0% found this document useful (0 votes)
432 views

Statistical Methods For Machine Learning

This course is an introduction to Statistical Machine Learning. The goal is to study modern methods and the underlying theory for those methods. There are two pre-requisites for this course: 1. 36-705 (Intermediate Statistical Theory) 2. 10-707 (Regression)

Uploaded by

mohamad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
432 views

Statistical Methods For Machine Learning

This course is an introduction to Statistical Machine Learning. The goal is to study modern methods and the underlying theory for those methods. There are two pre-requisites for this course: 1. 36-705 (Intermediate Statistical Theory) 2. 10-707 (Regression)

Uploaded by

mohamad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 272

Statistical

Methods
for
Machine
Learning
Larry Wasserman

Carnegie Mellon University


Lecture Notes
Review
Density Estimation
Nonparametric Regression
Linear Regression
Sparsity
Nonparametric Sparsity
Linear Classifiers
Nonparametric Classifiers
Random Forests
Clustering
Graphical Models
Directed Graphical Models
Causal Inference
Minimax Theory
Nonparametric Bayesian Inference
Conformal Prediction
Differential Privacy
Optimal Transport and Wasserstein Distance
Two Sample Testing
Dimension Reduction
Boosting
Support Vector Machines
Online Learning

https://www.stat.cmu.edu/~larry/=sml/
36-708
Statistical Methods in Machine Learning
Syllabus, Spring 2019
http://www.stat.cmu.edu/∼larry/=sml
Lectures: Tuesday and Thursday 1:30 - 2:50 pm (POS 152)

This course is an introduction to Statistical Machine Learning. The goal is to study modern methods and the
underlying theory for those methods. There are two pre-requisites for this course:

1. 36-705 (Intermediate Statistical Theory)


2. 10-707 (Regression)

Contact Information
Instructor:
Larry Wasserman BH 132G 412-268-8727 [email protected]
Teaching Assistants:
The names and office hours for the TA’s will be on the course website.

Office Hours
Larry Wasserman Tuesdays 12:00-1:00 pm Baker Hall 132G

Text
There is no text but course notes will be posted. Useful reference are:

1. Trevor Hastie, Robert Tibshirani, Jerome Friedman (2001). The Elements of Statistical Learning, Avail-
able at http://www-stat.stanford.edu/∼tibs/ElemStatLearn/.
2. Chris Bishop (2006). Pattern Recognition and Machine Learning.
3. Luc Devroye, László Györfi, Gábor Lugosi. (1996). A probabilistic theory of pattern recognition.
4. Gyorfi, Kohler, Krzyzak and Walk (2002). A Distribution-Free Theory of Nonparametric Regression.
5. Larry Wasserman (2004). All of Statistics: A Concise Course in Statistical Inference.
6. Larry Wasserman (2005). All of Nonparametric Statistics.

Grading

1. There will be four assignments. They are due Fridays at 3:00 p.m.. Hand them by uploading a pdf file
to Canvas.
2. Midterm Exam. The date is Thursday MARCH 7.
3. Project. There will be a final project, described later in the syllabus.

Grading will be as follows:


50% Assignments
25% Midterm
25% Project
Policy on Collaboration
Collaboration on homework assignments with fellow students is encouraged. However, such collaboration should
be clearly acknowledged, by listing the names of the students with whom you have had discussions concerning
your solution. You may not, however, share written work or code after discussing a problem with others. The
solutions should be written by you.
Topics

1. Introduction and Review


(a) Statistics versus ML
(b) concentration
(c) bias and variance
(d) minimax
(e) linear regression
(f) linear classification
(g) logistic regression
2. Nonparametric Inference
(a) Density Estimation
(b) Nonparametric Regression Regression
i. kernels
ii. local polynomial
iii. NN
iv. RKHS
(c) Nonparametric Classification
i. plug-in
ii. nn
iii. density-based
iv. kernelized SVM
v. trees
vi. random forests
3. High Dimensional Methods
(a) Forward stepwise regression
(b) Lasso
(c) Ridge Regression
(d) High dimensional classification
4. Clustering
5. Graphical models
6. Minimax theory
7. Causality
8. Dimension reduction: PCA, nonlinear
9. Other Possible Topics
(a) Mixture models
(b) Wasserstein distance and optimal transport
(c) boosting
(d) active learning
(e) nonparametric bayes
(f) deep learning
(g) differential privacy
(h) interactive data analysis
(i) multinomials
(j) statistics compuational tradeoff
(k) random matrices
(l) conformal methods
(m) interpolation

Course Calendar
The course calendar is posted on the course website and will be updated throughout the semester.
Project
The project involves picking a topic of interest, reading the relevant results in the area and then writing a short
paper (8 pages) summarizing the key ideas in the area. You may focus on a single paper if you prefer. Your are
NOT required to do new research.
The paper should include background, statement of important results, and brief proof outlines for the results.
If appropriate, you should also include numerical experiments are an application with real data.

1. You may work by yourself or in teams of two.


2. The goals are (i) to summarize key results in literature on a particular topic and (ii) present a summary
of the theoretical analysis (results and proof sketch) of the methods (iii) implement some of the main
methods. You may develop new theory if you like but it is not required.
3. You will provide: (i) a proposal, (ii) a progress report and (iii) and final report.
4. The reports should be well-written.

Proposal. A one page proposal is due February 8. It should contain the following information: (1) project
title, (2) team members, (3) precise description of the problem you are studying, (4) anticipated scope of the
project, and (5) reading list. (Papers you will need to read).
Progress Report. Due April 5. Three pages. Include: (i) a high quality introduction, (ii) what have you
done so far, (iii) what remains to be done and (iv) a clear description of the division of work among teammates,
if applicable.
Final Report: Due May 3. The paper should be in NIPS format. (pdf only). Maximum 8 pages. No
appendix is allowed. You should submit a pdf file electronically. It should have the following format:

1. Introduction. Motivation and a quick summary of the area.


2. Notation and Assumptions.
3. Key Results.
4. Proof outlines for the results.
5. Implementation (simulations or real data example.)
6. Conclusion. This includes comments on the meaning of the results and open questions.
36-708 Introduction and Review

1 Statistics versus ML

Statistics and ML are overlapping fields. Both address the same question: how do we extract
information from data? But there are differences in emphasis. In particular, some topics get
greater emphasis than others. Here are some examples:
More emphasis in ML More emphasis in Stat Common Areas
Bandits Confidence Sets Prediction (Regression and Classification)
Reinforcement Learning Large Sample Theory Probability Bounds (Concentration)
Efficient Computation Statistical Optimality Clustering
Deep Learning Causality Graphical Models
However, the lines between the two fields are blurry and will become increasingly so.

Another difference between the two fields is that ML researchers tend to publish short pa-
pers in conferences while Statisticians tend to publish long papers in journals. Each has
advantages and disadvantages.

2 Concentration

Hoeffding’s inequality:

Theorem 1 (Hoeffding) If Z1 , Z2 , . . . , Zn are iid with mean µ and P(a ≤ Zi ≤ b) = 1,


then for any  > 0
2 2
P(|Z n − µ| > ) ≤ 2e−2n /(b−a) (1)
n
where and Z n = n1 i=1 Zi .
P

VC Dimension. Let A be a class of sets. If F is a finite set, let s(A, F ) be the number of
subset of F ‘picked out’ by A. Define the growth function
sn (A) = sup s(A, F ).
|F |=n

Note that sn (A) ≤ 2n . The VC dimension of a class of set A is


n o
VC(A) = sup n : sn (A) = 2n . (2)

If the VC dimension is finite, then there is a phase transition in the growth function from
exponential to polynomial:

1
Theorem 2 (Sauer’s Theorem) Suppose that A has finite VC dimension d. Then, for
all n ≥ d,
 en d
s(A, n) ≤ . (3)
d

Given data Z1 , . . . , Zn ∼ P . The empirical measure Pn is


1X
Pn (A) = I(Zi ∈ A).
n i

p
Theorem 3 (Vapnik and Chervonenkis) Let A be a class of sets. For any t > 2/n,
 
2
P sup |Pn (A) − P (A)| > t ≤ 4 s(A, 2n)e−nt /8 (4)
A∈A

and hence, with probability at least 1 − δ,


s  
8 4 s(A, 2n)
sup |Pn (A) − P (A)| ≤ log . (5)
A∈A n δ

Hence, if A has finite VC dimension d then


s   
8 4  ne 
sup |Pn (A) − P (A)| ≤ log + d log . (6)
A∈A n δ d

Bernstein’s inequality is a more refined inequality than Hoeffding’s inequality. It is especially


useful when the variance of Y is small. Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤
σ 2 and |Yi | ≤ M . Then

n2
 
P(|Y − µ| > ) ≤ 2 exp − 2 . (7)
2σ + 2M /3

It follows that
σ 2
 
t
P |Y − µ| > + ≤ e−t
n 2(1 − c)
for small enough  and c.

3 Probability
P
1. Xn → 0 means that means that, for every  > 0 P(|Xn | > ) → 0 as n → ∞.

2
2. Xn Z means that P(Xn ≤ z) → P(Z ≤ z) at all continuity points z.
3. Xn = OP (an ) means that, Xn /an is bounded
 in probability:
 for every  > 0 there is
an M > 0 such that, for all large n, P Xann > M ≤ .
4. Xn = op (an ) means that Xn /an goes to 0 in probability: for every  > 0
 
Xn
P >  → 0 as n → ∞.
an
5. Law of large numbers: X1 , . . . , Xn ∼ P then
P
Xn → µ

where X n = n1 ni=1 Xi and µ = E[Xi ].


P

6. Central limit theorem: X1 , . . . , Xn ∼ P then



n(X n − µ)
N (0, 1)
σ
where σ 2 = Var(Xi ).

4 Basic Statistics
1. Bias and Variance. Let θb be an estimator of θ. Then

E(θb − θ)2 = bias2 + Var

where bias = E[θ]b − θ and Var = Var(θ). b In many cases there is a bias-variance trade-
off. In parametric problems, we typically have that the standard deviation is O(n−1/2 )
but the bias is O(1/n) so the variability dominates. In nonparametric problems this is
no longer true. We have to choose tuning parameters in classifiers and estimators to
balance the bias and variance.
2. A set of distributions P is a statistical model. They can be small (parametric models)
or large (nonparametric models).
3. Confidence Sets. Let X1 , . . . , Xn ∼ P where P ∈ P. Let θ = T (P ) be some quantity
of interest, Then Cn = C(X1 , . . . , Xn ) is a 1 − α confidence set if

inf P (T (P ) ∈ Cn ) ≥ 1 − α.
P ∈P

4. Maximum Likelihood. Parametric model {pθ : θ ∈ Θ}. We also write pθ (x) =


p(x; θ). Let X1 , . . . , Xn ∼ pθ . MLE θbn (maximum likelihood estimator) maximizes the
likelihood function n
Y
L(θ) = p(Xi ; θ).
i=1

3
5. Fisher information In (θ) = nI(θ) where
∂ 2 log p(X; θ)
 
I(θ) = −E .
∂θ2
6. Then
θbn − θ
N (0, 1)
sn
q
1
where sn = b .
nI(θ)

7. Asymptotic 1 − α confidence interval Cn = θbn ± zα/2 sn . Then


P(θ ∈ Cn ) → 1 − α.

5 Minimaxity

Let P be a set of distributions. Let θ be a parameter and let L(θ,


b θ) be a loss function. The
minimax risk is
Rn = inf sup EP [L(θ,
b θ)].
θb P ∈P

If supP ∈P EP [L(θ,
b θ)] = Rn then θb is a minimax estimator.

b θ) = (θb − θ)2 then the minimax risk is 1/n


For example, if X1 , . . . , Xn ∼ N (θ, 1) and L(θ,
and the minimax estimator is X n .
R
As another example, if X1 , . . . , Xn ∼ p where Xi ∈ Rd , L(bp, p) = (bp − p)2 and p ∈ P, the
set of densities with bounded second derivatives, then Rn = (C/n)4/(4+d) . The kernel density
estimator is minimax.

6 Regression
1. Y ∈ R, X ∈ Rd and prediction risk is
E(Y − m(X))2 .
We write X = (X(1), . . . , X(d)).
2. Minimizer is m(x) = E(Y |X = x).
3. Best linear predictor: minimize
E(Y − β T X)2
where X(1) = 1 so that β1 is the intercept. Minimizer is
β = Λ−1 α
where Λ(j, k) = E[X(j)X(k)] and α(j) = E(Y X(j)).

4
4. The data are
(X1 , Y1 ), . . . , (Xn , Yn ).
Given new X predict Y .
5. Minimize training error
1X
R(β)
b = (Yi − β T Xi )2 .
n i
Solution: least squares:
βb = (XT X)−1 XT Y
where X(i, j) = Xi (j).
6. Fitted values Yb = Xβb = HY where H = X(XT X)−1 XT is the hat matrix: the projector
onto the column space of X.
7. Bias-Variance tradeoff: Write Y = m(X) +  and let Yb = m(X)
b where m(x)
b = xT β.
b
Then Z Z
2 2 2
R = E(Y − Y ) = σ + b (x)p(x)dx + v(x)p(x)dx
b

where b(x) = E[m(x)]


b − m(x), v(x) = Var(m(x))
b and σ 2 = Var().

7 Classification
1. X ∈ Rd and Y ∈ {0, 1}.
2. Classifier h : Rd → {0, 1}.
3. Prediction risk:
R(h) = P(Y 6= h(X)).
The Bayes rule minimizes R(h):

h(x) = I(m(x) > 1/2) = I(π1 p1 (x) > π0 p0 (x))

where m(x) = P(Y = 1|X = x), π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|Y = 1)
and p0 (x) = p(x|Y = 0).
4. Re-coded loss. If we code Y as Y ∈ {−1, +1}. then many classifiers can be written
as
h(x) = sign(ψ(x))
for some ψ. For linear classifiers, ψ(x) = β T x. Then the loss can be written as
I(Y 6= h(X)) = I(Y ψ(X) < 0) and risk is

R = P(Y 6= h(X)) = P(Y ψ(X) < 0)

5
5. Linear Classifiers. A linear classifier has the form hβ (x) = I(β T x > 0). (I am
including a intercept in x. In other words x = (1, x(2), . . . , x(d)).) Given data
(X1 , Y1 ), . . . , (Xn , Yn ) there are several ways to estimate a linear classifier:
(a) Empirical risk minimization (ERM): Choose βb to minimize
n
1X
Rn (β) = 6 hβ (Xi )).
I(Yi =
n i=1

(b) Logistic regression: use the model


T
eβ x
P (Y = 1|X = x) = ≡ p(x, β).
1 + eβ T x
So Yi ∼ Benoulii(p(Xi , β)). The likelihood function is
Y
L(β) = p(Xi , β)Yi (1 − p(Xi , β))1−Yi .
i

The log-likelihood is strictly concave. So we have find the maximizer βb easily. It is


easy to check that the classifier I(px,βb > 1/2) is linear.
(c) SVM (support vector machine). Code Y as +1 or −1. We can write the classifier
as hβ (x) = sign(ψβ (x)) where ψβ (x) = xT β. As we said above, the loss can be written
as I(Y 6= h(X)) = I(Y ψ(X) < 0). Now replace the nonconvex loss I(Y ψ(X) < 0)
with the hinge-loss [1 − Yi ψβ (Xi )]+ . We minimize the regularized loss
n
X
[1 − Yi ψβ (Xi )]+ + λ||β||2 .
i=1

6. The SVM is an example of the general idea of replacing the true loss with a surrogate
loss that is easier to minimize. Replacing I(Y ψ(X) < 0) with

L(Y, ψ(X)) = log(1 + exp(−Y ψ(X)))

gives back logistic regression. The adaboost algorithm uses

L(Y, ψ(X)) = exp(−Y ψ(X)).

And, as we said above, the SVM uses the hinge loss

L(Y, ψ(X)) = [1 − Y ψ(X)]+ .

6
Density Estimation
36-708

1 Introduction

Let X1 , . . . , Xn be a sample from a distribution P with density p. The goal of nonparametric


density estimation is to estimate p with as few assumptions about p as possible. We denote
the estimator by pb. The estimator will depend on a smoothing parameter h and choosing h
carefully is crucial. To emphasize the dependence on h we sometimes write pbh .

Density estimation used for: regression, classification, clustering and unsupervised predic-
tion. For example, if pb(x, y) is an estimate of p(x, y) then we get the following estimate of
the regression function: Z
m(x)
b = yb
p(y|x)dy

where pb(y|x) = pb(y, x)/b


p(x). For classification, recall that the Bayes rule is

h(x) = I(p1 (x)π1 > p0 (x)π0 )

where π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0). Inserting
sample estimates of π1 and π0 , and density estimates for p1 and p0 yields an estimate of
the Bayes rule. For clustering, we look for the high density regions, based on an estimate
of the density. Many classifiers that you are familiar with can be re-expressed this way.
Unsupervised prediction is discussed in Section 9. In this case we want to predict Xn+1 from
X1 , . . . , X n .

Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density
4
1 1 X
p(x) = φ(x; 0, 1) + φ(x; (j/2) − 1, 1/10) (1)
2 10 j=0

where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron
and Wand (1992) call this density “the claw” although we will call it the Bart Simpson
density. Based on 1,000 draws from p, we computed a kernel density estimator, described
later. The estimator depends on a tuning parameter called the bandwidth. The top right plot
is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is
based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based
on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more
reasonable density estimate.

1
1.0

1.0
0.5

0.5
0.0

−3 0 3 0.0 −3 0 3
True Density Undersmoothed
1.0

1.0
0.5

0.5
0.0

0.0

−3 0 3 −3 0 3
Just Right Oversmoothed

Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots
are kernel estimators based on n = 1,000 draws. Bottom left: bandwidth h = 0.05 chosen by
leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.

2
2 Loss Functions

The most commonly used loss function is the L2 loss


Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x) + p2 (x)dx.
(b 2 2

The risk is R(p, pb) = E(L(p, pb)).

Devroye and Györfi (1985) make a strong case for using the L1 norm
Z
kbp − pk1 ≡ |b p(x) − p(x)|dx

as the loss instead of L2 . The L1 loss has the following nice interpretation. If P and Q are
distributions define the total variation metric

dT V (P, Q) = sup |P (A) − Q(A)|


A

where the supremum is over all measurable sets. Now if P and Q have densities p and q then
Z
1 1
dT V (P, Q) = |p − q| = kp − qk1 . H
2 2
R
Thus, if |p − q| < δ then we know that |P (A) − Q(A)| < δ/2 for all A. Also, the L1 norm is
transformation invariant. Suppose that T is a one-to-one smooth function. Let Y = T (X).
Let p and q be densities for X and let pe and qe be the corresponding densities for Y . Then
Z Z
|p(x) − q(x)|dx = |e p(y) − qe(y)|dy. H

Hence the distance is unaffected by transformations. The L1 loss is, in some sense, a much
better loss function than L2 for density estimation. But it is much more difficult to deal
with. For now, we will focus on L2 loss. But we may discuss L1 loss later.
R
Another loss function is the Kullback-Leibler loss p(x) log p(x)/q(x)dx. This is not a good
loss function to use for nonparametric density estimation. The reason is that the Kullback- H
Leibler loss is completely dominated by the tails of the densities.

3 Histograms

Perhaps the simplest density estimators are histograms. For convenience, assume that the
data X1 , . . . , Xn are contained in the unit cube X = [0, 1]d (although this assumption is not
crucial). Divide X into bins, or sub-cubes, of size h. We discuss methods for choosing

3
h later. There are N ≈ (1/h)d such bins and each has volume hd . Denote the bins by
B1 , . . . , BN . The histogram density estimator is
N b
X θj
pbh (x) = d
I(x ∈ Bj ) (2)
j=1
h

where n
1X
θbj = I(Xi ∈ Bj )
n i=1
is the fraction of data points in bin Bj . Now we bound the bias and variance of pbh . We will
assume that p ∈ P(L) where
( )
P(L) = p : |p(x) − p(y)| ≤ Lkx − yk, for all x, y . (3)

R
First we bound the bias. Let θj = P (X ∈ Bj ) = Bj
p(u)du. For any x ∈ Bj ,

θj
ph (x) ≡ E(b
ph (x)) = (4)
hd
and hence R
Bj
p(u)du 1
Z
p(x) − ph (x) = p(x) − = d (p(x) − p(u))du.
hd h
Thus,
1
Z
1 √ Z √
|p(x) − ph (x)| ≤ d |p(x) − p(u)|du ≤ d Lh d du = Lh d
h h

where we used the fact that if x, u ∈ Bj then kx − uk ≤ dh.

Now Rwe bound the variance.


R Since p is Lipschitz on a compact set, it is bounded. Hence,
θj = Bj p(u)du ≤ C Bj du = Chd for some C. Thus, the variance is

1 θj (1 − θj ) θj C
Var(b
ph (x)) = 2d
Var(θbj ) = 2d
≤ 2d
≤ .
h nh nh nhd

We conclude that the L2 risk is bounded by


Z
C
sup R(p, pb) = (E(b ph (x) − p(x))2 ≤ L2 h2 d + d . (5)
p∈P(L) nh
1
C
 d+2
The upper bound is minimized by choosing h = L2 nd
. (Later, we shall see a more
practical way to choose h.) With this choice,
2
  d+2
1
sup R(p, pb) ≤ C0
P ∈P(L) n

4
where C0 = L2 d(C/(L2 d))2/(d+2) .

Later, we will prove the following theorem which shows that this upper bound is tight.
Specifically:

Theorem 2 There exists a constant C > 0 such that


Z 2
  d+2
2 1
inf sup E p(x) − p(x)) dx ≥ C
(b . (6)
pb P ∈P(L) n

3.1 Concentration Analysis For Histograms

Let us now derive a concentration result for pbh . We will bound

sup P n (kb
ph − pk∞ > )
P ∈P

where kf k∞ = supx |f (x)|. Assume that  ≤ 1. First, note that


!
θbj θj X
ph −ph k∞ > ) = P max d − d >  = P(max |θbj −θj | > hd ) ≤
P(kb P(|θbj −θj | > hd ).
j h h j
j

Using Bernstein’s inequality and the fact that θj (1 − θj ) ≤ θj ≤ Chd ,


2 2d
 
d 1 n h
P(|θbj − θj | > h ) ≤ 2 exp −
2 θj (1 − θj ) + hd /3
1 n2 h2d
 
≤ 2 exp −
2 Chd + hd /3
≤ 2 exp −cn2 hd


where c = 1/(2(C + 1/3)). By the union bound and the fact that N ≤ (1/h)d ,

P(|θbj − θj | > hd ) ≤ 2h−d exp −cn2 hd ≡ πn .





Earlier we saw that supx |p(x) − ph (x)| ≤ L dh. Hence, with probability at least 1 − πn ,

kb
ph − pk∞ ≤ kb ph − ph k∞ + kph − pk∞ ≤  + L dh. (7)

Now set s  
1 2
= log .
cnhd δhd

5
Then, with probability at least 1 − δ,
s

 
1 2
kb
ph − pk∞ ≤ log + L dh. (8)
cnhd δhd

Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,


s 1
!

        2+d
2 2 2 1 log n
ph −pk∞ ≤ c−1 n− 2+d log
kb + log n +L dn− 2+d = O . (9)
δ 2+d n

4 Kernel Density Estimation


R
A
R one-dimensional smoothing
R kernel is any smooth function K such that K(x) dx = 1,
2
xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0. Smoothing kernels should not be confused with
Mercer kernels which we discuss later. Some commonly used kernels are the following:

2
Boxcar: K(x) = 21 I(x) Gaussian: K(x) = √1 e−x /2

3 2 70
Epanechnikov: K(x) = 4 (1 − x )I(x) Tricube: K(x) = 81
(1 − |x|3 )3 I(x)

where I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise.


Qd These kernels are plotted in Figure 2.
Two commonly used multivariate kernels are j=1 K(xj ) and K(kxk).

−3 0 3 −3 0 3

−3 0 3 −3 0 3

Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech-
nikov (bottom left), and tricube (bottom right).

6
−10 −5 0 5 10

Figure 3: A kernel density estimator pb. At each point x, pb(x) is the average of the kernels
centered over the data points Xi . The data points are indicated by short vertical bars. The
kernels are not drawn to scale.

Suppose that X ∈ Rd . Given a kernel K and a positive number h, called the bandwidth,
the kernel density estimator is defined to be
n  
1X 1 kx − Xi k
pb(x) = K . (10)
n i=1 hd h

More generally, we define


n
1X
pbH (x) = KH (x − Xi )
n i=1

where H is a positive definite bandwidth matrix and KH (x) = |H|−1/2 K(H −1/2 x). For
simplicity, we will take H = h2 I and we get back the previous formula.

Sometimes we write the estimator as pbh to emphasize the dependence on h. In the multivari-
ate case the coordinates of Xi should be standardized so that each has the same variance,
since the norm kx − Xi k treats all coordinates as if they are on the same scale.

The kernel estimator places a smoothed out lump of mass of size 1/n over each data point
Xi ; see Figure 3. The choice of kernel K is not crucial, but the choice of bandwidth h
is important. Small bandwidths give very rough estimates while larger bandwidths give
smoother estimates.

7
4.1 Risk Analysis

In this section we examine the accuracy of kernel density estimation. We will first need a
few definitions.

Assume that Xi ∈ X ⊂ Rd where X is compact. Let β and L be positive numbers. Given a


vector s = (s1 , . . . , sd ), define |s| = s1 + · · · + sd , s! = s1 ! · · · sd !, xs = xs11 · · · xsdd and

∂ s1 +···+sd
Ds = .
∂xs11 · · · ∂xsdd
Let β be a positive integer. Define the Hölder class
( )
Σ(β, L) = g : |Ds g(x)−Ds g(y)| ≤ Lkx−yk, for all s such that |s| = β −1, and all x, y .

(11)
For example, if d = 1 and β = 2 this means that

|g 0 (x) − g 0 (y)| ≤ L |x − y|, for all x, y.

The most common case is β = 2; roughly speaking, this means that the functions have
bounded second derivatives.

If g ∈ Σ(β, L) then g(x) is close to its Taylor series approximation:

|g(u) − gx,β (u)| ≤ Lku − xkβ (12)

where H
X (u − x)s
gx,β (u) = Ds g(x). (13)
s!
|s|≤β

In the common case of β = 2, this means that

p(u) − [p(x) + (x − u)T ∇p(x)] ≤ L||x − u||2 .

Assume now R that theR kernel


p
R =β G(x1 ) · · · G(xd ) where
K has the form K(x) R s G has support
on [−1, 1], G = 1, |G| < ∞ for any p ≥ 1, |t| |K(t)|dt < ∞ and t K(t)dt = 0 for
s ≤ β.

An example of a kernel that satisfies these conditions


R s for β = 2 is G(x) = (3/4)(1 − x2 ) for
|x| ≤ 1. Constructing a kernel that satisfies t K(t)dt = 0 for β > 2 requires using kernels
that can take negative values.

ph (x)]. The next lemma provides a bound on the bias ph (x) − p(x).
Let ph (x) = E[b

8
Lemma 3 The bias of pbh satisfies:
sup |ph (x) − p(x)| ≤ chβ (14)
p∈Σ(β,L)

for some c.

Proof. We have
Z
1
|ph (x) − p(x)| = K(ku − xk/h)p(u)du − p(x)
hd
Z
= K(kvk)(p(x + hv) − p(x))dv
Z Z
≤ K(kvk)(p(x + hv) − px,β (x + hv))dv + K(kvk)(px,β (x + hv) − p(x))dv .
R
The first term is bounded by Lhβ K(s)|s|β since p ∈ Σ(β, L). The second term is 0 from
the properties on K since px,β (x + hv) − p(x) is a polynomial of degree β (with no constant
term). 

Next we bound the variance.

Lemma 4 The variance of pbh satisfies:


c
ph (x)) ≤
sup Var(b (15)
p∈Σ(β,L) nhd
for some c > 0.
 
Proof. We can write pb(x) = n−1 ni=1 Zi where Zi = h1d K kx−X ik
P
h
. Then,

hd
 
kx − uk
Z Z
2 1 2
Var(Zi ) ≤ E(Zi ) = 2d K p(u)du = 2d K 2 (kvk) p(x + hv)dv
h h h
Z
supx p(x) c
≤ d
K 2 (kvk)dv ≤ d
h h
for some c since the densities in Σ(β, L) are uniformly bounded. The result follows. 

Since the mean squared error is equal to the variance plus the bias squared we have:

1
Theorem 5 The L2 risk is bounded above, uniformly over Σ(β, L), by h4β + nhd
(up to
constants). If h  n−1/(2β+d) then
Z 2β
  2β+d
2 1
sup E (b ph (x) − p(x)) dx  . (16)
p∈Σ(β,L) n

When β = 2 and h  n−1/(4+d) we get the rate n−4/(4+d) .

9
4.2 Minimax Bound

According to the next theorem, there does not exist an estimator that converges faster than
O(n−2β/(2β+d) ). We state the result for integrated L2 loss although similar results hold for
other loss functions and other function spaces. We will prove this later in the course.

Theorem 6 There exists C depending only on β and L such that


Z 2β
  2β+d
1
inf sup Ep p(x) − p(x))2 dx ≥ C
(b . (17)
pb p∈Σ(β,L) n

Theorem 6 together with (16) imply that kernel estimators are rate minimax.

4.3 Concentration Analysis of Kernel Density Estimator

Now we state a result which says how fast pb(x) concentrates around p(x). First, recall
Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2 and
|Yi | ≤ M . Then
n2
 
P(|Y − µ| > ) ≤ 2 exp − 2 . (18)
2σ + 2M /3

Theorem 7 For all small  > 0,

p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .



P(|b (19)

Hence, for any δ > 0,


r !
C log(2/δ)
sup P |b
p(x) − p(x)| > d
+ chβ <δ (20)
p∈Σ(β,L) nh

for some constants C and c. If h  n−1/(2β+d) then


 c 
sup P |b p(x) − p(x)|2 > < δ.
p∈Σ(β,L) n2β/(2β+d)

Note that the last statement follows from the bias-variance calculation followed by Markov’s
inequality. The first statement does not.

Proof. By the triangle inequality,

|b
p(x) − p(x)| ≤ |b
p(x) − ph (x)| + |ph (x) − p(x)| (21)

10
where ph (x) = E(bp(x)). From Lemma 3, |ph (x) − p(x)| ≤ chβ for some c. Now pb(x) =
n−1 ni=1 Zi where
P
 
1 kx − Xi k
Zi = d K .
h h
Note that |Zi | ≤ c1 /hd where c1 = K(0). Also, Var(Zi ) ≤ c2 /hd from Lemma 4. Hence, by
Bernstein’s inequality,
n2 nhd 2
   
p(x) − ph (x)| > ) ≤ 2 exp −
P(|b ≤ 2 exp −
2c2 h−d + 2c1 h−d /3 4c2
p
whenever  ≤ 3c2 /c1 . If we choose  = C log(2/δ)/(nhd ) where C = 4c2 then
r !
C
P |b
p(x) − ph (x)| > ≤ δ.
nhd

The result follows from (21). 

4.4 Concentration in L∞

Theorem 7 shows that, for each x, pb(x) is close to p(x) with high probability. We would like
a version of this result that holds uniformly over all x. That is, we want a concentration
result for
kb
p − pk∞ = sup |b p(x) − p(x)|.
x
We can write
kb
ph − pk∞ ≤ kb ph − ph k∞ + chβ .
ph − ph k∞ + kph − pk∞ ≤ kb
We can bound the first term using something called bracketing together with Bernstein’s
theorem to prove that,
d
3n2 hd
  
C
P(kbph − ph k∞ > ) ≤ 4 exp − . (22)
hd+1  28K(0)

An alternative approach is to replace Bernstein’s inequality with a more sophisticated in-


equality due to Talagrand. We follow the analysis in Giné and Guillou (2002). Let
   
x−· d
F= K ,x ∈ R ,h > 0 .
h
We assume there exists positive numbers A and v such that
 v
A
sup N (Fh , L2 (P ), kF kL2 (P ) ) ≤ , (23)
P 

11
where N (T, d, ) denotes the -covering number of the metric space (T, d), F is the envelope
function of F and the supremum is taken over the set of all probability measures on Rd . The
quantities A and v are called the VC characteristics of Fh .

Theorem 8 (Giné and Guillou 2002) Assume that the kernel satisfies the above prop-
erty.

1. Let h > 0 be fixed. Then, there exist constants c1 > 0 and c2 > 0 such that, for all
small  > 0 and all large n,
 
ph (x) − ph (x)| >  ≤ c1 exp −c2 nhd 2 .

P sup |b (24)
x∈Rd

nhdn
2. Let hn → 0 as n → ∞ in such a way that | log hdn |
→ ∞. Let
s
| log hn |
n ≥ . (25)
nhdn

Then, for all n large enough, (24) holds with h and  replaced by hn and n , respectively.

The above theorem imposes minimal assumptions on the kernel K and, more importantly,
on the probability distribution P , whose density is not required to be bounded or smooth,
and, in fact, may not even exist. Combining the above theorem with Lemma 3 we have the
following result.

Theorem 9 Suppose that p ∈ Σ(β, L). Fix any δ > 0. Then


r !
C log n
P sup |bp(x) − p(x)| > d
+ chβ < δ
x nh

for some constants C and c where C depends on δ. Choosing h  log n/n−1/(2β+d) we have
 
2 C log n
P sup |bp(x) − p(x)| > 2β/(2β+d) < δ.
x n

4.5 Boundary Bias

We have ignored what happens near the boundary of the sample space. If x is O(h) close to
the boundary, the bias is O(h) instead of O(h2 ). There are a variety of fixes including: data
reflection, transformations, boundary kernels, local likelihood.

12
4.6 Confidence Bands and the CLT
p
Consider first a single point x. Let sn (x) = Var(b
ph (x)). The CLT implies that

pbh (x) − ph (x)


Zn (x) ≡ N (0, τ 2 (x)) H
sn (x)

for some τ (x). This is true even P hn → 0 and


if h = hn is decreasing. Specifically, suppose thatP
nhn → ∞. Note that Zn (x) = i=1 Lni , say. According to Lyapounov’s CLT, ni=1 Lni
n

N (0, 1) as long as
Xn
lim E[Ln,i |2+δ = 0
n→∞
i=1

for some δ > 0. But this is does not yield a confidence interval for p(x). To see why, let us
write
pbh (x) − p(x) pbh (x) − ph (x) ph (x) − p(x) bias
= + = Zn (x) + √ .
sn (x) sn (x) sn (x) var(x)
Assuming that the optimize the risk by balancing the bias and the variance, the second term
is some constant c. So
pbh (x) − p(x)
N (c, τ 2 (x)).
sn (x)
This means that the usual confidence interval pbh (x) ± zα/2 s(x) will not cover p(x) with
probability tending to 1 − α. One fix for this is to undersmooth the estimator. (We sacrifice
risk for coverage.) An easier approach is just to interpret pbh (x) ± zα/2 s(x) as a confidence
interval for the smoothed density ph (x) instead of p(x).

But this only gives an interval at one point. To get a confidence band we use the bootstrap.
Let Pn be the empirical distribution of X1 , . . . , Xn . The idea is to estimate the distribution
√ 
Fn (t) = P nhd ||b ph (x) − ph (x)||∞ ≤ t

with the bootstrap estimator


√ 
Fbn (t) = P p∗h (x) − pbh (x)||∞ ≤ t X1 , . . . , Xn
nhd ||b

where pb∗h is constructed from the bootstrap sample X1∗ , . . . , Xn∗ ∼ Pn . Later in the course,
we will show that
P
sup |Fn (t) − Fbn (t)| → 0.
t

Here is the algorithm.

1. Let Pn be the empirical distribution that puts mass 1/n at each data point Xi .

13
h= 1

h= 2 h= 3

Figure 4: 95 percent bootstrap confidence bands using various bandwidths.

2. Draw X1∗ , . . . , Xn∗ ∼ Pn . This is called a bootstrap sample.


3. Compute the density estimator pb∗h based on the bootstrap sample.

4. Compute R = supx nhd ||b p∗h − pbh ||∞ .
5. Repeat steps 2-4 B times. This gives R1 , . . . , RB .
6. Let zα be the upper α quantile of the Rj ’s. Thus
B
1 X
I(Rj > zα ) ≈ α.
B j=1

7. Let
zα zα
`n (x) = pbh (x) − √ , un (x) = pbh (x) + √ .
nhd nhd

Theorem 10 Under appropriate (very weak) conditions, we have



lim inf P `n (x) ≤ ph (x) ≤ u(x) for all x ≥ 1 − α.
n→∞

See Figure 4.

If you want a confidence band for p you need to reduce the bias (undersmooth). A simple
way to do this is with twicing. Suppose that β = 2 and that we use the kernel estimator pbh .
Note that,
ph (x)] = p(x) + C(x)h2 + o(h2 )
E[b
p2h (x)] = p(x) + C(x)4h2 + o(h2 )
E[b

14
for some C(x). That is, the leading term of the bias is b(x) = C(x)h2 . So if we define

bb(x) = pb2h (x) − pbh (x)


3
then
E[bb(x)] = b(x).
We define the bias-reduced estimator
 
4 1
peh (x) = pbh (x) − bb(x) = pbh (x) − pb2h .
3 4
A confidence set centered at peh will be asymptotically valid but will not be an optimal
estimator. This is a fundamental conflict between estimation and inference.

5 Cross-Validation

In practice we need a data-based method for choosing the bandwidth h. To do this, we will
need to estimate the risk of the estimator and minimize the estimated risk over h. Here, we
describe two cross-validation methods.

5.1 Leave One Out

A common method for estimating risk is leave-one-out cross-validation. Recall that the loss
function is
Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x)dx + p2 (x)dx.
(b 2 2

The last term does not involve pb so we can drop it. Thus, we now define the loss to be
Z Z
2
L(h) = pb (x) dx − 2 pb(x)p(x)dx.

The risk is R(h) = E(L(h)).

Definition 11 The leave-one-out cross-validation estimator of risk is


Z  2 n
2X
R(h) =
b pb(−i) (x) dx − pb(−i) (Xi ) (26)
n i=1

where pb(−i) is the density estimator obtained after removing the ith observation.

15
H
It is easy to check that E[R(h)]
b = R(h).

When the kernel is Gaussian, the cross-validation score can be written, after some tedious
algebra, as follows. Let φ(z; σ) denote a Normal density with mean 0 and variance σ 2 . Then,
√ d
φd
(0; 2h) n − 2 XY √
R(h) =
b + φ(X i` − X j` ; 2h) (27)
(n − 1) n(n − 1)2 i6=j `=1
d
2 XY
− φ(Xi` − Xj` ; h). (28)
n(n − 1) i6=j `=1

The estimator pb and the cross-validation score can be computed quickly using the fast Fourier
transform; see pages 61–66 of Silverman (1986).

For histograms, it is easy to work out the leave-one-out cross-validation in close form:
2 n + 1 X b2
R(h)
b = − θ . H
(n − 1)h (n − 1)h j j

A further justification for cross-validation is given by the following theorem due to Stone
(1984).

Theorem 12 (Stone’s theorem) Suppose that p is bounded. Let pbh denote the kernel
estimator with bandwidth h and let b
h denote the bandwidth chosen by cross-validation. Then,
R 2
p(x) − pbbh (x) dx a.s.
→ 1. (29)
inf h (p(x) − pbh (x))2 dx
R

The bandwidth for the density estimator in the bottom left panel of Figure 1 is based on
cross-validation. In this case it worked well but of course there are lots of examples where
there are problems. Do not assume that, if the estimator pb is wiggly, then cross-validation
has let you down. The eye is not a good judge of risk.

There are cases when cross-validation can seriously break down. In particular, if there are
ties in the data then cross-validation chooses a bandwidth of 0.

5.2 Data Splitting

An alternative to leave-one-out is V -fold cross-validation. A common choice is V = 10. Fir


simplicity, let us consider here just splitting the data in two halves. This version of cross-
validation comes with stronger theoretical guarantees. Let pbh denote the kernel estimator

16
based on bandwidth h. For simplicity, assume the sample size is even and denote the sample
size by 2n. Randomly split the data X = (X1 , . . . , X2n ) into two sets of size n. Denote
these by Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ).1 Let H = {h1 , . . . , hN } be a finite grid of
bandwidths. Let n  
1X 1 kx − Yi k
pbj (x) = K .
n i=1 hdj h
Thus we have a set P = {b
p1 , . . . , pb} of density estimators.
R R
We would like to minimize L(p, pbj ) = pb2j (x) − 2 pbj (x)p(x)dx. Define the estimated risk
Z n
b pbj ) = pb2 (x) − 2
X
Lbj ≡ L(p, pbj (Zi ). (30)
j
n i=1

Let pb = argming∈P L(p,


b g). Schematically:

Y → {b
p1 , . . . , pbN } = P
split
X = (X1 , . . . , X2n ) =⇒
Z → {L bN }
b1 , . . . , L

Theorem 13 (Wegkamp 1999) There exists a C > 0 such that


C log N
p − pk2 ) ≤ 2 min E(kg − pk2 ) +
E(kb .
g∈P n

This theorem can be proved using concentration of measure techniques that we discuss later
in class. A similar result can be proved for V -fold cross-validation.

5.3 Asymptotic Expansions

In this section we consider some asymptotic expansions that describe the behavior of the
kernel estimator. We focus on the case d = 1.

Theorem 14 Let RxR = E(p(x) − pb(x))2 and let R = Rx dx. Assume that p00 is absolutely
R

continuous and that p000 (x)2 dx < ∞. Then,


R
1 4 4 00 2 p(x) K 2 (x)dx
 
1
Rx = σK hn p (x) + +O + O(h6n )
4 nhn n
1
It is not necessary to split the data into two sets of equal size. We use the equal split version for
simplicity.

17
and R
K 2 (x)dx
Z  
1 4 4 00 2 1
R = σK hn p (x) dx + +O + O(h6n ) (31)
4 nh n
2
R
where σK = x2 K(x)dx.

Proof. Write Kh (x, X) = h−1 K ((x − X)/h) and pb(x) = n−1 i Kh (x, Xi ). Thus, E[b
P
p(x)] =
E[Kh (x, X)] and Var[bp(x)] = n−1 Var[Kh (x, X)]. Now,
 
x−t
Z
1
E[Kh (x, X)] = K p(t) dt
h h
Z
= K(u)p(x − hu) du
h2 u2 00
Z  
0
= K(u) p(x) − hup (x) + p (x) + · · · du
2
Z
1
= p(x) + h2 p00 (x) u2 K(u) du · · ·
2
R R
since K(x) dx = 1 and x K(x) dx = 0. The bias is
1 2 2 00
E[Khn (x, X)] − p(x) = σK hn p (x) + O(h4n ).
2
By a similar calculation,
R
p(x) K 2 (x) dx
 
1
Var[b
p(x)] = +O .
n hn n
The first result then follows since the risk is the squared bias plus variance. The second
result follows from integrating the first result. 

If we differentiate (31) with respect to h and set it equal to 0, we see that the asymptotically
optimal bandwidth is
 1/5
c2
h∗ = (32)
c21 A(f )n
where c1 = x2 K(x)dx, c2 = K(x)2 dx and A(f ) = f 00 (x)2 dx. This is informative
R R R

because it tells us that the best bandwidth decreases at rate n−1/5 . Plugging h∗ into (31),
we see that if the optimal bandwidth is used then R = O(n−4/5 ).

6 High Dimensions

The rate of convergence n−2β/(2β+d) is slow when the dimension d is large. In this case it is
hopeless to try to estimate the true density p precisely in the L2 norm (or any similar norm).

18
We need to change our notion of what it means to estimate p in a high-dimensional problem.
Instead of estimating p precisely we have to settle for finding an adequate approximation
of p. Any estimator that finds the regions where p puts large amounts of mass should be
considered an adequate approximation. Let us consider a few ways to implement this type
of thinking.

Biased Density Estimation. Let ph (x) = E(bph (x)). Then


 
kx − uk
Z
1
ph (x) = K p(u)du
hd h
R
so that the mean of pbh can be thought of as a smoothed version of p. Let Ph (A) = A
ph (u)du
be the probability distribution corresponding to ph . Then

Ph = P ⊕ K h

where ⊕ denotes convolution2 and Kh is the distribution with density h−d K(kuk/h). In
other words, if X ∼ Ph then X = Y + Z where Y ∼ P and Z ∼ Kh . This is just another
way to say that Ph is a blurred or smoothed version of P . ph need not be close in L2 to p but
still could preserve most of the important shape information about p. Consider then choosing
a fixed h > 0 and estimating ph instead of p. This corresponds to ignoring the bias in the
density estimator. From Theorem 8 we conclude:

2
Theorem 15 Let h > 0 be fixed. Then P(kbph − ph k∞ > ) ≤ Ce−nc . Hence,
r !
log n
kb
ph − ph k∞ = OP .
n

The rate of convergence is fast and is independent of dimension. How to choose h is not
clear.

Independence Based Methods. If we can live with some bias, we can reduce the dimen-
sionality by imposing some independence assumptions. The simplest example is to treat the
components (X1 , . . . , Xd ) as if they are independent. In that case
d
Y
p(x1 , . . . , xd ) = pj (xj )
j=1

and the problem is reduced to a set of one-dimensional density estimation problems.


2
If X ∼ P and Y ∼ Q are independent, then the distribution of X + Y is denoted by P ? Q and is called
the convolution of P and Q.

19
An extension is to use a forest. We represent the distribution with an undirected graph. A
graph with no cycles is a forest. Let E be the edges of the graph. Any density consistent
with the forest can be written as
d
Y Y pj,k (xj , xk )
p(x) = pj (xj ) .
j=1
pj (xj )pk (xk )
(j,k)∈E

To estimate the density therefore only require that we estimate one and two-dimensional
marginals. But how do we find the edge set E? Some methods are discussed in Liu et al
(2011) under the name “Forest Density Estimation.” A simple approach is to connect pairs
greedily using some measure of correlation.

Density Trees. Ram and Gray (2011) suggest a recursive partitioning scheme similar to
decision trees. They split each coordinate dyadically, in a greedy fashion. The density
estimator is taken to be piecewise constant. They use an L2 risk estimator to decide when to
split. This seems promising. The ideas seems to have been re-discovered in Yand and Wong
(arXiv:1404.1425) and Liu and Wong (arXiv:1401.2597). Density trees seem very promising.
It would be nice if there was an R package to do this and if there were more theoretical
results.

7 Example

Figure 5 shows a synthetic two-dimensional data set, the cross-validation function and two
kernel density estimators. The data are 100 points generated as follows. We select a point
randomly on the unit circle then add Normal noise with standard deviation 0.1 The first
estimator (lower left) uses the bandwidth that minimizes the leave-one-out cross-validation
score. The second uses twice that bandwidth. The cross-validation curve is very sharply
peaked with a clear minimum. The resulting density estimate is somewhat lumpy. This is
because cross-validation is aiming to minimize L2 error which does not guarantee that the
estimate is smooth. Also, the dataset is small so this effect is more noticeable. The estimator
with the larger bandwidth is noticeably smoother. However, the lumpiness of the estimator
is not necessarily a bad thing.

8 Derivatives

Kernel estimators can also be used to estimate the derivatives of a density.3 Let D⊗r p denote
the rth derivative p. We are using Kronecker notation. Let D⊗0 p = p, D⊗1 f is the gradient
3
In this section we follow Chacon and Duong (2013), Electronic Journal of Statistics, 7, 499-532.

20
Risk
0.5 1.0 1.5 2.0 2.5 3.0
Bandwidth

Figure 5: Synthetic two-dimensional data set. Top left: data. Top right: cross-validation
function. Bottom left: kernel estimator based on the bandwidth that minimizes the cross-
validation score. Bottom right: kernel estimator based on the twice the bandwidth that
minimizes the cross-validation score.

21
of p, and D⊗2 p = vecH where H is the Hessian. We also write this as p(r) when convenient.

Let H be a bandwidth matrix and let


n
1X
pbH (x) = KH (x − Xi )
n i=1

where KH (x) = |H|−1/2 K(H −1/2 x). We define


n
(r) (x) = D ⊗r p
1 X ⊗r
pc bH (x) = D KH (x − Xi ).
n i=1
For computation, it is useful to note that
D⊗r KH (x) = |H|−1/2 (H −1/2 )⊗r D⊗r KH (H −1/2 x).

The asymptotic mean squared error is derived in Chacon, Duong and Wand (2011) and is
given by
1 −1/2 m2 (K)
|H |tr((H −1 )⊗r R(D⊗r (K))) + 2 tr((Idr ⊗ vecT H)R(D⊗(r+2) p)(Idr ⊗ vec(H)))
n 4
R R
where R(g) = g(x)g T (x)dx, m2 (K) = xxT K(x)dx. The optimal H has entries of order
n−2/(d+2r+4) which yield an asymptotic mean squared error of order n−4/(d+2r+4) . In dimension
d = 1, the risk looks like this as a function of r:

r risk
0 n−4/5
1 n−4/7
2 n−4/9

We see that estimating derivatives is harder than estimating the density itself.

Chacon and Duong (2013) derive an estimate of the risk:


CVr (H) = (−1)r |H|−1/2 vecT (H −1 )⊗r Gn
where
1 X ⊗2r −1/2 2 X
Gn = D K(H (X i − X j )) − D⊗2r K(H −1/2 (Xi − Xj ))
n2 i,j n(n − 1) i6=j

and K = K ? K. We can now minimize CV over H. It would be nice if someone wrote an


R package to do this. I think the ks package does much of this.

One application of this that we consider later in the course is mode-based clustering. Here,
we use density estimation to find the modes of the density. We associate clusters with these
modes. We can also test for a mode by testing if D2 p(x) < 0 at the estimated modes.

22
9 Unsupervised Prediction and Anomaly Detection

We can use density estimation to do unsupervised prediction and anomaly detection. The
basic idea is due to Vovk, and was developed in a statistical framework in Lei, Robins and
Wasserman (2014).

Suppose we observe Y1 , . . . , Yn ∼ P . We want to predict Yn+1 . We will construct a level α


test for the null hypothesis H0 : Yn+1 = y. We do this for every value of y. Then we invert
the test, that is, we set Cn to be the set of y’s that are not rejected. It folows that

P(Yn+1 ∈ Cn ) ≥ 1 − α.

The prediction set Cn is finite sample and distribution-free.

Fix a value y. Let A = (Y1 , . . . , Yn , y) be the augmented dataset. That is, we set Yn+1 = y.
Let pbA be a density estimate based on A. Consider the vector

pbA (Y1 ), . . . , pbA (Yn+1 ).

Under H0 , the rank of these values is uniformly distributed. That is, for each i,
1
pA (Yi ) ≤ pbA (y)) =
P(b .
n+1
A p-value for the test is
n+1
1 X
π(y) = pA (Yi ) ≤ pbA (y)).
I(b
n + 1 i=1

The prediction set is n o


Cn = y : π(y) ≥ α .

Computing Cn is tedious. Fortunately, Jing, Robins and Wasserman (2014) show that there is
a simpler set that still has the correct coverage (but is slightly larger). The set is constructed
as follows. Let Zi = pb(Yi ). Order these observations

Z(1) ≤ · · · ≤ Z(n) .

Let k = b(n + 1)αc and let


K(0)
t = Z(k) − .
nhd
Define )
n
Cn+ = y : pb(y) ≥ t .

23
Lemma 16 We have that Cn ⊂ Cn+ and hence

P(Yn+1 ∈ Cn ) ≥ 1 − α.

Finally, we note that any Yi with a small p-value can be regarded as an outlier (anomaly).

The above method is exact. We can also use a simpler, asympotic approach. With Z(k)
b = {y : pb(y) ≥ t} where now t = Z(k) . From Cadre, Pelletier and Pudlo
defined above, set C
(2013) we have that
√ P
nhd µ(C∆C)
b →c
for some constant c where C is the true 1 − α level set. Hence, P (Yn+1 ∈ C)
b = 1 − α + oP (1).

10 Manifolds and Singularities

Sometimes a distribution is concentrated near a lower-dimensional set. This causes problems


for density estimation. In fact the density, as we usually think of it, may not be defined.

As a simple example, suppose P is supported on the unit circle in R2 . The distribution


P is singular with respect to Lebesgue measure µ. This means that there are sets A with
P (A) > 0 even though µ(A) = 0. Effectively, this means that the density is infinite. To see
this, consider a point x on the circle. Let B(x, ) be a ball of radius  centered at x. Then
P(B(x, ))
p(x) = lim → ∞. H
→0 µ(B(x, ))

Note also that the L2 loss does not make any sense. If you tried to use cross-validation, you
would find that the estimated risk is minimized at h = 0. H

A simple solution is to focus on estimating the smoothed density ph (x) which is well-defined
for every h > 0. More sophisticated ideas are based on topological data analysis which we
discuss later in the course.

11 Series Methods

We have emphasized kernel density estimation. There are many other density estimation
methods. Let us briefly mention a method based on basis functions. For simplicity, suppose
that Xi ∈ [0, 1] and let φ1 , φ2 , . . . be an orthonormal basis for
Z 1
F = {f : [0, 1] → R, f 2 (x)dx < ∞}.
0

24
Thus Z Z
φ2j (x)dx = 1, φj (x)φk (x)dx = 0.
An example is the cosine basis:

φ0 (x) = 1, φj (x) = 2 cos(2πjx), j = 1, 2, . . . ,

If p ∈ F then

X
p(x) = βj φj (x)
j=1
R1 Pk
where βj = 0
p(x)φj (x)dx. An estimate of p is pb(x) = j=1 βbj φj (x) where
n
1X
βbj = φj (Xi ).
n i=1
The number of terms k is the smoothing parameter and can be chosen using cross-validation.

It can be shown that


Z n
X ∞
X
2
p(x) − p(x)) dx] =
R = E[ (b Var(βbj ) + βj2 . H
j=1 j=k+1

The first term is of order O(k/n). To bound the second term (the bias) one usually assumes
that p is a Sobolev space of order q which means that p ∈ P with
( ∞
)
X X
P= p∈F : p= βj φj : βj2 j 2q < ∞ .
j j=1

In that case it can be shown that


 2q
k 1
R≈ + . H
n k
The optimal k is k ≈ n1/(2q+1) with risk
2q
  2q+1
1
R=O .
n

11.1 L1 Methods

Here we discuss another approach to choosing h aimed at the L1 loss.


R The idea is to select
a class of sets A— which we call test sets— and choose h to make A pbh (x)dx close to P (A)
for all A ∈ A. That is, we would like to minimize
Z
∆(g) = sup g(x)dx − P (A) . (33)
A∈A A

25
VC Classes. Let A be a class of sets with VC dimension ν. As in section 5.2, split the data
X into Y and Z with P = {b p1 , . . . , pbN } constructed from Y . For g ∈ P define
Z
∆n (g) = sup g(x)dx − Pn (A)
A∈A A
−1
Pn
where Pn (A) = n i=1 I(Zi ∈ A). Let pb = argming∈P ∆n (g).

Theorem 17 For any δ > 0 there exists c such that


 r 
ν
P ∆(b p) > min ∆(b pj ) + 2c < δ.
j n

Proof. We know that  r 


ν
P sup |Pn (A) − P (A)| > c < δ.
A∈A n
Hence, except on an event of probability at most δ, we have that
Z Z
∆n (g) = sup g(x)dx − Pn (A) ≤ sup g(x)dx − P (A) + sup Pn (A) − P (A)
A∈A A A∈A A A∈A
r
ν
≤ ∆(g) + c .
n
By a similar argument, ∆(g) ≤ ∆n (g) + c nν . Hence, |∆(g) − ∆n (g)| ≤ c nν for all g. Let
p p
p∗ = argming∈P ∆(g). Then,
r r r
ν ν ν
∆(p) ≤ ∆(b p) ≤ ∆n (b
p) + c ≤ ∆n (p∗ ) + c ≤ ∆(p∗ ) + 2c .
n n n


The difficulty in implementing this idea is computing and minimizing ∆n (g). Hjort and
Walker (2001) presented a similar method which can be practically implemented when d = 1.

Yatracos Classes. Devroye and Györfi (2001) use a class of sets called a Yatracos class which
leads to estimators with some remarkable properties. n Let P = {po1 , . . . , pN } be a set of
densities and define the Yatracos class of sets A = A(i, j) : i 6= j where A(i, j) = {x :
pi (x) > pj (x)}. Let
pb = argming∈G ∆(g)
where Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A
Pn
and Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on a sample Z1 , . . . , Zn ∼ p.

26
Theorem 18 The estimator pb satisfies
Z Z
|b
p − p| ≤ 3 min |pj − p| + 4∆ (34)
j

R
where ∆ = supA∈A A
p − Pn (A) .

R
The term minj |pj − p| is like a bias while term ∆ is like the variance.
R R
Proof. Let i be such that pb = pi and let s be such that |ps − p| = minj |pj − p|. Let
B = {pi > ps } and C = {ps > pi }. Now,
Z Z Z
|b
p − p| ≤ |ps − p| + |ps − pi |. (35)

Let B denote all measurable sets. Then,


Z Z Z Z Z
|ps − pi | = 2 max pi − ps ≤ 2 sup pi − ps
A∈{B,C} A A A∈A A A
Z Z
≤ 2 sup pi − Pn (A) + 2 sup ps − Pn (A)
A∈A A A∈A A
Z
≤ 4 sup ps − Pn (A)
A∈A A
Z Z Z
≤ 4 sup ps − p + 4 sup p − Pn (A)
A∈A A A A∈A A
Z Z Z Z
= 4 sup ps − p + 4∆ ≤ 4 sup ps − p + 4∆
A∈A A A A∈B A A
Z
= 2 |ps − p| + 4∆.

The result follows from (35). 

Now we apply this to kernel estimators. Again we split the data X into two halves Y =
(Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ). For each h let
n  
1X kx − Yi k
pbh (x) = K .
n i=1 h

Let n o
A = A(h, ν) : h, ν > 0, h 6= ν
where A(h, ν) = {x : pbh (x) > pbν (x)}. Define
Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A

27
Pn
where Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on Z. Let

pb = argming∈G ∆(g).

Under some regularity conditions on the kernel, we have the following result.

Theorem 19 (Devroye and Györfi, 2001.) The risk of pb satisfies


Z Z r
log n
E |bp − p| ≤ c1 inf E |bph − p| + c2 . (36)
h n

The proof involves showing that the terms on the right hand side of (34) are small. We refer
the reader to Devroye and Györfi (2001) for the details.
R
Recall that dT V (P, Q) = supA |P (A) − Q(A)| = (1/2) |p(x) − q(x)|dx where the supremum
is over all measurable sets. The above theorem says that the estimator does well in the
total variation metric, even though the method only used the Yatracos class of sets. Finding
computationally efficient methods to implement this approach remains an open question.

12 Mixtures

Another approach to density estimation is to use mixtures. We will discuss mixture modelling
when we discuss clustering.

13 Two-Sample Hypothesis Testing

Density estimation can be usedRfor two sample testing. Given X1 , . . . , Xn ∼ p and Y1 , . . . , Ym ∼


q we can test H0 : p = q using (bp − qb)2 as a test statistic. More interestingly, we can test lo-
cally H0 : p(x) = q(x) at each x. See Duong (2013) and Kim, Lee and Lei (2018). Note that
under H0 , the bias cancels from pb(x) − qb(x). Also, some sort of multiple testing correction
is required.

14 Functional Data and Quasi-Densities

In some problems, X is not just high dimensional, it is infinite dimensional. For example
suppose that each Xi is a curve. An immediate problem is that the concept of a density
is no longer well defined. On a Euclidean space, the density p for a probability measure is

28
R
the function that satisfies P (A) = A p(u)dµ(u) for all measurable A where µ is Lebesgue
measure. Formally, we say that p is the Radon-Nikodym derivative of P with respect to the
dominating measure µ. Geometrically, we can think of p as
P(kX − xk ≤ )
p(x) = lim
→0 V ()

where V () = d π d/2 /Γ(d/2 + 1) is the volume of a sphere of radius . Under appropriate
conditions, these two notions of density agree. (This is the Lebesgue density theorem.)

When the outcome space X is a set of curves, there is no dominating measure and hence
there is no density. Instead, we define the density geometrically by

q (x) = P(ξ(x, X) ≤ )

for a small  where ξ is some metric on X . However we cannot divide by V () and let  tend
to 0 since the dimension d is infinite.

One way around this is to use a fixed  and work with the unnormalized density q . For the
purpose of finding high-density regions this may be adequate. An estimate of q is
n
1X
qb (x) = I(ξ(x, Xi ) ≤ ).
n i=1
Pk
An alternative is to expand Xi into a basis: X(t) ≈ j=1 βj ψj (t). A density can be defined
in terms of the βj ’s.

Example 20 Figure 6 shows the tracks (or paths) of 40 North Atlantic tropical cyclones
(TC). The full dataset, consisting of 608 from 1950 to 2006 is shown in Figure 7. Buchman,
Lee and Schafer (2009) provide a thorough analysis of the data. We refer the reader to their
paper for the full details.4

Each data point— that is, each TC track— is a curve in R2 . Various questions are of
interest: Where are the tracks most common? Is the density of tracks changing over time?
Is the track related to other variables such as windspeed and pressure?

Each curve Xi can be regarded as mapping Xi : [0, Ti ] → R2 where Xi (t) = (Xi1 (t), Xi2 (t))
is the position of the TC at time t and Ti is the lifelength of the TC. Let
n o
Γi = (Xi1 (t), Xi2 (t)) : 0 ≤ t ≤ Ti

be the graph of Xi . In other words, Γi is the track, regarded as a subset of points in R2 .


We will use the Hausdorff metric to measure the distance between curves. The Hausdorff
4
Thanks to Susan Buchman for providing the data.

29
Figure 6: Paths of 40 tropical cyclones in the North Atlantic.

distance between two sets A is B is

dH (A, B) = inf{ : A ⊂ B  and B ⊂ A } (37)


( )
= max sup inf kx − yk, sup inf kx − yk (38)
x∈A y∈B x∈B y∈A

where A = x∈A B(x, ) is called the enlargement of A and B(x, ) = {y : ky − xk ≤ }.


S
We use the unnormalized kernel estimator
n
1X
qb (γ) = I(dH (γ, Γi ) ≤ ).
n i=1

Figure 8 shows the 10 TC’s with highest local density and the the 10 TC’s with lowest local
density using  = 16.38. This choice of  corresponds to the 10th percentile of the values
{dH (Xi , Xj ) : i 6= j}. The high density tracks correspond to TC’s in the gulf of Mexico with
short paths. The low density tracks correspond to TC’s in the Atlantic with long paths.

15 Miscellanea

Another method for selecting h which is sometimes used when p is thought to be very smooth
is the plug-in method. The idea is to take the formula for the mean squared error (equation
31), insert a guess of p00 and then solve for the optimal bandwidth h. For example, if d = 1

30
Figure 7: Paths of 608 tropical cyclones in the North Atlantic.

Figure 8: 10 highest density paths (black) and 10 lowest density paths (blue).

31
and under the idealized assumption that p is a univariate Normal this yields h∗ = 1.06σn−1/5 .
Usually, σ is estimated by min{s, Q/1.34} where s is the sample standard deviation and Q
is the interquartile range.5 This choice of h∗ works well if the true density is very smooth
and is called the Normal reference rule.

Since we don’t want to necessarily assume that p is very smooth, it is usually better to
estimate h using cross-validation. See Loader (1999) for an interesting comparison between
cross-validation and plugin methods.

A generalization of the kernel method is to use adaptive kernels where one uses a different
bandwidth h(x) for each point x. One can also use a different bandwidth h(xi ) for each data
point. This makes the estimator more flexible and allows it to adapt to regions of varying
smoothness. But now we have the very difficult task of choosing many bandwidths instead
of just one.

Density estimation is sometimes used to find unusual observations or outliers. These are
observations for which pb(Xi ) is very small.

16 Summary
1. A commonly used nonparametric density estimator is the kernel estimator
n  
1X 1 kx − Xi k
pbh (x) = K .
n i=1 hd h

2. The kernel estimator is rate minimax over certain classes of densities.

3. Cross-validation methods can be used for choosing the bandwidth h.

5
Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason for dividing
by 1.34 is that Q/1.34 is a consistent estimate of σ if the data are from a N (µ, σ 2 ).

32
Nonparametric Regression
Statistical Machine Learning, Spring 2019
Ryan Tibshirani and Larry Wasserman

1 Introduction
1.1 Basic setup
Given a random pair (X, Y ) ∈ Rd × R, recall that the function

m0 (x) = E(Y |X = x)

is called the regression function (of Y on X). The basic goal in nonparametric regression: to
construct a predictor of Y given X. This is basically the same as constructing an estimate m b
of m0 , from i.i.d. samples (Xi , Yi ) ∈ Rd × R, i = 1, . . . , n. Given a new X, our prediction of
b
Y is m(X). We often call X the input, predictor, feature, etc., and Y the output, outcome,
response, etc.
Note for i.i.d. samples (Xi , Yi ) ∈ Rd × R, i = 1, . . . , n, we can always write

Yi = m0 (Xi ) + i , i = 1, . . . , n,

where i , i = 1, . . . , n are i.i.d. random errors, with mean zero. Therefore we can think
about the sampling distribution as follows: (Xi , i ), i = 1, . . . , n are i.i.d. draws from some
common joint distribution, where E(i ) = 0, and Yi , i = 1, . . . , n are generated from the
above model.
It is common to assume that each i is independent of Xi . This is a very strong as-
sumption, and you should think about it skeptically. We too will sometimes make this
assumption, for simplicity. It should be noted that a good portion of theoretical results
that we cover (or at least, similar theory) also holds without this assumption.

1.2 Fixed or random inputs?


Another common setup in nonparametric regression is to directly assume a model

Yi = m0 (Xi ) + i , i = 1, . . . , n,

where now Xi , i = 1, . . . , n are fixed inputs, and i , i = 1, . . . , n are i.i.d. with E(i ) = 0.
For arbitrary Xi , i = 1, . . . , n, this is really just the same as starting with the random
input model, and conditioning on the particular values of Xi , i = 1, . . . , n. (But note: after
conditioning on the inputs, the errors are only i.i.d. if we assumed that the errors and inputs
were independent in the first place.)

1
Generally speaking, nonparametric regression estimators are not defined with the ran-
dom or fixed setups specifically in mind, i.e., there is no real distinction made here. A
caveat: some estimators (like wavelets) do in fact assume evenly spaced fixed inputs, as in
Xi = i/n, i = 1, . . . , n,
for evenly spaced inputs in the univariate case.
Theory is not completely the same between the random and fixed input worlds (some
theory is sharper when we assume fixed input points, especially evenly spaced input points),
but for the most part the theory is quite similar.
Therefore, in what follows, we won’t be very precise about which setup we assume—
random or fixed inputs—because it mostly doesn’t matter when introducing nonparametric
regression estimators and discussing basic properties.

1.3 Notation
We will define an empirical norm k · kn in terms of the training points Xi , i = 1, . . . , n,
acting on functions m : Rd → R, by
n
1X 2
kmk2n = m (Xi ).
n
i=1

This makes sense no matter if the inputs are fixed or random (but in the latter case, it is a
random norm)
When the inputs are considered random, we will write PX for the distribution of X, and
we will define the L2 norm k · k2 in terms of PX , acting on functions m : Rd → R, by
Z
kmk2 = E[m (X)] = m2 (x) dPX (x).
2 2

So when you see k · k2 in use, it is a hint that the inputs are being treated as random
A quantity of interest will be the (squared) error associated with an estimator m b of m0 ,
which can be measured in either norm:
b − m0 k2n or km
km b − m0 k22 .
b is itself random). We will study bounds
In either case, this is a random quantity (since m
in probability or in expectation. The expectation of the errors defined above, in terms of
either norm (but more typically the L2 norm) is most properly called the risk; but we will
often be a bit loose in terms of our terminology and just call this the error.

1.4 Bias-Variance Tradeoff


If (X, Y ) is a new pair then
Z Z
2
E(Y − m(X))
b = bn (x)dP (x) + v(x)dP (x) + τ 2 = ||m
2
b − m0 ||22 + τ 2

b
where bn (x) = E[m(x)] − m(x) is the bias, v(x) = Var(m(x))
b is the variance and τ 2 =
E(Y − m(X))2 is the un-avoidable error. Generally, we have to choose tuning parameters
carefully to balance the bias and variance.

2
1.5 What does “nonparametric” mean?
Importantly, in nonparametric regression we don’t assume a particular parametric form
for m0 . This doesn’t mean, however, that we can’t estimate Pp bm0 using (say) a linear com-
b
bination of spline basis functions, written as m(x) = j=1 βj gj (x). A common question:
the coefficients on the spline basis functions β1 , . . . , βp are parameters, so how can this be
nonparametric? Again, the point is that we don’t assume a parametric form for m0 , i.e.,
we don’t assume that m0 itself is an exact linear combination of splines basis functions
g1 , . . . , gp .

1.6 What we cover here


The goal is to expose you to a variety of methods, and give you a flavor of some interesting
results, under different assumptions. A few topics we will cover into more depth than others,
but overall, this will be far from a complete treatment of nonparametric regression. Below
are some excellent texts out there that you can consult for more details, proofs, etc.
Nearest neighbors. Kernel smoothing, local polynomials: Tsybakov (2009) Smoothing
splines: de Boor (1978), Green & Silverman (1994), Wahba (1990) Reproducing kernel
Hilbert spaces: Scholkopf & Smola (2002), Wahba (1990) Wavelets: Johnstone (2011),
Mallat (2008). General references, more theoretical: Gyorfi, Kohler, Krzyzak & Walk
(2002), Wasserman (2006) General references, more methodological: Hastie & Tibshirani
(1990), Hastie, Tibshirani & Friedman (2009), Simonoff (1996)
Throughout, our discussion will bounce back and forth between the multivariate case
(d > 1) and univariate case (d = 1). Some methods have obvious (natural) multivariate ex-
tensions; some don’t. In any case, we can always use low-dimensional (even just univariate)
nonparametric regression methods as building blocks for a high-dimensional nonparametric
method. We’ll study this near the end, when we talk about additive models.

1.7 Holder Spaces and Sobolev Spaces


The class of Lipschitz functions H(1, L) on T ⊂ R is the set of functions g such that

|g(y) − g(x)| ≤ L|x − y| for all x, y ∈ T.

A differentiable function is Lipschitz if and only if it has bounded derivative. Conversely a


Lipschitz function is differentiable almost everywhere.
Let T ⊂ R and let β be an integer. The Holder space H(β, L) is the set of functions g
mapping T to R such that g is ` = β − 1 times differentiable and satisfies

|g (`) (y) − g (`) (x)| ≤ L|x − y| for all x, y ∈ T.

(There is an extension to real valued β but we will not need that.) If g ∈ H(β, L) and
` = β − 1, then we can define the Taylor approximation of g at x by

(y − x)` (`)
ge(y) = g(y) + (y − x)g 0 (x) + · · · + g (x)
`!
and then |g(y) − ge(y)| ≤ |y − x|β .

3
The definition for higher dimensions is similar. Let X be a compact subset of Rd . Let
β and L be positive numbers. Given a vector s = (s1 , . . . , sd ), define |s| = s1 + · · · + sd ,
s! = s1 ! · · · sd !, xs = xs11 · · · xsdd and

∂ s1 +···+sd
Ds = .
∂xs11 · · · ∂xsdd

Let β be a positive integer. Define the Hölder class


( )
Hd (β, L) = g : |Ds g(x)−Ds g(y)| ≤ Lkx−yk, for all s such that |s| = β−1, and all x, y .

(1)
For example, if d = 1 and β = 2 this means that

|g 0 (x) − g 0 (y)| ≤ L |x − y|, for all x, y.

The most common case is β = 2; roughly speaking, this means that the functions have
bounded second derivatives.
Again, if g ∈ Hd (β, L) then g(x) is close to its Taylor series approximation:

|g(u) − gx,β (u)| ≤ Lku − xkβ (2)

where
X (u − x)s
gx,β (u) = Ds g(x). (3)
s!
|s|≤β

In the common case of β = 2, this means that

p(u) − [p(x) + (x − u)T ∇p(x)] ≤ L||x − u||2 .

The Sobolev class S1 (β, L) is the set of β times differentiable functions (technically, it
only requires weak derivatives) g : R → R such that
Z
(g (β) (x))2 dx ≤ L2 .

Again this extends naturally to Rd . Also, there is an extension to non-integer β. It can be


shown that Hd (β, L) ⊂ Sd (β, L).

2 k-nearest-neighbors regression
Here’s a basic method to start us off: k-nearest-neighbors regression. We fix an integer
k ≥ 1 and define
1 X
b
m(x) = Yi , (4)
k
i∈Nk (x)

where Nk (x) contains the indices of the k closest points of X1 , . . . , Xn to x.

4
This is not at all a bad estimator, and you will find it used in lots of applications, in
many cases probably because of its simplicity. By varying the number of neighbors k, we can
b with small k corresponding
achieve a wide range of flexibility in the estimated function m,
to a more flexible fit, and large k less flexible.
But it does have its limitations, an apparent one being that the fitted function m b
essentially always looks jagged, especially for small or moderate k. Why is this? It helps to
write
Xn
b
m(x) = wi (x)Yi , (5)
i=1
where the weights wi (x), i = 1, . . . , n are defined as
(
1/k if Xi is one of the k nearest points to x
wi (x) =
0 else.

Note that wi (x) is discontinuous as a function of x, and therefore so is m(x). b


The representation (5) also reveals that the k-nearest-neighbors estimate is in a class of
estimates we call linear smoothers, i.e., writing Y = (Y1 , . . . , Yn ) ∈ Rn , the vector of fitted
values
b = (m(X
µ b n )) ∈ Rn
b 1 ), . . . , m(X
can simply be expressed as µ b = SY . (To be clear, this means that for fixed inputs X1 , . . . , Xn ,
b is a linear function of Y ; it does not mean that m(x)
the vector of fitted values µ b need behave
linearly as a function of x.) This class is quite large, and contains many popular estimators,
as we’ll see in the coming sections.
The k-nearest-neighbors estimator is universally consistent, which means Ekm b − m0 k22 → 0
2
as n → ∞, with no assumptions other than E(Y ) ≤ ∞, provided that we take k = kn such

that kn → ∞ and kn /n → 0; e.g., k = n will do. See Chapter 6.2 of Gyorfi et al. (2002).
Furthermore, assuming the underlying regression function m0 is Lipschitz continuous,
the k-nearest-neighbors estimate with k  n2/(2+d) satisfies

b − m0 k22 . n−2/(2+d) .
Ekm (6)

See Chapter 6.3 of Gyorfi et al. (2002). Later, we will see that this is optimal.
Proof sketch: assume that Var(Y |X = x) = σ 2 , a constant, for simplicity, and fix
(condition on) the training points. Using the bias-variance tradeoff,
 2  2  2 
b
E m(x) − m0 (x) b
= E[m(x)] − m0 (x) + E m(x)b − E[m(x)]
b
| {z } | {z }
Bias2 (m(x))
b Var(m(x))
b
 X 2
1  σ2
= m0 (Xi ) − m0 (x) +
k k
i∈Nk (x)
 2
L X σ2
≤ kXi − xk2 + .
k k
i∈Nk (x)

In the last line we used the Lipschitz property |m0 (x) − m0 (z)| ≤ Lkx − zk2 , for some
constant L > 0. Now for “most” of the points we’ll have kXi − xk2 ≤ C(k/n)1/d , for a

5
1e+06

8e+05
6e+05
eps^(−(2+d)/d)

4e+05

2e+05


0e+00


● ● ● ● ● ●

2 4 6 8 10

Dimension d

Figure 1: The curse of dimensionality, with  = 0.1

constant C > 0. (Think of a having input points Xi , i = 1, . . . , n spaced equally over (say)
[0, 1]d .) Then our bias-variance upper bound becomes
 2/d
k 2 σ2
(CL) + ,
n k
We can minimize this by balancing the two terms so that they are equal, giving k 1+2/d  n2/d ,
i.e., k  n2/(2+d) as claimed. Plugging this in gives the error bound of n−2/(2+d) , as claimed.

2.1 Curse of dimensionality


Note that the above error rate n−2/(2+d) exhibits a very poor dependence on the dimension
d. To see it differently: given a small  > 0, think about how large we need to make n to
ensure that n−2/(2+d) ≤ . Rearranged, this says n ≥ −(2+d)/2 . That is, as we increase d,
we require exponentially more samples n to achieve an error bound of . See Figure 1 for
an illustration with  = 0.1
In fact, this phenomenon is not specific to k-nearest-neighbors, but a reflection of the
curse of dimensionality, the principle that estimation becomes exponentially harder as the
number of dimensions increases. This is made precise by minimax theory: we cannot hope
to do better than the rate in(6) over Hd (1, L), which we write for the space of L-Lipschitz
functions in d dimensions, for a constant L > 0. It can be shown that

inf sup b − m0 k22 & n−2/(2+d) ,


Ekm (7)
m
b m0 ∈Hd (1,L)

b See Chapter 3.2 of Gyorfi et al. (2002).


where the infimum above is over all estimators m.
So why can we sometimes predict well in high dimensional problems? Presumably, it is
because m0 often (approximately) satisfies stronger assumptions. This suggests we should

6
look at classes of functions with more structure. One such example is the additive model,
covered later in the notes.

3 Kernel Smoothing and Local Polynomials


3.1 Kernel smoothing
Kernel regression or kernel smoothing begins with a kernel function K : R → R, satisfying
Z Z Z
K(t) dt = 1, tK(t) dt = 0, 0 < t2 K(t) dt < ∞.

Three common examples are the box-car kernel:


(
1 |x| ≤ 1/2
K(t) = ,
0 otherwise

the Gaussian kernel:


1
K(t) = √ exp(−t2 /2),

and the Epanechnikov kernel:
(
3/4(1 − t2 ) if |t| ≤ 1
K(t) =
0 else

Warning! Don’t confuse this with the notion of kernels in RKHS methods
which we cover later.
Given a bandwidth h > 0, the (Nadaraya-Watson) kernel regression estimate is defined
as  
Xn
kx − Xi k2
K Yi
h X
i=1
b
m(x) = n   = wi (x)Yi (8)
X kx − Xi k2
K i
h
i=1
P
where wi (x) = K(kx − Xi k2 /h)/ nj=1 K(kx − xj k2 /h). Hence kernel smoothing is also a
linear smoother.
In comparison to the k-nearest-neighbors estimator in (4), which can be thought of as
a raw (discontinuous) moving average of nearby responses, the kernel estimator in (8) is a
smooth moving average of responses. See Figure 2 for an example with d = 1.

3.2 Error Analysis


b − m0 k22 → 0 as n → ∞, with
The kernel smoothing estimator is universally consistent (Ekm
no assumptions other than E(Y 2 ) ≤ ∞), provided we take a compactly supported kernel
K, and bandwidth h = hn satisfying hn → 0 and nhdn → ∞ as n → ∞. See Chapter 5.2 of
Gyorfi et al. (2002). We can say more.

7
192 6. Kernel Smoothing Methods

1.5 Nearest-Neighbor Kernel Epanechnikov Kernel

1.5
O O O O
O O
O O
O O O O O O O O
O O O O O O O O
fˆO(x0 ) OO O
O OO O OO
fˆ(x0 ) OO O
OO OO
1.0

1.0
O O O OO O O OO OO
O O

OO
O
OO
O O
OO
O
O
O OO O

O
O
OO O
OO
O
OO
O
OO
O O
OO
O
O
O OO
• O
OO O
O
O
OO
O

OOO OO OOO OO
0.5

0.5
O O O O O O
O O
O O O O O O O O
O O O O
O OO O O O OO O O
O OO
O O O O OO
O O O
O O O O O O
O O
0.0

0.0
O O O OO O O O OO
O O O O O O
O OO O OO
-0.5

-0.5
O O O O
O O O O O O
O O
-1.0

-1.0
O O

O O
O O

0.0 0.2 0.4 x0 0.6 0.8 1.0 0.0 0.2 0.4 x0 0.6 0.8 1.0

FIGURE 6.1. In each panel 100 pairs x , y are generated at random from the
i i
Figure 2: Comparing k-nearest-neighbor and Epanechnikov kernels, when d = 1. From
blue curve
Chapter 6 of with
HastieGaussian errors: Y = sin(4X) + ε, X ∼ U [0, 1], ε ∼ N (0, 1/3). In
et al. (2009)
the left panel the green curve is the result of a 30-nearest-neighbor running-mean
smoother. The red point is the fitted constant fˆ(x0 ), and the red circles indicate
Theorem. Suppose that d = 1 and that m00 is bounded. Also suppose that X has a
those observations contributing to the fit at x0 . The solid yellow region indicates
non-zero, differentiable density p and that the support is unbounded. Then, the risk is
the weights assigned to  Zobservations.
 In
 the right panel, the green curve is the
2Z
kernel-weighted h 4 p0 (x) 2 ) window width
Rnaverage,
= n using an Epanechnikov
x2 K(x)dx m00 (x) +kernel
2m0 (x)with (half
dx
λ = 0.2. 4 p(x)
R Z  
σ2 K 2 (x)dx dx 1
+ +o + o(h4n )
nhn p(x) nhn
where p is the density of PX .
6.1 One-Dimensional Kernel Smoothers
InThe
Chapter 2, we
first term motivated
is the theThe
squared bias. k–nearest-neighbor average
dependence on p and p0 is the design bias and
is undesirable. We’ll fix this problem later using local linear smoothing. It follows that the
optimal bandwidth is hn ≈ n−1/5 fˆ(x)yielding
= Ave(y i |xof
a risk i ∈n Nk (x))
−4/5 . In d dimensions, the term(6.1) nhn
becomes nhn . In that case It follows that the optimal bandwidth is hn ≈ n−1/(4+d) yielding
d

a as
riskan n−4/(4+d) . of the regression function E(Y |X = x). Here Nk (x) is the set
ofestimate
ofIfk the support
points has boundaries
nearest then there
to x in squared is bias ofand
distance, order
Ave O(h) near the
denotes theboundary.
average
This happens because of the asymmetry of the kernel weights
(mean). The idea is to relax the2 definition of conditional expectation, in such regions. See Figure
as
3. Specifically, the bias is of order O(h ) in the interior but is of order O(h) near the
illustrated in the left panel of Figure 6.1, and compute an average in a
boundaries. The risk then becomes O(h3 ) instead of O(h4 ). We’ll fix this problems using
neighborhood
local of the
linear smoothing. target
Also, point.
the result In this
above dependscaseonwe have used
assuming that Pthe 30-nearest
X has a density.
Weneighborhood—the
can drop that assumption fit at(and
x0 is thefor
allow average of the
boundaries) and30getpairs whose
a slightly xi values
weaker result
are closest to x0 . The green curve is traced out as we apply this definition
due to Gyorfi, Kohler, Krzyzak and Walk (2002).
atFor simplicity,
different we will
values x0 .use
Thethegreen
spherical kernel
curve I(kxkfˆ
K(kxk) =since
is bumpy, ≤(x)
1); is
the results can be
discontinuous
extended to other kernels. Hence,
in x. As we move xP0 from left to right, thePk-nearest neighborhood remains
n n
constant, until Yii I(kX i − right
xk ≤ h) i=1 Yi I(kX i − xk than
≤ h) the furthest
b a=point
m(x) P i=1 x
n
to the of =x0 becomes closer
point x ′ in the neighborhood
i i=1 I(kXi − ≤ h)
toxkthe left of x , natPnwhich
(B(x, h))
time x replaces x ′ . 0 i i
The average in (6.1) changes in a discrete way, leading to a discontinuous
8
fˆ(x).
This discontinuity is ugly and unnecessary. Rather than give all the
points in the neighborhood equal weight, we can assign weights that die
off smoothly with distance from the target point. The right panel shows
where Pn is the empirical measure and B(x, h) = {u : kx − uk ≤ h}. If the denominator
b
is 0 we define m(x) = 0. The proof of the following theorem is from Chapter 5 of Györfi,
Kohler, Krzyżak and Walk (2002).
Theorem: Risk bound without density. Suppose that the distribution of X has
compact support and that Var(Y |X = x) ≤ σ 2 < ∞ for all x. Then
c2
sup b − mk2P ≤ c1 h2 +
Ekm . (9)
P ∈Hd (1,L) nhd

Hence, if h  n−1/(d+2) then


c
sup b − mk2P ≤
Ekm . (10)
P ∈Hd (1,L) n2/(d+2)

The proof is in the appendix. Note that the rate n−2/(d+2) is slower than the pointwise
rate n−4/(d+2) because we have made weaker assumptions.
Recall from (7) we saw that this was the minimax optimal rate over Hd (1, L). More
generally, the minimax rate over Hd (α, L), for a constant L > 0, is

inf sup b − m0 k22 & n−2α/(2α+d) ,


Ekm (11)
m
b m0 ∈Hd (α,L)

see again Chapter 3.2 of Gyorfi et al. (2002). However, as we saw above, with extra condi-
tions, we got the rate n−4/(4+d) which is minimax for Hd (2, L). We’ll get that rate under
weaker conditions with local linear regression.
If the support of the distribution of X lives on a smooth manifold of dimension r < d
then the term Z
dP (x)
nP (B(x, h))
is of order 1/(nhr ) instead of 1/(nhd ). In that case, we get the improved rate n−2/(r+2) .

3.3 Local Linear Regression


We can alleviate this boundary bias issue by moving from a local constant fit to a local
linear fit, or a local polynomial fit.
To build intuition, another way to view the kernel estimator in (8) is the following: at
b
each input x, define the estimate m(x) = θbx , where θbx is the minimizer of
Xn  
kx − Xi k
K (Yi − θ)2 ,
h
i=1

over all θ ∈ R. In other words, Instead we could consider forming the local estimate
b
m(x) =α bx + βbxT x, where α
bx , βbx minimize
Xn  
kx − Xi k
K (Yi − α − β T Xi )2 .
h
i=1

over all α ∈ R, β ∈ Rd . This is called local linear regression.

9
6.1 One-Dimensional Kernel Smoothers 195

1.5
N-W Kernel at Boundary Local Linear Regression at Boundary

1.5
O O O O O O
O O
OO O OO O
O O
O O OO O O O OO O
O O O O
O O O O
1.0

1.0
OO O O OO O O
O O O O O O
O O O O O
OO OO O O O O O
OO OO
O O O O O O
O O O O
fˆ(x ) O O O O
O O
OO O 0 O O OO O O
OO O OO
O O OO
OO
fˆ(x )
0.5

0.5
O O
O O O O
O • O O
O
O
0 O
O

OOOO O O
O
O
O
O OO O
O
O
• O

OOOO O O
O
O
O
O

O OO O
O
O O
0.0

0.0
O O O O O O O O
O O O O
O O O O
O O O O
O O O O O O
O
O O O
O O
-0.5

-0.5
O OO O OO
O O
O O O O
O O
O O
O O
-1.0

-1.0
0.0 x0 0.2 0.4 0.6 0.8 1.0 0.0 x0 0.2 0.4 0.6 0.8 1.0

FIGURE 6.3. The locally weighted average has bias problems at or near the
Figure 3: Comparing
boundaries (Nadaraya-Watson)
of the domain. kernel smoothing
The true function to local linear
is approximately regression;
linear here, butthe
former
most of the observations in the neighborhood have a higher mean than the targetof
is biased at the boundary, the latter is unbiased (to first-order). From Chapter 6
Hastie
point,etsoal.despite
(2009) weighting, their mean will be biased upwards. By fitting a locally
weighted linear regression (right panel), this bias is removed to first order
b
We can rewrite the local linear regression estimate m(x). This is just given by a weighted
least squares fit, so
b
m(x) = b(x)T (B T ΩB)−1 B T ΩY,
because of the asymmetry of the kernel in that region. By fitting straight
where b(x) = (1, x) ∈ Rd+1 , B ∈ Rn×(d+1) with ith row b(Xi ), and Ω ∈ Rn×n is diagonal
linesithrather than constants locally, we can remove this bias exactly =to first
TY ,
with diagonal element K(kx−X i k2 /h). We can write more concisely as m(x) b w(x)
order;
where see=Figure
w(x) ΩB(B T6.3ΩB)(right
−1 b(x),panel).
which showsActually, this bias
local linear can be
regression is apresent in the
linear smoother
interior of the domain as well, if the X values are not equally spaced (for
too.
the
The same
vectorreasons,
of fitted but
valuesusually
b = (m(x
µ b less
1 ), . severe).
b n ))Again
. . , m(x can be locally
expressedweighted
as linear
regression will make  a first-order  correction.
w1 (x)T Y
Locally weighted regression
 .. solves  a separate T −1weighted
T least squares prob-
b=
µ .  = B(B ΩB) B ΩY = SY
lem at each target point x0 :T
wn (x) Y
N
!
which should look familiar
min to youKfrom weighted least squares. 2
λ (x0 , xi ) [yi − α(x0 ) − β(x0 )xi ] . (6.7)
Now we’ll sketch how 0the
α(x0 ),β(x ) local linear fit reduces the bias, fixing (conditioning on) the
i=1
training points. Compute at a fixed point x,
The estimate is then fˆ(x0 ) = α̂(x0 )X n+ β̂(x0 )x0 . Notice that although we fit
b
m(x)] =
an entire linear model to the data in the
E[ wi (x)m 0 (Xi ).we only use it to evaluate
region,
i=1
the fit at the single point x0 .
Define
Using theexpansion
a Taylor vector-valued function
of m0 about x, b(x)T = (1, x). Let B be the N × 2
T
regression matrix with ithX n row b(xi ) , andX nW(x0 ) the N × N diagonal
matrix withE[ith
b diagonal
m(x)] = m0 (x)element
wi (x) K
+λ (x00(x)
∇m , xi ). Then
T
(Xi − x)wi (x) + R,
i=1 i=1
fˆ(x0 ) = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )y (6.8)
N
! 10
= li (x0 )yi . (6.9)
i=1

Equation (6.8) gives an explicit expression for the local linear regression
where the remainder term R contains quadratic and higher-order terms, and under regular-
ity conditions, is small. One can check that in fact for the local linear regression estimator
b
m,
n
X Xn
wi (x) = 1 and (Xi − x)wi (x) = 0,
i=1 i=1
b
and so E[m(x)] = m0 (x) + R, which means that m b is unbiased to first-order.
It can be shown that local linear regression removes boundary bias and design bias.

b is
Theorem. Under some regularity conditions, the risk of m
Z  Z 2 Z Z
h4n 1
tr(m00 (x) K(u)uuT du) dP (x)+ d K 2 (u)du σ 2 (x)dP (x)+o(h4n +(nhdn )−1 ).
4 nhn
For a proof, see Fan & Gijbels (1996). For points near the boundary, the bias is
Ch2 m00 (x) + o(h2 ) whereas, the bias is Chm0 (x) + o(h) for kernel estimators.
In fact, Fan (1993) shows a rather remarkable result. Let Rn be the minimax risk for
estimating m(x0 ) over the class of functions with bounded second derivatives in a neighbor-
hood of x0 . Let the maximum risk rn of the local linear estimator with optimal bandwidth
satisfies
Rn
1 + o(1) ≥ ≥ (0.896)2 + o(1).
rn
Rn
Moreover, if we compute the minimax risk over all linear estimators we get rn → 1.

3.4 Higher-order smoothness


How can we hope to get optimal error rates over Hd (α, d), when α ≥ 2? With kernels there
are basically two options: use local polynomials, or use higher-order kernels
Local polynomials build on our previous idea of local linear regression (itself P
an extension
of kernel smoothing.) Consider d = 1, for concreteness. Define m(x) b = βbx,0 + kj=1 βbx,j xj ,
where βbx,0 , . . . , βbx,k minimize
 
|x − Xi |  2
Xn k
X
K Yi − β0 − βj Xij .
h
i=1 j=1

over all β0 , β1 , . . . , βk ∈ R. This is called (kth-order) local polynomial regression


Again we can express

b
m(x) = b(x)(B T ΩB)−1 B T Ωy = w(x)T y,

where b(x) = (1, x, . . . , xk ), B is an n × (k + 1) matrix with ith row b(Xi ) = (1, Xi , . . . , Xik ),
and Ω is as before. Hence again, local polynomial regression is a linear smoother
Assuming that m0 ∈ H1 (α, L) for a constant L > 0, a Taylor expansion shows that the
local polynomial estimator m b of order k, where k is the largest integer strictly less than α
and where the bandwidth scales as h  n−1/(2α+1) , satisfies

b − m0 k22 . n−2α/(2α+1) .
Ekm

11
1.0
0.5
0.0
−0.5

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Figure 4: A higher-order kernel function: specifically, a kernel of order 4

See Chapter 1.6.1 of Tsybakov (2009). This matches the lower bound in (11) (when d = 1)
In multiple dimensions, d > 1, local polynomials become kind of tricky to fit, because of
the explosion in terms of the number of parameters we need to represent a kth order poly-
nomial in d variables. Hence, an interesting alternative is to return back kernel smoothing
but use a higher-order kernel. A kernel function K is said to be of order k provided that
Z Z Z
K(t) dt = 1, t K(t) dt = 0, j = 1, . . . , k − 1, and 0 < tk K(t) dt < ∞.
j

This means that the kernels we were looking at so far were of order 2
An example of a 4th-order kernel is K(t) = 83 (3 − 5t2 )1{|t| ≤ 1}, plotted in Figure 4.
Notice that it takes negative values.
Lastly, while local polynomial regression and higher-order kernel smoothing can help
“track” the derivatives of smooth functions m0 ∈ Hd (α, L), α ≥ 2, it should be noted that
they don’t share the same universal consistency property of kernel smoothing (or k-nearest-
neighbors). See Chapters 5.3 and 5.4 of Gyorfi et al. (2002)

4 Splines
Suppose that d = 1. Define an estimator by
n
X Z 1
2
b = argmin
m Yi − m(Xi ) +λ m00 (x)2 dx. (12)
f i=1 0

Spline Lemma. The minimizer of (25) is a cubic spline with knots at the data points.
(Proof in the Appendix.)

12
The key result presented above tells us that we can choose a basis η1 , . . . , ηn for the set
of kth-order natural splines with knots over x1 , . . . , xn , and reparametrize the problem as
Xn  Xn 2 Z 1X n 2
00
b
β = argmin Yi − βj ηj (Xi ) + λ βj ηj (x) dx. (13)
β∈Rn i=1 j=1 0 j=1

This is a finite-dimensional problem, and after we compute


Pn the coefficients βb ∈ Rn , we know
b
that the smoothing spline estimate is simply m(x) = j=1 βbj ηj (x)
Defining the basis matrix and penalty matrices N, Ω ∈ Rn×n by
Z 1
00 00
Nij = ηj (Xi ) and Ωij = ηi (x)ηj (x) dx for i, j = 1, . . . , n, (14)
0

the problem in (27) can be written more succinctly as


βb = argmin kY − N βk22 + λβΩβ, (15)
β∈Rn

showing the smoothing spline problem to be a type of generalized ridge regression problem.
In fact, the solution in (29) has the explicit form
βb = (N T N + λΩ)−1 N T Y,
b = (m(x
and therefore the fitted values µ b 1 ), . . . , m(x
b n )) are
b = N (N T N + λΩ)−1 N T Y ≡ SY.
µ (16)
Therefore, once again, smoothing splines are a type of linear smoother
A special property of smoothing splines: the fitted values in (30) can be computed
in O(n) operations. This is achieved by forming N from the B-spline basis (for natural
splines), and in this case the matrix N T N + ΩI ends up being banded (with a bandwidth
that only depends on the polynomial order k). In practice, smoothing spline computations
are extremely fast

4.1 Error rates


Recall the Sobolev class of functions S1 (m, C): for an integer m ≥ 0 and C > 0, to contain
all m times differentiable functions f : R → R such that
Z
2
f (m) (x) dx ≤ C 2 .

(The Sobolev class Sd (m, C) in d dimensions can be defined similarly, where we sum over
all partial derivatives of order m.)
Assuming m0 ∈ S1 (m, C) for the underlying regression function, where C > 0 is a
constant, the smoothing spline estimator mb of polynomial order k = 2m − 1 with tuning
parameter λ  n1/(2m+1)  n1/(k+2) satisfies
b − m0 k2n . n−2m/(2m+1) in probability.
km
The proof of this result uses much more fancy techniques from empirical process theory
(entropy numbers) than the proofs for kernel smoothing. See Chapter 10.1 of van de Geer
(2000) This rate is seen to be minimax optimal over S1 (m, C) (e.g., Nussbaum (1985)).

13
5 Mercer kernels, RKHS
5.1 Hilbert Spaces
A Hilbert space is a complete inner product space. We will see that a reproducing kernel
Hilbert space (RKHS) is a Hilbert space with extra structure that makes it very useful for
statistics and machine learning.
An example of a Hilbert space is
n Z o
2
L2 [0, 1] = f : [0, 1] → R : f <∞

endowed with the inner product


Z
hf, gi = f (x)g(x)dx.

The corresponding norm is


sZ
p
||f || = hf, f i = f 2 (x)dx.

We write fn → f to mean that ||fn − f || → 0 as n → ∞.

5.2 Evaluation Functional


The evaluation functional δx assigns a real number to each function. It is defined by
δx f = f (x). In general, the evaluation functional is not continuous. This means we can
have fn → f but δx fn does not converge to δx f . For example, let f (x) = 0 and fn (x) =
√ √ √
nI(x < 1/n2 ). Then ||fn − f || = 1/ n → 0. But δ0 fn = n which does not converge to
δ0 f = 0. Intuitively, this is because Hilbert spaces can contain very unsmooth functions.
We shall see that RKHS are Hilbert spaces where the evaluation functional is continuous.
Intuitively, this means that the functions in the space are well-behaved.

5.3 Nonparametric Regression


We observe (X1 , Y1 ), . . . , (Xn , Yn ) and we want to estimate m(x) = E(Y |X = x). The
approach we used earlier was based on smoothing kernels:
P  
||x−Xi ||
Y
i i K h
b
m(x) = P   .
||x−Xi ||
i K h

Another approach is regularization: choose m to minimize


X
(Yi − m(Xi ))2 + λJ(m)
i
P
for some penalty J. This is equivalent to: choose m ∈ M to minimize i (Yi − m(Xi ))2
where M = {m : J(m) ≤ L} for some L > 0.
We would like to construct M so that it contains smooth functions. We shall see that
a good choice is to use a RKHS.

14
5.4 Mercer Kernels
A RKHS is defined by a Mercer kernel. A Mercer kernel K(x, y) is a function of two
variables that is symmetric and positive definite. This means that, for any function f ,
Z Z
K(x, y)f (x)f (y)dx dy ≥ 0.

(This is like the definition of a positive definite matrix: xT Ax ≥ 0 for each x.)
Our main example is the Gaussian kernel
||x−y||2
K(x, y) = e− σ2 .

Given a kernel K, let Kx (·) be the function obtained by fixing the first coordinate. That
is, Kx (y) = K(x, y). For the Gaussian kernel, Kx is a Normal, centered at x. We can create
functions by taking linear combinations of the kernel:
k
X
f (x) = αj Kxj (x).
j=1

Let H0 denote all such functions:


( k
)
X
H0 = f: αj Kxj (x) .
j=1

P P
Given two such functions f (x) = kj=1 αj Kxj (x) and g(x) = m j=1 βj Kyj (x) we define an
inner product XX
hf, gi = hf, giK = αi βj K(xi , yj ).
i j

In general, f (and g) might be representable in more than one way. You can check that
hf, giK is independent of how f (or g) is represented. The inner product defines a norm:
p sX X √
||f ||K = hf, f, i = αj αk K(xj , xk ) = αT Kα
j k

where α = (α1 , . . . , αk )T and K is the k × k matrix with Kjk = K(xj , xk ).

5.5 The Reproducing Property


P
Let f (x) = i αi Kxi (x). Note the following crucial property:
X
hf, Kx i = αi K(xi , x) = f (x).
i

This follows from the definition of hf, gi where we take g = Kx . This implies that

hKx , Ky i = K(x, y).

15
This is called the reproducing property. It also implies that Kx is the representer of the
evaluation functional.
The completion of H0 with respect to || · ||K is denoted by HK and is called
the RKHS generated by K.
To verify that this is a well-defined Hilbert space, you should check that the following
properties hold:

hf, gi = hg, f i
hcf + dg, hi = chf, hi + chg, hi
hf, f i = 0 iff f = 0.

The last one is not obvious so let us verify it here. It is easy to see that f = 0 implies
that hf, f i = 0. Now we must show that hf, f i = 0 implies that f (x) = 0. So suppose that
hf, f i = 0. Pick any x. Then

0 ≤ f 2 (x) = hf, Kx i2 = hf, Kx i hf, Kx i


≤ ||f ||2 ||Kx ||2 = hf, f i2 ||Kx ||2 = 0

where we used Cauchy-Schwartz. So 0 ≤ f 2 (x) ≤ 0 which means that f (x) = 0.


Returning to the evaluation functional, suppose that fn → f . Then

δx fn = hfn , Kx i → hf, Kx i = f (x) = δx f

so the evaluation functional is continuous. In fact, a Hilbert space is a RKHS if and


only if the evaluation functionals are continuous.

5.6 Examples
Example 1. Let H be all functions f on R such that the support of the Fourier transform
of f is contained in [−a, a]. Then

sin(a(y − x))
K(x, y) =
a(y − x)

and Z
hf, gi = f g.

Example 2. Let H be all functions f on (0, 1) such that


Z 1
(f 2 (x) + (f 0 (x))2 )x2 dx < ∞.
0

Then 
K(x, y) = (xy)−1 e−x sinh(y)I(0 < x ≤ y) + e−y sinh(x)I(0 < y ≤ x)
and Z 1
2
||f || = (f 2 (x) + (f 0 (x))2 )x2 dx.
0

16
Example R3. The Sobolev space of order m is (roughly speaking) the set of functions f
such that (f (m) )2 < ∞. For m = 1 and X = [0, 1] the kernel is
( 2 3
1 + xy + xy2 − y6 0 ≤ y ≤ x ≤ 1
K(x, y) = 2 3
1 + xy + yx2 − x6 0 ≤ x ≤ y ≤ 1

and Z 1
||f ||2K = f 2 (0) + f 0 (0)2 + (f 00 (x))2 dx.
0

5.7 Spectral Representation


Suppose that supx,y K(x, y) < ∞. Define eigenvalues λj and orthonormal eigenfunctions
ψj by Z
K(x, y)ψj (y)dy = λj ψj (x).
P
Then j λj < ∞ and supx |ψj (x)| < ∞. Also,

X
K(x, y) = λj ψj (x)ψj (y).
j=1

Define the feature map Φ by


p p
Φ(x) = ( λ1 ψ1 (x), λ2 ψ2 (x), . . .).

We can expand f either in terms of K or in terms of the basis ψ1 , ψ2 , . . .:


X ∞
X
f (x) = αi K(xi , x) = βj ψj (x).
i j=1
P P
Furthermore, if f (x) = j aj ψj (x) and g(x) = j bj ψj (x), then

X aj bj
hf, gi = .
λj
j=1

Roughly speaking, when ||f ||K is small, then f is smooth.

5.8 Representer Theorem


Let ` be a loss function depending on (X1 , Y1 ), . . . , (Xn , Yn ) and on f (X1 ), . . . , f (Xn ). Let
fb minimize
` + g(||f ||2K )
where g is any monotone increasing function. Then fb has the form
n
X
fb(x) = αi K(xi , x)
i=1

for some α1 , . . . , αn .

17
5.9 RKHS Regression
b to minimize
Define m X
R= (Yi − m(Xi ))2 + λ||m||2K .
i
Pn
b
By the representer theorem, m(x) = i=1 αi K(xi , x). Plug this into R and we get
R = ||Y − Kα||2 + λαT Kα
where Kjk = K(Xj , Xk ) is the Gram matrix. The minimizer over α is
b = (K + λI)−1 Y
α
P
b
and m(x) = j bj K(Xi , x). The fitted values are
α

Yb = Kb
α = K(K + λI)−1 Y = LY.
So this is a linear smoother.
We can use cross-validation to choose λ. Compare this with smoothing kernel
regression.

5.10 Logistic Regression


Let
ef (x)
m(x) = P(Y = 1|X = x) = .
1 + ef (x)
We can estimate m by minimizing
−loglikelihood + λ||f ||2K .
P
Then fb = j K(xj , x) and α may be found by numerical optimization. In this case,
smoothing kernels are much easier.

5.11 Support Vector Machines


Suppose Yi ∈ {−1, +1}. Recall the the linear SVM minimizes the penalized hinge loss:
X λ
J= [1 − Yi (β0 + β T Xi )]+ + ||β||22 .
2
i

The dual is to maximize X 1X


αi − αi αj Yi Yj hXi , Xj i
2
i i,j

subject to 0 ≤ αi ≤ C.
The RKHS version is to minimize
X λ
J= [1 − Yi f (Xi )]+ + ||f ||2K .
2
i

The dual is the same except that hXi , Xj i is replaced with K(Xi , Xj ). This is called the
kernel trick.

18
5.12 The Kernel Trick
This is a fairly general trick. In many algorithms you can replace hxi , xj i with K(xi , xj ) and
get a nonlinear version of the algorithm. This is equivalent to replacing x with Φ(x) and
replacing hxi , xj i with hΦ(xi ), Φ(xj )i. However, K(xi , xj ) = hΦ(xi ), Φ(xj )i and K(xi , xj ) is
much easier to compute.
In summary, by replacing hxi , xj i with K(xi , xj ) we turn a linear procedure into a
nonlinear procedure without adding much computation.

5.13 Hidden Tuning Parameters


There are hidden tuning parameters in the RKHS. Consider the Gaussian kernel
||x−y||2
K(x, y) = e− σ2 .
P
For nonparametric regression we minimize i (Yi − m(Xi ))2 subject to ||m||K ≤ L. We
control the bias variance tradeoff by doing cross-validation over L. But what about σ?
This parameter seems to get mostly ignored. Suppose we have a uniform distribution
on a circle. The eigenfunctions of K(x, y) are the sines and cosines. The eigenvalues λk die
off like (1/σ)2k . So σ affects the bias-variance tradeoff since it weights things towards lower
order Fourier functions. In principle we can compensate for this by varying L. But clearly
there is some interaction between L and σ. The practical effect is not well understood.
We’ll see this again when we discuss interpolation.
Now consider the polynomial kernel K(x, y) = (1 + hx, yi)d . This kernel has the same
eigenfunctions but the eigenvalues decay at a polynomial rate depending on d. So there is
an interaction between L, d and, the choice of kernel itself.

6 Linear smoothers
6.1 Definition
Every
P estimator we have discussed so far is a linear smoother meaning that m(x) b =
w (x)Y for some weights w (x) that do not depend on the Y 0 s. Hence, the fitted values
i i i i i
b = (m(X
µ b 1 ), . . . , m(X
b n )) are of the form µ b = SY for some matrix S ∈ Rn×n depending on
the inputs X1 , . . . , Xn —and also possibly on a tuning parameter such as h in kernel smooth-
ing, or λ in smoothing splines—but not on the Yi ’s. We call S, the smoothing matrix. For
comparison, recall that in linear regression, µ b = HY for some projection matrix H.
For linear smoothers µ b = SY , the effective degrees of freedom is defined to be
n
X
ν ≡ df(b
µ) ≡ Sii = tr(S),
i=1

the trace of the smooth matrix S

19
6.2 Cross-validation
K-fold cross-validation can be used to estimate the prediction error and choose tuning
parameters.
For linear smoothers µ b = (m(x
b 1 ), . . . m(x
b n )) = SY , leave-one-out cross-validation can
be particularly appealing because in many cases we have the seemingly magical reduction
n n  
1X 2 1 X Yi − m(X b i) 2
b =
CV(m) Yi − mb −i (Xi ) = , (17)
n n 1 − Sii
i=1 i=1

where b −i
m denotes the estimated regression function that was trained on all but the ith pair
(Xi , Yi ). This leads to a big computational savings since it shows us that, to compute leave-
one-out cross-validation error, we don’t have to actually ever compute m b −i , i = 1, . . . , n.
Why does (17) hold, and for which linear smoothers µ b = Sy? Just rearranging (17)
perhaps demystifies this seemingly magical relationship and helps to answer these questions.
Suppose we knew that m b had the property
1 
b −i (Xi ) =
m b i ) − Sii Yi .
m(X (18)
1 − Sii
That is, to obtain the estimate at Xi under the function m b −i fit on all but (Xi , Yi ), we take
the sum of the linear weights
P (from our original fitted function m)b across all but the ith
b i ) − Sii Yi = i6=j Sij Yj , and then renormalize so that these weights sum to 1.
point, m(X
This is not an unreasonable property; e.g., we can immediately convince ourselves that
it holds for kernel smoothing. A little calculation shows that it also holds for smoothing
splines (using the Sherman-Morrison update formula).
From the special property (18), it is easy to show the leave-one-out formula (17). We
have
1  Yi − m(X b i)
Yi − mb −i (Xi ) = Yi − b i ) − Sii Yi =
m(X ,
1 − Sii 1 − Sii
and then squaring both sides and summing over n gives (17).
Finally, generalized cross-validation is a small twist on the right-hand side in (17) that
gives an approximation to leave-one-out cross-validation error. It is defined as by replacing
the appearances of diagonal terms Sii with the average diagonal term tr(S)/n,
n  
1 X Yi − m(X b i) 2 b
GCV(m) b = = (1 − ν/n)−2 R
n 1 − tr(S)/n
i=1

b is the training error. This can be of


where ν is the effective degrees of freedom and R
computational advantage in some cases where tr(S) is easier to compute that individual
elements Sii .

7 Additive models
7.1 Motivation and definition
Computational efficiency and statistical efficiency are both very real concerns as the dimen-
sion d grows large, in nonparametric regression. If you’re trying to fit a kernel, thin-plate

20
spline, or RKHS estimate in > 20 dimensions, without any other kind of structural con-
straints, then you’ll probably be in trouble (unless you have a very fast computer and tons
of data).
Recall from (11) that the minimax rate over the Holder class Hd (α, L) is n−2α/(2α+d) ,
which has an exponentially bad dependence on the dimension d. This is usually called the
curse of dimensionality (though the term apparently originated with Bellman (1962), who
encountered an analogous issue but in a separate context—dynamic programming).
What can we do? One answer is to change what we’re looking for, and fit estimates
with less flexibility in high dimensions. Think of a linear model in d variables: there is a
big difference between this and a fully nonparametric model in d variables. Is there some
middle man that we can consider, that would make sense?
Additive models play the role of this middle man. Instead of considering a full d-
dimensional function of the form

m(x) = m(x(1), . . . , x(d)) (19)

we restrict our attention to functions of the form

m(x) = m1 (x(1)) + · · · + md (x(d)). (20)

As each function mj , j = 1, . . . , d is univariate, fitting an estimate of the form (20) is


certainly less ambitious than fitting one of the form (19). On the other hand, the scope
of (20) is still big enough that we can capture interesting (marginal) behavior in high
dimensions.
There is a link to naive-Bayes classification that we will discuss later.
The choice of estimator of the form (20) need not be regarded as an assumption we
make about the true function m0 , just like we don’t always assume that the true model is
linear when using linear regression. In many cases, we fit an additive model because we
think it may provide a useful approximation to the truth, and is able to scale well with the
number of dimensions d.
A classic result by Stone (1985) encapsulates this idea precisely. He shows that, while
it may be difficult to estimate an arbitrary regression function m0 in multiple dimensions,
we can still estimate its best additive approximation madd well. Assuming each component
function madd
0,j , j = 1, . . . , d lies in the Holder class H1 (α, L), for constant L > 0, and
we can use an additive model, with each component m b j , j = 1, . . . , d estimated using an
appropriate kth degree spline, to give

b j − madd
Ekm 2
j k2 . n
−2α/(2α+1)
, j = 1, . . . , d.
add
Hence each component of the best additive approximation f to m0 can be estimated
at the optimal univariate rate. Loosely speaking, though we cannot hope to recover m0
arbitrarily, we can recover its major structure along the coordinate axes.

7.2 Backfitting
Estimation with additive models is actually very simple; we can just choose our favorite
univariate smoother (i.e., nonparametric estimator), and cycle through estimating each

21
function mj , j = 1, . . . , d individually (like a block coordinate descent algorithm). Denote
the result of running our chosen univariate smoother to regress Y = (Y1 , . . . , Yn ) ∈ Rn over
the input points Z = (Z1 , . . . , Zn ) ∈ Rn as

b = Smooth(Z, Y ).
m

E.g., we might choose Smooth(·, ·) to be a cubic smoothing spline with some fixed value of
the tuning parameter λ, or even with the tuning parameter selected by generalized cross-
validation
Once our univariate smoother has been chosen, we initialize m b 1, . . . , m
b d (say, to all to
zero) and cycle over the following steps for j = 1, . . . , d, 1, . . . , d, . . .:
P
1. define ri = Yi − `6=j mb ` (xi` ), i = 1, . . . , n;

2. smooth mb j = Smooth(x(j), r);


P
bj = m
3. center m b j − n1 ni=1 m
b j (Xi (j)).

This algorithm is known as backfitting. In last step above, we are removing the mean from
each fitted function mb j , j = 1, . . . , d, otherwise the model would not be identifiable. Our
final estimate therefore takes the form

b
m(x) b 1 (x(1)) + · · · + m(x(d))
=Y +m b
P
where Y = n1 ni=1 Yi . Hastie & Tibshirani (1990) provide a very nice exposition on the
some of the more practical aspects of backfitting and additive models.
In many cases, backfitting is equivalent to blockwise coordinate descent performed on
a joint optimization criterion that determines the total additive estimate. E.g., for the
additive cubic smoothing spline optimization problem,
n 
X d
X 2 X
d Z 1
b 1, . . . , m
m b d = argmin Yi − mj (xij ) + λj m00j (t)2 dt,
m1 ,...,md 0
i=1 j=1 j=1

backfitting is exactly blockwise coordinate descent (after we reparametrize the above to be


in finite-dimensional form, using a natural cubic spline basis).
The beauty of backfitting is that it allows us to think algorithmically, and plug in
whatever we want for the univariate smoothers. This allows for several extensions. One
extension: we don’t need to use the same univariate smoother for each dimension, rather,
we could mix and match, choosing Smoothj (·, ·), j = 1, . . . , d to come from entirely different
methods or giving estimates with entirely different structures.
Another extension: to capture interactions, we can perform smoothing over (small)
groups of variables instead of individual variables. For example we could fit a model of the
form X X
m(x) = mj (x(j)) + mjk (x(j), x(k)).
j j<k

22
7.3 Error rates
Error rates for additive models are both kind of what you’d expect and surprising. What
you’d expect: if the underlying function m0 is additive, and we place standard assumptions
on its component functions, such as f0,j ∈ S1 (m, C), j = 1, . . . , d, for a constant C > 0,
a somewhat straightforward argument building on univariate minimax theory gives us the
lower bound
inf sup Ekmb − m0 k22 & dn−2m/(2m+1) .
m
b m ∈⊕d S (m,C)
0 j=1 1

This is simply d times the univariate minimax rate. (Note that we have been careful to
track the role of d here, i.e., it is not being treated like a constant.) Also, standard methods
like backfitting with univariate smoothing splines of polynomial order k = 2m − 1, will also
match this upper bound in error rate (though the proof to get the sharp linear dependence
on d is a bit trickier).

7.4 Sparse additive models


Recently, sparse additive models have received a good deal of attention. In truly high
dimensions, we might believe that only a small subset of the variables play a useful role in
modeling the regression function, so might posit a modification of (20) of the form
X
m(x) = mj (x(j))
j∈S

where S ⊆ {1, . . . , d} is an unknown subset of the full set of dimensions.


This is a natural idea, and to estimate a sparse additive model, we can use methods that
are like nonparametric analogies of the lasso (more accurately, the group lasso). This is a
research topic still very much in development; some recent works are Lin & Zhang (2006),
Ravikumar et al. (2009), Raskutti et al. (2012). We’ll cover this in more detail when we
talk about the sparsity, the lasso, and high-dimensional estimation.

8 Variance Estimation and Confidence Bands


Let
σ 2 (x) = Var(Y |X = x).
We can estimate σ 2 (x) as follows. Let m(x)
b be an estimate of the regression function. Let
b i ). Now apply nonparametric regression again treating e2i as the response. The
ei = Yi − m(X
b2 (x) can be shown to be consistent under some regularity conditions.
resulting estimator σ
Ideally we would also like to find random functions `n and un such that

P (`n (x) ≤ m(x) ≤ un (x) for all x) → 1 − α.

For the reasons we discussed earlier with density functions, this is essentially an impossible
problem.
We can, however, still
P get an informal (but useful) estimatePthe variability of m(x). b
b
Suppose that m(x) = i wi (x)Yi . The conditional variance is i wi2 (x)σ 2 (x) which can

23
P 2
be estimatedqby i wi (x)bσ 2 (x). An asymptotic, pointwise (biased) confidence band is
P 2
b
m(x) ± zα/2 σ 2 (x).
i wi (x)b
A better idea is to bootstrap the quantity

n supx |m(x)
b − E[m(x)]|
b
b(x)
σ

to get a bootstrap quantile tn . Then


 
b(x)
tn σ b(x)
tn σ
b
m(x) − √ , m(x)b + √
n n

is a bootstrap variability band.

9 Wavelet smoothing
Not every nonparametric regression estimate needs to be a linear smoother (though this
does seem to be very common), and wavelet smoothing is one of the leading nonlinear tools
for nonparametric estimation. The theory of wavelets is elegant and we only give a brief
introduction here; see Mallat (2008) for an excellent reference
You can think of wavelets as defining an orthonormal function basis, with the basis
functions exhibiting a highly varied level of smoothness. Importantly, these basis functions
also display spatially localized smoothness at different locations in the input domain. There
are actually many different choices for wavelets bases (Haar wavelets, symmlets, etc.), but
these are details that we will not go into
We assume d = 1. Local adaptivity in higher dimensions is not nearly as settled as
it is with smoothing splines or (especially) kernels (multivariate extensions of wavelets are
possible, i.e., ridgelets and curvelets, but are complex)
Consider basis functions, φ1 , . . . , φn , evaluated over n equally spaced inputs over [0, 1]:

Xi = i/n, i = 1, . . . , n.

The assumption of evenly spaced inputs is crucial for fast computations; we also typically
assume with wavelets that n is a power of 2. We now form a wavelet basis matrix W ∈ Rn×n ,
defined by
Wij = φj (Xi ), i, j = 1, . . . , n
The goal, given outputs y = (y1 , . . . , yn ) over the evenly spaced input points, is to
represent y as a sparse combination of the wavelet basis functions. To do so, we first
perform a wavelet transform (multiply by W T ):

θe = W T y,

we threshold the coefficients θ (the threshold function Tλ to be defined shortly):

θb = Tλ (θ),
e

24
and then perform an inverse wavelet transform (multiply by W ):

b = W θb
µ

The wavelet and inverse wavelet transforms (multiplication by W T and W ) each require
O(n) operations, and are practically extremely fast due do clever pyramidal multiplication
schemes that exploit the special structure of wavelets
The threshold function Tλ is usually taken to be hard-thresholding, i.e.,

[Tλhard (z)]i = zi · 1{|zi | ≥ λ}, i = 1, . . . , n,

or soft-thresholding, i.e.,

[Tλsoft (z)]i = zi − sign(zi )λ · 1{|zi | ≥ λ}, i = 1, . . . , n.

These thresholding functions are both also O(n), and computationally trivial, making
wavelet smoothing very fast overall
We should emphasize that wavelet smoothing is not a linear smoother, i.e., there is no
b = Sy for all y
single matrix S such that µ
We can write the wavelet smoothing estimate in a more familiar form, following our
previous discussions on basis functions and regularization. For hard-thresholding, we solve

θb = argmin ky − W θk22 + λ2 kθk0 ,


θ∈Rn

b Here kθk0 = Pn 1{θi 6= 0},


b = W θ.
and then the wavelet smoothing fitted values are µ i=1
the number of nonzero components of θ, called the “`0 norm”. For soft-thresholding, we
solve
θb = argmin ky − W θk22 + 2λkθk1 ,
θ∈Rn
b Here kθk1 = Pn |θi |, the `1
b = W θ.
and then the wavelet smoothing fitted values are µ i=1
norm

9.1 The strengths of wavelets, the limitations of linear smoothers


Apart from its computational efficiency, an important strength of wavelet smoothing is that
it can represent a signal that has a spatially heterogeneous degree of smoothness, i.e., it
can be both smooth and wiggly at different regions of the input domain. The reason that
wavelet smoothing can achieve such local adaptivity is because it selects a sparse number
of wavelet basis functions, by thresholding the coefficients from a basis regression
We can make this more precise by considering convergence rates over an appropriate
function class. In particular, we define the total variation class M (k, C), for an integer k ≥ 0
and C > 0, to contain all k times (weakly) differentiable functions whose kth derivative
satisfies
XN
TV(f (k) ) = sup |f (k) (zi+1 ) − f (k) (zi )| ≤ C.
0=z1 <z2 <...<zN <zN +1 =1
j=1
R1
(Note that if f has k + 1 continuous derivatives, then TV(f (k) ) = 0 |f (k+1) (x)| dx.)

25
For the wavelet smoothing estimator, denoted by m b wav , Donoho & Johnstone (1998)
provide a seminal analysis. Assuming that m0 ∈ M (k, C) for a constant C > 0 (and further
conditions on the setup), they show that (for an appropriate scaling of the smoothing
parameter λ),

b wav − m0 k22 . n−(2k+2)/(2k+3) and


Ekm inf sup b − m0 k22 & n−(2k+2)/(2k+3) .
Ekm
m
b m0 ∈M (k,C)
(21)
Thus wavelet smoothing attains the minimax optimal rate over the function class M (k, C).
(For a translation of this result to the notation of the current setting, see Tibshirani (2014).)
Some important questions: (i) just how big is the function class M (k, C)? And (ii) can
a linear smoother also be minimax optimal over M (k, C)?
It is not hard to check M (k, C) ⊇ S1 (k + 1, C 0 ), the (univariate) Sobolev space of order
k + 1, for some other constant C 0 > 0. We know from the previously mentioned theory
on Sobolev spaces that the minimax rate over S1 (k + 1, C 0 ) is again n−(2k+2)/(2k+3) . This
suggests that these two function spaces might actually be somewhat close in size
But in fact, the overall minimax rates here are sort of misleading, and we will see
from the behavior of linear smoothers that the function classes are actually quite different.
Donoho & Johnstone (1998) showed that the minimax error over M (k, C), restricted to
linear smoothers, satisfies

inf sup b − m0 k22 & n−(2k+1)/(2k+2) .


Ekm (22)
m
b linear m0 ∈M (k,C)

(See again Tibshirani (2014) for a translation to the notation of the current setting.) Hence
the answers to our questions are: (ii) linear smoothers cannot cope with the heterogeneity
of functions in M (k, C), and are are bounded away from optimality, which means (i) we
can interpret M (k, C) as being much larger than S1 (k + 1, C 0 ), because linear smoothers
can be optimal over the latter class but not over the former. See Figure 5 for a diagram
Let’s back up to emphasize just how remarkable the results (21), (22) really are. Though
it may seem like a subtle difference in exponents, there is actually a significant difference
in the minimax rate and minimax linear rate: e.g., when k = 0, this is a difference of n−1/2
(optimal) and n−1/2 (optimal among linear smoothers) for estimating a function of bounded
variation. Recall also just how broad the linear smoother class is: kernel smoothing, regres-
sion splines, smoothing splines, RKHS estimators ... none of these methods can achieve a
better rate than n−1/2 over functions of bounded variation
Practically, the differences between wavelets and linear smoothers in problems with
spatially heterogeneous smoothness can be striking as well. However, you should keep in
mind that wavelets are not perfect: a shortcoming is that they require a highly restrictive
setup: recall that they require evenly spaced inputs, and n to be power of 2, and there are
often further assumptions made about the behavior of the fitted function at the boundaries
of the input domain
Also, though you might say they marked the beginning of the story, wavelets are not the
end of the story when it comes to local adaptivity. The natural thing to do, it might seem,
is to make (say) kernel smoothing or smoothing splines more locally adaptive by allowing
for a local bandwidth parameter or a local penalty parameter. People have tried this, but it

26
Figure 5: A diagram of the minimax rates over M (k, C) (denoted Fk in the picture) and
S1 (k + 1, C) (denoted Wk+1 in the picture)

is both difficult theoretically and practically to get right. A cleaner approach is to redesign
the kind of penalization used in constructing smoothing splines directly.

10 More on Splines: Regression and Smoothing Splines


10.1 Splines
• Regression splines and smoothing splines are motivated from a different perspective
than kernels and local polynomials; in the latter case, we started off with a special
kind of local averaging, and moved our way up to a higher-order local models. With
regression splines and smoothing splines, we build up our estimate globally, from a
set of select basis functions
• These basis functions, as you might guess, are splines. Let’s assume that d = 1 for
simplicity. (We’ll stay in the univariate case, for the most part, in this section.) A
kth-order spline f is a piecewise polynomial function of degree k that is continuous
and has continuous derivatives of orders 1, . . . , k − 1, at its knot points. Specifically,
there are t1 < . . . < tp such that f is a polynomial of degree k on each of the intervals
(−∞, t1 ], [t1 , t2 ], . . . , [tp , ∞)
and f (j) is continuous at t1 , . . . , tp , for each j = 0, 1, . . . , k − 1
• Splines have some special (some might say: amazing!) properties, and they have been
a topic of interest among statisticians and mathematicians for a very long time. See

27
de Boor (1978) for an in-depth coverage. Informally, a spline is a lot smoother than
a piecewise polynomial, and so modeling with splines can serve as a way of reducing
the variance of fitted estimators. See Figure 6
• A bit of statistical folklore: it is said that a cubic spline is so smooth, that one cannot
detect the locations of its knots by eye!
• How can we parametrize the set of a splines with knots at t1 , . . . , tp ? The most natural
way is to use the truncated power basis, g1 , . . . , gp+k+1 , defined as

g1 (x) = 1, g2 (x) = x, . . . gk+1 (x) = xk ,


(23)
gk+1+j (x) = (x − tj )k+ , j = 1, . . . , p.
(Here x+ denotes the positive part of x, i.e., x+ = max{x, 0}.) From this we can see
that the space of kth-order splines with knots at t1 , . . . , tp has dimension p + k + 1
• While these basis functions are natural, a much better computational choice, both for
speed and numerical accuracy, is the B-spline basis. This was a major development
in spline theory and is now pretty much the standard in software. The key idea:
B-splines have local support, so a basis matrix that we form with them (to be defined
below) is banded. See de Boor (1978) or the Appendix of Chapter 5 in Hastie et al.
(2009) for details

10.2 Regression splines


• A first idea: let’s perform regression on a spline basis. In other words, given inputs
x1 , . . . , xn and responses y1 , . . . , yn , we consider fitting functions f that are kth-order
splines with knots at some chosen locations t1 , . . . tp . This means expressing f as
p+k+1
X
f (x) = βj gj (x),
j=1

where β1 , . . . , βp+k+1 are coefficients and g1 , . . . , gp+k+1 , are basis functions for order
k splines over the knots t1 , . . . , tp (e.g., the truncated power basis or B-spline basis)
• Letting y = (y1 , . . . , yn ) ∈ Rn , and defining the basis matrix G ∈ Rn×(p+k+1) by
Gij = gj (xi ), i = 1, . . . , n, j = 1, . . . , p + k + 1,
we can just use least squares to determine the optimal coefficients βb = (βb1 , . . . , βbp+k+1 ),
βb = argmin ky − Gβk22 ,
β∈Rp+k+1
P
which then leaves us with the fitted regression spline fb(x) = p+k+1
j=1 βbj gj (x)
• Of course we know that βb = (GT G)−1 GT y, so the fitted values µ
b = (fb(x1 ), . . . , fb(xn ))
are
µb = G(GT G)−1 GT y,
and regression splines are linear smoothers

28
5.2 Piecewise Polynomials and Splines 143

Piecewise Cubic Polynomials

Discontinuous Continuous

O O
O O O O O O
O O O O O O
O O O O O O
OO O O
O OO O O
O
OOO O OOO O
OO O OO O
O O O O
O O
O O O O O
O O O O O O
O
O O O O
O O
O O O O O O
O
O O
O
O O O O O O
O O O O O O
O O O O
O O

O O

ξ1 ξ2 ξ1 ξ2

Continuous First Derivative Continuous Second Derivative

O O
O O O O O O
O O O O O O
O O O O O O
OO O O
O OO O O
O
OOO O OOO O
OO O OO O
O O O O
O O O O
O O O O
O O O O O
O
O O O O
O O
O O O O O O
O
O O
O
O O O O O O
O O O O O O
O O O O
O O

O O

ξ1 ξ2 ξ1 ξ2
FIGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of
Figure 6: Illustration of the effects of enforcing continuity at the knots, across various orders
continuity.
of the derivative, for a cubic piecewise polynomial. From Chapter 5 of Hastie et al. (2009)

increasing orders of continuity at the knots. The function in the lower


right panel is continuous, and has continuous first and second derivatives
at the knots. It is known as a cubic spline. Enforcing one more order of
continuity would lead to a global cubic polynomial. It is not hard to show
(Exercise 5.1) that the following basis represents a cubic spline with knots
at ξ1 and ξ2 :
29
h1 (X) = 1, h3 (X) = X 2 , h5 (X) = (X − ξ1 )3+ ,
(5.3)
h2 (X) = X, h4 (X) = X 3 , h6 (X) = (X − ξ2 )3+ .
• This is a classic method, and can work well provided we choose good knots t1 , . . . , tp ;
but in general choosing knots is a tricky business. There is a large literature on knot
selection for regression splines via greedy methods like recursive partitioning

10.3 Natural splines


• A problem with regression splines is that the estimates tend to display erractic be-
havior, i.e., they have high variance, at the boundaries of the input domain. (This
is the opposite problem to that with kernel smoothing, which had poor bias at the
boundaries.) This only gets worse as the polynomial order k gets larger

• A way to remedy this problem is to force the piecewise polynomial function to have a
lower degree to the left of the leftmost knot, and to the right of the rightmost knot—
this is exactly what natural splines do. A natural spline of order k, with knots at
t1 < . . . < tp , is a piecewise polynomial function f such that

– f is a polynomial of degree k on each of [t1 , t2 ], . . . , [tp−1 , tp ],


– f is a polynomial of degree (k − 1)/2 on (−∞, t1 ] and [tp , ∞),
– f is continuous and has continuous derivatives of orders 1, . . . , k − 1 at t1 , . . . , tp .

It is implicit here that natural splines are only defined for odd orders k

• What is the dimension of the span of kth order natural splines with knots at t1 , . . . , tp ?
Recall for splines, this was p + k + 1 (the number of truncated power basis functions).
For natural splines, we can compute this dimension by counting:
 (k − 1) 
(k + 1) · (p − 1) + + 1 · 2 − k · p = p.
| {z } | 2 {z } |{z}
a b c

Above, a is the number of free parameters in the interior intervals [t1 , t2 ], . . . , [tp−1 , tp ],
b is the number of free parameters in the exterior intervals (−∞, t1 ], [tp , ∞), and c is
the number of constraints at the knots t1 , . . . , tp . The fact that the total dimension
is p is amazing; this is independent of k!

• Note that there is a variant of the truncated power basis for natural splines, and a
variant of the B-spline basis for natural splines. Again, B-splines are the preferred
parametrization for computational speed and stability

• Natural splines of cubic order is the most common special case: these are smooth
piecewise cubic functions, that are simply linear beyond the leftmost and rightmost
knots

10.4 Smoothing splines


• Smoothing splines, at the end of the day, are given by a regularized regression over
the natural spline basis, placing knots at all inputs x1 , . . . , xn . They circumvent the
problem of knot selection (as they just use the inputs as knots), and they control

30
for overfitting by shrinking the coefficients of the estimated function (in its basis
expansion)

• Interestingly, we can motivate and define a smoothing spline directly from a func-
tional minimization perspective. With inputs x1 , . . . , xn lying in an interval [0, 1], the
smoothing spline estimate fb, of a given odd integer order k ≥ 0, is defined as
n
X Z 1
2 2
fb = argmin yi − f (xi ) + λ f (m) (x) dx, where m = (k + 1)/2. (24)
f i=1 0

This is an infinite-dimensional optimization problem over all functions f for the which
the criterion is finite. This criterion trades off the least squares error of f over the
observed pairs (xi , yi ), i = 1, . . . , n, with a penalty term that is large when the mth
derivative of f is wiggly. The tuning parameter λ ≥ 0 governs the strength of each
term in the minimization

• By far the most commonly considered case is k = 3, i.e., cubic smoothing splines,
which are defined as
n
X Z 1
2
b
f = argmin yi − f (xi ) + λ f 00 (x)2 dx (25)
f i=1 0

• Remarkably, it so happens that the minimizer in the general smoothing spline prob-
lem (38) is unique, and is a natural kth-order spline with knots at the input points
x1 , . . . , xn ! Here we give a proof for the cubic case, k = 3, from Green & Silverman
(1994) (see also Exercise 5.7 in Hastie et al. (2009))
The key result can be stated as follows: if fe is any twice differentiable function on
[0, 1], and x1 , . . . , xn ∈ [0, 1], then there exists a natural cubic spline f with knots at
x1 , . . . , xn such that f (xi ) = fe(xi ), i = 1, . . . , n and
Z 1 Z 1
00 2
f (x) dx ≤ fe00 (x)2 dx.
0 0

Note that this would in fact prove that we can restrict our attention in (25) to natural
splines with knots at x1 , . . . , xn
Proof: the natural spline basis with knots at x1 , . . . , xn is n-dimensional, so given any
n points zi = fe(xi ), i = 1, . . . , n, we can always find a natural spline f with knots at
x1 , . . . , xn that satisfies f (xi ) = zi , i = 1, . . . , n. Now define

h(x) = fe(x) − f (x).

31
Consider
Z 1 Z 1
1
f 00 (x)h00 (x) dx = f 00 (x)h0 (x) − f 000 (x)h0 (x) dx
0 0 0
Z xn
000 0
=− f (x)h (x) dx
x1
n−1
X Z xn
xj+1
000
=− f (x)h(x) + f (4) (x)h0 (x) dx
xj x1
j=1
n−1
X 
=− f 000 (x+
j ) h(xj+1 ) − h(xj ) ,
j=1

where in the first line we used integration by parts; in the second we used the that
f 00 (a) = f 00 (b) = 0, and f 000 (x) = 0 for x ≤ x1 and x ≥ xn , as f is a natural spline; in
the third we used integration by parts again; in the fourth line we used the fact that f 000
is constant on any open interval (xj , xj+1 ), j = 1, . . . , n − 1, and that f (4) = 0, again
because f is a natural spline. (In the above, we use f 000 (u+ ) to denote limx↓u f 000 (x).)
Finally, since h(xj ) = 0 for all j = 1, . . . , n, we have
Z 1
f 00 (x)h00 (x) dx = 0.
0

From this, it follows that


Z 1 Z 1
2
e00 2
f (x) dx = f 00 (x) + h00 (x) dx
0 0
Z 1 Z 1 Z 1
00 2 00 2
= f (x) dx + h (x) dx + 2 f 00 (x)h00 (x) dx
0 0 0
Z 1 Z 1
= f 00 (x)2 dx + h00 (x)2 dx,
0 0

and therefore Z Z
1 1
00 2
f (x) dx ≤ fe00 (x)2 dx, (26)
0 0
with equality if and only if 00
= 0 for all x ∈ [0, 1]. Note that h00 = 0 implies that
h (x)
h must be linear, and since we already know that h(xj ) = 0 for all j = 1, . . . , n, this
is equivalent to h = 0. In other words, the inequality (45) holds strictly except when
fe = f , so the solution in (25) is uniquely a natural spline with knots at the inputs

10.5 Finite-dimensional form


• The key result presented above tells us that we can choose a basis η1 , . . . , ηn for the set
of kth-order natural splines with knots over x1 , . . . , xn , and reparametrize the problem
(38) as
Xn  n
X 2 Z 1X n 2
(m)
βb = argmin yi − βj ηj (xi ) + λ βj ηj (x) dx. (27)
β∈Rn i=1 j=1 0 j=1

32
This is a finite-dimensional problem, and after we compute the
Pncoefficients βb ∈ Rn ,
b b
we know that the smoothing spline estimate is simply f (x) = j=1 βj ηj (x)

• Defining the basis matrix and penalty matrices N, Ω ∈ Rn×n by


Z 1
(m) (m)
Nij = ηj (xi ) and Ωij = ηi (x)ηj (x) dx for i, j = 1, . . . , n, (28)
0

the problem in (27) can be written more succintly as

βb = argmin ky − N βk22 + λβΩβ, (29)


β∈Rn

showing the smoothing spline problem to be a type of generalized ridge regression


problem. In fact, the solution in (29) has the explicit form

βb = (N T N + λΩ)−1 N T y,

b = (fb(x1 ), . . . , fb(xn )) are


and therefore the fitted values µ

b = N (N T N + λΩ)−1 N T y.
µ (30)

Therefore, once again, smoothing splines are a type of linear smoother

• A special property of smoothing splines: the fitted values in (30) can be computed in
O(n) operations. This is achieved by forming N from the B-spline basis (for natural
splines), and in this case the matrix N T N + ΩI ends up being banded (with a band-
width that only depends on the polynomial order k). In practice, smoothing spline
computations are extremely fast

10.6 Reinsch form


• It is informative to rewrite the fitted values in (30) is what is called Reinsch form,

b = N (N T N + λΩ)−1 N T y
µ
  −1 T
= N N T I + λ(N T )−1 ΩN −1 N N y
= (I + λQ)−1 y, (31)

where Q = (N T )−1 ΩN −1

• Note that this matrix Q does not depend on λ. If we compute an eigendecomposition


Q = U DU T , then the eigendecomposition of S = N (N T N + λΩ)−1 = (I + λQ)−1 is
n
X 1
S= uj uTj ,
1 + λdj
j=1

where D = diag(d1 , . . . , dn )

33
1.0
1e−05
5e−05
1e−04

0.2
5e−04

0.8
0.001
0.005
0.01
0.1

0.05

0.6
Eigenvectors

Eigenvalues
0.0

0.4
−0.1

0.2
−0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50

x Number

Figure 7: Eigenvectors and eigenvalues for the Reinsch form of the cubic smoothing spline
operator, defined over n = 50 evenly spaced inputs on [0, 1]. The left plot shows the bottom
7 eigenvectors of the Reinsch matrix Q. We can see that the smaller the eigenvalue, the
“smoother” the eigenvector. The right plot shows the weights wj = 1/(1 + λdj ), j = 1, . . . , n
implicitly used by the smoothing spline estimator (32), over 8 values of λ. We can see that
when λ is larger, the weights decay faster, so the smoothing spline estimator places less
weight on the “nonsmooth” eigenvectors

• Therefore the smoothing spline fitted values are µ


b = Sy, i.e.,
n
X uTj y
b=
µ uj . (32)
1 + λdj
j=1

Interpretation: smoothing splines perform a regression on the orthonormal basis


u1 , . . . , un ∈ Rn , yet they shrink the coefficients in this regression, with more shrinkage
assigned to eigenvectors uj that correspond to large eigenvalues dj
• So what exactly are these basis vectors u1 , . . . , un ? These are known as the Demmler-
Reinsch basis, and a lot of their properties can be worked out analytically (?). Ba-
sically: the eigenvectors uj that correspond to smaller eigenvalues dj are smoother,
and so with smoothing splines, we shrink less in their direction. Said differently, by
increasing λ in the smoothing spline estimator, we are tuning out the more wiggly
components. See Figure 7

10.7 Kernel smoothing equivalence


• Something interesting happens when we plot the rows of the smoothing spline matrix
S. For evenly spaced inputs, they look like the translations of a kernel! See Figure
8, left plot. For unevenly spaced inputs, the rows still have a kernel shape; now, the
bandwidth appears to adapt to the density of the input points: lower density, larger
bandwidth. See Figure 8, right plot

34
● ● ●
● ● ● ●

● ●
Row 25 ●
Row 5
Row 50 Row 50

0.08
● ● ● ● ● ●
Row 75 Row 95

0.15

● ● ● ● ● ● ●



0.06
● ● ● ● ● ● ●

0.10

● ● ● ● ● ●


0.04

● ● ● ● ● ●


● ● ● ● ● ●
● ●

0.05

● ●

● ●



● ● ● ● ● ● ●

0.02




● ●

● ● ● ● ● ● ● ● ● ● ●
● ●
● ●



● ●
● ● ● ● ● ●


0.00
● ●●
● ● ● ● ● ●
● ● ● ●● ● ●●● ● ● ● ●
● ● ●● ●
●●
● ●●
●● ● ●● ●●●
●●● ● ●● ●● ● ●●● ●
● ● ●
●●
● ●●
●●

●●

●●


●●●●●●
●●●

● ●●●
●● ●●●
●●●●●
● ●

●●
●●

●●
●●●
●●●●
●●● ●
●●●
●● ●
●●
●●●●
●●●●
● ●●
●●●
●●●●
●●

0.00

● ●● ●● ●
● ●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●

●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●● ● ●●● ●
●● ● ●● ● ● ●● ●●● ● ● ●● ● ●●
●●
●●●●●● ●●
●●●●● ●●●●●●●● ●●
●●●●● ●●●●●●●● ●●●●●●●●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7

x x

Figure 8: Rows of the cubic smoothing spline operator S defined over n = 100 evenly spaced
input points on [0, 1]. The left plot shows 3 rows of S (in particular, rows 25, 50, and 75)
for λ = 0.0002. These look precisely like translations of a kernel. The right plot considers
a setup where the input points are concentrated around 0.5, and shows 3 rows of S (rows
5, 50, and 95) for the same value of λ. These still look like kernels, but the bandwidth is
larger in low-density regions of the inputs

• What we are seeing is an empirical validation of a beautiful asymptotic result by ?.


It turns out that the cubic smoothing spline estimator is asymptotically equivalent to
a kernel regression estimator, with an unusual choice of kernel. Recall that both are
linear smoothers; this equivalence is achieved by showing that under some conditions
the smoothing spline weights converge to kernel weights, under the “Silverman kernel”:
1 √ √
K(x) = exp(−|x|/ 2) sin(|x|/ 2 + π/4), (33)
2
and a local choice of bandwidth h(x) = λ1/4 q(x)−1/4 , where q(x) is the density of the
input points. That is, the bandwidth adapts to the local distribution of inputs. See
Figure 9 for a plot of the Silverman kernel

• The Silverman kernel is “kind of” a higher-order kernel. It satisfies


Z Z Z
j
K(x) dx = 1, x K(x) dx = 0, j = 1, . . . , 3, but x4 K(x) dx = −24.

So it lies outside the scope of usual kernel analysis

• There is more recent work that connects smoothing splines of all orders to kernel
smoothing. See, e.g., ??.

35
A graph of K is given in Figure 1. The effective local bandwidth demonstrated
below is ~ l ' ~ f ( t ) - ' asymptotically;
'~ thus the smoothing spline's behaviour is
intermediate between fixed kernel smoothing (no dependence on f ) and smooth-
ing based on a n average of a fixed number of neighbouring values (effective local
bandwidth proportional to l l f ) . The desirability of this dependence on a low
power o f f will be discussed in Section 3.
The paper is organized as follows. In Section 2 the main theorem is stated and
discussed. In addition, some graphs of actual weight functions are presented and
compared with their asymptotic forms. These show that the kernel approximation
of the weight function is excellent in practice. Section 3 contains some discussion

FIG.1. The effectiue kernel K .

Figure 9: The Silverman kernel in (33), which is the (asymptotically) equivalent implicit
kernel used by smoothing splines. Note that it can be negative. From ?

10.8 Error rates


• Define the Sobolev class of functions W1 (m, C), for an integer m ≥ 0 and C > 0, to
contain all m times differentiable functions f : R → R such that
Z
2
f (m) (x) dx ≤ C 2 .

(The Sobolev class Wd (m, C) in d dimensions can be defined similarly, where we sum
over all partial derivatives of order m.)

• Assuming f0 ∈ W1 (m, C) for the underlying regression function, where C > 0 is a


constant, the smoothing spline estimator fb in (38) of polynomial order k = 2m − 1
with tuning parameter λ  n1/(2m+1)  n1/(k+2) satisfies

kfb − f0 k2n . n−2m/(2m+1) in probability.

The proof of this result uses much more fancy techniques from empirical process theory
(entropy numbers) than the proofs for kernel smoothing. See Chapter 10.1 of van de
Geer (2000)

• This rate is seen to be minimax optimal over W1 (m, C) (e.g., Nussbaum (1985)).
Also, it is worth noting that the Sobolev W1 (m, C) and Holder H1 (m, L) classes are
equivalent in the following sense: given W1 (m, C) for a constant C > 0, there are
L0 , L1 > 0 such that

H1 (m, L0 ) ⊆ W1 (m, C) ⊆ H1 (m, L1 ).

The first containment is easy to show; the second is far more subtle, and is a con-
sequence of the Sobolev embedding theorem. (The same equivalences hold for the
d-dimensional versions of the Sobolev and Holder spaces.)

36
10.9 Multivariate splines
• Splines can be extended to multiple dimensions, in two different ways: thin-plate
splines and tensor-product splines. The former construction is more computationally
efficient but more in some sense more limiting; the penalty for a thin-plate spline, of
polynomial order k = 2m − 1, is
X Z 2
∂ m f (x)
dx,
α1 +...+αd =m
∂xα1 1 xα2 2 . . . ∂xαd d

which is rotationally invariant. Both of these concepts are discussed in Chapter 7 of


Green & Silverman (1994) (see also Chapters 15 and 20.4 of Gyorfi et al. (2002))

• The multivariate extensions (thin-plate and tensor-product) of splines are highly non-
trivial, especially when we compare them to the (conceptually) simple extension of
kernel smoothing to higher dimensions. In multiple dimensions, if one wants to study
penalized nonparametric estimation, it’s (argurably) easier to study reproducing ker-
nel Hilbert space estimators. We’ll see, in fact, that this covers smoothing splines
(and thin-plate splines) as a special case

37
References
Bellman, R. (1962), Adaptive Control Processes, Princeton University Press.

de Boor, C. (1978), A Practical Guide to Splines, Springer.

Devroye, L., Gyorfi, L., & Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition,
Springer.

Donoho, D. L. & Johnstone, I. (1998), ‘Minimax estimation via wavelet shrinkage’, Annals
of Statistics 26(8), 879–921.

Fan, J. (1993), ‘Local linear regression smoothers and their minimax efficiencies’, The An-
nals of Statistics pp. 196–216.

Fan, J. & Gijbels, I. (1996), Local polynomial modelling and its applications: monographs
on statistics and applied probability 66, Vol. 66, CRC Press.

Green, P. & Silverman, B. (1994), Nonparametric Regression and Generalized Linear Mod-
els: A Roughness Penalty Approach, Chapman & Hall/CRC Press.

Gyorfi, L., Kohler, M., Krzyzak, A. & Walk, H. (2002), A Distribution-Free Theory of
Nonparametric Regression, Springer.

Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall.

Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning;
Data Mining, Inference and Prediction, Springer. Second edition.

Johnstone, I. (2011), Gaussian estimation: Sequence and wavelet models, Under contract to
Cambridge University Press. Online version at http://www-stat.stanford.edu/~imj.

Kim, S.-J., Koh, K., Boyd, S. & Gorinevsky, D. (2009), ‘`1 trend filtering’, SIAM Review
51(2), 339–360.

Lin, Y. & Zhang, H. H. (2006), ‘Component selection and smoothing in multivariate non-
parametric regression’, Annals of Statistics 34(5), 2272–2297.

Mallat, S. (2008), A wavelet tour of signal processing, Academic Press. Third edition.

Mammen, E. & van de Geer, S. (1997), ‘Locally apadtive regression splines’, Annals of
Statistics 25(1), 387–413.

Nussbaum, M. (1985), ‘Spline smoothing in regression models and asymptotic efficiency in


l2 ’, Annals of Statistics 13(3), 984–997.

Raskutti, G., Wainwright, M. & Yu, B. (2012), ‘Minimax-optimal rates for sparse addi-
tive models over kernel classes via convex programming’, Journal of Machine Learning
Research 13, 389–427.

Ravikumar, P., Liu, H., Lafferty, J. & Wasserman, L. (2009), ‘Sparse additive models’,
Journal of the Royal Statistical Society: Series B 75(1), 1009–1030.

38
Scholkopf, B. & Smola, A. (2002), ‘Learning with kernels’.

Simonoff, J. (1996), Smoothing Methods in Statistics, Springer.

Steidl, G., Didas, S. & Neumann, J. (2006), ‘Splines in higher order TV regularization’,
International Journal of Computer Vision 70(3), 214–255.

Stone, C. (1985), ‘Additive regression models and other nonparametric models’, Annals of
Statistics 13(2), 689–705.

Tibshirani, R. J. (2014), ‘Adaptive piecewise polynomial estimation via trend filtering’,


Annals of Statistics 42(1), 285–323.

Tsybakov, A. (2009), Introduction to Nonparametric Estimation, Springer.

van de Geer, S. (2000), Empirical Processes in M-Estimation, Cambdrige University Press.

Wahba, G. (1990), Spline Models for Observational Data, Society for Industrial and Applied
Mathematics.

Wang, Y., Smola, A. & Tibshirani, R. J. (2014), ‘The falling factorial basis and its statistical
properties’, International Conference on Machine Learning 31.

Wasserman, L. (2006), All of Nonparametric Statistics, Springer.

Yang, Y. (1999), ‘Nonparametric classification–Part I: Rates of convergence’, IEEE Trans-


actions on Information Theory 45(7), 2271–2284.

39
Appendix: Locally adaptive estimators
10.10 Locally adaptive regression splines
Locally adaptive regression splines (Mammen & van de Geer 1997), as their name suggests,
can be viewed as variant of smoothing splines that exhibit better local adaptivity. For a
given integer order k ≥ 0, the estimate is defined as
n
X 2
b = argmin
m Yi − m(Xi ) + λ TV(f (k) ). (34)
f i=1

The minimization domain is infinite-dimensional, the space of all functions for which the
criterion is finite
Another remarkable variational result, similar to that for smoothing splines, shows that
(34) has a kth order spline as a solution (Mammen & van de Geer 1997). This almost
turns the minimization into a finite-dimensional one, but there is one catch: the knots of
this kth-order spline are generally not known, i.e., they need not coincide with the inputs
x1 , . . . , xn . (When k = 0, 1, they do, but in general, they do not)
To deal with this issue, we can redefine the locally adaptive regression spline estimator
to be
X n
2
b = argmin
m Yi − m(Xi ) + λ TV(f (k) ), (35)
f ∈Gk i=1

i.e., we restrict the domain of minimization to be Gk , the space of kth-order spline functions
with knots in Tk , where Tk is a subset of {x1 , . . . , xn } of size n−k −1. The precise definition
of Tk is not important; it is just given by trimming away k + 1 boundary points from the
inputs
As we already know, the space Gk of kth-order splines with knots in Tk has dimension
|Tk | + k + 1 = n. Therefore we can choose a basis g1 , . . . , gn for the functions in Gk , and the
problem in (35) becomes one of finding the coefficients in this basis expansion,
n 
X n
X 2 n X
n (k) o
βb = argmin Yi − βj gj (Xi ) + λ TV βj gj (Xi ) , (36)
f ∈Gk i=1 j=1 j=1
P
b
and then we have m(x) = nj=1 βbj gj (x)
Now define the basis matrix G ∈ Rn×n by

Gij = gj (Xi ), i = 1, . . . , n.

Suppose we choose g1 , . . . , gn to be the truncated power basis. Denoting Tk = {t1 , . . . , tn−k−1 },


we compute
X n (k) n
X
βj gj (Xi ) = k. + k. βj 1{x ≥ tj−k−1 },
j=1 j=k+2

and so
n X
n (k) o n
X
TV βj gj (Xi ) = k. |βj |.
j=1 j=k+2

40
Hence the locally adaptive regression spline problem (36) can be expressed as
n
X
βb = argmin ky − Gβk22 + λk. |βi |. (37)
β∈Rn i=k+2

This is a lasso regression problem on the truncated power basis matrix G, with the first k +1
coefficients (those corresponding to the pure polynomial functions, in the basis expansion)
left unpenalized
This reveals a key difference between the locally adaptive regression splines (37) (origi-
nally, problem (35)) and the smoothing splines (29) (originally, problem
X n Z 1
2 2
b = argmin
m Yi − m(Xi ) + λ f (m) (x) dx, where m = (k + 1)/2. (38)
f i=1 0

In the first problem, the total variation penalty is translated into an `1 penalty on the
coefficients of the truncated power basis, and hence this acts a knot selector for the estimated
function. That is, at the solution in (37), the estimated spline has knots at a subset of Tk
(at a subset of the input points x1 , . . . , xn ), with fewer knots when λ is larger. In contrast,
recall, at the smoothing spline solution in (29), the estimated function has knots at each of
the inputs x1 , . . . , xn . This is a major difference between the `1 and `2 penalties
From a computational perspective, the locally adaptive regression spline problem in (37)
is actually a lot harder than the smoothing spline problem in (29). Recall that the latter
reduces to solving a single banded linear system, which takes O(n) operations. On the other
hand, fitting locally adaptive regression splines in (37) requires solving a lasso problem with
a dense n × n regression matrix G; this takes something like O(n3 ) operations. So when
n = 10, 000, there is a big difference between the two.
There is a tradeoff here, as with extra computation comes much improved local adap-
tivity of the fits. See Figure 10 for an example. Theoretically, when m0 ∈ M (k, C) for a
constant C > 0, Mammen & van de Geer (1997) show the locally adaptive regression spline
estimator, denoted m b lrs , with λ  n1/(2k+3) , satisfies
b lrs − m0 k2n . n−(2k+2)/(2k+3) in probability,
km
so (like wavelets) it achieves the minimax optimal rate over n−(2k+2)/(2k+3) . In this regard,
as we discussed previously, they actually have a big advantage over any linear smoother
(not just smoothing splines)

10.11 Trend filtering


At a high level, you can think of trend filtering as computationally efficient version of locally
adaptive regression splines, though their original construction (Steidl et al. 2006, Kim et al.
2009) comes from a fairly different perspective. We will begin by describing their connection
to locally adaptive regression splines, following Tibshirani (2014)
Revisit the formulation of locally adaptive regression splines in (35), where the mini-
mization domain is Gk = span{g1 , . . . , gn }, and g1 , . . . , gn are the kth-order truncated power
basis
g1 (x) = 1, g2 (x) = x, . . . gk+1 (x) = xk ,
(39)
gk+1+j (x) = (x − tj )k+ , j = 1, . . . , p.

41
True function Locally adaptive regression spline, df=19
10

10
●●●● ●●●●
● ●
● ●
● ● ●● ● ● ●●
● ●
● ●
● ●
●● ●●
8

8
● ● ● ●
● ●
● ● ● ●
●● ●● ●● ●●

● ●●● ●
● ●●●
● ● ● ●
●● ● ● ●● ● ●

●● ● ●● ●
● ● ● ●
6

6
● ● ● ● ● ●
● ●
● ●●● ● ●● ● ●●● ● ●●
● ● ● ●
●● ● ● ● ●● ●● ● ● ● ●●
● ●● ● ●● ●●● ● ●● ● ●● ●●●
●● ●●● ●●● ● ● ●● ● ●● ●●● ●●● ● ● ●● ●
● ● ●● ● ● ●●

● ● ● ●
● ● ●
● ● ● ●●● ●●● ● ● ● ●●● ●●●
● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
4

4
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ●
●● ● ●● ●
● ●
2

2
● ●
● ●
● ●

● ●
● ●

● ●
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Smoothing spline, df=19 Smoothing spline, df=30


10

10

●●●● ●●●●
● ●
● ●
● ● ●● ● ● ●●
● ●
● ●
● ●
●● ●●
8

● ● ● ●
● ●
● ● ● ●
●● ●● ●● ●●

● ●●● ●
● ●●●
● ● ● ●
●● ● ● ●● ● ●

●● ● ●● ●
● ● ● ●
6

● ● ● ● ● ●
● ●
● ●●● ● ●● ● ●●● ● ●●
● ● ● ●
●● ● ● ● ●● ●● ● ● ● ●●
● ●● ● ●● ●●● ● ●● ● ●● ●●●
●● ●●● ●●● ● ● ●● ● ●● ●●● ●●● ● ● ●● ●
● ● ●● ● ● ●●

● ● ● ●
● ● ●
● ● ● ●●● ●●● ● ● ● ●●● ●●●
● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
4

● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ●
●● ● ●● ●
● ●
2

● ●
● ●
● ●

● ●
● ●

● ●
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 10: The top left plot shows a simulated true regression function, which has inhomoge-
neous smoothness: smoother towards the left part of the domain, wigglier towards the right.
The top right plot shows the locally adaptive regression spline estimate with 19 degrees of
freedom; notice that it picks up the right level of smoothness throughout. The bottom left
plot shows the smoothing spline estimate with the same degrees of freedom; it picks up the
right level of smoothness on the left, but is undersmoothed on the right. The bottom right
panel shows the smoothing spline estimate with 33 degrees of freedom; now it is appropriately
wiggly on the right, but oversmoothed on the left. Smoothing splines cannot simultaneously
represent different levels of smoothness at different regions in the domain; the same is true
of any linear smoother

42
having knots in a set Tk ⊆ {X1 , . . . Xn } with size |Tk | = n − k − 1. The trend filtering
problem is given by replacing Gk with a different function space,
n
X 2
b = argmin
m Yi − m(Xi ) + λ TV(f (k) ), (40)
f ∈Hk i=1

where the new domain is Hk = span{h1 , . . . , hn }. Assuming that the input points are
ordered, x1 < . . . < xn , the functions h1 , . . . , hn are defined by
j−1
Y
hj (x) = (x − x` ), j = 1, . . . , k + 1,
`=1
(41)
k
Y
hk+1+j (x) = (x − xj+` ) · 1{x ≥ xj+k }, j = 1, . . . , n − k − 1.
`=1

(Our convention is to take the empty product to be 1, so that h1 (x) = 1.) These are dubbed
the falling factorial basis, and are piecewise polynomial functions, taking an analogous form
to the truncated power basis functions in (10.11). Loosely speaking, they are given by
replacing an rth-order power function in the truncated power basis with an appropriate
r-term product, e.g., replacing x2 with (x − x2 )(x − x1 ), and (x − tj )k with (x − xj+k )(x −
xj+k−1 ) · . . . , (x − xj+1 )
Defining the falling factorial basis matrix
Hij = hj (Xi ), i, j = 1, . . . , n,
it is now straightforward to check that the proposed problem of study, trend filtering in
(40), is equivalent to
Xn
b 2
β = argmin ky − Hβk2 + λk. |βi |. (42)
β∈Rn i=k+2
This is still a lasso problem, but now in the falling factorial basis matrix H. Compared to
the locally adaptive regression spline problem (37), there may not seem to be much of a
difference here—like G, the matrix H is dense, and solving (42) would be slow. So why did
we go to all the trouble of defining trend filtering, i.e., introducing the somewhat odd basis
h1 , . . . , hn in (41)?
The usefulness of trend filtering (42) is seen after reparametrizing the problem, by
inverting H. Let θ = Hβ, and rewrite the trend filtering problem as
θb = argmin ky − θk22 + λkDθk1 , (43)
θ∈Rn

where D ∈ R(n−k−1)×n denotes the last n − k − 1 rows of k. · H −1 . Explicit calculation


shows that D is a banded matrix (Tibshirani 2014, Wang et al. 2014). For simplicity of
exposition, consider the case when Xi = i, i = 1, . . . , n. Then, e.g., the first 3 orders of
difference operators are:    
  1 −2 1 0 ... −1 3 −3 1 ...
−1 1 0 ...
 0 1 −2 1 . . .   0 −1 3 −3 . . . 
D =  0. −1 1 . . . , D =  0 0 1 −2 . . . , D =  0 0 −1 3 ... 
.. .. ..
. .
when k = 0 when k = 1 when k = 2.

43
10
Locally adaptive splines ●●●●

Trend filtering ●
● ● ●●


●●

8
● ●

● ●
●● ●

● ●
● ●●
● ●
●● ● ●

●● ●
● ●

6
● ● ●

● ●●● ● ●●
● ●
●● ● ● ● ●●
● ●● ● ●● ●●●
●● ●●● ●●● ● ● ●● ●
● ● ● ●
●● ● ● ● ●●●
● ●
●●●
● ●
● ●● ● ● ●
4
● ● ●
● ● ●
● ●



●● ●

2





0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 11: Trend filtering and locally adaptive regression spline estimates, fit on the same
data set as in Figure 10. The two are tuned at the same level, and the estimates are visually
indistinguishable

One can hence interpret D as a type of discrete derivative operator, of order k + 1. This
also suggests an intuitive interpretation of trend filtering (43) as a discrete approximation
to the original locally adaptive regression spline problem in (34)
The bandedness of D means that the trend filtering problem (43) can be solved efficiently,
in close to linear time (complexity O(n1.5 ) in the worst case). Thus trend filtering estimates
are much easier to fit than locally adaptive regression splines
But what of their statistical relevancy? Did switching over to the falling factorial basis
(41) wreck the local adaptivity properties that we cared about in the first place? Fortu-
nately, the answer is no, and in fact, trend filtering and locally adaptive regression spline
estimates are extremely hard to distinguish in practice. See Figure 11
Moreover, Tibshirani (2014), Wang et al. (2014) prove that the estimates from trend
b tf and m
filtering and locally adaptive regression spline estimates, denoted m b lrs , respectively,
when the tuning parameter λ for each scales as n1/(2k+3) , satisfy

b tv − m
km b lrs k2n . n−(2k+2)/(2k+3) in probability.

This coupling shows that trend filtering converges to the underlying function m0 at the rate
n−(2k+2)/(2k+3) whenever locally adaptive regression splines do, making them also minimax
optimal over M (k, C). In short, trend filtering offers provably significant improvements
over linear smoothers, with a computational cost that is not too much steeper than a single
banded linear system solve

44
10.12 Proof of (9)
Let Pn
i=1 m(Xi )I(kXi − xk ≤ h)
mh (x) = .
nPn (B(x, h))
Let An = {Pn (B(x, h)) > 0}. When An is true,
! P
n
Var(Yi |Xi )I(kXi − xk ≤ h) σ2
E (m b h (x) − mh (x))2 X1 , . . . , Xn = i=1 ≤ .
n2 Pn2 (B(x, h)) nPn (B(x, h))

Since m ∈ M, we have that |m(Xi ) − m(x)| ≤ LkXi − xk ≤ Lh for Xi ∈ B(x, h) and hence

|mh (x) − m(x)|2 ≤ L2 h2 + m2 (x)IAn (x)c .

Therefore,
Z Z Z
2
E (m b h (x) − m(x)) dP (x) = E (m b h (x) − mh (x)) dP (x) + E (mh (x) − m(x))2 dP (x)
2

Z Z
σ2
≤E IAn (x) dP (x) + L2 h2 + m2 (x)E(IAn (x)c )dP (x). (44)
nPn (B(x, h))

To bound the first term, let Y = nPn (B(x, h)). Note that Y ∼ Binomial(n, q) where
q = P(X ∈ B(x, h)). Now,
    n
X  
I(Y > 0) 2 2 n
E ≤ E = q k (1 − q)n−k
Y 1+Y k+1 k
k=0
Xn  
2 n+1
= q k+1 (1 − q)n−k
(n + 1)q k+1
k=0

2 X n+1 
n+1
≤ q k (1 − q)n−k+1
(n + 1)q k
k=0
2 2 2
= (q + (1 − q))n+1 = ≤ .
(n + 1)q (n + 1)q nq

Therefore,
Z Z
σ 2 IAn (x) dP (x)
E dP (x) ≤ 2σ 2 .
nPn (B(x, h)) nP (B(x, h))
SM
We may choose points z1 , . . . , zM such that the support of P is covered by j=1 B(zj , h/2)
where M ≤ c2 /(nhd ). Thus,
Z M Z
X M Z
X
dP (x) I(z ∈ B(zj , h/2)) I(z ∈ B(zj , h/2))
≤ dP (x) ≤ dP (x)
nP (B(x, h)) nP (B(x, h)) nP (B(zj , h/2))
j=1 j=1
M c1
≤ ≤ .
n nhd

45
The third term in (44) is bounded by
Z Z
m (x)E(IAn (x)c )dP (x) ≤ sup m (x) (1 − P (B(x, h)))n dP (x)
2 2
x
Z
≤ sup m2 (x) e−nP (B(x,h)) dP (x)
x
Z
nP (B(x, h))
= sup m (x) e−nP (B(x,h))
2
dP (x)
x nP (B(x, h))
Z
1
≤ sup m2 (x) sup(ue−u ) dP (x)
x u nP (B(x, h))
c1 c2
≤ sup m2 (x) sup(ue−u ) d = d
.
x u nh nh

10.13 Proof of the Spline Lemma


The key result can be stated as follows: if fe is any twice differentiable function on [0, 1],
and x1 , . . . , xn ∈ [0, 1], then there exists a natural cubic spline f with knots at x1 , . . . , xn
such that m(Xi ) = fe(Xi ), i = 1, . . . , n and
Z 1 Z 1
00 2
f (x) dx ≤ fe00 (x)2 dx.
0 0

Note that this would in fact prove that we can restrict our attention in (25) to natural
splines with knots at x1 , . . . , xn .
The natural spline basis with knots at x1 , . . . , xn is n-dimensional, so given any n points
zi = fe(Xi ), i = 1, . . . , n, we can always find a natural spline f with knots at x1 , . . . , xn that
satisfies m(Xi ) = zi , i = 1, . . . , n. Now define

h(x) = fe(x) − m(x).

Consider
Z 1 Z 1
1
00 00 00 0
f (x)h (x) dx = f (x)h (x) − f 000 (x)h0 (x) dx
0 0 0
Z xn
000 0
=− f (x)h (x) dx
x1
n−1
X Z xn
xj+1
000
=− f (x)h(x) + f (4) (x)h0 (x) dx
xj x1
j=1
n−1
X 
=− f 000 (x+
j ) h(xj+1 ) − h(xj ) ,
j=1

where in the first line we used integration by parts; in the second we used the that f 00 (a) =
f 00 (b) = 0, and f 000 (x) = 0 for x ≤ x1 and x ≥ xn , as f is a natural spline; in the third we
used integration by parts again; in the fourth line we used the fact that f 000 is constant on
any open interval (xj , xj+1 ), j = 1, . . . , n−1, and that f (4) = 0, again because f is a natural

46
spline. (In the above, we use f 000 (u+ ) to denote limx↓u f 000 (x).) Finally, since h(xj ) = 0 for
all j = 1, . . . , n, we have Z 1
f 00 (x)h00 (x) dx = 0.
0
From this, it follows that
Z 1 Z 1
2
e00 2
f (x) dx = f 00 (x) + h00 (x) dx
0 0
Z 1 Z 1 Z 1
00 2 00 2
= f (x) dx + h (x) dx + 2 f 00 (x)h00 (x) dx
0 0 0
Z 1 Z 1
= f 00 (x)2 dx + h00 (x)2 dx,
0 0

and therefore Z Z
1 1
f 00 (x)2 dx ≤ fe00 (x)2 dx, (45)
0 0
with equality if and only if h00 (x) = 0 for all x ∈ [0, 1]. Note that h00 = 0 implies that h must
be linear, and since we already know that h(xj ) = 0 for all j = 1, . . . , n, this is equivalent to
h = 0. In other words, the inequality (45) holds strictly except when fe = f , so the solution
in (25) is uniquely a natural spline with knots at the inputs.

47
Linear Regression

We observe D = {(X1 , Y1 ), . . . , (Xn , Yn )} where Xi = (Xi (1), . . . , Xi (d)) ∈ Rd and Yi ∈ R.


For notational simplicity, we will always assume that Xi (1) = 1.

Given a new pair (X, Y ) we want to predict Y from X. The conditional prediction risk is
Z
2 2
b = E[(Y − m(X))
R(m) b |D] = (y − m(x))
b dP (x, y)

and the prediction risk of m


b is
2
b = E(Y − m(X))
r(m) b = E[r(m)]
b

where the expected value is over all random variables. The true regression function is

m(x) = E[Y |X = x].

We have the following bias-variance decomposition:


Z Z
2 2
r(m)
b = σ + bn (x)dP (x) + vn (x)dP (x)

where
σ 2 = E[Y − m(X)]2 , bn (x) = E[m(x)]
b − m(x), vn (x) = Var(m(x)).
b

Let  = Y − m(X). Note that

E[] = E[Y − m(X)] = E[E[Y − m(X) | X]] = 0.

A linear predictor has the form g(x) = β T x. The best linear predictor minimizes E(Y −β T X)2 .
(We do not assume that m(x) is linear.) The minimizer, assuming that Σ is non-singular, is

β∗ = Σ−1 α

where Σ = E[XX T ] and α = E(Y X). We will use linear predictors; but we should
never assume that m(x) is linear. The excess risk is of the linear predictor β T x is

r(β) − r(β∗ ) = (β − β∗ )T Σ(β − β∗ ). (1)

The training error is


1X
rbn (β) = (Yi − XiT β)2
n i

1
1 Low Dimensional Linear Regression

Recall that Σ = E[XX T ]. The least squares estimator βb minimizes the training error rbn (β).
We then have that
b −1 α
βb = Σ b
where
b= 1 1X
X
Σ Xi XiT , α
b= Yi Xi .
n i n i

We want to show that r(β) b is close to r(β∗ ). For simplicity, we will assume that the distri-
bution P of (Yi , Xi ) supported on a compact set. Also, for simplicity, we assume that βb is
truncated by some large constant L.

Theorem 1 Let P be the set of all distributions for Z = (X, Y ) supported on a compact set
K. There exists constants c1 , c2 such that the following is true. For any  > 0,
  2
sup P r(βn ) > r(β∗ (P )) + 2 ≤ c1 e−nc2  .
n b (2)
P ∈P

Hence, r !
1
r(βbn ) − r(β∗ ) = OP .
n

Proof. Given any β, define βe = (−1, β) and Λ = E[ZZ T ] where Z = (Y, X). Note that

r(β) = E(Y − β T X)2 = E[(Z T β)


e 2 ] = βeT Λβ.
e

Similarly,
rbn (β) = βeT Λ
b n βe
where
bn = 1
X
Λ Zi ZiT .
n i
So
rn (β) − r(β)| = |βeT (Λ
|b b n − Λ)β| e 2 ∆n
e ≤ ||β||
1

where
∆n = max |Λ
b n (j, k) − Λ(j, k)|.
j,k

By Hoeffding’s inequality and the union bound,


  2
P sup |brn (β) − r(β)| >  ≤ c1 e−nc2  .
β∈B

2
On the event supβ∈B |b
rn (β) − r(β)| < , we have

r(β∗ ) ≤ r(βbn ) ≤ rbn (βbn ) +  ≤ rbn (β∗ ) +  ≤ r(β∗ ) + 2.

The above result is not tight. Here is a more refined bound.

Theorem 2 (Theorem 11.3 of Gyorfi, Kohler, Krzyzak and Walk, 2002) Let σ 2 =
supx Var(Y |X = x) < ∞. Assume that all the random variables are bounded by L < ∞.
Then
Z Z
Cd(log(n) + 1)
bT 2
E |β x − m(x)| dP (x) ≤ 8 inf |β T x − m(x)|2 dP (x) + .
β n

The proof is straightforward but is very long. The strategy is to first bound n−1 i (βbT Xi −
P
m(Xi ))2 using
P the properties of least squares. Then, using concentration of measure one can
relate n−1 i f 2 (Xi ) to f 2 (x)dP (x).
R

We have the following central limit theorem for β.


b

Theorem 3 We have √
n(βb − β) N (0, Γ)
where " #
Γ = Σ−1 E (Y − X T β)2 XX T Σ−1

The covariance matrix Γ can be consistently estimated by

Γ b −1 M
b=Σ cΣb −1

where n
c(j, k) = 1
X
M 2i
Xi (j)Xi (k)b
n i=1

i = Yi − βbT Xi .
and b

The matrix Γb is called the sandwich estimator. The Normal approximation can be used to
q
construct confidence intervals for β. For example, β(j)±zα Γ(j,
b b j)/n is an asymptotic 1−α
confidence interval for β(j). We can also get confidence intervals by using the bootstrap. Do
not use the textbook formulas for the standard errors of β. b These assume that the regression
function itself is linear. See Buja et al (2015) for details.

3
2 High Dimensional Linear Regression

Now suppose that d > n. We can no longer use least squares. There are many approaches.

The simplest is to preprocess the data to reduce the dimension. For example, we can perform
PCA on the X 0 s and use the first k principal components where k < n. Alternatively, we
can cluster the covariates based on their correlations. We can the use one feature from each
cluster or take the average of the covariates within each cluster. Another approach is to
screen the variables by choosing the k features with the largest correlation with Y . After
dimension reduction, we can the use least squares. These preprocessing methods can be very
effective.

A different approach is to use all the covariates but, instead of least squares, we shrink the
coefficients towards 0. This is called ridge regression and is discussed in the next section.

Yet another approach is model selection where we try to find a good subset of the covariates.
Let S be a subset of {1, . . . , d} and let XS = (X(j) : j ∈ S). If the size of S is not too
large, we can regress Y on XS instead of S.

In particular, fix k < n and let Sk denote all subsets of size k. For a given S ∈ Sk , let βS be
the best linear predictor βS = Σ−1S αS for the subset S. We would like to choose S ∈ Sk to
minimize
E(Y − βST XS )2 .
This is equivalent to:
minimize E(Y − β T X)2 subject to ||β||0 ≤ k
where ||β||0 is the number of non-zero elements of β.

There will be a bias-variance tradeoff. As k increases, the bias decreases but the variance
increases.

We can approximate the risk with the training error. But the minimization is over all subsets
of size k. This minimization is NP-hard. So best subset regression is infeasible. We can
approximate best subset regression in two different ways: a greedy approxmation or a convex
relaxation. The former leads to forward stepwise regression. The latter leads to the lasso.

All these methods involve a tuning parameter which can be chosen by cross-validation.

3 Ridge Regression

In this case we minimize


1X
(Yi − XiT β)2 + λ||β||2
n i

4
Forward Stepwise Regression

1. Input k. Let S = ∅.
2. Let rj = n−1 i Yi Xi (j) denote the corrleation between Y and the j th feature.
P
Let J = argmaxj |rj |. Let S = S ∪ {J}.
3. Compute the regression of Y on XS = (X(j) : j ∈ S). Compute the residuals
e = (e1 , . . . , en ) where ei = Yi − βbST Xi .
4. Compute the correlations rj between the residuals e and the remaining features.
5. Let J = argmaxj |rj |. Let S = S ∪ {J}.
6. Repeat steps 3-5 until |S| = k.
7. Output S.

Figure 1: Forward Stepwise Regression

where λ ≥ 0. The minimizer is


b + λI)−1 α
βb = (Σ b.
As λ increases, the bias increases and the variance decreases.

Theorem 4 (Hsu, Kakade and Zhang 2014) Suppose that ||Xi || ≤ r. Let β T x be the
best linear apprximation to m(x). Then, with probability at least 1 − 4e−t ,
!!
r2
b − r(β) ≤ 1 + O 1 + λ λ||β||2 σ 2 tr(Σ)
r(β) + .
n 2 n 2λ

Proposition 5 If Y = X T β + ,  ∼ N (0, σ 2 ) and β ∼ N (0, τ 2 I). Then the posterior mean


is the ridge regression estimator with λ = σ 2 /τ 2 .

4 Forward Stepwise Regression (Greedy Regression)

Forward stepwise regression is a greedy approximation to best subset regression. In what


follows, we will assume
P that the features have been standardized to have sample mean 0 and
−1 2
sample variance n i Xi (j) = 1. The algorithm is in Fugure 1.

Now we will discuss the theory of forward stepwise regression. Let’s start with a functional,
noise-free version. We want to greedily approximate a function f using a dictionary of
functions D = {ψ1 , ψ2 , . . . , }. The elements of D are called atoms. Assume that ||ψ|| = 1 for
all ψ ∈ D. Assume that f and the atoms of the dictionary belong to a Hilbert space H.

5
1. Input: f .

2. Initialize: r0 = f , f0 = 0, V = ∅.

3. Repeat: At step N define

gN = argmaxψ∈D |hrN −1 , ψi|

and set VN = VN −1 ∪ {gN }. Let fN be the projection of rN −1 onto Span(VN ).


Let rN = f − fN .

Figure 2: The Orthogonal Greedy Algorithm.

Let ΣN denote all linear combinations of elements of D with at most N terms. Define the
best N -term approximation error

σN (f ) = inf inf kf − gk (3)


|Λ|≤N g∈Span(Λ)

where Λ denotes a subset of D and Span(Λ) is the set of linear combinations of functions in
Λ.

Suppose first that f is in the span


P of the dictionary. The function may then have more than
one expansion of the form f = j βj ψj . We define the norm

kf kLp = inf kβkp

where the infimum is over all expansions of f . The functional version of stepwise regres-
sion, known as the Orthogonal Greedy Algorithm (OGA), is also known as Orthogonal
Matching Pursuit. The algorithm is given in Figure 2.

The algorithm produces a series of approximations fN with corresponding residuals rN . We


have the following two theorems from Barron et al (2008), the first dating back to DeVore
and Temlyakov (1996).

Theorem 6 For all f ∈ L1 , the residual rN after N steps of OGA satsifies


kf kL1
krN k ≤ √ (4)
N +1
for all N ≥ 1.

Proof. Note that fN is the best approximation to f from Span(VN ). On the other hand, the
best approximation from the set {a gN : a ∈ R} is hf, gN igN . The error of the former must be

6
smaller than the error of the latter. In other words, ||f −fN ||2 ≤ ||f −fN −1 −hrN −1 , gN igN ||2 .
Thus,

krN k2 ≤ krN −1 − hrN −1 , gN igN k2


= krN −1 k2 + |hrN −1 , gN i|2 kgN k2 −2|hrN −1 , gN i|2
| {z }
=1
2 2
= krN −1 k − |hrN −1 , gN i| . (5)

Now, f = fN −1 + rN −1 and hfN −1 , rN −1 i = 0. So,

krN −1 k2 = hrN −1 , rN −1 i = hrN −1 , f − fN −1 i = hrN −1 , f i − hrN −1 , fN −1 i


| {z }
=0
X X
= hrN −1 , f i = βj hrN −1 , ψj i ≤ sup |hrN −1 , ψi| |βj |
j ψ∈D j
= sup |hrN −1 , ψi| kf kL1 = |hrN −1 , gN i| kf kL1 .
ψ∈D

Continuing from equation (5), we have

krN −1 k2 |hrN −1 , gN i|2


 
2 2 2 2
krN k ≤ krN −1 k − |hrN −1 , gN i| = krN −1 k 1 −
krN −1 k4
krN −1 k2 |hrN −1 , gN i|2 krN −1 k2
   
2 2
≤ krN −1 k 1 − = krN −1 k 1 − .
|hrN −1 , gN i|2 kf k2L1 kf k2L1

If a0 ≥ a1 ≥ a2 ≥ · · · are nonnegative numbers such that a0 ≤ M and aN ≤ aN −1 (1 −


aN −1 /M ) then it follows from induction that aN ≤ M/(N + 1). The result follows by setting
aN = krN k2 and M = kf k2L1 . 

If f is not in L1 , it is still possible to bound the error as follows.

Theorem 7 For all f ∈ H and h ∈ L1 ,

2 4khk2L1
2
krN k ≤ kf − hk + . (6)
N
P P
Proof. Choose any h ∈ L1 and write h = j βj ψj where khkL1 = j |βj |. Write f =
fN −1 + f − fN −1 = fN −1 + rN −1 and note that rN −1 is orthogonal to fN −1 . Hence, krN −1 k2 =

7
hrN −1 , f i and so

krN −1 k2 = hrN −1 , f i = hrN −1 , h + f − hi = hrN −1 , hi + hrN −1 , f − hi


≤ hrN −1 , hi + krN −1 k kf − hk
X
= βj hrN −1 , ψj i + krN −1 k kf − hk
j
X
≤ |βj | |hrN −1 , ψj i| + krN −1 k kf − hk
j
X
≤ max |hrN −1 , ψj i| |βj | + krN −1 k kf − hk
j
j
= |hrN −1 , gk i| khkL1 + krN −1 k kf − hk
1
≤ |hrN −1 , gk i| khkL1 + (krN −1 k2 + kf − hk2 ).
2
Hence,
2 (krN −1 k2 − kf − hk2 )2
|hrN −1 , gk i| ≥ .
4khk2L1
Thus,  
aN −1
aN ≤ aN −1 1 −
4khk2L1
where aN = krN k2 − kf − hk2 . By induction, the last displayed inequality implies that
aN ≤ 4khk2L1 /k and the result follows. 

Corollary 8 For each N ,


2
4θN
krN k2 ≤ σN
2
+
N
where θN is the L1 norm of the best N -atom approximation.

In Figure 3 we re-express forward stepwise regression in a form closer to the notation we


have been using. In this version, we have a finite dictionary Dn and a data vector Y =
(Y1 , . . . , Yn )T and we use the empirical norm defined by
v
u n
u1 X
khkn = t h2 (Xi ).
n i=1

We assume that the dictionary is normalized in this empirical norm.

By combining the previous results with concentration of measure arguments (see appendix
for details) we get the following result, due to Barron, Cohen, Dahmen and DeVore (2008).

8
1. Input: Y ∈ Rn .

2. Initialize: r0 = Y , fb0 = 0, V = ∅.

3. Repeat: At step N define

gN = argmaxψ∈D |hrN −1 , ψin |

where ha, bin = n−1 ni=1 ai bi . Set VN = VN −1 ∪ {gN }. Let fN be the projection
P
of rN −1 onto Span(VN ). Let rN = Y − fN .

Figure 3: The Greedy (Forward Stepwise) Regression Algorithm: Dictionary Version

Theorem
√ 9 Let hn = argminh∈FN kf0 − hk2 . Suppose that lim supn→∞ khn kL1,n < ∞. Let
N ∼ n. Then, for every γ > 0, there exist C > 0 such that

C log n
kf − fbN k2 ≤ 4σN
2
+
n1/2
except on a set of probability n−γ .

P
Let us compare this with the lasso which we will discuss next. Let fL = j βj ψj minimize
kf − fL k2 subject to kβk1 ≤ L. Then, we will see that
 1/2
2 2 log n
kf − fbL k ≤ kf − fL k + OP
n

which is the same rate.

The rate n−1/2 is in fact optimal. It might be surprising that the rate is independent of the
dimension. Why do you think this is the case?

4.1 The Lasso

The lasso approximates best subset regression


P by using a convex relaxation. In particular,
the norm ||β||0 is replaced with ||β||1 = j |βj |.

The lasso estimator βb is defined as the minimizer of


X
(Yi − β T Xi )2 + λ||β||1 .
i

9
This is a convex problem so the estimator can be found efficiently. The estimator is sparse:
for large enough λ, many of the components of βb are 0. This is proved in the course on
convex optimization. Now we discuss some theoretical properties of the lasso.1

The following result was proved in Zhao and Yu (2006), Meinshausen and Bühlmann (2005)
and Wainwright (2006). The version we state is from Wainwright (2006). Let β = (β1 , . . . , βs , 0, . . . , 0)
and decompose the design matrix as X = (XS XS c ) where S = {1, . . . , s}. Let βS =
(β1 , . . . , βs ).

Theorem 10 (Sparsistency) Suppose that:

1. The true model is linear.

2. The design matrix satisfies

kXS c XS (XTS XS )−1 k∞ ≤ 1 −  for some 0 <  ≤ 1. (7)

3. φn (dn ) > 0.

4. The i are Normal.

5. λn satisfies
nλ2n
→∞
log(dn − sn )
and r  −1 !
1 log sn 1 T
+ λn X X → 0. (8)
min1≤j≤sn |βj | n n

b = support(β)) → 1 where support(β) =


Then the lasso is sparsistent, meaning that P (support(β)
{j : β(j) 6= 0.

The conditions of this theorem are very strong. They are not checkable and they are unlikely
to ever be true in practice.

Theorem 11 (Consistency: Meinshausen and Yu 2006) Assume that

1. The true regression function is linear.

2. The columns of X have norm n and the covariates are bounded.


1
√The norm√ ||β||1 can be thought of as a measure of sparsity. For example, √ the vectors x =
(1/ d, . . . , 1/ d) and y = (1, 0, . . . , 1) have the same L2 norm. But ||y||1 = 1 < ||x||1 = d.

10
3. E(exp |i |) < ∞ and E(2i ) = σ 2 < ∞.

4. E(Yi2 ) ≤ σy2 < ∞.

5. 0 < φn (kn ) ≤ Φn (kn ) < ∞ for kn = min{n, dn }.

6. lim inf n→∞ φn (sn log n) > 0 where sn = kβn k0 .

Then    
log n sn log n 1
kβn − βbn k2 = OP +O (9)
n φ2n (sn log n) log n
If  
log n
sn log dn →0 (10)
n
and s
σy2 Φn (min n, dn )n2
λn = (11)
sn log n
P
then kβbn − βn k2 → 0.

Once again, the conditions of this theorem are very strong. They are not checkable and they
are unlikely to ever be true in practice.

The next theorem is the most important one. It does not require unrealistic conditions. We
state the theorem for bounded covariates. A more general version appears in Greenshtein
and Ritov (2004).

Theorem 12 Let Z = (Y, X). Assume that |Y | ≤ B and maxj |X(j)| ≤ B. Let

β∗ = argmin r(β)
||β||1 ≤L

where r(β) = E(Y − β T X)2 . Thus, xT β∗ is the best, sparse linear predictor (in the L1 sense).
Let βb be the lasso estimator:
βb = argmin rb(β)
||β||1 ≤L

−1
Pn
where rb(β) = n i=1 (Yi − XiT β)2 . With probabilty at least 1 − δ,

√ !
v
u
u 16(L + 1)4 B 2 2d
b ≤ r(β∗ ) + t
r(β) log √ .
n δ

11
Proof. Let Z = (Y, X) and Zi = (Yi , Xi ). Define γ ≡ γ(β) = (−1, β). Then
r(β) = E(Y − β T X)2 = γ T Λγ
where Λ = E[ZZ T ]. Note that ||γ||1 = ||β||1 + 1. Let B = {β : ||β||1 ≤ L}. The training
error is n
1X
rb(β) = (Yi − XiT β)2 = γ T Λγ
b
n i=1

where Λ b = 1 Pn Zi Z T . For any β ∈ B,


n i=1 i

r(β) − r(β)| = |γ T (Λ
|b b − Λ)γ|
X
≤ b k) − Λ(j, k)| ≤ ||γ||2 δn
|γ(j)| |γ(k)| |Λ(j, 1
j,k

≤ (L + 1)2 ∆n
where
∆n = max |Λ(j,
b k) − Λ(j, k)|.
j,k

So,
b + (L + 1)2 ∆n ≤ rb(β∗ ) + (L + 1)2 ∆n ≤ r(β∗ ) + 2(L + 1)2 ∆n .
b ≤ rb(β)
r(β)
Note that |Z(j)Z(k)| ≤ B 2 < ∞. By Hoeffding’s inequality,
2 /(2B 2 )
P(∆n (j, k) ≥ ) ≤ 2e−n
and so, by the union bound,
2 /(2B 2 )
P(∆n ≥ ) ≤ 2d2 e−n =δ
r √ 
if we choose  = (4B /n) log √2δd . Hence,
2

√ !
v
u
4 2
b ≤ r(β∗ ) + t 16(L + 1) B log 2d
u
r(β) √ .
n δ

with probability at least 1 − δ. 

Problems With Sparsity. Sparse estimators are convenient and popular but they can
some problems. Say that βb is weakly sparsistent if, for every β,

Pβ I(βbj = 1) ≤ I(βj = 1) for all j → 1 (12)

as n → ∞. In particular, if βbn is sparsistent, then it is weakly sparsistent. Suppose that d


is fixed. Then the least squares estimator βbn is minimax and satisfies
sup Eβ (nkβbn − βk2 ) = O(1). (13)
β

But sparsistent estimators have much larger risk:

12
Theorem 13 (Leeb and Pötscher (2007)) Suppose that the following condiitons hold:

1. d is fixed.
2. The covariariates are nonstochastic and n−1 XT X → Q for some positive definite matrix
Q.
3. The errors i are independent with mean 0, finite variance σ 2 and have a density f
satisfying
Z  0 2
f (x)
0< f (x)dx < ∞.
f (x)

If βb is weakly sparsistent then


sup Eβ (nkβbn − βk2 ) → ∞. (14)
β

More generally, if ` is any nonnegative loss function then


sup Eβ (`(n1/2 (βbn − β))) → sup `(s). (15)
β s


Proof. Choose any s ∈ Rd and let βn = −s/ n. Then,

sup Eβ (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β))I(βb = 0))
β

= `(− nβn )Pβn (βb = 0) = `(s)Pβn (βb = 0).

Now, P0 (βb = 0) → 1 by assumption. It can be shown that we also have Pβn (βb = 0) → 1.2
Hence, with probability tending to 1,
sup Eβ (`(n1/2 (βb − β)) ≥ `(s).
β

Since s was arbitrary the result follows. 

It follows that, if Rn denotes the minimax risk then

R(βbn )
sup → ∞.
β Rn
The implication is that when d is much smaller than n, sparse estimators have poor behavior.
However, when dn is increasing and dn > n, the least squares estimator no longer satisfies
(13). Thus we can no longer say that some other estimator outperforms the sparse estimator.
In summary, sparse estimators are well-suited for high-dimensional problems but not for low
dimensional problems.
2
This follows from a property called contiguity.

13
5 Cross Validation

The following result is from Gyorfi, Kohler, Krzyzak and Walk (2002). Let M =R{mh } be a
finite class of regression estimators indexed by a parameter h. Let mbh minimize |mh (x) −
m(x)|2 dP (x) over M. We want to show that cross-validation (data-splitting) leads to an
estimator with risk nearly as good as mbh .

Split the data into training D = {(X1 , Y1 ), . . . , (Xn , Yn )} and test D0 = {(X10 , Y10 ), . . . , (Xn0 , Yn0 )}.
Let mH minimize n−1 i∈D0 |Yi − mh (Xi )|2 . Assume that the data Yi and the estimators are
P
bounded by L.

Theorem 14 For any δ > 0,


C(1 + log |M |)
Z Z
E |mH (x) − m(x)| dP (x) ≤ (1 + δ)E |mbh (x) − m(x)|2 dP (x) +
2
n
where c = L2 (16/δ + 35 + 19δ).

Proof. Then
Z  Z 
2
E |mH − m| dP (x)|D = E |Y − mH | dP (x)|D − E|Y − m(X)|2
2

= T1 + T2
where Z 
T1 = E |Y − mH | dP (x)|D − E|Y − m(X)|2 − T2
2

and
1X 1X
|Yi − mH (Xi )|2 − |Yi − m(Xi )|2 ≤ (1+δ) |Yi − mbh (Xi )|2 − |Yi − m(Xi )|2
 
T2 = (1+δ)
n D0 n D0

and so
E[T2 |D] ≤ (1 + δ) E(|Y − mbh (X)|2 |D) − E|Y − m(X)|2

Z
= (1 + δ) |mbh (x) − m(x)|2 dP (x).

The second part of the proof involves some tedious calculations. We will bound P (T1 ≥ s|D).
The event T1 ≥ s is the same as
!
2 2 1X 2 2

(1 + δ) E(|mH (X) − Y | |D) − E|m(X) − Y | − |Yi − mH (Xi )| − |Yi − m(Xi )| ≥
n D0
s + δ E|mH (X) − Y |2 − E|m(X) − Y |2 .


14
This has probability at most |M| times the probabilty that
!
2 1X 2
|Yi − mH (Xi )|2 − |Yi − m(Xi )|2

(1 + δ) E(|mh (X) − Y | |D) − E|m(X) − Y | − ≥
n D0
s + δ E|mh (X) − Y |2 − E|m(X) − Y |2


for some h, that is


1X s + δE[Z|D]
E[Z|D] − Zi ≥
n i 1+δ
for some h, where Z = |mh (X) − Y |2 − |m(X) − Y |2 . Now
Z
2 2 2
σ = Var(Z|D) ≤ E[Z |D] ≤ 16L |mh (x) − m(x)|2 dP (x) = 16L2 E[Z|D].

Using this, and Bernstein’s inequality,


 
s + δE[Z|D]
P E[Z|D] − Z ≥ |D
1+δ
s + δσ 2 /(16L2 )
 
≤ P E[Z|D] − Z ≥ |D
1+δ
≤ e−nA/B

where
δσ 2
 
1
A= s+
(1 + δ)2 16L2
and
2 8L2 δσ 2
 
2
B = 2σ + s+ .
31+δ 16L2
Now A/B ≥ s/c for c = L2 (16/δ + 35 + 19δ). So

P (T1 ≥ s|D) ≤ |M|e−ns/c .

Finally Z ∞
c|M| −nu/c
E[T1 |D] ≤ u + P (T1 > s|D) ≤ u + e .
u n
The result follows by setting u = c log |M|/n. 

6 Inference?

Is it possible to do inference after model selection? Do we need to? I’ll discuss this in class.

15
References

Buja, Berk, Brown, George, Pitkin, Traskin, Zhao and Zhang (2015). Models as Apprx-
imations — A Conspiracy of Random Regressors and Model Deviations Against Classical
Inference in Regression. Statistical Science.

Hsu, Kakade and Zhang (2014). Random design analysis and ridge regression. arXiv:1106.2363.

Gyorfi, Kohler, Krzyzak and Walk. (2002). A Distribution-Free Theory of Nonparametric


Regression. Springer.

Appendix: L2 Boosting
(0) (k)
Define estimators m
bn ,...,m b (0) (x) = 0 and then iterate the follow-
b n , . . . , as follows. Let m
ing steps:

1. Compute the residuals Ui = Yi − m b (k) (Xi ).

2. Regress the residuals on the Yi ’s: βbj = i Ui Xij / i Xij2 , j = 1, . . . , d.


P P

− βbJ XiJ )2 .
P
3. Find J = argminj RSSj where RSSj = i (Ui

b (k+1) (x) = m
4. Set m b (k) (x) + βbJ xJ .

The version above is called L2 boosting or matching pursuit. A variation is to set


mb (k+1) (x) = m b (k) (x) + ν βbJ xJ where 0 < ν ≤ 1. Another variation is to set m b (k+1) (x) =
mb (k) (x)+νsign(βbJ )xJ which is called forward stagewise regression. Yet another variation
is to set mb (k) to be the linear regression estimator based on all variables selected up to that
point. This is forward stepwise regression or orthogonal matching pursuit.

Theorem 15 The matching pursuit estimator is linear. In particular,

Yb (k) = Bk Y (16)

where Yb (k) = (m
b (k) (X1 ), . . . , m
b (k) (Xn ))T ,

Bk = I − (I − Hk )(I − Hk−1 ) · · · (I − H1 ), (17)

and
Xj XTj
Hj = . (18)
kXj k2

16
Pdn
Theorem 16 (Bühlmann 2005) Let mn (x) = j=1 βj,n xj be the best linear approxima-
tion based on dn terms. Suppose that:
1−ξ
(A1 Growth) dn ≤ C0 eC1 n for some C0 , C1 > 0 and some 0 < ξ ≤ 1.

(A2 Sparsity) supn dj=1


Pn
|βj,n | < ∞.

(A3 Bounded Covariates) supn max1≤j≤dn maxi |Xij | < ∞ with probability 1.

(A4 Moments) E||s < ∞ for some s > 4/ξ.

Then there exists kn → ∞ such that

b n (X) − mn (x)|2 → 0
EX |m (19)

as n → 0.

We won’t prove the theorem


R but we will outline the idea. Let H be a Hilbert space with
inner product hf, gi = f (x)g(x)dP (x). Let D be a dictionary, that is a set of functions,
each of unit norm, that span H. Define a functional version of matching pursuit, known as
the weak greedy algorithm, as follows. Let R0 (f ) = f , F0 = 0. At step k, find gk ∈ D so
that
|hRk−1 (f ), gk i| ≥ tk sup |hRk−1 (f ), hi|
h∈D

for some 0 < tk ≤ 1. In the weak greedy algorithm we take Fk = Fk−1 +hf, gk igk . In the weak
orthogonal greedy algorithm we take Fk to be the projection of Rk−1 (f ) onto {g1 , . . . , gk }.
Finally set Rk (f ) = f − Fk .

Theorem 17 (Temlyakov 2000) Let f (x) = j βj gj (x) where gj ∈ D and ∞


P P
j=1 |βj | ≤
B < ∞. Then, for the weak orthogonal greedy algorithm
B
kRk (f )k ≤  1/2 (20)
Pk 2
1+ j=1 tj

and for the weak greedy algorithm


B
kRk (f )k ≤  tk /(2(2+tk )) . (21)
Pk 2
1+ j=1 tj

L2 boosting essentially replaces hf, Xj i with hY, Xj in = n−1 i Yi Xij . Now hY, Xj in has
P
mean hf, Xj i. The main burden of the proof is to show that hY, Xj in is close to hf, Xj i with

17
high probability and then apply Temlyakov’s result. For this we use Bernstein’s inequality.
Recall that if |Zj | are bounded by M and Zj has variance σ 2 then

n2
 
1
P(|Z − E(Zj )| > ) ≤ 2 exp − 2 . (22)
2 σ + M /3
Hence, the probability that any empirical inner products differ from their functional coun-
terparts is no more than
n2
 
2 1
dn exp − 2 →0 (23)
2 σ + M /3
because of the growth condition.

Appendix: Proof of Theorem 9

The L1 norm depends on n and so we denote this by khkL1,n . For technical reasons, we
assume that kf k∞ ≤ B, that fbn is truncated to be no more than B and that kψk∞ ≤ B for
all ψ ∈ Dn .

Theorem 18 Suppose that pn ≡ |D|n ≤ nc for some c ≥ 0. Let fbN be the output of
the stepwise regression algorithm after N steps. Let f (x) = E(Y |X = x) denote the true
regression function. Then, for every h ∈ Dn ,
!
2
8khkL CN log n 1
P kf − fbN k2 > 4kf − hk2 + 1,n
+ < γ
N n n

for some positive constants γ and C.

Before proving this theorem, we need some preliminary results. For any Λ ⊂ D, let SΛ =
Span(Λ). Define ( )
[
FN = SΛ : |Λ| ≤ N .

Recall that, if F is a set of functions then Np (, F, ν) is the Lp covering entropy with respect
to the probability measure ν and Np (, F) is the supremum of Np (, F, ν) over all probability
measures ν.

Lemma 19 For every t > 0, and every Λ ⊂ Dn ,


|Λ|+1 |Λ|+1
2eB 2 3eB 2
   
2eB 3eB
N1 (t, SΛ ) ≤ 3 log , N2 (t, SΛ ) ≤ 3 log .
t t t2 t2

18
Also,
N +1 N +1
2eB 2 3eB 2
   
N 2eB 3eB N
N1 (t, FN ) ≤ 12p log , N2 (t, FN ) ≤ 12p log .
t t t2 t2

Proof. The first two equation follow from standard covering arguments. The second two
equations follow from the fact that the number of subsets of Λ of size at most N is
N   X N  j
X p ep  ep N
N
 p N
≤ ≤N ≤ p max N ≤ 4pN .
j j N N N
j=1 j=1

The following lemma is from Chapter 11 of Gyorfi et al. The proof is long and technical and
we omit it.

Lemma 20 Suppose that |Y | ≤ B, where B ≥ 1, and F is a set of real-valued


R functions
such that kf k∞ ≤ B for all f ∈ F. Let f0 (x) = E(Y |X = x) and kgk2 = g 2 (x)dP (x).
Then, for every α, β > 0 and  ∈ (0, 1/2],
!
P (1 − )kf − f0 k2 ≥ kY − f k2n − kY − f0 k2n + (α + β) for some f ∈ F

2 (1 − )αn
   
β
≤ 14N1 ,F exp − .
20B 214(1 + )B 4

Proof of Theorem 18. For any h ∈ Fn we have


!
kfb − f0 k2n = kfb − f0 k2 − 2 kY − fbk2n − kY − f0 k2n
| {z }
A1
! !
+ 2 kY − fbk2n − kY − hk2n + 2 kY − hk2n − kY − f0 k2n .
| {z } | {z }
A2 A3

Apply Lemma 20 with  = 1/2 together with Lemma 19 to conclude that, for C0 > 0 large
enough,  
C0 N log n 1
P A1 > for some f < γ .
n n
To bound A2 , apply Theorem 7 with norm k · kn and with Y replacing f . Then,
4khk21,n
kY − fbk2n ≤ kY − hk2n +
k
19
8khk21,n
and hence A2 ≤ k
. Next, we have that

E(A3 ) = kf0 − hk2

and for large enough C1 ,


 
C1 N log n 1
P A3 > kf0 − hk2 + for some f < .
n nγ

20
A Closer Look at Sparse Regression
Ryan Tibshirani
(ammended by Larry Wasserman)

1 Introduction
In these notes we take a closer look at sparse linear regression. Throughout, we
make the very strong assumption that Yi = β T Xi + i where E[i |Xi ] = 0 and
Var(i |Xi ) = σ 2 . These assumptions are highly unrealistic but they permit a more de-
tailed analysis. There are several books on high-dimensional estimation: Hastie, Tib-
shirani & Wainwright (2015), Buhlmann & van de Geer (2011), Wainwright (2017).

2 Best subset selection, ridge regression, and the lasso


2.1 Three norms: `0 , `1 , `2
In terms of regularization, we typically choose the constraint set C to be a sublevel set
of a norm (or seminorm), and equivalently, the penalty function P (·) to be a multiple
of a norm (or seminorm)
Let’s consider three canonical choices: the `0 , `1 , and `2 norms:
p p p
X 1/2
X X
kβk0 = 1{βj 6= 0}, kβk1 = |βj |, kβk2 = βj2 .
j=1 j=1 j=1

(Truthfully, calling it “the `0 norm” is a misnomer, since it is not a norm: it does not
satisfy positive homogeneity, i.e., kaβk0 6= akβk0 whenever a 6= 0, 1.)
In constrained form, this gives rise to the problems:

min ky − Xβk22 subject to kβk0 ≤ k (Best subset selection) (1)


β∈Rp

min ky − Xβk22 subject to kβk1 ≤ t (Lasso regression) (2)


β∈Rp

min ky − Xβk22 subject to kβk22 ≤ t (Ridge regession) (3)


β∈Rp

where k, t ≥ 0 are tuning parameters. Note that it makes sense to restrict k to be


an integer; in best subset selection, we are quite literally finding the best subset of
variables of size k, in terms of the achieved training error
Though it is likely the case that these ideas were around earlier in other contexts, in
statistics we typically subset selection to Beale et al. (1967), Hocking & Leslie (1967),
ridge regression to Hoerl & Kennard (1970), and the lasso to Tibshirani (1996), Chen
et al. (1998)

1
In penalized form, the use of `0 , `1 , `2 norms gives rise to the problems:
1
min ky − Xβk22 + λkβk0 (Best subset selection) (4)
β∈Rp 2
1
minp ky − Xβk22 + λkβk1 (Lasso regression) (5)
β∈R 2
1
min ky − Xβk22 + λkβk22 (Ridge regression) (6)
β∈Rp 2

with λ ≥ 0 the tuning parameter. In fact, problems (2), (5) are equivalent. By this,
we mean that for any t ≥ 0 and solution βb in (2), there is a value of λ ≥ 0 such
that βb also solves (5), and vice versa. The same equivalence holds for (3), (6). (The
factors of 1/2 multiplying the squared loss above are inconsequential, and just for
convenience)
It means, roughly speaking, that computing solutions of (2) over a sequence of
t values and performing cross-validation (to select an estimate) should be basically
the same as computing solutions of (5) over some sequence of λ values and perform-
ing cross-validation (to select an estimate). Strictly speaking, this isn’t quite true,
because the precise correspondence between equivalent t, λ depends on the data X, y
Notably, problems (1), (4) are not equivalent. For every value of λ ≥ 0 and
solution βb in (4), there is a value of t ≥ 0 such that βb also solves (1), but the converse
is not true

2.2 A Toy Example


It is helpful to first consider a toy example. Suppose that Y ∼ N (µ, 1). Let’s consider
the three different estimators we get using the following three different loss functions:
1 1 1
(Y − µ)2 + λ||µ||0 , (Y − µ)2 + λ|µ|, (Y − µ)2 + λµ2 .
2 2 2
You should verify that the solutions are
√ Y
µ
b = H(Y ; 2λ), µ b = S(Y ; λ), µb=
1 + 2λ
where H(y; a) = yI(|y| > a) is the hard-thresholding operator, and

y − a if y > a

S(y; a) = 0 if − a ≤ y ≤ a

y + a if y < a.

Hard thresholding creates a “zone of sparsity” but it is discontinuous. Soft thresh-


olding also creates a “zone of sparsity” but it is scontinuous. The L2 loss creates a
nice smooth estimator but it is never sparse. (You can verify the solution to the L1
problem using sub-differentials if you know convex analysis, or by doing three cases
separately: µ > 0, µ = 0, µ < 0.)

2
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λ
are constants chosen by the corresponding techniques; sign denotes the sign of its
argument (±1), and x+ denotes “positive part” of x. Below the table, estimators
are shown by broken red lines. The 45◦ line in gray shows the unrestricted estimate
for reference.
2.3 Sparsity Estimator Formula
The best subset selection and the lasso estimators have a special, useful property:
Best subset
their solutions are sparse, i.e., at (size M ) βb β̂
a solution wej ·will j | ≥β
I(|β̂have b|jβ̂=
(M0) |)
for many components
j ∈ {1, . . . , p}. In problem
Ridge (1), this is obviously true,
β̂j /(1 λ) k ≥ 0 controls the sparsity
+where
level. In problem (2), it is less obviously true, but we get a higher degree of sparsity
the smaller the value Lasso
of t ≥ 0. In the penalized sign(forms, j | −(5),
β̂j )(|β̂(4), λ)+we get more sparsity
the larger the value of λ ≥ 0
This isBestnotSubset
true of ridge regression,Ridge i.e., the solution of (3) orLasso (6) generically has
all nonzero components, no matter the value of t or λ. Note that sparsity is desirable, λ
for two reasons: (i) it corresponds to performing variable selection in the constructed
linear model, and (ii) it) |provides a level of interpretability (beyond sheer accuracy)
|β̂(M
That the `0 (0,0)norm induces sparsity is obvious.(0,0) But, why does the(0,0)`1 norm induce
sparsity and not the `2 norm? There are different ways to look at it; let’s stick
with intuition from the constrained problem forms (2), (5). Figure 1 shows the
“classic” picture, contrasting the way the contours of the squared error loss hit the
two constraint sets, the `1 and `2 balls. As the `1 ball has sharp corners (aligned with
the coordinate axes), we get sparse solutions

β2 ^
β
. β2 ^
β
.

β1 β1

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
Figure 1: The “classic” illustration comparing lasso and ridge constraints. From
(right). Shown are contours of the error and constraint functions. The solid blue
Chapter 3 of Hastie et al. (2009)
areas are the constraint regions |β | + |β | ≤ t and β 2 + β 2 ≤ t2 , respectively,
1 2 1 2
while the red ellipses are the contours of the least squares error function.
Intuition can also be drawn from the orthogonal case. When X is orthogonal, it
is not hard to show that the solutions of the penalized problems (4), (5), (6) are
XT y
βbsubset = H√2λ (X T y), βblasso = Sλ (X T y), βbridge =
1 + 2λ

3
respectively, where Ht (·), St (·) are the componentwise hard- and soft-thresholding
functions at the level t. We see several revealing properties: subset selection and
lasso solutions exhibit sparsity when the componentwise least squares coefficients
(inner products X T y) are small enough; the lasso solution exihibits shrinkage, in
that large enough least squares coefficients are shrunken towards zero by λ; the ridge
regression solution is never sparse and compared to the lasso, preferentially shrinkage
the larger least squares coefficients even more

2.4 Convexity
The lasso and ridge regression problems (2), (3) have another very important prop-
erty: they are convex optimization problems. Best subset selection (1) is not, in fact
it is very far from being convex. Consider using the norm ||β||p as a penalty. Sparsity
requires p ≤ 1 and convexity requires p ≥ 1. The only norm that gives sparsity and
convexity is p = 1. The appendix has a brief review of convexity.

2.5 Theory For Subset Selection


Despite its computational intractability, best subset selection has some attractive risk
properties. A classic result is due to Foster & George (1994), on the in-sample risk of
best subset selection in penalized form (4), which we will paraphrase here. First, we
raise a very simple point: if A denotes the support (also called the active set) of the
subset selection solution βb in (4)—meaning that βbj = 0 for all j ∈ / A, and denoted
A = supp(β)—then
b we have
βbA = (XAT XA )−1 XAT y,
(7)
βb−A = 0.
Here and throughout we write XA for the columns of matrix X in a set A, and xA for
the components of a vector x in A. We will also use X−A and x−A for the columns
or components not in A. The observation in (7) follows from the fact that, given the
support set A, the `0 penalty term in the subset selection criterion doesn’t depend
on the actual magnitudes of the coefficients (it contributes a constant factor), so the
problem reduces to least squares.
Now, consider a standard linear model as with X fixed, and  ∼ N (0, σ 2 I). Sup-
pose that the underlying coefficients have support S = supp(β0 ), and s0 = |S|. Then,
the estimator given by least squares on S, i.e.,
βboracle = (X T XS )−1 X T y,
S S S
oracle
βb−S = 0.
is is called oracle estimator, and as we know from our previous calculations, has
in-sample risk
1 s0
kX βboracle − Xβ0 k22 = σ 2 .
n n
4
Foster & George (1994) consider this setup, and compare the risk of the best
subset selection estimator βb in (4) to the oracle risk of σ 2 s0 /n. They show that, if
we choose λ  σ 2 log p, then the best subset selection estimator satisfies
EkX βb − Xβ0 k22 /n
≤ 4 log p + 2 + o(1), (8)
σ 2 s0 /n
as n, p → ∞. This holds without any conditions on the predictor matrix X. Moreover,
they prove the lower bound
EkX βb − Xβ0 k22 /n
inf sup ≥ 2 log p − o(log p),
βb X,β0 σ 2 s0 /n
where the infimum is over all estimators β, b and the supremum is over all predictor
matrices X and underlying coefficients with kβ0 k0 = s0 . Hence, in terms of rate, best
subset selection achieves the optimal risk inflation over the oracle risk.
Returning to what was said above, the kicker is that we can’t really compute
the best subset selection estimator for even moderately-sized problems. As we will
in the following, the lasso provides a similar risk inflation guarantee, though under
considerably stronger assumptions.
Lastly, it is worth remarking that even if we could compute the subset selection
estimator at scale, it’s not at all clear that we would want to use this in place of the
lasso. (Many people assume that we would.) We must remind ourselves that theory
provides us an understanding of the performance of various estimators under typically
idealized conditions, and it doesn’t tell the complete story. It could be the case that
the lack of shrinkage in the subset selection coefficients ends up being harmful in
practical situations, in a signal-to-noise regime, and yet the lasso could still perform
favorably in such settings.
Update. Some nice recent work in optimization (Bertsimas et al. 2016) shows
that we can cast best subset selection as a mixed integer quadratic program, and
proposes to solve it (in general this means approximately, though with a certified
bound on the duality gap) with an industry-standard mixed integer optimization
package like Gurobi. However, in a recent paper, Hastie, Tibshirani and Tibshirani
(arXiv:1707.08692) show that best subset selection does not do well statistically unless
there is an extremely high signal to noise ratio.

3 Basic properties and geometry of the lasso


3.1 Ridge regression and the elastic net
A quick refresher: the ridge regression problem (6) is always strictly convex (assuming
λ > 0), due to the presense of the squared `2 penalty kβk22 . To be clear, this is true
regardless of X, and so the ridge regression solution is always well-defined, and is in
fact given in closed-form by βb = (X T X + 2λI)−1 X T y.

5
3.2 Lasso
Now we turn to subgradient optimality (sometimes called the KKT conditions) for
the lasso problem in (5). They tell us that any lasso solution βb must satisfy

X T (y − X β)
b = λs, (9)

where s ∈ ∂kβk
b 1 , a subgradient of the `1 norm evaluated at β.
b Precisely, this means
that 
{+1}
 βbj > 0
sj ∈ {−1} βbj < 0 j = 1, . . . , p. (10)

[−1, 1] βbj = 0,

From (9) we can read off a straightforward but important fact: even though the
solution βb may not be uniquely determined, the optimal subgradient s is a function
of the unique fitted value X βb (assuming λ > 0), and hence is itself unique.
Now from (10), note that the uniqueness of s implies that any two lasso solutions
must have the same signs on the overlap of their supports. That is, it cannot happen
that we find two different lasso solutions βb and βe with βbj > 0 but βej < 0 for some
j, and hence we have no problem interpretating the signs of components of lasso
solutions.
Let’s assume henceforth that the columns of X are in general position (and we
are looking at a nontrivial end of the path, with λ > 0), so the lasso solution βb is
unique. Let A = supp(β) b be the lasso active set, and let sA = sign(βbA ) be the signs
of active coefficients. From the subgradient conditions (9), (10), we know that

XAT (y − XA βbA ) = λsA ,

and solving for βbA gives

βbA = (XAT XA )−1 (XAT y − λsA ),


(11)
βb−A = 0

(where recall we know that XAT XA is invertible because X has columns in general
position). We see that the active coefficients βbA are given by taking the least squares
coefficients on XA , (XAT XA )−1 XAT y, and shrinking them by an amount λ(XAT XA )−1 sA .
Contrast this to, e.g., the subset selection solution in (7), where there is no such
shrinkage.
Now, how about this so-called shrinkage term (XAT XA )−1 XAT y? Does it always
act by moving each one of the least squares coefficients (XAT XA )−1 XAT y towards zero?
Indeed, this is not always the case, and one can find empirical examples where a
lasso coefficient is actually larger (in magnitude) than the corresponding least squares
coefficient on the active set. Of course, we also know that this is due to the correlations

6
between active variables, because when X is orthogonal, as we’ve already seen, this
never happens.
On the other hand, it is always the case that the lasso solution has a strictly
smaller `1 norm than the least squares solution on the active set, and in this sense,
we are (perhaps) justified in always referring to (XAT XA )−1 XAT y as a shrinkage term.
To see this, note that, for any vector b, ||b||1 = sT b where s is the vector of signs of
b 1 = sT βb = sT βbA and so
b. So ||β|| A

b 1 = sT (X T XA )−1 X T y − λsT (X T XA )−1 sA < k(X T XA )−1 X T yk1 .


kβk (12)
A A A A A A A

The first term is less than or equal to k(XAT XA )−1 XAT yk1 , and the term we are sub-
tracting is strictly negative (because (XAT XA )−1 is positive definite).

4 Theoretical analysis of the lasso


4.1 Slow rates
There has been an enormous amount theoretical work analyzing the performance of
the lasso. Some references (warning: a highly incomplete list) are Greenshtein &
Ritov (2004), Fuchs (2005), Donoho (2006), Candes & Tao (2006), Meinshausen &
Buhlmann (2006), Zhao & Yu (2006), Candes & Plan (2009), Wainwright (2009); a
helpful text for these kind of results is Buhlmann & van de Geer (2011).
We begin by stating what are called slow rates for the lasso estimator. Most of
the proofs are simple enough that they are given below. These results don’t place
any real assumptions on the predictor matrix X, but deliver slow(er) rates for the
risk of the lasso estimator than what we would get under more assumptions, hence
their name.
We will assume the standard linear model with X fixed, and  ∼ N (0, σ 2 ). We
will also assume that kXj k22 ≤ n, for j = 1, . . . , p. That the errors are Gaussian can
be easily relaxed to sub-Gaussianity.
The lasso estimator in bound form (2) is particularly easy to analyze. Suppose that
we choose t = kβ0 k1 as the tuning parameter. Then, simply by virtue of optimality
of the solution βb in (2), we find that
b 2 ≤ ky − Xβ0 k2 ,
ky − X βk 2 2

or, expanding and rearranging,


kX βb − Xβ0 k22 ≤ 2h, X βb − Xβ0 i.
Here we denote ha, bi = aT b. The above is sometimes called the basic inequality
(for the lasso in bound form). Now, rearranging the inner product, using Holder’s
inequality, and recalling the choice of bound parameter:
kX βb − Xβ0 k22 ≤ 2hX T , βb − β0 i ≤ 4kβ0 k1 kX T k∞ .

7
Notice that kX T k∞ = maxj=1,...,p |XjT | is a maximum of p Gaussians, each with
mean zero and variance upper bounded by σ 2 n. By a standard maximal inequality
for Gaussians, for any δ > 0,
p
max |XjT | ≤ σ 2n log(ep/δ),
j=1,...,p

with probability at least 1−δ. Plugging this to the second-to-last display and dividing
by n, we get the finite-sample result for the lasso estimator
r
1 2 2 log(ep/δ)
kX βb − Xβ0 k2 ≤ 4σkβ0 k1 , (13)
n n
with probability at least 1 − δ.
The high-probability result (13) implies an in-sample risk bound of
r
1 log p
EkX βb − Xβ0 k2 . kβ0 k1
2
.
n n
Compare to this with the risk bound (8) for best subset selection, which is on the
(optimal) order of s0 log p/n when β0 has s0 nonzero components. If each of the
nonzero components here has constant p magnitude, then above risk bound for the
lasso estimator is on the order of s0 log p/n, which is much slower.
Predictive risk. Instead of in-sample risk, we might also be interested in out-
of-sample risk, as after all that reflects actual (out-of-sample) predictions. In least
squares, recall, we saw that out-of-sample risk was generally higher than in-sample
risk. The same is true for the lasso Chatterjee (2013) gives a nice, simple analysis of
out-of-sample risk for the lasso. He assumes that x0 , xi , i = 1, . . . , n are i.i.d. from
an arbitrary distribution supported on a compact set in Rp , and shows that the lasso
estimator in bound form (2) with t = kβ0 k1 has out-of-sample risk satisfying
r
log p
E(x0 β − x0 β) . kβ0 k1
Tb T 2 2
.
n
The proof is not much more complicated than the above, for the in-sample risk, and
reduces to a clever application of Hoeffding’s inequality, though we omit it for brevity.
Note here the dependence on kβ0 k21 , rather than kβ0 k1 as in the in-sample risk. This
agrees with the analysis we did in the previous set of notes where we did not assume
the linear model. (Only the interpretation changes.)

Oracle inequality. If we don’t want to assume linearity of the mean then we


can still derive an oracle inequality that characterizes the risk of the lasso estimator
in excess of the risk of the best linear predictor. For this part only, assume the more
general model
y = µ(X) + ,

8
with an arbitrary mean function µ(X), and normal errors  ∼ N (0, σ 2 ). We will
analyze the bound form lasso estimator (2) for simplicity. By optimality of β,
b for any
1
other β feasible for the lasso problem in (2), it holds that
e

hX T (y − X β),
b βe − βi
b ≤ 0. (14)

Rearranging gives

hµ(X) − X β, b ≤ hX T , βb − βi.
b X βe − X βi e (15)

Now using the polarization identity kak22 + kbk22 − ka − bk22 = 2ha, bi,

kX βb − µ(X)k22 + kX βb − X βk
e 2 ≤ kX βe − µ(X)k2 + 2hX T , βb − βi,
2 2
e

and from the exact same arguments as before, it holds that


r
1 1 e 2 ≤ kX βe − µ(X)k2 + 4σt 2 log(ep/δ) ,
1
kX βb − µ(X)k22 + kX βb − X βk 2 2
n n n n
with probability at least 1 − δ. This holds simultaneously over all βe with kβk
e 1 ≤ t.
Thus, we may write, with probability 1 − δ,
  r
1 2 1 2 2 log(ep/δ)
kX βb − µ(X)k2 ≤ inf kX βe − µ(X)k2 + 4σt .
n e 1 ≤t n
kβk n

Also if we write X βebest as the best linear that predictor of `1 at most t, achieving
the infimum on the right-hand side (which we know exists, as we are minimizing a
continuous function over a compact set), then
r
1 2 log(ep/δ)
kX βb − X βebest k22 ≤ 4σt ,
n n
with probability at least 1 − δ

4.2 Fast rates


Under very strong assumptions we can get faster rates. For example, if we assume
that X satisfies the restricted eigenvalue condition with constant φ0 > 0, i.e.,

1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 (16)
1
To see this, consider minimizing a convex function f (x) over a convex set C. Let xb be a
minimizer. Let z ∈ C be any other point in C. If we move away from the solution xb we can only
x). In other words, h∇f (b
increase f (b x), z − zbi ≥ 0.

9
then
s0 log p
kβb − β0 k22 . (17)
nφ20
with probability tending to 1. (This condition can be slightly weakened, but not
much.) The condition is unlikely to hold in any real problem. Nor is it checkable.
The proof is in the appendix.

4.3 Support recovery


Here we discuss results on support recovery of the lasso estimator. There are a few
versions of support recovery results and again Buhlmann & van de Geer (2011) is
a good place to look for a thorough coverage. Here we describe a result due to
Wainwright (2009), who introduced a proof technique called the primal-dual witness
method. The assumptions are even stronger (and less believable) than in the previous
section. In addition to the previous assumptions we need:
Mutual incoherence: for some γ > 0, we have

k(XST XS )−1 XST Xj k1 ≤ 1 − γ, for j ∈


/ S,

Minimum eigenvalue: for some C > 0, we have


 
1 T
Λmin X XS ≥ C,
n S

where Λmin (A) denotes the minimum eigenvalue of a matrix A


Minimum signal:
4γλ
β0,min = min |β0,j | ≥ λk(XST XS )−1 k∞ + √ ,
j∈S C
Pq
where kAk∞ = maxi=1,...,m j=1 |Aij | denotes the `∞ norm of an m × q matrix A
Under these assumptions, once can show that, if λ is chosen just right, then
b = support(β)) → 1.
P (support(β) (18)

The proof is in the appendix.

References
Beale, E. M. L., Kendall, M. G. & Mann, D. W. (1967), ‘The discarding of variables
in multivariate analysis’, Biometrika 54(3/4), 357–366.

Bertsimas, D., King, A. & Mazumder, R. (2016), ‘Best subset selection via a modern
optimization lens’, The Annals of Statistics 44(2), 813–852.

10
Buhlmann, P. & van de Geer, S. (2011), Statistics for High-Dimensional Data,
Springer.

Candes, E. J. & Plan, Y. (2009), ‘Near ideal model selection by `1 minimization’,


Annals of Statistics 37(5), 2145–2177.

Candes, E. J. & Tao, T. (2006), ‘Near optimal signal recovery from random projec-
tions: Universal encoding strategies?’, IEEE Transactions on Information Theory
52(12), 5406–5425.

Chatterjee, S. (2013), Assumptionless consistency of the lasso. arXiv: 1303.5817.

Chen, S., Donoho, D. L. & Saunders, M. (1998), ‘Atomic decomposition for basis
pursuit’, SIAM Journal on Scientific Computing 20(1), 33–61.

Donoho, D. L. (2006), ‘Compressed sensing’, IEEE Transactions on Information The-


ory 52(12), 1289–1306.

Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression’,
Annals of Statistics 32(2), 407–499.

Foster, D. & George, E. (1994), ‘The risk inflation criterion for multiple regression’,
The Annals of Statistics 22(4), 1947–1975.

Fuchs, J. J. (2005), ‘Recovery of exact sparse representations in the presense of


bounded noise’, IEEE Transactions on Information Theory 51(10), 3601–3608.

Greenshtein, E. & Ritov, Y. (2004), ‘Persistence in high-dimensional linear predictor


selection and the virtue of overparametrization’, Bernoulli 10(6), 971–988.

Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning;
Data Mining, Inference and Prediction, Springer. Second edition.

Hastie, T., Tibshirani, R. & Wainwright, M. (2015), Statistical Learning with Sparsity:
the Lasso and Generalizations, Chapman & Hall.

Hocking, R. R. & Leslie, R. N. (1967), ‘Selection of the best subset in regression


analysis’, Technometrics 9(4), 531–540.

Hoerl, A. & Kennard, R. (1970), ‘Ridge regression: biased estimation for nonorthog-
onal problems’, Technometrics 12(1), 55–67.

Meinshausen, N. & Buhlmann, P. (2006), ‘High-dimensional graphs and variable se-


lection with the lasso’, The Annals of Statistics 34(3), 1436–1462.

Osborne, M., Presnell, B. & Turlach, B. (2000a), ‘A new approach to variable selection
in least squares problems’, IMA Journal of Numerical Analysis 20(3), 389–404.

11
Osborne, M., Presnell, B. & Turlach, B. (2000b), ‘On the lasso and its dual’, Journal
of Computational and Graphical Statistics 9(2), 319–337.

Raskutti, G., Wainwright, M. J. & Yu, B. (2011), ‘Minimax rates of estimation for
high-dimensional linear regression over `q -balls’, IEEE Transactions on Information
Theory 57(10), 6976–6994.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of
the Royal Statistical Society: Series B 58(1), 267–288.

van de Geer, S. & Buhlmann, P. (2009), ‘On the conditions used to prove oracle
results for the lasso’, Electronic Journal of Statistics 3, 1360–1392.

Wainwright, M. (2017), High-Dimensional Statistics: A Non-Asymptotic View, Cam-


bridge University Press. To appear.

Wainwright, M. J. (2009), ‘Sharp thresholds for high-dimensional and noisy sparsity


recovery using `1 -constrained quadratic programming (lasso)’, IEEE Transactions
on Information Theory 55(5), 2183–2202.

Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Ma-
chine Learning Research 7, 2541–2564.

5 Appendix: Convexity
It is convexity that allows to equate (2), (5), and (3), (6) (and yes, the penalized forms
are convex problems too). It is also convexity that allows us to both efficiently solve,
and in some sense, precisely understand the nature of the lasso and ridge regression
solutions
Here is a (far too quick) refresher/introduction to basic convex analysis and convex
optimization. Recall that a set C ⊆ Rn is called convex if for any x, y ∈ C and
t ∈ [0, 1], we have
tx + (1 − t)y ∈ C,
i.e., the line segment joining x, y lies entirely in C. A function f : Rn → R is called
convex if its domain dom(f ) is convex, and for any x, y ∈ dom(f ) and t ∈ [0, 1],

f tx + (1 − t)y ≤ tf (x) + (1 − t)f (y),

i.e., the function lies below the line segment joining its evaluations at x and y. A
function is called strictly convex if this same inequality holds strictly for x 6= y and
t ∈ (0, 1)
E.g., lines, rays, line segments, linear spaces, affine spaces, hyperplans, halfspaces,
polyhedra, norm balls are all convex sets

12
E.g., affine functions aT x + b are convex and concave, quadratic functions xT Qx +
bT x + c are convex if Q  0 and strictly convex if Q  0, norms are convex
Formally, an optimization problem is of the form

min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
`j (x) = 0, j = 1, . . . r

Here D = dom(f ) ∩ m
T Tr
i=1 dom(hi ) ∩ j=1 dom(`j ) is the common domain of all func-
tions. A convex optimization problem is an optimization problem in which all functions
f, h1 , . . . hm are convex, and all functions `1 , . . . `r are affine. (Think: why affine?)
Hence, we can express it as

min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
Ax = b

Why is a convex optimization problem so special? The short answer: because any
local minimizer is a global minimizer. To see this, suppose that x is feasible for the
convex problem formulation above and there exists some R > 0 such that

f (x) ≤ f (y) for all feasible y with kx − yk2 ≤ R.

Such a point x is called a local minimizer. For the sake of contradiction, suppose that
x was not a global minimizer, i.e., there exists some feasible z such that f (z) < f (x).
By convexity of the constraints (and the domain D), the point tz + (1 − t)x is feasible
for any 0 ≤ t ≤ 1. Furthermore, by convexity of f ,

f tz + (1 − t)x ≤ tf (z) + (1 − t)f (x) < f (x)

for any 0 < t < 1. Lastly, we can choose t > 0 small enough so that kx − (tz + (1 −
t)x)k2 = tkx − zk2 ≤ R, and we obtain a contradiction
Algorithmically, this is a very useful property, because it means if we keep “going
downhill”, i.e., reducing the achieved criterion value, and we stop when we can’t do
so anymore, then we’ve hit the global solution
Convex optimization problems are also special because they come with a beautiful
theory of beautiful convex duality and optimality, which gives us a way of understand-
ing the solutions. We won’t have time to cover any of this, but we’ll mention what
subgradient optimality looks like for the lasso
Just based on the definitions, it is not hard to see that (2), (3), (5), (6) are convex
problems, but (1), (4) are not. In fact, the latter two problems are known to be
NP-hard, so they are in a sense even the worst kind of nonconvex problem

13
6 Appendix: Geometry of the solutions
One undesirable feature of the best subset selection solution (7) is the fact that
it behaves discontinuously with y. As we change y, the active set A must change
at some point, and the coefficients will jump discontinuously, because we are just
doing least squares onto the active set. So, does the same thing happen with the
lasso solution (11)? The answer it not immediately clear. Again, as we change y,
the active set A must change at some point; but if the shrinkage term were defined
“just right”, then perhaps the coefficients of variables to leave the active set would
gracefully and continously drop to zero, and coefficients of variables to enter the
active set would continuously move form zero. This would make whole the lasso
solution continuous. Fortuitously, this is indeed the case, and the lasso solution
βb is continuous as a function of y. It might seem a daunting task to prove this,
but a certain perspective using convex geometry provides a very simple proof. The
geometric perspective in fact proves that the lasso fit X βb is nonexpansive in y, i.e.,
1-Lipschitz continuous, which is a very strong form of continuity. Define the convex
polyhedron C = {u : kX T uk∞ ≤ λ} ⊆ Rn . Some simple manipulations of the KKT
conditions show that the lasso fit is given by

X βb = (I − PC )(y),

the residual from projecting y onto C. A picture to show this (just look at the left
panel for now) is given in Figure 2.
The projection onto any convex set is nonexpansive, i.e., kPC (y) − PC (y 0 )k2 ≤
ky − y 0 k2 for any y, y 0 . This should be visually clear from the picture. Actually, the
same is true with the residual map: I − PC is also nonexpansive, and hence the lasso
fit is 1-Lipschitz continuous. Viewing the lasso fit as the residual from projection
onto a convex polyhedron is actually an even more fruitful perspective. Write this
polyhedron as
C = (X T )−1 {v : kvk∞ ≤ λ},
where (X T )−1 denotes the preimage operator under the linear map X T . The set
{v : kvk∞ ≤ λ} is a hypercube in Rp . Every face of this cube corresponds to a subset
A ⊆ {1, . . . p} of dimensions (that achieve the maximum value |λ|) and signs sA ∈
{−1, 1}|A| (that tell which side of the cube the face will lie on, for each dimension).
Now, the faces of C are just faces of {v : kvk∞ ≤ λ} run through the (linear) preimage
transformation, so each face of C can also indexed by a set A ⊆ {1, . . . p} and signs
sA ∈ {−1, 1}|A| . The picture in Figure 2 attempts to convey this relationship with
the colored black face in each of the panels.
Now imagine projecting y onto C; it will land on some face. We have just argued
that this face corresponds to a set A and signs sA . One can show that this set A is
exactly the active set of the lasso solution at y, and sA are exactly the active signs.
The size of the active set |A| is the co-dimension of the face. Looking at the picture:
we can that see that as we wiggle y around, it will project to the same face. From the

14
y

X β̂

û A, sA

(X T )−1
0
0

{v : kvk∞ ≤ λ}
T
C = {u : kX uk∞ ≤ λ}

Rn Rp

Figure 2: A geometric picture of the lasso solution. The left panel shows the polyhe-
dron underlying all lasso fits, where each face corresponds to a particular combination
of active set A and signs s; the right panel displays the “inverse” polyhedron, where
the dual solutions live

correspondence between faces and active set and signs of lasso solutions, this means
that A, sA do not change as we perturb y, i.e., they are locally constant. But this isn’t
true for all points y, e.g., if y lies on one of the rays emanating from the lower right
corner of the polyhedron in the picture, then we can see that small perturbations of
y do actually change the face that it projects to, which invariably changes the active
set and signs of the lasso solution. However, this is somewhat of an exceptional case,
in that such points can be form a of Lebesgue measure zero, and therefore we can
assure ourselves that the active set and signs A, sA are locally constant for almost
every y.
From the lasso KKT conditions (9), (10), it is possible to compute the lasso
solution in (5) as a function of λ, which we will write as β(λ),
b for all values of the
tuning parameter λ ∈ [0, ∞]. This is called the regularization path or solution path of
the problem (5). Path algorithms like the one we will describe below are not always
possible; the reason that this ends up being feasible for the lasso problem (5) is that
the solution path β(λ),
b λ ∈ [0, ∞] turns out to be a piecewise linear, continuous
function of λ. Hence, we only need to compute and store the knots in this path,
which we will denote by λ1 ≥ λ2 ≥ . . . ≥ λr ≥ 0, and the lasso solution at these
knots. From this information, we can then compute the lasso solution at any value

15
1
of λ by linear interpolation.
The knots λ1 ≥ . . . ≥ λr in the solution path correspond to λ values at which
the active set A(λ) = supp(β(λ))
b changes. As we decrease λ from ∞ to 0, the knots
typically correspond to the point at which a variable enters the active set; this con-
nects the lasso to an incremental variable selection procedure like forward stepwise
regression. Interestingly though, as we decrease λ, a knot in the lasso path can also
correspond to the point at which a variables leaves the active set. See Figure 3.
0.3
0.2
Coefficients

0.1
0.0
−0.1
−0.2

0.5 1.0 1.5 2.0

lambda

Figure 3: An example of the lasso path. Each colored line denotes a component of the
lasso solution βbj (λ), j = 1, . . . , p as a function of λ. The gray dotted vertical lines
mark the knots λ1 ≥ λ2 ≥ . . .

The lasso solution path was described by Osborne et al. (2000a,b), Efron et al.
(2004). Like the construction of all other solution paths that followed these seminal
works, the lasso path is essentially given by an iterative or inductive verification of the
KKT conditions; if we can maintain that the KKT conditions holds as we decrease
λ, then we know we have a solution. The trick is to start at a value of λ at which the
solution is trivial; for the lasso, this is λ = ∞, at which case we know the solution
must be β(∞)
b = 0.
Why would the path be piecewise linear? The construction of the path from the

16
KKT conditions is actually rather technical (not difficult conceptually, but somewhat
tedious), and doesn’t shed insight onto this matter. But we can actually see it clearly
from the projection picture in Figure 2.
As λ decreases from ∞ to 0, we are shrinking (by a multiplicative factor λ) the
polyhedron onto which y is projected; let’s write Cλ = {u : kX T uk∞ ≤ λ} = λC1 to
make this clear. Now suppose that y projects onto the relative interior of a certain
face F of Cλ , corresponding to an active set A and signs sA . As λ decreases, the
point on the boundary of Cλ onto which y projects, call it u b(λ) = PCλ (y), will move
along the face F , and change linearly in λ (because we are equivalently just tracking
the projection of y onto an affine space that is being scaled by λ). Thus, the lasso
fit X β(λ)
b =y−u b(λ) will also behave linearly in λ. Eventually, as we continue to
decrease λ, the projected point u b(λ) will move to the relative boundary of the face F ;
then, decreasing λ further, it will lie on a different, neighboring face F 0 . This face will
correspond to an active set A0 and signs sA0 that (each) differ by only one element to
A and sA , respectively. It will then move linearly across F 0 , and so on.
Now we will walk through the technical derivation of the lasso path, starting
at λ = ∞ and β(∞)
b = 0, as indicated above. Consider decreasing λ from ∞, and
continuing to set β(λ)
b = 0 as the lasso solution. The KKT conditions (9) read

X T y = λs,

where s is a subgradient of the `1 norm evaluated at 0, i.e., sj ∈ [−1, 1] for every j =


1, . . . , p. For large enough values of λ, this is satisfied, as we can choose s = X T y/λ.
But this ceases to be a valid subgradient if we decrease λ past the point at which
λ = |XjT y| for some variable j = 1, . . . , p. In short, β(λ)
b = 0 is the lasso solution for
all λ ≥ λ1 , where
λ1 = max |XjT y|. (19)
j=1,...,p

What happens next? As we decrease λ from λ1 , we know that we’re going to have to
change β(λ)
b from 0 so that the KKT conditions remain satisfied. Let j1 denote the
variable that achieves the maximum in (19). Since the subgradient was |sj1 | = 1 at
λ = λ1 , we see that we are “allowed” to make βbj1 (λ) nonzero. Consider setting

βbj1 (λ) = (XjT1 Xj1 )−1 (XjT1 y − λsj1 )


(20)
βbj (λ) = 0, for all j 6= j1 ,

as λ decreases from λ1 , where sj1 = sign(XjT1 y). Note that this makes β(λ)
b a piecewise
linear and continuous function of λ, so far. The KKT conditions are then
 
XjT1 y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 ) = λsj1 ,

which can be checked with simple algebra, and


 
T T −1 T
Xj y − Xj1 (Xj1 Xj1 ) (Xj1 y − λsj1 ) ≤ λ,

17
for all j 6= j1 . Recall that the above held with strict inequality at λ = λ1 for all j 6= j1 ,
and by continuity of the constructed solution β(λ), b it should continue to hold as we
decrease λ for at least a little while. In fact, it will hold until one of the piecewise
linear paths
XjT (y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 )), j 6= j1
becomes equal to ±λ, at which point we have to modify the solution because otherwise
the implicit subgradient

XjT (y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 ))


sj =
λ
will cease to be in [−1, 1]. It helps to draw yourself a picture of this.
Thanks to linearity, we can compute the critical “hitting time” explicitly; a short
calculation shows that, the lasso solution continues to be given by (20) for all λ1 ≥
λ ≥ λ2 , where

XjT (I − Xj1 (XjT1 Xj1 )−1 Xj1 )y


λ2 = max+ , (21)
j6=j1 , sj ∈{−1,1} sj − XjT Xj1 (XjT1 Xj1 )−1 sj1

and max+ denotes the maximum over all of its arguments that are < λ1 .
To keep going: let j2 , s2 achieve the maximum in (21). Let A = {j1 , j2 }, sA =
(sj1 , sj2 ), and consider setting

βbA (λ) = (XAT XA )−1 (XAT y − λsA )


(22)
βb−A (λ) = 0,

as λ decreases from λ2 . Again, we can verify the KKT conditions for a stretch of
decreasing λ, but will have to stop when one of

XjT (y − XA (XAT XA )−1 (XAT y − λsA ), j∈


/A

becomes equal to ±λ. By linearity, we can compute this next “hitting time” explic-
itly, just as before. Furthermore, though, we will have to check whether the active
components of the computed solution in (22) are going to cross through zero, because
past such a point, sA will no longer be a proper subgradient over the active compo-
nents. We can again compute this next “crossing time” explicitly, due to linearity.
Therefore, we maintain that (22) is the lasso solution for all λ2 ≥ λ ≥ λ3 , where λ3 is
the maximum of the next hitting time and the next crossing time. For convenience,
the lasso path algorithm is summarized below.
As we decrease λ from a knot λk , we can rewrite the lasso coefficient update in
Step 1 as
βbA (λ) = βbA (λk ) + (λk − λ)(XAT XA )−1 sA ,
(23)
βb−A (λ) = 0.

18
We can see that we are moving the active coefficients in the direction (λk −λ)(XAT XA )−1 sA
for decreasing λ. In other words, the lasso fitted values proceed as

X β(λ)
b b k ) + (λk − λ)XA (X T XA )−1 sA ,
= X β(λ A

for decreasing λ. Efron et al. (2004) call XA (XAT XA )−1 sA the equiangular direction,
because this direction, in Rn , takes an equal angle with all Xj ∈ Rn , j ∈ A.
For this reason, the lasso path algorithm in Algorithm ?? is also often referred
to as the least angle regression path algorithm in “lasso mode”, though we have not
mentioned this yet to avoid confusion. Least angle regression is considered as another
algorithm by itself, where we skip Step 3 altogether. In words, Step 3 disallows any
component path to cross through zero. The left side of the plot in Figure 3 visualizes
the distinction between least angle regression and lasso estimates: the dotted black
line displays the least angle regression component path, crossing through zero, while
the lasso component path remains at zero.
Lastly, an alternative expression for the coefficient update in (23) (the update in
Step 1) is
λk − λ T
βbA (λ) = βbA (λk ) + (XA XA )−1 XAT r(λk ),
λk (24)
β−A (λ) = 0,
b

where r(λk ) = y − XA βbA (λk ) is the residual (from the fitted lasso model) at λk . This
follows because, recall, λk sA are simply the inner products of the active variables
with the residual at λk , i.e., λk sA = XAT (y − XA βbA (λk )). In words, we can see that
the update for the active lasso coefficients in (24) is in the direction of the least
squares coefficients of the residual r(λk ) on the active variables XA .

7 Appendix: Fast Rates


Here is a proof of (17). There are many flavors of fast rates, and the conditions
required are all very closely related. van de Geer & Buhlmann (2009) provides a nice
review and discussion. Here we just discuss two such results, for simplicity.
Compatibility result. Assume that X satisfies the compatibility condition with
respect to the true support set S, i.e., for some compatibility constant φ0 > 0,

1 φ2
kXvk22 ≥ 0 kvS k21 for all v ∈ Rp such that kv−S k1 ≤ 3kvS k1 . (25)
n s0
While this may look like an odd condition, we will see it being useful in the proof
below, and we will also have some help interpreting it when we discuss the restricted
eigenvalue condition shortly. Roughly, it means the (truly active) predictors can’t be
too correlated

19
Recall from our previous analysis for the lasso estimator in penalized form (5), we
showed on an event Eδ of probability at least 1 − δ,
p
kX βb − Xβ0 k22 ≤ 2σ 2n log(ep/δ)kβb − β0 k1 + 2λ(kβ0 k1 − kβk
b 1 ).

Choosing λ large enough and applying the triangle inequality then gave us the slow
rate wepderived before. Now we choose λ just slightly larger (by a factor of 2):
λ ≥ 2σ 2n log(ep/δ). The remainder of the analysis will be performed on the event
Eδ and we will no longer make this explicit until the very end. Then

kX βb − Xβ0 k22 ≤ λkβb − β0 k1 + 2λ(kβ0 k1 − kβk


b 1)
≤ λkβbS − β0,S k1 + λkβb−S k1 + 2λ(kβ0 k1 − kβk
b 1)
≤ λkβbS − β0,S k1 + λkβb−S k1 + 2λ(kβ0,S − βbS k1 − kβb−S k1 )
= 3λkβbS − β0,S k1 − λkβb−S k1 ,

where the two inequalities both followed from the triangle inequality, one application
for each of the two terms, and we have used that βb0,−S = 0. As kX βb − Xβ0 k22 ≥ 0,
we have shown
kβb−S − βb0,−S k1 ≤ 3kβbS − β0,S k1 ,
and thus we may apply the compatibility condition (25) to the vector v = βb − β0 .
This gives us two bounds: one on the fitted values, and the other on the coefficients.
Both start with the key inequality (from the second-to-last display)

kX βb − Xβ0 k22 ≤ 3λkβbS − β0,S k1 . (26)

For the fitted values, we upper bound the right-hand side of the key inequality (26),
r
2 s0
kX βb − Xβ0 k2 ≤ 3λ kX βb − Xβ0 k2 ,
nφ20

or dividing through both sides by kX βb − Xβ0 k2 , then squaring both sides, and di-
viding by n,
1 9s0 λ2
kX βb − Xβ0 k22 ≤ 2 2 .
n n φ0
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that

1 72σ 2 s0 log(ep/δ)
kX βb − Xβ0 k22 ≤ , (27)
n nφ20

with probability at least 1 − δ. Notice the similarity between (27) and (8): both
provide us in-sample risk bounds on the order of s0 log p/n, but the bound for the
lasso requires a strong compability assumption on the predictor matrix X, which
roughly means the predictors can’t be too correlated

20
For the coefficients, we lower bound the left-hand side of the key inequality (26),

nφ20 b
kβS − β0,S k21 ≤ 3λkβbS − β0,S k1 ,
s0
so dividing through both sides by kβbS − β0,S k1 , and recalling kβb−S k1 ≤ 3kβbS − β0,S k1 ,
which implies by the triangle inequality that kβb − β0 k1 ≤ 4kβbS − β0,S k1 ,
12s0 λ
kβb − β0 k1 ≤ .
nφ20
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that
r
24σs 0 2 log(ep/δ)
kβb − β0 k1 ≤ 2
, (28)
φ0 n
p
with probability at least 1 − δ. This is a error bound on the order of s0 log p/n for
the lasso coefficients (in `1 norm)
Restricted eigenvalue result. Instead of compatibility, we may assume that
X satisfies the restricted eigenvalue condition with constant φ0 > 0, i.e.,

1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 . (29)

This produces essentially the same results as in (27), (28), but additionally, in the `2
norm,
s0 log p
kβb − β0 k22 .
nφ20
with probability tending to 1
Note the similarity between (29) and the compatibility condition (25). The former
is actually stronger, i.e., it implies the latter, because kβk22 ≥ kβJ k22 ≥ kβJ k21 /s0 . We
may interpret the restricted eigenvalue condition roughly as follows: the requirement
(1/n)kXvk22 ≥ φ20 kvk22 for all v ∈ Rn would be a lower bound of φ20 on the smallest
eigenvalue of (1/n)X T X; we don’t require this (as this would of course mean that X
was full column rank, and couldn’t happen when p > n), but instead that require
that the same inequality hold for v that are “mostly” supported on small subsets J
of variables, with |J| = s0

8 Appendix: Support Recovery


Again we assume a standard linear model (??), with X fixed, subject to the scaling
kXj k22 ≤ n, for j = 1, . . . , p, and  ∼ N (0, σ 2 ). Denote by S = supp(β0 ) the true
support set, and s0 = |S|. Assume that XS has full column rank

21
We aim to show that, at some value of λ, the lasso solution βb in (5) has an active
set that exactly equals the true support set,

A = supp(β)
b = S,

with high probability. We actually aim to show that the signs also match,

sign(βbS ) = sign(β0,S ),

with high probability. The primal-dual witness method basically plugs in the true
support S into the KKT conditions for the lasso (9), (10), and checks when they can
be verified
We start by breaking up (9) into two blocks, over S and S c . Suppose that
supp(β)
b = S at a solution β.
b Then the KKT conditions become

XST (y − XS βbS ) = λsS (30)


X T (y − XS βbS ) = λs−S .
−S (31)

Hence, if we can satisfy the two conditions (30), (31) with a proper subgradient
s, such that
sS = sign(β0,S ) and ks−S k∞ = max |sj | < 1,
j ∈S
/

then we have met our goal: we have recovered a (unique) lasso solution whose active
set is S, and whose active signs are sign(β0,S )
So, let’s solve for βbS in the first block (30). Just as we did in the work on basic
properties of the lasso estimator, this yields

βbS = (XST XS )−1 XST y − λsign(β0,S ) ,



(32)

where we have substituted sS = sign(β0,S ). From (31), this implies that s−S must
satisfy
1 T
X−S I − XS (XST XS )−1 XST y + X−S
T
XS (XST XS )−1 sign(β0,S ).

s−S = (33)
λ
To lay it out, for concreteness, the primal-dual witness method proceeds as follows:

1. Solve for the lasso solution over the S components, βbS , as in (32), and set
βb−S = 0

2. Solve for the subgradient over the S c components, s−S , as in (33)

3. Check that sign(βbS ) = sign(β0,S ), and that ks−S k∞ < 1. If these two checks
pass, then we have certified there is a (unique) lasso solution that exactly re-
covers the true support and signs

22
The success of the primal-dual witness method hinges on Step 3. We can plug in y =
Xβ0 + , and rewrite the required conditions, sign(βbS ) = sign(β0,S ) and ks−S k∞ < 1,
as

sign(β0,j + ∆j ) = sign(β0,j ), where


∆j = eTj (XST XS )−1 XST  − λsign(β0,S ) , for all j ∈ S, (34)


and
1 T
X−S I − XS (XST XS )−1 XST  + X−S
T
XS (XST XS )−1 sign(β0,S )

< 1. (35)
λ ∞

As  ∼ N (0, σ 2 I), we see that the two required conditions have been reduced to
statements about Gaussian random variables. The arguments we need to check these
conditions actually are quite simply, but we will need to make assumptions on X and
β0 . These are:
With these assumptions in place on X and β0 , let’s first consider verifying (34),
and examine ∆S , whose components ∆j , j ∈ S are as defined in (34). We have

k∆S k∞ ≤ k(XST XS )−1 XST k∞ + λk(XST XS )−1 k∞ .

Note that w = (XST XS )−1 XST  is Gaussian with mean zero and covariance σ 2 (XST XS )−1 ,
so the variances of components of w are bounded by
  σ2n
σ 2 Λmax (XST XS )−1 ≤ ,
C
where we have used the minimum eigenvalue assumption. By a standard result on
the maximum of Gaussians, for any δ > 0, it holds with probability at least 1 − δ
that
σ p
k∆S k∞ ≤ √ 2n log (es0 /δ) + λk(XST XS )−1 k∞
C  
γ σp
≤ β0,min + √ 2n log (es0 /δ) − 4λ .
C γ
| {z }
a

where in the second line we used the minimum signal condition. As long as a < 0,
we can see that the sign condition (34) is verified
Now, let’s consider verifying (35). Using the mutual incoherence condition, we
have
1 T
X−S I − XS (XST XS )−1 XST  + X−S
T
XS (XST XS )−1 sign(β0,S )

≤ kzk∞ + (1 − γ),
λ ∞

where z = (1/λ)X−ST
(I − XS (XST XS )−1 XST ) = (1/λ)X−S
T
PXS , with PXS the projec-
tion matrix onto the column space of XS . Notice that z is Gaussian with mean zero

23
and covariance (σ 2 /λ2 )X−S
T
PXS X−S , so the components of z have variances bounded
by
σ2n σ2n
Λ max (P XS
) ≤ .
λ2 λ2
Therefore, again by the maximal Gaussian inequality, for any δ > 0, it holds with
probability at least 1 − δ that
1 T
X−S I − XS (XST XS )−1 XST  + X−S
T
XS (XST XS )−1 sign(β0,S )

λ ∞
σp
≤ 2n log (e(p − s0 )/δ) + (1 − γ)
λ  
σp
=1+ 2n log (e(p − s0 )/δ) − γ ,
λ
| {z }
b

Thus as long as b < 0, we can see that the subgradient condition


p (35) is verified
So it remains to choose λ so that a, b < 0. For λ ≥ (2σ/γ) 2n log(ep/δ), we can
see that
a ≤ 2λ − 4λ < 0, b ≤ γ/2 − γ < 0,
so (34), (35) are verified—and hence lasso estimator recovers the correct support and
signs—with probability at least 1 − 2δ

8.1 A note on the conditions


As we moved from the slow rates, to fast rates, to support recovery, the assumptions
we used just got stronger and stronger. For the slow rates, we essentially assumed
nothing about the predictor matrix X except for column normalization. For the
fast rates, we had to additionally assume a compatibility or restricted eigenvalue
condition, which roughly speaking, limited the correlations of the predictor variables
(particularly concentrated over the underlying support S). For support recovery, we
still needed whole lot more. The minimum eigenvalue condition on (1/n)(XST XS )−1 is
somewhat like the restricted eigenvalue condition on X. But the mutual incoherence
condition is even stronger; it requires the regression coefficients

ηj (S) = (XST XS )−1 XST Xj ,

given by regressing each Xj on the truly active variables XS , to be small (in `1 norm)
for all j ∈
/ S. In other words, no truly inactive variables can be highly correlated
(or well-explained, in a linear projection sense) by any of the truly active variables.
Finally, this minimum signal condition ensures that the nonzero entries of the true
coefficient vector β0 are big enough to detect. This is quite restrictive and is not
needed for risk bounds, but it is crucial to support recovery.

24
8.2 Minimax bounds
Under the data model (??) with X fixed, subject to the scaling kXj k22 ≤ n, for
j = 1, . . . , p, and  ∼ N (0, σ 2 ), Raskutti et al. (2011) derive upper and lower bounds
on the minimax prediction error
1
M (s0 , n, p) = inf sup kX βb − Xβ0 k22 .
βb kβ0 k0 ≤s0 n

(Their analysis is acutally considerably more broad than this and covers the coefficient
error kβb − β0 k2 , as well `q constraints on β0 , for q ∈ [0, 1].) They prove that, under
no additional assumptions on X,

s0 log(p/s0 )
M (s0 , n, p) . ,
n
with probability tending to 1
They also prove that, under a type of restricted eigenvalue condition in which

(1/n)kXvk22
c0 ≤ ≤ c1 for all v ∈ Rp such that kvk0 ≤ 2s0 ,
kvk22

for some constants c0 > 0 and c1 < ∞, it holds that

s0 log(p/s0 )
M (s0 , n, p) & ,
n
with probability at least 1/2
The implication is that, for some X, minimax optimal prediction may be able
to be performed at a faster rate than s0 log(p/s0 )/n; but for low correlations, this
is the rate we should expect. (This is consistent with the worst-case-X analysis of
Foster & George (1994), who actually show the worst-case behavior is attained in the
orthogonal X case)

25
J. R. Statist. Soc. B (2009)
71, Part 5, pp. 1009–1030

Sparse additive models

Pradeep Ravikumar,
University of California, Berkeley, USA

and John Lafferty, Han Liu and Larry Wasserman


Carnegie Mellon University, Pittsburgh, USA

[Received April 2008. Final revision March 2009]

Summary. We present a new class of methods for high dimensional non-parametric regression
and classification called sparse additive models. Our methods combine ideas from sparse linear
modelling and additive non-parametric regression. We derive an algorithm for fitting the models
that is practical and effective even when the number of covariates is larger than the sample
size. Sparse additive models are essentially a functional version of the grouped lasso of Yuan
and Lin. They are also closely related to the COSSO model of Lin and Zhang but decouple
smoothing and sparsity, enabling the use of arbitrary non-parametric smoothers. We give an
analysis of the theoretical properties of sparse additive models and present empirical results
on synthetic and real data, showing that they can be effective in fitting sparse non-parametric
models in high dimensional data.
Keywords: Additive models; Lasso; Non-parametric regression; Sparsity

1. Introduction
Substantial progress has been made recently on the problem of fitting high dimensional
linear regression models of the form Yi = XiT β + "i , for i = 1, . . . , n. Here Yi is a real-valued
response, Xi is a predictor and "i is a mean 0 error term. Finding an estimate of β when p > n
that is both statistically well behaved and computationally efficient has proved challenging;
however, under the assumption that the vector β is sparse, the lasso estimator (Tibshirani,
1996) has been remarkably successful. The lasso estimator β̂ minimizes the l1 -penalized sum of
squares
 
p
.Yi − XiT β/2 + λ |βj |
i j=1
with the l1 -penalty β1 encouraging sparse solutions, where many components β̂ j are 0. The
good empirical success of this estimator has been recently backed up by results confirming that
it has strong theoretical properties; see Bunea et al. (2007), Greenshtein and Ritov (2004), Zhao
and Yu (2007), Meinshausen and Yu (2006) and Wainwright (2006).
The non-parametric regression model Yi = m.Xi / + "i , where m is a general smooth function,
relaxes the strong assumptions that are made by a linear model but is much more challenging
in high dimensions. Hastie and Tibshirani (1999) introduced the class of additive models of the
form

Address for correspondence: Larry Wasserman, Department of Statistics, 232 Baker Hall, Carnegie Mellon
University, Pittsburgh, PA 15213-3890, USA.
E-mail: [email protected]

© 2009 Royal Statistical Society 1369–7412/09/711009


1010 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman

p
Yi = fj .Xij / + "i : .1/
j=1

This additive combination of univariate functions—one for each covariate Xj —is less general
than joint multivariate non-parametric models but can be more interpretable and easier to fit;
in particular, an additive model can be estimated by using a co-ordinate descent Gauss–Seidel
procedure, called backfitting. Unfortunately, additive models only have good statistical and
computational behaviour when the number of variables p is not large relative to the sample size
n, so their usefulness is limited in the high dimensional setting.
In this paper we investigate sparse additive models (SPAMs), which extend the advantages of
sparse linear models to the additive non-parametric setting. The underlying model is the same
as in equation (1), but we impose a sparsity constraint on the index set {j : fj ≡ 0} of functions
fj that are not identically zero. Lin and Zhang (2006) have proposed COSSO, an extension of
the lasso to this setting, for the case where the component functions fj belong to a reproducing
kernel Hilbert space. They penalized the sum of the reproducing kernel Hilbert space norms of
the component functions. Yuan (2007) proposed an extension of the non-negative garrotte to
this setting. As with the parametric non-negative garrotte, the success of this method depends
on the initial estimates of component functions fj .
In Section 3, we formulate an optimization problem in the population setting that induces
sparsity. Then we derive a sample version of the solution. The SPAM estimation procedure
that we introduce allows the use of arbitrary non-parametric smoothing techniques, effectively
resulting in a combination of the lasso and backfitting. The algorithm extends to classifica-
tion problems by using generalized additive models. As we explain later, SPAMs can also be
thought of as a functional version of the grouped lasso (Antoniadis and Fan, 2001; Yuan and
Lin, 2006).
The main results of this paper include the formulation of a convex optimization problem
for estimating a SPAM, an efficient backfitting algorithm for constructing the estimator and
theoretical results that analyse the effectiveness of the estimator in the high dimensional setting.
Our theoretical results are of two different types. First, we show that, under suitable choices
of the design parameters, the SPAM backfitting algorithm recovers the correct sparsity pattern
asymptotically; this is a property that we call sparsistency, as a shorthand for ‘sparsity pattern
consistency’. Second, we show that the estimator is persistent, in the sense of Greenshtein and
Ritov (2004), which is a form of risk consistency.
In the following section we establish notation and assumptions. In Section 3 we formulate
SPAMs as an optimization problem and derive a scalable backfitting algorithm. Examples show-
ing the use of our sparse backfitting estimator on high dimensional data are included in Section
5. In Section 6.1 we formulate the sparsistency result, when orthogonal function regression is
used for smoothing. In Section 6.2 we give the persistence result. Section 7 contains a discussion
of the results and possible extensions. Proofs are contained in Appendix A.
The statements of the theorems in this paper were given, without proof, in Ravikumar et al.
(2008). The backfitting algorithm was also presented there. Related results were obtained in
Meier et al. (2008) and Koltchinskii and Yuan (2008).

2. Notation and assumptions


We assume that we are given independent data .X1 , Y1 /, . . . , .Xn , Yn / where Xi = .Xi1 , . . . ,
Xij , . . . , Xip /T ∈ [0, 1]p and
Yi = m.Xi / + "i .2/
Sparse Additive Models 1011
with "i ∼ N.0, σ 2 / independent of Xi and

p
m.x/ = fj .xj /: .3/
j=1

Let μ denote the distribution of X , and let μj denote the marginal distribution of Xj for each
j = 1, . . . , p. For a function fj on [0, 1] denote its L2 .μj / norm by
  1 

fj μj = fj .x/ dμj .x/ = E{fj .Xj /2 }:
2
.4/
0

When the variable Xj is clear from the context, we remove the dependence on μj in the notation
·μj and simply write fj .
For j ∈ {1, . . . , p}, let Hj denote the Hilbert subspace L2 .μj / of measurable functions fj .xj /
of the single scalar variable xj with zero mean, E{fj .Xj /} = 0. Thus, Hj has the inner product
fj , fj = E{fj .Xj / fj .Xj /} .5/

and fj  = E{fj .Xj /2 } < ∞. Let H = H1 ⊕ H2 ⊕ . . . ⊕ Hp denote the Hilbert space of func-
tions of .x1 , . . . , xp / that have the additive form: m.x/ = Σj fj .xj /, with fj ∈ Hj , j = 1, . . . , p.
Let {ψjk , k = 0, 1, . . .} denote a uniformly bounded, orthonormal basis with respect to L2 [0, 1].
Unless stated otherwise, we assume that fj ∈ Tj where
 

∞ 

2 2νj
Tj = fj ∈ Hj : fj .xj / = βjk ψjk .xj /, βjk k  C2 .6/
k=0 k=0

for some 0 < C < ∞. We shall take νj = 2 although the extension to other levels of smoothness is
straightforward. It is also possible to adapt to νj although we do not pursue that direction here.
Let Λmin .A/ and Λmax .A/ denote the minimum and maximum eigenvalues of a square matrix
A. If v = .v1 , . . . , vk /T is a vector, we use the norms
 

k 
k
v = v2j , v1 = |vj |, v∞ = max|vj |: .7/
j=1 j=1 j

3. Sparse backfitting
The outline of the derivation of our algorithm is as follows. We first formulate a population
level optimization problem and show that the minimizing functions can be obtained by iterat-
ing through a series of soft thresholded univariate conditional expectations. We then plug in
smoothed estimates of these univariate conditional expectations, to derive our sparse backfitting
algorithm.

3.1. Population sparse additive models


For simplicity, assume that E.Yi / = 0. The standard additive model optimization problem in
L2 .μ/ (the population setting) is
  2
p
min E Y− fj .Xj / .8/
fj ∈Hj ,1j p j=1

where the expectation is taken with respect to X and the noise ". Now consider the following
modification of this problem that introduces a scaling parameter for each function, and that
imposes additional constraints:
1012 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
  2

p
min E Y− βj gj .Xj / .9/
β∈Rp ,gj ∈Hj j=1

subject to

p
|βj |  L, .10/
j=1

E.gj2 / = 1, j = 1, . . . , p, .11/

noting that gj is a function whereas β = .β1 , . . . , βp /T is a vector. The constraint that β lies in
the l1 -ball {β : β1  L} encourages sparsity of the estimated β, just as for the parametric lasso
(Tibshirani, 1996). It is convenient to absorb the scaling constants βj into the functions fj , and
to re-express the minimization in the following equivalent Lagrangian form:
 2
1 p p √
L.f , λ/ = E Y − fj .Xj / + λ E{fj2 .Xj /}: .12/
2 j=1 j=1

Theorem 1. The minimizers fj ∈ Hj of equation (12) satisfy



λ
fj = 1 − √ Pj almost surely .13/
E.Pj2 / +
where [·]+ denotes the positive part, and Pj = E.Rj |Xj / denotes the projection of the residual
Rj = Y − Σk=j fk .Xk / onto Hj .
An outline of the proof of this theorem appears in Ravikumar et al. (2008). A formal proof
is given in Appendix A. At the population level, the fj s can be found by a co-ordinate descent
procedure that fixes .fk : k = j/ and fits fj by equation (13), and then iterates over j.

3.2. Data version of sparse additive models


To obtain a sample version of the population solution, we insert sample estimates into the pop-
ulation algorithm, as in standard backfitting (Hastie and Tibshirani, 1999). Thus, we estimate
the projection Pj = E.Rj |Xj / by smoothing the residuals:
P̂j = Sj Rj .14/
where Sj is a linear smoother, such as a local linear or kernel smoother. Let
1 √ 2
ŝj = √ P̂j  = mean.P̂j / .15/
n

be the estimate of E.Pj2 /. Using these plug-in estimates in the co-ordinate descent procedure
yields the SPAM backfitting algorithm that is given in Table 1.
This algorithm can be seen as a functional version of the co-ordinate descent algorithm for
solving the lasso. In particular, if we solve the lasso by iteratively minimizing with respect to a
single co-ordinate, each iteration is given by soft thresholding; Table 2. Convergence properties
of variants of this simple algorithm have been recently treated by Daubechies et al. (2004, 2007).
Our sparse backfitting algorithm is a direct generalization of this algorithm, and it reduces to
it in the case where the smoothers are local linear smoothers with large bandwidths, i.e., as the
bandwidth approaches ∞, the local linear smoother approaches a global linear fit, yielding the
estimator P̂j .i/ = β̂j Xij . When the variables are standardized,
Sparse Additive Models 1013
Table 1. SPAM backfitting algorithm†

Input: data .Xi , Yi /, regularization parameter λ


Initialize fˆj = 0, for j = 1, . . . , p
Iterate until convergence, for each j = 1, . . . , p
Step 1: compute the residual, Rj = Y − Σk=j fˆk .Xk /
Step 2: estimate Pj = E.Rj |Xj / by smoothing, P̂j = Sj Rj
Step 3: estimate the norm, ŝj 2 = .1=n/Σni=1 P̂j2 .i/
Step 4: soft threshold, fˆj = [1 − λ=ŝj ]+ P̂j
Step 5: centre, fˆj ← fˆj − mean .fˆj /.
Output: component functions fˆj and estimator m̂.Xi / = Σj fˆj .Xij /

†The first two steps in the iterative algorithm are the usual backfitting
procedure; the remaining steps carry out functional soft thresholding.

Table 2. Co-ordinate descent lasso†

Input: data .Xi , Yi /, regularization parameter λ


Initialize β̂j = 0, for j = 1, . . . , p
Iterate until convergence, for each j = 1, . . . , p
Step 1: compute the residual, Rj = Y − Σk=j β̂ k Xk
Step 2: project residual onto Xj , Pj = XjT Rj
Step 3: soft threshold, β̂j = [1 − λ=|Pj |]+ Pj
Output: estimator m̂.Xi / = Σj β̂j Xij

†The SPAM backfitting algorithm is a functional ver-


sion of the co-ordinate descent algorithm for the lasso,
which computes β̂ = arg min. 21 Y − Xβ22 + λβ1 /.

 
1 n 2 2
ŝj = β̂ j Xij = |β̂j |
n i=1

so the soft thresholding in step 4 of the SPAM backfitting algorithm is the same as the soft
thresholding in step 3 in the co-ordinate descent lasso algorithm.

3.3. Basis functions


It is useful to express the model in terms of basis functions. Recall that Bj = .ψjk : k = 1, 2, . . ./
is an orthonormal basis for Tj and that supx |ψjk .x/|  B for some B. Then


fj .xj / = βjk ψjk .xj / .16/
k=1
where βjk = fj .xj / ψjk .xj / dxj .
Let us also define

d
f˜j .xj / = βjk ψjk .xj / .17/
k=1
where d = dn is a truncation parameter. For the Sobolev space Tj of order 2 we have that
fj − f˜j 2 = O.1=d 4 /. Let S = {j : fj = 0}. Assuming the sparsity condition |S| = O.1/ it follows
that m − m̃2 = O.1=d 4 / where m̃ = Σj f˜j . The usual choice is d  n1=5 , yielding truncation bias
m − m̃2 = O.n−4=5 /.
1014 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
In this setting, the smoother can be taken to be the least squares projection onto the trun-
cated set of basis functions {ψj1 , . . . , ψjd }; this is also called orthogonal series smoothing. Let
Ψj denote the n × dn matrix that is given by Ψj .i, l/ = ψj,l .Xij /. The smoothing matrix is the
projection matrix Sj = Ψj .ΨT −1 T
j Ψj / Ψj . In this case, the backfitting algorithm in Table 1 is a
co-ordinate descent algorithm for minimizing
 
1 p 2
p 1 T T
Y− Ψj βj + λ βj Ψj Ψj βj
2n j=1 2 j=1 n
which is the sample version of equation (12). This is the Lagrangian of a second-order cone
program, and standard convexity theory implies the existence of a minimizer. In Section 6.1
we prove theoretical properties of SPAMs by assuming that this particular smoother is being used.

3.4. Connection with the grouped lasso


The SPAM model can be thought of as a functional version of the grouped lasso (Yuan and Lin,
2006) as we now explain. Consider the following linear regression model with multiple factors:
pn
Y= Xj βj + " = Xβ + ", .18/
j=1

where Y is an n × 1 response vector, " is an n × 1 vector of independent and identically dis-


tributed mean 0 noise, Xj is an n × dj matrix corresponding to the jth factor and βj is the
corresponding dj × 1 coefficient vector. Assume for convenience (in this subsection only) that
each Xj is orthogonal, so that XjT Xj = Idj , where Idj is the dj × dj identity matrix. We use
X = .X1 , . . . , Xpn / to denote the full design matrix and use β = .β1T , : : : , βpTn /T to denote the
parameter.
The grouped lasso estimator is defined as the solution of the following convex optimization
problem:  
2 
pn √
β̂.λn / = arg min Y − Xβ2 + λn dj βj  .19/
β j=1

where dj scales the jth term to compensate for different group sizes.
It is obvious that, when dj = 1 for j = 1, . . . , pn , the grouped lasso becomes the standard lasso.
From the Karush–Kuhn–Tucker optimality conditions, a necessary and sufficient condition for
β̂ = .β̂ T
1 , . . . , β̂ p / to be the grouped lasso solution is
T T

λ dj β̂ j
−XjT .Y − Xβ̂/ + = 0, ∀β̂ j = 0, .20/
β̂ j 

XjT .Y − Xβ̂/  λ dj , ∀β̂ j = 0:
On the basis of this stationary condition, an iterative blockwise co-ordinate descent algorithm
can be derived; as shown by Yuan and Lin (2006), a solution to equation (20) satisfies
 √
λ dj
β̂j = 1 − Sj .21/
Sj  +

where Sj = XjT .Y − Xβ\j /, with β\j = .β1T , . . . , βj−1


T , 0T , β T , . . . , β T /. By iteratively applying
j+1 pn
equation (21), the grouped lasso solution can be obtained.
As discussed in Section 1, the COSSO model of Lin and Zhang (2006) replaces the lasso
constraint on Σj |βj | with a reproducing kernel Hilbert space constraint. The advantage of our
formulation is that it decouples smoothness (gj ∈ Tj ) and sparsity (Σj |βj |  L). This leads to a
Sparse Additive Models 1015
simple algorithm that can be carried out with any non-parametric smoother and scales easily
to high dimensions.

4. Choosing the regularization parameter


We choose λ by minimizing an estimate of the risk. Let νj be the effective degrees of freedom
for the smoother on the jth variable, i.e. νj = tr.Sj / where Sj is the smoothing matrix for the
jth dimension. Also let σ̂ 2 be an estimate of the variance. Define the total effective degrees of
freedom as

df.λ/ = νj I.fˆj  = 0/: .22/
j

Two estimates of risk are


 2
1 n p 2σ̂ 2
Cp = Yi − fj .Xj / +
ˆ df.λ/ .23/
n i=1 j=1 n
and
 2

n 
.1=n/ Yi − fˆj .Xij /
i=1 j
GCV.λ/ = : .24/
{1 − df.λ/=n}2
The first is Cp and the second is generalized cross-validation but with degrees of freedom defined
by df.λ/. A proof that these are valid estimates of risk is not currently available; thus, these should
be regarded as heuristics.
On the basis of the results in Wasserman and Roeder (2007) about the lasso, it seems likely
that choosing λ by risk estimation can lead to overfitting. One can further clean the estimate by
testing H0 : fj = 0 for all j such that fˆj = 0. For example, the tests in Fan and Jiang (2005) could
be used.

5. Examples
To illustrate the method, we consider a few examples.

5.1. Synthetic data


We generated n = 100 observations for an additive model with p = 100 and four relevant variables,
4
Yi = fj .Xij / + "i ,
j=1

where "i ∼ N.0, 1/; the relevant component functions are given by
f1 .x/ = −sin.1:5x/,
f2 .x/ = x3 + 1:5.x − 0:5/2 ,
f3 .x/ = −φ.x, 0:5, 0:82 /,
f4 .x/ = sin{exp.−0:5x/}
where φ.·, 0:5, 0:82 / is the Gaussian probability distribution function with mean 0.5 and stan-
dard deviation 0.8. The data therefore have 96 irrelevant dimensions. The covariates are sampled
independent and identically distributed from uniform.−2:5, 2:5/. All the component functions
are standardized, i.e.
1016 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman

1.0

8
0.8

7
Component Norms

0.6

6
Cp
2

5
0.4

75

4
0.2

3
90
0.0

2
14
0.0 0.4 0.8 0.0 0.4 0.8
(a) (b)
6
6

6
4

4
4

4
−6 −4 −2 m3 2
2

2
−4 −2 m2
−2 m1

−4 −2 m4
−4

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x2 x3 x4
(c) (d) (e) (f)
6

−0.2 −0.1 0.0 0.1 0.2

−0.2 −0.1 0.0 0.1 0.2


4

4
−6 −4 −2 m5 2

−6 −4 −2 m6 2

m7

m8

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x7
x5 x6 x8
(g) (h) (i) (j)
Fig. 1. Simulated data: (a) empirical l2 -norm of the estimated components as plotted against the regular-
ization parameter :λ (the value on the x -axis is proportional to Σj kfˆj k); (b) Cp -scores against the amount
of regularization (::, value of λ which has the smallest Cp -score); estimated ( ) versus true additive
component functions (– – – ) for (c)–(f) the first four relevant dimensions and (g)–(j) the first four irrelevant
dimensions ((c) l1 D 97:05; (d) l1 D 88:36; (e) l1 D 90:65; (f) l1 D 79:26; (g)–(j) l1 D 0)
Sparse Additive Models 1017

1 n
fj .Xij / = 0,
n i=1
.25/
1  n
f 2 .Xij / = 1:
n − 1 i=1 j
The results of applying SPAMs are summarized in Fig. 1, using the plug-in bandwidths
hj = 0:6 sd.Xj /=n1=5 :
Fig. 1(a) shows regularization paths as the parameter λ varies; each curve is a plot of fˆj .λ/
versus
 p 
p 
fˆk .λ/ maxλ fˆk .λ/ .26/
k=1 k=1

for a particular variable Xj . The estimates are generated efficiently over a sequence of λ-values
by ‘warm starting’ fˆj .λt / at the previous value fˆj .λt−1 /. Fig. 1(b) shows the Cp -statistic as a
function of regularization level.

5.2. Functional sparse coding


Olshausen and Field (1996) proposed a method of obtaining sparse representations of data such
as natural images; the motivation comes from trying to understand principles of neural coding.
In this example we suggest a non-parametric form of sparse coding.

(a) (b)
Fig. 2. Comparison of sparse reconstructions by using (a) the lasso and (b) SPAMs
1018 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
Let {yi }i=1,:::,N be the data to be represented with respect to some learned basis, where each
instance yi ∈ Rn is an n-dimensional vector. The linear sparse coding optimization problem is
 N  
 1 i i 2
min y − Xβ  + λβ 1 i
.27/
β,X i=1 2n

such that

Xj   1: .28/
Here X is an n × p matrix with columns Xj , representing the ‘dictionary’ entries or basis vec-
tors to be learned. It is not required that the basis vectors are orthogonal. The l1 -penalty on
the coefficients β i encourages sparsity, so each data vector yi is represented by only a small
number of dictionary elements. Sparsity allows the features to specialize, and to capture salient
properties of the data.
This optimization problem is not jointly convex in β i and X. However, for fixed X , each
weight vector β i is computed by running the lasso. For fixed β i , the optimization is similar to
ridge regression and can be solved efficiently. Thus, an iterative procedure for (approximately)
solving this optimization problem is easy to derive.
In the case of sparse coding of natural images, as in Olshausen and Field (1996), the basis vec-
tors Xj encode basic edge features at different scales and spatial orientations. In the functional
version, we no longer assume a linear parametric fit between the dictionary X and the data
y. Instead, we model the relationship by using an additive model. This leads to the following
optimization problem for functional sparse coding:
 N  
 1 i  p 2
p
min y − fji .Xj / + λ fji  .29/
f ,X i=1 2n j=1 j=1

such that
Xj   1, j = 1, . . . , p: .30/
Fig. 2 illustrates the reconstruction of various image patches by using the sparse linear model
compared with the SPAM. Local linear smoothing was used with a Gaussian kernel having fixed
bandwidth h = 0:05 for all patches and all codewords. The codewords Xj are those obtained by
using the Olshausen-Field procedure; these become the design points in the regression estima-
tors. Thus, a codeword for a 16 × 16 patch corresponds to a vector Xj of dimension 256, with
each Xij the grey level for a particular pixel.

6. Theoretical properties
6.1. Sparsistency
In the case of linear regression, with fj .Xj / = βjÅT Xj , several researchers have shown that, under
certain conditions on n and p, the number of relevant variables s = |supp.β Å /|, and the design
matrix X , the lasso recovers the sparsity pattern asymptotically, i.e. the lasso estimator β̂ n is
sparsistent:
P{supp.β Å / = supp.β̂ n /} → 1: .31/
Here, supp.β/ = {j : βj = 0}. References include Wainwright (2006), Meinshausen and Bühlmann
(2006), Zou (2005), Fan and Li (2001) and Zhao and Yu (2007). We show a similar result for
SPAMs under orthogonal function regression.
Sparse Additive Models 1019
In terms of an orthogonal basis ψ, we can write
p ∞
Å ψ .X / + " :
Yi = βjk jk ij i .32/
j=1 k=1

To simplify the notation, let βj be the dn -dimensional vector {βjk , k = 1, . . . , dn } and let Ψj
be the n × dn matrix Ψj .i, k/ = ψjk .Xij /. If A ⊂ {1, . . . , p}, we denote by ΨA the n × d|A| matrix
where, for each j ∈ A, Ψj appears as a submatrix in the natural way.
We now analyse the sparse backfitting algorithm of Table 1 by assuming that an orthogonal
series smoother is used to estimate the conditional expectation in its step 2. As noted earlier, an
orthogonal series smoother for a predictor Xj is the least squares projection onto a truncated
set of basis functions {ψj1 , . . . , ψjd }. Our optimization problem in this setting is
  
1 
p 2
p 1 T T
min Y − Ψj βj + λ βj Ψj Ψj βj : .33/
β 2n j=1 2 j=1 n

Combined with the soft thresholding step, the update for fj in the algorithm in Table 1 can thus
be seen to solve the problem
  
1 2 1 T T
min Rj − Ψj βj 2 + λn β Ψ Ψj βj
β 2n n j j
where v22 denotes Σni=1 v2i and Rj = Y − Σl=j Ψl βl is the residual for fj . The sparse backfitting
algorithm thus solves
 
1 p 2
p 1
min{Rn .β/ + λn Ω.β/} = min Y− Ψj βj + λn √ Ψj βj .34/
β β 2n j=1 2 j=1 n 2

where Rn denotes the squared error term and Ω denotes the regularization term, and each βj is
a dn -dimensional vector. Let S denote the true set of variables {j : fj = 0}, with s = |S|, and let
S c denote its complement. Let Ŝ n = {j : β̂j = 0} denote the estimated set of variables from the
minimizer β̂ n , with corresponding function estimates f̂j .xj / = Σdk=1
n
β̂jk ψjk .xj /. For the results in
this section, we shall treat the covariates as fixed. A preliminary version of the following result
is stated, without proof, in Ravikumar et al. (2008).
Theorem 2. Suppose that the following conditions hold on the design matrix X in the orthog-
onal basis ψ:
 
1 T
Λmax Ψ ΨS  Cmax < ∞, .35/
n S
 
1 T
Λmin ΨS ΨS  Cmin > 0, .36/
n
  −1  
1 T 1 T Cmin 1 − δ
maxc Ψj ΨS ΨS ΨS  √ , for some 0 < δ  1: .37/
j∈S n n Cmax s
Assume that the truncation dimension dn satisfies dn → ∞ and dn = o.n/. Furthermore, sup-
pose the following conditions, which relate the regularization parameter λn to the design
parameters n and p, the number of relevant variables s and the truncation size dn :
s
→ 0, .38/
dn λn
1020 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman

dn log{dn .p − s/}
→ 0, .39/
nλ2n
 
1 log.sdn / s3=2 √
+ + λn .sdn / → 0 .40/
ρÅn n dn

where ρnÅ = minj∈S βjÅ ∞ . Then the solution β̂ n to problem (33) is unique and satisfies Ŝ n = S
with probability approaching 1.
This result parallels the theorem of Wainwright (2006) on model selection consistency of
the lasso; however, technical subtleties arise because of the truncation dimension dn which is
increasing with sample size, and the matrix ΨT j Ψ which appears in the regularization of βj . As
a result, the operator norm rather than the ∞-norm appears in the incoherence condition (37).
Note, however, that condition (37) implies that
−1 −1
ΨT T
S c ΨS .ΨS ΨS / ∞ = max
c
ΨT T
j ΨS .ΨS ΨS / ∞ .41/
j∈S

 
Cmin dn
 .1 − δ/ .42/
Cmax
√ √
since .1= n/A∞  A  mA∞ for an m × n matrix A. This relates it to the more standard
incoherence conditions that have been used for sparsistency in the case of the lasso.
The following corollary, which imposes the additional condition that the number of relevant
variables is bounded, follows directly. It makes explicit how to choose the design parameters dn
and λn , and implies a condition on the fastest rate at which the minimum norm ρÅn can approach 0.
Corollary 1. Suppose that s = O.1/, and assume that the design conditions (35)–(37) hold. If
the truncation dimension dn , regularization parameter λn and minimum norm ρÅn satisfy
dn  n1=3 , .43/
log.np/
λn  , .44/
n1=3
 1=6 
1 n
Å =o .45/
ρn log.np/
then P.Ŝ n = S/ → 1.

The following proposition clarifies the implications of condition (45), by relating the sup-norm
βj ∞ to the function norm fj 2 .
Proposition 1. Suppose that f.x/ = Σk βk ψk .x/ is in the Sobolev space of order ν > 21 , so that
Σ∞
i=1 βi i
2 2ν C2 for some constant C. Then
f 2 = β2  cβ2ν=.2ν+1/
∞ .46/

for some constant c.


For instance, the result of corollary 1 allows the norms of the coefficients βj to decrease as
βj ∞ = log2 .np/=n1=6 . In the case ν = 2, this would allow the norms fj 2 of the relevant
functions to approach 0 at the rate log8=5 .np/=n2=15 .
Sparse Additive Models 1021
6.2. Persistence
The previous assumptions are very strong. They can be weakened at the expense of obtaining
weaker results. In particular, in this section we do not assume that the true regression function
is additive. We use arguments like those in Juditsky and Nemirovski (2000) and Greenshtein
and Ritov (2004) in the context of linear models. In this section we treat X as random and we
use triangular array asymptotics, i.e. the joint distribution for the data can change with n. Let
.X, Y/ denote a new pair (independent of the observed data) and define the predictive risk when
predicting Y with v.X/ by
R.v/ = E{Y − v.X/}2 : .47/
When v.x/ = Σj βj gj .xj / we also write the risk as R.β, g/ where β = .β1 , . . . , βp / and g =
.g1 , . . . , gp /. Following Greenshtein and Ritov (2004) we say that an estimator m̂n is persis-
tent (risk consistent) relative to a class of functions Mn , if
P
R.m̂n / − R.mÅn / → 0 .48/
where
mÅn = arg min{R.v/} .49/
v∈Mn

is the predictive oracle. Greenshtein and Ritov (2004) showed that the lasso is persistent for
Mn = {l.x/ = xT β : β1  Ln } and Ln = o{n= log.n/1=4 }. Note that mÅn is the best linear approx-
imation (in prediction risk) in Mn but the true regression function is not assumed to be linear.
Here we show a similar result for SPAMs.
In this section, we assume that the SPAM estimator m̂n is chosen to minimize
 2
1 n 
Yi − βj gj .Xij / .50/
n i=1 j

subject to β1  Ln and gj ∈ Tj . We make no assumptions about the design matrix. Let Mn ≡
Mn .Ln / be defined by
 

pn 
Mn = m : m.x/ = βj gj .xj / : E.gj / = 0, E.gj2 / = 1, |βj |  Ln .51/
j=1 j

and let mnÅ = arg minv∈Mn {R.v/}.


Theorem 3. Suppose that pn  exp.nξ / for some ξ < 1. Then,
 
L2n
R.m̂n / − R.mÅn / = OP .52/
n.1−ξ/=2
and hence, if Ln = o.n.1−ξ/=4 /, then the SPAM is persistent.

7. Discussion
The results that are presented here show how many of the recently established theoretical prop-
erties of l1 -regularization for linear models extend to SPAMs. The sparse backfitting algorithm
that we have derived is attractive because it decouples smoothing and sparsity, and can be used
with any non-parametric smoother. It thus inherits the nice properties of the original backfitting
procedure. However, our theoretical analyses have made use of a particular form of smoothing,
using a truncated orthogonal basis. An important problem is thus to extend the theory to cover
more general classes of smoothing operators. Convergence properties of the SPAM backfitting
1022 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
algorithm should also be investigated; convergence of special cases of standard backfitting was
studied by Buja et al. (1989).
An additional direction for future work is to develop procedures for automatic bandwidth
selection in each dimension. We have used plug-in bandwidths and truncation dimensions dn in
our experiments and theory. It is of particular interest to develop procedures that are adaptive
to different levels of smoothness in different dimensions. It would also be of interest to consider
more general penalties of the form pλ .fj /, as in Fan and Li (2001).
Finally, we note that, although we have considered basic additive models that allow functions
of individual variables, it is natural to consider interactions, as in the functional analysis-of-
variance model. One challenge is to formulate suitable incoherence conditions on the functions
that enable regularization-based procedures or greedy algorithms to recover the correct inter-
action graph. In the parametric setting, one result in this direction is Wainwright et al. (2007).

Acknowledgements
This research was supported in part by National Science Foundation grant CCF-0625879 and
a Siebel scholarship to PR.

Appendix A: Proofs
A.1. Proof of theorem 1
Consider the minimization of the Lagrangian
 2
1 p p √
min {L.f , λ/} ≡ E Y − fj .Xj / + λ E{fj .Xj /2 } .53/
{fj ∈Hj } 2 j=1 j=1

with respect to fj ∈ Hj , holding the other components {fk , k = j} fixed. The stationary condition is obtained
by setting the Fréchet derivative to 0. Denote by @j L.f , λ; ηj / the directional derivative with respect to fj
in the direction ηj .Xj / ∈ Hj {E.ηj / = 0, E.ηj2 / < ∞}. Then the stationary condition can be formulated as
@j L.f , λ; ηj / = 21 E{.fj − Rj + λvj /ηj } = 0 .54/

j = Y − Σk=j fk is the residual for fj , and vj ∈ Hj is an element of the subgradient @ E.fj /, satisfying
2
where R√
vj = fj = E.fj2 / if E.fj2 / = 0 and vj ∈ {uj ∈ Hj |E.u2j /  1} otherwise.
Using iterated expectations, the above condition can be rewritten as
E[{fj + λvj − E.Rj |Xj /}ηj ] = 0: .55/
But, since fj − E.Rj |Xj / + λvj ∈ Hj , we can compute the derivative in the direction ηj = fj − E.Rj |Xj / +
λvj ∈ Hj , implying that
E[{fj .xj / − E.Rj |Xj = xj / + λ vj .xj /}2 ] = 0, .56/
i.e.
fj + λvj = E.Rj |Xj / almost everywhere. .57/
Denote the conditional expectation √ E.Rj |Xj /—also the projection of the residual Rj onto Hj —by Pj .
Now, if E.fj2 / = 0, then vj = fj = E.fj2 /, which from condition (57) implies
√ √ √
E.Pj2 / = E[{fj + λfj = E.fj2 /}2 ] .58/
 
λ √
= 1+ √ E.fj2 / .59/
E.fj /
2


= E.fj2 / + λ .60/

 λ: .61/

If E.fj2 / = 0, then fj = 0 almost everywhere, and E.v2j /  1. Equation (57) then implies that
Sparse Additive Models 1023

E.Pj2 /  λ: .62/

We thus obtain the equivalence



E.Pj2 /  λ ⇔ fj = 0 almost everywhere. .63/

Rewriting equation (57) in light of result (63), we obtain


 
λ √
1+ √ fj = Pj if E.Pj2 / > λ,
E.fj2 /
fj = 0 otherwise:

Using equation (60), we thus arrive at the soft thresholding update for fj :
 
λ
fj = 1 − √ Pj .64/
E.Pj2 / +

where [·]+ denotes the positive part and Pj = E[Rj |Xj ].

A.2. Proof of theorem 2


A vector β̂ ∈ Rdn p is an optimum of the objective function in expression (34) if and only if there is a
subgradient ĝ ∈ @Ω.β̂/, such that
1 T  
Ψ Ψj β̂ j − Y + λn ĝ = 0: .65/
n j

The subdifferential @Ω.β/ is the set of vectors g ∈ Rpdn satisfying

.1=n/ΨTj Ψj βj
gj = √ if βj = 0,
{.1=n/βjT ΨTj Ψj βj }
−1
1 T
gjT Ψ Ψj gj  1 if βj = 0:
n j
Our argument is based on the technique of a primal dual witness, which has been used previously in
the analysis of the lasso (Wainwright, 2006). In particular, we construct a coefficient subgradient pair
.β̂, ĝ/ which satisfies supp.β̂/ = supp.β Å / and in addition satisfies the optimality conditions for the objec-
tive (34) with high probability. Thus, when the procedure succeeds, the constructed coefficient vector β̂
is equal to the solution of the convex objective (34), and ĝ is an optimal solution to its dual. From its
construction, the support of β̂ is equal to the true support supp.β Å /, from which we can conclude that
the solution of the objective (34) is sparsistent. The construction of the primal dual witness proceeds as
follows.
(a) Set β̂ S c = 0.
(b) Set ĝ S = @Ω.β Å /S .
(c) With these settings of β̂ S c and ĝ S , obtain β̂ S and ĝ S c from the stationary conditions in equation (65).
For the witness procedure to succeed, we must show that .β̂, ĝ/ is optimal for the objective (34), meaning
that

β̂ j = 0 for j ∈ S, .66a/
−1
1 T
gjT Ψ Ψj gj < 1 for j ∈ S c : .66b/
n j
For uniqueness of the solution, we require strict dual feasibility, meaning strict inequality in condition
(66b). In what follows, we show that these two conditions hold with high probability.
1024 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
A.2.1. Condition (66a)
Setting β̂ S c = 0 and
.1=n/ΨTj Ψj βjÅ
ĝ j = √ for j ∈ S,
{.1=n/β ÅT ΨT Ψ β Å }
j j j j

the stationarity condition for β̂ S is given by


1 T
Ψ .ΨS β̂ S − Y/ + λn ĝ S = 0: .67/
n S
Let V = Y − ΨS βSÅ − W denote the error due to finite truncation of the orthogonal basis, where W =
."1 , . . . , "n /T . Then the stationarity condition (67) can be simplified as
1 T 1 1
Ψ ΨS .β̂ S − βSÅ / − ΨTS W − ΨTS V + λn ĝ S = 0,
n S n n
so that
−1
1 T 1 T 1
β̂ S − βSÅ = Ψ ΨS Ψ W + ΨTS V − λn ĝ S , .68/
n S n S n
where we have used the assumption that .1=n/ΨTS ΨS is non-singular. Recalling our definition of the mini-
mum function norm ρnÅ = minj∈S βjÅ ∞ > 0, it suffices to show that β̂ S − βSÅ ∞ < ρnÅ =2, to ensure that
supp.βSÅ / = supp.β̂ S / = {j : β̂ j ∞ = 0},
so that condition (66a) would be satisfied. Using ΣSS = .1=n/.ΨTS ΨS / to simplify the notation, we have the
l∞ -bound,
1 T 1 T
β̂ S − βSÅ ∞  Σ−1 Ψ W + Σ−1 ΨS V +λn Σ−1SS ĝ S ∞ : .69/
SS
n S ∞
SS
n ∞  
    T3
T1 T2

We now proceed to bound the quantities T1 , T2 and T3 .

A.2.2. Bounding T3
Note that, for j ∈ S,
−1
1 T 1
1 = gjT Ψ Ψj gj  gj 2 ,
n j Cmax

and thus |gj   Cmax . Noting further that

gS ∞ = max.gj ∞ /  max.gj 2 /  Cmax , .70/
j∈S j∈S

it follows that

T3 := Σ−1 −1
SS ĝ S ∞  Cmax ΣSS ∞ : .71/

A.2.3. Bounding T2
We proceed in two steps; we first bound V ∞ and use this to bound .1=n/ΨTS V ∞ . Note that, as we are
working over the Sobolev spaces Sj of order 2,
 
 

Å Ψ .X /  B   |β Å |

|Vi | =  βjk jk ij  jk
j∈S k=dn +1 j∈S k=dn +1
 
 
∞ |β Å |k2  
∞ 
∞ 1
β Å2 k4
jk
=B B jk
j∈S k=dn +1 k2 j∈S k=dn +1 k=dn +1 k 4


∞ 1 sB
 sBC 4
 3=2
,
k=dn +1 k dn
Sparse Additive Models 1025
for some constant B > 0. It follows that
   
1 T  1  
 Ψ V Ψ .X / V ∞  Ds , .72/
n jk  n jk ij  3=2
i dn
where D denotes a generic constant. Thus,
1 T Ds
T2 := Σ−1
SS Ψ V  Σ−1
SS ∞ : .73/
n S ∞
3=2
dn

A.2.4. Bounding T1
Let Z = T1 = Σ−1
SS .1=n/ΨS W . Note that W ∼ N.0, σ I/, so that Z is Gaussian as well, with mean 0. Consider
T 2

its lth component, Zl = el Z. Then E.Zl / = 0, and


T

σ 2 T −1 σ2
var.Zl / = el ΣSS el  :
n Cmin n
By Gaussian comparison results (Ledoux and Talagrand, 1991), we have then that
 
√ log.sdn /
E.Z∞ /  3 {log.sdn /var.Z/∞ }  3σ : .74/
nCmin

Substituting the bounds for T2 and T3 from equations (73) and (71) respectively into equation (69),
and using the bound for the expected value of T1 from inequality (74), it follows from an application of
Markov’s inequality that
 
Å ρnÅ −1 −3=2 √ ρnÅ
P β̂ S − βS ∞ >  P Z∞ + ΣSS ∞ .Dsdn + λn Cmax / >
2 2
2 √
 Å {E.Z∞ / + Σ−1 SS ∞ .Dsdn
−3=2
+ λn Cmax /}
ρn
    
2 log.sdn / Ds √
 Å 3σ + Σ−1
SS ∞ + λ n C max ,
ρn nCmin dn
3=2

which converges to 0 under the condition that


  −1 
1 log.sdn / 1 T s
+ Ψ ΨS + λn → 0: .75/
ρnÅ n n S ∞
3=2
dn
Noting that
−1 √
1 T .sdn /
Ψ ΨS  , .76/
n S ∞ Cmin

it follows that condition (75) holds when


  
1 log.sdn / s3=2 √
+ + λn .sdn / → 0: .77/
ρnÅ n dn

But this is satisfied by assumption (40) in the theorem. We have thus shown that condition (66a) is satisfied
with probability converging to 1.

A.2.5. Condition (66b)


We now must consider the dual variables ĝ S c . Recall that we have set β̂ S c = βSÅc = 0. The stationarity con-
dition for j ∈ S c is thus given by
1 T
Ψ .ΨS β̂ S − ΨS βSÅ − W − V/ + λn ĝ j = 0:
n j
1026 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
It then follows from equation (68) that
 
1 1 T 1
ĝ S c = ΨS c ΨS .βSÅ − β̂ S / + ΨTS c .W + V/
λn n n
 −1 
1 1 T 1 T 1 1 1
= Ψ c ΨS Ψ ΨS λn ĝ S − ΨTS W − ΨTS V + ΨTS c .W + V/ ,
λn n S n S n n n
so
 
1 1 T 1 T 1
ĝ S c = ΣS c S Σ−1
SS λn ĝ S − ΨS W − ΨS V + ΨTS c .W + V/ : .78/
λn n n n

Condition (66b) requires that


−1
1 T
gjT Ψ Ψj gj < 1, .79/
n j

for all j ∈ S c . Since


−1
1 T 1
gjT Ψ Ψj gj  gj 2 .80/
n j Cmin

it suffices to show that maxj∈S c gj  < Cmin . From equation (78), we see that ĝ j is Gaussian, with mean
μj as
1 1 T 1 1 T
μj = E.ĝ j / = ΣjS Σ−1
SS ĝ S − Ψ V − Ψ V:
λn n S λn n j
This can be bounded as
1 1 T 1 1 T
μj   ΣjS Σ−1
SS  ĝ S  + Ψ V + Ψ V
λn n S λn n j
 
√ 1 1 T 1 1 T
= ΣjS Σ−1
SS  .sCmax / + ΨS V + Ψ V : .81/
λn n λn n j

Using the bound ΨTj V ∞  Ds=dn3=2 from equation (72), we have

1 T √ 1 Ds
Ψ V  dn ΨTj V  ,
n j n ∞ dn
and hence
1 T √ 1 Ds3=2
Ψ V  s ΨTS V  :
n S n ∞ dn
Substituting in the bound (81) on the mean μj ,
 
−1 √ Ds3=2 Ds
μj   ΣjS ΣSS  .sCmax / + + : .82/
λn dn λn dn

Assumptions (37) and (38) of the theorem can be rewritten as



Cmin 1−δ
ΣjS Σ−1
SS   √ for some δ > 0, .83/
Cmax s

s
→ 0: .84/
λn dn
Sparse Additive Models 1027
Thus the bound on the mean becomes
√ 2Ds √
μj   Cmin .1 − δ/ + < Cmin ,
λn dn
for sufficiently large n. It therefore suffices, for condition (66b) to be satisfied, to show that
δ
P maxc ĝ j − μj ∞ > √ → 0, .85/
j∈S 2 dn
since this implies that
ĝ j   μj  + ĝ j − μj 

 μj  + dn ĝ j − μj ∞
√ δ
 Cmin .1 − δ/ + + o.1/,
2
with probability approaching 1. To show result (85), we again appeal to Gaussian comparison results.
Define
W
Zj = ΨTj .I − ΨS .ΨTS ΨS /−1 ΨTS / , .86/
n
for j ∈ S c . Then Zj are zero-mean Gaussian random variables, and we need to show that
 
Zj ∞ δ
P maxc  √ → ∞: .87/
j∈S λn 2 dn
A calculation shows that E.Zjk
2
/  σ 2 =n. Therefore, we have by Markov’s inequality and Gaussian com-
parison that
  √
Zj ∞ δ 2 dn
P maxc  √  E.max |Zjk |/
j∈S λn 2 dn δλn jk

2 dn √ √
 [3 log{.p − s/dn } max{ E.Zjk2
/}]
δλn jk
 
6σ dn log{.p − s/dn }
 ,
δλn n
which converges to 0 given the assumption (39) of the theorem that
λ2n n
→ ∞:
dn log{.p − s/dn }
Thus condition (66b) is also satisfied with probability converging to 1, which completes the proof.

A.3. Proof of proposition 1


For any index k we have that


f 22 = βi2 .88/
i=1



 β∞ |βi | .89/
i=1


k 

= β∞ |βi | + β∞ |βi | .90/
i=1 i=k+1

∞ iν |β |
i
 kβ2∞ + β∞ ν
.91/
i=k+1 i
1028 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
 

∞ ∞ 1
 kβ2∞ + β∞ βi2 i2ν 2ν
.92/
i=1 i=k+1 i

k1−2ν
 kβ2∞ + β∞ C , .93/
2ν − 1
where the last inequality uses the bound
 ∞


−2ν k1−2ν
i  x−2ν dx = : .94/
i=k+1 k 2ν − 1
Let kÅ be the index that minimizes expression (93). Some calculus shows that kÅ satisfies
c1 β−2=.2ν+1/
∞  kÅ  c2 β−2=.2ν+1/
∞ .95/
for some constants c1 and c2 . Using the above expression in expression (93) then yields
f 22  β∞ .c2 β.2ν−1/=.2ν+1/
∞ + c1 β.2ν−1/=.2ν+1/
∞ / .96/

= cβ4ν=.2ν+1/
∞ .97/

for some constant c, and the result follows.

A.4. Proof of theorem 3


We begin with some notation. If M is a class of functions then the L∞ bracketing number N[ ] .", M/ is
defined as the smallest number of pairs B = {.l1 , u1 /, . . . , .lk , uk /} such that uj − lj ∞  ", 1  j  k, and
such that for every m ∈ M there exists .l, u/ ∈ B such that l  m  u. For the Sobolev space Tj ,
1=2
1
log{N[ ] .", Tj /}  K .98/
"
for some K > 0; see van der Vaart (1998). The bracketing integral is defined to be
 δ

J[ ] .δ, M/ = log{N[ ] .u, M/} du: .99/
0

From corollary 19.35 of van der Vaart (1998),


 
C J[ ] .F ∞ , M/
E sup |μ̂.g/ − μ.g/|  √ .100/
g∈M n
for some C > 0, where F.x/ = supg∈M |g.x/|, μ.g/ = E{g.X/} and μ̂.g/ = n−1 Σni=1 g.Xi /.
Set Z ≡ .Z0 , . . . , Zp / = .Y , X1 , . . . , Xp / and note that
p p
R.β, g/ = βj βk E{gj .Zj / gk .Zk /} .101/
j=0 k=0

where we define g0 .z0 / = z0 and β0 = −1. Also define


1 n  p  p
R̂.β, g/ = βj βk gj .Zij / gk .Zik /: .102/
n i=1 j=0 k=0

Hence m̂n is the minimizer of R̂.β, g/ subject to the constraint Σj βj gj .xj / ∈ Mn .Ln / and gj ∈ Tj . For all
.β, g/,
|R̂.β, g/ − R.β, g/|  β21 max sup |μ̂jk .g/ − μjk .g/| .103/
jk gj ∈Sj , gk ∈Sk

where

n 
μ̂jk .g/ = n−1 gj .Zij / gk .Zik /
i=1 jk
Sparse Additive Models 1029
and μjk .g/ = E{gj .Zj / gk .Zk /}. From inequality (98) it follows that
1=2
1
log{N[ ] .", Mn /}  2 log.pn / + K : .104/
"

Hence, J[ ] .C, Mn / = O{ log.pn /} and it follows from inequality (100) and Markov’s inequality that
 
log.pn / 1
max sup |μ̂jk .g/ − μjk .g/| = OP = OP : .105/
jk gj ∈Sj , gk ∈Sk n n.1−ξ/=2

We conclude that
L2n
sup |R̂.g/ − R.g/| = OP : .106/
g∈M n.1−ξ/=2
Therefore,
L2n
R.mÅ /  R.m̂n /  R̂.m̂n / + OP .1−ξ/=2
n
L2n L2n
 R̂.mÅ / + OP  R.mÅ / + OP
n.1−ξ/=2 n.1−ξ/=2
and the conclusion follows.

References
Antoniadis, A. and Fan, J. (2001) Regularized wavelet approximations (with discussion). J. Am. Statist. Ass., 96,
939–967.
Buja, A., Hastie, T. and Tibshirani, R. (1989) Linear smoothers and additive models. Ann. Statist., 17, 453–510.
Bunea, F., Tsybakov, A. and Wegkamp, M. (2007) Sparsity oracle inequalities for the lasso. Electron. J. Statist.,
1, 169–194.
Daubechies, I., Defrise, M. and DeMol, C. (2004) An iterative thresholding algorithm for linear inverse problems.
Communs Pure Appl. Math., 57, 1413–1457.
Daubechies, I., Fornasier, M. and Loris, I. (2007) Accelerated projected gradient method for linear inverse
problems with sparsity constraints. Technical Report. Princeton University, Princeton. (Available from
arXiv:0706.4297.)
Fan, J. and Jiang, J. (2005) Nonparametric inference for additive models. J. Am. Statist. Ass., 100, 890–907.
Fan, J. and Li, R. Z. (2001) Variable selection via penalized likelihood. J. Am. Statist. Ass., 96, 1348–1360.
Greenshtein, E. and Ritov, Y. (2004) Persistency in high dimensional linear predictor-selection and the virtue of
over-parametrization. Bernoulli, 10, 971–988.
Hastie, T. and Tibshirani, R. (1999) Generalized Additive Models. New York: Chapman and Hall.
Juditsky, A. and Nemirovski, A. (2000) Functional aggregation for nonparametric regression. Ann. Statist., 28,
681–712.
Koltchinskii, V. and Yuan, M. (2008) Sparse recovery in large ensembles kernel machines. In Proc. 21st A. Conf.
Learning Theory, pp. 229–238. Eastbourne: Omnipress.
Ledoux, M. and Talagrand, M. (1991) Probability in Banach Spaces: Isoperimetry and Processes. New York:
Springer.
Lin, Y. and Zhang, H. H. (2006) Component selection and smoothing in multivariate nonparametric regression.
Ann. Statist., 34, 2272–2297.
Meier, L., van de Geer, S. and Bühlmann, P. (2008) High-dimensional additive modelling. (Available from arXiv.)
Meinshausen, N. and Bühlmann, P. (2006) High dimensional graphs and variable selection with the lasso. Ann.
Statist., 34, 1436–1462.
Meinshausen, N. and Yu, B. (2006) Lasso-type recovery of sparse representations for high-dimensional data.
Technical Report 720. Department of Statistics, University of California, Berkeley.
Olshausen, B. A. and Field, D. J. (1996) Emergence of simple-cell receptive field properties by learning a sparse
code for natural images. Nature, 381, 607–609.
Ravikumar, P., Liu, H., Lafferty, J. and Wasserman, L. (2008) Spam: sparse additive models. In Advances in
Neural Information Processing Systems, vol. 20 (eds J. Platt, D. Koller, Y. Singer and S. Roweis), pp. 1201–1208.
Cambridge: MIT Press.
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288.
van der Vaart, A. W. (1998) Asymptotic Statistics. Cambridge: Cambridge University Press.
1030 P. Ravikumar, J. Lafferty, H. Liu and L. Wasserman
Wainwright, M. (2006) Sharp thresholds for high-dimensional and noisy recovery of sparsity. Technical Report
709. Department of Statistics, University of California, Berkeley.
Wainwright, M. J., Ravikumar, P. and Lafferty, J. D. (2007) High-dimensional graphical model selection using l1 -
regularized logistic regression. In Advances in Neural Information Processing Systems, vol. 19 (eds B. Schölkopf,
J. Platt and T. Hoffman), pp. 1465–1472. Cambridge: MIT Press.
Wasserman, L. and Roeder, K. (2007) Multi-stage variable selection: screen and clean. Carnegie Mellon University,
Pittsburgh. (Available from arXiv:0704.1139.)
Yuan, M. (2007) Nonnegative garrote component selection in functional ANOVA models. Proc. Artif. Intell.
Statist. (Available from www.stat.umn.edu/∼aistat/proceedings.)
Yuan, M. and Lin, Y. (2006) Model selection and estimation in regression with grouped variables. J. R. Statist.
Soc. B, 68, 49–67.
Zhao, P. and Yu, B. (2007) On model selection consistency of lasso. J. Mach. Learn. Res., 7, 2541–2567.
Zou, H. (2005) The adaptive lasso and its oracle properties. J. Am. Statist. Ass., 101, 1418–1429.
Linear Classification
36-708

In these notes we discuss parametric classification, in particular, linear classification, from


several different points of view. We begin with a review of classification.

1 Review of Classification

The problem of predicting a discrete random variable Y from another random variable X
is called classification, also sometimes called discrimination, pattern classification or pat-
tern recognition. We observe iid data (X1 , Y1 ), . . . , (Xn , Yn ) ∼ P where Xi ∈ Rd and
Yi ∈ {0, 1, . . . , K − 1}. Often, the covariates X are also called features. The goal is to
predict Y given a new X; here are some examples:

1. The Iris Flower study. The data are 50 samples from each of three species of Iris
flowers, Iris setosa, Iris virginica and Iris versicolor; see Figure 1. The length and
width of the sepal and petal are measured for each specimen, and the task is to predict
the species of a new Iris flower based on these features.

2. The Coronary Risk-Factor Study (CORIS). The data consist of attributes of 462 males
between the ages of 15 and 64 from three rural areas in South Africa. The outcome Y
is the presence (Y = 1) or absence (Y = 0) of coronary heart disease and there are 9
covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipopro-
tein cholesterol), adiposity, famhist (family history of heart disease), typea (type-A
behavior), obesity, alcohol (current alcohol consumption), and age. The goal is to
predict Y from all these covariates.

3. Handwriting Digit Recognition. Here each Y is one of the ten digits from 0 to 9. There
are 256 covariates X1 , . . . , X256 corresponding to the intensity values of the pixels in a
16 × 16 image; see Figure 2.

4. Political Blog Classification. A collection of 403 political blogs were collected during
two months before the 2004 presidential election. The goal is to predict whether a blog
is liberal (Y = 0) or conservative (Y = 1) given the content of the blog.

A classification rule, or classifier, is a function h : X → {0, . . . , K −1} where X is the domain


of X. When we observe a new X, we predict Y to be h(X). Intuitively, the classification rule
h partitions the input space X into K disjoint decision regions whose boundaries are called
decision boundaries. In these notes, we consider linear classifiers whose decision boundaries
are linear functions of the covariate X. For K = 2, we have a binary classification problem.

1
Figure 1: Three different species of the Iris data. Iris setosa (Left), Iris versicolor (Middle),
and Iris virginica (Right).

For K > 2, we have a multiclass classification problem. To simplify the discussion, we mainly
discuss binary classification, and briefly explain how methods can extend to the multiclass
case.

A binary classifier h is a function from X to {0, 1}. It is linear if there exists a function
H(x) = β0 + β T x such that h(x) = I(H(x) > 0). H(x) is also called a linear discriminant
function. The decision boundary is therefore defined as the set x ∈ Rd : H(x) = 0 , which
corresponds to a (d − 1)-dimensional hyperplane within the d-dimensional input space X .

The classification risk, or error rate, of h is defined as



R(h) = P Y 6= h(X) (1)
and the empirical classification error or training error is
n
1X 
R(h)
b = I h(Xi ) 6= Yi . (2)
n i=1

Here is some notation that we will use.

X covariate (feature)
X domain of X, usually X ⊂ Rd
Y response (pattern)
h binary classifier, h : X → {0, 1} 
H linear discriminant function, H(x) = β0 + β T x and h(x) = I H(x) > 0
m regression function, m(x) = E(Y |X = x) = P(Y = 1|X = x)
PX marginal distribution of X
pj pj (x) = p(x|Y = j), the conditional density1 of X given that Y = j
π1 π1 = P(Y = 1)
P joint distribution of (X, Y )

Now we review some key results.

2
Figure 2: Examples from the zipcode data.

Theorem 1 The rule h that minimizes R(h) is


1

∗ 1 if m(x) > 2
h (x) = (3)
0 otherwise
where m(x) = E(Y |X = x) = P(Y = 1|X = x) denotes the regression function.

The rule h∗ is called the Bayes rule. The risk R∗ = R(h∗ ) of the Bayes rule is called the
Bayes risk. The set {x ∈ X : m(x) = 1/2} is called the Bayes decision boundary.

Proof. We will show that R(h) − R(h∗ ) ≥ 0. Note that


Z
 
R(h) = P {Y 6= h(X)} = P Y = 6 h(X)|X = x dPX (x).

It suffices to show that


P Y 6= h(X)|X = x − P Y 6= h∗ (X)|X = x ≥ 0 for all x ∈ X .
 
(4)
Now,
 
P Y 6= h(X)|X = x = 1 − P Y = h(X)|X = x (5)
  
= 1 − P Y = 1, h(X) = 1|X = x + P Y = 0, h(X) = 0|X = x (6)
  
= 1 − h(x)P(Y = 1|X = x) + 1 − h(x) P(Y = 0|X = x) (7)
  
= 1 − h(x)m(x) + 1 − h(x) 1 − m(x) . (8)

3
Hence,

P Y 6= h(X)|X = x − P Y 6= h∗ (X)|X = x
 
   
= h∗ (x)m(x) + 1 − h∗ (x) 1 − m(x) − h(x)m(x) + 1 − h(x) 1 − m(x)
 
 
 ∗ 1
h∗ (x) − h(x) .
 
= 2m(x) − 1 h (x) − h(x) = 2 m(x) − (9)
2

When m(x) ≥ 1/2 and h∗ (x) = 1, (9) is non-negative. When m(x) < 1/2 and h∗ (x) = 0, (9)
is again non-negative. This proves (4). 

We can rewrite h∗ in a different way. From Bayes’ theorem


p(x|Y = 1)P(Y = 1)
m(x) = P(Y = 1|X = x) =
p(x|Y = 1)P(Y = 1) + p(x|Y = 0)P(Y = 0)
π1 p1 (x)
= . (10)
π1 p1 (x) + (1 − π1 )p0 (x)
where π1 = P(Y = 1). From the above equality, we have that
1 p1 (x) 1 − π1
m(x) > is equivalent to > . (11)
2 p0 (x) π1
Thus the Bayes rule can be rewritten as
p1 (x)
( 1−π1

1 if p0 (x)
> π1
h (x) = (12)
0 otherwise.

If H is a set of classifiers then the classifier ho ∈ H that minimizes R(h) is the oracle classifier.
Formally,
R(ho ) = inf R(h)
h∈H

and Ro = R(ho ) is called the oracle risk of H. In general, if h is any classifier and R∗ is the
Bayes risk then,

R(h) − R∗ = R(h) − R(ho ) + R(ho ) − R∗ . (13)


| {z } | {z }
distance from oracle distance of oracle from Bayes error

The first term is analogous to the variance, and the second is analogous to the squared bias
in linear regression.

For a binary classifier problem, given a covariate X we only need to predict its class label
Y = 0 or Y = 1. This is in contrast to a regression problem where we need to predict a
real-valued response Y ∈ R. Intuitively, classification is a much easier task than regression.
To rigorously formalize this, let m∗ (x) = E(Y |X = x) be the true regression function and

4
let h∗ (x) be the corresponding Bayes rule. Let m(x)
b be an estimate of m∗ (x) and define the
plug-in classification rule:
> 12

1 if m(x)
h(x) = (14)
b b
0 otherwise.
We have the following theorem.

Theorem 2 The risk of the plug-in classifier rule in (14) satisfies


sZ
h) − R∗ ≤ 2
R(b (m(x)
b − m∗ (x))2 dPX (x).

Proof. In the proof of Theorem 1 we showed that


h(X)|X = x − P Y 6= h∗ (X)|X = x = 2m(x) − 1 h∗ (x) − b
   
P Y 6= b b h(x)
− 1 I h∗ (x) 6= b − 1/2 I h∗ (x) 6= b
 
= 2m(x)
b h(x) = 2 m(x)
b h(x) .

Now, when h∗ (x) 6= b h(x) = 1 and h∗ (x) = 0; (ii)


h(x), there are two possible cases: (i) b
h(x) = 0 and h∗ (x) = 1. In both cases, we have that m(x)
b b − m∗ (x) ≥ m(x)b − 1/2 .
Therefore,
Z

− 1/2 I h∗ (x) 6= b
  
P h(X) 6= Y − P h (X) 6= Y
b = 2 m(x)
b h(x) dPX (x)
Z
− m∗ (x) I h∗ (x) 6= b

≤ 2 m(x) b h(x) dPX (x)
Z
≤ 2 m(x) b − m∗ (x) dPX (x) (15)
sZ
2
≤ 2 m(x)
b − m∗ (x) dPX (x). (16)


The last inequality follows from the fact that E|Z| ≤ EZ 2 for any Z. 

This theorem implies that if the regression estimate m(x)


b is close to m∗ (x) then the plug-in
classification risk will be close to the Bayes risk. The converse is not necessarily true. It is
possible for mb to be far from m∗ (x) and still lead to a good classifier. As long as m(x)
b and

m (x) are on the same side of 1/2 they yield the same classifier.

Example 3 Figure 3 shows two one-dimensional regression functions. In both cases, the
Bayes rule is h∗ (x) = I(x > 0) and the decision boundary is D = {x = 0}. The left plot
illustrates an easy problem; there is little ambiguity around the decision boundary. Even a
poor estimate of m(x) will recover the correct decision boundary. The right plot illustrates a
hard problem; it is hard to know from the data if you are to the left or right of the decision
boundary.

5
1 1

1/2 1/2
Bayes decision boundary Bayes decision boundary
regression function m(x)
regression function m(x)

0 0
x=0 x x=0 x

Figure 3: The Bayes rule is h∗ (x) = I(x > 0) in both plots, which show the regression
function m(x) = E(Y |x) for two problems. The left plot shows an easy problem; there is
little ambiguity around the decision boundary. The right plot shows a hard problem; it is
hard to know from the data if you are to the left or right of the decision boundary.

b → 0.
So classification is easier than regression. Can it be strictly easier? Suppose that R(m)
We have that
Z
R(h) − R(h∗ ) ≤ 2 |m(x)
b b − m∗ (x)|I(h∗ (x) 6= b
h(x))dP (x)
Z
= 2 |m(x)b − m∗ (x)|I(h∗ (x) 6= b
h(x))I(m∗ (x) 6= 1/2)dP (x)
h i
= 2E |m(X)
b − m∗ (X)|I(h∗ (X) 6= bh(X))I(m∗ (X) 6= 1/2)
h i
≤ 2E |m(X)
b − m∗ (X)|I(h∗ (X) 6= h(X))I(|m∗ (X) − 1/2| ≤ , m∗ (X) 6= 1/2)
b
h i
+ 2E |m(X)
b − m∗ (X)|I(h∗ (X) 6= bh(X))I(|m∗ (X) − 1/2| > )
p
≤ 2 R(m)(a b 1/2 + b1/2 )
where
a = P (h∗ (X) 6= b
h(X), |m∗ (X) − 1/2| ≤ , m∗ (X) 6= 1/2)
and
b = P (h∗ (X) 6= b
h(X), |m∗ (X) − 1/2| > ).
Now
R(m)
b ≤ P (|m(X) − m∗ (X)| > ) ≤ →0
b
2
b

so
h) − R(h∗ )
R(b
lim p ≤ 2a1/2 .
n→∞ R(m)
b
But a → 0 as  → 0 so So
h) − R(h∗ )
R(b
p → 0.
R(m)
b

6
So the LHS can be smaller than the right hand side. But how much smaller? Yang (1999)
showed that if the class of regression functions is sufficiently rich, then

h)  rn2 and inf sup [R(b


inf sup R(b h) − R(h∗ )]  rn
m
b m∈M h m∈M
b

which says that the minimax classification rate is the square root of the regression rate.

But, there are natural classes that fail the richness condition such as low noise classes. For
example, if P (|m∗ (X)−1/2| ≤ ) = 0 and m b satifies an exponential inequality then R(√
h)−R(h∗ )
b
R(m)
b
is exponentially small. So it really depends on the problem.

2 Empirical Risk Minimization

The conceptually simplest approach is empirical risk minimization (ERM) where we minimize
the training error over all linear classifiers. Let Hβ (x) = β T x (where x(1) = 1) and hβ (x) =
I(Hβ (x) > 0). Thus we define βb to be the value of β that minimizes
n
bn (β) = 1
X
R I(Yi 6= hβ (Xi )).
n i=1

The problem with this approach is that it is difficult to minimize R


bn (β). The theory for the
ERM is straightforward. First, let us recall the following. Let H be a set of clasifiers and let
h∗ = argminh∈H R(h). Let b h minimize the empirical risk. If suph∈H |R(h)
b − R(h)| ≤  then
R(h) ≤ R(h∗ ) + 2. To see this, note that if if suph∈H |R(h) − R(h)| ≤  then, using the fact
b b
that b
h minimizes R,
b

R(h∗ ) ≤ R(b
h) ≤ R(
bb h) +  ≤ R(h
b ∗ ) +  ≤ R(h∗ ) + 2.

So we need to bound
P (sup |R(h)
b − R(h)| ≤ ).
h∈H

If H has finite VC dimension r then


2
P (sup |R(h)
b − R(h)| ≤ ) ≤ 8(n + 1)r e−n /32 .
h∈H

Now hald-spaces have VC dimension r = d + 1. So


2
P (sup |R(h)
b − R(h)| ≤ ) ≤ 8(n + 1)d+1 e−n /32 .
h∈H

2
h) − R(h∗ ) > 2) ≤ 8nd+1 e−n /32 .
We conclude that P (R(b

7
The result can be improved if there are not too many data points near the decision boundary.
We’ll state a result due to Koltchinski and Panchenko (2002) that involves the margin. (See
also Kakade, Sridharan and Tewari 2009). Let us take Yi ∈ {−1, +1} so we can write
h(x) = sign(β T x). Suppose that |X(j)| ≤ A < ∞ for each j. We also restrict ourselved to
the set of linear classifiers h(x) = sign(β T x) with |β(j)| ≤ A. Define the margin-sensitive
loss 
1
 if u ≤ 0
u
φγ (u) = 1 − γ if 0 < u ≤ γ

0 if u > γ.

Then, for any such classifier h, with probability at least 1 − δ,


n r
4A3/2 d

1X 8 log(4/δ)
P (Y 6= h(X)) ≤ φγ (Yi h(Xi )) + + +1 .
n i=1 γn γ 2n
This means that, if there are few observations near the boundary, then, by taking γ large,
we can make the loss small. However, the restriction to bounded covariates and bounded
classifiers is non-trivial.

3 Gaussian Discriminant Analysis

Suppose that p0 (x) = p(x|Y = 0) and p1 (x) = p(x|Y = 1) are both multivariate Gaussians:
 
1 1 T −1
pk (x) = exp − (x − µk ) Σk (x − µk ) , k = 0, 1.
(2π)d/2 |Σk |1/2 2
where Σ1 and Σ2 are both d × d covariance matrices. Thus, X|Y = 0 ∼ N (µ0 , Σ0 ) and
X|Y = 1 ∼ N (µ1 , Σ1 ).

Given a square matrix A, we define |A| to be the determinant of A. For a binary classification
problem with Gaussian distributions, we have the following theorem.

Theorem 4 If X|Y = 0 ∼ N (µ0 , Σ0 ) and X|Y = 1 ∼ N (µ1 , Σ1 ), then the Bayes rule is
(    
1 if r1 < r0 + 2 log 1−π1 + log |Σ
2 2 π1 0|
h∗ (x) = |Σ1 | (17)
0 otherwise
where ri2 = (x − µi )T Σ−1
i (x − µi ) for i = 1, 2 is the Mahalanobis distance.

Proof. By definition, the Bayes rule is h∗ (x) = I π1 p1 (x) > (1 − π1 )p0 (x) . Plug-in the


specific forms of p0 and p1 and take the logarithms we get h∗ (x) = 1 if and only if
(x − µ1 )T Σ−1

1 (x − µ1 ) − 2 log π1 + log |Σ1 |
< (x − µ0 )T Σ−1

0 (x − µ0 ) − 2 log(1 − π1 ) + log |Σ0 | . (18)

8
The theorem immediately follows from some simple algebra. 

Let π0 = 1 − π1 . An equivalent way of expressing the Bayes rule is

h∗ (x) = argmaxk∈{0,1} δk (x) (19)

where
1 1
δk (x) = − log |Σk | − (x − µk )T Σ−1
k (x − µk ) + log πk (20)
2 2
is called the Gaussian discriminant function. The decision boundary of the above classifier
can be characterized by the set {x ∈ X : δ1 (x) = δ0 (x)}, which is quadratic so this procedure
is called quadratic discriminant analysis (QDA).

In practice, we use sample quantities of π0 , π1 , µ1 , µ2 , Σ0 , Σ1 in place of their population


values, namely
n n
1X 1X
π
b0 = (1 − Yi ), πb1 = Yi , (21)
n i=1 n i=1
1 X 1 X
µ
b0 = Xi , µb1 = Xi , (22)
n0 i: Y =0 n1 i: Y =1
i i

1 X
Σ
b0 = (Xi − µb0 )(Xi − µ b0 )T , (23)
n0 − 1 i: Y =0
i

1 X
Σ
b1 = (Xi − µb1 )(Xi − µ b1 )T , (24)
n1 − 1 i: Y =1
i

P P
where n0 = i (1 − Yi ) and n1 = i Yi . (Note: we could also estimate Σ0 and Σ1 using their
maximum likelihood estimates, which replace n0 − 1 and n1 − 1 with n0 and n1 .)

A simplification occurs if we assume that Σ0 = Σ1 = Σ. In this case, the Bayes rule is

h∗ (x) = argmaxk δk (x) (25)

where now
1
δk (x) = xT Σ−1 µk − µTk Σ−1 µk + log πk . (26)
2
Hence, the classifier is linear. The parameters are estimated as before, except that we use a
pooled estimate of the Σ:
b = (n0 − 1)Σ0 + (n1 − 1)Σ1 .
b b
Σ (27)
n0 + n1 − 2
The classification rule is (
1 if δ1 (x) > δ0 (x)
h∗ (x) = (28)
0 otherwise.

9
The decision boundary {x ∈ X : δ0 (x) = δ1 (x)} is linear so this method is called linear
discrimination analysis (LDA).

When the dimension d is large, fully specifying the QDA decision boundary requires d +
d(d − 1) parameters, and fully specifying the LDA decision boundary requires d + d(d − 1)/2
parameters. Such a large number of free parameters might induce a large variance. To further
regularize the model, two popular methods are diagonal quadratic discriminant analysis
(DQDA) and diagonal linear discriminant analysis (DLDA). The only difference between
DQDA and DLDA with QDA and LDA is that after calculating Σ b 1 and Σ
b 0 as in (24), we
set all the off-diagonal elements to be zero. This is also called the independence rule.

We now generalize to the case where Y takes on more than two values. That is, Y ∈
{0, . . . , K − 1} for K > 2. First, we characterize the Bayes classifier under this multiclass
setting.

Theorem 5 Let R(h) = P (h(X) 6= Y ) be the classification error of a classification rule


h(x). The Bayes rule h∗ (X) minimizing R(h) can be written as
h∗ (x) = argmaxk P (Y = k|X = x) (29)

Proof. We have
R(h) = 1 − P (h(X) = Y ) (30)
K−1
X
= 1− P (h(X) = k, Y = k) (31)
k=0
K−1
X h i

= 1− E I h(X) = k P (Y = k|X) (32)
k=0

It’s clear
 that h (X) = argmaxk P (Y = k|X) achieves the minimized classification error
1 − E maxk P (Y = k|X) . 

Let πk = P(Y = k). The next theorem extends QDA and LDA to the multiclass setting.

Theorem 6 Suppose that Y ∈ {0, . . . , K − 1} with K ≥ 2. If pk (x) = p(x|Y = k) is


Gaussian : X|Y = k ∼ N µk , Σk , the Bayes rule for the multiclass QDA can be written as
h∗ (x) = argmaxk δk (x)
where
1 1
δk (x) = − log |Σk | − (x − µk )T Σ−1
k (x − µk ) + log πk . (33)
2 2
If all Gaussians have an equal variance Σ, then
1
δk (x) = xT Σ−1 µk − µTk Σ−1 µk + log πk . (34)
2
10
P
Let nk = i I(yi = k) for k = 0, . . . , K − 1. The estimated sample quantities of πk , µk , Σk ,
and Σ are:
n
1X 1 X
π
bk = I(yi = k), µ bk = Xi , (35)
n i=1 nk i: Y =k
i

1 X
Σbk = (Xi − µ
bk )(Xi − µ bk )T , (36)
nk − 1 i: Y =k
i
PK−1
b = k=0 (nk − 1)Σk .
b
Σ (37)
n−K

Example 7 Let us return to the Iris data example. Recall that there are 150 observations
made on three classes of the iris flower: Iris setosa, Iris versicolor, and Iris virginica. There
are four features: sepal length, sepal width, petal length, and petal width. In Figure 4 we
visualize the datasets. Within each class, we plot the densities for each feature. It’s easy to
see that the distributions of petal length and petal width are quite different across different
classes, which suggests that they are very informative features.

Figures 5 and 6 provide multiple figure arrays illustrating the classification of observations
based on LDA and QDA for every combination of two features. The classification boundaries
and error are obtained by simply restricting the data to these a given pair of features before
fitting the model. We see that the decision boundaries for LDA are linear, while the decision
boundaries for QDA are highly nonlinear. The training errors for LDA and QDA on this data
are both 0.02. From these figures, we see that it is very easy to discriminate the observations
of class Iris setosa from those of the other two classes.

11
7
6
5

setosa
4
3
2
1
0
7
6
variable
5

versicolor
density

Sepal.Length
4
3 Sepal.Width

2 Petal.Length
1 Petal.Width
0
7
6
5

virginica
4
3
2
1
0
2 4 6
value

Figure 4: The Iris data: The estimated densities for different features are plotted within
each class. It’s easy to see that the distributions of petal length and petal width are quite
different across different classes, which suggests that they are very informative features.

12
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
v v v
Error: v
0.2 v v v Error: 0.033 v vvvv Error: 0.04 v v vvv
v

7.5
7.5
vv vv v
vv e v v vvv v vv v v
e
v v eev v vv ee
v eee vvev vv e eev vvvvv
e e
e e e v vvv v
vvv
v
eee ee vv v v eee v v v

6.5
6.5

v vv eeeeee
ee e v vv vv
v v ve
e vv
e v e e vvvvvvvvv v e
e
e v e vvv vv v v vv v v
e
e
e
Sepal.Length v
e v eee eevvv e e
e e
ee eeee vvve v e eee veve e vv
v e vv
ve
e eee
vev ee
v e e s s s s ss ee e
e
ee e vv
ee
ee e v
sss e e
e e eee e
e vv v
5.5

5.5
eeee e s
s ss s s ss
ssss ee e ee ss s e e e e e
e s ss ss s s s ssssss e ee e ss s e e
e eee v sss ss s s ss s ss sss
s
sss
s e v s ss ss s s
s
ss s e
e v
s s s s s
s ss s s s sssss ss ss
4.5

4.5
s s ss s
s ssss ss s
Error: 0.2 Error: 0.047 Error: 0.033
s s s
s s s
4.0

4.0
s s s
s s s s
s s vv ssss v v sss v v
s ss s s s
s ss v s s v ss v
3.5

3.5
sss s ssss ss s
s s sss s e vv sssss e vv sss e vv
ss v v
e ss e vv s s e v v
s ss s e e vv vve v Sepal.Width ssss eee vv vvv s ee e v v v
s ss v eve v ss e eev vv v ss ee v v v v
3.0

3.0
ss sss e ee e vvev vee vv vv vv ssss eeeeevvevv v vvv v s s s eeee v e v v v v v
s ee eeeve e v s e eeeee v v s eee v
vev evvve e v v ee eeee vvv v v v eeee v vvvvv v
e ee v e vv eee ve vv e eee e v v
e ee v v e ee v v e e v v
2.5

2.5
v e eev v v
e e ee v ev v e e e vvvv
e e e ee ee
s e e e s e ee s e e
ve
e e e v e e
v
2.0

2.0
e v v e e v
Error: 0.033 vv v
Error: 0.047 v v Error: 0.04 vvv
v v v v v v
v v vvv v v v v vv v vv v vv
6

6
vv v v v v
v v v vv v v
v vvvv vvvv v v v v vv v v v v vv vv v v v v
vv v v vvv v v v v vvv
vve vv vv evv
v e vv e vvv v ee vv vv vvv v v
v
vv evvvve v e
5

5
v e vevv e vv eee e
e e eee e e e e e eee eeee v
ev
v eeee eeeeeee e ee v e e ee ee
e e e e e ee
e
e ee
e
ee
e e ee e ee
e eee
e e
eeee
e ee ee e e eee e Petal.Length e
e e e
e
4

4
e e ee e e e ee e
ee e ee e
e e e e e e
ee ee e
e e e
3

3
2

2
s s s s s s
sss ss s sss ss s s s s ss ss s s
ss ss s ss s s
ssssss
ssss sssss
s sss s sss ss ss s
s ss s ss s s s
ssss
s
ss
ss
s s s s s s s s s ss
ss
2.5

v v v v v s v vv
1

2.51
Error: 0.04 v v v v v v
Error: 0.033 Error: 0.04 v v
v v vvv v v vvv v vvvv vvv v
vv v v v v vv v
v vvv v v v vv v vvvvv v
2.0

vv v vv v v v v v vvvv vv

2.0
v vv v v vv vvv v
e
vvvvvvv v vv v v v v v ve v e
vvv vvvv v
v e v e v e
e e v e v ee ee e v
1.5

1.5
e e eev ee
vee e e v e e
e v eeee e eeeevv
e v eee e
e v eeeeee e eeee v
eee eeee e e e eeee e eeeeeee Petal.Width
e ee e eee e eeee e
e ee ee e ee

1.0
1.0

ee e ee e e eee ee eee ee

s s s
0.5
0.5

s s s
ss s s s sss s sssss
ss s ss s s s ss s ssss
s ssssssss
ssssss s s s sss
s s s s s s s s s ssss
ssss
ssss
s ss s ss s s s ss
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5

Figure 5: Classifying the Iris data using LDA. The multiple figure array illustrates the
classification of observations based on LDA for every combination of two features. The
classification boundaries and error are obtained by simply restricting the data to a given
pair of features before fitting the model. In these plots, “s” represents the class label Iris
setosa, “e” represents the class label Iris versicolor, and “v” represents the class label Iris
virginica. The red letters illustrate the misclassified observations.

13
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5
v v v
Error: v
0.2 v v v Error: 0.04 v vvvv Error: 0.033 v v vvv
v

7.5
7.5
vv vv v
vv e v v vvv v vv v v
e
v v eev v vv ee
v eee vvev vv e eev vvvvv
e e
e e e v vvv v
vvv
v
eee ee vv v v eee v v v

6.5
6.5

v vv eeeeee
ee e v vv vv
v v ve
e e
vvv e e vvvvvvvvv v e
e
e v e vvv vv v v vv v v
e
e
e
Sepal.Length v
e v eee eevv e e
e e
ee eeee vvve v e eee veve e vv
v v e vv
ve
e eee
vev ee
v e e s s s s ss ee e
e
ee e vv
ee
ee e v
sss e e
e e eee e
e vv v
5.5

5.5
eeee e s
s ss s s ss
ssss ee e ee ss s e e e e e
e s ss ss s s s ssssss e ee e ss s e e
e eee v sss ss s s ss s ss sss
s
sss
s e v s ss ss s s
s
ss s e
e v
s s s s s
s ss s s s sssss ss ss
4.5

4.5
s s ss s
s ssss ss s
Error: 0.2 Error: 0.047 Error: 0.047
s s s
s s s
4.0

4.0
s s s
s s s s
s s vv ssss v v sss v v
s ss s s s
s ss v s s v ss v
3.5

3.5
sss s ssss ss s
s s sss s e vv sssss e vv sss e vv
ss v v
e ss e vv s s e v v
s ss s e e vv vve v Sepal.Width ssss eee vv vvv s ee e v v v
s ss v eve v ss e eev vv v ss ee v v v v
3.0

3.0
ss sss e ee e vvev vee vv vv vv ssss eeeeevvevv v vvv v s s s eeee v e v v v v v
s ee eeeve e v s e eeeee v v s eee v
vev evvve e v v ee eeee vvv v v v eeee v vvvvv v
e ee v e vv eee ve vv e eee e v v
e ee v v e ee v v e e v v
2.5

2.5
v e eev e
v v e ee v ev v e e e vvvv
e e e ee ee
s e e e s e ee s e e
eve e e v e v
e
2.0

2.0
e v v e e v
Error: 0.04 vv v
Error: 0.047 v v Error: 0.02 vvv
v v v v v v
v v vvv v v v v vv v vv v vv
6

6
vv v v v v
v v v vv v vv v
v vvvv vvvv v v v v vv v v v v vv vv v v
vv v v vvv v v v v vvv
vve vv vv evv
v e vv e vvv v ee vv vv vvv v v
v
vv evvvve v e
5

5
v e vevv e vv eee e
e e eee e e e e e eee e
eeee v
ev
v eeee eeeeeee e ee v e e ee ee
e e e e e eeee
e e
ee
e e ee e ee
e eee
e e
eee
e ee e e e
e e e e ee ee e e e Petal.Length e e e
4

4
e e e e ee e
ee e ee e
e e e e e e
ee ee e
e e e
3

3
2

2
s s s s s s
sss ss s sss ss s s s s ss ss s s
ss ss s ss s s
ssssss
ssss sssss
s sss s sss ss ss s
s ss s ss s s s
ssss
s
ss
ss
s s s s s s s s s ss
ss
2.5

v v v v v s v vv
1

2.51
Error: 0.033 v v v v v v
Error: 0.047 Error: 0.02 v v
v v vvv v v vvv v vvvv vvv v
vv v v v v vv v
v vvv v v v vv v vvvvv v
2.0

vv v vv v v v v v vvvv vv

2.0
v vv v v vv vvv v
e
vvvvvvv v vv v v v v v ve v vvv vvvv v
e
v e v e v e
e e v e v ee ee e v
1.5

1.5
e e eev ee
vee e e v e e
e v eeee e eeeevv
e v eee e
e v eeeeee e eeee v
eee eeee e e e eeee e eeeeeee Petal.Width
e ee e eee e eeee e
e ee ee e ee

1.0
1.0

ee e ee e e eee ee eee ee

s s s
0.5
0.5

s s s
ss s s s sss s sssss
ss s ss s s s ss s ssss
s ssssssss
ssssss s s s sss
s s s s s s s s s ssss
ssss
ssss
s ss s ss s s s ss
4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 0.5 1.0 1.5 2.0 2.5

Figure 6: Classifying the Iris data using QDA. The multiple figure array illustrates the
classification of observations based on QDA for every combination of two features. The
classification boundaries are displayed and the classification error by simply casting the data
onto these two features are calculated. In these plots, “s” represents the class label Iris
setosa, “e” represents the class label Iris versicolor, and “v” represents the class label Iris
virginica. The red letters illustrate the misclassified observations.

14
4 Fisher Linear Discriminant Analysis

There is another version of linear discriminant analysis due to Fisher (1936). The idea is to
first reduce the covariates to one dimension by projecting the data onto a line. Algebraically,
T T
this
Pd means replacing the covariate X = (X1 , . . . , Xd ) with a linearT combination U = w X =
j=1 wj Xj . The goal is to choose the vector w = (w1 , . . . , wd ) that “best separates the
data into two groups.” Then we perform classification with the one-dimensional covariate U
instead of X.

What do we mean by “best separates the data into two groups”? Formally, we would like
the two groups to have means that as far apart as possible relative to their spread. Let
µj denote the mean of X for Y = j, j = 0, 1. And let Σ be the covariance matrix of X.
Then, for j = 0, 1, E(U |Y = j) = E(wT X|Y = j) = wT µj and Var(U ) = wT Σw. Define the
separation by
2
E(U |Y = 0) − E(U |Y = 1)
J(w) =
wT Σw
(wT µ0 − wT µ1 )2
=
wT Σw
w (µ0 − µ1 )(µ0 − µ1 )T w
T
= .
wT Σw
J is sometimes called the Rayleigh coefficient. Our goal is to find w that maximizes J(w).
Since J(w)Pinvolves unknown population quantities Σ0 , Σ1 , µ0 , µ1 , we estimate J as follows.
n
Let nj = i=1 I(Yi = j) be the number of observations in class j, let µ bj be the sample mean
vector of the X’s for class j, and let Σj be the sample covariance matrix for all observations
in class j. Define
w T SB w
J(w) = T
b (38)
w SW w
where

µ0 − µ
SB = (b b1 )T ,
µ0 − µ
b1 )(b
(n0 − 1)S0 + (n1 − 1)S1
SW = .
(n0 − 1) + (n1 − 1)

Theorem 8 The vector


−1
w
b = SW µ0 − µ
(b b1 ) (39)
is a maximizer of J(w).
b

Proof. Maximizing J(w)


b is equivalent to maximizing wT SB w subject to the constraint that
wT SW w = 1. This is a generalized eigenvalue problem. By the definition of eigenvector and

15
−1
eigenvalue, the maximizer w
b should be the eigenvector of SW SB corresponding to the largest
µ0 − µ
eigenvalue. The key observation is that SB = (b b1 )T , which implies that for any
µ0 − µ
b1 )(b
b0 − µ
vector w, SB w must be in the direction of µ b1 . The desired result immediately follows.


We call
−1
bT x = (b
f (x) = w b1 )T SW
µ0 − µ x (40)
the Fisher linear discriminant function. Given a cutting threshold cm ∈ R, Fisher’s classifi-
cation rule is

b T x ≥ cm
0 if w
h(x) = (41)
bT x < cm .
1 if w

Fisher’s rule is the same as the Gaussian LDA rule in (26) when
 
1 T −1 π
b0
cm = (b µ0 − µ
b1 ) SW (b b1 ) − log
µ0 + µ . (42)
2 π
b1

5 Logistic Regression

One approach to binary classification is to estimate the regression function m(x) = E(Y |X =
x) = P(Y = 1|X = x) and, once we have an estimate m(x), b use the classification rule

> 12

1 if m(x)
h(x) = (43)
b b
0 otherwise.

For binary classification problems, one possible choice is the linear regression model
d
X
Y = m(X) +  = β0 + βj Xj + . (44)
j=1

The linear regression model does not explicitly constrain Y to take on binary values. A
more natural alternative is to use logistic regression, which is the most common binary
classification method.

Before we describe the logistic regression model, let’s recall some basic facts about binary
random variables. If Y takes values 0 and 1, we say that Y has a Bernoulli distribution with
parameter π1 = P(Y = 1). The probability mass function for Y is p(y; π1 ) = π1y (1 − π1 )1−y
for y = 0, 1. The likelihood function for π1 based on iid data Y1 , . . . , Yn is
n
Y n
Y
L(π1 ) = p(Yi ; π1 ) = π1Yi (1 − π1 )1−Yi . (45)
i=1 i=1

16
In the logistic regression model, we assume that

exp β0 + xT β
m(x) = P(Y = 1|X = x) =  ≡ π1 (x, β0 , β). (46)
1 + exp β0 + xT β

In other words, given X = x, Y is Bernoulli with mean π1 (x, β0 , β). We can write the model
as
logit P(Y = 1|X = x) = β0 + xT β

(47)
where logit(a) = log(a/(1 − a)). The name “logistic regression” comes from the fact that
exp(x)/(1 + exp(x)) is called the logistic function.

Lemma 9 Both linear regression and logistic regression models have linear decision bound-
aries.

Proof. The linear decision boundary for linear regression is straightforward. The same
result for logistic regression follows from the monotonicity of the logistic function. 

The parameters β0 and β = (β1 , . . . , βd )T can be estimated by maximum conditional likeli-


hood. The conditional likelihood function for β is
n
Y
L(β0 , β) = π(xi , β0 , β)Yi (1 − π(xi , β0 , β))1−Yi .
i=1

Thus the conditional log-likelihood is


n n
X o
`(β0 , β) = Yi log π(xi , β0 , β) − (1 − yi ) log(1 − π(xi , β0 , β) (48)
i=1
n n
X o
= Yi (β0 + xTi β) − log 1 + exp(β0 + xTi β) . (49)
i=1

The maximum conditional likelihood estimators βb0 and βb cannot be found in closed form.
However, the loglikelihood function is concave and can be efficiently solve by the Newton’s
method in an iterative manner as follows.

Note that the logistic regression classifier is essentially replacing the 0-1 loss with a smooth
loss function. In other words, it uses a surrogate loss function.

For notational simplicity, we redefine (local to this section) the d-dimensional covariate xi
and parameter vector β as the following (d + 1)-dimensional vectors:

xi ← (1, xTi )T and β ← (β0 , β T )T . (50)

Thus, we write π1 (x, β0 , β) as π1 (x, β) and `(β0 , β) as `(β).

17
To maximize `(β), the (k + 1)th Newton step in the algorithm replaces the kth iterate βb(k)
by !−1
2 b(k)
(k+1) (k) ∂ `( β ) ∂`(βb(k) )
βb ← βb − . (51)
∂β∂β T ∂β
βb(k) ) 2 `(β
b(k) )
The gradient ∂s ∂`(∂β and Hessian ∂s ∂∂β∂β T are both evaluated at βb(k) and can be written
as n
∂`(βb(k) ) X ∂ 2 `(βb(k) )
= (π(xi , βb(k) ) − Yi )Xi and T
= −XT WX (52)
∂β i=1
∂β∂β
(k) (k) (k)
where W = diag(w11 , w22 , . . . , wdd ) is a diagonal matrix with
(k)
wii = π(xi , βb(k) ) 1 − π(xi , βb(k) ) .

(53)

(k) T
Let π1 = π1 (x1 , βb(k) ), . . . , π1 (xn , βb(k) ) , (51) can be written as
(k)
βb(k+1) = βb(k) + (XT WX)−1 XT (y − π1 ) (54)
(k) 
= (XT WX)−1 XT W Xβb(k) + W−1 (y − π ) 1 (55)
= (XT WX)−1 XT Wz (k) (56)
(k) (k) (k)
where z (k) ≡ (z1 , . . . , zn )T = XT βb(k) + W−1 (y − π1 ) with
!
(k) π1 (xi , βb(k) ) yi − π1 (xi , βb(k) )
zi = log + . (57)
1 − π1 (xi , βb(k) ) π1 (xi , βb(k) )(1 − π1 (xi , βb(k) ))

Given the current estimate βb(k) , the above Newton iteration forms a quadratic approximation
to the negative log-likelihood using Taylor expansion at βb(k) :
1
−`(β) = (z − Xβ)T W(z − Xβ) +constant. (58)
|2 {z }
`Q (β)

The update equation (56) corresponds to solving a quadratic optimization

βb(k+1) = argmin `Q (β). (59)


β

We then get an iterative algorithm called iteratively reweighted least squares. See Figure 7.

18
Iteratively Reweighted Least Squares Algorithm
(0) (0) (0)
Choose starting values βb(0) = (βb0 , βb1 , . . . , βbd )T and compute π1 (xi , βb(0) ) using
(0)
Equation (46), for i = 1, . . . , n with βj replaced by its initial value βbj .

For k = 1, 2, . . ., iterate the following steps until convergence.


(k)
1. Calculate zi according to (57) for i = 1, . . . , n.

2. Calculate βb(k+1) according to (56). This corresponds to doing a weighted linear


regression of z on X.
b using (46) with the current estimate of βb(k+1) .
3. Update the π(xi , β)’s

Figure 7: Finding the Logistic Regression MLE.

We can get the estimated standard errors of the final solution β.


b For the kth iteration, recall
(k)
that the Fisher information matrix I(β ) takes the form
b
!
2 b(k)
∂ `( β )
I(βb(k) ) = −E ≈ XT WX, (60)
∂β∂β T

b −1 .
we estimate the standard error of βbj as the jth diagonal element of I(β)

Example 10 We apply the logistic regression on the Coronary Risk-Factor Study (CORIS)
data and yields the following estimates and Wald statistics Wj for the coefficients:

Covariate βbj se Wj p-value


Intercept -6.145 1.300 -4.738 0.000
sbp 0.007 0.006 1.138 0.255
tobacco 0.079 0.027 2.991 0.003
ldl 0.174 0.059 2.925 0.003
adiposity 0.019 0.029 0.637 0.524
famhist 0.925 0.227 4.078 0.000
typea 0.040 0.012 3.233 0.001
obesity -0.063 0.044 -1.427 0.153
alcohol 0.000 0.004 0.027 0.979
age 0.045 0.012 3.754 0.000

19
6 Logistic Regression Versus LDA

There is a close connection between logistic regression and Gaussian LDA. Let (X, Y ) be a
pair of random variables where Y is binary and let p0 (x) = p(x|Y = 0), p1 (x) = p(x|Y = 1),
π1 = P(Y = 1). By Bayes’ theorem,
p(x|Y = 1)π1
P(Y = 1|X = x) = (61)
p(x|Y = 1)π1 + p(x|Y = 0)(1 − π1 )
If we assume that each group is Gaussian with the same covariance matrix Σ, i.e., X|Y =
0 ∼ N (µ0 , Σ) and X|Y = 1 ∼ N (µ1 , Σ), we have
   
P(Y = 1|X = x) π 1
log = log − (µ0 + µ1 )T Σ−1 (µ1 − µ0 ) (62)
P(Y = 0|X = x) 1−π 2
+ xT Σ−1 (µ1 − µ0 ) (63)
T
≡ α0 + α x. (64)

On the other hand, the logistic regression model is, by assumption,


 
P(Y = 1|X = x)
log = β0 + β T x.
P(Y = 0|X = x)
These are the same model since they both lead to classification rules that are linear in x.
The difference is in how we estimate the parameters.

This is an example of a generative versus a discriminative model. In Gaussian LDA we


estimate the whole joint distribution by maximizing the full likelihood
n
Y n
Y n
Y
p(Xi , Yi ) = p(Xi |Yi ) p(Yi ) . (65)
i=1
|i=1 {z } |i=1 {z }
Gaussian Bernoulli

In logistic regression we maximize the conditional likelihood ni=1 p(Yi |Xi ) but ignore the
Q
second term p(Xi ):
Yn n
Y n
Y
p(Xi , Yi ) = p(Yi |Xi ) p(Xi ) . (66)
i=1
|i=1 {z } |i=1 {z }
logistic ignored

Since classification only requires the knowledge of p(y|x), we don’t really need to estimate
the whole joint distribution. Logistic regression leaves the marginal distribution p(x) un-
specified so it relies on less parametric assumption than LDA. This is an advantage of the
logistic regression approach over LDA. However, if the true class conditional distributions
are Gaussian, the logistic regression will be asymptotically less efficient than LDA, i.e. to
achieve a certain level of classification error, the logistic regression requires more samples.

20
7 Regularized Logistic Regression

As with linear regression, when the dimension d of the covariate is large, we cannot simply fit
a logistic model to all the variables without experiencing numerical and statistical problems.
Akin to the lasso, we will use regularized logistic regression, which includes sparse logistic
regression and ridge logistic regression.

Let `(β0 , β) be the log-likelihood defined in (49). The sparse logistic regression estimator is
an `1 -regularized conditional log-likelihood estimator
n o
βb0 , βb = argmin −`(β0 , β) + λkβk1 . (67)
β0 ,β

Similarly, the ridge logistic regression estimator is an `2 -regularized conditional log-likelihood


estimator
n o
βb0 , βb = argmin −`(β0 , β) + λkβk22 . (68)
β0 ,β

The algorithm for logistic ridge regression only requires a simple modification of the itera-
tively reweighed least squares algorithm and is left as an exercise.

For sparse logistic regression, an easy way to calculate βb0 and βb is to apply a `1 -regularized
Newton procedure. Similar to the Newton method for unregularized logistic regression, for
the kth iteration, we first form a quadratic approximation to the negative log-likelihood
`(β0 , β) based on the current estimates βb(k) .
n d
1X (k)
X 2
−`(β0 , β) = wii zi − β0 − βj xij +constant. (69)
2 i=1 j=1
| {z }
`Q (β0 ,β)

(k)
where wii and zi are defined in (53) and (57). Since we have a `1 -regularization term, the
updating formula for the estimate in the (k + 1)th step then becomes
 X n d 
(k+1) b(k+1) 1 (k)
X 2
β
b ,β = argmin wii zi − β0 − βj xij + λkβk1 . (70)
β0 ,β 2 i=1 j=1

This is a weighted lasso problem and can be solved using coordinate descent. See Figure 8.

Even though the above iterative procedure does not guarantee theoretical convergence, it
works very well in practice.

21
Sparse Logistic Regression Using Coordinate Descent

(0) (0) (0)


Choose starting values βb(0) = (βb0 , βb1 , . . . , βbd )T

(Outer loop) For k = 1, 2, . . ., iterate the following steps until convergence.


(k) (k)
1. For i = 1, . . . , n, calculate π1 (xi , βb(k) ), zi , wii according to (46), (57), and
(53).
Pn (k) (k)
wii zi (k)
2. α0 = ∂s i=1
Pn (k) and α` = βb` for ` = 1, . . . d.
i=1 wii

3. (Inner loop) iterate the following steps until convergence

For j ∈ {1, . . . , d}
(k) P
(a) For i = 1, . . . , n, calculate rij = zi − α0 − `6=j α` xi` .
(k) (k) (k) (k)
(b) Calculate uj = ni=1 wii rij xij and vj = ni=1 wii x2ij .
P P
 (k) 
(k)  |uj |−λ
(c) αj = sign uj (k) .
vj
+

(k+1) (k+1)
4. βb0 = α0 and βb` = α` for ` = 1, . . . d.
b using (46) with the current estimate of βb(k+1) .
5. Update the π(xi , β)’s

Figure 8: Sparse Logistic Regression

22
3.0

2.5

2.0

L(yH(x))
1.5

1.0

0.5

0.0

−2 −1 0 1 2
yH(x)

Figure 9: The 0-1 classification loss (blue dashed line), hinge loss (red solid line) and logistic
loss (black dotted line).

8 Support Vector Machines

The support vector machine (SVM) classifier is a linear classifier that replaces the 0-1 loss
with a surrogate loss function. (Logistic regression uses a different surrogate.) In this
section, the outcomes are coded as −1 and +1. The  loss is L(x, y, β) = I(y 6= hβ (x)) =
 0-1
I(yHβ (x) < 0) with the hinge loss Lhinge yi , H(xi ) ≡ 1 − Yi H(Xi ) + instead of the logistic
loss. This is the smallest convex function that lies above the 0-1 loss. (When we discuss
nonparameteric classifiers, we will consider more general support vector machines.)

The support vector machine classifier is bh(x) = I H(x)
b > 0 where the hyperplane H(x) b =
T
β0 + β x is obtained by minimizing
b b
n
X λ
1 − Yi H(Xi ) + + kβk22
 
(71)
i=1
2

where λ > 0 and the factor 1/2 is only for notational convenience.

Figure 9 compares the hinge loss, 0-1 loss, and logistic loss. The advantage of the hinge
loss is that it is convex, and
 it has a corner which leads to efficient computation and the
minimizer of E 1 − Y H(X) + is the Bayes rule. A disadvantage of the hinge loss is that one
can’t recover the regression function m(x) = E(Y |X = x).

The SVM classifier is often developed from a geometric perspective. Suppose first that the
data are linearly separable, that is, there exists a hyperplane that perfectly separates the
two classes. How can we find a separating hyperplane? LDA is not guaranteed to find it. A
separating hyperplane will minimize
X
− Yi H(Xi ).
i∈M

23



















H(x) = β0 + β T x = 0

Figure 10: The hyperplane H(x) has the largest margin of all hyperplanes that separate the
two classes.

where M is the index set of all misclassified data points. Rosenblatt’s perceptron algorithm
takes starting values and iteratively updates the coefficients as:
     
β β Yi Xi
←− +ρ
β0 β0 Yi

where ρ > 0 is the learning rate. If the data are linearly separable, the perceptron algorithm is
guaranteed to converge to a separating hyperplane. However, there could be many separating
hyperplanes. Different starting values may lead to different separating hyperplanes. The
question is, which separating hyperplane is the best?

Intuitively, it seems reasonable to choose the hyperplane “furthest” from the data in the
sense that it separates the +1’s and -1’s and maximizes the distance to the closest point.
This hyperplane is called the maximum margin hyperplane. The margin is the distance from
the hyperplane to the nearest data point Points on the boundary of the margin are called
support vectors. See Figure 10. The goal, then, is to find a separating hyperplane which
maximizes the margin. After some simple algebra, we can show that (71) exactly achieves
this goal. In fact, (71) also works for data that are not linearly separable.

The unconstrained optimization problem (71) can be equivalently formulated in constrained


form:
n1 n
1X o
min kβk22 + ξi (72)
β0 ,β 2 λ i=1
subject to ∀i, ξi ≥ 0 and ξi ≥ 1 − yi H(xi ). (73)

Given two vectors a and b, let ha, bi = aT b = j aj bj denote the inner product of a and b.
P
The following lemma provides the dual of the optimization problem in (72).

24
Lemma 11 The dual of the SVM optimization problem in (72) takes the form
n n n
nX 1 XX o
α
b = argmax αi − αi αk Yi yk hXi , xk i (74)
α∈Rn
i=1
2 i=1 k=1
n
1 X
subject to 0 ≤ α1 , . . . , αn ≤ and αi yi = 0, (75)
λ i=1

with the primal-dual relationship βb = ni=1 α


P
byi xi . We also have
 
bi 1 − ξi − yi βb0 + βx
α b i = 0, i = 1, . . . , n. (76)

Proof. Let αi , γi ≥ 0 be the Lagrange multipliers. The Lagrangian function can be written
as n n n
1 1X X  X
L(ξ, β, β0 , α, γ) = kβk22 + ξi + αi 1 − ξi − yi H(xi ) − γi ξi . (77)
2 λ i=1 i=1 i=1

The Karush-Kuhn-Tucker conditions are

∀i, αi ≥ 0, γi ≥ 0, ξi ≥ 0 and ξi ≥ 1 − yi H(xi ), (78)


Xn n
X
β= α i y i xi , αi yi = 0 and αi + γi = 1/λ, (79)
i=1 i=1

∀i, αi 1 − ξi − yi H(xi ) = 0 and γi ξi = 0. (80)

The dual formulation in (74) follows by plugging (78) and (79) into (77). The primal-dual
complementary slackness condition (76) is obtained from the first equation in (80). 

The dual problem (74) is easier to solve than the primal problem (71). The data points
(Xi , yi ) for which α bi > 0 are called support vectors. By (76) and (72), for all the data

points (xi , yi ) satisfying yi βb0 + βbT xi > 1, there must be α bi = 0. The solution for the dual
problem is sparse. From the first equality in (79), we see that the final estimate βb is a linear
combination only of these support vectors. Among these support vectors, if αi < 1/λ, we
call (xi , yi ) a margin point. For a margin point (xi , yi ), the last equality in (79) implies that
γi > 0, then the second equality in (80) implies ξi = 0. Moreover, using the first equality in
(80) , we get
βb0 = −Yi XiT β.
b (81)
Therefore, once βb is given, we could calculate βb0 using any margin point (Xi , yi ).

Example 12 We consider classifying two types of irises, versicolor and viriginica. There are
50 observations in each class. The covariates are ”Sepal.Length” ”Sepal.Width” ”Petal.Length”
and ”Petal.Width”. After fitting a SVM we get a 3/100 misclassification rate. The SVM
uses 33 support vectors.

25
9 Case Study I: Supernova Classification

A supernova is an exploding star. Type Ia supernovae are a special class of supernovae that
are very useful in astrophysics research. These supernovae have a characteristic light curve,
which is a plot of the luminosity of the supernova versus time. The maximum brightness
of all type Ia supernovae is approximately the same. In other words, the true (or absolute)
brightness of a type Ia supernova is known. On the other hand, the apparent (or observed)
brightness of a supernova can be measured directly. Since we know both the absolute and
apparent brightness of a type Ia supernova, we can compute its distance. Because of this,
type Ia supernovae are sometimes called standard candles. Two supernovae, one type Ia
and one non-type Ia, are illustrated in Figure 11. Astronomers also measure the redshift
of the supernova, which is essentially the speed at which the supernova is moving away
from us. The relationship between distance and redshift provides important information for
astrophysicists in studying the large scale structure of the universe.

Figure 11: Two supernova remnants from the NASA’s Chandra X-ray Observatory study.
The image in the right panel, the so-called Kepler supernova remnant, is ”Type Ia”. Such
supernovae have a very symmetric, circular remnant. This type of supernova is thought to
be caused by a thermonuclear explosion of a white dwarf, and is often used by astronomers
as a “standard candle” for measuring cosmic distances. The image in the left panel is a
different type of supernova that comes from “core collapse.” Such supernovae are distinctly
more asymmetric. (Credit: NASA/CXC/UCSC/L. Lopez et al.)

A challenge in astrophysics is to classify supernovae to be type Ia versus other types. [?]


released a mixture of real and realistically simulated supernovae and challenged the scientific
community to find effective ways to classify the type Ia supernovae. The dataset consists of

26
about 20,000 simulated supernovae. For each supernova, there are a few noisy measurements
of the flux (brightness) in four different filters. These four filters correspond to different
wavelengths. Specifically, the filters correspond to the g-band (green), r-band (red), i-band
(infrared) and z-band (blue). See Figure 12.
25

g band: 5 knots r band: 5 knots

50
20

40
15

30
10

20
5

10
0
−5

56200 56240 56280 56320 56200 56240 56280 56320

i band: 5 knots z band: 5 knots


80
60
50

60
40
30

40
20
10

20
0

56200 56240 56280 56320 56200 56240 56280 56320

Figure 12: Four filters (g, r, i, z-bands) corresponding to a type Ia supernova DES-SN000051.
For each band, a weighted regression spline fit (solid red) with the corresponding standard
error curves (dashed green) is provided. The black points with bars represent the flux values
and their estimated standard errors.

To estimate a linear classifier we need to preprocess the data to extract features. One
difficulty is that each supernova is only measured at a few irregular time points, and these
time points are not aligned. To handle this problem we use nonparametric regression to get
a smooth curve. (We used the estimated measurement errors of each flux as weights and
fitted a weighted least squares regression spline to smooth each supernova.) All four filters

27
of each supernova are then aligned according to the peak of the r-band. We also rescale so
that all the curves have the same maximum.

The goal of this study is to build linear classifiers to predict whether a supernova is type
Ia or not. For simplicity, we only use the information in the r-band. First, we align the
fitted regression spline curves of all supernovae by calibrating their maximum peaks and set
the corresponding time point to be day 0. There are altogether 19, 679 supernovae in the
dataset with 1, 367 being labeled. To get a higher signal-to-noise ratio, we throw away all
supernovae with less than 10 r-band flux measurements. We finally get a trimmed dataset
with 255 supernovae, 206 of which are type Ia and 49 of which are non-type Ia.

X1 X2 X3 X4 X5 X1 X2 X3 X4 X5
0.6 8
0.4 7
6

X1

X1
0.2 5
0.0 4
3
−0.2 2

0.6 3
0.4 2
X2

X2
0.2 1
0.0 0
−0.2
0.8 1.0
0.6 0.5
0.4 0.0
X3

X3
y

0.2 −0.5
−1.0
0.0 −1.5
−0.2
0.8
0.6 −0.5
0.4
X4

X4
−1.0
0.2
0.0 −1.5

0.8
−0.5
0.6
X5

X5
0.4 −1.0
0.2
0.0 −1.5

0.00.20.40.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.00.20.40.6 0.00.20.40.6 3 4 5 6 7 8 0 1 2 3 −1.5
−1.0
−0.5
0.00.51.0 −1.5−1.0−0.5 −1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
x x

time domain frequency domain


Figure 13: The matrix of scatterplots of the first five features of the supernova data. On the
diagonal cells are the estimated univariate densities of each feature. The off-diagonal cells
visualize the pairwise scatter plots of the two corresponding variables with a least squares fit.
We see the time-domain features are highly correlated, while the frequency-domain features
are almost uncorrelated.

We use two types of features: the time-domain features and frequency-domain features. For
the time-domain features, the features are the interpolated regression spline values according
to an equally spaced time grid. In this study, the grid has length 100, ranging from day -20
to day 80. Since all the fitted regression curves have similar global shapes, the time-domain
features are expected to be highly correlated. This conjecture is confirmed by the matrix
of scatterplots of the first five features in 13. To make the features less correlated, we also
extract the frequency-domain features, which are simply the discrete cosine transformations
of the corresponding time-domain features. More specifically, given the time domain features
X1 , . . . , Xd (d = 100), Their corresponding frequency domain features X e1 , . . . , X
ed can be

28
written as
d
2X hπ 1 i
Xj =
e Xk cos k − (j − 1) for j = 1, . . . , d. (82)
d k=1 d 2
The right panel of Figure 13 illustrates the scatter matrix of the first 5 frequency-domain
features. In contrast to the time-domain features, the frequency-domain features have low
correlation.

We apply sparse logistic regression (LR), support vector machines (SVM), diagonal linear
discriminant analysis (DLDA), and diagonal quadratic discriminant analysis (DQDA) on
this dataset. For each method, we conduct 100 runs, within each run, 40% of the data are
randomly selected as training and the remaining 60% are used for testing.

Figure 14 illustrates the regularization paths of sparse logistic regression using the ime-
domain and frequency-domain features. A regularization path provides the coefficient value
of each feature over all regularization parameters. Since the time-domain features are highly
correlated, the corresponding regularization path is quite irregular. In contrast, the paths
for the frequency-domain features behave stably.
0 9 10 8 8 0 7 8 9 9 10
10

5
5
Coefficients

Coefficients
0

0
−5

−5

0 10 20 30 40 0 5 10 15 20 25

L1 Norm L1 Norm

time domain frequency domain

Figure 14: The regularization paths of sparse logistic regression using the features of time-
domain and frequency-domain. The vertical axis corresponds to the values of the coefficients,
plotted as a function of their `1 -norm. The path using time-domain features are highly
irregular, while the path using frequency-domain features are more stable.

Figure 15 compares the classification performance of all these methods. The results show that
classification in the frequency domain is not helpful. The regularization paths of the SVM are
the same in both the time and frequency domains. This is expected since the discrete cosine
transformation is an orthonormal transformation, which corresponds to rotatating the data
in the feature space while preserving their Euclidean distances and inner products. It is easy

29
0.20 0.20

0.15 0.15
Errors

Errors
0.10 0.10

0.05 0.05

0.00 0.00

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
log(λ) log(λ)

Sparse LR (time domain) Sparse LR (frequency domain)

0.20 0.20

0.15 0.15
Errors

Errors
0.10 0.10

0.05 0.05

0.00 0.00

−8 −6 −4 −2 0 2 4 −8 −6 −4 −2 0 2 4
log(λ) log(λ)

SVM (time domain) SVM (frequency domain)


0.4

0.4
0.3

0.3
Errors

Errors
0.2

0.2
0.1

0.1
0.0

0.0

DLDA.train DLDA.test DQDA.train DQDA.test DLDA.train DLDA.test DQDA.train DQDA.test

Discriminant Analysis (time domain) Discriminant Analysis (frequency domain)

Figure 15: Comparison of different methods on the supernova dataset using both the time-
domain (left column) and frequency-domain features (right Column). Top four figures: mean
error curves (black: training error, red: test error) and their corresponding standard error
bars for sparse logistic regression (LR) and support vector machines (SVM). Bottom two
figures: boxplots of the training and test errors of diagonal linear discriminant analysis
(DLDA) and diagonal quadratic discriminant analysis (DQDA). For the time-domain fea-
tures, the SVM achieves the smallest test error among all methods.

30
to see that the SVM is rotation invariant. Sparse logistic regression is not rotation invariant
due to the `1 -norm regularization term. The performance of the sparse logistic regression
in the frequency domain is worse than that in the time domain. The DLDA and DQDA
are also not rotation invariant; their performances decreases significantly in the frequency
domain compared to those in the time domain. In both time and frequency domains, the
SVM outperforms all the other methods. Then follows sparse logistic regression, which is
better than DLDA and DQDA.

10 Case Study II: Political Blog Classification

In this example, we classify political blogs according to whether their political leanings are
liberal or conservative. Snapshots of two political blogs are shown in Figure 16.

Figure 16: Examples of two political blogs with different orientations, one conservative and
the other liberal.

The data consist of 403 political blogs in a two-month window before the 2004 presidential
election. Among these blogs, 205 are liberal and 198 are conservative. We use bag-of-words
features, i.e., each unique word from these 403 blogs serves as a feature. For each blog,
the value of a feature is the number of occurrences of the word normalized by the total
number of words in the blog. After converting all words to lower case, we remove stop words
and only retain words with at least 10 occurrences across all the 403 blogs. This results
in 23,955 features, each of which corresponds to an English word. Such features are only a
crude representation of the text represented as an unordered collection of words, disregarding
all grammatical structure. We also extracted features that use hyperlink information. In
particular, we selected 292 out of the 403 blogs that are heavily linked to, and for each
blog i = 1, . . . , 403, its linkage information is represented as a 292-dimensional binary vector
(xi1 , . . . , xi292 )T where xij = 1 if the ith blog has a link to the jth feature blog. The total
number of covariates is then 23,955 + 292 = 24,247. Even though the link features only
constitute a small proportion, they are important for predictive accuracy.

We run the full regularization paths of sparse logistic regression and support vector machines,

31
0.6 0.6

0.5 0.5

0.4 0.4
Errors

Errors
0.3 0.3

0.2 0.2

0.1 0.1

0.0 0.0

1.0 1.5 2.0 2.5 3.0 3.5 6 7 8 9 10 11 12


log(λ) log(λ)

Sparse LR SVM
0.6 0.6

0.5 0.5

0.4 0.4
Errors

Errors
0.3 0.3

0.2 0.2

0.1 0.1

0.0 0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0 7 8 9 10 11


log(λ) log(λ)

Sparse LR (with linkage information) SVM (with linkage information)


0 80 134 164 166 0 36 89 107 127
0.6

1.0
0.4

0.5
0.2

Coefficients
Coefficients

0.0
0.0
−0.2

−0.5
−0.4

−1.0

0 5 10 15 20 0 5 10 15 20

L1 Norm L1 Norm

Sparse LR path (no linkage information) Sparse LR path (with linkage information)

Figure 17: Comparison of the sparse logistic regression (LR) and support vector machine
(SVM) on the political blog data. (Top four figures): The mean error curves (Black: training
error, Red: test error) and their corresponding standard error bars of the sparse LR and SVM,
with and without the linkage information. (Bottom two figures): Two typical regularization
paths of the sparse logistic LR with and without the linkage information. On this dataset, the
diagonal linear discriminant analysis (DLDA) achieves a test error 0.303 (sd = 0.07) without
the linkage information and a test error 0.159 (sd = 0.02) with the linkage information.

32
100 times each. For each run, the data are randomly partitioned into training (60%) and
testing (40%) sets. Figure 17 shows the mean error curves with their standard errors. From
Figure 17, we see that linkage information is crucial. Without the linkage features, the
smallest mean test error of the support vector machine along the regularization path is 0.247,
while that of the sparse logistic regression is 0.270. With the link features, the smallest test
error for the support vector machine becomes 0.132. Although the support vector machine
has a better mean error curve, it has much larger standard error. Two typical regularization
paths for sparse logistic regression with and without using the link features are provided
at the bottom of Figure 17. By examining these paths, we see that when the link features
are used, 11 of the first 20 selected features are link features. In this case, although thev
class conditional distribution is obviously not Gaussian, we still apply the diagonal linear
discriminant analysis (DLDA) on this dataset for a comparative study. Without the linkage
features, the DLDA has a mean test error 0.303 (sd = 0.07). With the linkage features,
DLDA has a mean test error 0.159 (sd = 0.02).

33
Nonparametric Classification
10/36-702

1 Introduction

Let h : X → {0, 1} to denote a classifier where X is the domain of X. In parametric classi-


fication we assumed that h took a very constrained form, typically linear. In nonparametric
classification we aim to relax this assumption.

Let us recall a few definitions and facts. The classification risk, or error rate, of h is

R(h) = P(Y 6= h(X)) (1)

and the empirical error rate or training error rate based on training data (X1 , Y1 ), . . . , (Xn , Yn )
is n
1X
Rn (h) =
b I(h(Xi ) 6= Yi ). (2)
n i=1
R(h) is minimized by the Bayes’ rule
p1 (x) (1−π)
(
1 1 if >

∗ 1 if m(x) > 2 p0 (x) π
h (x) = = (3)
0 otherwise 0 otherwise.

where m(x) = P(Y = 1 | X = x), pj (x) = p(x | Y = j) and π = P(Y = 1). The excess risk of
a classifier h is R(h) − R(h∗ ).

In the multiclass case, Y ∈ {1, . . . , k}, the Bayes’ rule is

h∗ (x) = argmax1≤j≤k πj pj (x) = argmax1≤j≤k mj (x)

where mj (x) = P(Y = j|X = x), πj = P(Y = j) and pj (x) = p(x|Y = j).

2 Plugin Methods

One approach to nonparametric classification is to estimate the unknown quantities in the


expression for the Bayes’ rule (3) and simply plug them in. For example, if m b is any
nonparametric regression estimator then we can use

> 12

1 if m(x)
h(x) = (4)
b b
0 otherwise.

1
For example, we could use the kernel regresson estimator
 
Pn ||x−Xi ||
i=1 Yi K h
mb h (x) = P   .
n ||x−Xi ||
i=1 K h

Howeve, the bandwidth should be optimized for classification error as described in Section
8.

We have the following theorem.

Theorem 1 Let b
h be the plug-in classifier based on m.
b Then,
Z sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|dP (x) ≤ 2 |m(x)
b − m(x)|2 dP (x). (5)

An immediate consequence of this theorem is that any result about nonparametric re-
gression can be turned into a result about nonparametric classification. For example, if
− m(x)|2 dP (x) = OP (n−2β/(2β+d) ) then R(b
h) − R(h∗ ) = OP (n−β/(2β+d) ). How-
R
|m(x)
b

qR (5) is an upper bound and it is possible that R(h) − R(h ) is strictly smaller than
ever, b
|m(x)
b − m(x)|2 dP (x).

When Y ∈ {1, . . . , k} the plugin rule has the form

h(x) = argmaxj m
b b j (x)

where m
b j (x) is an estimate of P(Y = j|X = x).

3 Classifiers Based on Density Estimation

We can apply nonparametric density estimation to each class to get estimators pb0 and pb1 .
Then we define
1 if ppbb01 (x)
(x)
> (1−bπ)
(
π
h(x) = (6)
b b

0 otherwise
b = n−1 ni=1 Yi . Hence, any nonparametric density estimation method yields a
P
where π
nonparametric classifier.

A simplification occurs if we assume that the covariate has independent coordinates, con-
ditioned on the class variable Y . Thus, if Xi = (Xi1 , . . . , Xid )T has dimension
Qd d and if
we assume conditional independence, then the density factors as pj (x) = `=1 pj` (x` ). In

2
this case we can estimate the one-dimensional marginals pj` (x` ) separately and then define
pbj (x) = d`=1 pbj` (x` ). This has the advantage that we never have to do more than a one-
Q
dimensional density estimate. This approach is called naive Bayes. The resulting classifier
can sometimes be very accurate even if the independence assumption is false.

It is easy to extend density based methods for multiclass problems. If Y ∈ {1, . . . , k} then
we estimate the k densities pbj (x) = p(x|Y = j) and the classifier is

h(x) = argmaxj π
b bj pbj (x)
Pn
bj = n−1
where π i=1 I(Yi = j).

4 Nearest Neighbors

The k-nearest neighbor classifier is


Pn
wi (x)I(Yi = 1) > ni=1 wi (x)I(Yi = 0)
 P
1 i=1
h(x) = (7)
0 otherwise

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise. “Nearest”


depends on how you define the distance. Often we use Euclidean distance kXi − Xj k. In
that case you should standardize the variables first.

The k-nearest neighbor classifier can be recast as a plugin rule. Define the regression estsi-
mator Pn
i=1 Yi I(||Xi ≤ x|| ≤ dk (x))
m(x) = P n
i=1 I(||Xi ≤ x|| ≤ dk (x))
b

where dk (x) is the distance between x and its k th -nearest neighbor. Then b
h(x) = I(m(x)
b >
1/2).

It is interesting to consider the classification error when n is large. First suppose that k = 1
and consider a fixed x. Then b h(x) is 1 if the closest Xi has label Y = 1 and b h(x) is 0 if the
closest Xi has label Y = 0. When n is large, the closest Xi is approximately equal to x. So
the probability of an error is

m(Xi )(1−m(x))+(1−m(Xi ))m(x) ≈ m(x)(1−m(x))+(1−m(x))m(x) = 2m(x)(1−m(x)).

Define
Ln = P(Y 6= b
h(X) | Dn )
where Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}. Then we have that

lim E(Ln ) = E(2m(X)(1 − m(X))) ≡ R(1) . (8)


n→∞

3
The Bayes risk can be written as R∗ = E(A) where A = min{m(X), 1 − m(X)}. Note that
A ≤ 2m(X)(1 − m(X)). Also, by direct integration, E(A(1 − A)) ≤ E(A)E(1 − A). Hence,
we have the well-known result due to Cover and Hart (1967),

R∗ ≤ R(1) = E(A(1 − A)) ≤ 2E(A)E(1 − A) = 2R∗ (1 − R∗ ) ≤ 2R∗ .

Thus, for any problem with small Bayes error, k = 1 nearest neighbors should have small
error.

More generally, for any odd k,


lim E(Ln ) = R(k) (9)
n→∞

where
k   !
X k j k−j
R(k) =E m (X)(1 − m(X)) [m(X)I(j < k/2) + (1 − m(X))I(j > k/2)] .
j
j=0

Theorem 2 (Devroye et al 1996) For all odd k,


1
R∗ ≤ R(k) ≤ R∗ + √ . (10)
ke

Proof. We can rewrite R(k) as R(k) = E(a(m(X))) where


 
k
a(z) = min{z, 1 − z} + |2z − 1| P B >
2
and B ∼ Binomial(k, min{z, 1 − z}). The mean of a(z) is less than or equal to its maximum
and, by symmetry, we can take the maximum over 0 ≤ z ≤ 1/2. Hence, letting B ∼
Binomial(k, z), we have, by Hoeffding’s inequality,
 
k 2 2 1
R(k) − R∗ ≤ sup (1 − 2z)P B > ≤ sup (1 − 2z)e−2k(1/2−z) = sup ue−ku /2 = √ .
0≤z≤1/2 2 0≤z≤1/2 0≤u≤1 ke


If the distribution of X has a density function then we have the following.

Theorem 3 (Devroye and Györfi 1985) Suppose that the distribution of X has a den-
sity and that k → ∞ and k/n → 0. For every  > 0 the following is true. For all large
n,
2 2
h) − R∗ > ) ≤ e−n /(72γd )
P(R(b
where b
hn is the k-nearest neighbor classifier estimated on a sample of size n, and where γd
depends on the dimension d of X.

4
Recently, Chaudhuri and Dasgupta (2014) have obtained some very general results about
k-nn classifiers. We state one of their key results here.

Theorem 4 (Chaudhuri and Dasgupta 2014) Suppose that

P ({x : |m(x) − (1/2)| ≤ t}) ≤ Ctβ

for some β ≥ 0 and some C > 0. Also, suppose that m satisfies the following smoothness
condition: for all x and r > 0

|m(B) − m(x)| ≤ LP (B o )α

where B = {u : ||x−u|| ≤ r}, B 0 = {u : ||x−u|| < r} and m(B) = (P (B))−1


R
B
m(u)dP (x).
Fix any 0 < δ < 1. Let h∗ be the Bayes rule. With probability at least 1 − δ,
 αβ
 2α+1
log(1/δ)
h(X) ≤ h∗ (X)) ≤ δC
P (b .
n

If k  n 2α+1 then
α(β+1)
h) − R(h∗ )  n−
R(b 2α+1 .

4.1 Partitions and Trees

As with nonparametric regression, simple and interpretable classifiers can be derived by


partitioning the range of X. Let Πn = {A1 , . . . , AN } be a partition of X . Let Aj be the
P P
partition element that contains x. Then b h(x) = 1 if Xi ∈Aj Yi ≥ Xi ∈Aj (1 − Yi ) and
h(x) = 0 otherwise. This is nothing other than the plugin classifier based on the partition
b
regression estimator
XN
m(x)
b = Y j I(x ∈ Aj )
j=1
−1
Pn
where Y j = nj i=1 Yi I(Xi ∈ Aj ) is the average of the Yi ’s in Aj and nj = #{Xi ∈ Aj }.
(We define Y j to be 0 if nj = 0.)

Recall from the results on regression that if


( )
m∈M= m : |m(x) − m(x)| ≤ L||x − z||, x, z, ∈ Rd (11)

and the binwidth b satsifies b  n−1/(d+2) then


c
b − m||2P ≤
E||m . (12)
n2/(d+2)
5
Age
< 50 ≥ 50

Blood Pressure 1

< 100 ≥ 100

0 1

Figure 1: A simple classification tree.

1
Blood Pressure

110
1
0

50
Age

Figure 2: Partition representation of classification tree.

From (5), we conclude that R(b h) − R(h∗ ) = O(n−1/(d+2) ). However, this binwidth was based
on the bias-variance tradeoff of the regression problem. For classification, b should be chosen
as described in Section 8.

Like regression trees, classification trees are partition classifiers where the partition is built
recursively. For illustration, suppose there are two covariates, X1 = age and X2 = blood
pressure. Figure 1 shows a classification tree using these variables.

The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.

Here is how a tree is constructed. First, suppose that y ∈ Y = {0, 1} and that there is
only a single covariate X. We choose a split point t that divides the real line into two sets

6
A1 = (−∞, t] and A2 = (t, ∞). Let rs (j) be the proportion of observations in As such that
Yi = j: Pn
I(Y = j, Xi ∈ As )
Pn i
rs (j) = i=1 (13)
i=1 I(Xi ∈ As )

for s = 1, 2 and j = 0, 1. The impurity of the split t is defined to be I(t) = 2s=1 γs where
P

1
X
γs = 1 − rs (j)2 . (14)
j=0

This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.

When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity. This process is continued until some stopping criterion is met. For example,
we might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0
or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition
element.

This procedure is easily generalized to the case where Y ∈ {1, . . . , K}. We define the impurity
by
Xk
γs = 1 − rs2 (j) (15)
j=1

where ri (j) is the proportion of observations in the partition element for which Y = j.

5 Minimax Results

The minimax classification risk over a set of joint distributions P is


 
Rn (P) = inf sup R(b h) − Rn∗ (16)
h P ∈P
b

where R(bh) = P(Y 6= bh(X)), Rn∗ is the Bayes error and the infimum is over all classifiers
constructed from the data (X1 , Y1 ), . . . , (Xn , Yn ). Recall that
sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|2 dP (x)

7
Class Rate Condition
E(α) n−α/(2α+d) α > 1/2
BV n−1/3
p
MI log n/n
−α/(2α+1)
L(α, q) n α > (1/q − 1/2)+
α
Bσ,q n−α/(2α+d) α/d > 1/q − 1/2
Neural nets see text

Table 1: Minimax Rates of Convergence for Classification.


q
Thus Rn (P) ≤ 2 R en (P) where R en (P) is the minimax risk for estimating the regression
function m. Since this is just anqinequality, it leaves open the following question: can Rn (P)
be substantially smaller than 2 R en (P)? Yang (1999) proved that the answer is no, in cases
where P is substantially rich. Moreover, we can achieve minimax classification rates using
plugin regression methods.

However, with smaller classes that invoke extra assumptions, such as the Tsybakov noise
condition, there can be a dramatic difference. Here, we summarize Yang’s results under
the richness assumption. This assumption is simply that if m is in the class, then a small
hypercube containing m is also in the class. Yang’s results are summarized in Table 1.

The classes in Table 1 are the following: E(α) is th Sobolev space of order α, BV is the class
of functions of bounded variation, MI is all monotone functions, L(α, q) are α-Lipschitz (in
α
q-norm), and Bσ,q are Besov spaces. For neural nets we have the bound, for every  > 0,
1+(2/d)  1+(1/d)
  4+(4/d) + 
1 log n 4+(2/d)
≤ Rn (P) ≤
n n

It appears that, as d → ∞, we get the dimension independent rate (log n/n)1/4 . However,
this result requires some caution since the class of distributions implicitly gets smaller as d
increases.

6 Support Vector Machines

When we discussed linear classification, we defined SVM classifier b


h(x) = sign(H(x))
b where
T
H(x)
b = βb0 + βb x and βb minimizes
X
[1 − Yi H(Xi )]+ + λ||β||22 .
i

8
We can do a nonparametric version by letting H be in a RKHS and taking the penalty to be
||H||2K . In terms of implementation, this means replacing every instance of an inner product
hXi , Xj i with K(Xi , Xj ).

7 Boosting

Boosting refers to a class of methods that build classifiers in a greedy, iterative way. The
original boosting algorithm is called AdaBoost and is due to Freund and Schapire (1996).
See Figure 3.

The algorithm seems mysterious and there is quite a bit of controversey about why (and
when) it works. Perhaps the most compelling explanation is due to Friedman, Hastie and
Tibshirani (2000) which is the explanation we will give. However, the reader is warned
that there is not consensus on the issue. Futher discussions can be found in Bühlmann and
Hothorn (2007), Zhang and Yu (2005) and Mease and Wyner (2008). The latter paper is
followed by a spirited discussion from several authors. Our view is that boosting combines
two distinct ideas: surrogate loss functions and greedy function approximation.

In this section, we assume that Yi ∈ {−1, +1}. Many classifiers then have the form
h(x) = sign(H(x))
for some function H(x). For example, a linear classifier corresponds to H(x) = β T x. The
risk can then be written as
R(h) = P(Y 6= h(X)) = P(Y H(X) < 0) = E(L(A))
where A = Y H(X) and L(a) = I(a < 0). As a function of a, the loss L(a) is discontinuous
which makes it difficult to work with. Friedman, Hastie and Tibshirani (2000) show that
−a −yH(x)
AdaBoost corresponds to using P a surrogate loss, namely, L(a) = e = e P. Consider
finding a classifier of the form m αm hm (x) by minimizing the exponential loss i e−Yi H(Xi ) .
If we do this iteratively, adding one function
P at a time, this leads precisely to AdaBoost.
Typically, the classifiers hj in the sum m αm hm (x) are taken to be very simple classifiers
such as small classification trees.

The argument in Friedman, Hastie and Tibshirani (2000) is as follows. Consider minimizing
the expected loss J(F ) = E(e−Y F (X) ). Suppose our current estimate is F and consider
updating to an improved estimate F (x) + cf (x). Expanding around f (x) = 0,
J(F + cf ) = E(e−Y (F (X)+cf (X)) ) ≈ E(e−Y F (X) (1 − cY f (X) + c2 Y 2 f 2 (X)/2))
= E(e−Y F (X) (1 − cY f (X) + c2 /2))
since Y 2 = f 2 (X) = 1. Now consider minimizing the latter expression a fixed X = x.
If we minimize over f (x) ∈ {−1, +1} we get f (x) = 1 if Ew (y|x) > 0 and f (x) = −1 if

9
1. Input: (X1 , Y1 ), . . . , (Xn , Yn ) where Yi ∈ {−1, +1}.

2. Set wi = 1/n for i = 1, . . . , n.

3. Repeat for m = 1, . . . , M .
Pn
(a) Compute the weighted error (h) = i=1 wi I(Yi 6= h(Xi ) and find hm to
minimize (h).
(b) Let αm = (1/2) log((1 − )/).
(c) Update the weights:
wi e−αm Yi hm (Xi )
wi ←
Z
where Z is chosen so that the weights sum to 1.

4. The final classifier is


M
!
X
h(x) = sign αm hm (x) .
m=1

Figure 3: AdaBoost

10
Ew (y|x) < 0 where Ew (y|x) = E(w(x, y)y|x)/E(w(x, y)|x) and w(x, y) = e−yF (x) . In other
words, the optimal f is simply the Bayes classifier with respect to the weights. This is exactly
the first step in AdaBoost. If we fix now fix f (x) and minimize over c we get
 
1 1−
c = log
2 
where  = Ew (I(Y 6= f (x))). Thus the updated F (x) is

F (x) ← F (x) + cf (x)

as in AdaBoost. When we update F this way, we change the weights to


   
−cf (x)y 1−
w(x, y) ← w(x, y)e = w(x, y) exp log I(Y 6= f (x))

which again is the same as AadBoost.

Seen in this light, boosting really combines two ideas. The first is the use of surrogate loss
functions. The second is greedy function approximation.

8 Choosing Tuning Parameters

All the nonparametric methods involve tuning parameters, for example, the number of neigh-
bors k in nearest neighbors. As with density estimation and regression, these parameters
can be chosen by a variety of cross-validation methods. Here we describe the data splitting
version of cross-validation. Suppose the data are (X1 , Y1 ), . . . , (X2n , Y2n ). Now randomly
split the data into two halves that we denote by
n o n o
∗ ∗ ∗ ∗
D = (X1 , Y1 ), . . . , (Xn , Yn ) , and E = (X1 , Y1 ), . . . , (Xn , Yn ) .
e e e e

Construct classifiers H = {h1 , . . . , hN } from D corresponding to different values of the tuning


parameter. Define the risk estimator
n
b j) = 1
X
R(h I(Yi∗ 6= hj (Xi∗ )).
n i=1

Let b
h = argminh∈H R(h).
b

Theorem 5 Let h∗ ∈ H minimize R(h) = P(Y 6= h(X)). Then


s  !
1 2N
P R(bh) > R(h∗ ) + 2 log ≤ δ.
2n δ

11
2
Proof. By Hoeffding’s inequality, P(|R(h)
b − R(h)| > ) ≤ 2e−2n , for each h ∈ H. By the
union bound,
2
P(max |R(h)
b − R(h)| > ) ≤ 2N e−2n = δ
h∈H
q
1
log 2N

where  = 2n δ
. Hence, except on a set of probability at most δ,

h) ≤ R(
R(b bbh) +  ≤ R(
bbh∗ ) +  ≤ R(b
h∗ ) + 2.

p
Note that the difference between R(bh) and R(h∗ ) is O( log N/n) but in regression it was
O(log N/n) which is an interesting difference between the two settings. Under low noise
conditions, the error can be improved.

A popular modification of data-splitting is K-fold cross-validation. The data are divided


into K blocks; typicaly K = 10. One block is held out as test data to estimate risk. The
process is then repeated K times, leaving out a different block each time, and the results are
averaged over the K repetitions.

9 Example

The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 2. See Figures 4, 5, 6, 7.

10 Sparse Nonparametric Logistic Regression

For high dimensional problems we can use sparsity-based methods. The nonparametric
additive logistic model is
P 
p
exp j=1 fj (Xj )
P(Y = 1 | X) ≡ p(X; f ) = P  (17)
p
1 + exp j=1 f j (X j )

where Y ∈ {0, 1}, and the population log-likelihood is


`(f ) = E [Y f (X) − log (1 + exp f (X))] (18)

12
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20

Table 2: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2

0.8

0.05
0.4
0.1
0.0

0.6

0.00
0.2
0.0

0.4

−0.05
−0.1

−0.1

0.0
0.2

−0.10
−0.2

−0.2

0.0

−0.2

−0.15
−0.3

−0.2
−0.3

−0.20
−0.4

−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15

0.06

0.04
0.2
0.2
0.04
0.10

0.03
0.1
0.02

0.02
0.05

0.1

0.01
0.00

0.0
0.00

0.00
−0.02

0.0

−0.1
−0.05

−0.04

−0.02
−0.10

−0.1

−0.2
−0.06

−4 −2 0 2 −4 −2 0 2 −6 −2 0 2 4 6 −1.0 0.0 1.0 2.0 −2 −1 0 1 2

Figure 4: Estimated functions for additive model.

13
0.29
0.28
0.27
Test Error

0.26
0.25

0 10 20 30 40 50

Figure 5: Test error versus k for nearest neighbor estimator.

xtrain.V9 < −0.189962


|

xtrain.V1 < 1.21831 xtrain.V1 < −0.536394

xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193

xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0

xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0

xtrain.V5 < 1.41292 xtrain.V4 < 1.95513


xtrain.V3 < −0.96174
1 1 0 0 0

xtrain.V1 < −0.787108


1 0 1 1 0

1 1

Figure 6: Full tree.

14
xtrain.V9 < | !0.189962

xtrain.V1 < 1.21831 xtrain.V1 < !0.536394

1 0

xtrain.V10 < !0.369401


0
1 0

Figure 7: Classification tree. The size of the tree was chcosen by cross-validation.
P
where f (X) = j=1p fj (Xj ). To fit this model, the local scoring algorithm runs the backfit-
ting procedure within Newton’s method. One iteratively computes the transformed response
for the current estimate fb

Yi − p(Xi ; fb)
Zi = fb(Xi ) + (19)
p(Xi ; fb)(1 − p(Xi ; fb))

and weights w(Xi ) = p(Xi ; fb)(1 − p(Xi ; fb), and carries out a weighted backfitting of (Z, X)
with weights w. The weighted smooth is given by
Sj (wRj )
Pbj = . (20)
Sj w
where Sj is a linear smoothing matrix, such as a kernel smoother. This extends iteratively
reweighted least squares to the nonparametric setting.

A sparsity penality can be incorporated, just as for sparse additive models (SpAM) for
regression. The Lagrangian is given by
p q
!
X
L(f, λ) = E log 1 + ef (X) − Y f (X) + λ
  
E(fj2 (Xj )) − L (21)
j=1

and the stationary condition for component


q function fj is E (p − Y | Xj ) + λvj = 0 where vj
is an element of the subgradient ∂ E(fj2 ). As in the unregularized case, this condition is

15
nonlinear in f , and so we linearize the gradient of the log-likelihood around fb. This yields
the linearized condition E [w(X)(f (X) − Z) | Xj ] + λvj = 0. To see this, note that
 
0 = E p(X; f ) − Y + p(X; f )(1 − p(X; f ))(f (X) − f (X)) | Xj + λvj
b b b b (22)
= E [w(X)(f (X) − Z) | Xj ] + λvj (23)
When E(fj2 ) 6= 0, this implies the condition
 
E (w | Xj ) + q λ  fj (Xj ) = E(wRj | Xj ). (24)
2
E(fj )

In the finite sample case, in terms of the smoothing matrix Sj , this becomes
Sj (wRj )
fj = .q . (25)
Sj w + λ E(fj2 )

If kSj (wRj )k < λ, then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannot be
solved explicitly, so one simply iterates until convergence:
Sj (wRj )
fj ← √ . (26)
Sj w + λ n /kfj k
When λ = 0, this yields the standard local scoring update (20).

Example 6 (SpAM for Spam) Here we consider an email spam classification problem,
using the logistic SpAM backfitting algorithm above. This dataset has been studied Hastie et
al (2001) using a set of 3,065 emails as a training set, and conducting hypothesis tests to
choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all
numeric. The attributes measure the percentage of specific words or characters in the email,
the average and maximum run lengths of upper case letters, and the total number of such
letters.

The results of a typical run of logistic SpAM are summarized in Figure 8, using plug-in
bandwidths. A held-out set is used to tune the regularization parameter λ.

11 Bagging and Random Forests

Suppose we draw B bootstrap samples and each time we construct a classifier. This gives
classifiers h1 , . . . , hB . We now classify by combining them:
(
1 if B1 j hj (x) ≥ 21
P
h(x) =
0 otherwise.

16
λ(×10−3 ) Error # zeros selected variables
5.5 0.2009 55 { 8,54}

5.0 0.1725 51 { 8, 9, 27, 53, 54, 57}

4.5 0.1354 46 {7, 8, 9, 17, 18, 27, 53, 54, 57, 58}

4.0 0.1083 ( ) 20 {4, 6–10, 14–22, 26, 27, 38, 53–58}

3.5 0.1117 0 all


3.0 0.1174 0 all
2.5 0.1251 0 all
2.0 0.1259 0 all
0.20
empirical prediction error
0.18
0.16
0.14
0.12

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5


penalization parameter

Figure 8: (Email spam) Classification accuracies and variable selection for logistic SpAM.

This is called bagging which stands for bootstrap aggregration. The basline classifiers are
usually trees.

A variation is to choose a random subset of the predictors to split on at each stage. The
resulting classifier is called a random forests. Random forests often perform very well. Their
theoretical performance is not well understood. Some good references are:

Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.

Biau, G. (2012). Analysis of a Random Forests Model. arXiv:1005.0208.

Lin and Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American
Statistical Association, 101, p 578.

17
Wager, S. (2014). Asymptotic Theory for Random Forests. arXiv:1405.0352.

Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.

Appendix: Multiclass Sparse Logistic Regression

Now we consider the multiclass version. Suppose we have the nonparametric K-class logistic
regression model
ef` (X)
pf (Y = ` | X) = PK ` = 1, . . . , K (27)
fm (X)
m=1 e
where each function has an additive form
f` (X) = f`1 (X1 ) + f`2 (X2 ) + · · · + f`p (Xp ). (28)
In Newton’s algorithm, we minimize the quadratic approximation to the log-likelihood
h i 1 h i
L(f ) ≈ L(fb) + E (Y − pb)T (f − fb) + E (f − fb)T H(fb)(f − fb) (29)
2
where pb(X) = (pfb(Y = 1 | X), . . . , pfb(Y = K | X)), and H(fb(X)) is the Hessian

H(fb) = −diag(b p(X)T .


p(X)) + pb(X)b (30)
Maximizing the right hand size of (29) is equivalent to minimizing
    1
−E (Y − pb)T (f − fb) − E fbT Jf + E f T Jf

(31)
2
which is, in turn, equivalent to minimizing the surrogate loss function
1
E kZ − Af k22 .

Q(f, fb) ≡ = (32)
2
where J = −H(fb), A = J 1/2 , and Z is defined by

Z = J −1/2 (Y − pb) + J 1/2 fb (33)


= A−1 (Y − pb) + Afb. (34)

The above calculation can be reexpressed as follows, which leads a multiclass backfitting
algorithm. The difference in log-likelihoods for functions {fb` } and {f` } is, to second order,
 !2 
K−1 K−1 K−1
X X Y` − p` (X) X
E p` (X) fb` (X) − pk (X)fbk (X) + − f` (X) + pk (X)fk (X) 
`=0 k=0
p ` (X)
k=0
(35)

18
where p` (X) = P(Y = ` | X), and Y` = δ(Y, `) are indicator variables. Minimizing over {f` }
gives coupled equations for the functions f` ; they can’t be solved independently over `.

A practical approach is to use coordinate descent, computing the function f` holding the
other functions {fk }k6=` fixed, and iterating. Assuming that fk = fbk for k 6= `, this simplifies
to
"  2 X  2 #
Y ` − p ` p k − Y k
E p` (1 − p` )2 fb` + − f` + pk p2` fb` + − f` . (36)
p` (1 − p` ) k6=`
pk p`

After some algebra, this can be seen to be the same as the usual objective function in the
binary case, where we take fb0 = 1 and fb1 arbitrary.

Now assume f` (and fb` ) has an additive form: f` (X) = pj=1 f`j (Xj ). Some further calcula-
P
tion shows that minimizing over each f`j yields the following backfitting algorithm:
" ! #
X Y ` − p `
E p` (1 − p` ) fb` − f`k + | Xj
p ` (1 − p` )
k6=j
f`j (Xj ) ← . (37)
E [p` (1 − p` ) | Xj ]

We approximate the conditional expectations by smoothing, as usual:

Sj (xj )T (w` (X) R`j (X))


f`j (xj ) ← (38)
Sj (xj )T (w` (X))

where
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) + (39)
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)). (40)

This is the same as in binary logistic regression. We thus have the following algorithm:

Multiclass Logistic Backfitting

1. Initialize {fb` = 0}, and set Z(X) = K.

2. Iterate until convergence:

For each ` = 0, 1, . . . , K − 1
A. Initialize f` = fb`
B. Iterate until convergence:

19
For each j = 1, 2, . . . , p

Sj (xj )T (w` (X) R`j (X))


f`j (xj ) ← where
Sj (xj )T (w` (X))
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) +
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)).

C. Update Z(X) ← Z(X) − ef` (X) + ef` (X) .


b

D. Set fb` ← f` .

Incrementally updating the normalizing constants (step C) is important so that the proba-
bilties p` (X) = ef` (X) /Z(X) can be efficiently computed, and we avoid an O(K 2 ) algorithm.
b

This can be extended to include a sparsity constraint, as in the binary case.

20
Random Forests
One of the best known classifiers is the random forest. It is very simple and effective but
there is still a large gap between theory and practice. Basically, a random forest is an average
of tree estimators.

These notes rely heavily on Biau and Scornet (2016) as well as the other references at the
end of the notes.

1 Partitions and Trees

We begin by reviewing trees. As with nonparametric regression, simple and interpretable


classifiers can be derived by partitioning the range of X. Let Πn = {A1 , . . . , AN } be
a partition of X . Let Aj be the partition element that contains x. Then b h(x) = 1 if
P P
Xi ∈Aj Yi ≥ Xi ∈Aj (1 − Yi ) and h(x) = 0 otherwise. This is nothing other than the plugin
b
classifier based on the partition regression estimator
N
X
m(x)
b = Y j I(x ∈ Aj )
j=1

−1
Pn
where Y j = nj i=1 Yi I(Xi ∈ Aj ) is the average of the Yi ’s in Aj and nj = #{Xi ∈ Aj }.
(We define Y j to be 0 if nj = 0.)

Recall from the results on regression that if m ∈ H1 (1, L) and the binwidth b of a regular
partition satisfies b  n−1/(d+2) then
c
b − m||2P ≤ 2/(d+2) .
E||m (1)
n
h) − R(h∗ ) = O(n−1/(d+2) ).
We conclude that the corresponding classification risk satisfies R(b

Regression trees and classification trees (also called decision trees) are partition classifiers
where the partition is built recursively. For illustration, suppose there are two covariates,
X1 = age and X2 = blood pressure. Figure 1 shows a classification tree using these variables.

The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.

Here is how a tree is constructed. First, suppose that there is only a single covariate X. We
choose a split point t that divides the real line into two sets A1 = (−∞, t] and A2 = (t, ∞).
Let Y 1 be the mean of the Yi ’s in A1 and let Y 2 be the mean of the Yi ’s in A2 .

1
Age
< 50 ≥ 50

Blood Pressure 1

< 100 ≥ 100

0 1

Figure 1: A simple classification tree.

1
Blood Pressure

110
1
0

50
Age

Figure 2: Partition representation of classification tree.

2
For continuous Y (regression), the split is chosen to minimize the training error. For binary
Y (classification), the split is chosen to minimizeP2a surrogate for classification error. A
common choice is the impurity defined by I(t) = s=1 γs where
2
γs = 1 − [Y s + (1 − Y s )2 ]. (2)

This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.

Now we continue recursively splitting until some stopping criterion is met. For example, we
might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf has an estimate
m(x)
b which is the mean of Yi ’s in that leaf. For classification, we take b
h(x) = I(m(x)
b > 1/2).
When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity.

The result is a piecewise constant estimator that can be represented as a tree.

2 Example

The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 1. See Figures 3, 4, 5, 6.

3 Bagging

Trees are useful for their simplicity and interpretability. But the prediction error can be
reduced by combining many trees. A common approach, called bagging, is as follows.

Suppose we draw B bootstrap samples and each time we construct a classifier. This gives tree
classifiers h1 , . . . , hB . (The same idea applies to regression.) We now classify by combining

3
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20

Table 1: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2

0.8

0.05
0.4
0.1
0.0

0.6

0.00
0.2
0.0

0.4

−0.05
−0.1

−0.1

0.0
0.2

−0.10
−0.2

−0.2

0.0

−0.2

−0.15
−0.3

−0.2
−0.3

−0.20
−0.4

−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15

0.06

0.04
0.2
0.2
0.04
0.10

0.03
0.1
0.02

0.02
0.05

0.1

0.01
0.00

0.0
0.00

0.00
−0.02

0.0

−0.1
−0.05

−0.04

−0.02
−0.10

−0.1

−0.2
−0.06

−4 −2 0 2 −4 −2 0 2 −6 −2 0 2 4 6 −1.0 0.0 1.0 2.0 −2 −1 0 1 2

Figure 3: Estimated functions for additive model.

4
0.29
0.28
0.27
Test Error

0.26
0.25

0 10 20 30 40 50

Figure 4: Test error versus k for nearest neighbor estimator.

xtrain.V9 < −0.189962


|

xtrain.V1 < 1.21831 xtrain.V1 < −0.536394

xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193

xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0

xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0

xtrain.V5 < 1.41292 xtrain.V4 < 1.95513


xtrain.V3 < −0.96174
1 1 0 0 0

xtrain.V1 < −0.787108


1 0 1 1 0

1 1

Figure 5: Full tree.

5
xtrain.V9 < | !0.189962

xtrain.V1 < 1.21831 xtrain.V1 < !0.536394

1 0

xtrain.V10 < !0.369401


0
1 0

Figure 6: Classification tree. The size of the tree was chosen by cross-validation.

them: (
1 if B1 j hj (x) ≥ 1
P
2
h(x) =
0 otherwise.
This is called bagging which stands for bootstrap aggregation. A variation is sub-bagging
where we use subsamples instead of bootstrap samples.

To get some intuition about why bagging is useful, consider this example from Buhlmann
and Yu (2002). Suppose that x ∈ R and consider the simple decision rule θbn = I(Y n ≤ x).
Let µ = E[Yi ] and for simplicity assume that Var(Yi ) = 1. Suppose that x is close to µ

relative to the sample size. We can model this by setting x ≡ xn = µ + c/ n. Then θbn
converges to I(Z ≤ c) where Z ∼ N (0, 1). So the limiting mean and variance of θbn are

Φ(c) and Φ(c)(1 − Φ(c)). Now the bootstrap
√ distribution of Y (conditional on Y1 , . . . , Yn )

is approximately N (Y , 1/n). That is, n(Y − Y ) ≈ N (0, 1). Let E ∗ denote the average
with respect to the bootstrap randomness. Then, if θen is the bagged estimator, we have
" !#
∗ ∗ ∗
√ ∗ √
θen = E [I(Y ≤ xn )] = E I n(Y − Y ) ≤ n(xn − Y )

= Φ( n(xn − Y )) + o(1) = Φ(c + Z) + o(1)

where Z ∼ N (0, 1), and we used the fact that Y ≈ N (µ, 1/n).

To summarize, θbn ≈ I(Z ≤ c) while θen ≈ Φ(c + Z) which is a smoothed version of I(Z ≤ c).

6
In other words, bagging is a smoothing operator. In particular, suppose we take c = 0.
Then θbn converges to a Bernoulli with mean 1/2 and variance 1/4. The bagged estimator
converges to Φ(Z) = Unif(0, 1) which has mean 1/2 and variance 1/12. The reduction in
variance is due to the smoothing effect of bagging.

4 Random Forests

Finally we get to random forests. These are bagged trees except that we also choose random
subsets of features for each tree. The estimator can be written as
1 X
m(x)
b = m
b j (x)
M j

where mb j is a tree estimator based on a subsample (or bootstrap) of size a using p randomly
selected features. The trees are usually required to have some number k of observations in
the leaves. There are three tuning parameters: a, p and k. You could also think of M as a
tuning parameter but generally we can think of M as tending to ∞.

For each tree, we can estimate the prediction error on the un-used data. (The tree is built
on a subsample.) Averaging these prediction errors gives an estimate called the out-of-bag
error estimate.

Unfortunately, it is very difficult to develop theory for random forests since the splitting
is done using greedy methods. Much of the theoretical analysis is done using simplified
versions of random forests. For example, the centered forest is defined as follows. Suppose
the data are on [0, 1]d . Choose a random feature, split in the center. Repeat until there are
k leaves. This defines one tree. Now we average M such trees. Breiman (2004) and Biau
(2002) proved the following.

Theorem 1 If each feature is selected with probability 1/d, k = o(n) and k → ∞ then

E[|m(X)
b − m(X)|2 ] → 0

as n → ∞.

Under stronger assumptions we can say more:

Theorem 2 Suppose that m is Lipschitz and that m only depends on a subset S of the
features and that the probability of selecting j ∈ S is (1/S)(1 + o(1)). Then
3
  4|S| log
2 1 2+3
E|m(X)
b − m(X)| = O .
n

7
This is better than the usual Lipschitz rate n−2/(d+2) if |S| ≤ p/2. But the condition that
we select relevant variables with high probability is very strong and proving that this holds
is a research problem.

A significant step forward was made by Scornet, Biau and Vert (2015). Here is their result.

Theorem 3 Suppose that Y = j mj (X(j)) +  where X ∼ Uniform[0, 1]d ,  ∼ N (0, σ 2 )


P
and each mj is continuous. Assume that the split is chosen using the maximum drop in sums
of squares. Let tn be the number of leaves on each tree and let an be the subsample size. If
tn → ∞, an → ∞ and tn (log an )9 /an → 0 then
E[|m(X)
b − m(X)|2 ] → 0
as n → ∞.

Again, the theorem has strong assumptions but it does allow a greedy split selection. Scornet,
Biau and Vert (2015) provide another interesting result. Suppose that (i) there is a subset
S of relevant features, (ii) p = d, (iii) mj is not constant on any interval for j ∈ S. Then
with high probability, we always split only on relevant variables.

5 Connection to Nearest Neighbors

Lin and Jeon (2006) showed that there is a connection between random forests and k-NN
methods. We say that Xi is a layered nearest neighbor (LNN) of x If the hyper-rectangle
defined by x and Xi contains no data points except Xi . Now note that if tree is grown until
each leaf has one point, then m(x)
b is simply a weighted average of LNN’s. More generally,
Lin and Jeon (2006) call Xi a k-potential nearest neighbor k − P N N if there are fewer than
k samples in the the hyper-rectangle defined by x and Xi . If we restrict to random forests
whose leaves have k points then it follows easily that m(x)
b is some weighted average of the
k − P N N ’s.

Let us know return to LNN’s. Let Ln (x) denote all LNN’s of x and let Ln (x) = |Ln (x)|. We
could directly define
1 X
m(x)
b = Yi I(Xi ∈ Ln (x)).
Ln (x) i
Biau and Devroye (2010) showed that, if X has a continuous density,
(d − 1)!E[Ln (x)]
→ 1.
2d (log n)d−1
Moreover, if Y is bounded and m is continuous then, for all p ≥ 1,
b n (X) − m(X)|p → 0
E|m

8
as n → ∞. Unfortunately, the rate of convergence is slow. Suppose that Var(Y |X = x) = σ 2
is constant. Then
σ2 σ 2 (d − 1)!
b n (X) − m(X)|p ≥
E|m ∼ d .
E[Ln (x)] 2 (log n)d−1

If we use k-PNN, with k → ∞ and k = o(n), then the results Lin and Jeon (2006) show that
the estimator is consistent and has variance of order O(1/k(log n)d−1 ).

As an aside, Biau and Devroye (2010) also show that if we apply bagging to the usual 1-NN
rule to subsamples of size k and then average over subsamples, then, if k → ∞ and k = o(n),
then for all p ≥ 1 and all distributions P , we have that E|m(X)
b − m(X)|p → 0. So bagged
1-NN is universally consistent. But at this point, we have wondered quite far from random
forests.

6 Connection to Kernel Methods

There is also a connection between random forests and kernel methods (Scornet 2016). Let
Aj (x) be the cell containing x in the j th tree. Then we can write the tree estimator as

1 X X Yi I(Xi ∈ Aj (x)) 1 XX
m(x)
b = = Wij Yj
M j i Nj (x) M j i

where Nj (x) is the number of data points in Aj (x) and Wij = I(Xi ∈ Aj (x))/Nj (x). This
suggests that a cell Aj with low density (and hence small Nj (x)) has a high weight. Based
on this observation, Scornet (2016) defined kernel based random forest (KeRF) by
P P
j i Yi I(Xi ∈ Aj (x))
m(x)
b = P .
j Nj (x)

With this modification, m(x)


b is the average of each Yi weighted by how often it appears in
the trees. The KeRF can be written as
P
Yi K(x, Xi )
m(x)
b = Pi
s Kn (x, Xs )

where
1 X
Kn (x, z) = I(x ∈ Aj (x)).
M j

The trees are random. So let us write the j th tree as Tj = T (Θj ) for some random quantity
Θj . So the forests is built from T (Θ1 ), . . . , T (ΘM ). And we can write Aj (x) as A(x, Θj ).
Then Kn (x, z) converges almost surely (as M → ∞) to κn (x, z) = PΘ (z ∈ A(x, Θ)) which is

9
just the probability that x and z are connected, in the sense that they are in the same cell.
Under some assumptions, Scornet (2016) showed that KeRF’s and forests are close to each
other, thus providing a kernel interpretation of forests.

Recall the centered forest we discussed earlier. This is a stylized forest — quite different
from the forests used in practice — but they provide a nice way to study the properties
of the forest. In the case of KeRF’s, Scornet (2016) shows that if m(x) is Lipschitz and
X ∼ Unif([0, 1]d ) then
  3+d1log 2
2 12
E[(m(x)
b − m(x)) ] ≤ C(log n) .
n

This is slower than the minimax rate n−2/(d+2) but this probably reflects the difficulty in
analyzing forests.

7 Variable Importance

Let m
b be a random forest estimator. How important is feature X(j)?

LOCO. One way to answer this question is to fit the forest with all the data and fit it
again without using X(j). When we construct a forest, we randomly select features for each
tree. This second forest can be obtained by simply average the trees where feature j was
b (−j) . Let H be a hold-out sample of size m. Then let
not selected. Call this m

bj = 1
X
∆ Wi
m i∈H

where
b (−j) (Xi ))2 − (Yi − m(X
Wi = (Yi − m b i ))2 .
Then ∆j is a consistent estimate of the prediction risk inflation that occurs by not having
access to X(j). Formally, if T denotes the training data then,
" #
E[∆ b (−j) (X))2 − (Y − m(X))
b j |T ] = E (Y − m b 2
T ≡ ∆j .

In fact, since ∆
b j is simply an average, we can easily construct a confidence interval. This
approach is called LOCO (Leave-Out-COvariates). Of course, it is easily extended to sets
of features. The method is explored in Lei, G’Sell, Rinaldo, Tibshirani, Wasserman (2017)
and Rinaldo, Tibshirani, Wasserman (2015).

Permutation Importance. A different approach is to permute the values of X(j) for the
out-of-bag observations, for each tree. Let Oj be the out-of-bag observations for tree j and

10
let Oj∗ be the out-of-bag observations for tree j with X(j) permuted.

bj = 1
XX
Γ Wij
M j i

where
1 X 1 X
Wij = b j (Xi ))2 −
(Yi − m b j (Xi ))2 .
(Yi − m
mj i∈O∗ mj i∈O
j j

This avoids using a hold-out sample. This is estimating

b j0 ))2 ] − E[(Y − m(X))


Γj = E[(Y − m(X b 2
]

where Xj0 has the same distribution as X except that Xj0 (j) is an independent draw from
X(j). This is a lot like LOCO but its meaning is less clear. Note that mb j is not changed
when X(j) is permuted. Gregorutti, Michel and Saint Pierre. (2013) show that, when (X, )
is Gaussian, that Var(X) = (1 − c)I + c11T and that Cov(Y, X(j)) = τ for all j then
 2
τ
Γj = 2 .
1 − c + dc

It
P is not clear how this connects to the actual importance of X(j). In the case where Y =
2
j mj (X(j)) +  with E[|X] = 0 and E[ |X] < ∞, they show that Γj = 2Var(mj (X(j)).

8 Inference

Using
√ the theory of infinite order U -statistics, Mentch and Hooker (2015) showed that
n(m(x)
b − E[m(x)])/σ
b converges to a Normal(0,1) and they show how to estimate σ.

Wager and Athey (2017) show asymptotic normality if we use sample splitting: part of the
data are used to build the tree and part is used to estimate the average in the leafs of the
tree. Under a number of technical conditions — including the fact that we use subsamples
of size s = nβ with β < 1 — they show that (m(x)
b − m(x))/σn (x) N (0, 1) and they show
how to estimate σn (x). Specifically,
 2 X
n−1 n
bn2 (x)
σ = b j (x), Nij )2
(Cov(m
n n−s i

where the covariance is with respect to the trees in the forest and Nij = 1 of (Xi , Yi ) was in
the j th subsample and 0 otherwise.

11
9 Summary

Random forests are considered one of the best all purpose classifiers. But it is still a mystery
why they work so well. The situation is very similar to deep learning. We have seen that
there are now many interesting theoretical results about forests. But the results make strong
assumptions that create a gap between practice and theory. Furthermore, there is no theory
to say why forests outperform other methods. The gap between theory and practice is due
to the fact that forests — as actually used in practice — are complex functions of the data.

10 References

Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.

Biau, Gerard, and Scornet. (2016). A random forest guided tour. Test 25.2: 197-227.

Biau, G. (2012). Analysis of a Random Forests Model. arXiv:1005.0208.

Buhlmann, P., and Yu, B. (2002). Analyzing bagging. Annals of Statistics, 927-961.

Gregorutti, Michel, and Saint Pierre. (2013). Correlation and variable importance in random
forests. arXiv:1310.5726.

Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, Wasserman L. (2017). Distribution-free predictive


inference for regression. Journal of the American Statistical Association.

Lin, Y. and Jeon, Y. (2006). Random Forests and Adaptive Nearest Neighbors. Journal of
the American Statistical Association, 101, p 578.

L. Mentch and G. Hooker. (2015). Ensemble trees and CLTs: Statistical inference for
supervised learning. Journal of Machine Learning Research.

Rinaldo A, Tibshirani R, Wasserman L. (2015). Uniform asymptotic inference and the


bootstrap after model selection. arXiv preprint arXiv:1506.06266.

Scornet E. Random forests and kernel methods. (2016). IEEE Transactions on Information
Theory. 62(3):1485-500.

Wager, S. (2014). Asymptotic Theory for Random Forests. arXiv:1405.0352.

Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.

Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association.

12
Clustering
10/26-702 Spring 2017

1 The Clustering Problem

In a clustering problem we aim to find groups in the data. Unlike classification, the data
are not labeled, and so clustering is an example of unsupervised learning. We will study the
following approaches:

1. k-means
2. Mixture models
3. Density-based Clustering I: Level Sets and Trees
4. Density-based Clustering II: Modes
5. Hierarchical Clustering
6. Spectral Clustering

Some issues that we will address are:

1. Rates of convergence
2. Choosing tuning parameters
3. Variable selection
4. High Dimensional Clustering

Example 1 Figures 17 and 18 show some synthetic examples where the clusters are meant
to be intuitively clear. In Figure 17 there are two blob-like clusters. Identifying clusters like
this is easy. Figure 18 shows four clusters: a blob, two rings and a half ring. Identifying
clusters with unusual shapes like this is not quite as easy. In fact, finding clusters of this
type requires nonparametric methods.

2 k-means (Vector Quantization)

One of the oldest approaches to clustering is to find k representative points, called prototypes
or cluster centers, and then divide the data into groups based on which prototype they are
closest to. For now, we assume that k is given. Later we discuss how to choose k.

Warning! My view is that k is a tuning parameter; it is not the number of clusters. Usually
we want to choose k to be larger than the number of clusters.

1
Let X1 , . . . , Xn ∼ P where Xi ∈ Rd . Let C = {c1 , . . . , ck } where each cj ∈ Rd . We call C a
codebook. Let ΠC [X] be the projection of X onto C:

ΠC [X] = argminc∈C ||c − X||2 . (1)

Define the empirical clustering risk of a codebook C by


n n
1X 2 1X
Rn (C) = Xi − ΠC [Xi ] = min ||Xi − cj ||2 . (2)
n i=1 n i=1 1≤j≤k

Let Ck denote all codebooks of length k. The optimal codebook C


b = {b ck } ∈ C k
c1 , . . . , b
minimizes Rn (C):
Cb = argminC∈C Rn (C).
k
(3)
The empirical risk is an estimate of the population clustering risk defined by
2
R(C) = E X − ΠC [X] = E min ||X − cj ||2 (4)
1≤j≤k

where X ∼ P . The optimal population quantization C ∗ = {c∗1 , . . . , c∗k } ∈ Ck minimizes R(C).


b as an estimate of C ∗ . This method is called k-means clustering or vector
We can think of C
quantization.

A codebook C = {c1 , . . . , ck } defines a set of cells known as a Voronoi tesselation. Let


n o
Vj = x : ||x − cj || ≤ ||x − cs ||, for all s 6= j . (5)

The set Vj is known as a Voronoi cell and consists of all points closer to cj than any other
point in the codebook. See Figure 1.

The usual algorithm to minimize Rn (C) and find C b is the k-means clustering algorithm—
also known as Lloyd’s algorithm— see Figure 2. The risk Rn (C) has multiple minima. The
algorithm will only find a local minimum and the solution depends on the starting values.
A common way to choose the starting values is to select k data points at random. We will
discuss better methods for choosing starting values in Section 2.1.

Example 2 Figure 3 shows synthetic data inspired by the Mickey Mouse example from
http: // en. wikipedia. org/ wiki/ K-means_ clustering . The data in the top left plot
form three clearly defined clusters. k-means easily finds in the clusters (top right). The
bottom shows the same example except that we now make the groups very unbalanced. The
lack of balance causes k-means to produce a poor clustering. But note that, if we “overfit
then merge” then there is no problem.

2




Figure 1: The Voronoi tesselation formed by 10 cluster centers c1 , . . . , c10 . The cluster centers
are indicated by dots. The corresponding Voronoi cells T1 , . . . , T10 are defined as follows: a
point x is in Tj if x is closer to cj than ci for i 6= j.

1. Choose k centers c1 , . . . , ck as starting values.


2. Form the clusters C1 , . . . , Ck as follows. Let g = (g1 , . . . , gn ) where gi = argminj ||Xi − cj ||.
Then Cj = {Xi : gi = j}.
3. For j = 1, . . . , k, let nj denote the number of points in Cj and set

1 X
cj ←− Xi .
nj i: Xi ∈Cj

4. Repeat steps 2 and 3 until convergence.


b = {c1 , . . . , ck } and clusters C1 , . . . , Ck .
5. Output: centers C

Figure 2: The k-means (Lloyd’s) clustering algorithm.

3
●●●●●
● ●
●●● ●● ●●
● ●
●●●●
●●● ●
●●
● ●●●● ●●● ●● ● ●●●
●●●●
●●
●●●● ●●
●●●● ●●● ●●●
●●

● ●● ●● ● ●●●
●●●● ●
●●●

● ●
●●
●●●
●●●●

●●
●● ●● ●● ●
● ●
●●●●● ● ● ●
● ●●● ●● ●●●●
●●
●●

●●● ●
●●

● ●
●●●●

●●●●●●
●● ●●

●●●● ●

● ●● ●●●●
● ●●●●●
● ●●●● ●●
●●
● ●
●●●●●
● ●●●
●● ●●
● ● ●●●

● ●● ●●●●● ●●
●● ●●
●● ●
●●●

●●●●●●●●●
●● ●●● ●●●●



● ●
●●
●●●●● ●
●●●●
● ● ● ●●
●●●●● ●● ●

●● ●●
● ●●● ●●● ●
●●


●● ●●●●● ● ●●●● ●●● ●
●●● ● ● ●●●

● ●●●
● ●
●●●●●●
●●● ●●●●●●● ●
●●

●●● ● ●

● ●●●
●● ●●●●●
● ●●

● ●
● ● ●
●● ●●
●● ●
●●● ● ●● ●● ● ●● ● ●●
●●●● ●●
● ●●● ●
●●● ●●●●● ●
●●●●●● ●●
●●
● ●●● ●●●
●●
● ● ● ● ●● ● ● ● ●
● ● ●● ●●●●● ●
●●●●● ● ●
● ●●
●● ●


●●
●●
●● ●● ●●●●
●● ●● ● ●●
●● ●
●●●● ● ● ●●● ●
●●
●●● ●
●●●●●
● ●
●●
●●●●●●

● ●●

●●
●●

●● ●●● ● ● ● ●●●
●● ●●●● ●●●●
● ● ●● ●
●● ●●●
● ●
● ● ●●●●
● ●●

●●●●●●●● ●●●●● ●● ●
●●●●●● ●
●●●●●● ●● ●

●● ● ●● ●
● ● ●●●● ●● ● ● ●●
●● ●●●●●● ●●● ●●●●● ●● ●●● ● ●●●●●
● ●
●● ●● ●●●●●
●●

● ●●●●●●●●●●●●●
●●



●● ●●● ● ●● ● ●
● ●●● ●
● ●
● ●●
● ●

●●

●● ●

● ● ●●
● ●
●●●●
● ●●●
● ●●
●● ●● ● ●
●●
●● ●
● ●●●● ● ● ●
● ●●●●
●●●● ●●●
●● ●
●●●●●●●●
●●●● ●●●
●●●●●●●● ●● ●
● ● ●
● ● ● ●
●● ●●●●● ●● ● ●
● ● ●


● ● ●
●●
●●

● ●

●●●●●●● ●● ● ●●
●● ●●
● ●
●●●●● ●●
● ●●● ●●●●● ● ●● ●●●● ●
●● ●●●●
● ●●●●● ●
● ●●●●● ●● ●●●●
●●
●●●● ●● ●● ●● ●●● ●● ●● ●
● ●●●●● ●●
●● ● ●
●●
● ●●
●●● ● ● ●● ●●● ●● ● ●●●●
● ●


●●●● ● ●
●●●●●
● ●● ●●●●● ●
● ● ●
● ● ●
● ●
●●●●●
● ● ●●●●● ●
● ● ●
●●
●●
●●
●●● ● ●
●●●● ●● ●●
●●● ●●
●●●

●●


●●
● ●●
●●
●●● ●
●●
●●●●
●● ●●
●●● ●●


●●
●●●

●●

●●●● ●
●●●●●
●●

●●●

● ●●●●

●●



●●

●●
● ●
●● ●●●
●● ●●●●● ●●●

●●●

● ●●●●

●●
●●
●●


● ●
●●●●●●
●●


●●
●●●
●●
●●
●●●● ●●
● ●



●●●
●●● ●●


●●●
● ●
●●


●● ●

●● ●
●●● ●●
●●

● ●
●●● ●
● ●●
● ●


●●●●
●●●●●


●●●
●● ●
●●


● ●● ● ● ●●
● ● ● ● ● ●● ● ● ●● ● ● ●


●●● ●● ● ●

●● ● ● ●
●● ●



●●
● ● ●●●● ●
●●
●●● ●

●●● ● ●
●● ●


● ●● ●
●● ●



●●
● ● ●●●●●


●●●



●● ●●●
●●
● ●●
●●●●●● ●


●● ●●●●● ●





●● ●●●
●●
● ● ●

●●●● ●



●● ●●
●●●



●●●●
● ●● ● ●
●●●
●●● ●
●●●
● ●
●●●●●●●● ● ●●●●
● ● ●●●●
●●● ●
●●●
●●●●
●●●●●● ● ●
●● ●● ● ● ●● ● ●●
● ●● ● ● ●
●●●● ●
●●●




●●● ●
● ●
●●●●
● ●●●


● ●●
●●
●●
● ●
●● ●●●●●
●●




●●●
● ●

●●●
● ● ● ●
● ●●
●●

● ●●
●●
●●
● ●●
● ●● ●●●●● ●●●
●●●
● ●● ● ●● ● ●● ●● ●●
● ●

● ●●● ● ●
●● ● ● ● ●● ●
●● ● ● ●● ●● ●
●●●
●●●●●
●●●●●●


●●●
● ●●●
● ●●●●● ●● ●●●
●●●●●


●●●
● ●●● ●●●
● ●
● ● ●●

●●●●●● ● ● ● ●●

●●●●●● ●

●● ● ● ●
●● ● ●

●●
●●●●●
● ●●
● ●●●

●●●●●
● ●● ●●
●●●●●
●●●


●●
●●● ●●●
● ● ●●
●●●
● ●

●●●●


●●●●●● ● ●


● ●

●●
●●
●●
● ●
●●●
●● ●● ●●● ●●●
●● ●
● ●● ● ● ●

● ●●● ●● ●●●●● ●
●●
● ●●●●
● ● ● ● ● ●
● ●
● ● ● ●● ● ● ● ●
● ● ●● ● ●●● ● ●●● ●●●●●● ●●●●● ●●●

● ●●● ●●● ●●●● ● ●●●
●● ● ● ●●● ●● ● ●●
●●
● ● ●●●●●● ●● ● ● ●● ● ●
● ● ●●
●●●●● ●
● ●
● ●● ● ●
●● ●●●
●● ●
●●● ● ●
● ●●
●● ● ●
● ●●● ●●● ● ●● ●●● ●●
●●● ● ●
● ● ●●●●● ● ● ●● ●● ● ●● ● ●
● ● ● ● ●● ●●●
●●
● ●
● ● ● ●
●●
● ● ●● ● ●● ●● ●●



●●● ●● ●●●● ● ●●●● ● ●●● ●●
● ●● ● ●●●● ●
●●● ●●●● ●● ●●●● ●● ●●● ●● ● ●● ●●●● ●● ●●●
● ● ●●● ● ● ●● ● ●●●
● ●●●●
●● ●● ●


● ●●● ●
●● ●


●●●●●
●●


●●
●●●●● ●● ●

● ●●● ● ●●
●● ● ●●●
●●●● ●●
●●●●
●●●●●●● ●● ● ● ●●●●●● ● ●
● ● ●● ● ● ●●●● ●●
●●●●●●●
● ● ●● ●●● ●● ● ●● ●
● ●●●●●●●●● ●● ●● ●●● ●● ●
● ●●●● ●●
●● ● ● ●●● ●● ● ●●●● ●● ●●
● ●
● ●●●
●● ●● ●● ●● ●● ●● ●● ● ●●● ●● ●● ●

●●● ●● ●● ●●●
●● ●● ●● ● ● ● ●● ●●●●● ●● ● ●● ● ●● ● ●
●● ● ●● ● ●● ● ●
●●●●● ●● ● ●● ●
●●●●
● ●●
●●● ●●
● ● ●● ●●● ● ●
● ●
●● ●
● ●● ● ●●● ●●
●●● ●●●●● ●● ● ●●●●● ●●
●●●

●● ●
● ●
●●● ●● ● ● ●
●●●● ●
●●●●●●
● ●● ●● ● ● ● ●●●● ●● ●● ●● ● ●●
●●● ● ●● ● ● ● ●●
●●●●● ●● ● ● ●

●● ● ● ● ●●●● ● ●●●●

Figure 3: Synthetic data inspired by the “Mickey Mouse” example from wikipedia. Top
left: three balanced clusters. Top right: result from running k means with k = 3. Bottom
left: three unbalanced clusters. Bottom right: result from running k means with k = 3
on the unbalanced clusters. k-means does not work well here because the clusters are very
unbalanced.

Example 3 We applied k-means clustering to the Topex data with k = 9. (Topex is a


satellite.) The data are discretized so we treated each curve as one vector of length 70. The
resulting nine clusters are shown in Figure 4.

Example 4 (Supernova Clustering) Figure 5 shows supernova data where we apply k-


means clustering with k = 4. The type Ia supernovae get split into two groups although the
groups are very similar. The other type also gets split into two groups which look qualitatively
different.

Example 5 The top left plot of Figure 6 shows a dataset with two ring-shaped clusters. The
remaining plots show the clusters obtained using k-means clustering with k = 2, 3, 4. Clearly,
k-means does not capture the right structure in this case unless we overfit then merge.

2.1 Starting Values for k-means

Since Rbn (C) has multiple minima, Lloyd’s algorithm is not guaranteed to minimize Rn (C).
The clustering one obtains will depend on the starting values. The simplest way to choose
starting values is to use k randomly chosen points. But this often leads to poor clustering.

4
100 150 200

100 150 200

100 150 200


50

50

50
0

0
0 10 30 50 70 0 10 30 50 70 0 10 30 50 70
100 150 200

100 150 200

100 150 200


50

50

50
0

0 10 30 50 70 0 10 30 50 70 0 10 30 50 70
100 150 200

100 150 200

100 150 200


50

50

50
0

0 10 30 50 70 0 10 30 50 70 0 10 30 50 70

Figure 4: The nine clusters found in the Topex data using k-means clustering with k = 9.
Each plot show the curves in that cluster together with the mean of the curves in that cluster.

5
0 20 40 60 80 100 0 20 40 60 80 100

Cluster 1 Cluster 2

0 20 40 60 80 100 0 20 40 60 80 100

Cluster 3 Cluster 4

Figure 5: Clustering of the supernova light curves with k = 4.

● ●●● ● ●●●
●● ● ● ●●● ●● ● ● ●●●
●●
●● ●● ● ● ●●
●● ●● ●
● ●● ●
●● ● ●●
●● ●●
● ●●
● ●● ●
●● ● ●●
● ● ●
● ● ●● ● ●
● ●
●●
● ●
●●●
● ● ● ●●
●●
●●●


●●● ●
● ● ● ●
● ●

●● ●





●● ●
●●
●●

● ●● ●● ●



●●●


● ●●
●● ●


●● ● ● ●

●●
● ●
●● ●
●● ●● ● ●

● ●
●● ●●●
● ● ● ●●

● ●
●●
●●●
● ●●

● ● ● ●

●●
●●
● ●
●●●●●●●●
●● ● ● ●●


●● ●
● ●●
● ● ●
● ● ●
●● ● ●●

●● ●
● ●● ●●

●●
● ●● ●●

● ●●● ●
●●● ●
● ● ● ● ●●●
●●●●


●●

●●

●●


● ●●


●●
●●●●


● ●●
● ● ●
●●


●●● ●

● ●●

●●
● ●●
● ●

●● ●●

●●

● ●
●● ●●
●●
● ●●●
● ●


●●
●●● ●●


● ●
●● ●
●●●●●●●● ●


● ●

● ●

● ●●
● ●
● ●
●● ●●
●● ●● ●●

● ●● ●●●
●●● ●
● ● ●● ● ●●● ●
●●●

Figure 6: Top left: a dataset with two ring-shaped clusters. Top right: k-means with k = 2.
Bottom left: k-means with k = 3. Bottom right: k-means with k = 4.

6
● ● ● ●
● ●●● ● ●●●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●● ●● ● ● ● ●●
●● ●● ●● ● ● ● ●●
● ● ●● ● ● ● ●● ●● ●●●● ● ● ● ●● ● ● ● ●● ●● ●●●● ●
● ●● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ●
●● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●
● ●● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●
●● ● ● ● ●● ● ● ●
●●● ●● ●●●
● ● ●●●
●● ● ●● ●●
● ● ●●● ●
●●● ● ●●● ●● ●●●
● ● ●●●
●● ● ●● ●●
● ● ●●● ● ●● ●

● ●● ● ●● ●● ●
●● ●●●● ● ● ● ●
● ●
● ●


● ● ● ● ●● ● ●● ●● ●
●● ●●●● ● ● ● ●
● ●
● ●


● ● ● ●
● ● ●●● ● ● ● ●●●● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ●●●● ● ● ● ●●● ●
●●● ●
● ●●●●● ●● ● ● ●● ● ●● ● ● ● ●●
●●
●● ●●● ●
● ●●●●● ●● ● ● ●● ● ●● ● ● ● ●●
●● ●

● ● ● ● ●●
● ● ● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ●●
●● ●● ● ●● ● ● ● ● ●●● ● ●● ●● ●● ●● ● ●● ● ● ● ● ●●● ● ●● ●●
● ● ● ●● ●●●●● ●● ● ●● ● ● ● ●●
●● ● ● ● ● ●● ●●●●● ●● ● ●● ● ● ● ●●
●● ●
●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●
● ●● ●● ● ●

● ●
● ● ●● ● ●● ●● ● ●

● ●
● ● ●●
●● ● ● ●● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ●●● ●● ● ● ● ● ●●● ●● ● ● ●
● ●
● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●●
●● ●●● ●● ● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ●
●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ● ●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ●
● ● ●● ●●●●●● ●
● ●●● ●

●● ● ●● ● ● ● ●● ● ● ●● ●●●●●● ●
● ●●● ●

●● ● ●● ● ● ● ●●
●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●● ●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●●
● ●● ●● ●
●● ●● ●
● ● ●
● ●●
● ●●

● ●● ●● ●
●● ●● ●
● ● ●
● ●●
● ●●

●● ●●●●● ●● ●●●● ●● ●● ● ●
● ●●● ● ●
●●
●● ●●●●● ●●● ● ●● ●●●●● ●● ●●●● ●● ●● ● ●
● ●●● ● ●
●●
●● ●●●●● ●●● ●
● ●●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ●● ●● ● ●
● ● ● ● ● ● ● ●● ●● ● ●
● ● ●
● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ●● ●● ● ● ● ●● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ●●● ● ● ●
● ●
● ● ● ●● ● ● ● ●●● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ● ●
● ●● ●

● ● ●●●●● ● ● ● ● ●
● ● ●●●●● ● ● ● ●
●●● ●● ●● ●●● ●● ●●
●●
● ● ●●
● ●● ● ● ●●
● ●
● ● ● ●● ●

● ● ● ● ●
● ●●
●● ●
●●● ●●● ● ●●● ● ● ●●
● ● ●●
● ●● ● ● ●●
● ●
● ● ● ●● ●

● ● ● ● ●
● ●●
●● ●
●●● ●●● ● ●●● ● ●
● ●● ●● ● ●●● ● ●● ●● ● ●● ●● ● ●●● ● ●● ●●
● ● ●●
● ● ●
● ●●
●●● ● ● ●● ●●●●● ● ●●●●

●●

● ● ● ● ● ● ● ●●
● ● ●
● ●●
●●● ● ● ●● ●●●●● ● ●●●●

●●

● ● ● ● ●
● ●●●●● ●● ●●●● ● ●● ● ●●
● ● ● ●●●●● ●● ●●●● ● ●● ● ●●
● ●
●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●

●●●●● ● ●● ● ● ●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●

●●●●● ● ●● ● ●
● ●● ● ●
● ●● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●●
● ●●
●● ● ● ● ● ●● ● ● ●●
●● ● ● ● ● ●● ●
● ● ●●● ●● ● ● ● ● ●●● ●● ● ●
● ●● ● ●● ● ● ●● ● ●● ●
● ● ● ● ● ●
●●● ●●●

● ● ● ●
● ●●
● ● ●●

●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●● ●● ● ● ● ●●
●● ●● ●● ● ● ● ●●
● ● ●● ● ● ● ● ●●
● ●●●● ● ● ● ●● ● ● ● ● ●●
● ●●●● ●
● ●● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ●
●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ●
● ●● ●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ●
●● ● ●● ● ● ● ●● ● ●● ● ● ●
●●● ●● ●●●
● ● ●●● ● ●● ●●
● ● ●●● ● ●● ●

●●● ●● ●●●
● ● ●●● ● ●● ●●
● ● ●●● ● ●● ●

● ● ● ●● ●● ●
●● ●●●● ● ● ● ● ●
● ● ●●● ●


● ● ● ● ● ● ●● ●● ●
●● ●●●● ● ● ● ● ●
● ● ●●● ●


● ● ●
● ● ●●●●● ●● ●●●●● ●● ● ● ●●●● ● ●● ● ●●● ● ● ● ● ●●●●● ●● ●●●●● ●● ● ● ●●●● ● ●● ● ●●● ● ●
●●● ● ● ● ●●
● ● ● ● ● ● ●● ● ●●● ● ● ● ●●
● ● ● ● ● ● ●● ●
● ● ●● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●● ●
●● ●● ●● ●● ● ● ●● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●● ●
●● ●● ●● ●●
● ● ● ●● ●●●●● ●● ● ●● ● ● ●●
● ●● ● ● ● ● ●● ●●●●● ●● ● ●● ● ● ●●
● ●● ●
●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●
● ●● ●● ● ●

● ●
● ● ●● ● ●● ●● ● ●

● ●
● ● ●●
●● ● ● ●● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ●●● ●● ● ● ● ● ●●● ●● ● ● ●
● ●
● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●
● ●●
●● ●●● ●● ● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ●● ●
● ● ●● ● ●● ● ● ● ● ● ● ●●● ●
● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ●
●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ● ●●● ● ●● ● ● ●
● ● ●● ●●
●● ● ● ●
● ● ●● ●●●●●● ●
● ●●● ●

●● ● ●● ● ● ● ●● ● ● ●● ●●●●●● ●
● ●●● ●

●● ● ●● ● ● ● ●●
●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●● ●●●●● ●●●● ● ●●●
● ●● ●●●●
● ●●● ●●● ●● ●●
●● ●● ●
●●●●● ● ● ● ●● ●● ●● ●● ●
●●●●● ● ● ● ●● ●●
● ●
●● ●● ●● ●●● ● ●●●
●●● ●●
● ● ●●● ● ● ●
● ●
●● ●● ●● ●●● ● ●●●
●●● ●●
● ● ●●● ● ● ●
●● ● ●
● ●●● ● ● ●
● ● ● ● ●●● ●● ● ●
● ●●● ● ● ●
● ● ● ● ●●●
● ●●● ● ● ●● ●●
● ●●●●●
● ● ●●●● ● ● ●●● ● ● ●● ●●
● ●●●●●
● ● ●●●● ●
● ● ● ● ●● ● ● ●
● ● ● ● ● ● ●● ● ● ●
● ●
● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ●● ●● ● ● ● ●● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●
●● ●● ●● ● ●● ●● ●● ●
● ● ●●●
●● ● ● ●●● ●
●● ● ● ● ● ● ●● ●

● ●●●●
● ● ●●●
●● ● ● ●●● ●
●● ● ● ● ● ● ●● ●

● ●●●●
● ●●
● ● ● ● ●
● ● ● ● ● ● ● ●● ●●● ● ● ● ●●
● ● ● ● ●
● ● ● ● ● ● ● ●● ●●● ● ●
● ●● ●● ● ●
●●
● ● ●● ●● ●●● ● ● ●● ● ● ●● ● ●● ●● ● ●
●●
● ● ●● ●● ●●● ● ● ●● ● ● ●●
● ● ●●
● ● ●●
●●● ● ●●●●
● ●●
●● ●● ● ● ● ● ●●
● ● ●●
●●● ● ●●●●
● ●●
●● ●● ● ●
● ●●●●● ●● ● ●● ● ●●●●●● ● ●●●●● ●● ● ●● ● ●●●●●●
●●●● ●● ●● ● ●●●● ●● ●● ●
●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●

●●●●
● ● ●● ● ● ●● ●
●● ●● ●●● ●
● ● ● ● ●●●● ● ●●

●●●●
● ● ●● ● ●
● ●● ● ●
● ●● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●●
● ●●
●● ● ● ● ● ●● ● ● ●●
●● ● ● ● ● ●● ●
● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ●
● ●● ● ●● ● ● ●● ● ●● ●
● ● ● ● ● ●
●●● ●●●

Figure 7: An example with 9 clusters. Top left: data. Top right: k-means with random
starting values. Bottom left: k-means using starting values from hierarchical clustering.
Bottom right: the k-means++ algorithm.

Example 6 Figure 7 shows data from a distribution with nine clusters. The raw data are in
the top left plot. The top right plot shows the results of running the k-means algorithm with
k = 9 using random points as starting values. The clustering is quite poor. This is because
we have not found the global minimum of the empirical risk function. The two bottom plots
show better methods for selecting starting values that we will describe below.

Hierarchical Starting Values. Tseng and Wong (2005) suggest the following method for
choosing staring values for k-means. Run single-linkage hierarchical clustering (which we
describe in Section 6) to obtains p × k clusters. They suggest using p = 3 as a default. Now
take the centers of the k-largest of the p × k clusters and use these as starting values. See
the bottom left plot in Figure 7.

k-means++ . Arthur and Vassilvitskii (2007) invented an algorithm called k-means++ to get
good starting values. They show that if the starting points are chosen in a certain way, then
we can get close to the minimum with high probability. In fact the starting points themselves
— which we call seed points — are already close to minimizing Rn (C). The algorithm is
described in Figure 8. See the bottom right plot in Figure 7 for an example.

Theorem 7 (Arthur and Vassilvitskii, 2007). Let C = {c1 , . . . , ck } be the seed points from

7
1. Input: Data X = {X1 , . . . , Xn } and an integer k.

2. Choose c1 randomly from X = {X1 , . . . , Xn }. Let C = {c1 }.

3. For j = 2, . . . , k:

(a) Compute D(Xi ) = minc∈C ||Xi − c|| for each Xi .


(b) Choose a point Xi from X with probability

D2 (Xi )
pi = Pn 2
.
j=1 D (Xj )

(c) Call this randomly chosen point cj . Update C ←− C ∪ {cj }.

4. Run Lloyd’s algorithm using the seed points C = {c1 , . . . , ck } as starting points and output
the result.

Figure 8: The k-means++ algorithm.

the k-means++ algorithm. Then,


  
E Rn (C) ≤ 8(log k + 2) min Rn (C) (6)
C

where the expectation is over the randomness of the algorithm.

See Arthur and Vassilvitskii (2007) for a proof. They also show that the Euclidean distance
can be replaced with the `p norm in the algorithm. The result is the same except that the
constant 8 gets replaced by 2p+2 . It is possible to improve the k-means++ algorithm. Ailon,
Jaiswal and Monteleoni (2009) showed that, by choosing 3 log k points instead of one point,
at each step of the algorithm, the log k term in (6) can be replaced by a constant. They call
the algorithm, k-means#.

2.2 Choosing k

In k-means clustering we must choose a value for k. This is still an active area of research
and there are no definitive answers. The problem is much different than choosing a tuning
parameter in regression or classification because there is no observable label to predict.
Indeed, for k-means clustering, both the true risk R and estimated risk Rn decrease to 0

8
as k increases. This is in contrast to classification where the true risk gets large for high
complexity classifiers even though the empirical risk decreases. Hence, minimizing risk does
not make sense. There are so many proposals for choosing tuning parameters in clustering
that we cannot possibly consider all of them here. Instead, we highlight a few methods.

Elbow Methods. One approach is to look for sharp drops in estimated risk. Let Rk denote
the minimal risk among all possible clusterings and let R
bk be the empirical risk. It is easy to
see that Rk is a nonincreasing function of k so minimizing Rk does not make sense. Instead,
we can look for the first k such that the improvement Rk − Rk+1 is small, sometimes called
an elbow. This can be done informally by looking at a plot of R bk . We can try to make this
more formal by fixing a small number α > 0 and defining
 
Rk − Rk+1
kα = min k : ≤α (7)
σ2

where σ 2 = E(kX − µk2 ) and µ = E(X). An estimate of kα is


( )
Rbk − Rbk+1
kα = min k :
b ≤α (8)
b2
σ
Pn
b2 = n−1
where σ i=1 kXi − Xk2 .

Unfortunately, the elbow method often does not work well in practice because there may not
be a well-defined elbow.

Hypothesis Testing. A more formal way to choose k is by way of hypothesis testing. For
each k we test

Hk : the number of clusters is k versus Hk+1 : the number of clusters is > k.

We begin k = 1. If the test rejects, then we repeat the test for k = 2. We continue until the
first k that is not rejected. In summary, b
k is the first k for which k is not rejected.

Currently, my favorite approach is the one in Liu, Hayes, Andrew Nobel and Marron (2012).
(JASA, 2102, 1281-1293). They simply test if the data are multivariate Normal. If this
rejects, they split into two clusters and repeat. The have an R package sigclust for this.
A similar procedure, called PG means is described in Feng and Hammerly (2007).

Example 8 Figure 9 shows a two-dimensional example. The top left plot shows a single
cluster. The p-values are shown as a function of k in the top right plot. The first k for which
the p-value is larger than α = .05 is k = 1. The bottom left plot shows a dataset with three
clusters. The p-values are shown as a function of k in the bottom right plot. The first k for
which the p-value is larger than α = .05 is k = 3.

9
1.0

2





0.8
● ● ●

● ● ● ●
● ●
● ●
● ●
1

● ●
●● ●

●● ● ●
● ● ● ●

0.6

p−value
● ●● ● ●
● ●
●● ● ●● ● ●
0

● ● ● ●
● ● ●
●● ● ●● ●
● ●

0.4
● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
−1

● ● ● ●
● ● ● ● ●

0.2
● ●

● ● ●

−2



0.0

−2 −1 0 1 2 3 5 10 15 20
1.0 k


● ●● ●
●●
● ●● ● ● ●● ●
●● ●
● ● ●● ● ● ● ●● ●
●● ●●●● ●
● ●●●●●●
5


● ●● ●●● ●●●
0.8

● ●
● ●

● ●
●●●●●●●
● ●●●●●● ● ●
●● ● ●● ●● ●● ●●
●●
●● ● ● ●
● ● ●


● ●
0.6


●●● ● ●
p−value


●● ● ●●


●●●●●
● ● ●● ● ●
●● ●●●●●●●
●●●●

0


● ● ● ●
●● ●
● ●
●● ●● ●● ● ● ●●● ●●
● ● ● ● ●●
● ●● ●●
●●
● ●● ● ● ●●
●● ● ●●
0.4

● ● ●

● ●

● ● ●
● ●
● ● ●
●●● ● ●
● ●●●●●●● ●●●●●

0.2
−5

●●●●●
●● ●● ● ●●●

● ●●●●● ●

●● ●● ●● ●
● ●● ● ●●●●●● ●
● ● ●● ●●●

●●●●
●●●
● ● ●● ● ● ●
● ●●

0.0

−6 −4 −2 0 2 4 6 5 10 15 20

Figure 9: Top left: a single cluster. Top right: p-values for various k. The first k for which
the p-value is larger than .05 is k = 1. Bottom left: three clusters. Bottom right: p-values
for various k. The first k for which the p-value is larger than .05 is k = 3.

10
Stability. Another class of methods are based on the idea of stability. The idea is to find
the largest number of clusters than can be estimated with low variability.

We start with a high level description of the idea and then we will discuss the details. Suppose
that Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ) are two independent samples from P . Let Ak
be any clustering algorithm that takes the data as input and outputs k clusters. Define the
stability
Ω(k) = E [s(Ak (Y ), Ak (Z))] (9)
where s(·, ·) is some measure of the similarity of two clusterings. To estimate Ω we use
random subsampling. Suppose that the original data are X = (X1 , . . . , X2n ). Randomly
split the data into two equal sets Y and Z of size n. This process if repeated N times.
Denote the random split obtained in the j th trial by Y j , Z j . Define
N
1 X
s(Ak (Y j ), Ak (Z j )) .

Ω(k) =
b
N j=1

For large N , Ω(k)


b will approximate Ω(k). There are two ways to choose k. We can choose a
small k with high stability. Alternatively, we can choose k to maximize Ω(k)
b if we somehow
standardize Ω(k).
b

Now we discuss the details. First, we need to define the similarity between two clusterings.
We face two problems. The first is that the cluster labels are arbitrary: the clustering
(1, 1, 1, 2, 2, 2) is the same as the clustering (4, 4, 4, 8, 8, 8). Second, the clusterings Ak (Y )
and Ak (Z) refer to different data sets.

The first problem is easily solved. We can insist the labels take values in {1, . . . , k} and then
we can maximize the similarity over all permutations of the labels. Another way to solve
the problem is the following. Any clustering method can be regarded as a function ψ that
takes two points x and y and outputs a 0 or a 1. The interpretation is that ψ(x, y) = 1 if x
and y are in the same cluster while ψ(x, y) = 0 if x and y are in a different cluster. Using
this representation of the clustering renders the particular choice of labels moot. This is the
approach we will take.

Let ψY and ψZ be clusterings derived from Y and Z. Let us think of Y as training data and
Z as test data. Now ψY returns a clustering for Y and ψZ returns a clustering for Z. We’d
like to somehow apply ψY to Z. Then we would have two clusterings for Z which we could
then compare. There is no unique way to do this. A simple and fairly general approach is
to define
ψY,Z (Zj , Zk ) = ψY (Yj0 , Yk0 ) (10)
where Yj0 is the closest point in Y to Zj and Yk0 is the closest point in Y to Zk . (More
generally, we can use Y and the cluster assignment to Y as input to a classifier; see Lange
et al 2004). The notation ψY,Z indicates that ψ is trained on Y but returns a clustering for

11
Z. Define
1 X
s(ψY,Z , ψZ ) =   I (ψY,Z (Zs , Zt ) = ψZ (Zs , Zt )) .
n s6=t
2
Thus s is the fraction of pairs of points in Z on which the two clusterings ψY,Z and ψZ agree.
Finally, we define
N
1 X
Ω(k)
b = s(ψY j ,Z j , ψZ j ).
N j=1

Now we need to decide how to use Ω(k) b to choose k. The interpretation of Ω(k)
b requires
some care. First, note that 0 ≤ Ω(k) ≤ 1 and Ω(1) = Ω(n) = 1. So simply maximizing Ω(k)
b b b b
does not make sense. One possibility is to look for a small k larger than k > 1 with a high
stability. Alternatively, we could try to normalize Ω(k).
b Lange et al (2004) suggest dividing
by the value of Ω(k) obtained when cluster labels are assigned randomly. The theoretical
b
justification for this choice is not clear. Tibshirani, Walther, Botstein and Brown (2001)
suggest that we should compute the stability separately over each cluster and then take the
minimum. However, this can sometimes lead to very low stability for all k > 1.

Many authors have considered schemes of this form, including Breckenridge (1989), Lange,
Roth, Braun and Buhmann (2004), Ben-Hur, Elisseeff and Guyron (2002), Dudoit and
Fridlyand (2002), Levine and Domany (2001), Buhmann (2010), Tibshirani, Walther, Bot-
stein and Brown (2001) and Rinaldo and Wasserman (2009).

It is important to interpret stability correctly. These methods choose the largest number
of stable clusters. That does not mean they choose “the true k.” Indeed, Ben-David, von
Luxburg and Pál (2006), Ben-David and von Luxburg Tübingen (2008) and Rakhlin (2007)
have shown that trying to use stability to choose “the true k” — even if that is well-defined
— will not work. To explain this point further, we consider some examples from Ben-David,
von Luxburg and Pál (2006). Figure 10 shows the four examples. The first example (top
left plot) shows a case where we fit k = 2 clusters. Here, stability analysis will correctly
show that k is too small. The top right plot has k = 3. Stability analysis will correctly show
that k is too large. The bottom two plots show potential failures of stability analysis. Both
cases are stable but k = 2 is too small in the bottom left plot and k = 3 is too big in the
bottom right plot. Stability is subtle. There is much potential for this approach but more
work needs to be done.

2.3 Theoretical Properties

A theoretical property of the k-means method is given in the following result. Recall that
C ∗ = {c∗1 , . . . , c∗k } minimizes R(C) = E||X − ΠC [X] ||2 .

12
●●
●● ● ●
●●●
●●● ●● ●●
●●
● ●● ●●
● ●●●●
●●
●●
● ● ● ●●● ●●

●●●
●● ● ●●

● ●●
● ●●●● ●●●
● ●

●●●
● ●●●● ●

●● ●●

●● ●●●
●● ●●
●●●● ●●●●●


●●
●●
●●
●● ●
● ● ●●
● ●●
●●●●●



●●●●


●●●●●● ●

● ●● ●
● ●●
● ●● ●●● ●

●●●● ●●


●●●●

●●
●●
●●●
●●

● ● ●●●●
●●
● ●●●● ●
●●● ●●
●●● ●●●●●●●●●●
●●
●●
●●
● ●●
●● ● ●
●● ●●
●●

●●●●● ●●
● ●● ● ●
●●●●●● ●● ● ●●●●●
● ●●
● ●●
●●
●● ●●
●●● ●●
●●● ●●●
●●


● ●● ● ● ● ●●● ●●● ●●●● ● ●
● ●
●●●●● ●● ●●●
● ●
● ●
●●


● ●●●
●●

● ●●●● ●
●●●● ●●●●
●●

●● ●● ●
● ●● ●●● ●● ● ● ●●
●●●●●●●●●●
●● ●
●●
●●
●●●

● ●● ●●●
●●
●●●


●●●
●●

● ●●● ●
● ●●
●●●●


●●
● ●●
●●
● ●
●●● ●●
● ●
●● ●● ● ●
●●●●● ● ●●●● ●●●
● ●●
●●
●●● ●
●● ● ●●●●●
●●
● ●● ● ●
●● ●●●●●
●●●●
● ● ● ●● ●
●●●●●

●● ● ●●●
●●
●●●
●●

●●

●●
● ●●

●●●●● ●
●●

●●
●●●

●●
●● ● ● ●●●
●● ● ●
●●●● ●●●
● ●
●●●●●● ●●● ●
●●●
●●●
● ●
●●
●●●●●

● ●
●●
● ●●● ●●●●● ●
● ●●
● ●● ● ●
●● ●

●●


●●●● ●●●
● ●●
●● ● ● ●
●● ● ●
● ●● ●●
●●●●
● ● ● ●

● ●●
● ● ●●●●●●
● ●●● ●
● ●
●●●●● ●

●●
●● ●●●●● ● ●
●●●
●●
●●●●●●●
● ●● ●● ●●
●●



●●●●●


● ●



● ●
●● ●●●●●
● ● ●●
●●●

●●●
●●●
●●
● ●●●● ●●
● ●

●●

●●

●●

●● ● ●●
● ●●●●●●● ●●

●●
●●
●●●● ● ● ● ●●●







●●














●●












● ●
●●●

●●●
●●●●

●● ●●
●● ●● ●
●●

● ● ●● ●●●●
●●

●●


●●


●●



●●


● ●●● ● ●
●● ●●●●●●
● ●●
●●

●●
●● ●●●● ●●●●

●●●●●

● ●●●● ●
● ●
●●

●●
●●● ●●


●●●●
●●●
●●
●●●●
● ●●
●●

●●
● ●

●●●●

● ●●●
●●
●●● ●● ●●
●● ●●●● ●● ● ●●● ●●●
●● ●●
● ●●
● ●●●●

● ●●●●●
● ●●●● ●● ●●
●● ● ●●● ●● ●●
●●● ●●● ●
● ●● ● ●●
● ●●● ●●
●● ●●
● ● ● ● ●●●
● ● ●
●●
●● ● ●●

●● ●●

●●● ● ● ●
●●● ● ●●●
●● ●●●

●●● ● ●●● ●●● ●● ●
●●● ● ●● ● ●●
● ●
●● ●
● ● ●●

●●●●●● ● ●●

●● ● ●
● ●● ●

Figure 10: Examples from Ben-David, von Luxburg and Pál (2006). The first example (top
left plot) shows a case where we fit k = 2 clusters. Stability analysis will correctly show that
k is too small. The top right plot has k = 3. Stability analysis will correctly show that k
is too large. The bottom two plots show potential failures of stability analysis. Both cases
are stable but k = 2 is too small in the bottom left plot and k = 3 is too big in the bottom
right plot.

Theorem 9 Suppose that P(||Xi ||2 ≤ B) = 1 for some B < ∞. Then


r
b − R(C ) ≤ c k(d + 1) log n
E(R(C)) ∗
(11)
n
for some c > 0.

Warning! The fact that R(C)


b is close to R(C∗ ) does not imply that C
b is close to C∗ .

This proof is due to Linder, Lugosi and Zeger (1994). The proof uses techniques from a later
lecture on VC theory so you may want to return to the proof later.

Proof. Note that R(C) b − R(C ∗ ) = R(C) b − Rn (C) b − R(C ∗ ) ≤ R(C)


b + Rn (C) b − Rn (C)
b +
∗ ∗
Rn (C ) − R(C ) ≤ 2 supC∈Ck |R(C) b − Rn (C)|.
b For each C define a function fC by fC (x) =
2
||x − ΠC [x]|| . Note that supx |fC (x)| ≤ 4B for all C. Now, using the fact that E(Y ) =

13
R∞
0
P(Y ≥ t)dt whenever Y ≥ 0, we have
n
1X
2 sup |R(C)
b − Rn (C)|
b = 2 sup fC (Xi ) − E(fC (X))
C∈Ck C n i=1
Z ∞ n
!
1X
= 2 sup I(fC (Xi ) > u) − P(fC (Z) > u) du
C 0 n i=1
n
1X
≤ 8B sup I(fC (Xi ) > u) − P(fC (Z) > u)
C,u n
i=1
n
1X
= 8B sup I(Xi ∈ A) − P(A)
A n i=1

where A varies over all sets A of the form {fC (x) > u}. The shattering number of A is
s(A, n) ≤ nk(d+1) . This follows since each set {fC (x) > u} is a union of the complements of
k spheres. By the VC Theorem,
n
!
b − R(C ∗ ) > ) ≤ P 8B sup 1 X
P(R(C) I(Xi ∈ A) − P(A) > 
A n i=1
n
!
1X 
= P sup I(Xi ∈ A) − P(A) >
A n i=1 8B
2 2
≤ 4(2n)k(d+1) e−n /(512B ) .
q
Now conclude that E(R(C) − R(C )) ≤ C k(d + 1) logn n . 

p
b

A sharper result, together with a lower bound is the following.

Theorem 10 p (Bartlett, Linder and Lugosi 1997) Suppose that P (kXk2 ≤ 1) = 1 and
that n ≥ k 4/d , dk 1−2/d log n ≥ 15, kd ≥ 8, n ≥ 8d and n/ log n ≥ dk 1+2/d . Then,
r r !
dk 1−2/d log n dk log n
b − R(C ∗ ) ≤ 32
E(R(C)) =O .
n n

Also, if k ≥ 3, n ≥ 16k/(2Φ2 (−2)) then, for any method C


b that selects k centers, there exists
P such that r
1−4/d
b − R(C ∗ ) ≥ c0 k
E(R(C))
n
4 −12

where c0 = Φ (−2)2 / 6 and Φ is the standard Gaussian distribution function.

See Bartlett, Linder and Lugosi (1997) for a proof. It follows that k-means is risk consistent
b − R(C ∗ ) →P
in the sense that R(C) 0, as long as k = o(n/(d3 log n)). Moreover, the lower

14
bound implies that we cannot find any other method that improves much over the k-means
approach, at least with respect to this loss function.

The k-means algorithm can be generalized in many ways. For example, if we replace the L2
norm with the L1 norm we get k-medians clustering. We will not discuss these extensions
here.

2.4 Overfitting and Merging

The best way to use k-means clustering is to “overfit then merge.” Don’t think of the k in
k-means as the number of clusters. Think of it as a tuning parameter. k-means clustering
works much better if we:

1. Choose k large
2. merge close clusters

This eliminates the sensitivity to the choice of k and it allows k-means to fit clusters with
arbitrary shapes. Currently, there is no definitive theory for this approach but in my view,
it is the right way to do k-means clustering.

3 Mixture Models

Simple cluster structure can be discovered using mixture models. We start with a simple
example. We flip a coin with success probability π. If heads, we draw X from a density
p1 (x). If tails, we draw X from a density p0 (x). Then the density of X is

p(x) = πp1 (x) + (1 − π)p0 (x),

which is called a mixture of two densities p1 and p0 . Figure 11 shows a mixture of two
Gaussians distribution.

Let Z ∼ Bernoulli(π) be the unobserved coin flip. Then we can also write p(x) as
X X
p(x) = p(x, z) = p(x|z)p(z) (12)
z=0,1 z=0,1

where p(x|Z = 0) := p0 (x), p(x|Z = 1) := p1 (x) and p(z) = π z (1 − π)1−z . Equation (12) is
called the hidden variable representation. A more formal definition of finite mixture models
is as follows.

15
[Finite Mixture Models] Let {pθ (x) : θ ∈ Θ} be a parametric class of densities. Define the
mixture model
K−1
X
pψ (x) = πj pθj (x),
j=0
PK−1
where the mixing coefficients πj ≥ 0, j=0 πj = 1 and ψ = (π0 , . . . , πK−1 , θ0 , . . . , θK−1 ) are
the unknown parameters. We call pθ0 , . . . , pθK−1 the component densities.

Generally, even if {pθ (x) : θ ∈ Θ} is an exponential family model, the mixture may no
longer be an exponential family.

3.1 Mixture of Gaussians

Let φ(x; µj , σj2 ) be the probability density function of a univariate Gaussian distribution with
mean µj and variance σj2 . A typical finite mixture model is the mixture of Gaussians. In
one dimension, we have
K−1
X
pψ (x) = πj φ(x; µj , σj2 ),
j=0
PK−1
which has 3K − 1 unknown parameters, due to the restriction j=0 πj = 1.

A mixture of d-dimensional multivariate Gaussians is


K−1  
X πj 1 T −1
p(x) = d/2 |Σ |1/2
exp − (x − uj ) Σj (x − uj ) .
j=0
(2π) j 2

There are in total


 
d(d + 1) Kd(d + 3)
K + d + (K − 1) = +K −1
2 } |{z} | {z } 2
| {z # of parameters in uj # of mixing coefficients
# of parameters in Σj

parameters in the mixture of K multivariate Gausssians.

3.2 Maximum Likelihood Estimation

A finite mixture model pψ (x) has parameters ψ = (π0 , . . . , πK−1 , θ0 , . . . , θK−1 ). The likelihood
of ψ based on the observations X1 , . . . , Xn is
n
Y n K−1
Y X 
L(ψ) = pψ (Xi ) = πj pθj (Xi )
i=1 i=1 j=0

16
0.20

0.15

p(p(x)
x)
0.10

0.05

0.00

−4 −2 0 2 4 6
x
x

Figure 11: A mixture of two Gaussians, p(x) = 25 φ(x; −1.25, 1) + 53 φ(x; 2.95, 1).

and, as usual, the maximum likelihood estimator is the value ψb that maximizes L(ψ). Usually,
the likelihood is multimodal and one seeks a local maximum instead if a global maximum.

For fixed θ0 , . . . , θK−1 , the log-likelihood is often a concave function of the mixing parameters
πj . However, for fixed π0 , . . . , πK−1 , the log-likelihood is not generally concave with respect
to θ0 , . . . , θK−1 .

One way to find ψb is to apply your favorite optimizer directly to the log-likelihood.
Xn K−1
X 
`(ψ) = log πj pθj (Xi ) .
i=1 j=0

However, `(ψ) is not jointly convex with respect to ψ. It is not clear which algorithm is the
best to optimize such a nonconvex objective function.

A convenient and commonly used algorithm for finding the maximum likelihood estimates of
a mixture model (or the more general latent variable models) is the expectation-maximization
(EM) algorithm. The algorithm runs in an iterative fashion and alternates between the
“E-step” which computes conditional expectations with respect to the current parameter
estimate, and the “M-step” which adjusts the parameter to maximize a lower bound on
the likelihood. While the algorithm can be slow to converge, its simplicity and the fact
that it doesn’t require a choice of step size make it a convenient choice for many estimation
problems.

On the other hand, while simple and flexible, the EM algorithm is only one of many numerical
procedures for obtaining a (local) maximum likelihood estimate of the latent variable models.
In some cases procedures such as Newton’s method or conjugate gradient may be more
effective, and should be considered as alternatives to EM. In general the EM algorithm
converges linearly, and may be extremely slow when the amount of missing information is
large,

In principle, there are polynomial time algorithms for finding good estimates of ψ based on
spectral methods and the method of moments. It appears that, at least so far, these methods

17
are not yet practical enough to be used in routine data analysis.

Example. The data are measurements on duration and waiting time of eruptions of the
Old Faithful geyser from August 1 to August 15, 1985. There are two variables with 299 ob-
servations. The first variable ,“Duration”, represents the numeric eruption time in minutes.
The second variable, “waiting”, represents the waiting time to next eruption. This data is
believed to have two modes. We fit a mixture of two Gaussians using EM algorithm. To
illustrate the EM step, we purposely choose a bad starting point. The EM algorithm quickly
converges in six steps. Figure 12 illustrates the fitted densities for all the six steps. We see
that even though the starting density is unimodal, it quickly becomes bimodal.
mixture

mixture
mixture
densit

densit
densit
y

y
y W

Wa
Wa

aiti

iti
itin

ng

ng
g

tim

ti
tim

s) ) s)

me
in ins min
e(

(m
e

(m e (
ime ime

(m
(m

im
min

n t nt nt

ins
in

ptio ptio ptio


s)
s)

Er u Eru Eru

)
Step 1 Step 2 Step 3
mixture

mixture
mixture
densit

densit
densit
y

y W
y W
Wa

aiti

aiti
itin

ng

ng
g

ti

ti
tim

s) s) s)
me

me

min min min


e

e( e( e(
(m

(m
(m

im im im
nt nt nt
ins

ins
in

ptio ptio ptio


s)

r u Eru Eru
)

Step 4 Step 5 Step 6

Figure 12: Fitting a mixture of two Gaussians on the Old Faithful Geyser data. The initial
0.8 7
T T

values are π0 = π1 = 0.5. u0 = (4, 70) , u1 = (3, 60) , Σ1 = Σ2 = 7 70 . We see that
even though the starting density is not bimodal, the EM algorithm converges quickly to a
bimodal density.

18
3.3 The Twilight Zone

Mixtures models are conceptually simple but they have some strange properties.

Computation. Finding the mle is NP-hard.


Pk 2
InfiniteQLikelihood. Let pψ (x) = j=1 πj φ(x; µj , σj ), be a mixture of Gaussians. Let
n
L(ψ) = i=1 pψ (Xi ) be the likelihood function based on a sample of size n. Then supψ L(ψ) =

∞. To see this, set µj = X1 for some j. Then φ(X1 ; µj , σj2 ) = ( 2πσj )−1 . Now let σj → 0.
We have φ(X1 ; µj , σj2 ) → ∞. Therefore, the log-likelihood is unbounded. This behavior
is very different from a typical parametric model. Fortunately, if we define the maximum
likelihood estimate to be a mode of L(ψ) in the interior of the parameter space, we get a
well-defined estimator.

Multimodality of the Density. Consider the mixture of two Gaussians

p(x) = (1 − π)φ(x; µ1 , σ 2 ) + πφ(x; µ0 , σ 2 ).

You would expect p(x) to be multimodal but this is not necessarily true. The density p(x)
is unimodal when |µ1 − µ2 | ≤ 2σ and bimodal when |µ1 − µ2 | > 2σ. One might expect that
the maximum number of modes of a mixture of k Gaussians would be k. However, there are
examples where a mixture of k Gaussians has more than k modes. In fact, Edelsbrunner,
Fasy and Rote (2012) show that the relationship between the number of modes of p and the
number of components in the mixture is very complex.

Nonidentifability. A model {pθ (x) : θ ∈ Θ} is identifiable if

θ1 6= θ2 implies Pθ1 6= Pθ2

where Pθ is the distribution corresponding to the density pθ . Mixture models are noniden-
tifiable in two different ways. First, there is nonidentifiability due to permutation of labels.
For example, consider a mixture of two univariate Gaussians,

pψ1 (x) = 0.3φ(x; 0, 1) + 0.7φ(x; 2, 1)

and
pψ2 (x) = 0.7φ(x; 2, 1) + 0.3φ(x; 0, 1),
then pψ1 (x) = pψ2 (x) even though ψ1 = (0.3, 0.7, 0, 2, 1)T 6= (0.7, 0.3, 2, 0, 1)T = ψ2 . This is
not a serious problem although it does contribute to the multimodality of the likelihood.

A more serious problem is local nonidentifiability. Suppose that

p(x; π, µ1 , µ2 ) = (1 − π)φ(x; µ1 , 1) + πφ(x; µ2 , 1). (13)

When µ1 = µ2 = µ, we see that p(x; π, µ1 , µ2 ) = φ(x; µ). The parameter π has disappeared.
Similarly, when π = 1, the parameter µ2 disappears. This means that there are subspaces of

19
the parameter space where the family is not identifiable. This local nonidentifiability causes
many of the usual theoretical properties— such as asymptotic Normality of the maximum
likelihood estimator and the limiting χ2 behavior of the likelihood ratio test— to break
down. For the model (13), there is no simple theory to describe the distribution of the
likelihood ratio test for H0 : µ1 = µ2 versus H1 : µ1 6= µ2 . The best available theory is
very complicated. However, some progress has been made lately using ideas from algebraic
geometry (Yamazaki and Watanabe 2003, Watanabe 2010).

The lack of local identifiabilty causes other problems too. For example, we usually have that
the Fisher information is non-zero and that θb − θ = OP (n−1/2 ) where θb is the maximum
likelihood estimator. Mixture models are, in general, irregular: they do not satisfy the usual
regularity conditions that make parametric models so easy to deal with. Here is an example
from Chen (1995).

Consider a univariate mixture of two Gaussians distribution:


2 1
pθ (x) = φ(x; −θ, 1) + φ(x; 2θ, 1).
3 3
Then it is easy to check that I(0) = 0 where I(θ) is the Fisher information. Moreover, no
estimator of θ can converge faster than n−1/4 if the number of components is not known
in advance. Compare this to a Normal family φ(x; θ, 1) where the Fisher information is
I(θ) = n and the maximum likelihood estimator converges at rate n−1/2 . Moreover, the
distribution of the mle is not even well understood for mixture models. The same applies to
the likelihood ratio test.

Nonintinuitive Group Membership. Our motivation for studying mixture modes in


this chapter was clustering. But one should be aware that mixtures can exhibit unexpected
behavior with respect to clustering. Let

p(x) = (1 − π)φ(x; µ1 , σ12 ) + πφ(x; µ2 , σ22 ).

Suppose that µ1 < µ2 . We can classify an observation as being from cluster 1 or cluster 2
by computing the probability of being from the first or second component, denoted Z = 0
and Z = 1. We get

(1 − π)φ(x; µ1 , σ12 )
P(Z = 0|X = x) = .
(1 − π)φ(x; µ1 , σ12 ) + πφ(x; µ2 , σ22 )

Define Z(x) = 0 if P(Z = 0|X = x) > 1/2 and Z(x) = 1 otherwise. When σ1 is much
larger than σ2 , Figure 13 shows Z(x). We end up classifying all the observations with large
Xi to the leftmost component. Technically this is correct, yet it seems to be an unintended
consequence of the model and does not capture what we mean by a cluster.

Improper Posteriors. Bayesian inference is based on the posterior distribution p(ψ|X1 , . . . , Xn ) ∝


L(ψ)π(ψ). Here, π(ψ) is the prior distribution that represents our knowledge of ψ before

20
1.4
.2MbBiv 7mM+iBQM p(x)

1.2
*Hbb bbB;MK2Mi
1.0
h(x)p(x)

0.8
0.6

*Hbb k
*Hbb R *Hbb R
0.4
0.2
0.0

−2 0 2 4 6
xx
Figure 13: Mixtures are used as a parametric method for finding clusters. Observations with
x = 0 and x = 6 are both classified into the first component.
1.2

seeing the data. Often, the prior is improper, meaning that it does not have a finite integral.
For example, suppose that X1 , . . . , Xn ∼ N (µ, 1). It is common to use an improper prior
1.0

π(µ) = 1. This is improper because


0.8

Z
0.6

Z(x) = 0 π(µ)dµ = ∞. Z(x) = 1 Z(x) = 0


Z(x)

0.4

Nevertheless, the posterior p(µ|Dn ) ∝ L(µ)π(µ) is a proper √ distribution, where L(µ) is the
0.2

data likelihood of µ. In fact, the posterior for µ is N (X, 1/ n) where x is the sample mean.
The posterior inferences in this case coincide exactly with the frequentist inferences. In many
0.0

parametric models, the posterior inferences are well defined even if the prior is improper and
−0.2

usually they approximate the frequentist inferences. Not so with mixtures. Let
−2 0 2 4 6
1 1
p(x; µ) = φ(x; 0, 1)x + φ(x; µ, 1). (14)
2 2
If π(µ) is improper then so is the posterior. Moreover, Wasserman (2000) shows that the only
priors that yield posteriors in close agreement to frequentist methods are data-dependent
priors.

Use With Caution. Mixture models can have very unusual and unexpected behavior.
This does not mean that we should not use mixture modes. Indeed, mixture models are
extremely useful. However, when you use mixture models, it is important to keep in mind
that many of the properties of models that we often take for granted, may not hold.

What Does All This Mean? Mixture models can have very unusual and unexpected
behavior. This does not mean that we should not use mixture modes. Compare this to

21
kernel density estimators which are simple and very well understood. If you are going to use
mixture models, I advise you to remember the words of Rod Serling:

There is a fifth dimension beyond that which is known to man. It is a dimension


as vast as space and as timeless as infinity. It is the middle ground between light
and shadow, between science and superstition, and it lies between the pit of man’s
fears and the summit of his knowledge. This is the dimension of imagination. It
is an area which we call the Twilight Zone.

4 Density-Based Clustering I: Level Set Clustering

Let p be the density if the data. Let Lt = {x : ph (x) > t} denote an upper level S setSof p.
Suppose that Lt can be decomposed into finitely many disjoint sets: Lt = C1 · · · Ckt .
We call Ct = {C1 , . . . , Ckt } the level set clusters at level t.
S
Let C = t≥0 Ct . The clusters in C form a tree: if A, B ∈ C, the either (i) A ⊂ B or (ii)B ⊂ A
or (iii) A ∩ B = ∅. We call C the level set cluster tree.
bt = {x : pbh (x) > t}. How do we
The level sets can be estimated in the obvious way: L
decompose Lbt into its connected components? This can be done as follows. For each t let

Xt = {Xi : pbh (Xi ) > t}.


Now construct a graph Gt where each Xi ∈ Xt is a vertex and there is an edge between Xi
and Xj if and only if ||Xi − Xj || ≤  where  > 0 is a tuning parameter. Bobrowski et al
(2104) show that we can take  = h. Gt is a called a Rips graphs. The clusters at level t are
estimated by taking the connected components of the graph Gt . In summary:

1. Compute pbh .
2. For each t, let Xt = {Xi : pbh (Xi ) > t}.
3. Form a graph Gt for the points in Xt by connecting Xi and Xj if ||Xi − Xj || ≤ h.
4. The clusters at level t are the connected components of Gt .

A Python package, called DeBaCl, written by Brian Kent, can be found at

http://www.brianpkent.com/projects.html.

Fabrizio Lecci has written an R implementation, include in his R package: TDA (topological
data analysis). You can get it at:

http://cran.r-project.org/web/packages/TDA/index.html

Two examples are shown in Figures 14 and 15.

22
cluster labels
0.06

8

● ●

● ● ● ●

●● ●

● ●


● ●

● ●●●●● ● ● ● ●

● ●●●●● ●●
● ●
● ● ● ● ● ●

●●
● ● ● ● ● ●

● ●
●●

●● ● ●●● ● ●● ●●

●●


● ●
●●
● ●

●●●



●●●


● ● ●
●●●●
● ●●

●●
● ●

●● ●

●● ●●
●● ● ● ●

● ●

● ●
● ●



●●

6
●● ●●●● ●●●●
●● ●●●● ● ● ●●●
● ● ●

●●

● ● ● ●
● ●
●● ●

● ●● ● ●
● ● ●

● ●● ● ● ●●
● ● ● ● ●●
● ●●●●●● ● ● ●

● ●● ● ● ● ● ●●
●● ● ● ●● ● ● ●● ● ●
● ●

●● ●
● ●

●●


●●
● ●


●●● ● ● ● ●●

● ● ● ●● ● ● ● ● ●

● ● ● ● ●●●

● ●●● ● ●
● ●●
● ● ●● ● ●

● ● ●
● ● ●


● ● ●●
● ●

●● ● ●●●● ● ●●
●●


● ●● ●
● ●

●●

● ●●●●●● ●●
● ● ●●

●● ●
● ● ●●● ● ●● ● ● ●●
● ● ●
● ● ● ●●

● ● ●
●●


●●
● ● ● ●● ● ● ●●● ● ●● ●● ●

●●● ● ● ●● ●●●●●
●●● ● ● ●● ●

● ● ●
●● ●●● ●

● ● ● ● ● ● ●●

● ● ● ● ● ●


● ● ●● ●
● ● ●
●● ● ●


● ● ● ●

●● ● ●
● ● ● ● ●

● ● ● ●
●●●● ●●● ●●● ● ● ● ●● ● ●● ● ●● ● ● ● ●

● ●


● ● ●● ●

● ●

●●● ●●

● ●●●● ●● ● ●● ●
●●
●●
● ●

● ●● ●●
● ●
● ● ●●
●●● ●● ●● ● ●
●● ●
●● ●●● ● ● ●
● ● ●
● ● ●●


● ● ●● ●
● ●●●
● ● ● ● ●● ● ● ●●●


● ●

● ● ●●●●
●● ●

● ●● ●● ●●
●● ● ●
●● ● ●● ● ●●●●● ● ●●

● ●
● ●● ● ●● ●●●●● ●


●●
● ● ●
● ●

● ● ● ● ●● ● ●


●●●● ● ● ●

0.04

● ● ●● ●

● ●


●●

● ● ●

●●
● ●


● ●

● ● ●
● ● ● ● ● ● ●● ● ● ● ●

● ●●
● ●●● ● ● ●●● ● ● ●

● ●
● ●
●●

● ● ●
●●●● ●

● ●
● ●●

● ● ●
● ● ●

● ● ● ● ● ●● ● ● ● ●
● ●

● ●

● ●● ●● ● ●

● ● ● ● ● ●● ●
● ● ●●● ● ●

● ● ● ●● ●
●● ● ●● ●●


● ● ●● ●● ●

● ● ● ●

●● ●
● ●● ● ● ●● ●● ●

●●●●

●●
● ●●●● ● ●●●●●●●●●●

●●●●
● ● ● ● ●●●
●● ●● ●
● ●●
● ● ● ● ●●●

●● ● ●● ●

●● ● ●
● ●
● ● ● ● ● ●
● ●

●●● ● ●● ●
● ● ● ● ● ● ●●

● ● ● ● ●●● ●

● ●
● ●


● ● ● ●

●● ●
● ● ● ●

● ●


● ●●

● ●● ● ●●
● ● ● ● ● ●●

● ● ●

● ● ●


● ●●

● ● ● ●
● ●● ● ● ●

● ●
lambda

● ●

● ● ● ● ●
● ●



●●

● ● ●
● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ●●


● ●● ●● ●●
● ● ●
●●

●●● ●●
●●●

●● ●

● ●
● ● ●
● ●●● ●


● ●


● ●


● ●

● ●
● ● ●● ●●●● ● ● ●● ●●●

● ● ●● ●
● ●●● ●● ●●

●●

● ● ●● ● ●●
●● ●
●● ●
●●● ●

●●
●●

● ●●●
●● ●


● ●


●●●● ● ●

● ● ●

XX[,2]

● ●

● ●

● ●
● ● ● ●

●● ●
● ● ● ● ●● ●
● ● ● ● ● ●

● ●●
● ●●
● ●
●●


● ●●● ●● ●●

4
● ●●
● ●

● ● ●


● ● ● ●●●

● ● ●● ●

● ● ● ●

● ●
● ● ●● ●


●●





● ● ●

●●● ●●●● ● ●
● ●● ● ● ●● ● ● ● ● ● ● ●

●●● ●●●● ● ●● ●●
● ●●
● ● ●

●●

● ● ●

● ●

● ●
● ● ● ●● ● ● ● ●● ● ● ● ●
● ●● ● ●

●● ●
● ● ● ●

●●●
● ● ●

●●●●
●●

●●
● ●● ●● ● ● ● ●

●●●
●● ● ● ●

● ● ● ●● ●●

●● ● ●

●●
●● ● ●● ● ●


● ●● ●


●● ● ●

● ●●
● ●
● ● ●

● ● ●

● ●● ● ●

● ● ● ●

●●●●●●●

● ● ●●●●● ● ● ● ●● ●● ●


●● ● ●

●●
●●●●●●●● ● ●
● ● ● ● ●

2
● ● ● ● ●

●● ●
●●

● ● ●
●●

● ●
●●●●
● ● ●
●●●
● ● ●
●●● ●● ● ●
● ●●● ● ●● ●
● ●

●● ●● ●● ● ● ● ●● ● ● ● ●

●●
● ●● ●
● ● ● ●●
● ●● ● ● ● ●

● ●

●●● ●●●
●●
●●●●●

●●
●●●

●●
●●●●●
● ●

● ●
● ● ●●
●● ●
●●●

●●
● ● ● ●● ● ●



●●


0.02

●● ●●● ● ● ●●● ●
● ● ● ● ●● ●● ●● ●● ●● ● ● ●
●●● ● ●

● ●●● ● ● ●
● ●●


● ●● ● ●


●●


● ●

●●● ●
● ● ●

● ●
● ●

● ●
● ● ●●

● ●
● ● ● ●

● ●

● ●

● ●

● ●

● ●

●● ●● ●●● ● ● ● ● ●●●

● ● ●●

● ●

● ●●●●●

● ●

● ● ●
● ● ●●●
●●● ● ●



●● ●● ● ●

● ●●● ● ● ●
●● ●

● ● ● ●● ●●●● ●●● ● ● ● ●

●●
●● ●

● ●

● ● ● ● ●● ● ● ●

●● ●● ● ●
●● ● ●
●● ●


● ●

● ●
● ●●● ●● ●● ● ●


● ●

● ● ●●●

● ●● ● ●●
●●● ● ●
● ● ● ●● ● ●●●


●● ●

● ●● ●● ● ● ●● ● ● ● ●●●●● ● ● ●

●●●● ●

●●● ● ● ● ●

●● ● ● ● ●● ● ●


●●

●●
●●


● ●● ● ●

● ●●
●●●●● ●
●●
●●
● ●


●●
●● ● ●● ● ● ● ●● ●

● ●● ●

● ●
●●

●●
● ●●
● ● ● ● ● ●●●● ●●
●●

●●
●●
●●

●●
● ● ●●
● ● ●● ●●
●● ●
●● ● ●●● ● ● ● ●

●●●●

0

● ●●●●●
●●● ● ●

●●


● ●●


● ● ● ●● ●

●●●
● ● ● ● ●● ● ● ●● ●●

●● ●

● ●● ●
● ● ● ●

● ● ● ● ● ● ● ●● ● ●

● ● ●
● ●

●● ●● ●
● ●
● ●

● ●

−2
● ●
0.00

0.0 0.2 0.4 0.6 0.8 1.0 −2 0 2 4 6 8

XX[,1]

Figure 14: DeBaClR in two dimensions.

cluster labels
0.020


●●●●●
●●
●● ● ●
● ●
1 2 3 4 5 6 7 8

● ●●

●●
●●
●●● ●

● ●●●● ● ●

●●

●●
●● ●● ●●
● ●●● ● ●●
●●●●●●● ●●
lambda



● ●●
●●●●

●●
● ●
●●

●●

●●
●●
●●
●●
●●●
●● ●
● ●●
●●
●●●

●●
●●
●●
●●●

●●

●●
●●
●●
●●●

●●● ● ●
●●
●● ●

●●●

●●
●●●


●●●

●●
●●




●●●

●●●
● ●●
●●●●●●●●●●

●●
●● ●●●
●●●●●
●●● ●●

●●
●● ●●● ●●●●●
0.010

● ●
●●

●●

●●
●●
●●


●●



●●

●●

●●


●●

●●



●●●
● ●
●●●●
●●●
●●
●●●
●●
●●●
●●

●●
●●

●●●
●●
●●


●●
●●●

●●
●●



●●
●●●
●●
●●
●●

●●● ●

●●●

●●

●● ●

●●

●●
●●
●●

●●

●●



●●●



●●
●●●


●●

●●


●●

●●
●●
●●

●●●

● ●
●●

●●
●●

●●●
●●


●●

●●

●●●


●●●
●●
●●
●●●
●●●
●● ●●●● ●●
●●●● ●

●●
●●●
●●●
XX[,3]


● ●

●●

●●

●● ●
●●
●●●
● ●●
●●
●●
● ●

●●


● ●●
●●

●●

●●

●●●●●
●●

●●

●●
●● ●

●● ●
●●●● ●
● ●
● ●●
●●●●
●●●
●●

●● ●
●●
XX[,2]

●●


● ●
●●●●
● ●


●●

●●
●●
● ● ● ●●●
●●●
●●●

●●


●●●


●●● ●
● ●


●● ●
●●● ●● ●
●●● ●●
●●●● ●●

●●●
●● ●●●
●●●●
●●●●●● ● ●●● ●


●●



●●●



●●











● ● 10
● ●
●●
●●
●●
● ●●●
● ●●
●●
● ● ●●● 8
● ●●
●● 6
4
2
0
0.000

−2
−4
−4 −2 0 2 4 6 8 10

0.0 0.2 0.4 0.6 0.8 1.0


XX[,1]

Figure 15: DeBaClR in three dimensions.

23
4.1 Theory

How well does this work? Define the Hausdorff distance between two sets by
( )
H(U, V ) = inf  : U ⊂ V ⊕  and V ⊂ U ⊕ 

where [
V ⊕= B(x, )
x∈V

and B(x, ) denotes a ball of radius  centered at x. We would like to say that Lt and L bt are
close. In general this is not true. Sometimes Lt and Lt+δ are drastically different even for
small δ. (Think of the case where a mode has height t.) But we can estimate stable level
sets. Let us say that Lt is stable if there exists a > 0 and C > 0 such that, for all δ < a,

H(Lt−δ , Lt+δ ) ≤ Cδ.

p
Theorem 11 Suppose that Lt is stable. Then H(L
bt , Lt ) = OP ( log n/(nhd )).

p
Proof. Let rn = log n/(nhd )). We need to show two things: (i) for every x ∈ Lt there
exists y ∈ Lbt such that ||x − y|| = OP (rn ) and (ii) for every x ∈ L
bt there exists y ∈ Lt such
that ||x − y|| = OP (rn ). First, we note that, by earlier results, ||bph − ph ||∞ = OP (rn ). To
show (i), suppose that x ∈ Lt . By the stability assumption, there exists y ∈ Lt+rn such that
||x − y|| ≤ Crn . Then ph (y) > t + rn which implies that pbh (y) > t and so y ∈ L bt . To show
(ii), let x ∈ Lt so that pbh (x) > t. Thus ph (x) > t − rn . By stability, there is a y ∈ Lt such
b
that ||x − y|| ≤ Crn . 

4.2 Persistence

Consider a smooth density p with M = supx p(x) < ∞. The t-level set clusters are the
connected components of the set Lt = {x : p(x) ≥ t}. Suppose we find the upper level
sets Lt = {x : p(x) ≥ t} as we vary t from M to 0. Persistent homology measures how
the topology of Lt varies as we decrease t. In our case, we are only interested in the modes,
which correspond to the zeroth order homology. (Higher order homology refers to holes,
tunnels etc.) The idea of using persistence to study clustering was introduced by Chazal,
Guibas, Oudot and Skraba (2013).

Imagine setting t = M and then gradually decreasing t. Whenever we hit a mode, a new
level set cluster is born. As we decrease t further, some clusters may merge and we say that
one of the clusters (the one born most recently) has died. See Figure 16.

24
b3

b3

b1

b1

b2

birth
b2
d2

b4

b4
d4
d1

d3
d3 d1 d4 d2
−5 0 5 death

Figure 16: Starting at the top of the density and moving down, each mode has a birth time
b and a death time d. The persistence diagram (right) plots the points (d1 , b1 ), . . . , (d4 , b4 ).
Modes with a long lifetime are far from the diagonal.

In summary, each mode mj has a death time and a birth time denoted by (dj , bj ). (Note that
the birth time is larger than the death time because we start at high density and move to
lower density.) The modes can be summarized with a persistence diagram where we plot the
points (d1 , b1 ), . . . , (dk , bk ) in the plane. See Figure 16. Points near the diagonal correspond
to modes with short lifetimes. We might kill modes with lifetimes smaller than the bootstrap
quantile α defined by
( B
)
1 X  ∗b 
α = inf z : I ||b
ph − pbh ||∞ > z ≤ α . (15)
B b=1

Here, pb∗b
h is the density estimator based on the b
th
bootstrap sample. This corresponds
to killing a mode if it is in a 2α band around the diagonal. See Fasy, Lecci, Rinaldo,
Wasserman, Balakrishnan and Singh (2014). Note that the starting and ending points of the
vertical bars on the level set tree are precisely the coordinates of the persistence diagram.
(A more precise bootstrap approach was introduced in Chazal, Fasy, Lecci, Michel, Rinaldo
and Wasserman (2104).)

5 Density-Based Clustering II: Modes

Let p be the density of X ∈ Rd . Assume that p has modes m1 , . . . , mk0 and that p is a Morse
function, which means that the Hessian of p at each stationary point is non-degenerate. We

25

●●●
●●


● ●
●● ● ●


● ●●
● ● ●

●● ● ●
● ● ●● ● ●●
● ● ● ●

● ●● ● ● ● ● ●
● ● ●●● ●●

● ●
● ● ● ●●●●
●●
● ●
● ●

●●● ● ● ● ● ●●● ●● ●

● ● ●
● ●


● ●
●● ● ●● ●
● ● ● ●
●●


●● ● ● ● ●● ●


● ●
●●● ● ●
● ●● ●●
●● ● ●● ● ●

●● ●
● ●
● ● ● ● ●● ● ● ● ●
● ●● ●
● ●
●●● ● ●

●●●●●
● ●● ●

●● ● ●

● ● ● ●●● ●
●●
● ● ●
● ● ● ● ● ● ● ●●
● ● ●●
● ●
● ●● ●
● ● ●







● ● ● ● ● ●
● ●●



● ● ●● ●● ●
●● ●


● ●
●●●
● ● ●●

● ● ●● ●●




● ● ●

● ●

● ● ●● ● ● ●
● ●
● ● ●
● ●●

Figure 17: A synthetic example with two “blob-like” clusters.


● ●●
●●●●●●●● ●

●● ●●
●● ●● ●●●
●●●
● ● ●● ● ●●

● ● ● ●


● ●● ● ● ● ●
● ●●
● ●●●●●


●●●●
● ● ●


● ●● ●● ●
● ● ●
●● ●

●● ● ●
● ●● ● ● ●●● ● ● ●
●●
● ● ●●
● ●●● ●●●



●●

● ●



● ● ●●●●
●●●● ● ● ● ● ●
●●●● ●●
●● ●
● ● ●
●● ●
● ● ●
●●●

●● ● ●●●●
● ●● ●● ●
●●●●●●●● ● ●●

●●●●

Figure 18: A synthetic example with four clusters with a variety of different shapes.

can use the modes to define clusters as follows.

5.1 Mode Clustering

Given any point x ∈ Rd , there is a unique gradient ascent path, or integral curve, passing
through x that eventually leads to one of the modes. We define the clusters to be the “basins
of attraction” of the modes, the equivalence classes of points whose ascent paths lead to the
same mode. Formally, an integral curve through x is a path πx : R → Rd such that πx (0) = x
and
πx0 (t) = ∇p(πx (t)). (16)
Integral curves never intersect (except at stationary points) and they partition the space.

Equation (16) means that the path π follows the direction of steepest ascent of p through x.
The destination of the integral curve π through a (non-mode) point x is defined by

dest(x) = lim πx (t). (17)


t→∞

It can then be shown that for all x, dest(x) = mj for some mode mj . That is: all integral
curves lead to modes. For each mode mj , define the sets
n o
Aj = x : dest(x) = mj . (18)

These sets are known as the ascending manifolds, and also known as the cluster associated
with mj , or the basin of attraction of mj . The Aj ’s partition the space. See Figure 19. The
collection of ascending manifolds is called the Morse complex.

26
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Figure 19: The left plot shows a function with four modes. The right plot shows the ascending
manifolds (basins of attraction) corresponding to the four modes.

Given data X1 , . . . , Xn we construct an estimate pb of the density. Let m b 1, . . . , m


b k be the
estimated modes and let A1 , . . . , Ak be the corresponding ascending manifolds derived from
b b
pb. The sample clusters C1 , . . . , Ck are defined to be Cj = Xi : Xi ∈ Abj .

Recall that the kernel density estimator is


n  
1X 1 ||x − Xi ||
pb(x) ≡ pbh (x) = K (19)
n i=1 hd h

where K is a smooth, symmetric kernel and h > 0 is the bandwidth.1 The mean of the
estimator is Z
ph (x) = E[b
ph (x)] = K(t)p(x + th)dt. (20)

To locate the modes of pbh we use the mean shift algorithm which finds modes by approxi-
mating the steepest ascent paths. The algorithm is given in Figure 20. The result of this
process is the set of estimated modes M
c = {m b k }. We also get the clustering for
b 1, . . . , m
free: the mean shift algorithm shows us what mode each point is attracted to. See Figure
21.

A modified version of the algorithm is the blurred mean-shift algorithm (Carreira-Perpinan,


2006). Here, we use the data as the mesh and we replace the data with the mean-shifted data
at each step. This converges very quickly but must be stopped before everything converges
to a single point; see Figures 22 and 23.
1 1
Pn
In general, we can use a bandwidth matrix H in the estimator, with pb(x) ≡ pbH (x) = n i=1 KH (x − Xi )
1 1
where KH (x) = |H|− 2 K(H − 2 x).

27
Mean Shift Algorithm

1. Input: pb(x) and a mesh of points A = {a1 , . . . , aN } (often taken to be the data
points).
(0)
2. For each mesh point aj , set aj = aj and iterate the following equation until
convergence:  (s) 
Pn ||aj −Xi ||
i=1 Xi K h
(s+1)
aj ←−  (s)  .
Pn ||aj −Xi ||
i=1 K h

c be the unique values of the set {a(∞)


3. Let M
(∞)
1 , . . . , aN }.
4. Output: M.
c

Figure 20: The Mean Shift Algorithm.

What we are doing is tracing out the gradient flow. The flow lines lead to the modes and
they define the clusters. In general, a flow is a map φ : Rd × R → Rd such that φ(x, 0) = x
and φ(φ(x, t), s) = φ(x, s + t). The latter is called the semi-group property.

5.2 Choosing the Bandwidth

As usual, choosing a good bandwidth is crucial. You might wonder if increasing the band-
width, decreases the number of modes. Silverman (1981) showed that the answer is yes if
you use a Normal kernel.


● ● ●

● ● ●



● ●
● ● ●
● ●●
● ●

●●
● ●●
●●● ●
●● ●

●●

● ●


Figure 21: A simple example of the mean shift algorithm.

28
2
●●
● ●● ● ●●
● ● ●
●●●●
●●●●
● ●

●●●●●●●●● ●● ●
●●●●●
●●
●●●

●●●
● ●

●●●●●●


●●
●●●
●●●

●●
●●

●●
●●
●●
●●
●● ●
● ●
● ●●●●
●● ●●● ●

●●●●●
● ● ●● ●
●●●●
●●●
●●
●●●
●●●●


●●
●●●

●●●
● ●●● ● ● ●
●●

●●●
●●●
● ●●
●●●
●●

●●●●●

● ●●●


●● ●●●●

●●●●



●●
●●●●
●● ● ●●● ●● ● ●



●●


●●
● ●●

● ●
●●
●●
●●
● ●● ● ●●●● ●
●●●●●
●● ●● ●

●● ●

●●

●●
● ● ● ●●
● ●

●●
●●●

● ●●
●● ● ●●
● ●
●●



● ●
●●●●●●
●● ●●●●
●●●●

●●

●●

●●

●●●



●●
●●
●●
●●●● ●
●●● ●●●●
● ●●
●●●●


●●


●●
●●
●●●
●●

●●

●●●

●●●●

●●

●●
●●
●●●
●●●



●●
● ●●●●●

●●

●●
●●●
●●

●●

●●
●●

●●

●●

●●
●●


●●

●●●

1
● ● ●● ●
●●● ●●


●●

●●
●●
● ● ● ●









● ●



● ●
● ●
● ●

● ●

●●●●
● ●




● ●
●●●●
● ● ●





● ●




● ● ●

●●
●● ●●●
● ● ●
● ●
●●

● ●



● ● ● ●●● ●●
● ●
● ●●●●
● ●
● ●● ●●● ●
●●

● ●
● ●
●●
●●
●●●
● ●●
●●

●●
●●

●●
●●

● ●
●●
●●


● ●
●●●
●● ●
●●● ● ●●●●


●●●

●●
●●

●●
●●
● ●●

●●
●●
●●


●●


● ●
● ●

●●

● ●

●● ●
● ●●
● ●●
● ●●●● ●●●


●●
● ●

●● ●●
●●

●●
●● ●
● ●

● ●● ● ●

●●

●●
●● ●

● ●● ● ●
●●
● ●●● ●
●●
● ●
●● ● ●
● ●
● ●

●●

● ●
● ● ●

●●

●● ● ●
●●●
●●● ●

●●●● ● ●●●
●●●

● ●

●●
●●

●●



●●


●●● ●●●
●●●

●●
●●● ●●
●●●●
● ●● ●
●● ●● ● ●

●●
● ●


● ●●
●●
●●●● ●●
●● ●●● ● ●●● ● ●●
●●●●●●●
●●●
●●

●●


●●●●






●●



●●●
●●●●






●●







●●

●●



●●

●●






●●


●●


●●●
●●
●●●

●●●



















●●

●●
●●●







●●
●●

● ●
●●


●●
●●●
●●

●●
●●

●●


●●

●●● ●

● ● ● ●

●●●●●●●●
● ●● ●● ● ●●●
●●●
●●●
● ●●●

0
●●●
● ●● ●●●●
●●●●

●● ● ● ●●●
●●●●●
●●●

●●●●●
● ●

●●
●● ●

●●
●●
● ●
● ●



●●●
● ●

●●

●●
●●●
● ●
●●● ●
●●
●●●




●●
●●








●●
●●






●●
●●



●●
●●●

●●●●
●●

●●
● ●●●


●● ●● ● ●●●●
●●

●●


●●


●●









●●

●●



● ●
● ● ●●●●
●●


●●
● ●●

●●





●●


● ●●●
●●●
●●
●●

●●●
●●


●●● ● ● ●


●●
●●●








●●

●●● ●
●●

●●
●●






●●



●●
●● ● ●
●●



●●
●●


●● ●●● ●● ● ●
●●
●●●

●●
●●●

●● ●
●●
●●

●●
●●














●●










●●





























●●

●●


●● ●








●●




●●
●●




●●









●●







● ●●● ● ●

●●●●●●
●●●

● ●●
●●●
● ●●
●●
● ● ●


●●● ●●●
● ●
●●

●● ●
● ●
●●● ●
●●

● ●
●●
● ●


●● ●
●● ●

−1
●●●
● ● ●
●●

●●
●●● ●●●
● ●


●●●
●●

●●
●● ●
●●
●●●


●●



● ●●

● ●●●●●

●●
●● ●●
● ●



●●
●● ●●
●●

●●●
●●●

●●


●●

●●
● ●●
●● ●●●● ●● ●● ●
●●

●●

●●


●●●
●●
●●● ●
●●

●●


●●



● ●



● ●

● ●

●● ● ●

●● ●●

● ● ●

● ●

● ●

● ●


● ●●






●● ● ● ●
● ●
● ●●

●● ●

● ●
●●
● ●


● ● ●


● ● ●●


● ●

● ●●



●●
●●●●●
●●
●●
●●

●●●●●

●●
●●


●●●
●●●
●●
●●


●●●
●●●


●●

●●

●●

●●

●●
●●●

●●


●●
●●

●●●

●●

●●●



●●
●●



●●●●

●●●●●

●●

●●

●●
● ●●●●
●●●●
●● ●


● ●●

●●
●●●
●●
●●

●●


●●
●●


●●


●●


























●●


●●

●●
●●

●●●

●●
●●●

●●●●
● ● ● ●●● ●
●●

●●
● ●
● ●
●●●●




●●
●●

●●
●●
●●
● ●


●●
●●●

●●● ●


●●


●● ●●


● ●


●●
●●
● ●


● ●
● ●
●●
●●
●●
● ●



● ●


● ●



●●
●●

●●

●●●
●●
● ●


● ●
●●


●● ●
●●


● ● ●
●●●
● ● ●●

●● ●●


●●


●●
●●
● ●● ● ●● ●


●●●
●●

●●● ● ●

●●

●● ● ● ●●●
●●●
●●●
●●
●● ●
●●●
●● ● ●● ●
●●


●●
●●
● ●● ●
●●● ●
●●

−2
● ●
● ● ● ●● ●
●●

●●











●●●
●● ●● ● ●●

●●










●●● ●●



●●
●●

●●
●●●


●●●

●●●
● ●●
●●
●●



●●




●●●
●●
●●

●●
●●●
●●●
●●
●●


●●
●●
●●

●●
●● ●●
●● ●
●●●●
●●
●●

●●
● ● ● ● ●●





● ●


●●

● ●●





● ●


●●
● ●●


● ●●






● ●
●●
●●

●●
● ●●●



●●
●●

●●

●●
● ●●
●●

●●



●●


●●


●●

−2 −1 0 1 2 ●
●●
●●●

● ●
● ●
●●
● ●

a ●


● ●
●●

●● ● ●

● ●

● ●

● ●

●● ●

●●
● ●

● ●

●●

● ●


Figure 22: The crescent data example. Top left: data. Top right: a few steps of mean-shift.
Bottom left: a few steps of blurred mean-shift.
1.0

● ●
● ● ●
●● ● ● ●
●● ●
● ● ● ●● ● ● ●● ●

●● ● ● ● ●●● ●

●● ● ● ● ●●

●●●●●● ● ● ● ● ● ●
● ● ●● ●
● ● ● ● ● ●
●●● ●
●● ●●●● ●●

● ●●
● ● ●● ● ● ● ●


● ● ● ●
●● ● ●
● ●● ● ● ●
●●
● ●●●● ● ●●
●●● ●● ● ●

● ●

●● ● ●●● ●● ● ●●●●● ● ●
0.5

● ●●● ● ●●●● ● ●
● ● ●●●
●●
●● ●
●●
●●●● ●●● ●● ● ●● ●●

●●●
●●
● ●
● ●● ● ● ●● ●

●●
●●● ●● ●●●●
● ●● ●●
● ●● ● ● ●
● ●
●●●● ● ● ●●●●
● ●

●●
● ●●
● ●
●●●●● ● ●●
●●●●


●●





●●
●● ●● ●



● ●

● ●
● ●


● ●
● ●


● ●
● ●

● ● ● ● ● ● ●
● ●●● ● ●
●●

● ●●
●●●


●●
● ● ●

● ●
●●

● ● ●
● ●● ● ● ●● ●
●●● ●


●●●
● ●
● ●● ● ● ●●
● ●●
●●●
● ● ●●
● ● ●
● ●
●●
● ●●●● ●
●●
u2

● ● ● ●●
● ●
0.0

●● ●

● ● ●●
● ● ●●● ● ●

●●● ●
● ●

● ●● ● ●



●●
● ●●


●● ● ● ●
●● ● ●● ●
●● ●●● ● ● ●●● ●●
● ● ● ● ● ●
● ● ●●● ●● ●●


● ●
● ●●● ● ●
●●●
● ● ● ●● ●●

●●
●●
● ● ●● ● ●
●●

●●


● ●●●
●●

●● ● ● ●
● ● ● ●
●● ●
● ●●● ● ●● ●


●●●
●●

● ●●
● ●

●●●

●●●● ● ●
●●
●●

●●● ● ●
● ●●
● ●●
● ● ●●
●●
● ● ● ●●

●●● ●

−0.5

●● ● ● ● ● ●●

● ●●●
● ● ● ●

●● ●

● ●●●● ● ●●● ●
●● ● ●●
●● ●
● ● ● ●●


● ●
● ● ●

●●●

●●●● ●

●●
●●● ●
●●●
●●
●●


●● ●
● ● ●● ●
●● ●

●● ●
●● ●●● ●● ●●
●● ● ● ● ●●

●●● ● ●● ● ● ●
●● ● ● ●

● ● ●● ●●●●● ● ●
●● ● ● ●
●●● ● ●●●●●
● ●●● ●● ●
● ●
●● ●●
● ● ●
● ●● ● ●●

● ●● ●●●●
●●
●● ●
● ●●● ● ● ● ●
● ●●●●
●●● ●
−1.0

● ●● ●
● ●●●●
● ●

●●

●●●




●●

● ●
● ●




●●●

● ●


● ●


● ●





● ●● ●●



●●● ●
●●

●●

●● ●●

● ●●
● ●

●●●● ●● ● ●
●● ●●
●●●

●●●● ●●●●● ●●

● ●● ●●
●●

−1.0 −0.5 0.0 0.5 1.0 ●● ● ●
●●

● ● ●


● ●●●●
● ●

u1 ● ● ●● ● ●●●●●

● ●

● ●

●●
●●

● ●
● ●

● ●

● ●

● ● ●

Figure 23: The Broken Ring example. Top left: data. Top right: a few steps of mean-shift.
Bottom left: a few steps of blurred mean-shift.

29
Theorem 12 (Silverman 1981) Let pbh be a kernel density estimator using a Gaussian
kernel in one dimension. Then the number of modes of pbh is a non-increasing function of h.
The Gaussian kernel is the unique kernel with this property.

We still need a way to pick h. We can use cross-validation as before. One could argue that
we should choose h so that we estimate the gradient g(x) = ∇p(x) well since the clustering
is based on the gradient flow.

How can we estimate the loss of the gradient? Consider, first the scalar case. Note that
Z Z Z Z
0 0 2
p − p ) = (b
(b p ) − 2 pbp + (p0 )2 .
0 2 0

We can ignore the last term. The first term is known. To estimate the middle term, we use
integration by parts to get Z Z
pbp = − p00 p
0

suggesting the cross-validation estimator


Z
2 X 00
p0 (x))2 dx +
(b pb (Xi )
n i i

where pb00i is the leave-one-out second derivative. More generally, by repeated integration by
parts, we can estimate the loss for the rth derivative by
Z
2 X (2r)
CVr (h) = (b p(r) (x))2 dx − (−1)r pbi (Xi ).
n i

Let’s now discuss estimating derivatives more generally following Chacon and Duong (2013).
Let n
1X
pbH (x) = KH (x − Xi )
n i=1

where KH (x) = |H|−1/2 K(H −1/2 x). Let D = ∂/∂x = (∂/∂x1 , . . . , ∂/∂xd ) be the gradient
operator. Let H(x) be the Hessian of p(x) whose entries are ∂ 2 p/(∂xj ∂xk ). Let
r
D⊗r p = (Dp)⊗r = ∂ r p/∂x⊗r ∈ Rd

denote the rth derivatives, organized into a vector. Thus

D⊗0 p = p, D⊗1 p = Dp, D⊗2 p = vec(H)

where vec takes a matrix and stacks the columns into a vector.

30
The estimate of D⊗r p is
n n
⊗r 1 X ⊗r 1X
(r)
pb (x) = D pbH (x) = D KH (x−Xi ) = |H|−1/2 (H −1/2 )⊗r D⊗r K(H −1/2 (x−Xi ).
n i=1 n i=1

The integrated squared error is


Z
L= ||D⊗r pbH (x) − D⊗r p(x)||2 dx.

Chacon, Duong and Wand shows that E[L] is minimized by choosing H so that each entry
has order n−2/(d+2r+4) leading to a risk of order O(n−4/(d+2r+4) ). In fact, it may be shown
that
1 1
E[L] = |H|−1/2 tr((H −1 )⊗r R(D⊗r K)) − trR∗ (KH ? KH , D⊗r p)
n n
+ trR (KH ? KH , D p) − 2trR (KH , D⊗r p) + trR(D⊗r p)
∗ ⊗r ∗

where
Z
R(g) = g(x)g T (x)dx
Z

R (a, g) = (a ? g)(x)g T (x)dx

and (a ? g) is componentwise convolution.

To estimate the loss, we expand L as


Z Z
L = ||D pbH (x)|| dx − 2 hD⊗r pbH (x), D⊗r p(x)idx + constant.
⊗r 2

Using some high-voltage calculations, Chacon and Duong (2013) derived the following leave-
one-out approximation to the first two terms:

CVr (H) = (−1)r |H|−1/2 (vec(H −1 )⊗r )T B(H)

where
1 X ⊗2r −1/2 2 X
B(H) = 2
D K(H (X i − X j )) − D⊗2r K(H −1/2 (Xi − Xj ))
n i,j n(n − 1) i6=j

and K = K ? K In practice, the minimization is easy if we restrict to matrices of the form


H = h2 I.

A better idea is to used fixed (non-decreasing h). We don’t need h to go to 0 to find the
clusters. More on this when we discuss persistence.

31
5.3 Theoretical Analysis

How well can we estimate the modes?

Theorem 13 Assume that p is Morse with finitely many modes m1 , . . . , mk . Then for h > 0
and not too large, ph is Morse with modes mh1 , . . . , mhk and (possibly after relabelling),

max ||mj − mjh || = O(h2 ).


j

With probability tending to 1, pbh has the same number of modes which we denote by m
b h1 , . . . , m
b hk .
Furthermore, r !
1
max ||m b jh − mjh || = OP
j nhd+2
and r !
1
b jh − mj || = O(h2 ) + OP
max ||m .
j nhd+2

Remark: Setting h  n−1/(d+6) gives the rate n−2/(d+6) which is minimax (Tsyabkov 1990)
under smoothness assumptions. See also Romano (1988). However, if we take the fixed h
point if view, then we have a n−1/2 rate.

Proof Outline. But a small ball Bj around each mjh . We will skip the first step, which is
to show that there is one (and only one) local mode in Bj . Let’s focus on showing
r !
1
max ||m
b jh − mjh || = OP .
j nhd+2

For simplicity, write m = mjh and x = m b jh . Let g(x) and H(x) be the gradient and Hessian
of ph (x) and let gb(x) and H(x) be the gradient Hessian of pbh (x). Then
b
Z 1
T T
(0, . . . , 0) = gb(x) = gb(m) + (x − m) H(m
b + u(x − m))du
0

and so Z 1
T
(x − m) H(m
b + u(x − m))du = (g(m) − gb(m))
0

where we used the fact that 0 = g(m). Multiplying on the right by x − m we have
Z 1
T
(x − m) H(m
b + u(x − m))(x − m)du = (bg (m) − gb(m))T (x − m).
0

32
Let λ = inf 0≤u≤1 λmin (H(m + u(x − m))). Then λ = λmin (H(m)) + oP (1) and
Z 1
T b + u(m − x))(x − m)du ≥ λ||x − m||2 .
(x − m) H(x
0

Hence, using Cauchy-Schwartz,


r !
1
λ||x−m||2 ≤ ||b
g (m)−g(m)|| ||x−m|| ≤ ||x−m|| sup ||b g (y)|| ≤ ||x−m||OP
g (y)−b
y nhd+2
q 
1
and so ||x − m|| = OP nhd+2
.
p
Remark: If we treat h as fixed (not decreasing) then the rate is OP ( 1/n) independent of
dimension.

6 Hierarchical Clustering

Hierarchical clustering methods build a set of nested clusters at different resolutions. The
are two types of hierarchical clustering: agglomerative (bottom-up) and divisive (top-down).
With agglomerative clustering we start with some distance or dissimilarity d(x, y) between
points. We then extend this distance so that we can compute the distance d(A, B) between
to sets of points A and B.

The three most common ways of extending the distance are:

Single Linkage d(A, B) = min d(x, y)


x∈A,y∈B

X
1
Average Linkage d(A, B) = NA NB
d(x, y)
x∈A,y∈B

Complete Linkage d(A, B) = max d(x, y)


x∈A,y∈B

The algorithm is:

1. Input: data X = {X1 , . . . , Xn } and metric d giving distance between clusters.

2. Let Tn = {C1 , C2 , . . . , Cn } where Ci = {Xi }.

3. For j = n − 1 to 1:

33
● ●
● ●●●● ● ● ●
●● ● ● ●
● ●●●


● ● ●●
● ● ●
●● ●
● ● ● ●
● ●
●● ●

●● ● ●
● ● ● ●● ● ● ● ● ● ● ●●
●● ●
●●●●
● ● ● ●● ●● ●
●●●●
● ● ●
● ● ●●●●
● ●● ●● ●●●●
● ●● ●●

●●●●
●●● ●●●●● ●● ●
●●
●●
●●● ●●●●● ●●
● ● ●
●●●
● ●● ● ● ● ●
●●●
● ●● ● ●
● ●●
●●●●
● ●●
● ●●

● ●● ●●
●●●●
● ●●
● ●●


● ●● ●●●
●● ● ●● ● ●● ●●●
●● ● ●●
●● ●●● ●●●
● ● ●● ●●● ●●●

●●



● ●● ●

● ●●
● ● ●
● ● ●● ●
● ● ●● ●
●●●
● ● ●
● ●
● ● ● ●
● ●
● ●


●●● ● ● ●
● ●●● ● ● ● ●

● ● ● ●
● ●●●● ● ● ● ● ●●●● ● ● ●
●● ● ● ● ●● ● ● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
●● ●
● ●● ●

●● ● ● ● ●●● ●● ● ● ● ●●●
●● ●
●●●●
● ● ● ●● ●
●●●●
● ● ●
● ● ●●●●
● ●● ●● ● ● ●●●●
● ●● ●●

●●●●
●●● ●●●●● ●● ●
●●
●●
●●● ●●●●● ●●
● ● ●
●●●
● ●● ● ● ● ● ●
●●●
● ●● ● ●
● ●●
●●●●
● ● ●●
●●●● ● ●●
●●●●
● ● ●●
●●●●
● ●● ●●●●●● ●● ● ●● ●●●●●● ●●
●●
● ●●● ●●● ●● ●●
● ●●● ●●● ●●
● ● ●●● ● ● ● ●●● ●
● ● ● ●
● ●
● ●
●●● ●●●
● ●
● ● ● ●
● ● ●



●●● ● ● ●
● ●●● ● ● ● ●

Figure 24: Hierarchical clustering applied to two noisy rings. Top left: the data. Top right:
two clusters from hierarchical clustering using single linkage. Bottom left: average linkage.
Bottom right: complete linkage.

(a) Find j, k to minimize d(Cj , Ck ) over all Cj , Ck ∈ Tj+1 .


(b) Let Tj be the same as Tj+1 except that Cj and Ck are replaced with Cj ∪ Ck .

4. Return the sets of clusters T1 , . . . , Tn .

The result can be represented as a tree, called a dendogram. We can then cut the tree at
different places to yield any number of clusters ranging from 1 to n. Single linkage often
produces thin clusters while complete linkage is better at rounder clusters. Average linkage
is in between.

Example 14 Figure 24 shows agglomerative clustering applied to data generated from two
rings plus noise. The noise is large enough so that the smaller ring looks like a blob. The
data are show in the top left plot. The top right plot shows hierarchical clustering using single
linkage. (The tree is cut to obtain two clusters.) The bottom left plot shows average linkage
and the bottom right plot shows complete linkage. Single linkage works well while average
and complete linkage do poorly.

Let us now mention some theoretical properties of hierarchical clustering. Suppose that
X1 , . . . , Xn is a sample from a distribution P on Rd with density p. A high density cluster is
a maximal connected component of a set of the form {x : p(x) ≥ λ}. One might expect that
single linkage clusters would correspond to high density clusters. This turns out not quite
to be the case. See Hartigan (1981) for details. DasGupta (2010) has a modified version

34
of hierarchical clustering that attempts to fix this problem. His method is very similar to
density clustering.

Single linkage hierarchical clustering is the same as geometric graph clustering. Let G =
(V, E) be a graph where V = {X1 , . . . , Xn } and Eij = 1 if ||Xi − Xj || ≤  and Eij = 0 if
||Xi − Xj || > . Let C1 , . . . , Ck denote the connected components of the graph. As we vary
 we get exactly the hierarchical clustering tree.

Finally, we let us mention divisive clustering. This is a form of hierarchical clustering where
we start with one large cluster and then break the cluster recursively into smaller and smaller
pieces.

7 Spectral Clustering

Spectral clustering refers to a class of clustering methods that use ideas related to eigenvector.
An excellent tutorial on spectral clustering is von Luxburg (2006) and some of this section
relies heavily on that paper. More detail can be found in Chung (1997).

Let G be an undirected graph with n vertices. Typically these vertices correspond to obser-
vations X1 , . . . , Xn . Let W be an n × n symmetric weight matrix. Say that Xi and Xj are
connected if Wij > 0. The simplest type of weight matrix has entries that are either 0 or 1.
For example, we could define

Wij = I(||Xi − Xj || ≤ ).


2 2
An example of a more general weight matrix is Wij = e−||Xi −Xj || /(2h ) .

The degree matrix D is the n×n diagonal matrix with Dii = nj=1 Wij . The graph Laplacian
P
is
L = D − W. (21)

The graph Laplacian has many interesting properties which we list in the following result.
Recall that a vector v is an eigenvector of L if there is a scalar λ such that Lv = λv in which
case we say that λ is the eigenvalue corresponding to v. Let L(v) = {cv : c ∈ R, c 6= 0} be
the linear space generated by v. If v is an eigenvector with eigenvalue λ and c is any nonzero
constant, then cv is an eigenvector with eigenvalue cλ. These eigenvectors are considered
equivalent. In other words, L(v) is the set of vectors that are equivalent to v.

Theorem 15 The graph Laplacian L has the following properties:

35
1. For any vector f = (f1 , . . . , fn )T ,
n n
T 1 XX
f Lf = Wij (fi − fj )2 .
2 i=1 j=1

2. L is symmetric and positive semi-definite.


3. The smallest eigenvalue of L is 0. The corresponding eigenvector is (1, 1, . . . , 1)T .
4. L has n non-negative, real-valued eigenvalues 0 = λ1 ≤ λ2 ≤ · · · ≤ λk .
5. The number of eigenvalues that are equal to 0 is equal to the number of connected
components of G. That is, 0 = λ1 = . . . = λk where k is the number of connected
components of G. The corresponding eigenvectors v1 , . . . , vk are orthogonal and each
is constant over one of the connected components of the graph.

Part 1 of the theorem says that L is like a derivative operator. The last part shows that we
can use the graph Laplacian to find the connected components of the graph.

Proof.

(1) This follows from direct algebra.

(2) Since W and D are symmetric, it follow that L is symmetric. The fact that L is positive
semi-definite folows from part (1).

(3) Let v = (1, . . . , 1)T . Then



    
D11 D11 0
 ..   ..   .. 
Lv = Dv − W v =  .  −  .  =  . 
Dnn Dnn 0
which equals 0 × v.

(4) This follows from parts (1)-(3).

(5) First suppose that k = 1 and thus that the graph is fully connected. We already know
that λ1 = 0 and v1 = (1, . . . , 1)T . Suppose there were another eigenvector v with eigenvalue
0. Then n X n
X
0 = v T Lv = Wij (v(i) − v(j))2 .
i=1 j=1

36
It follows that Wij (v(i) − v(j))2 = 0 for all i and j. Since G is fully connected, all Wij > 0.
Hence, v(i) = v(j) for all i, j and so v is constant and thus v ∈ L(v1 ).

Now suppose that K has k components. Let nj be the number of nodes in components
j. We can reliable the vertices so that the first n1 nodes correspond to the first connected
component, the second n2 nodes correspond to the second connected component and so
on. Let v1 = (1, . . . , 1, 0, . . . , 0) where the 1’s correspond to the first component. Let Let
v2 = (0, . . . , 0, 1, . . . , 1, 0, . . . , 0) where the 1’s correspond to the second component. Define
v3 , . . . , vk similarly. Due to the re-ordering of the vertices, L has block diagonal form:
 
L1
 L2 
L= .
 
..
 . 
Lk
Here, each Li corresponds to one of th connected components of the graph. It is easy to see
that LV − j = 0 for j = 1, . . . , k. Thus, each vj , for j = 1, . . . , k is an eigenvector with zero
eigenvalue. Suppose that v is any eigenvector with 0 eigenvalue. Arguing as before, v must
be constant over some component and 0 elsewhere. Hence, v ∈ L(vj ) for some 1 ≤ j ≤ k. 

Example 16 Consider the graph


X1 X2 X3 X4 X5
and suppose that Wij = 1 if and only if there is an edge between Xi and Xj . Then
   
0 1 0 0 0 1 0 0 0 0
 1 0 0 0 0   0 1 0 0 0 
   
W =  0 0 0 1 0  D = 0 0 1 0 0 

  
 0 0 1 0 1   0 0 0 2 0 
0 0 0 1 0 0 0 0 0 1
and the Laplacian is
 
1 −1 0 0 0

 −1 1 0 0 0 

L=D−W =
 0 0 1 −1 0 .

 0 0 −1 2 −1 
0 0 0 −1 0
The eigenvalues of W , from smallest to largest are 0, 0, 1, 2, 3. The eigenvectors are
         
1 0 0 −.71 0
 1   0   0   .71   0 
         
v1 =  0  v2 = 
 
 1  v3 =  −.71  v4 =  0  v5 = 
    
 −.41 

 0   1   0   0   .82 
0 1 .71 0 −.41
Note that the first two eigenvectors correspond to the connected components of the graph.

37
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Figure 25: The top shows a simple graph. The remaining plots are the eigenvectors of
the graph Laplacian. Note that the first two eigenvectors correspond to the two connected
components of the graph.

Note f T Lf measures the smoothness of f relative to the graph. This means that the higher
order eigenvectors generate a basis where the first few basis elements are smooth (with
respect to the graph) and the later basis elements become more wiggly.

Example 17 Figure 25 shows a graph and the corresponding eigenvectors. The two eigen-
vectors correspond two the connected components of the graph. The other eignvectors can be
thought of as forming bases vectors within the connected components.

One approach to spectral clustering is to set

Wij = I(||Xi − Xj || ≤ )

for some  > 0 and then take the clusters to be the connected components of the graph which
can be found by getting the eigenvectors of the Laplacian L. This is exactly equivalent to
geometric graph clustering from Section ??. In this case we have gained nothing except that
we have a new algorithm to find the connected components of the graph. However, there
are other ways to use spectral methods for clustering as we now explain.

The idea underlying the other spectral methods is to use the Laplacian to transform the
data into a new coordinate system in which clusters are easier to find. For this purpose, one

38
typically uses a modified form of the graph Laplacian. The most commonly used weights for
this purpose are
2 2
Wij = e−||Xi −Xj || /(2h ) .
Other kernels Kh (Xi , Xj ) can be used as well. We define the symmetrized Laplacian L =
D−1/2 W D−1/2 and the random walk Laplacian L = D−1 W. (We will explain the name
shortly.) These are very similar and we will focus on the latter. Some authors define the
random walk Laplacian to be I −D−1 W . We prefer to use the definition L = D−1 W because,
as we shall see, it has a nice interpretation. The eigenvectors of I − D−1 W and D−1 W are
the same so it makes little difference which definition is used. The main difference is that
the connected components have eigenvalues 1 instead of 0.

Lemma 18 Let L be the graph Laplacian of a graph G and let L be the random walk Lapla-
cian.

1. λ is an eigenvalue of L with eigenvector v if and only if Lv = (1 − λ)Dv.

2. 1 is an eigenvalue of L with eigenvector (1, . . . , 1)T .

3. L is positive semidefinite with n non-negative real-valued eigenvalues.

4. The number of eigenvalues of L equal to 1 equals the number of connected components of


G. Let v1 , . . . , vk denote the eigenvectors with eigenvalues equal to 1. The linear space
spanned by v1 , . . . , vk is spanned by the indicator functions of the connected components.

Proof. Homework.  H

Let λ1 ≥ λ2 ≥ · · · ≥ λn be the eigenvalues of L with eigenvectors v1 , . . . , vn . Define


r
X p
Zi ≡ T (Xi ) = λj vj (i).
j=1

The mapping T : X → Z transforms the data into a new coordinate system. The numbers
h and r are tuning parameters. The hope is that clusters are easier to find in the new
parameterization.

To get some intuition for this, note that L has a nice probabilistic interpretation (Coifman,
Lafon, Lee 2006). Consider a Markov chain on X1 , . . . , Xn where we jump from Xi to Xj
with probability
Kh (Xi , Xj )
P(Xi −→ Xj ) = L(i, j) = P .
s Kh (Xs , Xj )

The Laplacian L(i, j) captures how easy it is to move from Xi to Xj . If Zi and Zj are close
in Euclidean distance, then they are connected by many high density paths through the

39
data. This Markov chain is a discrete version of a continuous Markov chain with transition
probability: R
Kh (x, y)dP (y)
P (x → A) = RA .
Kh (x, y)dP (y)
b : f → fe is
The corresponding averaging operator A
P
j f (j)Kh (Xi , Xj )
(Af
b )(i) = P
j Kh (Xi , Xj )

which is an estimate of A : f → fe where


R
f (y)Kh (x, y)dP (y)
Af = A R .
Kh (x, y)dP (y)

The lower order eigenvectors of L are vectors that are smooth relative to P . Thus, project-
ing onto the first few eigenvectors parameterizes in terms of closeness with respect to the
underlying density.

The steps are:

Input: n × n similarity matrix W .

P
1. Let D be the n × n diagonal matrix with Dii = j Wij .
2. Compute the Laplacian L = D−1 W.
3. Find first k eigenvectors v1 , . . . , vk of L.
4. Project each Xi onto the eigenvectors to get new points X bi .
5. Cluster the points Xb1 , . . . , X
bn using any standard clustering algorithm.

There is another way to think about spectral clustering. Spectral methods are similar to
multidimensional scaling. However, multidimensional scaling attempts to reduce dimension
while preserving all pairwise distances. Spectral methods attempt instead to preserve local
distances.

Example 19 Figure 26 shows a simple synthetic example. The top left plot shows the data.
We apply spectral clustering with Gaussian weights and bandwidth h = 3. The top middle
plot shows the first 20 eigenvalues. The top right plot shows the the first versus the second
eigenvector. The two clusters are clearly separated. (Because the clusters are so separated,
the graph is essentially disconnected and the first eigenvector is not constant. For large h,
the graph becomes fully connected and v1 is then constant.) The remaining six plots show
the first six eigenvectors. We see that they form a Fourier-like basis within each cluster.
Of course, single linkage clustering would work just as well with the original data as in the
transformed data. The real advantage would come if the original data were high dimensional.

40
●● ●● ●● ● ●
●● ●




● ●
● ●
● ●




●●● ●

●●




0.8
●● ● ●



● ●● ●

●● ●● ●
● ● ●
● ●
● ●
● ●
● ●


●● ●
● ●


●●●●

● ●●●
●●
●● ●● ●
● ●

● ●● ●●
●●
●● ● ●

v2




● ● ●● ● ●

λ
●●
● ● ●●

● ●

● ●
● ●●

●●
● ●
●●

●●● ●
● ●

0.4
●●
●● ●●●


●● ●●●
●●

● ● ●

● ● ●
●●●●
●●● ●
● ●
● ●



● ● ●
● ●



● ●
● ●

● ●
● ●


● ●
● ● ●
● ●● ●

●●
●● ●●
● ●


●●
● ● ●●
●●● 0.0 ●



●●●● ● ●● ●● ●

5 10 15 20 v1

v1 v2 v3

v4 v5 v6

Figure 26: Top left: data. Top middle: eigenvalues. Top right: second versus third eigen-
vectors. Remaining plots: first six eigenvectors.

41
1.0


● ●●●
●●
●●●
●● ●●
●●●●

0.8
●●●●● ●

●●
●●●●
● ●●
●●●
●●
●●●

●●● ●● ●●
●●●●●●
●●
●● ●
●● ● ●●●●●● ●
●●●●
● ●●

●●●●● ● ●
● ●● ●● ● ●
●●

0.6
●● ●●

●●



●●


●●
●●●● ●● ● ●


●●

●●


●●
●●


●● ● ●
●●
●●●









●●




●●



● ●●
●●●●●●
● ● ●●



●●●

●●


















●●
● ● ●●●● ●
●●●● ●

● ●
●●●●●
●● ●

●●●
● ●●
● ●
●● ●

v3
● ● ●
●●

λ

●●●● ●
●●●● ●
●● ● ●●
● ● ●●
●● ●●●

0.4
●● ●
● ●
● ● ●●
●●●●● ●● ●
●● ●
●●● ●●●●●

●●●
●●
●●●● ●
● ●

●●
● ●● ●●
●● ●●
●●●
● ● ●
● ●● ●●●

● ●

●●●● ●●● ●

0.2
●●●●
● ●



●● ●● ●●
●● ●
●●● ●
●●● ●

●●●
●●●● ●
● ●●

0.0
● ●

5 10 15 20 v2
1.0


0.8
0.6

v3
λ
0.4
0.2








●●


●●

● ●






0.0


●●

5 10 15 20 v2

Figure 27: Spectral analysis of some zipcode data. Top: h = 6. Bottom: h = 4. The plots
on the right show the second versus third eigenvector. The three colors correspond to the
three digits 1, 2 and 3.

Example 20 Figure 27 shows a spectral analysis of some zipcode data. Each datapoint is a
16 x 16 image of a handwritten number. We restrict ourselves to the digits 1, 2 and 3. We
use Gaussian weights and the top plots correspond to h = 6 while the bottom plots correspond
to h = 4. The left plots show the first 20 eigenvalues. The right plots show a scatterplot of
the second versus the third eigenvector. The three colors correspond to the three digits. We
see that with a good choice of h, namely h = 6, we can clearly see the digits in the plot. The
original dimension of the problem is 16 x 16 =256. That is, each image can be represented by
a point in R256 . However, the spectral method shows that most of the information is captured
by two eignvectors so the effective dimension is 2. This example also shows that the choice
of h is crucial.

Spectral methods are interesting. However, there are some open questions:

1. There are tuning parameters (such as h) and the results are sensitive to these param-
eters. How do we choose these tuning parameters?
2. Does spectral clustering perform better than density clustering?

42
8 High-Dimensional Clustering

As usual, interesting and unexpected things happen in high dimensions. The usual methods
may break down and even the meaning of a cluster may not be clear.

8.1 High Dimensional Behavior

I’ll begin by discussing some recent results from Sarkar and Ghosh (arXiv:1612.09121). Sup-
pose we have data coming from k distributions P1 , . . . , Pk . Let µr be the mean of Pr and
Σr be the covariance matrix. Most clustering methods depend on the pairwise distances
||Xi − Xj ||2 . Now,
Xd
2
||Xi − Xj || = δ(a)
a=1
2
where δa = (Xi (a) − Xj (a)) . This is a sum. As d increases, by the law of large numbers
we might expect this sum to converge to a number (assuming the features are not too
dependent). Indeed, suppose that X is from Pr and Y is from Ps then
1 P p
√ ||X − Y || → σr2 + σs2 + νrs
d
where
d
1X
νrs = lim ||µr (a) − µs (a)||2
d→∞ d
a=1

and
1
σr2 = lim trace(Σr ).
d→∞ d

Note that νrr = 0.

Consider two clusters, C1 and C2 :

X Y ||X − Y ||
X ∈ C1 Y ∈ C1 ||X − Y || = 2σ12
X ∈ C2 Y ∈ C2 ||X − Y || = 2σ22
X ∈ C1 Y ∈ C2 ||X − Y || = σ12 + σ22 + ν12

If
σ12 + ν12 < σ22
then every point in cluster 2 is closer to a point in cluster 1 than to other points
in cluster 2. Indeed, if you simulate high dimensional Gaussians, you will see that all the
standard clustering methods fail terribly.

43
What’s really going on is that high dimensional data tend to cluster on rings. Pairwise
distance methods don’t respect rings.

An interesting fix suggested by Sarkar and Ghosh is to use the mean absolute difference
distance (MADD) defined by

1 X
ρ(x, y) = ||x − z|| − ||y − z|| .
n − 2 z6=x,y

P
Suppose that X ∼ Pr and Y ∼ Ps . They show that ρ(X, Y ) → crs where crs ≥ 0 and crs = 0
if and only if σr2 = σs2 and νbr = νbs for all b. What this means is that pairwise distance
methods only work if νrs > |σr2 − σs2 | but MADD works if either νrs 6= 0 or σr 6= σs .

Pairwise distances only use information about two moments and they combine this moment
information in a particular way. MADD combines the moment information in a different and
more effective way. One could also invent other measures that separate mean and variance
information or that use higher moment information.

8.2 Variable Selection

If X ∈ Rd is high dimensional, then it makes sense to do variable selection before clustering.


There are a number of methods for doing this. But, frankly, none are very convincing. This
is, in my opinion, an open problem. Here are a couple of possibilities.

Marginal Selection (Screening). In marginal selection, we look for variables that marginally
look ‘clustery.” This idea was used in Chan and Hall (2010) and Wasserman, Azizyan and
Singh (2014). We proceed as follows:

Test For Multi-Modality

1. Fix 0 < α < 1. Let α


e = α/(nd).

2. For each 1 ≤ j ≤ d, compute Tj = Dip(Fnj ) where Fnj is the empirical


distribution function of the j th feature and Dip(F ) is defined in (22).

3. Reject the null hypothesis that feature j is not multimodal if Tj > cn,eα
where cn,eα is the critical value for the dip test.

Any test of multimodality may be used. Here we describe the dip test (Hartigan and Har-
tigan, 1985). Let Z1 , . . . , Zn ∈ [0, 1] be a sample from a distribution F . We want to test
“H0 : F is unimodal” versus “H1 : F is not unimodal.” Let U be the set of unimodal

44
distributions. Hartigan and Hartigan (1985) define

Dip(F ) = inf sup |F (x) − G(x)|. (22)


G∈U x

If F has a density p we also write Dip(F ) as Dip(p). Let Fn be the empirical distribution
function. The dip statistic is Tn = Dip(Fn ). The dip test rejects H0 if Tn > cn,α where the
critical value cn,α is chosen so that, under H0 , P(Tn > cn,α ) ≤ α.2

Since we are conducting multiple tests, we cannot test at a fixed error rate α. Instead, we
replace α with α
e = α/(nd). That is, we test each marginal and we reject H0 if Tn > cn,eα .
By the union bound, the chance of at least one false rejection of H0 is at most de
α = α/n.

There are more refined tests such as the excess mass test given in Chan and Hall (2010),
building on work by Muller and Sawitzki (1991). For simplicity, we use the dip test in this
paper; a fast implementation of the test is available in R.

Marginal selection can obviously fail. See Figure 28 taken from Wasserman, Azizyan and
Singh (2014).

Sparse k-means. Here we discuss the approach in Witten and Tibshirani (2010). Recall
that in k-means clustering we choose C = {c1 , . . . , ck } to minimize
n n
1X 1X
Rn (C) = ||Xi − ΠC [Xi ]||2 = min ||Xi − cj ||2 . (23)
n i=1 n i=1 1≤j≤k

This is equivalent to minimizing the within sums of squares


k
X 1 X 2
d (Xs , Xt ) (24)
n
j=1 j s,t∈A j

where Aj is the j th cluster and d2 (x, y) = dr=1 (x(r) − y(r))2 is squared Euclidean distance.
P
Further, this is equivalent to maximizing the between sums of squares
k
1X 2 X 1 X 2
B= d (Xs , Xt ) − d (Xs , Xt ). (25)
n s,t n
j=1 j s,t∈A j

Witten
Pd and Tibshirani propose replace the Euclidean norm with the weighted norm d2w (x, y) =
2
r=1 wr (x(r) − y(r)) . Then they propose to maximize

k
1X 2 X 1 X 2
B= dw (Xs , Xt ) − dw (Xs , Xt ) (26)
n s,t n
j=1 j s,t∈A j

2
Specifically, cn,α can be defined by supG∈U PG (Tn > cn,α ) = α. In practice, cn,α can be defined
by PU (Tn > cn,α ) = α where U is Unif(0,1). Hartigan and Hartigan (1985) suggest that this suffices
asymptotically.

45
Figure 28: Three examples, each showing two clusters and two features X(1) and X(2). The
top plots show the clusters. The bottom plots show the marginal density of X(1). Left: The
marginal fails to reveal any clustering structure. This example violates the marginal signature
assumption. Middle: The marginal is multimodal and hence correctly identifies X(1) as a
relevant feature. This example satisfies the marginal signature assumption. Right: In this
case, X(1) is relevant but X(2) is not. Despite the fact that the clusters are close together,
the marginal is multimodal and hence correctly identifies X(1) as a relevant feature. This
example satisfies the marginal signature assumption. (Figure from Wasserman, Azizyan and
Singh, 2014).

over C and w subject to the constraints


||w||2 ≤ 1, ||w||1 ≤ s, wj ≥ 0
where w = (w1 , . . . , wd ). The optimization is done iteratively by optimizing over C, opti-
mizing over w and repeating. See Figure 29.

The `1 norm on the weights causes some of the components of w to be 0 which results in
variable selection. There is no theory that shows that this method works.

Sparse Alternate Sum Clustering. Arais-Castro and Pu (arXiv:1602.07277) introduced


a method called SAS (Sparse Alternate Sum) clustering. It is very simple and intuitively
appealing.

Recall that k-means minimizes


X 1 X
||Xi − Xj ||2 .
j
|C j | i,j∈Cj

Suppose we want a clustering based on a subset of features S such that |S| = L. Let
δa (i, j) = (Xi (a) − Xj (a))2 be the pairwise distance for the ath feature. Assume that each

46
1. Input X1 , . . . , Xn and k.

2. Set w = (w1 , . . . , wd ) where w1 = . . . = wd = 1/ d.

3. Iterate until convergence:

(a) Optimize (25) over C holding w fixed. Find c1 , . . . , ck from the k-means algorithm using
distance dw (Xi , Xj ). Let Aj denote the j th cluster.
(b) Optimize (25) over w holding c1 , . . . , ck fixed. The solution is
sr
wr = qP
d
t=1 s2t

where
sr = (ar − ∆)+ ,
 
k
1 X X 1 X
ar =  wr (Xs (r) − Xt (r))2 − wr (Xs (r) − Xt (r))2 
n s,t n
j=1 j s,t∈A j
+

and ∆ = 0 if ||w||1 < s otherwise ∆ > 0 is chosen to that ||w||1 = s.

Figure 29: The Witten-Tibshirani Sparse k-means Method

47
feature has been standardized so that
X
δa (i, j) = 1
i,j
P
for all a. Define δS (i, j) = a∈S δa (i, j). Then we can say that the goal of sparse clustering
is to minimize X 1 X
δS (i, j)
j
|Cj | i,j∈C
j

over clusterings and subsets. They propose to minimize by alternating between finding clus-
ters and finding subsets. The former is the usual k-means. The latter is trivial because
δS decomposes into maginal components. Arias-Castro and Pu also suggest a permuta-
tion method for choosing the size of S. Their numerical experiments are very promising.
Currently, no theory has been developed for this approach.

8.3 Mosaics

A different idea is to create a partition of features and observations which I like to call a
mosaic. There are papers that cluster features and observations simultaneously but clear
theory is still lacking.

9 Examples

Example 21 Figures 17 and 18 shows some synthetic examples where the clusters are meant
to be intuitively clear. In Figure 17 there are two blob-like clusters. Identifying clusters like
this is easy. Figure 18 shows four clusters: a blob, two rings and a half ring. Identifying
clusters with unusual shapes like this is not quite as easy. To the human eye, these certainly
look like clusters. But what makes them clusters?

Example 22 (Gene Clustering) In genomic studies, it is common to measure the expres-


sion levels of d genes on n people using microarrays (or gene chips). The data (after much
simplification) can be represented as an n × d matrix X where Xij is the expression level
of gene j for subject i. Typically d is much larger than n. For example, we might have
d ≈ 5, 000 and n ≈ 50. Clustering can be done on genes or subjects. To find groups of
similar people, regard each row as a data vector so we have n vectors X1 , . . . , Xn each of
length d. Clustering can then be used to place the subjects into similar groups.

Example 23 (Curve Clustering) Sometimes the data consist of a set of curves f1 , . . . , fn


and the goal is to cluster similarly shaped clusters together. For example, Figure 30 shows a

48
200
150
100
50
0

0 10 20 30 40 50 60 70

Figure 30: Some curves from a dataset of 472 curves. Each curve is a radar waveform from
the Topex/Poseidon satellite.

small sample of curves a from a dataset of 472 curves from Frappart (2003). Each curve is a
radar waveform from the Topex/Poseidon satellite which used to map the surface topography
of the oceans.3 One question is whether the 472 curves can be put into groups of similar
shape.

Example 24 (Supernova Clustering) Figure 31 shows another example of curve cluster-


ing. Briefly, each data point is a light curve, essentially brightness versus time. The top two
plots show the light curves for two types of supernovae called “Type Ia” and “other.” The
bottom two plots show what happens if we throw away the labels (“Type Ia” and “other”)
and apply a clustering algorithm (k-means clustering). We see that the clustering algorithm
almost completely recovers the two types of supernovae.

3
See http://topex-www.jpl.nasa.gov/overview/overview.html. The data are available at “Work-
ing Group on Functional and Operator-based Statistics” a web site run by Frederic Ferrarty
and Philippe Vieu. The address is http://www.math.univ-toulouse.fr/staph/npfda/. See also
http://podaac.jpl.nasa.gov/DATA CATALOG/topexPoseidoninfo.html.

49
0 20 40 60 80 100 0 20 40 60 80 100

Type Ia Other

0 20 40 60 80 100 0 20 40 60 80 100

Cluster 1 Cluster 2

Figure 31: Light curves for supernovae. The top two plots show the light curves for two
types of supernovae. The bottom two plots show the results of clustering the curves into two
groups, without using knowledge of their labels.

10 Bibliographic Remarks

k-means clustering goes back to Stuart Lloyd who apparently came up with the algorithm in
1957 although he did not publish it until 1982. See [?]. Another key reference is [?]. Similar
ideas appear in [?]. The related area of mixture models is discussed at length in McLachlan
and Basford (1988). k-means is actually related to principal components analysis; see Ding
and He (2004) and Zha, He, Ding, Simon and Gu (2001). The probabilistic behavior of
random geometric graphs is discussed in detail in [?].

50

You might also like