0% found this document useful (0 votes)
53 views

Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Ch.

6 Kernel Smoothing methods

Applications

- (Non-parametric) Regression
- Unsupervised Learning – Density estimation
- (Non-parametric) Classification
- To do better: Gaussian Mixture models

Kernel Smoothing Regression

Summary – Kernel smoother


A kernel smoother is a statistical technique to estimate a real valued function f: Rp -> R as the
weighted average of neighbouring observed data. The weight is defined by the kernel, such that
closer points are given higher weights. The estimated function is smooth, and the level of
smoothness is set by a single parameter.

This technique is most appropriate when the dimension of the predictor is low (p < 3), for
example for data visualization.

Further info
- A class of regression techniques that achieve flexibility in estimating the
regression function f(X) over the domain Rp by fitting a different but simple model
separately at each query point x0.
- This is done by using only those observations close to the target point x0 to fit the
simple model, and in such a way that the resulting estimated function ˆf(X) is
smooth in Rp.
o This localization is achieved via a weighting function or kernel Kλ(x0, xi),
which assigns a weight to xi based on its distance from x0.
o The kernels Kλ are typically indexed by a parameter λ that dictates the
width of the neighbourhood.
Training
- These memory-based methods require in principle little or no training; all the work
gets done at evaluation time.
- The only parameter that needs to be determined from the training data is λ. The
model, however, is the entire training data set.

Simple kernel smoothing (locally weighted average of our points)


Mathematically (for 1-d kernel smoother)

- Model is the entire training data set because all data go into Kλ(x0, xi) through xi.
x0 is any point within the continuous x axis. By taking all x’s across the x-axis as
x0 then we can construct our function approximation

- hλ(x0) is a width function (indexed by λ) that determines the width of the


neighbourhood at x0.
o Can be adaptive or constant (e.g. hλ(x0) = λ for epanechikov quadratic
kernel)
- E.g. D(t) = ¾ * (1 – t2) if |t| < 1, 0 otherwise (Epanechnikov quadratic kernel)
Extra Kernel Smoothing Regression parts

Kernel smoothing of local regressions (kernel weighted local regression) (Improvement to


simple kernel smoothing)
- Local Linear regression
- Local Polynomial regression

What are they


- With kernel smoothing essentially instead of taking raw moving average we take
a smoothly varying locally weighted average (through the kernel weighting)
- We will now move a step further and take a weighed local regression instead of
taking the locally weighted average around a point (do this again for all points)

What do they help in?


We have progressed from the raw moving average to a smoothly varying locally
weighted average by using kernel weighting. The smooth kernel fit still has problems,
however, as exhibited in Figure 6.3 (left panel). Locally weighted averages can be badly
biased on the boundaries of the domain, because of the asymmetry of the kernel in that
region. By fitting straight lines rather than constants locally, we can remove this bias
exactly to first order; see Figure 6.3 (right panel). Actually, this bias can be present in the
interior of the domain as well, if the X values are not equally spaced (for the same
reasons, but usually less severe). Again locally weighted linear regression will make a
first-order correction.

To summarize some collected wisdom on this issue:


- Local linear fits can help bias dramatically at the boundaries at a modest cost in
variance. Local quadratic fits do little at the boundaries for bias, but increase the
variance a lot.
- Local quadratic fits tend to be most helpful in reducing bias due to curvature in
the interior of the domain.
- Asymptotic analysis suggest that local polynomials of odd degree dominate those
of even degree. This is largely due to the fact that asymptotically the MSE is
dominated by boundary effects.

Selecting width of kernel


In each of the kernels Kλ, λ is a parameter that controls its width:
- For the Epanechnikov or tri-cube kernel with metric width, λ is the radius of the
support region.
- For the Gaussian kernel, λ is the standard deviation.
- λ is the number k of nearest neighbors in k-nearest neighborhoods, often
expressed as a fraction or span k/N of the total training sample.

Note: There is a natural bias–variance tradeoff as we change the width of the averaging
window, which is most explicit for local averages:

Structured local regression


- Structured kernels
- Structured regression functions

Kernel Density Estimation

- e.g. Gaussian Kernel (1D):


Radial Basis functions (RBF)
- Note: Kernels can be thought of as radial functions/kernels. (They are functions
that their values depends on the distance between the input (x i or x) and some
fixed point (x0 or ξj))
- Using kernels as basis functions -> radial basis functions
o i.e. transforming data into a another feature space before applying an
algorithm
- Can use them to do least squares regression e.g., or any other algorithms (e.g.
that assumes linearity in the feature space and without transformation/rbfs of
feature space we don’t expect linearity)
o PCA (kernel-PCA), k-means (kernel k-means; spectral clustering) etc

E.g. Kernel Linear Regression model

o
o With Gaussian kernel = RBF network

Note: Radial basis functions form the bridge between the modern “kernel methods” (ML –
e.g. kernel SVM) and local fitting technology (i.e. kernel smoothing methods)

Training:
- Similar as explained below for RBFs represented as NN except we do OLS to
estimate betas.
o Get ξj through the centres of k-means clustering (one way to do it)
o Lambda can be determined from kernel (e.g for Gaussian is standard
deviation)
o OLS for betas

Explained as Neural Networks


https://en.wikipedia.org/wiki/Radial_basis_function_network
Training
RBF networks are typically trained from pairs of input and target values x(t) ,y(t) by a two-step
algorithm.
In the first step, the center vectors ci of the RBF functions in the hidden layer are chosen. This
step can be performed in several ways; centers can be randomly sampled from some set of
examples, or they can be determined using k-means clustering. Note that this step
is unsupervised.
The second step simply fits a linear model with coefficients wi to the hidden layer's outputs with
respect to some objective function. A common objective function, at least for regression/function
estimation, is the least squares function:

Kernels and RBF in SVM: http://cs229.stanford.edu/notes/cs229-notes3.pdf (Andrew NG lecture


ntes)

Mixture Models for Density Estimation and Classification

Density estimation, f(x)


- f(x) = Linear combination of densities of x (with different parameters, e.g. mean
and covariance matrix) and weights alpha
o Gaussian mixture model – Gaussian densities with different mean and
covariance matrix
o Fit my maximum likelihood, using the EM algorithm (Ch.8)
o The density can be viewed as a kind of kernel

Classification
- Probability xi belongs to class/component m, rim = value of weighted component
m (from a term m from the linear combination f(x))/sum of the linear combination
f(x)
o e.g. xi is age, f(x) is as defined above in density estimation
o Suppose we threshold each value ri2 and hence define δi = I(ri2 > 0.5)

Additional notes
Computational Considerations
- Kernel and local regression and density estimation are memory-based methods.
For many real-time applications, this can make this class of methods
(computationally) infeasible.

You might also like