Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes
Elements of Statistical Learning II - Ch.6 Kernel Smoothing Methods - Notes
Applications
- (Non-parametric) Regression
- Unsupervised Learning – Density estimation
- (Non-parametric) Classification
- To do better: Gaussian Mixture models
This technique is most appropriate when the dimension of the predictor is low (p < 3), for
example for data visualization.
Further info
- A class of regression techniques that achieve flexibility in estimating the
regression function f(X) over the domain Rp by fitting a different but simple model
separately at each query point x0.
- This is done by using only those observations close to the target point x0 to fit the
simple model, and in such a way that the resulting estimated function ˆf(X) is
smooth in Rp.
o This localization is achieved via a weighting function or kernel Kλ(x0, xi),
which assigns a weight to xi based on its distance from x0.
o The kernels Kλ are typically indexed by a parameter λ that dictates the
width of the neighbourhood.
Training
- These memory-based methods require in principle little or no training; all the work
gets done at evaluation time.
- The only parameter that needs to be determined from the training data is λ. The
model, however, is the entire training data set.
- Model is the entire training data set because all data go into Kλ(x0, xi) through xi.
x0 is any point within the continuous x axis. By taking all x’s across the x-axis as
x0 then we can construct our function approximation
Note: There is a natural bias–variance tradeoff as we change the width of the averaging
window, which is most explicit for local averages:
o
o With Gaussian kernel = RBF network
Note: Radial basis functions form the bridge between the modern “kernel methods” (ML –
e.g. kernel SVM) and local fitting technology (i.e. kernel smoothing methods)
Training:
- Similar as explained below for RBFs represented as NN except we do OLS to
estimate betas.
o Get ξj through the centres of k-means clustering (one way to do it)
o Lambda can be determined from kernel (e.g for Gaussian is standard
deviation)
o OLS for betas
Classification
- Probability xi belongs to class/component m, rim = value of weighted component
m (from a term m from the linear combination f(x))/sum of the linear combination
f(x)
o e.g. xi is age, f(x) is as defined above in density estimation
o Suppose we threshold each value ri2 and hence define δi = I(ri2 > 0.5)
Additional notes
Computational Considerations
- Kernel and local regression and density estimation are memory-based methods.
For many real-time applications, this can make this class of methods
(computationally) infeasible.