Machine learning Lecture 02
Machine learning Lecture 02
• k-Nearest Neighbor
• Locally weighted regression
• Lazy and eager learning
Nearest Neighbor Algorithm
• Learning Algorithm:
– Store training examples
• Prediction Algorithm:
– To classify a new example x by finding the training
example (xi,yi) that is nearest to x
– Guess the class y = yi
K-Nearest Neighbor Methods
• To classify a new input vector x, examine the k-closest training data points to x
and assign the object to the most frequently occurring class
k=1
k=5
x
Each line segment is equidistant between two points of opposite classes. The
more examples that are stored, the more complex the decision boundaries can
become.
Instance-Based Learning
Key idea : just store all training examples x i,f(xi)
Nearest neighbor (1 - Nearest neighbor) :
Given query instance x q , locate nearest example x n , estimate
f̂ ( x q ) f ( x n )
k Nearest neighbor :
Given x q , take vote among its k nearest neighbors (if
discrete - valued target function)
Take mean of f values of k nearest neighbors (if real - valued)
k
f (xi )
f̂ ( x q ) i 1
k
Distance-Weighted k-NN
Might want to weight nearer neighbors more heavily ...
k
w f (x ) i i
f̂ ( x q ) i 1
w
k
i 1 i
where
1
wi
d( x q , x i )2
and d(xq,xi) is distance betweenx q and x i
Note,now it makes sense to use all training examples
instead of just k
Shepard's method
Nearest Neighbor
When to Consider
– Instance map to points in Rn
– Less than 20 attributes per instance
– Lots of training data
Advantages
– Training is very fast
– Learn complex target functions
– Do not lose information
Disadvantages
– Slow at query time
– Easily fooled by irrelevant attributes
Issues
• Distance measure
– Most common: Euclidean
• Choosing k
– Increasing k reduces variance, increases bias
• For high-dimensional space, problem that the nearest
neighbor may not be very close at all!
• Memory-based technique. Must make a pass through
the data for each classification. This can be prohibitive
for large data sets.
Distance
• Notation: object with p measurements
i i i i
x ( x 1 , x 2 , , x p )
k 1
• efficiency trick: using squared Euclidean distance gives same answer,
avoids computing square root
•ED makes sense when different measurements are commensurate; each
is variable measured in the same units.
•If the measurements are different, say length and weight, it is not clear.
Standardization
When variables are not commensurate, we can standardize
them by dividing by the sample standard deviation. This
makes them all equally important.
The estimate for the standard deviation of xk :
1
1 n i
2
2
ˆk x k x k
n i1
where xk is the sample mean:
1 n i
xk x k
n i1
Weighted Euclidean distance
Finally, if we have some idea of the relative importance of
each variable, we can weight them:
1
p j 2
2
dWE (i, j) wk ( x k x k )
i
k 1
One option: weight each feature by its mutual information with the class.
Other Distance Metrics
• Minkowski or L metric:
1
p
d(i, j) ( x k (i) x k ( j))
k 1
• Manhattan, city block or L1 metric:
p
d(i, j) x k (i) x k ( j)
k 1
• L
22
K-NN and irrelevant features
+
o
+ o
? o o
+
o o
o o
o o
+ o
+ +
+ o
o + o o
o
o
o
23
Efficient Indexing: Kd-trees
• A kd-tree is similar to a decision tree, except that we split using the
median value along the dimension having the highest variance, and
points are stored
• A kd-tree is a tree with the following properties
– Each node represents a rectilinear region (faces aligned with axes)
– Each node is associated with an axis aligned plane that cuts its region
into two, and it has a child for each sub-region
– The directions of the cutting planes alternate with depth
Kd-trees
• A kd-tree is similar to a decision tree, except that we split using the
median value along the dimension having the highest variance, and
points are stored
Edited Nearest Neighbor
• Storing all of the training examples can require a huge amount of
memory. Select a subset of points that still give good classifications.
– Incremental deletion. Loop through the training data and test each
point to see if it can be correctly classified given the other points. If
so, delete it from the data set.
– Incremental growth. Start with an empty data set. Add each point to
the data set only if it is not correctly classified by the points already
stored.
KNN Advantages
• Easy to program
• No optimization or training required
• Classification accuracy can be very good; can
outperform more complex models
Nearest Neighbor Summary
• Advantages
– variable-sized hypothesis space
– Learning is extremely efficient
• however growing a good kd-tree can be expensive
– Very flexible decision boundaries
• Disadvantages
– distance function must be carefully chosen
– Irrelevant or correlated features must be eliminated
– Typically cannot handle more than 30 features
– Computational costs: Memory and classification-time
computation
Locally Weighted Linear
Regression: LWLR
• Idea:
– k-NN forms local approximation for each query
point xq
– Why not form an explicit approximation 𝑓 for
region surrounding xq
• Fit linear function to k nearest neighbors
• Fit quadratic, ...
• Thus producing ``piecewise approximation'' to 𝑓
– Minimize error over k nearest neighbors of xq
– Minimize error entire set of examples, weighting by distances
– Combine two above
LWLR: Continued
𝑛
1 2
𝐸𝑟𝑟𝑜𝑟1 = 𝑓 𝑥 − 𝑓 (𝑥)
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞
1 2
𝐸𝑟𝑟𝑜𝑟2 = 𝑓 𝑥 −𝑓 𝑥 𝐾(𝑑 𝑥𝑞, 𝑥 )
2
𝑥∈𝐷
1 2
𝐸𝑟𝑟𝑜𝑟3 = 𝑓 𝑥 −𝑓 𝑥 𝐾(𝑑 𝑥𝑞, 𝑥 )
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 𝑥𝑞
LWR Example Locally-weighted regression (f2)
f1 (simple regression)
Training data
Predicted value using simple regression
Does it matter?
• Eager learner must create global approximation
• Lazy learner can create many local approximations
• If they use same H, lazy can represent more complex functions