0% found this document useful (0 votes)
20 views

Machine learning Lecture 02

The document discusses instance-based learning, focusing on k-Nearest Neighbor (k-NN) methods, which classify new inputs based on the closest training examples. It highlights advantages such as fast training and flexibility in decision boundaries, while also addressing challenges like the curse of dimensionality and the impact of irrelevant features. Additionally, it covers techniques like locally weighted regression and the differences between lazy and eager learning approaches.

Uploaded by

233046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Machine learning Lecture 02

The document discusses instance-based learning, focusing on k-Nearest Neighbor (k-NN) methods, which classify new inputs based on the closest training examples. It highlights advantages such as fast training and flexibility in decision boundaries, while also addressing challenges like the curse of dimensionality and the impact of irrelevant features. Additionally, it covers techniques like locally weighted regression and the differences between lazy and eager learning approaches.

Uploaded by

233046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Instance Based Learning

• k-Nearest Neighbor
• Locally weighted regression
• Lazy and eager learning
Nearest Neighbor Algorithm
• Learning Algorithm:
– Store training examples
• Prediction Algorithm:
– To classify a new example x by finding the training
example (xi,yi) that is nearest to x
– Guess the class y = yi
K-Nearest Neighbor Methods
• To classify a new input vector x, examine the k-closest training data points to x
and assign the object to the most frequently occurring class

k=1

k=5
x

common values for k: 3, 5


Decision Boundaries
• The nearest neighbor algorithm does not explicitly compute decision
boundaries. However, the decision boundaries form a subset of the Voronoi
diagram for the training data.

1-NN Decision Surface

 Each line segment is equidistant between two points of opposite classes. The
more examples that are stored, the more complex the decision boundaries can
become.
Instance-Based Learning
Key idea : just store all training examples  x i,f(xi) 
Nearest neighbor (1 - Nearest neighbor) :
 Given query instance x q , locate nearest example x n , estimate
f̂ ( x q )  f ( x n )
k  Nearest neighbor :
 Given x q , take vote among its k nearest neighbors (if
discrete - valued target function)
 Take mean of f values of k nearest neighbors (if real - valued)


k
f (xi )
f̂ ( x q ) i 1
k
Distance-Weighted k-NN
Might want to weight nearer neighbors more heavily ...


k
w f (x ) i i
f̂ ( x q ) i 1

 w
k
i 1 i

where
1
wi 
d( x q , x i )2
and d(xq,xi) is distance betweenx q and x i
Note,now it makes sense to use all training examples
instead of just k
 Shepard's method
Nearest Neighbor
When to Consider
– Instance map to points in Rn
– Less than 20 attributes per instance
– Lots of training data
Advantages
– Training is very fast
– Learn complex target functions
– Do not lose information
Disadvantages
– Slow at query time
– Easily fooled by irrelevant attributes
Issues
• Distance measure
– Most common: Euclidean
• Choosing k
– Increasing k reduces variance, increases bias
• For high-dimensional space, problem that the nearest
neighbor may not be very close at all!
• Memory-based technique. Must make a pass through
the data for each classification. This can be prohibitive
for large data sets.
Distance
• Notation: object with p measurements
i i i i
x  ( x 1 , x 2 , , x p )

• Most common distance metric is Euclidean distance:


1
 p
j 2
2
dE ( x , x )    ( x k  x k ) 
i j i

 k 1 
• efficiency trick: using squared Euclidean distance gives same answer,
avoids computing square root
•ED makes sense when different measurements are commensurate; each
is variable measured in the same units.
•If the measurements are different, say length and weight, it is not clear.
Standardization
When variables are not commensurate, we can standardize
them by dividing by the sample standard deviation. This
makes them all equally important.
The estimate for the standard deviation of xk :
1
1 n i
 2

2
ˆk    x k  x k 
 n i1 
where xk is the sample mean:
1 n i
xk   x k
n i1
Weighted Euclidean distance
Finally, if we have some idea of the relative importance of
each variable, we can weight them:
1
 p j 2
2
dWE (i, j)    wk ( x k  x k ) 
i

 k 1 

One option: weight each feature by its mutual information with the class.
Other Distance Metrics
• Minkowski or L metric:
1
 p 

d(i, j)    ( x k (i)  x k ( j)) 
 k 1 
• Manhattan, city block or L1 metric:
p
d(i, j)   x k (i)  x k ( j)
k 1

• L

d(i, j)  max x k (i)  x k ( j)


k
The Curse of Dimensionality
• Nearest neighbor breaks down in high-dimensional spaces because the
“neighborhood” becomes very large.
• Suppose we have 5000 points uniformly distributed in the unit
hypercube and we want to apply the 5-nearest neighbor algorithm.
• Suppose our query point is at the origin.
– 1D –
• On a one dimensional line, we must go a distance of 5/5000 = 0.001 on
average to capture the 5 nearest neighbors
– 2D –
• In two dimensions, we must go sqrt(0.001) to get a square that contains
0.001 of the volume
– D–
• In d dimensions, we must go (0.001)1/d
Curse of Dimensionality cont.
• With 5000 points in 10 dimensions, we must go 0.501 distance along
each attribute in order to find the 5 nearest neighbors!
K-NN and irrelevant features
+ + + oo o oo? o + + o + o oooo+ o ooooo +

"irrelevant features" refer to data attributes that do not


significantly contribute to the prediction task, but are still
included in the distance calculations

22
K-NN and irrelevant features
+
o
+ o
? o o
+
o o
o o
o o
+ o
+ +
+ o
o + o o
o
o
o

23
Efficient Indexing: Kd-trees
• A kd-tree is similar to a decision tree, except that we split using the
median value along the dimension having the highest variance, and
points are stored
• A kd-tree is a tree with the following properties
– Each node represents a rectilinear region (faces aligned with axes)
– Each node is associated with an axis aligned plane that cuts its region
into two, and it has a child for each sub-region
– The directions of the cutting planes alternate with depth
Kd-trees
• A kd-tree is similar to a decision tree, except that we split using the
median value along the dimension having the highest variance, and
points are stored
Edited Nearest Neighbor
• Storing all of the training examples can require a huge amount of
memory. Select a subset of points that still give good classifications.
– Incremental deletion. Loop through the training data and test each
point to see if it can be correctly classified given the other points. If
so, delete it from the data set.
– Incremental growth. Start with an empty data set. Add each point to
the data set only if it is not correctly classified by the points already
stored.
KNN Advantages
• Easy to program
• No optimization or training required
• Classification accuracy can be very good; can
outperform more complex models
Nearest Neighbor Summary
• Advantages
– variable-sized hypothesis space
– Learning is extremely efficient
• however growing a good kd-tree can be expensive
– Very flexible decision boundaries
• Disadvantages
– distance function must be carefully chosen
– Irrelevant or correlated features must be eliminated
– Typically cannot handle more than 30 features
– Computational costs: Memory and classification-time
computation
Locally Weighted Linear
Regression: LWLR
• Idea:
– k-NN forms local approximation for each query
point xq
– Why not form an explicit approximation 𝑓 for
region surrounding xq
• Fit linear function to k nearest neighbors
• Fit quadratic, ...
• Thus producing ``piecewise approximation'' to 𝑓
– Minimize error over k nearest neighbors of xq
– Minimize error entire set of examples, weighting by distances
– Combine two above
LWLR: Continued
𝑛

• Linear regression 𝑓 𝑥 = 𝑤0 + 𝑤𝑖 𝑎𝑖 (𝑥)


𝑖=1
1 2
𝐸𝑟𝑟𝑜𝑟 = 𝑓 𝑥 − 𝑓 (𝑥)
2
𝑥∈𝐷

1 2
𝐸𝑟𝑟𝑜𝑟1 = 𝑓 𝑥 − 𝑓 (𝑥)
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞

1 2
𝐸𝑟𝑟𝑜𝑟2 = 𝑓 𝑥 −𝑓 𝑥 𝐾(𝑑 𝑥𝑞, 𝑥 )
2
𝑥∈𝐷
1 2
𝐸𝑟𝑟𝑜𝑟3 = 𝑓 𝑥 −𝑓 𝑥 𝐾(𝑑 𝑥𝑞, 𝑥 )
2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 𝑥𝑞
LWR Example Locally-weighted regression (f2)
f1 (simple regression)

Locally-weighted regression (f4) Locally-weighted regression


(f3)

Training data
Predicted value using simple regression

Predicted value using locally weighted (piece-wise) regression


Yike Guo, Advanced Knowledge Management, 2000
Lazy and Eager Learning
Lazy: wait for query before generalizing
– k-Nearest Neighbor
• Eager: generalize before seeing query
– ID3, Backpropagation, etc.

Does it matter?
• Eager learner must create global approximation
• Lazy learner can create many local approximations
• If they use same H, lazy can represent more complex functions

You might also like