K Nearest Neighbours
K Nearest Neighbours
K Nearest Neighbours
1
Introduction to machine learning
K Nearest Neighbors -
b. For classification, the algorithm obtains a majority vote between the K most
similar instances to a given “unseen” observation. K is a count
c. Suited for classification where relationship between features and target classes
is numerous, complex and difficult to understand and yet items in a class tend
to be fairly homogenous on the values of attributes
d. Not suitable if the data is noisy and the target classes do not have clear
demarcation in terms of attribute values
2
Introduction to machine learning
K Nearest Neighbors -
e. The training data is represented by the scattered data points in the feature
space
f. The color of the data points indicate the class they belong to
g. The grey point is the query point who's class has to be fixed
3
Introduction to machine learning
a. Measuring similarity with distance between the points using Euclidian method
4
Introduction to machine learning
b. Those dimensions which have larger possible range of values will dominate the
result of the distance calculation using Euclidian formula
c. To ensure all the dimensions have similar scale, we normalize the data on all
the dimensions / attributes
d. There are multiple ways of normalizing the data. We will use Z-score
standardization
5
Introduction to machine learning
1. Minkowski distance
2. Euclidean distance
3. Manhattan distance
4. Chebyshev distance
5. Mahalanobis distanc
6. Inner product
7. Cosine similarity
8. Pearson correlation
9. Hamming distance
10. Jaccard similarity
11. Edit distance or Levenshtein distance
Ref:
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
6
Introduction to machine learning
b. It computes distance of the data point x from all the other data points in the
training set, arranges in ascending order, takes top K observations. Let this be
A. K is usually odd
e. The data point x is assigned the class which has max probability
7
Introduction to machine learning
a. The Voronoi diagram is formed from lines that bisect and are perpendicular to
the lines that connect two neighboring vertices
b. Each point s has a Voronoi cell V(s) consisting of all points closer to s than to
any other points
8
Introduction to machine learning
a. How to pick the right K? K can range from 1 to number of training data points!
b. K values can affect the performance of the classifier
c. K in KNN is a hyper parameter. It has to be discovered through iterations!
d. Since we will be evaluating a hyper parameter, we need to ensure data is split
into three i.e. training, validation and testing.
e. The iteration to find K should include only training and validation data
f. We can imagine K as a way of influencing the shape of the boundary between
classes
9
Introduction to machine learning
K Nearest Neighbors - (K and Voronoi boundaries)
b. The boundaries have sharp bends and there are b. The boundary will be relatively smooth with
many islands. The surface represents a complex little or no sharp turns. Islands will be
model likely to suffer from variance error minimized and variance error will be low but
bias errors increase
10
Introduction to machine learning
Dis-advantages -
5. Fixing the optimal value of K is a challenge
6. Will not be effective when the class distributions overlap
7. Does not output any models. Calculates distances for every new point (lazy learner)
8. Computationally intensive (O(D(N^2))), can be addressed using KD algorithms which
take time to prepare
11
Introduction to machine learning
Creator:
12