0% found this document useful (0 votes)
21 views12 pages

K Nearest Neighbours

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

K Nearest Neighbours

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction to machine learning

K Nearest Neighbours

1
Introduction to machine learning
K Nearest Neighbors -

a. The KNN classifier is also a non parametric and instance-based learning


algorithm.
I. Non-parametric makes no assumptions about the distribution of data and thus
avoids the risks of mistaking the underlying distribution of the data. For example,
suppose the data is non-Gaussian but the learning model assumes a Gaussian form.
In that case, our algorithm would make extremely poor predictions
II. Instance-based learning means that the algorithm doesn’t explicitly learn a model. It
liberally memorizes (keeps in RAM) the training instances which are subsequently
used to predict classes of unseen data. Minimal training but high cost in testing!

b. For classification, the algorithm obtains a majority vote between the K most
similar instances to a given “unseen” observation. K is a count

c. Suited for classification where relationship between features and target classes
is numerous, complex and difficult to understand and yet items in a class tend
to be fairly homogenous on the values of attributes

d. Not suitable if the data is noisy and the target classes do not have clear
demarcation in terms of attribute values

2
Introduction to machine learning

K Nearest Neighbors -

e. The training data is represented by the scattered data points in the feature
space
f. The color of the data points indicate the class they belong to
g. The grey point is the query point who's class has to be fixed

3
Introduction to machine learning

K Nearest Neighbors – (Similarity Measurements)

a. Measuring similarity with distance between the points using Euclidian method

b. Other distance measurement methods include Manhattan distance, Minkowski distance,


Mahalanobis distance, Bhattacharya distance etc.

4
Introduction to machine learning

K Nearest Neighbors based classifications -

a. The distance formula is highly dependent on how features / attributes /


dimensions are measured.

b. Those dimensions which have larger possible range of values will dominate the
result of the distance calculation using Euclidian formula

c. To ensure all the dimensions have similar scale, we normalize the data on all
the dimensions / attributes

d. There are multiple ways of normalizing the data. We will use Z-score
standardization

5
Introduction to machine learning

K Nearest Neighbors based classifications –

There are many distance calculation formulas in Scikit-learn package-

1. Minkowski distance
2. Euclidean distance
3. Manhattan distance
4. Chebyshev distance
5. Mahalanobis distanc
6. Inner product
7. Cosine similarity
8. Pearson correlation
9. Hamming distance
10. Jaccard similarity
11. Edit distance or Levenshtein distance

Ref:
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
6
Introduction to machine learning

K Nearest Neighbors - (Methodology)

a. Technically, given a positive count K, and unseen observation x and a similarity


metric d, KNN algorithm performs two steps -

b. It computes distance of the data point x from all the other data points in the
training set, arranges in ascending order, takes top K observations. Let this be
A. K is usually odd

c. It estimates the conditional probability

d. I is a function which returns 1 if the point in set A of k items is from class j. In


simple language, what proportion of K belongs to which class. The proportion is
probability and sum of all probabilities is 1

e. The data point x is assigned the class which has max probability

f. Alternate way of understanding KNN is as a method that calculates decision


boundaries in the feature space, a.k.a Voronoi regions or tessellations

7
Introduction to machine learning

K Nearest Neighbors - (Voronoi Diagram / Tessellations)

a. The Voronoi diagram is formed from lines that bisect and are perpendicular to
the lines that connect two neighboring vertices
b. Each point s has a Voronoi cell V(s) consisting of all points closer to s than to
any other points

Voronoi boundaries created using nearest


neighbor method i.e. K = 1

8
Introduction to machine learning

K Nearest Neighbors - (K the magic)

a. How to pick the right K? K can range from 1 to number of training data points!
b. K values can affect the performance of the classifier
c. K in KNN is a hyper parameter. It has to be discovered through iterations!
d. Since we will be evaluating a hyper parameter, we need to ensure data is split
into three i.e. training, validation and testing.
e. The iteration to find K should include only training and validation data
f. We can imagine K as a way of influencing the shape of the boundary between
classes

9
Introduction to machine learning
K Nearest Neighbors - (K and Voronoi boundaries)

Image Source : https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/

a. K = 1 create Voronoi boundaries based on a. Large value of K means a larger spread of


individual point. Each point has a region around data points is considered to decide the
itself as it’s domain boundary

b. The boundaries have sharp bends and there are b. The boundary will be relatively smooth with
many islands. The surface represents a complex little or no sharp turns. Islands will be
model likely to suffer from variance error minimized and variance error will be low but
bias errors increase
10
Introduction to machine learning

K Nearest Neighbors based classifications -


Advantages -
1. Makes no assumptions about distributions of classes in feature space
2. Can work for multi classes simultaneously
3. Easy to implement and understand
4. Not impacted by outliers

Dis-advantages -
5. Fixing the optimal value of K is a challenge
6. Will not be effective when the class distributions overlap
7. Does not output any models. Calculates distances for every new point (lazy learner)
8. Computationally intensive (O(D(N^2))), can be addressed using KD algorithms which
take time to prepare

11
Introduction to machine learning

K Nearest Neighbors based classifications -

Lab- 3 Model the given data to predict type of breast cancer

Description – Sample data is available at


https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

Creator:

Dr. WIlliam H. Wolberg (physician)


The dataset has 10 attributes listed below University of Wisconsin Hospitals
1. Sample code number: id number Madison, Wisconsin, USA
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10 Donor:
Olvi Mangasarian (mangasarian '@'
4. Uniformity of Cell Shape: 1 - 10
cs.wisc.edu)
5. Marginal Adhesion: 1 - 10 Received by David W. Aha (aha '@'
6. Single Epithelial Cell Size: 1 - 10 cs.jhu.edu)
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
Sol: KNN+Breast+Cancer+Modeling.ipynb

12

You might also like