The KNN algorithm classifies new data points based on their distance to labeled examples in a training set. It assumes that similar things exist in close proximity, so it finds the k nearest neighbors of a new point and assigns the most common label of those neighbors. The distance metric, usually Euclidean distance, is used to calculate distances between data points. Choosing the optimal k value, which determines the number of neighbors considered, is important but challenging, as different k values can lead to different classifications of test points.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
8 views
K - Nearest Neighbours (K-NN) Algorithm
The KNN algorithm classifies new data points based on their distance to labeled examples in a training set. It assumes that similar things exist in close proximity, so it finds the k nearest neighbors of a new point and assigns the most common label of those neighbors. The distance metric, usually Euclidean distance, is used to calculate distances between data points. Choosing the optimal k value, which determines the number of neighbors considered, is important but challenging, as different k values can lead to different classifications of test points.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10
Classification: K-
Nearest Neighbours (K-NN)
Prof. (Dr.) Honey Sharma KNN Classification The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. “Birds of a feather flock together.” The KNN algorithm hinges on assumption similar data points are close to each other. KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics — calculating the distance between points on a graph. It manipulates the training data and classifies the new test data based on distance metrics. It finds the k-nearest neighbors to the test data, and then classification is performed by the majority of class labels. Consider the following figure. Let us say we have plotted data points from our training set on a two-dimensional feature space. As shown, we have a total of 6 data points (3 red and 3 blue). Red data points belong to ‘class1’ and blue data points belong to ‘class2’. And yellow data point in a feature space represents the new point for which a class is to be predicted. Obviously, we say it belongs to ‘class1’ (red points) Distance Metrics The distance metric is the effective hyper-parameter through which we measure the distance between data feature values and new test inputs. Usually, we use the Euclidean approach, which is the most widely used distance measure to calculate the distance between test samples and trained data values. How to choose a K value? Selecting the optimal K value to achieve the maximum accuracy of the model is always challenging for a data scientist. K value indicates the count of the nearest neighbors. We have to compute distances between test points and trained labels points. Updating distance metrics with every iteration is computationally expensive, and that’s why KNN is a lazy learning algorithm. As you can verify from the above image, if we proceed with K=3, then we predict that test input belongs to class B, and if we continue with K=7, then we predict that test input belongs to class A. That’s how you can imagine that the K value has a powerful effect on KNN performance. ● There are no pre-defined statistical methods to find the most favorable value of K. ● Initialize a random K value and start computing. ● Choosing a small value of K leads to unstable decision boundaries. ● The substantial K value is better for classification as it leads to smoothening the decision boundaries. ● Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate.