ML Lec-10
ML Lec-10
LECTURE-10
BY
Dr. Ramesh Kumar Thakur
Assistant Professor (II)
School Of Computer Engineering
v The K-Nearest Neighbors (KNN) algorithm is a popular machine learning technique used for
classification and regression tasks.
v It relies on the idea that similar data points tend to have similar labels or values.
v During the training phase, the KNN algorithm stores the entire training dataset as a reference.
v When making predictions, it calculates the distance between the input data point and all the training
examples, using a chosen distance metric such as Euclidean distance.
v Next, the algorithm identifies the K nearest neighbors to the input data point based on their distances.
v In the case of classification, the algorithm assigns the most common class label among the K
neighbors as the predicted label for the input data point.
v For regression, it calculates the average or weighted average of the target values of the K neighbors
to predict the value for the input data point.
v The KNN algorithm is straightforward and easy to understand, making it a popular choice in various
domains.
v However, its performance can be affected by the choice of K and the distance metric, so careful
parameter tuning is necessary for optimal results.
v KNN Algorithm can be used for both classification and regression predictive problems.
v However, it is more widely used in classification problems in the industry. To evaluate any technique, we
generally look at 3 important aspects:
v 1. Ease of interpreting output
v 2. Calculation time
v 3. Predictive Power
v We intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. The “K”
in KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will
now make a circle with BS as the center just as big as to enclose only three data points on the plane as
shown below:
v The three closest points to BS are all RC. Hence, with a good confidence level, we can say that the BS
should belong to the class RC. Here, the choice became obvious as all three votes from the closest
neighbor went to RC. The choice of the parameter K is very crucial in this algorithm.
v In the classification setting, the K-nearest neighbor algorithm essentially boils down to forming a majority
vote between the K most similar instances to a given “unseen” observation. Similarity is defined
according to a distance metric between two data points. A popular one is the Euclidean distance method:
v K-NN is also a lazy learner because it doesn’t learn a discriminative function from the training data
but “memorizes” the training dataset instead.
v 1) There is no structured method to find the best value for “K”. We need to find out with various
values by trial and error and assuming that training data is unknown.
v 2) Smaller values for K can be noisy and will have a higher influence on the result.
v 3) Larger values of K will have smoother decision boundaries which mean lower variance but
increased bias. Also, computationally expensive.
v 4) Another way to choose K is though cross-validation. One way to select the cross-validation dataset
from the training dataset. Take the small portion from the training dataset and call it a validation
dataset, and then use the same to evaluate different possible values of K. This way we are going to
predict the label for every instance in the validation set using with K equals to 1, K equals to 2, K
equals to 3.. and then we look at what value of K gives us the best performance on the validation set
and then we can take that value and use that as the final setting of our algorithm so we are minimizing
the validation error .
v 5) In general, practice, choosing the value of k is k = sqrt(N) where N stands for the number of
samples in training dataset.
v 6) Try and keep the value of k odd in order to avoid confusion between two classes of data.
v Advantages of KNN
1. Simple to implement
2. Flexible to feature/distance choices
3. Naturally handles multi-class cases
4. Can do well in practice with enough representative data
v Disadvantages of KNN
1. Need to determine the value of parameter K (number of nearest neighbors)
2. Computation cost is quite high because we need to compute the distance of each query instance to all
training samples.
3. Storage of data
4. Must know we have a meaningful distance function.
v Data splitting divides a dataset into three main subsets: the training set, used to train the model; the
validation set, used to track model parameters and avoid overfitting; and the testing set, used for
checking the model’s performance on new data.
v Each subset serves a unique purpose in the iterative process of developing a machine-learning model. The
importance of train-test-validation split are as follows:-
v Overfitting Prevention
v Overfitting occurs when a model learns the training data well, capturing noise and irrelevant patterns. The
validation set acts as a checkpoint, allowing for the detection of overfitting. By evaluating the
model’s performance on a different dataset, you can adjust model complexity, techniques, or other
hyperparameters to prevent overfitting and enhance generalization.
v Performance Evaluation
v The testing set is essential to a machine learning model’s performance. After training and validation, the
model faces the testing set, which checks real-world scenarios. A well-performing model on the testing
set indicates that it has successfully adapted to new, unseen data. This step is important for gaining
confidence in deploying the model for real-world applications.