0% found this document useful (0 votes)
33 views

Lecture Material 12

The document discusses the K-nearest neighbors algorithm and how it can be used for classification and regression problems. It then provides details on different distance metrics like Euclidean, Manhattan and Minkowski distance that are used in KNN. The document also discusses parameters like choosing the K value and provides examples of using KNN for classification and regression with the iris and tips datasets.

Uploaded by

Ali Naseer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Lecture Material 12

The document discusses the K-nearest neighbors algorithm and how it can be used for classification and regression problems. It then provides details on different distance metrics like Euclidean, Manhattan and Minkowski distance that are used in KNN. The document also discusses parameters like choosing the K value and provides examples of using KNN for classification and regression with the iris and tips datasets.

Uploaded by

Ali Naseer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

3/20/24, 9:25 AM Lecture_material_12

K-Nearest Neighbours ( KNN )


The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method
employed to tackle classification and regression problems.

It is widely disposable in real-life scenarios since it is non-parametric, meaning it does


not make any underlying assumptions about the distribution of data.

The K-NN algorithm works by finding the K nearest neighbors to a given data point
based on a distance metric, such as Euclidean distance. The class or value of the data
point is then determined by the majority vote or average of the K neighbors. This
approach allows the algorithm to adapt to different patterns and make predictions based
on the local structure of the data

Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight
line that joins the two points which are into consideration. This metric helps us calculate
the net displacement done between the two states of an object

Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance
traveled by the object instead of the displacement.

Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of the
Minkowski distance

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 1/9
3/20/24, 9:25 AM Lecture_material_12

From the formula above we can say that when p = 2 then it is the same as the formula
for the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan
distance

How to choose the value of k for KNN


Algorithm?
The value of k is very crucial in the KNN algorithm to define the number of neighbors in
the algorithm.

The value of k in the k-nearest neighbors (k-NN) algorithm should be chosen based on
the input data.

If the input data has more outliers or noise, a higher value of k would be better. On the
other hand, if the input data has less noise, a lower value of k would be better.

It is recommended to choose an odd value for k to avoid ties in classification.

Cross-validation methods can help in selecting the best k value for the given dataset.

Workings of KNN algorithm


Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where
it predicts the label or value of a new data point by considering the labels or values of its
K nearest neighbors in the training dataset.

In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 2/9
3/20/24, 9:25 AM Lecture_material_12

from sklearn.model_selection import train_test_split


from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_sc

In [ ]: # load the dataset


iris=sns.load_dataset("iris")
iris.head()

In [ ]: iris.info()

In [ ]: X=iris.drop("species", axis =1)


y=iris["species"]

In [ ]: # split the data into train and test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_

In [ ]: knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)

In [ ]: y_predict=knn.predict(X_test)

In [ ]: knn.predict([[10,10,2,.5]])

In [ ]: # evaluate the model


print(accuracy_score(y_test, y_predict))
print(precision_score(y_test, y_predict, average='weighted'))
print(recall_score(y_test, y_predict, average='weighted'))

In [ ]: print(f1_score(y_test, y_predict, average='weighted'))

In [ ]: print(confusion_matrix(y_test, y_predict))
sns.heatmap(confusion_matrix(y_test, y_predict), annot=True)

In [ ]: print(classification_report(y_test, y_predict))

KNN Regression:
In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder

In [ ]: tip=sns.load_dataset("tips")
tip.head()

In [ ]: tip["day"].value_counts()

In [ ]: # Convert strings into numeric data using label encoding


le=LabelEncoder()

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 3/9
3/20/24, 9:25 AM Lecture_material_12

for columns in tip.columns:


if tip[columns].dtypes=="object" or tip[columns].dtypes=="category":
tip[columns]=le.fit_transform(tip[columns])

In [ ]: tip.head()

In [ ]: tip["day"].value_counts()

In [ ]: # Split the data in input and output


X=tip.drop("tip", axis=1)
y=tip["tip"]

In [ ]: # Split the data in train and test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_

In [ ]: knn=KNeighborsRegressor(n_neighbors=5)

In [ ]: knn.fit(X_train, y_train)

In [ ]: knn.predict(X_test)

In [ ]: tip["sex"].nunique()

In [ ]: tip.head(5)

In [ ]: knn.predict([[20,1,0,5,0,3]])

In [ ]: # Evaluate the model


print(mean_squared_error(y_test, knn.predict(X_test)))
print(r2_score(y_test, knn.predict(X_test)))
print(mean_absolute_error(y_test, knn.predict(X_test)))

Decision tree
A decision tree is one of the most powerful tools of supervised learning algorithms used
for both classification and regression tasks.

A decision tree is a flowchart-like tree structure where an internal node represents a


feature(or attribute), the branch represents a decision rule, and each leaf node represents
the outcome. The topmost node in a decision tree is known as the root node. It learns to
partition on the basis of the attribute value.

It is constructed by recursively splitting the training data into subsets based on the values
of the attributes until a stopping criterion is met, such as the maximum depth of the tree
or the minimum number of samples required to split a node.

During training, the Decision Tree algorithm selects the best attribute to split the data
based on a metric such as entropy or Gini impurity, which measures the level of impurity
or randomness in the subsets. The goal is to find the attribute that maximizes the
information gain or the reduction in impurity after the split.

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 4/9
3/20/24, 9:25 AM Lecture_material_12

Some of the common Terminologies used in Decision Trees are as follows:

Root Node: It is the topmost node in the tree, which represents the complete dataset. It
is the starting point of the decision-making process.
Decision/Internal Node: A node that symbolizes a choice regarding an input feature.
Branching off of internal nodes connects them to leaf nodes or other internal nodes.
Leaf/Terminal Node: A node without any child nodes that indicates a class label or a
numerical value.
Splitting: The process of splitting a node into two or more sub-nodes using a split
criterion and a selected feature.
Branch/Sub-Tree: A subsection of the decision tree starts at an internal node and ends
at the leaf nodes
Parent Node: The node that divides into one or more child nodes.
Child Node: The nodes that emerge when a parent node is split.
Impurity: A measurement of the target variable’s homogeneity in a subset of data. It
refers to the degree of randomness or uncertainty in a set of examples. The Gini index
and entropy are two commonly used impurity measurements in decision trees for
classifications task\ Variance: Variance measures how much the predicted and the target
variables vary in different samples of a dataset. It is used for regression problems in
decision trees. Mean squared error, Mean Absolute Error, friedman_mse, or Half Poisson
deviance are used to measure the variance for the regression tasks in the decision tree.
Information Gain: Information gain is a measure of the reduction in impurity achieved
by splitting a dataset on a particular feature in a decision tree. The splitting criterion is
determined by the feature that offers the greatest information gain, It is used to
determine the most informative feature to split on at each node of the tree, with the goal
of creating pure subsets
Entropy: Entropy is a measure of the randomness or uncertainty in a dataset. It is used to
calculate the homogeneity of a sample. A decision tree algorithm uses entropy to
calculate the information gain, which is used to determine the best feature to split the
dataset.

Pruning: The process of removing branches from the tree that do not provide any
additional information or lead to overfitting.

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 5/9
3/20/24, 9:25 AM Lecture_material_12

Gini Impurity or index: Gini Impurity is a score that evaluates how accurate a split is
among the classified groups. The Gini Impurity evaluates a score in the range between 0
and 1, where 0 is when all observations belong to one class, and 1 is a random
distribution of the elements within classes. In this case, we want to have a Gini index
score as low as possible. Gini Index is the evaluation metric we shall use to evaluate our
Decision Tree Model.

In [ ]: import math

In [ ]: # Example Dataset
# Let's say we have a dataset with two classes, A and B
# Suppose in a dataset of 10 elements, 4 are of class A and 6 are of class B

# Number of elements in each class


n_A = 4
n_B = 6
total = n_A + n_B

In [ ]: # let's calculate the proportions


p_A = n_A / total
p_B = n_B / total

# print the proportions


print("Proportion of A: ", p_A)
print("Proportion of B: ", p_B)

In [ ]: # Entropy Calculate
# Entropy is a measure of uncertainty
entropy = -p_A * math.log2(p_A) - p_B * math.log2(p_B)
print("Entropy: ", entropy)

In [ ]: # gini impurity
# Gini impurity is a measure of misclassification
gini = 1- p_A**2 - p_B**2
print("Gini Impurity: ", gini)

In [ ]: # Information Gain
# Assuming a split on some feature divides the dataset into two subsets
# Subset 1: 2 elements of A, 3 of B
# Subset 2: 2 elements of A, 3 of B

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 6/9
3/20/24, 9:25 AM Lecture_material_12

# Entropy and size for each subset


n_1_A, n_1_B = 2, 3
n_2_A, n_2_B = 2, 3

p_1_A = n_1_A / (n_1_A + n_1_B)


p_1_B = n_1_B / (n_1_A + n_1_B)
entropy_1 = -p_1_A * math.log2(p_1_A) - p_1_B * math.log2(p_1_B) if p_1_A and p_

p_2_A = n_2_A / (n_2_A + n_2_B)


p_2_B = n_2_B / (n_2_A + n_2_B)
entropy_2 = -p_2_A * math.log2(p_2_A) - p_2_B * math.log2(p_2_B) if p_2_A and p_

# Calculating information gain


info_gain = entropy - ((n_1_A + n_1_B) / total * entropy_1 + (n_2_A + n_2_B) / t
print("Information Gain: ", info_gain)

Based on our example dataset with two classes (A and B), we have calculated the
following values:

Entropy: The calculated entropy of the dataset is approximately 0.971. This value
indicates a moderate level of disorder in the dataset, considering that it's not very close
to 0 (which would mean no disorder) and not at its maximum (which would mean
complete disorder for a binary classification).

Gini Impurity: The Gini impurity for the dataset is 0.48. This value, being less than 0.5,
suggests some level of purity in the dataset but still indicates a mix of classes A and B.

Information Gain: The information gain from the chosen split is 0.0. This result implies
that the split did not reduce the entropy or disorder of the dataset. In other words, the
split did not add any additional information that could help distinguish between classes
A and B more effectively than before.

These metrics provide insight into the nature of the dataset and the effectiveness of
potential splits when constructing a decision tree. In practical applications, you would use
these calculations to choose the best feature and split at each node in the tree to
maximize the purity of the subsets created.

In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_sc
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

In [ ]: df=sns.load_dataset("titanic")
df.head()

In [ ]: df.drop("deck", axis=1, inplace=True)

In [ ]: imputer=SimpleImputer(strategy="most_frequent")
df[["age","fare","embark_town", "embarked"]]=imputer.fit_transform(df[["age","fa

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 7/9
3/20/24, 9:25 AM Lecture_material_12

In [ ]: df.isnull().sum()

In [ ]: df.info()

In [ ]: le=LabelEncoder()

In [ ]: for col in df.columns:


if df[col].dtypes=="object" or df[col].dtypes=="category":
df[col]=le.fit_transform(df[col])

In [ ]: X=df.drop(["survived","alive"] , axis=1)
y=df["survived"]

In [ ]: # split the data in train and test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_

In [ ]: model=DecisionTreeClassifier( max_depth=3, random_state=42)

In [ ]: fit_model=model.fit(X_train, y_train)
fit_model

In [ ]: y_predict=model.predict(X_test)
y_predict

In [ ]: # evaluate the model


print(accuracy_score(y_test, y_predict))
print(precision_score(y_test, y_predict, average='weighted'))

In [ ]: print(confusion_matrix(y_test, y_predict))

In [ ]: sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, cmap="YlGnBu", fmt=

In [ ]: print(classification_report(y_test, y_predict))

In [ ]: # save the decision tree model

from sklearn.tree import export_graphviz


export_graphviz(fit_model, out_file="tree.dot", filled=True, rounded=True)

In [ ]: plot_tree(fit_model, filled=True, rounded=True, feature_names=X.columns, class_n


plt.title("Decision Tree trained model")
plt.show()

Decision Tree Regression


Decision Tree Regression is a type of regression algorithm that uses a decision tree to
model the relationship between the input features and the target variable. It is a non-
parametric algorithm, meaning it does not make any underlying assumptions about the
distribution of the data or the shape of the function that relates the input features to the
target variable.

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 8/9
3/20/24, 9:25 AM Lecture_material_12

In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

In [ ]: # Import dataset
df=sns.load_dataset("iris")
df.head()

In [ ]: # Split the data in input and output for regression


X=df.drop(["sepal_length", "species"], axis=1)
y=df["sepal_length"]

In [ ]: # Split the data in train and test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_

In [ ]: # Create the model


model=DecisionTreeRegressor(max_depth=3, random_state=42)
# Train the model
fit_model=model.fit(X_train, y_train)
fit_model

In [ ]: # Predict the model


y_predict=model.predict(X_test)
y_predict

In [ ]: # Evaluate the model


print(mean_squared_error(y_test, y_predict))

In [ ]: # Plot the tree


plot_tree(fit_model, filled=True, rounded=True, feature_names=X.columns)
plt.title("Decision Tree trained model")
plt.show()

file:///C:/Users/adeel/Desktop/Lecture_material_12.html 9/9

You might also like