Lecture Material 12
Lecture Material 12
The K-NN algorithm works by finding the K nearest neighbors to a given data point
based on a distance metric, such as Euclidean distance. The class or value of the data
point is then determined by the majority vote or average of the K neighbors. This
approach allows the algorithm to adapt to different patterns and make predictions based
on the local structure of the data
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight
line that joins the two points which are into consideration. This metric helps us calculate
the net displacement done between the two states of an object
Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance
traveled by the object instead of the displacement.
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of the
Minkowski distance
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 1/9
3/20/24, 9:25 AM Lecture_material_12
From the formula above we can say that when p = 2 then it is the same as the formula
for the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan
distance
The value of k in the k-nearest neighbors (k-NN) algorithm should be chosen based on
the input data.
If the input data has more outliers or noise, a higher value of k would be better. On the
other hand, if the input data has less noise, a lower value of k would be better.
Cross-validation methods can help in selecting the best k value for the given dataset.
In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 2/9
3/20/24, 9:25 AM Lecture_material_12
In [ ]: iris.info()
In [ ]: knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
In [ ]: y_predict=knn.predict(X_test)
In [ ]: knn.predict([[10,10,2,.5]])
In [ ]: print(confusion_matrix(y_test, y_predict))
sns.heatmap(confusion_matrix(y_test, y_predict), annot=True)
In [ ]: print(classification_report(y_test, y_predict))
KNN Regression:
In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder
In [ ]: tip=sns.load_dataset("tips")
tip.head()
In [ ]: tip["day"].value_counts()
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 3/9
3/20/24, 9:25 AM Lecture_material_12
In [ ]: tip.head()
In [ ]: tip["day"].value_counts()
In [ ]: knn=KNeighborsRegressor(n_neighbors=5)
In [ ]: knn.fit(X_train, y_train)
In [ ]: knn.predict(X_test)
In [ ]: tip["sex"].nunique()
In [ ]: tip.head(5)
In [ ]: knn.predict([[20,1,0,5,0,3]])
Decision tree
A decision tree is one of the most powerful tools of supervised learning algorithms used
for both classification and regression tasks.
It is constructed by recursively splitting the training data into subsets based on the values
of the attributes until a stopping criterion is met, such as the maximum depth of the tree
or the minimum number of samples required to split a node.
During training, the Decision Tree algorithm selects the best attribute to split the data
based on a metric such as entropy or Gini impurity, which measures the level of impurity
or randomness in the subsets. The goal is to find the attribute that maximizes the
information gain or the reduction in impurity after the split.
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 4/9
3/20/24, 9:25 AM Lecture_material_12
Root Node: It is the topmost node in the tree, which represents the complete dataset. It
is the starting point of the decision-making process.
Decision/Internal Node: A node that symbolizes a choice regarding an input feature.
Branching off of internal nodes connects them to leaf nodes or other internal nodes.
Leaf/Terminal Node: A node without any child nodes that indicates a class label or a
numerical value.
Splitting: The process of splitting a node into two or more sub-nodes using a split
criterion and a selected feature.
Branch/Sub-Tree: A subsection of the decision tree starts at an internal node and ends
at the leaf nodes
Parent Node: The node that divides into one or more child nodes.
Child Node: The nodes that emerge when a parent node is split.
Impurity: A measurement of the target variable’s homogeneity in a subset of data. It
refers to the degree of randomness or uncertainty in a set of examples. The Gini index
and entropy are two commonly used impurity measurements in decision trees for
classifications task\ Variance: Variance measures how much the predicted and the target
variables vary in different samples of a dataset. It is used for regression problems in
decision trees. Mean squared error, Mean Absolute Error, friedman_mse, or Half Poisson
deviance are used to measure the variance for the regression tasks in the decision tree.
Information Gain: Information gain is a measure of the reduction in impurity achieved
by splitting a dataset on a particular feature in a decision tree. The splitting criterion is
determined by the feature that offers the greatest information gain, It is used to
determine the most informative feature to split on at each node of the tree, with the goal
of creating pure subsets
Entropy: Entropy is a measure of the randomness or uncertainty in a dataset. It is used to
calculate the homogeneity of a sample. A decision tree algorithm uses entropy to
calculate the information gain, which is used to determine the best feature to split the
dataset.
Pruning: The process of removing branches from the tree that do not provide any
additional information or lead to overfitting.
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 5/9
3/20/24, 9:25 AM Lecture_material_12
Gini Impurity or index: Gini Impurity is a score that evaluates how accurate a split is
among the classified groups. The Gini Impurity evaluates a score in the range between 0
and 1, where 0 is when all observations belong to one class, and 1 is a random
distribution of the elements within classes. In this case, we want to have a Gini index
score as low as possible. Gini Index is the evaluation metric we shall use to evaluate our
Decision Tree Model.
In [ ]: import math
In [ ]: # Example Dataset
# Let's say we have a dataset with two classes, A and B
# Suppose in a dataset of 10 elements, 4 are of class A and 6 are of class B
In [ ]: # Entropy Calculate
# Entropy is a measure of uncertainty
entropy = -p_A * math.log2(p_A) - p_B * math.log2(p_B)
print("Entropy: ", entropy)
In [ ]: # gini impurity
# Gini impurity is a measure of misclassification
gini = 1- p_A**2 - p_B**2
print("Gini Impurity: ", gini)
In [ ]: # Information Gain
# Assuming a split on some feature divides the dataset into two subsets
# Subset 1: 2 elements of A, 3 of B
# Subset 2: 2 elements of A, 3 of B
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 6/9
3/20/24, 9:25 AM Lecture_material_12
Based on our example dataset with two classes (A and B), we have calculated the
following values:
Entropy: The calculated entropy of the dataset is approximately 0.971. This value
indicates a moderate level of disorder in the dataset, considering that it's not very close
to 0 (which would mean no disorder) and not at its maximum (which would mean
complete disorder for a binary classification).
Gini Impurity: The Gini impurity for the dataset is 0.48. This value, being less than 0.5,
suggests some level of purity in the dataset but still indicates a mix of classes A and B.
Information Gain: The information gain from the chosen split is 0.0. This result implies
that the split did not reduce the entropy or disorder of the dataset. In other words, the
split did not add any additional information that could help distinguish between classes
A and B more effectively than before.
These metrics provide insight into the nature of the dataset and the effectiveness of
potential splits when constructing a decision tree. In practical applications, you would use
these calculations to choose the best feature and split at each node in the tree to
maximize the purity of the subsets created.
In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_sc
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
In [ ]: df=sns.load_dataset("titanic")
df.head()
In [ ]: imputer=SimpleImputer(strategy="most_frequent")
df[["age","fare","embark_town", "embarked"]]=imputer.fit_transform(df[["age","fa
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 7/9
3/20/24, 9:25 AM Lecture_material_12
In [ ]: df.isnull().sum()
In [ ]: df.info()
In [ ]: le=LabelEncoder()
In [ ]: X=df.drop(["survived","alive"] , axis=1)
y=df["survived"]
In [ ]: fit_model=model.fit(X_train, y_train)
fit_model
In [ ]: y_predict=model.predict(X_test)
y_predict
In [ ]: print(confusion_matrix(y_test, y_predict))
In [ ]: print(classification_report(y_test, y_predict))
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 8/9
3/20/24, 9:25 AM Lecture_material_12
In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
In [ ]: # Import dataset
df=sns.load_dataset("iris")
df.head()
file:///C:/Users/adeel/Desktop/Lecture_material_12.html 9/9