3. Machine Learning
3. Machine Learning
1
Machine Learning
Chapters
1. Introduction Machine Learning
2. Regression
3. Classification
4. Clustering
5. Principal Component Analysis
2
Chapter 1
3
1. Introduction Machine Learning
4
1. Introduction Machine Learning
5
1. Introduction Machine Learning
ML Terminology
• Variables / Features
• These are the columns from the dataset, dataset may be
come from files, databases and other sources
• Independent Variable
• It used in equation to find output (pattern)
• It is also known as Predictor
• Dependent Variable
• It is the out of the equation
• It is also known as Response / Target
6
1. Introduction Machine Learning
ML Terminology
• Actual Value
• Dependent Variable value from dataset
• Predicted Value
• Dependent Variable value from equation
• Error
• Difference from actual and predicted
• Accuracy Metric
• Value/measure to identify how well machine trained or evaluate machine learning algorithm
7
1. Introduction Machine Learning
8
1. Introduction Machine Learning
9
1. Introduction Machine Learning
Regression
• Regression:
• The dependent variable is continuous, example salary of an employee
• Regression Techniques:
• Liner Regression
• Predictor and Response variables are linearly related
• Simple Linear Regression
• Multiple Linear Regression
• Non Linear Regression
• Predictor and Response variables are non linearly related
• Polynomial Regression
10
1. Introduction Machine Learning
Classification
• Classification:
• The dependent variable is categorical, example mail is spam or not
• Classification Techniques:
• Logistic Regression
• Decision Tree
• Support Vector Machine
• K-Nearest Neighbor
• Naïve Bays
• Random Forest (ensemble technique)
11
1. Introduction Machine Learning
12
1. Introduction Machine Learning
Reinforcement Learning
• Reinforcement Learning
• Machine will be trained on rewards and penalty
• Rewards are positive points
• Penalty is negative point
13
2. Regression
• 𝑦 = 𝛽 0 + 𝛽 1𝑥 + ∈
• Where:
• y is dependent variable
• x is independent variable
• 𝛽0 is intercept
• 𝛽1 is slope or coefficient
• ∈ is error term or residual
14
2. Regression
• 𝑦 = 𝑚𝑥 + 𝑐
• Where:
• y is dependent variable
• x is independent variable
• 𝑐 is intercept
• 𝑚 is slope or coefficient
15
2. Regression
16
2. Regression
OLS Method
𝒚 = 𝟏𝟏𝒙 + 𝟏𝟐
17
2. Regression
OLS
import pandas as pd
import statsmodels.api as sm
emp_ds = pd.read_csv('data/Emp_Salary.csv’)
x = emp_ds1[['YearsExperience']]
y = emp_ds1.iloc[:,-1]
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
model.summary() 18
2. Regression
SLR Walkthrough
importing
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import r2_score
19
2. Regression
SLR Walkthrough
Loading data
#loading data from csv file
file_path = 'data/Emp_Salary.csv'
emp_ds = pd.read_csv(file_path)
20
2. Regression
SLR Walkthrough
Handling na values
#finding na values
emp_ds.isna().sum()
21
2. Regression
SLR Walkthrough
Checking Relation
22
2. Regression
SLR Walkthrough
Splitting x and y
x = emp_ds1[['YearsExperience']].values
y = emp_ds1.iloc[:,-1].values
23
2. Regression
SLR Walkthrough
print(x_train.shape, x_test.shape)
24
2. Regression
SLR Walkthrough
Building Model
#training model
slr_model = LinearRegression()
slr_model.fit(x_train, y_train)
#finding parameters
print(f'Coef : {slr_model.coef_} \nIntercept : {slr_model.intercept_}')
25
2. Regression
SLR Walkthrough
Evaluating Model
#evaluting model
y_pred = slr_model.predict(x_test)
26
2. Regression
SLR Walkthrough
SLR Walkthrough
Finding outliers
28
2. Regression
SLR Walkthrough
return outliers.to_list()
29
2. Regression
SLR Walkthrough
Deleting outliers
#deleting outliers
outliers = find_outliers(emp_ds1['Salary’])
30
2. Regression
SLR Walkthrough
y_pred = slr_model.predict(x_test)
print('R2 Score : ', r2_score(y_test, y_pred))
31
2. Regression
33
2. Regression
MLR Walkthrough
adv_ds = pd.read_csv('data/Advertisments.csv')
adv_ds.head()
sns.heatmap(adv_ds.corr(), annot=True)
plt.show()
34
2. Regression
MLR Walkthrough
Splitting Dataset
x = adv_ds.iloc[:,:-1]
y = adv_ds.iloc[:,-1]
35
2. Regression
MLR Walkthrough
mlr_model = LinearRegression()
mlr_model.fit(x_train, y_train)
y_pred = mlr_model.predict(x_test)
36
2. Regression
MLR Walkthrough
Finding Multicollinearity
#delete all variables have value more than 5 and rebuild model
37
2. Regression
MLR Walkthrough
Finding Multicollinearity
#delete all variables have value more than 5 and rebuild model
38
2. Regression
MLR Walkthrough
Residuals Normality
import statsmodels.api as sm
sm.qqplot(residuals)
plt.show()
39
2. Regression
Activity
• Build regression model to predict house price on Real Estate Dataset
40
2. Regression
Polynomial Regression
• Linear regression not suitable for non linear relation data
• It is extension of linear relation with nth degree polynomial
41
2. Regression
emp_ds = pd.read_csv('data/Emp_Grade_Salary.csv’)
emp_ds.head()
42
2. Regression
#x to polynomial features
poly_conv = PolynomialFeatures(degree=2,include_bias=False)
x_poly = poly_conv.fit_transform(x)
43
2. Regression
#building model
pr_model = LinearRegression()
pr_model.fit(x_train, y_train)
44
2. Regression
plt.figure(figsize=(4,4))
plt.scatter(x, y, label='Acutal Data')
plt.plot(x, pr_model.predict(x_poly), color='g', label='Regression Line')
plt.title('Grade vs Salary')
plt.xlabel('Grade')
plt.ylabel('Salary')
plt.legend()
45
2. Regression
Polynomial Regression
• Finding best degree is the big deal for polynomial regression
• We check with different degree values start from 2 to n
• Select degree at the best score or minimum error
46
2. Regression
train_errors = []
test_errors = []
for d in range(1,10):
poly_conv = PolynomialFeatures(degree=d,include_bias=False)
x_poly = poly_conv.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_poly, y, test_size=0.4,
random_state=101)
model = LinearRegression()
model.fit(x_train,y_train)
47
2. Regression
train_pred = model.predict(x_train)
test_pred = model.predict(x_test)
train_RMSE = np.sqrt(mean_squared_error(y_train,train_pred))
test_RMSE = np.sqrt(mean_squared_error(y_test,test_pred))
train_errors.append(train_RMSE)
test_errors.append(test_RMSE)
48
2. Regression
steps = range(len(train_errors))
plt.plot(steps, train_errors, label='Training Error')
plt.plot(steps, test_errors, label='Testing Error’)
plt.xlabel(‘Steps’)
plt.ylabel(‘Error’)
plt.legend()
49
2. Regression
50
2. Regression
• Underfitting:
• Model is doing pretty well for testing set, and doing for training
set
• Bias is more and variance is less
51
2. Regression
52
2. Regression
• Avoiding underfitting:
• Increasing the training time of the model
• Increasing the number of features
53
2. Regression
Cross-Validation
• Model is trained with different combination of train and test sets from same dataset
• Some times it is known as k-fold cross validation
54
2. Regression
Cross-Validation
Cross-Validation
lm = LinearRegression()
k_folds = KFold(n_splits = 5, shuffle = True, random_state = 100)
scores = cross_val_score(lm, x, y, scoring='r2', cv=k_folds)
np.mean(np.absolute(scores))
55
2. Regression
Regularization
• One of the most crucial ideas in machine learning is regularization.
• It is a method for preventing the model from overfitting by providing it with more
data.
• By lowering the magnitude of the variables, this strategy can be applied to keep all
variables or features in the model.
• Consequently, it keeps the model's generality and accuracy.
• The coefficient of features is mostly regularized or reduced toward zero.
56
2. Regression
Regularization
• In regularization approach, we preserve the same amount of features while reducing
the magnitude of the features.
• Small error term introduce to loss/cost function, with lambda hyper parameter, this
term is called penalty
• Type of Regularization:
• Ridge Regularization
• Lasso Regularization
57
2. Regression
Ridge Regularization
• It is also known as L2 Regularization
• The penalty term is lambda multiplied with squares of coefficients
• Equations as follows:
58
2. Regression
Ridge Regularization
Ridge Regression
ridge_model = Ridge(alpha=10)
ridge_model.fit(x_train,y_train)
y_pred = ridge_model.predict(x_test)
MAE = mean_absolute_error(y_test,y_pred)
MSE = np.sqrt(mean_squared_error(y_test,y_pred))
59
2. Regression
Ridge Regularization
ridge_cv_model = RidgeCV(alphas=range(1,101,5),scoring='neg_mean_absolute_error')
ridge_cv_model.fit(x_train,y_train)
ridge_cv_model.alpha_
60
2. Regression
Activity
• Build ridge regression to predict house price on Real Estate Dataset
61
2. Regression
Lasso Regularization
• It is also known as L1 Regularization
• The penalty term is lambda multiplied with absolute of coefficients
• Equations as follows:
62
2. Regression
Lasso Regularization
Lasso Regression
lasso_model = Lasso(alpha=100)
lasso_model.fit(x_train,y_train)
y_pred = lasso_model.predict(x_test)
MAE = mean_absolute_error(y_test,y_pred)
MSE = np.sqrt(mean_squared_error(y_test,y_pred))
63
2. Regression
Lasso Regularization
lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5)
lasso_cv_model.fit(x_train,y_train)
lasso_cv_model.alpha_
64
2. Regression
Activity
• Build lasso regression to predict house price on Real Estate Dataset
65
Chapter 3
Classification
66
3. Classification
Introduction to Classification
• Classifying samples into groups is called classification
• In classification dependent variable has categorical values such as yes or no
• If dependent variable has only two categorical values then problem classified into binary class
classification
• If dependent variable has more than two categorical values then problem classified into multiclass
classification
67
3. Classification
Classification Techniques
• Logistic Regression
• Decision Tree
• K Nearest Neighbor
• Support Vector Machine
• Naïve Bayes
• Ensemble Methods
• Random Forest
• Gradient Boosting
68
3. Classification
Logistic Regression
• Don’t confuse with name regression, but is a classification
(algorithm)technique
• It is probabilistic model
• It uses MLE (Maximum Likelihood Estimation) (Maximum
Probability)
• It uses linear model(equation) internally to predict labels
(dependent variable) Source: https://encrypted-
tbn0.gstatic.com/images?q=tbn:ANd9
GcTI4QMCr3XP0OTpRoyyIZvpm_g
hInAQ5pkldPtyKDgfXWi64HMUke
UblKYtZVLlZuC5Jig&usqp=CAU
69
3. Classification
Logistic Regression
• Linear model will be transformed into non linear model by applying a function is called
sigmoid
• It returns values from 0 to 1 (probability values) for the samples
• Sigmoid function as given bellow:
𝟏
𝒇 𝒙 =
𝟏 + 𝒆−𝒙
• Here e is base of natural logarithm with value 2.718
https://www.vcalc.com/wiki/vCalc/Sigmoid+Function
70
3. Classification
Logistic Regression
• We introduce decision surface (threshold) to classify a sample, default is 0.5.
• For example binary classification, if the sigmoid values is greater than equal to 0.5 classify as
1 else 0
𝟏
𝒄𝒐𝒔𝒕 = 𝚺 − [𝒚𝒊 𝒍𝒐𝒈(𝒇 𝒙𝒊 ) + (𝟏 − 𝒚𝒊 )𝒍𝒐𝒈(𝟏 − 𝒇 𝒙𝒊 )]
𝒏
71
3. Classification
X Y Sigmoid Threshold Y|
-5 0 0.02 0.5 0
-2 0 0.17 0.5 0
10 1 1 0.5 1
20 1 1 0.5 1
1 0 0.69 0.5 1
18 1 1 0.5 1
72
3. Classification
LogisticRegression Class
Parameter Description
73
3. Classification
LogisticRegression Class
Attributes Description
Methods Description
74
3. Classification
Importing
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
75
3. Classification
Loading Dataset
bank_ds.info()
76
3. Classification
x and y split
x = bank_ds[['age']]
y = bank_ds['y’]
77
3. Classification
Build Model
log_model = LogisticRegression()
log_model.fit(x_train, y_train)
log_model.coef_, log_model.intercept_
78
3. Classification
Evaluating Model
y_pred = log_model.predict(x_test)
accuracy_score(y_test, y_pred)
79
3. Classification
80
3. Classification
Accuracy Score
• It is the ratio between actual labels and predicted labels
• It’s value ranges from 0 to 1
• 0 means all wrongly predicted
• 1 means all correctly predicted
• 0.5 means only 50 percent observations are correctly predicted
81
3. Classification
Confusion Matrix
• It is a n by n square matrix with detailed prediction of each class label
• It gives how many are correctly and wrongly predicted for each class
• This will change for different threshold values
Source:
https://upload.wikimedia.org/wikipedia/comm
ons/6/6f/ConfusionMatrix.png
82
3. Classification
𝑇𝑃
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑃) =
𝑇𝑃+𝐹𝑃
• Recall: (How many actual are predicted)
• It is the ration between TP and TP+FN
𝑇𝑃
• 𝑅𝑒𝑐𝑎𝑙𝑙(𝑅) =
𝑇𝑃+𝐹𝑁
83
3. Classification
𝑃∗𝑅
• 𝑓1 = 2 ∗
𝑃+𝑅
84
3. Classification
𝐹𝑃
• 𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁
Source:
https://upload.wikimedia.org/wikipedia/c
ommons/4/4d/Threshold_roc.wikipedia_e
dit.svg
85
3. Classification
MC Logistic Model
#importing required packages
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
MC Logistic Model
#splitting dataset into train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)
MC Logistic Model
#displaying confussion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
#classification report
print(classification_report(y_test, y_pred))
Activity
• Build multiclass logistic regression model on digit dataset from sklearn package
90
3. Classification
91
3. Classification
• σ(𝑝1 − 𝑝2)2
• Manhattan distance
• σ 𝑝1 − 𝑝2
• Minkowski distance
1/𝑝
• (σ(𝑝1 − 𝑝2)𝑝 )
92
3. Classification
• Disadvantages:
• Difficult to find best k value
• Computational cost is large
93
3. Classification
KNeighborsClassifier Class
Parameter Description
Methods Description
KNN Walkthrough
KNN Model
#importing required packages
from sklearn import datasets as dss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
KNN Walkthrough
KNN Model
# displaying shape of x
x.shape
96
3. Classification
KNN Walkthrough
KNN Model
# Evaluating KNN model
y_pred = knn_model.predict(x_test)
97
3. Classification
98
3. Classification
Activity
• Build KNN model on digit dataset from sklearn package
99
3. Classification
Decision Tree
• It is like a tree to make decision
• It can be used to regression as well as classification
• Decision Tree algorithms:
• ID3
• C4.5
• CART
Source:
https://upload.wikimedia.org/wikipedia/en
/4/4f/GEP_decision_tree_with_numeric_a
nd_nominal_attributes.png
100
3. Classification
https://upload.wikimedia.org/wikipedia/commons/a/
a8/Decision_Tree_Depth_2.png
101
3. Classification
ASM Techniques
• ASM stands Attribute Selection Measures
• Technique select a feature to create tree
• ASM Techniques
• Entropy
• Information Gain
• Gini Index
102
3. Classification
Entropy
• The randomness in the information being processed
• Higher entropy, more randomness of classes
• Lesser entropy, less randomness of classes
• Equation as follows:
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = σ𝑐𝑖=1 −𝑃𝑖𝑙𝑜𝑔2𝑃𝑖
103
3. Classification
Information Gain
• It gives the value of class at nodes
• If information gain is higher then node contains almost one class values
• If information gain is low then node mix of all class values
• Equation as follows:
• 𝐼𝐺 𝑇, 𝑋 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇, 𝑋
104
3. Classification
Gini Index
• It gives the value of class at nodes
• It opposite to information gain
• Equation as follows:
• 𝐺𝑖𝑛𝑖 = 1 − σ𝑐𝑖=1(𝑃𝑖)2
105
3. Classification
DecisionTreeClassifier Class
Parameter Description
DecisionTreeClassifier Class
Attributes Description
Methods Description
107
3. Classification
DT Classifier Model
#importing required packages
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import seaborn as sns
108
3. Classification
DT Classifier Model
DT Classifier Model
#classification report
print(classification_report(y_test, y_pred)) 110
3. Classification
DT Classifier Model
#displaying tree
plt.figure(figsize=(20,20))
plot_tree(dt_model, class_names=['one', 'two', 'three'])
plt.show()
111
3. Classification
Activity
• Build decision tree model on digit dataset from sklearn package
112
3. Classification
Decision Tree
• Advantages:
• Easy to read and interpret
• Less data cleaning required
• Disadvantages:
• Easily Overfits
• Unstable nature
113
3. Classification
Decision Tree
• Avoiding Overfitting:
• Pruning
• Pre-Pruning
• Post-Pruning
• Ensemble
• Random Forest
114
3. Classification
Naïve Bayes
• It is a supervised machine learning algorithm based on bayes theorem
• It is mainly used for classification problems for high dimensional dataset
• It is probabilistic model
• Most of the time it is used for text classification
• Naïve means assuming all features are independent to each other
• It uses bayes theorem or law
115
3. Classification
Naïve Bayes
• Bayes law as follows:
𝑃(𝐵|𝐴)𝑃 𝐴
• 𝑃 𝐴𝐵 =
𝑃(𝐵)
• Where
• P(A|B) is posterior probability
• P(B|A) is likelihood probability
• P(A) is prior probability
• P(B) is marginal probability
116
3. Classification
Naïve Bayes
• Types of naïve bayes:
• Gaussian
• Features follow normal distribution
• Multinomial
• Data follows multinomial distribution
• Bernoulli
• Same like multinomial, but features will have boolean values
117
3. Classification
GaussianNB Class
Attributes Description
Methods Description
118
3. Classification
GaussianNB Model
#importing required libraries
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB, roc_auc_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
119
3. Classification
GaussianNB Model
#splitting dataset into x and y
x = iris_ds.data
y = iris_ds.target
120
3. Classification
GaussianNB Model
#Evaluating model
y_pred = gnb_model.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)
121
3. Classification
122
3. Classification
• Polynomial Kernel: The polynomial kernel function transforms the data into a
higher-dimensional space using a polynomial function.
123
3. Classification
124
3. Classification
SVC Walkthrough
SVC
#importing required libraries
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import classification_report
125
3. Classification
SVC Walkthrough
SVC
# Getting x and y from cancer_ds
x = cancer_ds.data
y= cancer_ds.target
126
3. Classification
SVC Walkthrough
SVC
print(cls_rpt)
127
3. Classification
Ensemble Techniques
• Ensemble techniques are machine learning methods that combine multiple models to
improve the accuracy and robustness of the predictions.
• The predictions of these models are combined to make the final prediction.
128
3. Classification
Ensemble Techniques
• Boosting:
• A sequence of models is trained on the same data, with each model focusing on
the samples that the previous model got wrong.
• The predictions of these models are combined to make the final prediction.
129
3. Classification
Ensemble Techniques
• The predictions of these models are combined using another model, called a meta-
model, to make the final prediction.
• Ensemble techniques can improve the accuracy and robustness of the predictions, reduce
overfitting, and handle noisy or missing data.
• However, they can also increase the complexity and computational cost of the model.
130
3. Classification
Ensemble Techniques
• Ensemble techniques:
• Random Forest:
• Random Forest is a type of bagging technique
• Gradient Boosting
• Gradient Boosting is a type of boosting technique
• AdaBoost:
• AdaBoost is a type of boosting technique
131
3. Classification
Ensemble Techniques
• Random Forest:
Source:
https://upload.wikimedia.org/wikipedia/commons/4/4e/Random_f
132
orest_explain.png
3. Classification
Random Forest
133
3. Classification
Random Forest
# Getting x and y from cancer_ds
x = cancer_ds.data
y= cancer_ds.target
134
3. Classification
Random Forest
print(cls_rpt)
135
3. Classification
Ensemble Techniques
• Gradient Boosting:
Source:
https://miro.medium.com/max/1400/1*jbncjeM4CfpobEnDO0ZTjw.
136
png
3. Classification
Gradient Boosting
137
3. Classification
Gradient Boosting
# Loading cancer dataset from sklearn package
cancer_ds = datasets.load_breast_cancer()
138
3. Classification
Gradient Boosting
print(cls_rpt)
139
4. Clustering
Clustering
• Clustering is a unsupervised machine learning techniques
• Clustering is a machine learning technique used for grouping similar data points together
based on their characteristics or features.
• The goal of clustering is to find natural groups or clusters in the data, without prior
knowledge of the group labels.
• Clustering algorithms typically operate by measuring the similarity between data points
and assigning them to groups based on their similarity
140
4. Clustering
Clustering
• Types of clustering algorithms:
• K-means clustering:
• It partitions the data into k clusters based on their similarity.
• Hierarchical clustering:
• It creates a hierarchy of clusters by recursively merging or splitting clusters based
on their similarity.
• Density-based clustering
• It identifies clusters based on areas of high density in the data.
141
4. Clustering
K-means clustering
• The goal of k-means clustering is to partition a set of
observations into k clusters in such a way that the points
within each cluster are as similar as possible.
• The points across different clusters are as dissimilar as
possible.
• The k-means algorithm works by randomly initializing k Source: https://www.gatevidyalay.com/wp-
content/uploads/2020/01/K-Means-
cluster centers, and then iteratively assigning each data point Clustering.png
to the nearest cluster center based on its distance.
• The algorithm then re-computes the cluster centers based on
the new assignments, and repeats the process until
convergence. 142
4. Clustering
K-Means Walkthrough
K-Means
143
4. Clustering
K-Means Walkthrough
K-Means
plt.plot(centers, i_wss)
144
4. Clustering
K-Means Walkthrough
K-Means
# Visualizing clusters
plt.scatter(x[:,0], x[:,1], c=k_means.labels_)
145
4. Clustering
Hierarchical clustering
• Hierarchical clustering starts with each data point as a separate cluster and then iteratively
merges clusters based on the distance between them, until all data points are contained in a
single cluster.
146
4. Clustering
Hierarchical clustering
• Agglomerative clustering :
• Agglomerative clustering starts with each data point
as a separate cluster and iteratively merges the
closest pairs of clusters until all data points are
contained in a single cluster.
Hierarchical clustering
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs, make_circles
from matplotlib import pyplot as plt
#Generating Dataset
centers = [[1, 1], [3, 3]]
ds1 = make_blobs(n_samples=500, centers=centers, cluster_std=0.4, random_state=0)
ds2 = make_circles(n_samples=500, noise=0.1, factor=0.2)
148
4. Clustering
Hierarchical clustering
agg_clstr = AgglomerativeClustering(n_clusters=2)
x = ds2[0]
agg_clstr.fit(x)
149
4. Clustering
Density-based clustering
• Density-based clustering is a clustering technique that identifies
clusters based on the density of data points in the feature space.
Density-based clustering
• The key parameter in density-based clustering is the minimum number of data points required to
form a cluster, known as the minimum cluster size or the minimum points threshold.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the one popular
density-based clustering algorithm.
• DBSCAN works by defining a radius around each data point and counting the number of data
points within that radius.
151
4. Clustering
Density-based clustering
• A point is considered to be a core point if there are at least a specified minimum number of points
(the minimum points threshold) within its radius.
• If a point is not a core point but is within the radius of a core point, it is considered a border
point. All other points that do not meet either of these criteria are classified as noise points.
• DBSCAN then forms clusters by connecting core points that are within each other's radius, and
any border points that are within the radius of a core point.
• DBSCAN also allows for the detection of noise points, which are data points that do not belong
to any cluster. 152
4. Clustering
DBSCAN Walkthrough
DBSCAN
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs, make_circles
from matplotlib import pyplot as plt
#Generating Dataset
centers = [[1, 1], [3, 3]]
ds1 = make_blobs(n_samples=500, centers=centers, cluster_std=0.4, random_state=0)
ds2 = make_circles(n_samples=500, noise=0.1, factor=0.2)
153
4. Clustering
DBSCAN Walkthrough
DBSCAN
dbs = DBSCAN(eps=0.2, min_samples=5)
x = ds2[0]
dbs.fit(x)
154
5. Principal Component Analysis
155
5. Principal Component Analysis
Source:
https://www.analytixlabs.co.in/blog/wp-content/uploads/2021/05/Blog-Image-
1.jpg
156
5. Principal Component Analysis
• Standardize the data by subtracting the mean and dividing by the standard deviation.
• Calculate the covariance matrix of the standardized data.
• Calculate the eigenvectors and eigenvalues of the covariance matrix.
• Choose the first k eigenvectors with the largest eigenvalues to form the basis of the
lower-dimensional subspace.
• Multiply the standardized data with eigen vectors
• Select k components
157
5. Principal Component Analysis
PCA Walkthrough
PCA
from sklearn import datasets
from sklearn.decomposition import PCA
iris_ds = datasets.load_iris()
x = iris_ds.data
pca_2 = PCA(n_components=2)
x_std = StandardScaler().fit_transform(x)
pca_2_x = pca_2.fit_transform(x_std)
158