0% found this document useful (0 votes)
11 views

3. Machine Learning

Uploaded by

ssen29750
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

3. Machine Learning

Uploaded by

ssen29750
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Data Science

Module Name: Machine Learning

1
Machine Learning

Chapters
1. Introduction Machine Learning
2. Regression
3. Classification
4. Clustering
5. Principal Component Analysis

2
Chapter 1

Introduction Machine Learning

3
1. Introduction Machine Learning

What is Machine Learning?


• It is a sub set/branch of Artificial Intelligence and computer science
• It enables system to learn and improve from rom experience without
being explicitly programmed
• Deals with data and algorithms to imitate human learning process to
improve the accuracy
• The main objective of Machine Learning:
• To allow computers to learn autonomously without human
intervention
• To make predictions on unknow dataset or data points

4
1. Introduction Machine Learning

Applications of Machine Learning


• Email Filtering
• Fraud Detection
• Image Recognition
• Speech Recognition
• Recommendations
• Medical Diagnosis

5
1. Introduction Machine Learning

ML Terminology
• Variables / Features
• These are the columns from the dataset, dataset may be
come from files, databases and other sources
• Independent Variable
• It used in equation to find output (pattern)
• It is also known as Predictor
• Dependent Variable
• It is the out of the equation
• It is also known as Response / Target

6
1. Introduction Machine Learning

ML Terminology
• Actual Value
• Dependent Variable value from dataset
• Predicted Value
• Dependent Variable value from equation
• Error
• Difference from actual and predicted
• Accuracy Metric
• Value/measure to identify how well machine trained or evaluate machine learning algorithm

7
1. Introduction Machine Learning

Type of Machine Learning


• Machine Learning mainly categorized as follows:
• Supervised Machine Learning
• Unsupervised Machine Learning
• Reinforcement Learning

8
1. Introduction Machine Learning

Supervised Machine Learning


• Supervised Machine Learning
• We will use both (Independent and Dependent) variables to build supervised machine learning
• Machine will be trained with supervisor of dependent variable
• Supervised Machine Learning Kinds:
• Regression
• Classification

9
1. Introduction Machine Learning

Regression
• Regression:
• The dependent variable is continuous, example salary of an employee
• Regression Techniques:
• Liner Regression
• Predictor and Response variables are linearly related
• Simple Linear Regression
• Multiple Linear Regression
• Non Linear Regression
• Predictor and Response variables are non linearly related
• Polynomial Regression
10
1. Introduction Machine Learning

Classification
• Classification:
• The dependent variable is categorical, example mail is spam or not
• Classification Techniques:
• Logistic Regression
• Decision Tree
• Support Vector Machine
• K-Nearest Neighbor
• Naïve Bays
• Random Forest (ensemble technique)

11
1. Introduction Machine Learning

Unsupervised Machine Learning


• Unsupervised Machine Learning
• We will have independent variables to build unsupervised machine learning
• Machine will be trained with out supervisor of dependent variable
• Unsupervised Machine Learning Kinds:
• Clustering
• Grouping data based on patterns
• Association Rule
• Rules are used to make predictions

12
1. Introduction Machine Learning

Reinforcement Learning
• Reinforcement Learning
• Machine will be trained on rewards and penalty
• Rewards are positive points
• Penalty is negative point

13
2. Regression

Simple Linear Regression


• Simple means only one independent variable present in model(equation) building
• Linear model equation as follows:

• 𝑦 = 𝛽 0 + 𝛽 1𝑥 + ∈
• Where:
• y is dependent variable
• x is independent variable
• 𝛽0 is intercept
• 𝛽1 is slope or coefficient
• ∈ is error term or residual
14
2. Regression

Simple Linear Regression


• Linear equation as follows:

• 𝑦 = 𝑚𝑥 + 𝑐
• Where:
• y is dependent variable
• x is independent variable
• 𝑐 is intercept
• 𝑚 is slope or coefficient

15
2. Regression

Simple Linear Regression


• To train machine on simple linear regression we use OLS method
• OLS stands for Ordinary Least Squares
• It is used to find the intercept and slope (unknown parameters)
• The method relies on minimizing the sum of squared residuals (difference between the actual(y)
and predicted(y’) values)
• Error equation as given below:

16
2. Regression

OLS Method

#no m(slope) c(intercep SSE


t)
1 10 11 2000
2 10.5 11.5 1500
3 11 12 1000
4 11.5 12.5 1500
5 12 13 2000

𝒚 = 𝟏𝟏𝒙 + 𝟏𝟐
17
2. Regression

SLR Walkthrough using statsmodels

OLS
import pandas as pd
import statsmodels.api as sm

emp_ds = pd.read_csv('data/Emp_Salary.csv’)

x = emp_ds1[['YearsExperience']]
y = emp_ds1.iloc[:,-1]

x = sm.add_constant(x)
model = sm.OLS(y, x).fit()

model.summary() 18
2. Regression

SLR Walkthrough

importing

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import r2_score

from matplotlib import pyplot as plt


import seaborn as sns

19
2. Regression

SLR Walkthrough

Loading data
#loading data from csv file
file_path = 'data/Emp_Salary.csv'
emp_ds = pd.read_csv(file_path)

#displaying 1st 2 rows


emp_ds.head(2)

#finding dataset information


emp_ds.info()

20
2. Regression

SLR Walkthrough

Handling na values

#finding na values
emp_ds.isna().sum()

#replacing na values with ffile


emp_ds1 = emp_ds.fillna(method='ffill’)

21
2. Regression

SLR Walkthrough

Checking Relation

#check relation between x and y


sns.pairplot(data=emp_ds)

22
2. Regression

SLR Walkthrough

Splitting x and y

#spliting dataset into IVs(x) and DV(y)

x = emp_ds1[['YearsExperience']].values
y = emp_ds1.iloc[:,-1].values

23
2. Regression

SLR Walkthrough

Train Test Split

#spliting dataset into train and test


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)

print(x_train.shape, x_test.shape)

24
2. Regression

SLR Walkthrough

Building Model

#training model
slr_model = LinearRegression()
slr_model.fit(x_train, y_train)

#finding parameters
print(f'Coef : {slr_model.coef_} \nIntercept : {slr_model.intercept_}')

25
2. Regression

SLR Walkthrough

Evaluating Model

#evaluting model
y_pred = slr_model.predict(x_test)

print('R2 Score : ', r2_score(y_test, y_pred))


print('MAE : ', mean_absolute_error(y_test, y_pred))
print('MSE : ', mean_squared_error(y_test, y_pred))

26
2. Regression

SLR Walkthrough

Drawing Regression Line

#drawing regression line

plt.scatter(emp_ds2['YearsExperience'], emp_ds2['Salary'], label='Acutal')


plt.plot(emp_ds2['YearsExperience'],
slr_model.predict(emp_ds2[['YearsExperience']]), color='green', label='Predicted')
plt.legend()
plt.title('YoE vs Salary')
plt.xlabel('YoE')
plt.ylabel('Salary')
27
2. Regression

SLR Walkthrough

Finding outliers

#function to find outliers


sns.boxplot(y='Salary', data=emp_ds2)

28
2. Regression

SLR Walkthrough

Finding outliers as list

#function to find outliers


def find_outliers(df):
q1=df.quantile(0.25)
q3=df.quantile(0.75)
IQR=q3-q1
outliers = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]

return outliers.to_list()

29
2. Regression

SLR Walkthrough

Deleting outliers

#deleting outliers
outliers = find_outliers(emp_ds1['Salary’])

emp_ds2 = emp_ds1.query(f'Salary not in {outliers}')

30
2. Regression

SLR Walkthrough

Building & Evaluating Model Again

#building model without outliers


x = emp_ds2.iloc[:,[0]]
y = emp_ds2.iloc[:,1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)


slr_model = LinearRegression()
slr_model.fit(x_train, y_train)
print(f'Coef : {slr_model.coef_} \nIntercept : {slr_model.intercept_}')

y_pred = slr_model.predict(x_test)
print('R2 Score : ', r2_score(y_test, y_pred))
31
2. Regression

Multiple Linear Regression


• It is extension of the simple linear regression
• More than one independent variable are present
• Linear model equation as follows:

• 𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + … + 𝛽𝑛𝑥𝑛 +∈


• Where:
• y is dependent variable
• X1, x2, x3 are independent variable
• 𝛽0 is intercept
• 𝛽1, 𝛽2, 𝛽𝑛 is slope or coefficient
• ∈ is error term or residual
32
2. Regression

Multiple Linear Regression


• Assumptions:
• Independent variables are linearly related to dependent variable
• No or less multicollinearity (no linear relations between independent variables)
• Normality of the residuals

33
2. Regression

MLR Walkthrough

Loading Data Set

#loading data from csv file

adv_ds = pd.read_csv('data/Advertisments.csv')
adv_ds.head()

Finding Linear Relation

#heatmap with correlation value

sns.heatmap(adv_ds.corr(), annot=True)
plt.show()
34
2. Regression

MLR Walkthrough

Splitting Dataset

#splitting x, y, train and test

x = adv_ds.iloc[:,:-1]
y = adv_ds.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)

35
2. Regression

MLR Walkthrough

Bulding and Evaluation model

mlr_model = LinearRegression()
mlr_model.fit(x_train, y_train)

y_pred = mlr_model.predict(x_test)

print('R2 Score : ', r2_score(y_test, y_pred))


print('MAE : ', mean_absolute_error(y_test, y_pred))
print('RMSE : ', np.sqrt(mean_squared_error(y_test, y_pred)))

36
2. Regression

MLR Walkthrough

Finding Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif=pd.Series([variance_inflation_factor(x.values, i) for i in range(x.shape[1])],


index=x.columns)

#delete all variables have value more than 5 and rebuild model

37
2. Regression

MLR Walkthrough

Finding Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif=pd.Series([variance_inflation_factor(x.values, i) for i in range(x.shape[1])],


index=x.columns)

#delete all variables have value more than 5 and rebuild model

38
2. Regression

MLR Walkthrough

Residuals Normality
import statsmodels.api as sm

sm.qqplot(residuals)
plt.show()

#if it is strait line then model is good

39
2. Regression

Activity
• Build regression model to predict house price on Real Estate Dataset

40
2. Regression

Polynomial Regression
• Linear regression not suitable for non linear relation data
• It is extension of linear relation with nth degree polynomial

• Polynomial equation as follows:

• 𝑦 = 𝑏 + 𝑏1𝑥1 + 𝑏2𝑥12 + ⋯ + 𝑏𝑛𝑥1n

41
2. Regression

Polynomial Regression Walkthrough

Loading Data Set

#loading data from csv file

emp_ds = pd.read_csv('data/Emp_Grade_Salary.csv’)
emp_ds.head()

#build linear regression and check the score

42
2. Regression

Polynomial Regression Walkthrough

Creating Polynomial Features

#x to polynomial features

from sklearn.preprocessing import PolynomialFeatures

poly_conv = PolynomialFeatures(degree=2,include_bias=False)
x_poly = poly_conv.fit_transform(x)

43
2. Regression

Polynomial Regression Walkthrough

Building model using x_poly, y

#train test split


x_train, x_test, y_train, y_test = train_test_split(x_poly, y, test_size=0.3, random_state=10)

#building model
pr_model = LinearRegression()
pr_model.fit(x_train, y_train)

44
2. Regression

Polynomial Regression Walkthrough

Visualizing the graph

plt.figure(figsize=(4,4))
plt.scatter(x, y, label='Acutal Data')
plt.plot(x, pr_model.predict(x_poly), color='g', label='Regression Line')
plt.title('Grade vs Salary')
plt.xlabel('Grade')
plt.ylabel('Salary')
plt.legend()

45
2. Regression

Polynomial Regression
• Finding best degree is the big deal for polynomial regression
• We check with different degree values start from 2 to n
• Select degree at the best score or minimum error

46
2. Regression

Polynomial Regression Walkthrough

Finding best degree

train_errors = []
test_errors = []

for d in range(1,10):
poly_conv = PolynomialFeatures(degree=d,include_bias=False)
x_poly = poly_conv.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_poly, y, test_size=0.4,
random_state=101)
model = LinearRegression()
model.fit(x_train,y_train)
47
2. Regression

Polynomial Regression Walkthrough

Finding best degree

train_pred = model.predict(x_train)
test_pred = model.predict(x_test)
train_RMSE = np.sqrt(mean_squared_error(y_train,train_pred))
test_RMSE = np.sqrt(mean_squared_error(y_test,test_pred))

train_errors.append(train_RMSE)
test_errors.append(test_RMSE)

48
2. Regression

Polynomial Regression Walkthrough

Finding best degree

steps = range(len(train_errors))
plt.plot(steps, train_errors, label='Training Error')
plt.plot(steps, test_errors, label='Testing Error’)
plt.xlabel(‘Steps’)
plt.ylabel(‘Error’)
plt.legend()

49
2. Regression

Bias and Variance


• We will have two kinds of errors in model as follows:
• Bias
• It is the error for training set
• Variance
• It is the error for testing set

50
2. Regression

Overfitting and Underfitting


• Overfitting:
• Model is doing pretty well for training set, and doing for testing
set
• Bias is less and variance is more

• Underfitting:
• Model is doing pretty well for testing set, and doing for training
set
• Bias is more and variance is less

51
2. Regression

Bias – Variance Tradeoff


• To make model more generalize model (doing well for
training set and test set), we train model with different
combination of parameter values

• We pick one with low bias and low variance

52
2. Regression

Overfitting and Underfitting


• Avoiding overfitting:
• Training with more data
• Removing features
• Cross-Validation
• Regularization
• Ensembling

• Avoiding underfitting:
• Increasing the training time of the model
• Increasing the number of features
53
2. Regression

Cross-Validation
• Model is trained with different combination of train and test sets from same dataset
• Some times it is known as k-fold cross validation

54
2. Regression

Cross-Validation

Cross-Validation

from sklearn.model_selection import cross_val_score


from sklearn.model_selection import KFold

lm = LinearRegression()
k_folds = KFold(n_splits = 5, shuffle = True, random_state = 100)
scores = cross_val_score(lm, x, y, scoring='r2', cv=k_folds)

np.mean(np.absolute(scores))

55
2. Regression

Regularization
• One of the most crucial ideas in machine learning is regularization.
• It is a method for preventing the model from overfitting by providing it with more
data.
• By lowering the magnitude of the variables, this strategy can be applied to keep all
variables or features in the model.
• Consequently, it keeps the model's generality and accuracy.
• The coefficient of features is mostly regularized or reduced toward zero.

56
2. Regression

Regularization
• In regularization approach, we preserve the same amount of features while reducing
the magnitude of the features.
• Small error term introduce to loss/cost function, with lambda hyper parameter, this
term is called penalty

• Type of Regularization:
• Ridge Regularization
• Lasso Regularization

57
2. Regression

Ridge Regularization
• It is also known as L2 Regularization
• The penalty term is lambda multiplied with squares of coefficients
• Equations as follows:

58
2. Regression

Ridge Regularization

Ridge Regression

from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=10)
ridge_model.fit(x_train,y_train)
y_pred = ridge_model.predict(x_test)
MAE = mean_absolute_error(y_test,y_pred)
MSE = np.sqrt(mean_squared_error(y_test,y_pred))

print("Test MAE is:"+ str(MAE))


print("Test RMSE is:"+ str(RMSE))

59
2. Regression

Ridge Regularization

Finding Best alpha

from sklearn.linear_model import RidgeCV

ridge_cv_model = RidgeCV(alphas=range(1,101,5),scoring='neg_mean_absolute_error')
ridge_cv_model.fit(x_train,y_train)

ridge_cv_model.alpha_

60
2. Regression

Activity
• Build ridge regression to predict house price on Real Estate Dataset

61
2. Regression

Lasso Regularization
• It is also known as L1 Regularization
• The penalty term is lambda multiplied with absolute of coefficients
• Equations as follows:

62
2. Regression

Lasso Regularization

Lasso Regression

from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=100)
lasso_model.fit(x_train,y_train)
y_pred = lasso_model.predict(x_test)
MAE = mean_absolute_error(y_test,y_pred)
MSE = np.sqrt(mean_squared_error(y_test,y_pred))

print("Test MAE is:"+ str(MAE))


print("Test RMSE is:"+ str(RMSE))

63
2. Regression

Lasso Regularization

Finding Best alpha

from sklearn.linear_model import LassoCV

lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5)
lasso_cv_model.fit(x_train,y_train)

lasso_cv_model.alpha_

64
2. Regression

Activity
• Build lasso regression to predict house price on Real Estate Dataset

65
Chapter 3

Classification

66
3. Classification

Introduction to Classification
• Classifying samples into groups is called classification
• In classification dependent variable has categorical values such as yes or no
• If dependent variable has only two categorical values then problem classified into binary class
classification
• If dependent variable has more than two categorical values then problem classified into multiclass
classification

67
3. Classification

Classification Techniques
• Logistic Regression
• Decision Tree
• K Nearest Neighbor
• Support Vector Machine
• Naïve Bayes
• Ensemble Methods
• Random Forest
• Gradient Boosting

68
3. Classification

Logistic Regression
• Don’t confuse with name regression, but is a classification
(algorithm)technique
• It is probabilistic model
• It uses MLE (Maximum Likelihood Estimation) (Maximum
Probability)
• It uses linear model(equation) internally to predict labels
(dependent variable) Source: https://encrypted-
tbn0.gstatic.com/images?q=tbn:ANd9
GcTI4QMCr3XP0OTpRoyyIZvpm_g
hInAQ5pkldPtyKDgfXWi64HMUke
UblKYtZVLlZuC5Jig&usqp=CAU

69
3. Classification

Logistic Regression
• Linear model will be transformed into non linear model by applying a function is called
sigmoid
• It returns values from 0 to 1 (probability values) for the samples
• Sigmoid function as given bellow:

𝟏
𝒇 𝒙 =
𝟏 + 𝒆−𝒙
• Here e is base of natural logarithm with value 2.718

https://www.vcalc.com/wiki/vCalc/Sigmoid+Function
70
3. Classification

Logistic Regression
• We introduce decision surface (threshold) to classify a sample, default is 0.5.
• For example binary classification, if the sigmoid values is greater than equal to 0.5 classify as
1 else 0

• Cost or loss function as fallows:

𝟏
𝒄𝒐𝒔𝒕 = 𝚺 − [𝒚𝒊 𝒍𝒐𝒈(𝒇 𝒙𝒊 ) + (𝟏 − 𝒚𝒊 )𝒍𝒐𝒈(𝟏 − 𝒇 𝒙𝒊 )]
𝒏

71
3. Classification

Logistic Regression Example

X Y Sigmoid Threshold Y|
-5 0 0.02 0.5 0
-2 0 0.17 0.5 0
10 1 1 0.5 1
20 1 1 0.5 1
1 0 0.69 0.5 1
18 1 1 0.5 1

72
3. Classification

LogisticRegression Class

Parameter Description

penalty Regularization norm (l1, l2, elasticnet)


C Regularization Term (0, 0.001, 0.1, 1, 10)
solver Optimizer (liblinear, lbfgs, sag)
multi_class Classification Type (auto, ovr, multinomial)

73
3. Classification

LogisticRegression Class

Attributes Description

coef_ Coefficient of the features


intercept_ Intercept (a.k.a. bias) added

Methods Description

fit(X, y) Fit the model to the given training data.


predict_proba(X) Probability estimates
predict(X) Predict class labels
get_params([deep]) Get parameters for this estimator

74
3. Classification

Logistic Regression(Binary Classification) Walkthrough

Importing

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

75
3. Classification

Logistic Regression(Binary Classification) Walkthrough

Loading Dataset

bank_ds = pd.read_csv('data/bank/bank.csv', delimiter=';’)


bank_ds.head()

bank_ds.info()

#Do all preprocessing required

76
3. Classification

Logistic Regression(Binary Classification) Walkthrough

x and y split

x = bank_ds[['age']]
y = bank_ds['y’]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

77
3. Classification

Logistic Regression(Binary Classification) Walkthrough

Build Model

log_model = LogisticRegression()
log_model.fit(x_train, y_train)

log_model.coef_, log_model.intercept_

78
3. Classification

Logistic Regression(Binary Classification) Walkthrough

Evaluating Model

y_pred = log_model.predict(x_test)

#y_pred_proba = log_model.predict_proba(x_test) #this returns the probability value

accuracy_score(y_test, y_pred)

79
3. Classification

Classification Model Evaluation Metrics


• Metrics:
• Accuracy Score
• Confusion Matrix
• Precision, Recall and F1 score
• ROC-AUC Score

80
3. Classification

Accuracy Score
• It is the ratio between actual labels and predicted labels
• It’s value ranges from 0 to 1
• 0 means all wrongly predicted
• 1 means all correctly predicted
• 0.5 means only 50 percent observations are correctly predicted

• Function from sklearn.metrics


• accuracy_score(y_actual, y_pred)

81
3. Classification

Confusion Matrix
• It is a n by n square matrix with detailed prediction of each class label
• It gives how many are correctly and wrongly predicted for each class
• This will change for different threshold values

• Function from sklearn.metrics:


• confusion_matrix(y_actual, y_pred)

Source:
https://upload.wikimedia.org/wikipedia/comm
ons/6/6f/ConfusionMatrix.png
82
3. Classification

Precision, Recall and F1 score


• Precision: (How many predicted values are actual)
• It is the ration between TP and TP+FP

𝑇𝑃
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑃) =
𝑇𝑃+𝐹𝑃
• Recall: (How many actual are predicted)
• It is the ration between TP and TP+FN

𝑇𝑃
• 𝑅𝑒𝑐𝑎𝑙𝑙(𝑅) =
𝑇𝑃+𝐹𝑁

83
3. Classification

Precision, Recall and F1 score


• In some problem we need to aim high precision
• In some problem we need to aim high recall
• In some problem we need to aim high precision and high recall, it is very difficult, we trade-
off between precision and recall with f1 score
• It is the harmonic mean of precision and recall

𝑃∗𝑅
• 𝑓1 = 2 ∗
𝑃+𝑅

84
3. Classification

ROC Curve-AUC Score


• ROC stands for Receiver Operating Characteristic
• ROC Curve is the graph between true positive rate and false
positive rate at different thresholds
𝑇𝑃
• 𝑇𝑃𝑅 = 𝑇𝑃+𝐹𝑁

𝐹𝑃
• 𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁
Source:
https://upload.wikimedia.org/wikipedia/c
ommons/4/4d/Threshold_roc.wikipedia_e
dit.svg
85
3. Classification

ROC Curve-AUC Score


• AUC stands Area Under the Curve
• It gives the performance of the model
• If AUC score is 1 then model predicts 100% correct

• Function from sklearn.metrics:


• roc_curve(y_actual, y_pred)
• roc_auc_score(y_actual, y_pred)
• RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=0.8).plot() Source:
https://upload.wikimedia.org/wikipedia/c
ommons/4/4d/Threshold_roc.wikipedia_e
dit.svg
86
3. Classification

Logistic Regression(Multiclass Classification) Walkthrough

MC Logistic Model
#importing required packages
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns

#loading iris dataset


iris_ds = load_iris()

#splitting dataset into features(x) and labels(y)


x = iris_ds.data
y = iris_ds.target 87
3. Classification

Logistic Regression(Multiclass Classification) Walkthrough

MC Logistic Model
#splitting dataset into train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

#building and training logistic model


mc_log_model = LogisticRegression(multi_class='multinomial', max_iter=1000)
mc_log_model.fit(x_train, y_train)

#evaluating trained model


y_pred = mc_log_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy {accuracy}')
88
3. Classification

Logistic Regression(Multiclass Classification) Walkthrough

MC Logistic Model
#displaying confussion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)

#classification report
print(classification_report(y_test, y_pred))

#checking overfitting or underfitting


training_loss = log_loss(y_train, mc_log_model.predict_proba(x_train))
print(f'Training Loss : {training_loss}')
testing_loss = log_loss(y_test, mc_log_model.predict_proba(x_test))
print(f'Testing Loss : {testing_loss}')
89
3. Classification

Activity
• Build multiclass logistic regression model on digit dataset from sklearn package

90
3. Classification

K Nearest Neighbour (KNN)


• It is simplest Machine Learning algorithms based on
Supervised Learning technique
• It can be used for Regression as well as for Classification,
mainly used for classification
• It is a non-parametric algorithm
• No assumptions are made on data set.
• It is lazy learner algorithm
• Stores dataset, at the time of classification it perform the
actions

91
3. Classification

K Nearest Neighbour (KNN)


• New data point is classified, based on k neighbours
• If more neighbours of a class present, it classified new data point to that class
• It uses distance between new data point and k neighbours to decide the class of the new data point
• Distance can be calculated as follows:
• Euclidean distance

• σ(𝑝1 − 𝑝2)2
• Manhattan distance
• σ 𝑝1 − 𝑝2
• Minkowski distance
1/𝑝
• (σ(𝑝1 − 𝑝2)𝑝 )
92
3. Classification

K Nearest Neighbour (KNN)


• Advantages:
• It is robust to the noisy data
• It can be more effective if the data is large

• Disadvantages:
• Difficult to find best k value
• Computational cost is large

93
3. Classification

KNeighborsClassifier Class

Parameter Description

n_neighbors Number of neighbors


metric Metric to use for distance computation

Methods Description

fit(X, y) Fit the model to the given training data.


predict_proba(X) Probability estimates
predict(X) Predict class labels
get_params([deep]) Get parameters for this estimator
94
3. Classification

KNN Walkthrough

KNN Model
#importing required packages
from sklearn import datasets as dss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split

# Loading cancer dataset from sklearn package


cancer_ds = dss.load_breast_cancer()

# Getting x and y from cancer_ds


x = cancer_ds.data
y= cancer_ds.target 95
3. Classification

KNN Walkthrough

KNN Model
# displaying shape of x
x.shape

# splitting train and test


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Building KNN model


knn_model = KNeighborsClassifier(n_neighbors=7)
knn_model.fit(x_train, y_train)

96
3. Classification

KNN Walkthrough

KNN Model
# Evaluating KNN model
y_pred = knn_model.predict(x_test)

acc_score = accuracy_score(y_test, y_pred)


auc_score = roc_auc_score(y_test, y_pred)

print(f'Accuracy Score : {acc_score}')


print(f'Auc Score : {auc_score}')

97
3. Classification

K Nearest Neighbour (KNN)


• Finding Best k value:
• Fit the model with different k values
• Find errors or scores for each k
• Plot the graph between errors or scores vs k
• Pick the k value where line is bent like elbow

98
3. Classification

Activity
• Build KNN model on digit dataset from sklearn package

99
3. Classification

Decision Tree
• It is like a tree to make decision
• It can be used to regression as well as classification
• Decision Tree algorithms:
• ID3
• C4.5
• CART

Source:
https://upload.wikimedia.org/wikipedia/en
/4/4f/GEP_decision_tree_with_numeric_a
nd_nominal_attributes.png

100
3. Classification

Decision Tree Terminology


• Root Node
• Leaf Node
• Splitting
• Sub Tree
• Parent/Child Node
• Pruning
• Attribute Selection Measures

https://upload.wikimedia.org/wikipedia/commons/a/
a8/Decision_Tree_Depth_2.png

101
3. Classification

ASM Techniques
• ASM stands Attribute Selection Measures
• Technique select a feature to create tree
• ASM Techniques
• Entropy
• Information Gain
• Gini Index

102
3. Classification

Entropy
• The randomness in the information being processed
• Higher entropy, more randomness of classes
• Lesser entropy, less randomness of classes
• Equation as follows:
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = σ𝑐𝑖=1 −𝑃𝑖𝑙𝑜𝑔2𝑃𝑖

Log2 Calculator: https://www.omnicalculator.com/math/log-2

103
3. Classification

Information Gain
• It gives the value of class at nodes
• If information gain is higher then node contains almost one class values
• If information gain is low then node mix of all class values
• Equation as follows:
• 𝐼𝐺 𝑇, 𝑋 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇, 𝑋

104
3. Classification

Gini Index
• It gives the value of class at nodes
• It opposite to information gain
• Equation as follows:
• 𝐺𝑖𝑛𝑖 = 1 − σ𝑐𝑖=1(𝑃𝑖)2

105
3. Classification

DecisionTreeClassifier Class

Parameter Description

criterion gini, entropy


max_depth The maximum depth of the tree
The minimum number of samples required to
min_samples_split
split
A node will be split if this split induces a
min_impurity_decrease decrease of the impurity greater than or equal
to this value
Complexity parameter used for Minimal Cost-
ccp_alpha
Complexity Pruning
106
3. Classification

DecisionTreeClassifier Class

Attributes Description

classes_ The classes labels


feature_names_in_ Names of features seen during fit

Methods Description

fit(X, y) Fit the model to the given training data.


predict_proba(X) Probability estimates
predict(X) Predict class labels
get_params([deep]) Get parameters for this estimator

107
3. Classification

Decision Tree Classifier Walkthrough

DT Classifier Model
#importing required packages
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import seaborn as sns

#loading iris dataset


iris_ds = load_iris()

108
3. Classification

Decision Tree Classifier Walkthrough

DT Classifier Model

#splitting dataset into features(x) and labels(y)


x = iris_ds.data
y = iris_ds.target

#splitting dataset into train test split


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

#building and training logistic model


dt_model = DecisionTreeClassifier(criterion='entropy')
dt_model.fit(x_train, y_train)
109
3. Classification

Decision Tree Classifier Walkthrough

DT Classifier Model

#evaluating trained model


y_pred = dt_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy {accuracy}')

#displaying confussion matrix


cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)

#classification report
print(classification_report(y_test, y_pred)) 110
3. Classification

Decision Tree Classifier Walkthrough

DT Classifier Model

#displaying tree
plt.figure(figsize=(20,20))
plot_tree(dt_model, class_names=['one', 'two', 'three'])
plt.show()

111
3. Classification

Activity
• Build decision tree model on digit dataset from sklearn package

112
3. Classification

Decision Tree
• Advantages:
• Easy to read and interpret
• Less data cleaning required

• Disadvantages:
• Easily Overfits
• Unstable nature

113
3. Classification

Decision Tree
• Avoiding Overfitting:
• Pruning
• Pre-Pruning
• Post-Pruning
• Ensemble
• Random Forest

114
3. Classification

Naïve Bayes
• It is a supervised machine learning algorithm based on bayes theorem
• It is mainly used for classification problems for high dimensional dataset
• It is probabilistic model
• Most of the time it is used for text classification
• Naïve means assuming all features are independent to each other
• It uses bayes theorem or law

115
3. Classification

Naïve Bayes
• Bayes law as follows:

𝑃(𝐵|𝐴)𝑃 𝐴
• 𝑃 𝐴𝐵 =
𝑃(𝐵)

• Where
• P(A|B) is posterior probability
• P(B|A) is likelihood probability
• P(A) is prior probability
• P(B) is marginal probability

116
3. Classification

Naïve Bayes
• Types of naïve bayes:
• Gaussian
• Features follow normal distribution
• Multinomial
• Data follows multinomial distribution
• Bernoulli
• Same like multinomial, but features will have boolean values

117
3. Classification

GaussianNB Class

Attributes Description

classes_ The classes labels


feature_names_in_ Names of features seen during fit

Methods Description

fit(X, y) Fit the model to the given training data.


predict_proba(X) Probability estimates
predict(X) Predict class labels
get_params([deep]) Get parameters for this estimator

118
3. Classification

GaussianNB Classifier Walkthrough

GaussianNB Model
#importing required libraries
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB, roc_auc_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

#loading iris dataset


iris_ds = load_iris()

119
3. Classification

GaussianNB Classifier Walkthrough

GaussianNB Model
#splitting dataset into x and y
x = iris_ds.data
y = iris_ds.target

#splitting x, y into train and test


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

#Building gaussian naive bayes model


gnb_model = GaussianNB()
gnb_model.fit(x_train, y_train)

120
3. Classification

GaussianNB Classifier Walkthrough

GaussianNB Model
#Evaluating model
y_pred = gnb_model.predict(x_test)
acc_score = accuracy_score(y_test, y_pred)

print(f'Accuracy Score : {acc_score}')


print(classification_report(y_test, y_pred))

121
3. Classification

Support Vector Machines (SVM)


• It is a supervised machine learning algorithm
• It is used for regression and classification problems for high
dimensional dataset
• It separates the class with best line is called decision
boundary or hyper plane
• Points are used to find hyper plane are called support vectors
• Types of SVM:
• Linearly separable
• Non linearly separable

122
3. Classification

Support Vector Machines (SVM)


• Non linearly separable uses kernel trick.
• Kernel transforms low dimensional data into high dimensional data

• Types of kernel functions :


• Linear Kernel: The linear kernel is the simplest type of kernel function. It is
used when the data is linearly separable.

• Polynomial Kernel: The polynomial kernel function transforms the data into a
higher-dimensional space using a polynomial function.

123
3. Classification

Support Vector Machines (SVM)

• Radial Basis Function (RBF) Kernel: The RBF kernel


is the most commonly used kernel function in SVMs. It
transforms the data into a higher-dimensional space using
a Gaussian function.

• Sigmoid Kernel: The sigmoid kernel function transforms


Source:
the data using a sigmoid function. https://miro.medium.com/max/621/1*o
DksheYAj1eP0Be-a6r_qQ.png

124
3. Classification

SVC Walkthrough

SVC
#importing required libraries
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

# Loading cancer dataset from sklearn package


cancer_ds = datasets.load_breast_cancer()

125
3. Classification

SVC Walkthrough

SVC
# Getting x and y from cancer_ds
x = cancer_ds.data
y= cancer_ds.target

# splitting train and test


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Building SVC model


svc_model = SVC()
svc_model.fit(x_train, y_train)

126
3. Classification

SVC Walkthrough

SVC

# Evaluating SVC model


y_pred = svc_model.predict(x_test)
cls_rpt = classification_report(y_test, y_pred)

print(cls_rpt)

127
3. Classification

Ensemble Techniques
• Ensemble techniques are machine learning methods that combine multiple models to
improve the accuracy and robustness of the predictions.

• Types of ensemble techniques, including:

• Bagging (Bootstrap Aggregating):


• Multiple models are trained on different subsets of the training data using
bootstrapping, a statistical sampling technique.

• The predictions of these models are combined to make the final prediction.
128
3. Classification

Ensemble Techniques
• Boosting:
• A sequence of models is trained on the same data, with each model focusing on
the samples that the previous model got wrong.

• The predictions of these models are combined to make the final prediction.

129
3. Classification

Ensemble Techniques

• The predictions of these models are combined using another model, called a meta-
model, to make the final prediction.

• Ensemble techniques can improve the accuracy and robustness of the predictions, reduce
overfitting, and handle noisy or missing data.

• However, they can also increase the complexity and computational cost of the model.

130
3. Classification

Ensemble Techniques
• Ensemble techniques:
• Random Forest:
• Random Forest is a type of bagging technique

• Gradient Boosting
• Gradient Boosting is a type of boosting technique

• AdaBoost:
• AdaBoost is a type of boosting technique

131
3. Classification

Ensemble Techniques
• Random Forest:

Source:
https://upload.wikimedia.org/wikipedia/commons/4/4e/Random_f
132
orest_explain.png
3. Classification

Random Forest Walkthrough

Random Forest

from sklearn import datasets


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

# Loading cancer dataset from sklearn package


cancer_ds = datasets.load_breast_cancer()

133
3. Classification

Random Forest Walkthrough

Random Forest
# Getting x and y from cancer_ds
x = cancer_ds.data
y= cancer_ds.target

# splitting train and test


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Random Forest model


rf_model = RandomForestClassifier(n_estimators=50)
rf_model.fit(x_train, y_train)

134
3. Classification

Random Forest Walkthrough

Random Forest

# Evaluating Random Forest model


y_pred = rf_model.predict(x_test)
cls_rpt = classification_report(y_test, y_pred)

print(cls_rpt)

135
3. Classification

Ensemble Techniques
• Gradient Boosting:

Source:
https://miro.medium.com/max/1400/1*jbncjeM4CfpobEnDO0ZTjw.
136
png
3. Classification

Gradient Boosting Walkthrough

Gradient Boosting

from sklearn import datasets


from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

# Loading cancer dataset from sklearn package


cancer_ds = datasets.load_breast_cancer()

137
3. Classification

Gradient Boosting Walkthrough

Gradient Boosting
# Loading cancer dataset from sklearn package
cancer_ds = datasets.load_breast_cancer()

# Getting x and y from cancer_ds


x = cancer_ds.data
y= cancer_ds.target

# splitting train and test


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

138
3. Classification

Gradient Boosting Walkthrough

Gradient Boosting

# Building Gradient Boosting model


gb_model = GradientBoostingClassifier(n_estimators=35)
gb_model.fit(x_train, y_train)

# Evaluating Gradient Boosting model


y_pred = gb_model.predict(x_test)
cls_rpt = classification_report(y_test, y_pred)

print(cls_rpt)

139
4. Clustering

Clustering
• Clustering is a unsupervised machine learning techniques

• Clustering is a machine learning technique used for grouping similar data points together
based on their characteristics or features.

• The goal of clustering is to find natural groups or clusters in the data, without prior
knowledge of the group labels.

• Clustering algorithms typically operate by measuring the similarity between data points
and assigning them to groups based on their similarity
140
4. Clustering

Clustering
• Types of clustering algorithms:
• K-means clustering:
• It partitions the data into k clusters based on their similarity.

• Hierarchical clustering:
• It creates a hierarchy of clusters by recursively merging or splitting clusters based
on their similarity.

• Density-based clustering
• It identifies clusters based on areas of high density in the data.
141
4. Clustering

K-means clustering
• The goal of k-means clustering is to partition a set of
observations into k clusters in such a way that the points
within each cluster are as similar as possible.
• The points across different clusters are as dissimilar as
possible.
• The k-means algorithm works by randomly initializing k Source: https://www.gatevidyalay.com/wp-
content/uploads/2020/01/K-Means-
cluster centers, and then iteratively assigning each data point Clustering.png
to the nearest cluster center based on its distance.
• The algorithm then re-computes the cluster centers based on
the new assignments, and repeats the process until
convergence. 142
4. Clustering

K-Means Walkthrough

K-Means

from sklearn import datasets


from sklearn.cluster import Kmeans
from matplotlib import pyplot as plt

# Loading iris dataset from sklearn package


iris_ds = datasets.load_iris()

# Creating clusters with 2 centroids


k_means = KMeans(n_clusters=2)

143
4. Clustering

K-Means Walkthrough

K-Means

# Finding best k using elbow method


i_wss = []
centers = list(range(1, 11))
for center in centers:
k_means = KMeans(n_clusters=center)
k_means.fit(iris_ds.data)
i_wss.append(k_means.inertia_)

plt.plot(centers, i_wss)

144
4. Clustering

K-Means Walkthrough

K-Means

# Visualizing clusters
plt.scatter(x[:,0], x[:,1], c=k_means.labels_)

145
4. Clustering

Hierarchical clustering
• Hierarchical clustering starts with each data point as a separate cluster and then iteratively
merges clusters based on the distance between them, until all data points are contained in a
single cluster.

• Types of hierarchical clustering:


• Agglomerative
• Divisive

146
4. Clustering

Hierarchical clustering
• Agglomerative clustering :
• Agglomerative clustering starts with each data point
as a separate cluster and iteratively merges the
closest pairs of clusters until all data points are
contained in a single cluster.

• Divisive clustering: Source:


• Divisive clustering starts with all data points in a https://miro.medium.com/max/1039/0*afzan
WwrDq9vd2g-
single cluster and iteratively splits the cluster into
smaller clusters until each data point is contained in
a separate cluster. 147
4. Clustering

Hierarchical clustering Walkthrough

Hierarchical clustering
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs, make_circles
from matplotlib import pyplot as plt

#Generating Dataset
centers = [[1, 1], [3, 3]]
ds1 = make_blobs(n_samples=500, centers=centers, cluster_std=0.4, random_state=0)
ds2 = make_circles(n_samples=500, noise=0.1, factor=0.2)

148
4. Clustering

Hierarchical clustering Walkthrough

Hierarchical clustering
agg_clstr = AgglomerativeClustering(n_clusters=2)
x = ds2[0]
agg_clstr.fit(x)

plt.scatter(x[:,0], x[:,1], c= agg_clstr.labels_)


plt.show()

149
4. Clustering

Density-based clustering
• Density-based clustering is a clustering technique that identifies
clusters based on the density of data points in the feature space.

• It is particularly useful when dealing with data that has complex


and irregular cluster shapes or when there is no prior knowledge
about the number of clusters in the data.

• The main idea behind density-based clustering is to group together


data points that are close to each other and have a high density of Source:
https://media.geeksforgeeks.org/wp
nearby points, while separating points that have low densities. -content/uploads/fig-1-300x300.jpg
150
4. Clustering

Density-based clustering
• The key parameter in density-based clustering is the minimum number of data points required to
form a cluster, known as the minimum cluster size or the minimum points threshold.

• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the one popular
density-based clustering algorithm.

• DBSCAN works by defining a radius around each data point and counting the number of data
points within that radius.

151
4. Clustering

Density-based clustering
• A point is considered to be a core point if there are at least a specified minimum number of points
(the minimum points threshold) within its radius.

• If a point is not a core point but is within the radius of a core point, it is considered a border
point. All other points that do not meet either of these criteria are classified as noise points.

• DBSCAN then forms clusters by connecting core points that are within each other's radius, and
any border points that are within the radius of a core point.

• DBSCAN also allows for the detection of noise points, which are data points that do not belong
to any cluster. 152
4. Clustering

DBSCAN Walkthrough

DBSCAN
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs, make_circles
from matplotlib import pyplot as plt

#Generating Dataset
centers = [[1, 1], [3, 3]]
ds1 = make_blobs(n_samples=500, centers=centers, cluster_std=0.4, random_state=0)
ds2 = make_circles(n_samples=500, noise=0.1, factor=0.2)

153
4. Clustering

DBSCAN Walkthrough

DBSCAN
dbs = DBSCAN(eps=0.2, min_samples=5)
x = ds2[0]
dbs.fit(x)

plt.scatter(x[:,0], x[:,1], c=dbs.labels_)


plt.show()

154
5. Principal Component Analysis

Principal Component Analysis(PCA)


• Principal Component Analysis is a widely used technique in dimensionality reduction.
• PCA is used to transform high-dimensional data into a lower-dimensional representation that
captures most of the variability of the original data.
• In PCA, a set of orthogonal basis vectors, called principal components, are calculated to represent
the data in a way that minimizes the information loss.
• These principal components are linear combinations of the original variables
• The first principal component accounts for the largest amount of variability in the data
• The second principal component accounts for the second largest amount of variability, and so on.

155
5. Principal Component Analysis

Principal Component Analysis(PCA)

Source:
https://www.analytixlabs.co.in/blog/wp-content/uploads/2021/05/Blog-Image-
1.jpg
156
5. Principal Component Analysis

Principal Component Analysis(PCA)


• PCA Steps:

• Standardize the data by subtracting the mean and dividing by the standard deviation.
• Calculate the covariance matrix of the standardized data.
• Calculate the eigenvectors and eigenvalues of the covariance matrix.
• Choose the first k eigenvectors with the largest eigenvalues to form the basis of the
lower-dimensional subspace.
• Multiply the standardized data with eigen vectors
• Select k components

157
5. Principal Component Analysis

PCA Walkthrough

PCA
from sklearn import datasets
from sklearn.decomposition import PCA

iris_ds = datasets.load_iris()
x = iris_ds.data

pca_2 = PCA(n_components=2)
x_std = StandardScaler().fit_transform(x)
pca_2_x = pca_2.fit_transform(x_std)

158

You might also like