0% found this document useful (0 votes)
3 views

Machine Learning

The document is a laboratory manual for the Machine Learning Lab at Rishi MS Institute of Engineering & Technology for Women, detailing experiments and course objectives for B.Tech III Year students in Computer Science and Engineering. It includes a list of experiments involving Python programming and machine learning techniques, as well as the institution's vision, mission, program outcomes, and educational objectives. The manual aims to equip students with practical skills in machine learning and data analysis.

Uploaded by

Hasi P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine Learning

The document is a laboratory manual for the Machine Learning Lab at Rishi MS Institute of Engineering & Technology for Women, detailing experiments and course objectives for B.Tech III Year students in Computer Science and Engineering. It includes a list of experiments involving Python programming and machine learning techniques, as well as the institution's vision, mission, program outcomes, and educational objectives. The manual aims to equip students with practical skills in machine learning and data analysis.

Uploaded by

Hasi P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

RISHI MS INSTITUTE OF ENGINEERING &

TECHNOLOGY FOR WOMEN


(Approved by AICTE, New Delhi and Affiliated to JNTUH)
Nizampet Cross Road, JNTUH Kukatpally Hyderabad–500085

LABORATORY MANUAL
MACHINE LERNING LAB
(R22 Regulations)

For

B. Tech
III Year

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


&
INFORMATION TECHNOOGY
INDEX

S.NO TOPIC PAGENO

I List of Experiments ii

II V/M/POs/PSOs/PEOs iii

III Syllabus ix

IV Course Objectives &Course Outcomes x

i
List of Experiments

Exp. Experiment Name Page No.


No.
Write a python program to compute Central Tendency
Measures: Mean, Median, Mode Measure of Dispersion:
1 Variance, Standard Deviation.

Study of Python Basic Libraries such as Statistics, Math, Numpy


2 and Scipy.

Study of Python Libraries for ML application such as Pandas and


3 Matplotlib.

Write a Python program to implement Simple Linear Regression


4

Implementation of Multiple Linear Regression for House Price


5 Prediction using sklearn
Implementation of Decision tree using sklearn and its parameter
6 tuning

Implementation of KNN using sklearn


7
Implementation of Logistic Regression using sklearn
8

Implementation of K-Means Clustering


9
Performance analysis of Classification Algorithms on a specific
10 dataset (Mini Project).

ii
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering

Vision of the institution:

To be a center of excellence in producing women engineers and scientists who are


professionally competent social leaders to face multi-disciplinary global environment by
imparting quality technical education, values and ethics through innovation methods of
teaching and learning.

Mission of the institution:

 To promote women technocrats capable enough to resolve the problems faced by


the society using the knowledge imparted.
 To prepare self-reliant women engineering for technological growth of the nation
and society by laying strong theoretical foundation accompanied by wide practical
training.
 To equip the young women with creative thinking capabilities and empowering
them towards innovation.

iii
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering

Vision & Mission of Department

Vision of the department

To empower women by providing cutting-edge technology to female technocrats in the


fields of Information Technology, allowing them to develop into competent engineers and
entrepreneurs.

Mission of the department

 Adopting creative techniques to nurture and strengthen the core skill of Computer
Science.
 Introduce students to the most recent technological advancements.
 Impart quality education; improve the research, entrepreneurial, and employability
skills of women technocrats.
 Instill professional ethics and a sense of social responsibility in students.
 Strengthen the Industry-Academia interface, which will enable graduates to
emerge as academic leaders or inspiring entrepreneurs

iv
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering

Program out comes POs)

1. Engineering Knowledge: Apply the knowledge of mathematics, science,


engineering fundamentals, and an engineering specialization to the solution
of complex engineering problems.
2. Problem Analysis: Identify, formulate, review research literature, and
analyze complex engineering problems reaching substantiated conclusions
using first principles of mathematics, natural sciences, and engineering
sciences.
3. Design/Development of Solutions: Design solutions for complex
engineering problems and design system components or processes that meet
the specified needs with appropriate consideration for the public health and
safety, and the cultural, societal, and environmental considerations.
4. Conduct Investigations of Complex Problems: Use research-based
knowledge and research methods including design of experiments, analysis
and interpretation of data, and synthesis of the information to provide valid
conclusions.
5. Modern Tool Usage: Create, select and apply appropriate techniques,
resources and modern engineering and IT tools including prediction and
modeling to complex engineering activities with an understanding of the
limitations.
6. The Engineer and Society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to the professional engineering
practice.
7. Environment and Sustainability: Understand the impact of the
professional engineering solutions in societal and environmental contexts,
and demonstrate the knowledge of and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
9. Individual and Team Work: Function effectively as an individual, and as
a member or leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering

v
activities with the engineering community and with society at large, such
as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear
instructions.
11. Project Management and Finance: Demonstrate knowledge and
understanding of the engineering and management principles and apply
these to one’s own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.
12. Life-long Learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest
context of technological

vi
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering

Program specific outcomes (PSOs)

PSO 1: Improve the student's ability to decipher the basic principles and methodology
of computer systems. Improve the student’s ability to absorb facts and
technical ideas in order to build and develop software.
PSO2: The capacity to create novel job routes as an entrepreneur using modern
computer languages and evolving technologies like SDLC, Python,
Machine Learning, Social Networks, Cyber Security, Mobile Apps etc.

vii
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering

Program educational objectives (PEOs)

PEO-1: Engineering graduates with excellent fundamental and technical


skills will have successful careers in industry, meeting the needs of
Indian and worldwide firms.

PEO-2: With determination, development, self-reliance, leadership, morality, and


moral principles, engineering graduates will become successful entrepreneurs
who will ever age employability.

PEO-3: To support personal and organizational progress, engineering graduates will


pursue higher education and engage in lifelong learning.

viii
SYLLABUS
OPERATING SYSTEM
B. TECH II Year II Sem

Week-1:
1. Write a python program to compute Central Tendency Measures: Mean, Median, Mode
Measure of Dispersion: Variance, Standard Deviation.

2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy.

Week-2:
3. Study of Python Libraries for ML application such as Pandas and Matplotlib.
4. Write a Python program to implement Simple Linear Regression

Week-3:

5. Implementation of Multiple Linear Regression for House Price Prediction using sklearn.

6. Implementation of Decision tree using sklearn and its parameter tuning.


Week-4:

7. Implementation of KNN using sklearn


8. Implementation of Logistic Regression using sklearn

Week-5:

9. Implementation of K-Means Clustering


10. Performance analysis of Classification Algorithms on a specific dataset (Mini Project)

TEXT BOOK:

1. Machine Learning – Tom M. Mitchell, - MGH.

REFERENCE BOOK:
1. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis.

ix
Course Objectives:

1. The objective of this lab is to get an overview of the various machine learning
techniques and can demonstrate them using python. , system call interface
for process management, inter-process communication and I/O in Unix.

Course Outcomes: After learning the contents of this course the student is able to
 Understand modern notions in predictive data analysis

 Select data, model selection, model complexity and identify the trends

 Understand a range of machine learning algorithms along with their


strengths and weaknesses

 Build predictive models from data and analyze their performance

CO-PO MAPPING:

CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1 2 3 3 2 2 2 2

MACHINE CO2 2 3 1 1 3 2 3
LEARNING
CO3 2 2 3 3 3 3 1

CO-PSO MAPPING:
PSO-1 PSO-2
CO1 3 2
CO2 3 2
CO3 3 3

x
WEEK-1

1. Write a python program to compute Central Tendency Measures:


Mean, Median,

Mode Measure of Dispersion: Variance, Standard Deviation.

xi
xii
xiii
2. Study of Python Basic Libraries such as Statistics, Math,
Numpy and Scipy

xiv
xv
xvi
xvii
WEEK-2

3. Study of Python Libraries for ML application such as Pandas


and Matplotlib.

xviii
xix
xxvii
xxvii
4. Write a Python program to implement Simple
Linear Regression.
Implementation of Linear Regression using Python
Linear regression is a statistical technique to describe relationships between
dependent variables with a number of independent variables. This tutorial will
discuss the basic concepts of linear regression as well as its application within
Python.

In order to give an understanding of the basics of the concept of linear regression,


we begin with the most basic form of linear regression, i.e., "Simple linear
regression".

Simple Linear Regression


Simple linear regression (SLR) is a method to predict a response using one
feature. It is believed that both variables are linearly linked. Thus, we
strive to find a linear equation that can predict an answer value(y) as
precisely as possible in relation to features or the independently derived
variable(x).

Let's consider a dataset in which we have a number of responses y per


feature x:

For simplification, we define:

x as feature vector, i.e., x = [x1, x2, x3, …., xn],

y as response vector, i.e., y = [y1, y2, y3 …., yn]

for n observations (in above example, n = 10).

A scatter plot of the above dataset looks like: -

xxix
This line is referred to as the regression line.

The equation of the regression line can be shown as follows:

Here

o h(xi ) signifies the predicted response value for ith


o ?0 and ?1xi ) are regression coefficients and represent y-intercept and slope of regression line
respectively.

In order to build our model, we need to "learn" or estimate the value of the regression coefficients
and . After we've determined those coefficients, then we are able to make use of this model in
order to forecast the response!

In this tutorial, we're going to employ the concept of Least Squares.

Let's consider:

yi = ?0+ ?1xi + ?i=h(xi )+ ?i ? ?i= yi- h(xi )

xxx
Here, ?i is a residual error in ith observation.

So, our goal is to minimize the total residual error.

We have defined the cost function or squared error, J as:

and our mission is to find the value of ?0 and ?1 for which J(?0,?1) is minimum.

Without going into the mathematical details, we are presenting the result below:

Where, ssxy would be the sum of the cross deviations of "y" and "x":

Implementation of Linear Regression using Python


And ssxx would be the sum of squared deviations of "x"

Implementation of Linear Regression using Python.


CODE:

1. import numpy as nmp


2. import matplotlib.pyplot as mtplt
3.
4. def estimate_coeff(p, q):
5. # Here, we will estimate the total number of points or observation
6. n1 = nmp.size(p)
7. # Now, we will calculate the mean of a and b vector
8. m_p = nmp.mean(p)
9. m_q = nmp.mean(q)
10.
11. # here, we will calculate the cross deviation and deviation about a
12. SS_pq = nmp.sum(q * p) - n1 * m_q * m_p
13. SS_pp = nmp.sum(p * p) - n1 * m_p * m_p

xxxi
14. # here, we will calculate the regression coefficients
15. b_1 = SS_pq / SS_pp
16. b_0 = m_q - b_1 * m_p
17.
18. return (b_0, b_1)
19.
20. def plot_regression_line(p, q, b):
21. # Now, we will plot the actual points or observation as scatter plot
22. mtplt.scatter(p, q, color = "m",
23. marker = "o", s = 30)
24.
25. # here, we will calculate the predicted response vector
26. q_pred = b[0] + b[1] * p
27.
28. # here, we will plot the regression line
29. mtplt.plot(p, q_pred, color = "g")
30.
31. # here, we will put the labels
32. mtplt.xlabel('p')
33. mtplt.ylabel('q')
34.
35. # here, we will define the function to show plot
36. mtplt.show()
37.
38. def main():
39. # entering the observation points or data
40. p = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
41. q = np.array([11, 13, 12, 15, 17, 18, 18, 19, 20, 22])
42.
43. # now, we will estimate the coefficients
44. b = estimate_coeff(p, q)
45. print("Estimated coefficients are :\nb_0 = {} \
46. \nb_1 = {}".format(b[0], b[1]))
47. # Now, we will plot the regression line

xxxii
48. plot_regression_line(p, q, b)
49.
50. if __name__ == "__main__":
51. main()

Output:

Estimated coefficients are :


b_0 = -0.4606060606060609
b_1 = 1.1696969696969697

xxxii
ID Name Salary
1 John Doe 50000
Record updated successfully.
ID Name Salary
1 John Updated 55000
Record deleted successfully.
ID Name Salary
This demonstrates a simple Java standalone application for CRUD operations on a MySQL
database using JDBC. Keep in mind that for a production environment, you would wantto use
prepared statements to prevent SQL injection and handle exceptions more gracefully.

xxxiv
WEEK-3

5. Implementation of Multiple Linear Regression for House Price


Prediction using sklearn

Multiple linear regression refers to a statistical technique that is used to predict the outcome of a
variable based on the value of two or more variables. It is sometimes known simply as multiple
regression, and it is an extension of linear regression. The variable that we want to predict is known
as the dependent variable, while the variables we use to predict the value of the dependent variable
are known as independent or explanatory variables.

Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. You can use multiple linear regression when you want to
know:

How strong the relationship is between two or more independent variables and one dependent
variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
Multiple Linear Regression Formula Where:

xxxv
Where:

yi is the dependent or predicted variable


β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
β1 and β2 are the regression coefficients that represent the change in y relative to a one-unit
change in xi1 and xi2, respectively.
βp is the slope coefficient for each independent variable
ϵ is the model’s random error (residual) term.
Assumptions of Multiple Linear Regression
Multiple linear regression is based on the following assumptions:

1. A linear relationship between the dependent and independent variables

2. The independent variables are not highly correlated with each other

3. The variance of the residuals is constant

4. Independence of observation

Going forward, Let see how to implement the house price prediction using multiple linear
regression algorithm.

Import libraries

# Import numpy and pandas package


import pandas as pd
import numpy as np
# Data visualization
from matplotlib import pyplot as plot
import statsmodels.api as sm
import seaborn as sns

xxxv
Reading the dataset
tdata=pd.read_csv(r'Housing.csv)

Data Inspection
data.head(5)

Display first 5 records


data.info()

data.describe()

Describe the descriptive statistics for the dataset

xxxix
data.shape()

Total number of rows & columns

Data Cleaning

check if any null data present in the dataset


data.isnull().sum()

NULL() check

Finally, there is no null data present in the dataset. Seems there is no need of replacing
the 0 values.

Detect Outliers

Outliers are extreme values that fall a long way outside of the other observations.

Created the separate function to detect outliers for the dataset. Here used the boxplot
using Seaborn library.
def detectOutliers():
fig, axs = plot.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(data['price'], ax = axs[0,0])
plt2 = sns.boxplot(data['area'], ax = axs[0,1])
plt3 = sns.boxplot(data['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(data['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(data['stories'], ax = axs[1,1])
plt3 = sns.boxplot(data['parking'], ax = axs[1,2])
plot.tight_layout()detectOutliers()

xl
Outlier Detection

Price and area have considerable outliers . Next step is to drop the outliers.
# Outlier reduction for priceplot.boxplot(data.price)
Q1 = data.price.quantile(0.25)
Q3 = data.price.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.price >= Q1 - 1.5*IQR) & (data.price <= Q3 + 1.5*IQR)]# Outlier reduction for
areaplot.boxplot(data.area)
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]
To verify the outlier is still exists,
detectOutliers()

xli
Data Visualization
sns.pairplot(data)
plot.show()

Pairplot

Next step visualizing the categorical variables

xlii
plot.figure(figsize=(20, 12))
plot.subplot(3,3,1)
sns.boxplot(x='mainroad', y='price', data=data)
plot.subplot(3,3,2)
sns.boxplot(x='guestroom', y='price', data=data)
plot.subplot(3,3,3)
sns.boxplot(x='basement', y='price', data=data)
plot.subplot(3,3,4)
sns.boxplot(x='hotwaterheating', y='price', data=data)
plot.subplot(3,3,5)
sns.boxplot(x='airconditioning', y='price', data=data)
plot.subplot(3,3,6)
sns.boxplot(x='furnishingstatus', y='price', data=data)
plot.show()

xliii
6. Implementation of Decision tree using sklearn and its parameter
tuning
Decision Tree Classifiers
The diagram below demonstrates how decision trees work to make decisions. The top node is
called the root node. Each of the decision points are called decision nodes. The final decision point
is referred to as a leaf node.

Beyond this, decision trees are great algorithms because:

 They’re generally faster to train than other algorithms such as neural networks
 Their complexity is a by-product of the data’s attributes and dimensions
 It’s a non-parametric method meaning that they do not depend on probability distribution
assumptions
 They can handle high dimensional data with high degrees of accuracy

xliv
How do Decision Tree Classifiers Work?

Gini Impurity refers to a measurement of the likelihood of incorrect classification of a new


instance of a random variable if that instance was randomly classified according to the distribution
of class labels from the dataset.

Ok, that sentence was a mouthful! The Gini Impurity measures the likelihood that an item will be
misclassified if it’s randomly assigned a class based on the data’s distribution. To generalize this
to a formula, we can write:

xlv
We can calculate the impurity using this Python function:

# Calculating Gini Impurity of a Pandas DataFrame Column


def gini_impurity(column):
impurity = 1
counters = Counter(column)
for value in column.unique():
impurity -= (counters[value] / len(column)) ** 2

return impurity

Using Decision Tree Classifiers in Python’s Sklearn

# Downloading an exploring the Titanic dataset


import pandas as pd
data = pd.read_csv(
'https://github.com/datagy/data/raw/main/titanic.csv',
usecols=['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Embarked'])
data = data.dropna()

print(data.head())

# Returns:
# Survived Pclass Sex Age SibSp Parch Fare Embarked
#0 0 3 male 22.0 1 0 7.2500 S
#1 1 1 female 38.0 1 0 71.2833 C
#2 1 3 female 26.0 0 0 7.9250 S
#3 1 1 female 35.0 1 0 53.1000 S
#4 0 3 male 35.0 0 0 8.0500 S

Let’s better understand the distribution of the data by plotting a pairplot using Seaborn. We’ll temporarily
load the target feature into the DataFrame to be able to color points based on whether people survived.

# Plotting a Pairplot of Titanic Data


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv(
'https://github.com/datagy/data/raw/main/titanic.csv',
usecols=['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'])
data = data.dropna()

sns.pairplot(data=data, hue='Survived')
plt.show()

xlvi
A pairplot of the Titanic Dataset

Before we dive much further, let’s first drop a few more variables. In particular, we’ll drop all the
non-numeric variables for now. Machine learnings tend to require numerical columns to work.
We’ll focus on these later, but for now we’ll keep things simple:

# Loading only numeric columns


import pandas as pd

data = pd.read_csv(
'https://github.com/datagy/data/raw/main/titanic.csv',
usecols=['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'])
data = data.dropna()

X = data.copy()
y = X.pop('Survived') In the code above, we loaded only the numeric columns (by removing 'Sex'
and 'Embarked'). Then, we split the data into two variables:

X: our features matrix (because it’s a matrix, it’s denoted with a capital letter)
y: our target variable

xlvii
Splitting Data into Training and Testing Data in Sklearn

Let’s first load the function and then see how we can apply it to our data:

# Splitting data into training and testing data


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

Understanding DecisionTreeClassifier in Sklearn


In this section, we’ll explore how the DecisionTreeClassifier class works in Sklearn.
We can import the class from the tree module. Let’s see how we can import the
class and explore its different parameters:
# How to Import the DecisionTreeClassifer Class
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier(
*,
criterion='gini',
splitter='best',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=None,
random_state=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
class_weight=None,
ccp_alpha=0.0
)

Now, let’s see how we can build our first decision tree classifier using Sklearn!

# Creating Our First Decision Tree Classifier


from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

xlviii
In the code above we accomplished two critical things (in very few lines of code):
We created our Decision Tree Classifier model and assigned it to the variable clf
We then applied the .fit() method to train the model. In order to do this, we passed in our training
data.
Scikit-Learn takes care of making all the decisions for us (for better or worse!). Now, let’s see
how we can make predictions with this newly created model:
# Making Predictions with Our Model
predictions = clf.predict(X_test)
print(predictions[:5])
Let’s break down what we did in the code above:
We assigned a new variable, predictions, which takes the values from applying the .predict()
method to our model clf.
We make predictions based on our X_test data

Validating a Decision Tree Classifier Algorithm in Python’s Sklearn


# Attempting to build a model with non-numeric data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X = pd.read_csv(
'https://github.com/datagy/data/raw/main/titanic.csv',
usecols=['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex', 'Embarked'])
X = X.dropna()
y = X.pop('Survived')

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Raises
# ValueError: could not convert string to float: 'female'

xlix
By doing this, we can safely use non-numeric columns. Let’s see how we can use Python
and Scikit-Learn to convert our columns to their one-hot encoded columns.

# One-hot encoding our data


from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

column_transformer = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')

X_train = column_transformer.fit_transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names())

Let’s break down what we did here:

We imported the OneHotEncoder() class and the make_column_transformer function


We created a column transformer object
We then apply the .fit_transform() method to simultaneously fit and transform the column
transformations on the X_train dataset
We then converted the dataset back into a Pandas DataFrame
Let’s see how we can now use our dataset to make classifications using a Decision Tree Classifier
in Scikit-Learn:

# Making Predictions with One-Hot Encoded Values


X_test = column_transformer.transform(X_test)
X_test = pd.DataFrame(data=X_test, columns=column_transformer.get_feature_names())

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

# Returns: 0.775

Let’s see how we can make this work:

# Creating a dictionary of parameters to use in GridSearchCV


from sklearn.model_selection import GridSearchCV

params = {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 2, 4, 6, 8, 10],
'max_features': [None, 'sqrt', 'log2', 0.2, 0.4, 0.6, 0.8],
'splitter': ['best', 'random']
}

l
clf = GridSearchCV(
estimator=DecisionTreeClassifier(),
param_grid=params,
cv=5,
n_jobs=5,
verbose=1,
)

clf.fit(X_train, y_train)
print(clf.best_params_)
This returns the following dictionary:

# The best parameters


{
'criterion': 'entropy',
'max_depth': 4,
'max_features': 0.6,
'splitter': 'best'
}

li
WEEK-4
7. Implementation of KNN using sklearn.

The Idea Behind K-Nearest Neighbours Algorithm


Our behavior is shaped by the companions we grew up with. Our parents also shape our personalities in
various ways. If you grow up among folks who enjoy sports, it is highly likely that you will end up loving
sports. There are of course exceptions. KNN works similarly.

If you have a close buddy and spend most of your time with him/her, you will end up having similar
interests and loving same things. That is kNN with k=1.
If you constantly hang out with a group of 5, each one in the group has an impact on your behavior and you
will end up becoming the average of 5. That is kNN with k=5.
kNN classifier identifies the class of a data point using the majority voting principle. If k is set to 5, the
classes of 5 nearest points are examined. Prediction is done according to the predominant class. Similarly,
kNN regression takes the mean value of 5 nearest locations.

Do we witness folks who are close but how data points are considered to be close? The distance between
data points is measured. There are various techniques to estimate the distance. Euclidean distance
(Minkowski distance with p=2) is one of the most regularly used distance measurements. The graphic
below explains how to compute the euclidean distance between two points in a 2-dimensional space. It is
determined using the square of the difference between x and y coordinates of the locations.

lii
Implementation of KNN Algorithm in Python
Let’s now get into the implementation of KNN in Python. We’ll go over the steps to help you break the
code down and make better sense of it.
1. Importing the modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
2. Creating Dataset
Scikit-learn has a lot of tools for creating synthetic datasets, which are great for testing machine learning
algorithms. I’m going to utilize the make blobs method.

X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5, random_state =


4)
This code generates a dataset of 500 samples separated into four classes with a total of two characteristics.
Using associated parameters, you may quickly change the number of samples, characteristics, and classes.
We may also change the distribution of each cluster (or class).

3. Visualize the Dataset


plt.style.use('seaborn')
plt.figure(figsize = (10,10))
plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')
plt.show()

liii
4. Splitting Data into Training and Testing Datasets

It is critical to partition a dataset into train and test sets for every supervised machine learning
method. We first train the model and then put it to the test on various portions of the dataset. If
we don’t separate the data, we’re simply testing the model with data it already knows. Using the
train_test_split method, we can simply separate the tests.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

With the train size and test size options, we may determine how much of the original data is
utilized for train and test sets, respectively. The default separation is 75% for the train set and
25% for the test set.}

5. KNN Classifier Implementation


After that, we’ll build a kNN classifier object. I develop two classifiers with k values of 1 and 5 to
demonstrate the relevance of the k value. The models are then trained using a train set. The k value is
chosen using the n_neighbors argument. It does not need to be explicitly specified because the default
value is 5.

knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)

6.Predictions for the KNN Classifiers

Then, in the test set, we forecast the target values and compare them to the actual values.

knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)

y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)

7. Predict Accuracy for both k values


from sklearn.metrics import accuracy_score
print("Accuracy with k=5", accuracy_score(y_test, y_pred_5)*100)
print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)

The accuracy for the values of k comes out as follows:

Accuracy with k=5 93.60000000000001


Accuracy with k=1 90.4

liv
8. Visualize Predictions
Let’s view the test set and predicted values with k=5 and k=1 to see the influence of k values.
plt.figure(figsize = (15,5))

plt.subplot(1,2,1)

plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black')

plt.title("Predicted values with k=5", fontsize=20)

plt.subplot(1,2,2)

plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black')

plt.title("Predicted values with k=1", fontsize=20)

plt.show()

Visualize Predictions KNN

lv
8. Implementation of Logistic Regression using sklearn.

Logistic regression, despite its name, is a classification algorithm rather than regression algorithm.
Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no,
true/false). It is also called logit or MaxEnt Classifier.

Basically, it measures the relationship between the categorical dependent variable and one or more
independent variables by estimating the probability of occurrence of an event using its logistics
function.

sklearn.linear_model.LogisticRegression is the module used to implement logistic regression.

Parameters
Following table lists the parameters used by Logistic Regression module −

Sr.No Parameter & Description


penalty − str, ‘L1’, ‘L2’, ‘elasticnet’ or none, optional, default =
‘L2’
1
This parameter is used to specify the norm (L1 or L2) used in
penalization (regularization).
dual − Boolean, optional, default = False
2 It is used for dual or primal formulation whereas dual formulation is
only implemented for L2 penalty.
tol − float, optional, default=1e-4
3
It represents the tolerance for stopping criteria.
C − float, optional, default=1.0
4 It represents the inverse of regularization strength, which must
always be a positive float.
fit_intercept − Boolean, optional, default = True
5 This parameter specifies that a constant (bias or intercept) should be
added to the decision function.
intercept_scaling − float, optional, default = 1
This parameter is useful when
6
 the solver ‘liblinear’ is used
 fit_intercept is set to true
class_weight − dict or ‘balanced’ optional, default = none
It represents the weights associated with classes. If we use the
default option, it means all the classes are supposed to have weight
7
one. On the other hand, if you choose class_weight: balanced, it
will use the values of y to automatically adjust weights.

lvi
8 random_state − int, RandomState instance or None, optional,
default = none
This parameter represents the seed of the pseudo random number
generated which is used while shuffling the data. Followings are the
options
 int − in this case, random_state is the seed used by random
number generator.
 RandomState instance − in this case, random_state is the
random number generator.
 None − in this case, the random number generator is the
RandonState instance used by np.random.
solver − str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘saag’, ‘saga’},
optional, default = ‘liblinear’
This parameter represents which algorithm to use in the
optimization problem. Followings are the properties of options
under this parameter −
 liblinear − It is a good choice for small datasets. It also
handles L1 penalty. For multiclass problems, it is limited to
one-versus-rest schemes.
9
 newton-cg − It handles only L2 penalty.
 lbfgs − For multiclass problems, it handles multinomial loss.
It also handles only L2 penalty.
 saga − It is a good choice for large datasets. For multiclass
problems, it also handles multinomial loss. Along with L1
penalty, it also supports ‘elasticnet’ penalty.
 sag − It is also used for large datasets. For multiclass
problems, it also handles multinomial loss.
max_iter − int, optional, default = 100
10 As name suggest, it represents the maximum number of iterations
taken for solvers to converge.
multi_class − str, {‘ovr’, ‘multinomial’, ‘auto’}, optional, default =
‘ovr’
 ovr − For this option, a binary problem is fit for each label.
 multimonial − For this option, the loss minimized is the
11
multinomial loss fit across the entire probability distribution.
We can’t use this option if solver = ‘liblinear’.
 auto − This option will select ‘ovr’ if solver = ‘liblinear’ or
data is binary, else it will choose ‘multinomial’.
verbose − int, optional, default = 0
By default, the value of this parameter is 0 but for liblinear and
12 lbfgs solver we should set verbose to any positive number.

lvii
13 warm_start − bool, optional, default = false
With this parameter set to True, we can reuse the solution of the
previous call to fit as initialization. If we choose default i.e. false, it
will erase the previous solution.
n_jobs − int or None, optional, default = None
If multi_class = ‘ovr’, this parameter represents the number of CPU
14
cores used when parallelizing over classes. It is ignored when
solver = ‘liblinear’.
l1_ratio − float or None, optional, dgtefault = None
15 It is used in case when penalty = ‘elasticnet’. It is basically the
Elastic-Net mixing parameter with 0 < = l1_ratio > = 1.

lviii
Attributes
Followings table consist the attributes used by Logistic Regression module −npm install
jsonwebtoken
Sr.No Attributes & Description
coef_ − array, shape(n_features,) or (n_classes,
n_features)
1 It is used to estimate the coefficients of the features
in the decision function. When the given problem is
binary, it is of the shape (1, n_features).
Intercept_ − array, shape(1) or (n_classes)
2 It represents the constant, also known as bias,
added to the decision function.
classes_ − array, shape(n_classes)
3 It will provide a list of class labels known to the
classifier.
n_iter_ − array, shape (n_classes) or (1)
4 It returns the actual number of iterations for all the
classes.
Implementation Example
Following Python script provides a simple example of implementing logistic regression on iris
dataset of scikit-learn −

from sklearn import datasets


from sklearn import linear_model
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y = True)
LRG = linear_model.LogisticRegression(
random_state = 0,solver = 'liblinear',multi class = 'auto'
)
.fit(X, y)
LRG.score(X, y)
Output
0.96
The output shows that the above Logistic Regression model gave the accuracy of 96 percent.

lix
WEEK-5
9. Implementation of K-Means Clustering

K Means Clustering for Classification


Clustering is a type of unsupervised machine learning which aims to find homogeneous
subgroups such that objects in the same group (clusters) are more similar to each other than the
others.
KMeans is a clustering algorithm which divides observations into k clusters. Since we can dictate
the amount of clusters, it can be easily used in classification where we divide data into clusters
which can be equal to or more than the number of classes.

I’ll be using the MNIST dataset which comes with scikit learn which is a collection of labelled
handwritten digits and use KMeans to find clusters within the dataset and test how good it is as a
feature.

Implementation:
import numpy as np
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

lx
class clust():
def _load_data(self, sklearn_load_ds):
data = sklearn_load_ds
X = pd.DataFrame(data.data)
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X,
data.target, test_size=0.3, random_state=42)

def __init__(self, sklearn_load_ds):


self._load_data(sklearn_load_ds)

def classify(self, model=LogisticRegression(random_state=42)):


model.fit(self.X_train, self.y_train)
y_pred = model.predict(self.X_test)
print('Accuracy: {}'.format(accuracy_score(self.y_test, y_pred)))

def Kmeans(self, output='add'):


n_clusters = len(np.unique(self.y_train))
clf = KMeans(n_clusters = n_clusters, random_state=42)
clf.fit(self.X_train)
y_labels_train = clf.labels_
y_labels_test = clf.predict(self.X_test)
if output == 'add':
self.X_train['km_clust'] = y_labels_train
self.X_test['km_clust'] = y_labels_test
elif output == 'replace':
self.X_train = y_labels_train[:, np.newaxis]
self.X_test = y_labels_test[:, np.newaxis]
else:
raise ValueError('output should be either add or replace')
return self

lxi
Output:

In the first attempt only clusters found by KMeans are used to train a classification model.
These clusters alone give a decent model with an accuracy of 78.33%. Let’s compare it with
an out of the box Logistic Regression model.

In our final iteration we are using the clusters as features, the results show an improvement
over our previous model.

lxii
10. Performance analysis of Classification Algorithms on a specific
dataset (Mini Project)
The following algorithms were used for classification analysis:

· Decision Tree Classifier,

· Random Forest Classifier,

· XGBoost Classifier,

· Naïve Bayes,

· Support Vector Machines (SVM),

· AdaBoost.

Data Cleaning

The first step is to import and clean the data (if needed) using pandas before starting the

analysis.

lxiii
There are 25 austenitic (A), 17 martensitic (M), 11 ferritic (F) and 9 precipitation-hardening (P)

stainless steels in the dataset.

There are 62 rows (stainless steels) and 17 columns (attributes) of data. 15 columns cover the chemical

composition information of the alloys. The first column is the AISI designation and the last column is

the type of the alloy. Our target is to estimate the type of the steel.

Descriptive statistics of the dataset are shown below.

Decision Tree Classifier

First algorithm is the Decision Tree Classifier. It uses a decision tree (as a predictive model) to go
from observations about an item (represented in the branches) to conclusions about the item’s
target value (represented in the leaves).

lxiv
The results are very good; actually, only one alloy type was classified mistakenly.

Random Forest Classifier


Random forests or random decision forests are an ensemble learning method for classification, regression
and other tasks that operate by constructing a multitude of decision trees at training time and outputting the
class that is the mode of the classes (classification) or mean/average prediction (regression) of the
individual trees.

Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted
trees. However, data characteristics can affect their performance [ref].

lxv
Hyperparameter Tuning with Grid Search

Even though I got satisfactory results with Random Forest Analysis, I applied hyperparameter
tuning with Grid Search. Grid search is a common method for tuning a model’s hyperparameters.
The grid search algorithm is simple: feed it a set of hyperparameters and the values to be tested for
each hyperparameter, and then run an exhaustive search over all possible combinations of these
values, training one model for each set of values. The algorithm then compares the scores of each
model it trains and keeps the best one. Here are the results:

lxvi
Hyperparameter tuning with Grid Search took the results to the perfect level — or overfitting.

lxvii
XG Boost Classifier

XGBoost is well known to provide better solutions than other machine


learning algorithms. In fact, since its inception, it has become the
“state-of-the-art” machine learning algorithm to deal with structured
data.

The results of the XGBoost Classifier provided the best results for this
classification study.

lxviii
Hyperparameter Tuning with Grid Search
Once again, I applied the hyperparameter tuning with Grid Search, even though the results were
near perfect.

lxix
Naïve Bayes Classifier

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems. Naïve Bayes Classifier is one of the simple and most
effective Classification algorithms which helps in building the fast machine learning models that
can make quick predictions.
The results are shown below:

lxx
Support Vector Machines (SVM)
Support-vector machines (SVMs, also support-vector networks) are supervised learning models with
associated learning algorithms that analyze data for classification and regression analysis. An SVM maps
training examples to points in space to maximize the width of the gap between the two categories. New
examples are then mapped into that same space and predicted to belong to a category based on which side
of the gap they fall.
The results are shown below:

lxxi
AdaBoost

AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm, which can be used in
conjunction with many other types of learning algorithms to improve performance. The output of the
other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final
output of the boosted classifier.

The results are shown below:

lxxii
Conclusion
In this article, I used six different Supervised Machine Learning (Classification) algorithms with the
purpose of classifying four types of stainless steels (multi-class) according to their chemical compositions
comprised of 15 elements in the alloy. The dataset included 62 alloys; which made it a small, but a very
accurate dataset (all the information was taken from ASM International Sources (formerly known as
American Society of Metals)).

lxxiii
The analysis provides evidence that:

· Considering the f1 scores, Random Forest and XGBoost methods produced the best results
(0.94).

· After hyperparameter tuning by Grid Search, RF and XGBoost f1 scores jumped to 100 %.

· Multiple tries of the same algorithm resulted different results with a huge gap — most probably
due to the limited data size.

· The poorest f1 scores were mostly for the classification of the types that have the least-numbered
groups; which were ferritic and precipitation-hardened steels.

· Finally, test classification accuracy of 95% achieved by 3 models (DT, RF and XGBoost) and
100% by 2 tuned models demonstrates that the ML approach can be effectively applied to steel
classification despite the small number of alloys and heterogeneous input parameters (chemical
compositions). Based on only 62 cases, the models achieved a very high level of performance for
multi-class alloy type classification

lxxiv

You might also like