Machine Learning
Machine Learning
LABORATORY MANUAL
MACHINE LERNING LAB
(R22 Regulations)
For
B. Tech
III Year
I List of Experiments ii
II V/M/POs/PSOs/PEOs iii
III Syllabus ix
i
List of Experiments
ii
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering
iii
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering
Adopting creative techniques to nurture and strengthen the core skill of Computer
Science.
Introduce students to the most recent technological advancements.
Impart quality education; improve the research, entrepreneurial, and employability
skills of women technocrats.
Instill professional ethics and a sense of social responsibility in students.
Strengthen the Industry-Academia interface, which will enable graduates to
emerge as academic leaders or inspiring entrepreneurs
iv
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering
v
activities with the engineering community and with society at large, such
as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear
instructions.
11. Project Management and Finance: Demonstrate knowledge and
understanding of the engineering and management principles and apply
these to one’s own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.
12. Life-long Learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest
context of technological
vi
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering
PSO 1: Improve the student's ability to decipher the basic principles and methodology
of computer systems. Improve the student’s ability to absorb facts and
technical ideas in order to build and develop software.
PSO2: The capacity to create novel job routes as an entrepreneur using modern
computer languages and evolving technologies like SDLC, Python,
Machine Learning, Social Networks, Cyber Security, Mobile Apps etc.
vii
RISHI M.S. INSTITUTE OF ENGINEERING & TECHNOLOGY FOR
WOMEN
(Affiliated to JNTUH University, Approved by AICTE)
Department of
Information Technology& Computer Science And Engineering
viii
SYLLABUS
OPERATING SYSTEM
B. TECH II Year II Sem
Week-1:
1. Write a python program to compute Central Tendency Measures: Mean, Median, Mode
Measure of Dispersion: Variance, Standard Deviation.
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy.
Week-2:
3. Study of Python Libraries for ML application such as Pandas and Matplotlib.
4. Write a Python program to implement Simple Linear Regression
Week-3:
5. Implementation of Multiple Linear Regression for House Price Prediction using sklearn.
Week-5:
TEXT BOOK:
REFERENCE BOOK:
1. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis.
ix
Course Objectives:
1. The objective of this lab is to get an overview of the various machine learning
techniques and can demonstrate them using python. , system call interface
for process management, inter-process communication and I/O in Unix.
Course Outcomes: After learning the contents of this course the student is able to
Understand modern notions in predictive data analysis
Select data, model selection, model complexity and identify the trends
CO-PO MAPPING:
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 3 3 2 2 2 2
MACHINE CO2 2 3 1 1 3 2 3
LEARNING
CO3 2 2 3 3 3 3 1
CO-PSO MAPPING:
PSO-1 PSO-2
CO1 3 2
CO2 3 2
CO3 3 3
x
WEEK-1
xi
xii
xiii
2. Study of Python Basic Libraries such as Statistics, Math,
Numpy and Scipy
xiv
xv
xvi
xvii
WEEK-2
xviii
xix
xxvii
xxvii
4. Write a Python program to implement Simple
Linear Regression.
Implementation of Linear Regression using Python
Linear regression is a statistical technique to describe relationships between
dependent variables with a number of independent variables. This tutorial will
discuss the basic concepts of linear regression as well as its application within
Python.
xxix
This line is referred to as the regression line.
Here
In order to build our model, we need to "learn" or estimate the value of the regression coefficients
and . After we've determined those coefficients, then we are able to make use of this model in
order to forecast the response!
Let's consider:
xxx
Here, ?i is a residual error in ith observation.
and our mission is to find the value of ?0 and ?1 for which J(?0,?1) is minimum.
Without going into the mathematical details, we are presenting the result below:
Where, ssxy would be the sum of the cross deviations of "y" and "x":
xxxi
14. # here, we will calculate the regression coefficients
15. b_1 = SS_pq / SS_pp
16. b_0 = m_q - b_1 * m_p
17.
18. return (b_0, b_1)
19.
20. def plot_regression_line(p, q, b):
21. # Now, we will plot the actual points or observation as scatter plot
22. mtplt.scatter(p, q, color = "m",
23. marker = "o", s = 30)
24.
25. # here, we will calculate the predicted response vector
26. q_pred = b[0] + b[1] * p
27.
28. # here, we will plot the regression line
29. mtplt.plot(p, q_pred, color = "g")
30.
31. # here, we will put the labels
32. mtplt.xlabel('p')
33. mtplt.ylabel('q')
34.
35. # here, we will define the function to show plot
36. mtplt.show()
37.
38. def main():
39. # entering the observation points or data
40. p = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
41. q = np.array([11, 13, 12, 15, 17, 18, 18, 19, 20, 22])
42.
43. # now, we will estimate the coefficients
44. b = estimate_coeff(p, q)
45. print("Estimated coefficients are :\nb_0 = {} \
46. \nb_1 = {}".format(b[0], b[1]))
47. # Now, we will plot the regression line
xxxii
48. plot_regression_line(p, q, b)
49.
50. if __name__ == "__main__":
51. main()
Output:
xxxii
ID Name Salary
1 John Doe 50000
Record updated successfully.
ID Name Salary
1 John Updated 55000
Record deleted successfully.
ID Name Salary
This demonstrates a simple Java standalone application for CRUD operations on a MySQL
database using JDBC. Keep in mind that for a production environment, you would wantto use
prepared statements to prevent SQL injection and handle exceptions more gracefully.
xxxiv
WEEK-3
Multiple linear regression refers to a statistical technique that is used to predict the outcome of a
variable based on the value of two or more variables. It is sometimes known simply as multiple
regression, and it is an extension of linear regression. The variable that we want to predict is known
as the dependent variable, while the variables we use to predict the value of the dependent variable
are known as independent or explanatory variables.
Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. You can use multiple linear regression when you want to
know:
How strong the relationship is between two or more independent variables and one dependent
variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
Multiple Linear Regression Formula Where:
xxxv
Where:
2. The independent variables are not highly correlated with each other
4. Independence of observation
Going forward, Let see how to implement the house price prediction using multiple linear
regression algorithm.
Import libraries
xxxv
Reading the dataset
tdata=pd.read_csv(r'Housing.csv)
Data Inspection
data.head(5)
data.describe()
xxxix
data.shape()
Data Cleaning
NULL() check
Finally, there is no null data present in the dataset. Seems there is no need of replacing
the 0 values.
Detect Outliers
Outliers are extreme values that fall a long way outside of the other observations.
Created the separate function to detect outliers for the dataset. Here used the boxplot
using Seaborn library.
def detectOutliers():
fig, axs = plot.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(data['price'], ax = axs[0,0])
plt2 = sns.boxplot(data['area'], ax = axs[0,1])
plt3 = sns.boxplot(data['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(data['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(data['stories'], ax = axs[1,1])
plt3 = sns.boxplot(data['parking'], ax = axs[1,2])
plot.tight_layout()detectOutliers()
xl
Outlier Detection
Price and area have considerable outliers . Next step is to drop the outliers.
# Outlier reduction for priceplot.boxplot(data.price)
Q1 = data.price.quantile(0.25)
Q3 = data.price.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.price >= Q1 - 1.5*IQR) & (data.price <= Q3 + 1.5*IQR)]# Outlier reduction for
areaplot.boxplot(data.area)
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]
To verify the outlier is still exists,
detectOutliers()
xli
Data Visualization
sns.pairplot(data)
plot.show()
Pairplot
xlii
plot.figure(figsize=(20, 12))
plot.subplot(3,3,1)
sns.boxplot(x='mainroad', y='price', data=data)
plot.subplot(3,3,2)
sns.boxplot(x='guestroom', y='price', data=data)
plot.subplot(3,3,3)
sns.boxplot(x='basement', y='price', data=data)
plot.subplot(3,3,4)
sns.boxplot(x='hotwaterheating', y='price', data=data)
plot.subplot(3,3,5)
sns.boxplot(x='airconditioning', y='price', data=data)
plot.subplot(3,3,6)
sns.boxplot(x='furnishingstatus', y='price', data=data)
plot.show()
xliii
6. Implementation of Decision tree using sklearn and its parameter
tuning
Decision Tree Classifiers
The diagram below demonstrates how decision trees work to make decisions. The top node is
called the root node. Each of the decision points are called decision nodes. The final decision point
is referred to as a leaf node.
They’re generally faster to train than other algorithms such as neural networks
Their complexity is a by-product of the data’s attributes and dimensions
It’s a non-parametric method meaning that they do not depend on probability distribution
assumptions
They can handle high dimensional data with high degrees of accuracy
xliv
How do Decision Tree Classifiers Work?
Ok, that sentence was a mouthful! The Gini Impurity measures the likelihood that an item will be
misclassified if it’s randomly assigned a class based on the data’s distribution. To generalize this
to a formula, we can write:
xlv
We can calculate the impurity using this Python function:
return impurity
print(data.head())
# Returns:
# Survived Pclass Sex Age SibSp Parch Fare Embarked
#0 0 3 male 22.0 1 0 7.2500 S
#1 1 1 female 38.0 1 0 71.2833 C
#2 1 3 female 26.0 0 0 7.9250 S
#3 1 1 female 35.0 1 0 53.1000 S
#4 0 3 male 35.0 0 0 8.0500 S
Let’s better understand the distribution of the data by plotting a pairplot using Seaborn. We’ll temporarily
load the target feature into the DataFrame to be able to color points based on whether people survived.
data = pd.read_csv(
'https://github.com/datagy/data/raw/main/titanic.csv',
usecols=['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'])
data = data.dropna()
sns.pairplot(data=data, hue='Survived')
plt.show()
xlvi
A pairplot of the Titanic Dataset
Before we dive much further, let’s first drop a few more variables. In particular, we’ll drop all the
non-numeric variables for now. Machine learnings tend to require numerical columns to work.
We’ll focus on these later, but for now we’ll keep things simple:
data = pd.read_csv(
'https://github.com/datagy/data/raw/main/titanic.csv',
usecols=['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'])
data = data.dropna()
X = data.copy()
y = X.pop('Survived') In the code above, we loaded only the numeric columns (by removing 'Sex'
and 'Embarked'). Then, we split the data into two variables:
X: our features matrix (because it’s a matrix, it’s denoted with a capital letter)
y: our target variable
xlvii
Splitting Data into Training and Testing Data in Sklearn
Let’s first load the function and then see how we can apply it to our data:
Now, let’s see how we can build our first decision tree classifier using Sklearn!
xlviii
In the code above we accomplished two critical things (in very few lines of code):
We created our Decision Tree Classifier model and assigned it to the variable clf
We then applied the .fit() method to train the model. In order to do this, we passed in our training
data.
Scikit-Learn takes care of making all the decisions for us (for better or worse!). Now, let’s see
how we can make predictions with this newly created model:
# Making Predictions with Our Model
predictions = clf.predict(X_test)
print(predictions[:5])
Let’s break down what we did in the code above:
We assigned a new variable, predictions, which takes the values from applying the .predict()
method to our model clf.
We make predictions based on our X_test data
X = pd.read_csv(
'https://github.com/datagy/data/raw/main/titanic.csv',
usecols=['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex', 'Embarked'])
X = X.dropna()
y = X.pop('Survived')
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Raises
# ValueError: could not convert string to float: 'female'
xlix
By doing this, we can safely use non-numeric columns. Let’s see how we can use Python
and Scikit-Learn to convert our columns to their one-hot encoded columns.
column_transformer = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')
X_train = column_transformer.fit_transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names())
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))
# Returns: 0.775
params = {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 2, 4, 6, 8, 10],
'max_features': [None, 'sqrt', 'log2', 0.2, 0.4, 0.6, 0.8],
'splitter': ['best', 'random']
}
l
clf = GridSearchCV(
estimator=DecisionTreeClassifier(),
param_grid=params,
cv=5,
n_jobs=5,
verbose=1,
)
clf.fit(X_train, y_train)
print(clf.best_params_)
This returns the following dictionary:
li
WEEK-4
7. Implementation of KNN using sklearn.
If you have a close buddy and spend most of your time with him/her, you will end up having similar
interests and loving same things. That is kNN with k=1.
If you constantly hang out with a group of 5, each one in the group has an impact on your behavior and you
will end up becoming the average of 5. That is kNN with k=5.
kNN classifier identifies the class of a data point using the majority voting principle. If k is set to 5, the
classes of 5 nearest points are examined. Prediction is done according to the predominant class. Similarly,
kNN regression takes the mean value of 5 nearest locations.
Do we witness folks who are close but how data points are considered to be close? The distance between
data points is measured. There are various techniques to estimate the distance. Euclidean distance
(Minkowski distance with p=2) is one of the most regularly used distance measurements. The graphic
below explains how to compute the euclidean distance between two points in a 2-dimensional space. It is
determined using the square of the difference between x and y coordinates of the locations.
lii
Implementation of KNN Algorithm in Python
Let’s now get into the implementation of KNN in Python. We’ll go over the steps to help you break the
code down and make better sense of it.
1. Importing the modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
2. Creating Dataset
Scikit-learn has a lot of tools for creating synthetic datasets, which are great for testing machine learning
algorithms. I’m going to utilize the make blobs method.
liii
4. Splitting Data into Training and Testing Datasets
It is critical to partition a dataset into train and test sets for every supervised machine learning
method. We first train the model and then put it to the test on various portions of the dataset. If
we don’t separate the data, we’re simply testing the model with data it already knows. Using the
train_test_split method, we can simply separate the tests.
With the train size and test size options, we may determine how much of the original data is
utilized for train and test sets, respectively. The default separation is 75% for the train set and
25% for the test set.}
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)
Then, in the test set, we forecast the target values and compare them to the actual values.
knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)
y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)
liv
8. Visualize Predictions
Let’s view the test set and predicted values with k=5 and k=1 to see the influence of k values.
plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
plt.subplot(1,2,2)
plt.show()
lv
8. Implementation of Logistic Regression using sklearn.
Logistic regression, despite its name, is a classification algorithm rather than regression algorithm.
Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no,
true/false). It is also called logit or MaxEnt Classifier.
Basically, it measures the relationship between the categorical dependent variable and one or more
independent variables by estimating the probability of occurrence of an event using its logistics
function.
Parameters
Following table lists the parameters used by Logistic Regression module −
lvi
8 random_state − int, RandomState instance or None, optional,
default = none
This parameter represents the seed of the pseudo random number
generated which is used while shuffling the data. Followings are the
options
int − in this case, random_state is the seed used by random
number generator.
RandomState instance − in this case, random_state is the
random number generator.
None − in this case, the random number generator is the
RandonState instance used by np.random.
solver − str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘saag’, ‘saga’},
optional, default = ‘liblinear’
This parameter represents which algorithm to use in the
optimization problem. Followings are the properties of options
under this parameter −
liblinear − It is a good choice for small datasets. It also
handles L1 penalty. For multiclass problems, it is limited to
one-versus-rest schemes.
9
newton-cg − It handles only L2 penalty.
lbfgs − For multiclass problems, it handles multinomial loss.
It also handles only L2 penalty.
saga − It is a good choice for large datasets. For multiclass
problems, it also handles multinomial loss. Along with L1
penalty, it also supports ‘elasticnet’ penalty.
sag − It is also used for large datasets. For multiclass
problems, it also handles multinomial loss.
max_iter − int, optional, default = 100
10 As name suggest, it represents the maximum number of iterations
taken for solvers to converge.
multi_class − str, {‘ovr’, ‘multinomial’, ‘auto’}, optional, default =
‘ovr’
ovr − For this option, a binary problem is fit for each label.
multimonial − For this option, the loss minimized is the
11
multinomial loss fit across the entire probability distribution.
We can’t use this option if solver = ‘liblinear’.
auto − This option will select ‘ovr’ if solver = ‘liblinear’ or
data is binary, else it will choose ‘multinomial’.
verbose − int, optional, default = 0
By default, the value of this parameter is 0 but for liblinear and
12 lbfgs solver we should set verbose to any positive number.
lvii
13 warm_start − bool, optional, default = false
With this parameter set to True, we can reuse the solution of the
previous call to fit as initialization. If we choose default i.e. false, it
will erase the previous solution.
n_jobs − int or None, optional, default = None
If multi_class = ‘ovr’, this parameter represents the number of CPU
14
cores used when parallelizing over classes. It is ignored when
solver = ‘liblinear’.
l1_ratio − float or None, optional, dgtefault = None
15 It is used in case when penalty = ‘elasticnet’. It is basically the
Elastic-Net mixing parameter with 0 < = l1_ratio > = 1.
lviii
Attributes
Followings table consist the attributes used by Logistic Regression module −npm install
jsonwebtoken
Sr.No Attributes & Description
coef_ − array, shape(n_features,) or (n_classes,
n_features)
1 It is used to estimate the coefficients of the features
in the decision function. When the given problem is
binary, it is of the shape (1, n_features).
Intercept_ − array, shape(1) or (n_classes)
2 It represents the constant, also known as bias,
added to the decision function.
classes_ − array, shape(n_classes)
3 It will provide a list of class labels known to the
classifier.
n_iter_ − array, shape (n_classes) or (1)
4 It returns the actual number of iterations for all the
classes.
Implementation Example
Following Python script provides a simple example of implementing logistic regression on iris
dataset of scikit-learn −
lix
WEEK-5
9. Implementation of K-Means Clustering
I’ll be using the MNIST dataset which comes with scikit learn which is a collection of labelled
handwritten digits and use KMeans to find clusters within the dataset and test how good it is as a
feature.
Implementation:
import numpy as np
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
lx
class clust():
def _load_data(self, sklearn_load_ds):
data = sklearn_load_ds
X = pd.DataFrame(data.data)
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X,
data.target, test_size=0.3, random_state=42)
lxi
Output:
In the first attempt only clusters found by KMeans are used to train a classification model.
These clusters alone give a decent model with an accuracy of 78.33%. Let’s compare it with
an out of the box Logistic Regression model.
In our final iteration we are using the clusters as features, the results show an improvement
over our previous model.
lxii
10. Performance analysis of Classification Algorithms on a specific
dataset (Mini Project)
The following algorithms were used for classification analysis:
· XGBoost Classifier,
· Naïve Bayes,
· AdaBoost.
Data Cleaning
The first step is to import and clean the data (if needed) using pandas before starting the
analysis.
lxiii
There are 25 austenitic (A), 17 martensitic (M), 11 ferritic (F) and 9 precipitation-hardening (P)
There are 62 rows (stainless steels) and 17 columns (attributes) of data. 15 columns cover the chemical
composition information of the alloys. The first column is the AISI designation and the last column is
the type of the alloy. Our target is to estimate the type of the steel.
First algorithm is the Decision Tree Classifier. It uses a decision tree (as a predictive model) to go
from observations about an item (represented in the branches) to conclusions about the item’s
target value (represented in the leaves).
lxiv
The results are very good; actually, only one alloy type was classified mistakenly.
Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted
trees. However, data characteristics can affect their performance [ref].
lxv
Hyperparameter Tuning with Grid Search
Even though I got satisfactory results with Random Forest Analysis, I applied hyperparameter
tuning with Grid Search. Grid search is a common method for tuning a model’s hyperparameters.
The grid search algorithm is simple: feed it a set of hyperparameters and the values to be tested for
each hyperparameter, and then run an exhaustive search over all possible combinations of these
values, training one model for each set of values. The algorithm then compares the scores of each
model it trains and keeps the best one. Here are the results:
lxvi
Hyperparameter tuning with Grid Search took the results to the perfect level — or overfitting.
lxvii
XG Boost Classifier
The results of the XGBoost Classifier provided the best results for this
classification study.
lxviii
Hyperparameter Tuning with Grid Search
Once again, I applied the hyperparameter tuning with Grid Search, even though the results were
near perfect.
lxix
Naïve Bayes Classifier
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems. Naïve Bayes Classifier is one of the simple and most
effective Classification algorithms which helps in building the fast machine learning models that
can make quick predictions.
The results are shown below:
lxx
Support Vector Machines (SVM)
Support-vector machines (SVMs, also support-vector networks) are supervised learning models with
associated learning algorithms that analyze data for classification and regression analysis. An SVM maps
training examples to points in space to maximize the width of the gap between the two categories. New
examples are then mapped into that same space and predicted to belong to a category based on which side
of the gap they fall.
The results are shown below:
lxxi
AdaBoost
AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm, which can be used in
conjunction with many other types of learning algorithms to improve performance. The output of the
other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final
output of the boosted classifier.
lxxii
Conclusion
In this article, I used six different Supervised Machine Learning (Classification) algorithms with the
purpose of classifying four types of stainless steels (multi-class) according to their chemical compositions
comprised of 15 elements in the alloy. The dataset included 62 alloys; which made it a small, but a very
accurate dataset (all the information was taken from ASM International Sources (formerly known as
American Society of Metals)).
lxxiii
The analysis provides evidence that:
· Considering the f1 scores, Random Forest and XGBoost methods produced the best results
(0.94).
· After hyperparameter tuning by Grid Search, RF and XGBoost f1 scores jumped to 100 %.
· Multiple tries of the same algorithm resulted different results with a huge gap — most probably
due to the limited data size.
· The poorest f1 scores were mostly for the classification of the types that have the least-numbered
groups; which were ferritic and precipitation-hardened steels.
· Finally, test classification accuracy of 95% achieved by 3 models (DT, RF and XGBoost) and
100% by 2 tuned models demonstrates that the ML approach can be effectively applied to steel
classification despite the small number of alloys and heterogeneous input parameters (chemical
compositions). Based on only 62 cases, the models achieved a very high level of performance for
multi-class alloy type classification
lxxiv