0% found this document useful (0 votes)
21 views

Unit 2 MLMM

super

Uploaded by

face the fear
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Unit 2 MLMM

super

Uploaded by

face the fear
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning Model Management – PAD21D04T

Unit -2

ML Model Management - Introduction

Machine learning: the problem setting

In general, a learning problem considers a set of n samples of data and then tries to predict
properties of unknown data. If each sample is more than a single number and, for instance, a
multi-dimensional entry (multivariate data), it is said to have several attributes or features.

Learning problems fall into a few categories:

 Supervised learning, in which the data comes with additional attributes that we want
to predict. This problem can be either:
o Classification: samples belong to two or more classes and we want to learn
from already labeled data how to predict the class of unlabeled data. An
example of a classification problem would be handwritten digit recognition, in
which the aim is to assign each input vector to one of a finite number of
discrete categories. Another way to think of classification is as a discrete (as
opposed to continuous) form of supervised learning where one has a limited
number of categories and for each of the n samples provided, one is to try to
label them with the correct category or class.
o Regression: if the desired output consists of one or more continuous variables,
then the task is called regression. An example of a regression problem would
be the prediction of Housing Prices.
 Unsupervised learning, in which the training data consists of a set of input vectors x
without any corresponding target values. The goal in such problems may be to
discover groups of similar examples within the data, where it is called clustering, or to
determine the distribution of data within the input space, known as density estimation,
or to project the data from a high-dimensional space down to two or three dimensions
for the purpose of visualization.
Training set and testing set

Machine learning is about learning some properties of a data set and then testing those
properties against another data set. A common practice in machine learning is to evaluate an
algorithm by splitting a data set into two. We call one of those sets the training set, on which
we learn some properties; we call the other set the testing set, on which we test the learned
properties.

Loading an example dataset

scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for
classification and the diabetes dataset for regression.

In the following, we start a Python interpreter from our shell and then load
the iris and digits datasets. Our notational convention is that $ denotes the shell prompt
while >>> denotes the Python interpreter prompt:

$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data.
This data is stored in the .data member, which is a n_samples, n_features array. In the case of
supervised problems, one or more response variables are stored in the .target member.

For instance, in the case of the digits dataset, digits.data gives access to the features that can
be used to classify the digits samples:

>>> print(digits.data)
[[ 0. 0. 5. ... 0. 0. 0.]
[ 0. 0. 0. ... 10. 0. 0.]
[ 0. 0. 0. ... 16. 9. 0.]
...
[ 0. 0. 1. ... 6. 0. 0.]
[ 0. 0. 2. ... 12. 0. 0.]
[ 0. 0. 10. ... 12. 1. 0.]]
and digits.target gives the ground truth for the digit dataset, that is the number corresponding
to each digit image that we are trying to learn:
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])

Shape of the data arrays

The data is always a 2D array, shape (n_samples, n_features), although the original data may
have had a different shape. In the case of the digits, each original sample is an image of
shape (8, 8) and can be accessed using:

>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])

The simple example on this dataset illustrates how starting from the original problem one can
shape the data for consumption in scikit-learn.

Learning and predicting

In the case of the digits dataset, the task is to predict, given an image, which digit it
represents. We are given samples of each of the 10 possible classes (the digits zero through
nine) on which we fit an estimator to be able to predict the classes to which unseen samples
belong.

In scikit-learn, an estimator for classification is a Python object that implements the


methods fit(X, y) and predict(T).

An example of an estimator is the class sklearn.svm.SVC, which implements support vector


classification. The estimator’s constructor takes as arguments the model’s parameters.

For now, we will consider the estimator as a black box:

>>> from sklearn import svm


>>> clf = svm.SVC(gamma=0.001, C=100.)
Choosing the parameters of the model

In this example, we set the value of gamma manually. To find good values for these
parameters, we can use tools such as grid search and cross validation.

The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from
the model. This is done by passing our training set to the fit method. For the training set,
we’ll use all the images from our dataset, except for the last image, which we’ll reserve for
our predicting. We select the training set with the [:-1] Python syntax, which produces a new
array that contains all but the last item from digits.data:

>>> clf.fit(digits.data[:-1], digits.target[:-1])


SVC(C=100.0, gamma=0.001)
Now you can predict new values. In this case, you’ll predict using the last image
from digits.data. By predicting, you’ll determine the image from the training set that best
matches the last image.

>>> clf.predict(digits.data[-1:])
array([8])
Refitting and updating parameters

Hyper-parameters of an estimator can be updated after it has been constructed via


the set_params() method. Calling fit() more than once will overwrite what was learned by any
previous fit():

>>> import numpy as np


>>> from sklearn.datasets import load_iris
>>> from sklearn.svm import SVC
>>> X, y = load_iris(return_X_y=True)

>>> clf = SVC()


>>> clf.set_params(kernel='linear').fit(X, y)
SVC(kernel='linear')
>>> clf.predict(X[:5])
array([0, 0, 0, 0, 0])

>>> clf.set_params(kernel='rbf').fit(X, y)
SVC()
>>> clf.predict(X[:5])
array([0, 0, 0, 0, 0])
Here, the default kernel rbf is first changed to linear via SVC.set_params() after the
estimator has been constructed, and changed back to rbf to refit the estimator and to make a
second prediction.

Multiclass vs. multilabel fitting

When using multiclass classifiers, the learning and prediction task that is performed is
dependent on the format of the target data fit upon:

>>> from sklearn.svm import SVC


>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.preprocessing import LabelBinarizer

>>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
>>> y = [0, 0, 1, 1, 2]

>>> classif = OneVsRestClassifier(estimator=SVC(random_state=0))


>>> classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])

In the above case, the classifier is fit on a 1d array of multiclass labels and
the predict() method therefore provides corresponding multiclass predictions. It is also
possible to fit upon a 2d array of binary label indicators:

>>> y = LabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])

Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer.
In this case predict() returns a 2d array representing the corresponding multilabel predictions.

Note that the fourth and fifth instances returned all zeroes, indicating that they matched none
of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to
be assigned multiple labels:
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
>>> y = MultiLabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
In this case, the classifier is fit upon instances each assigned multiple labels.
The MultiLabelBinarizer is used to binarize the 2d array of multilabels to fit upon. As a
result, predict() returns a 2d array with multiple predicted labels for each instance.

Creating and saving ML models with scikit-learn


Once we create a machine learning model, our job doesn’t end there. We can save the
model to use in the future. We can either use the pickle or the joblib library for this
purpose. The dump method is used to create the model and the load method is used to load
and use the dumped model. Now let’s demonstrate how to do it. The save and load methods
of both pickle and joblib have the same parameters.

syntax of dump() method:


pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)

parameters:
 obj: The pickled Python object.
 file: The pickled object will be written to a file or buffer.
 fix_imports: When supplied, the method dump() will determine if the pickling procedure
should be compatible with Python version 2 or not based on the value for the pickle
protocol option. True is the default value. Only a name-value pair should be used with
this default parameter.
syntax of load() method:
pickle.load(file, *, fix_imports=True, encoding=’ASCII’, errors=’strict’, buffers=None)

The load() method Returns the rebuilt object hierarchy indicated therein after reading the
pickled representation of an object from the open file object file.
Example 1: Saving and loading models using pickle

Python’s default method for serializing objects is a pickle. Your machine learning
algorithms can be serialized/encoded using the pickling process, and the serialized format
can then be saved to a file. When you want to deserialize/decode your model and utilize it
to produce new predictions, you can load this file later. The training of a linear regression
model is shown in the example that follows. In the below example we fit the data with train
data and the dump() method is used to create a model. The dump method takes in the
machine learning model and a file is given. The test data is used to find predictions after
loading the model using the load() method. root mean square error metric is used to
evaluate the predictions of the model.

# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import pickle

# import the dataset


dataset = pd.read_csv('headbrain1.csv')

X = dataset.iloc[:, : -1].values
Y = dataset.iloc[:, -1].values

# train test split


X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.2, random_state=0)

# create a linear regression model


regressor = LinearRegression()
regressor.fit(X_train, y_train)

# save the model


filename = 'linear_model.sav'
pickle.dump(regressor, open(filename, 'wb'))

# load the model


load_model = pickle.load(open(filename, 'rb'))

y_pred = load_model.predict(X_test)
print('root mean squared error : ', np.sqrt(
metrics.mean_squared_error(y_test, y_pred)))

Output:
root mean squared error : 72.11529287182815

Example 2: Saving and loading models using joblib

The SciPy ecosystem includes Joblib, which offers tools for pipelining Python jobs. It
offers tools for effectively saving and loading Python objects that employ NumPy data
structures. This can be helpful for machine learning algorithms that need to store the
complete dataset or have a lot of parameters. Let’s look at a simple example where we save
and load a linear regression model. The same steps are repeated while using the joblib
library.

# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import joblib

# import the dataset


dataset = pd.read_csv('headbrain1.csv')

X = dataset.iloc[:, : -1].values
Y = dataset.iloc[:, -1].values

# train test split


X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.2, random_state=0)

# create a linear regression model


regressor = LinearRegression()
regressor.fit(X_train, y_train)

# save the model


filename = 'linear_model_2.sav'
joblib.dump(regressor, open(filename, 'wb'))

# load the model


load_model = joblib.load(open(filename, 'rb'))

y_pred = load_model.predict(X_test)
print('root mean squared error : ', np.sqrt(
metrics.mean_squared_error(y_test, y_pred)))

Output:
root mean squared error : 72.11529287182815

Models for Regression

Introduction

Regression problems are prevalent in machine learning, and regression analysis is the most

often used technique for solving them. It is based on data modelling and entails determining

the best fit line that passes through all data points with the shortest distance possible between

the line and each data point. While there are other techniques for regression analysis, linear

and logistic regression is the most widely used. Ultimately, the type of regression analysis

model we adopt will be determined by the nature of the data.

Regression Model/Analysis

Predictive modelling techniques such as regression model/analysis may be used to determine

the relationship between a dataset’s dependent (goal) and independent variables. It is widely

used when the dependent and independent variables are linked in a linear or non-linear

fashion, and the target variable has a set of continuous values. Thus, regression analysis

approaches help establish causal relationships between variables, modelling time series, and

forecasting. Regression analysis, for example, is the best way to examine the relationship

between sales and advertising expenditures for a corporation.

Purpose of a regression model

Regression analysis is used for one of two purposes: predicting the value of the dependent

variable when information about the independent variables is known or predicting the effect

of an independent variable on the dependent variable.


Types of Regression Models

There are numerous regression analysis approaches available for making predictions.

Additionally, the choice of technique is determined by various parameters, including the

number of independent variables, the form of the regression line, and the type of dependent

variable.

Let us examine several of the most often utilized regression analysis techniques:

1. Linear Regression

The most extensively used modelling technique is linear regression, which assumes a linear

connection between a dependent variable (Y) and an independent variable (X). It employs a

regression line, also known as a best-fit line. The linear connection is defined as Y = c+m*X

+ e, where ‘c’ denotes the intercept, ‘m’ denotes the slope of the line, and ‘e’ is the error

term.

The linear regression model can be simple (with only one dependent and one independent

variable) or complex (with numerous dependent and independent variables) (with one

dependent variable and more than one independent variable).


2. Logistic Regression

When the dependent variable is discrete, the logistic regression technique is applicable. In

other words, this technique is used to compute the probability of mutually exclusive

occurrences such as pass/fail, true/false, 0/1, and so forth. Thus, the target variable can take

on only one of two values, and a sigmoid curve represents its connection to the independent

variable, and probability has a value between 0 and 1.

3. Polynomial Regression

The technique of polynomial regression analysis is used to represent a non-linear relationship

between dependent and independent variables. It is a variant of the multiple linear regression

model, except that the best fit line is curved rather than straight.
4. Ridge Regression

When data exhibits multicollinearity, that is, the ridge regression technique is applied when

the independent variables are highly correlated. While least squares estimates are unbiased in

multicollinearity, their variances are significant enough to cause the observed value to

diverge from the actual value. Ridge regression reduces standard errors by biassing the

regression estimates.

The lambda (λ) variable in the ridge regression equation resolves the multicollinearity

problem.
5. Lasso Regression

As with ridge regression, the lasso (Least Absolute Shrinkage and Selection Operator)
technique penalizes the absolute magnitude of the regression coefficient. Additionally, the
lasso regression technique employs variable selection, which leads to the shrinkage of
coefficient values to absolute zero. The basic idea of lasso regression is to introduce a little
bias so that the variance can be substantially reduced, which leads to a lower overall MSE.
Notice that as λ increases, variance drops substantially with very little increase in bias.
Beyond a certain point, though, variance decreases less rapidly and the shrinkage in the
coefficients causes them to be significantly underestimated which results in a large increase
in bias.

We can see from the chart that the test MSE is lowest when we choose a value for λ that
produces an optimal tradeoff between bias and variance.

When λ = 0, the penalty term in lasso regression has no effect and thus it produces the same
coefficient estimates as least squares. However, by increasing λ to a certain point we can
reduce the overall test MSE. This means the model fit by lasso regression will produce
smaller test errors than the model fit by least squares regression.
6. Quantile Regression

The quantile regression approach is a subset of the linear regression technique. It is employed

when the linear regression requirements are not met or when the data contains outliers. In

statistics and econometrics, quantile regression is used.

7. Bayesian Linear Regression

Bayesian linear regression is a form of regression analysis technique used in machine

learning that uses Bayes’ theorem to calculate the regression coefficients’ values. Rather than
determining the least-squares, this technique determines the features’ posterior distribution.

As a result, the approach outperforms ordinary linear regression in terms of stability.

8. Principal Components Regression

Multicollinear regression data is often evaluated using the principle components regression

approach. The significant components regression approach, like ridge regression, reduces

standard errors by biassing the regression estimates. Principal component analysis (PCA) is

used first to modify the training data, and then the resulting transformed samples are used to

train the regressors.

9. Partial Least Squares Regression

The partial least squares regression technique is a fast and efficient covariance-based

regression analysis technique. It is advantageous for regression problems with many

independent variables with a high probability of multicollinearity between the variables. The
method decreases the number of variables to a manageable number of predictors, then is

utilized in a regression.

10. Elastic Net Regression

Elastic net regression combines ridge and lasso regression techniques that are particularly

useful when dealing with strongly correlated data. It regularizes regression models by

utilizing the penalties associated with the ridge and lasso regression methods.

Getting started with Regression model example

Linear Regression using Iris Dataset

The Iris Dataset

There are 3 species in the Iris genus namely Iris Setosa, Iris Versicolor and Iris Virginica and
50 rows of data for each species of Iris flower. The column names represent the feature of the
flower that was studied and recorded.

# Import Dataset from sklearn


from sklearn.datasets import load_iris# Load Iris Data
iris = load_iris()# Creating pd DataFrames
iris_df = pd.DataFrame(data= iris.data, columns= iris.feature_names)
target_df = pd.DataFrame(data= iris.target, columns= ['species'])def converter(specie):
if specie == 0:
return 'setosa'
elif specie == 1:
return 'versicolor'
else:
return 'virginica'target_df['species'] = target_df['species'].apply(converter)# Concatenate the
DataFrames
iris_df = pd.concat([iris_df, target_df], axis= 1)
iris_df.describe()
iris_df.info()

# Converting Objects to Numerical dtype


iris_df.drop('species', axis= 1, inplace= True)
target_df = pd.DataFrame(columns= ['species'], data= iris.target)
iris_df = pd.concat([iris_df, target_df], axis= 1)# Variables
X= iris_df.drop(labels= 'sepal length (cm)', axis= 1)
y= iris_df['sepal length (cm)']

# Splitting the Dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

# Instantiating LinearRegression() Model


lr = LinearRegression()

# Training/Fitting the Model


lr.fit(X_train, y_train)

# Making Predictions
lr.predict(X_test)
pred = lr.predict(X_test)

# Evaluating Model's Performance


print('Mean Absolute Error:', mean_absolute_error(y_test, pred))
print('Mean Squared Error:', mean_squared_error(y_test, pred))
print('Mean Root Squared Error:', np.sqrt(mean_squared_error(y_test, pred)))

Results of fitting the model

Mean Absolute Error: 0.26498350887555133


Mean Squared Error: 0.10652500975036944
Mean Root Squared Error: 0.3263816933444176

Now to test…
iris_df.loc[6]
d = {'sepal length (cm)' : [4.6],
'sepal width (cm)' : [3.4],
'petal length (cm)' : [1.4],
'petal width (cm)' : [0.3],
'species' : 0}
test_df = pd.DataFrame(data= d)
test_df

pred = lr.predict(X_test)print('Predicted Sepal Length (cm):', pred[0])


print('Actual Sepal Length (cm):', 4.6)

As you can see, there is a discrepancy between the predicted value and the actual value, the

difference is approximate 0.283 cm which is a little bit higher than the mean absolute error.

Classification Management

Classification is a supervised machine learning method where the model tries to predict the
correct label of a given input data. In classification, the model is fully trained using the
training data, and then it is evaluated on test data before being used to perform prediction on
new unseen data.

For instance, an algorithm can learn to predict whether a given email is spam or ham (no
spam), as illustrated below.
Before diving into the classification concept, we will first understand the difference between
the two types of learners in classification: lazy and eager learners. Then we will clarify the
misconception between classification and regression.

Lazy Learners Vs. Eager Learners

There are two types of learners in machine learning classification: lazy and eager learners.

Eager learners are machine learning algorithms that first build a model from the training
dataset before making any prediction on future datasets. They spend more time during the
training process because of their eagerness to have a better generalization during the training
from learning the weights, but they require less time to make predictions.

Most machine learning algorithms are eager learners, and below are some examples:

 Logistic Regression.

 Support Vector Machine.

 Decision Trees.

 Artificial Neural Networks.


Lazy learners or instance-based learners, on the other hand, do not create any model
immediately from the training data, and this is where the lazy aspect comes from. They just
memorize the training data, and each time there is a need to make a prediction, they search
for the nearest neighbour from the whole training data, which makes them very slow during
prediction. Some examples of this kind are:

 K-Nearest Neighbour.

 Case-based reasoning.

Machine Learning Classification Vs. Regression

There are four main categories of Machine Learning algorithms: supervised, unsupervised,
semi-supervised, and reinforcement learning.

Even though classification and regression are both from the category of supervised learning,
they are not the same.

 The prediction task is a classification when the target variable is discrete. An application is
the identification of the underlying sentiment of a piece of text.

 The prediction task is a regression when the target variable is continuous. An example can be
the prediction of the salary of a person given their education degree, previous work
experience, geographical location, and level of seniority.

Examples of Machine Learning Classification in Real Life

Supervised Machine Learning Classification has different applications in multiple domains of


our day-to-day life. Below are some examples.

Healthcare

Training a machine learning model on historical patient data can help healthcare specialists
accurately analyze their diagnoses:

 During the COVID-19 pandemic, machine learning models were implemented to efficiently
predict whether a person had COVID-19 or not.
 Researchers can use machine learning models to predict new diseases that are more likely to
emerge in the future.

Education

Education is one of the domains dealing with the most textual, video, and audio data. This
unstructured information can be analyzed with the help of Natural Language technologies to
perform different tasks such as:

 The classification of documents per category.

 Automatic identification of the underlying language of students' documents during their


application.

 Analysis of students’ feedback sentiments about a Professor.

Transportation

Transportation is the key component of many countries' economic development. As a result,


industries are using machine and deep learning models:

 To predict which geographical location will have a rise in traffic volume.

 Predict potential issues that may occur in specific locations due to weather conditions.

Sustainable agriculture

Agriculture is one of the most valuable pillars of human survival. Introducing sustainability
can help improve farmers' productivity at a different level without damaging the
environment:

 By using classification models to predict which type of land is suitable for a given type of
seed.

 Predict the weather to help them take proper preventive measures.

Different Types of Classification Tasks in Machine Learning

There are four main classification tasks in Machine learning: binary, multi-class, multi-label,
and imbalanced classifications.
Binary Classification

In a binary classification task, the goal is to classify the input data into two mutually
exclusive categories. The training data in such a situation is labeled in a binary format: true
and false; positive and negative; O and 1; spam and not spam, etc. depending on the problem
being tackled. For instance, we might want to detect whether a given image is a truck or a
boat.

Logistic Regression and Support Vector Machines algorithms are natively designed for
binary classifications. However, other algorithms such as K-Nearest Neighbors and Decision
Trees can also be used for binary classification.

Multi-Class Classification

The multi-class classification, on the other hand, has at least two mutually exclusive class
labels, where the goal is to predict to which class a given input example belongs to. In the
following case, the model correctly classified the image to be a plane.

Most of the binary classification algorithms can be also used for multi-class classification.
These algorithms include but are not limited to:

 Random Forest

 Naive Bayes

 K-Nearest Neighbors

 Gradient Boosting

 SVM

 Logistic Regression.

we can apply binary transformation approaches such as one-versus-one and one-versus-all


to adapt native binary classification algorithms for multi-class classification tasks.

One-versus-one: this strategy trains as many classifiers as there are pairs of labels. If we
have a 3-class classification, we will have three pairs of labels, thus three classifiers.
In general, for N labels, we will have Nx(N-1)/2 classifiers. Each classifier is trained on a
single binary dataset, and the final class is predicted by a majority vote between all the
classifiers. One-vs-one approach works best for SVM and other kernel-based algorithms.

One-versus-rest: at this stage, we start by considering each label as an independent label and
consider the rest combined as only one label. With 3-classes, we will have 3 classifiers.

In general, for N labels, we will have N binary classifiers.

Multi-Label Classification

In multi-label classification tasks, we try to predict 0 or more classes for each input example.
In this case, there is no mutual exclusion because the input example can have more than one
label.

Such a scenario can be observed in different domains, such as auto-tagging in Natural


Language Processing, where a given text can contain multiple topics. Similarly to computer
vision, an image can contain multiple objects: the model predicted that the image contains: a
plane, a boat, a truck, and a dog.

It is not possible to use multi-class or binary classification models to perform multi-label


classification. However, most algorithms used for those standard classification tasks have
their specialized versions for multi-label classification. We can cite:

 Multi-label Decision Trees

 Multi-label Gradient Boosting

 Multi-label Random Forests

Imbalanced Classification

For the imbalanced classification, the number of examples is unevenly distributed in each
class, meaning that we can have more of one class than the others in the training data. Let’s
consider the following 3-class classification scenario where the training data contains: 60% of
trucks, 25% of planes, and 15% of boats.

The imbalanced classification problem could occur in the following scenario:


 Fraudulent transaction detections in financial industries

 Rare disease diagnosis

 Customer churn analysis

Using conventional predictive models such as Decision Trees, Logistic Regression, etc. could
not be effective when dealing with an imbalanced dataset, because they might be biased
toward predicting the class with the highest number of observations, and considering those
with fewer numbers as noise.

So, does that mean that such problems are left behind?

Of course not! We can use multiple approaches to tackle the imbalance problem in a dataset.
The most commonly used approaches include sampling techniques or harnessing the power
of cost-sensitive algorithms.

Sampling Techniques

These techniques aim to balance the distribution of the original by:

 Cluster-based Oversampling:

 Random undersampling: random elimination of examples from the majority class.

 SMOTE Oversampling: random replication of examples from the minority class.

Cost-Sensitive Algorithms

These algorithms take into consideration the cost of misclassification. They aim to minimize
the total cost generated by the models.

 Cost-sensitive Decision Trees.

 Cost-sensitive Logistic Regression.

 Cost-sensitive Support Vector Machines.


Metrics to Evaluate Machine Learning Classification Algorithms

Now that we have an idea of the different types of classification models, it is crucial to
choose the right evaluation metrics for those models. The most commonly used metrics:
accuracy, precision, recall, F1 score, and area under the ROC (Receiver Operating
Characteristic) curve and AUC (Area Under the Curve).

Building Machine Learning Pipelines

What is a Machine Learning Pipeline?


A machine learning pipeline helps automate machine learning workflows by processing and
integrating data sets into a model, which can then be evaluated and delivered. A well-built
pipeline helps in the flexibility of the model implementation. A pipeline in machine learning
is a technical infrastructure that allows an organization to organize and automate machine
learning operations.

How to Build a Machine Learning Pipeline?


There are mainly seven stages of building an end-to-end pipeline in machine learning. Let us
look at each of these stages:

1. Data Ingestion
The initial stage in every machine learning workflow is transferring incoming data into a data
repository. The vital element is that data is saved without alteration, allowing everyone to
record the original information accurately. You can obtain data from various sources,
including pub/sub requests. Also, you can use streaming data from other platforms. Each
dataset has a separate pipeline, which you can analyze simultaneously. The data is split
within each pipeline to take advantage of numerous servers or processors. This reduces the
overall time to perform the task by distributing the data processing across multiple pipelines.
For storing data, use NoSQL databases as they are an excellent choice for keeping massive
amounts of rapidly evolving organized/unorganized data. They also provide storage space
that is shared and extensible.
2. Data Processing
This time-consuming phase entails taking input, unorganized data and converting it into data
that the models can use. During this step, a distributed pipeline evaluates the data's quality for
structural differences, incorrect or missing data points, outliers, anomalies, etc., and corrects
any abnormalities along the way. This stage also includes the process of feature engineering.
Once you ingest data into the pipeline, the feature engineering process begins. It stores all the
generated features in a feature data repository. It transfers the output of features to the online
feature data storage upon completion of each pipeline, allowing for easy data retrieval.

3. Data Splitting
The primary objective of a machine learning data pipeline is to apply an accurate model to
data that it hasn't been trained on, based on the accuracy of its feature prediction. To assess
how the model works against new datasets, you need to divide the existing labeled data into
training, testing, and validation data subsets at this point. Model training and assessment are
the next two pipelines in this stage, both of which should be likely to access the API used for
data splitting. It needs to produce a notification and return with the dataset to protect the
pipeline (model training or evaluation) against selecting values that result in an irregular data
distribution.

4. Model Training
This pipeline includes the entire collection of training model algorithms, which you can use
repeatedly and alternatively as needed. The model training service obtains the training
configuration details, and the pipeline's process requests the required training dataset from the
API (or service) constructed during the data splitting stage. Once it sets the model,
configurations, training parameters, and other elements, it stores them into a model candidate
data repository which will be evaluated and used further in the pipeline. Model training
should take error tolerance, data backups, and failover on training segments. For example,
you can retrain each split if the latest attempt fails, owing to a transitory glitch.

5. Model Evaluation
This stage assesses the stored models' predictive performance using test and validation data
subsets until a model solves the business problem efficiently. The model evaluation step uses
several criteria to compare predictions on the evaluation dataset with actual values. A
notification service is broadcast once a model is ready for deployment, and the pipeline
chooses the "best" model from the evaluation sample to make predictions on future cases. A
library of multiple evaluators provides the accuracy metrics of a model and stores them
against the model in the data repository.

6. Model Deployment
Once the model evaluation is complete, the pipeline selects the best model and deploys it.
The pipeline can deploy multiple machine learning models to ensure a smooth transition
between old and new models; the pipeline services continue to work on new prediction
requests while deploying a new model.

7. Monitoring Model Performance


The final stage of a pipeline in machine learning is model monitoring and performance
scoring. This stage entails monitoring and assessing the model behaviour on a regular and
recurring basis to gradually enhance it. Models are used for scoring based on feature values
imported by previous stages. When a new prediction is issued, the Performance Monitoring
Service receives a notification, runs the performance evaluation, records the outcomes, and
raises the necessary alerts. It compares the scoring to the observed results generated by the
data pipeline during the assessment. You can use various methods for monitoring, the most
common of which is logging analytics

There are different tools in Machine learning for building a Pipeline. Some are given below
along with their usage:
Steps while Tools
building the
pipeline

Obtaining the Managing the Database - PostgreSQL, MongoDB, DynamoDB, MySQL.


Data Distributed Storage - Apache Hadoop, Apache Spark/Apache Flink.

Scrubbing / Scripting Language - SAS, Python, and R. Processing in a Distributed manner -


Cleaning the MapReduce/ Spark, Hadoop. Data Wrangling Tools - R, Python Pandas
Data

Exploring / Python, R, MATLAB, and Weka.


Visualizing the
Data to find the
patterns and
trends

Modeling the Machine Learning algorithms - Supervised, Unsupervised, Reinforcement, Semi-


data to make the Supervised, and Semi-unsupervised learning. Important libraries - Python (Scikit
predictions learn) / R (CARET)

Interpreting the Data Visualization Tools -ggplot, Seaborn, D3.JS, Matplotlib, Tableau.
result

Overview – Pipelines

Overview

Pipelines provide the capabilit y to create automated models that seamlessly


integrate human and machine elements for data processing within a structured
pipeline framework. This architecture is composed of a series of interconnected
nodes, where each node's output serves as the input for the next in line.

Pipeline Process

Dataloop's pipeline process enables the smooth transition of data across various
stages, including:

 Labeling tasks
 Quality assurance tasks
 Execution of functions embedded within the Dataloop system
 Integration of code snippets
 Utilization of machine learning (ML) models
Throughout this process, your data can be filtered based on specific criteria,
segmented, merged, and its status can be modified as needed.

In summary, Dataloop's pipeline is equipped to:

 Streamline any production pipeline


 Preprocess and label data
 Automate operations through applications and models
 Postprocess data and train models of any type or scale with the utmost
performance and availabilit y standards.

Process Example
1. Pipeline involves the preprocessing of data by code, which may involve actions
like segmenting a video into frames.
2. After the data is preprocessed, it is directed to three different tasks, all of which
run in parallel.
3. Items that are marked as completed during these tasks are sent to a separate
task, such as a Quality Assurance (QA) task.
4. Conversely, items that are labeled with a discard status are directed to a
separate dataset for further handling.

Main Sections of the Pipeline Page

The Pipelines page allows you to manage and monitor your pipelines. Here, you
can initiate new pipelines either from the ground up or by using a template, as well
as access the dedicated details page for each individual pipeline.

The Pipeline page can be broken down into three sections, as outlined below:

Section 1: Metrics Bar

The metrics bar is where you can view the counts of your existing pipelines and the
breakdown of pipelines categorized by their statuses, which include Active,
Inactive, and Failed.

Section 2: Search and Create Pipelines

The Pipeline page features a Search field, which permits you to employ the
Pipeline's Name, Status, or Creator as search parameters for locating specific
pipelines. Furthermore, you have the option to filter pipelines based on their status.
Using the Create Pipeline button allows you to create pipelines.

Section 3: Pipelines List

The Pipeline page displays a table containing a list of pipelines, with their
respective details presented in individual columns. The available columns are as
follows:

Column Description
Name
Pipeline The name of the Pipeline serves as the entry point to access the pipeline's canvas page. When you click on
Name an Active Pipeline, it will open in View Mode, whereas clicking on a Paused Pipeline will open it in Edit
Mode.
Status The status of the pipeline can be one of the following: Active, Inactive, or Failed.

Pause/Play Indicates whether the pipeline is currently active or inactive. By clicking on the play/pause icon, you can
activate or deactivate the pipeline as needed.
Pending Indicates the number of pipeline cycles in the queue. Once the initial node service in the cycle becomes
Cycles "active", the cycles will start running.
Created By The avatar displayed here belongs to the user who created the pipeline.

Created At This timestamp corresponds to when the pipeline was created.

Updated At This timestamp corresponds to when the pipeline was updated.

Last This timestamp corresponds the most recent cycle execution of the pipeline.
Execution

Pipeline Actions

The Pipelines page enables you to perform various actions on pipelines:

 Create Pipeline: Use the Create Pipeline button to create a new pipeline.
 Double-click on a Pipeline:
o Double-clicking on an Active Pipeline will open the Pipeline
in View mode.
o Double-clicking on a Paused Pipeline will open the Pipeline in Edit mode.
 Customize the table columns by clicking the Show/Hide Columns icon on the
right-side.
 Sort the table to present meaningful data by clicking the header of each
column.
 View or edit a pipeline by clicking the View only or Edit mode icons.
 Clicking on the Ellipsis (three-dots) icon will provide you additional actions,
including:
o Copy Pipeline ID: This allows you to copy the Pipeline ID. It can be used
for SDK Scripts.

o Rename Pipeline: This allows you to rename the pipeline, with the caveat
that renaming is not possible if the pipeline is currently running.

o Manage Secrets: This enables you to handle available secrets for your
pipeline. Clicking on Manage Pipeline allows you to add multiple secrets
from the available list. Managing the pipeline's secrets allows you to
create environment variables that will be securely stored in the vault. This
is recommended when handling sensitive information, such as usernames
and passwords in your code.

o View Executions: This feature allows you to view a detailed page of the
pipeline executions based on the selected pipeline's ID.

o View Logs: This option allows you to access a detailed page of the logs
related to the selected pipeline's ID.

o View Audit Logs: This feature provides access to a detailed page of the
audit logs linked to the selected pipeline's ID.

o Delete Pipeline: This action enables you to delete the selected pipeline.

Machine Learning Pipeline Tools


A machine learning pipeline uses hundreds of tools, libraries, and frameworks. Sometimes, it
becomes difficult for companies to hire a separate data science team for building machine
learning pipelines using various resources. This is where machine learning pipeline tools
come into the picture. Businesses involved with data processing tasks but running low on a
budget require machine learning pipeline tools to improve their performance and efficiency.
A machine learning pipeline tool handles the development, maintenance, and tracking of data
processing pipelines. Machine learning pipeline tools help businesses streamline their data
usage, resulting in better decision-making and increased overall productivity.
How do Machine Learning Pipeline Tools Benefit Businesses?
 Accurate Machine Learning Models- Automated machine learning pipeline
technologies can provide a continuous range of high data that will aid in fine-tuning
your machine learning algorithms. It creates better machine learning models that will
generate more accurate predictions.
 Faster Deployment- Data pipeline automation accelerates the process of training,
testing, and refining machine learning models, allowing you to deploy them sooner in
the market.
 Enhanced Business Forecasting- You may improve your business forecasting abilities
by using data pipeline technologies that help you construct a better machine learning
model. Improved business forecasting enables you to stay ahead of the competition,
provide a better client experience, and reap business profits.
Let us look at a few popular tools used in building an end-to-end machine learning pipeline
1. MLFlow
MLflow is a free and open-source tool for managing machine learning workflow, including
experimentation, production, deployment, and a centralized model repository. It has four
elements: tracking, projects, models, and registration. Individuals and organizations of any
scale can benefit from MLflow. The tool is not reliant on any particular library or a
programming language and can be combined with any machine learning library.
2. DVC
Data Version Management, or DVC, is an experimental tool that helps define your pipeline
irrespective of the programming language used. DVC enables you to save time when
discovering a bug in earlier versions of your ML model by utilizing code, data versioning,
and reproducibility. For machine learning applications, DVC is an open-source version
control system. It keeps track of data sets and machine learning models, making them more
shareable and replicable. DVC handles large files, data sets, machine learning models,
metrics, and code.
3. Neptune
Neptune is a machine learning metadata repository designed for monitoring various
experiments by research and production teams. It comes with a customizable metadata format
that lets you organize training and production info any way you desire. All model building
metadata may be logged, stored, shown, managed, compared, and queried in one place. It's
similar to a dictionary or folder layout that you build in code and then present in the user
interface.
4. Polyaxon
Polyaxon is a Kubernetes machine learning platform for recreating and managing machine
learning workflows. Polyaxon can host and maintain the tool, implemented in any data center
or cloud provider. Polyaxon's orchestration lets you get the most out of your cluster by
managing jobs and experiments through CLI, dashboard, SDKs, and REST API. Large-
scale deep learning applications can be built, trained, and monitored using the platform. It
supports major deep learning frameworks like Torch, Tensorflow, and MXNet.

Machine Learning pipeline techniques and tools - Example

The following are examples of machine learning pipeline orchestration tools and platforms:

 Metaflow.
 Kedro pipelines.
 ZenML.
 Flyte.
 Kubeflow pipelines.

Metaflow

Metaflow, originally a Netflix project, is a cloud-native framework that couples all the pieces
of the ML stack together—from orchestration to versioning, modeling, deployment, and other
stages. Metaflow allows you to specify a pipeline as a DAG of computations relating to your
workflow. Netflix runs hundreds to thousands of machine learning projects on Metaflow—
that’s how scalable it is.

Metaflow differs from other pipelining frameworks because it can load and store artifacts
(such as data and models) as regular Python instance variables. Anyone with a working
knowledge of Python can use it without learning other domain-specific languages (DSLs).
Kedro

Kedro is a Python library for building modular data science pipelines. Kedro assists you in
creating data science workflows composed of reusable components, each with a “single
responsibility,” to speed up data pipelining, improve data science prototyping, and promote
pipeline reproducibility.

ZenML

ZenML is an extensible, open-source MLOps framework for building portable, production-


ready MLOps pipelines. It’s built for data scientists and MLOps engineers to collaborate as
they develop for production.

Flyte

Flyte is a platform for orchestrating ML pipelines at scale. You can use Flyte for deployment,
maintenance, lifecycle management, version control, and training. You can also use it with
platforms like Feast, PyTorch, TensorFlow, and whylogs to do tasks for the whole model
lifecycle.

Kubeflow Pipelines

Kubeflow Pipelines is an orchestration tool for building and deploying portable, scalable, and
reproducible end-to-end machine learning workflows directly on Kubernetes clusters. You
can define Kubeflow Pipelines with the following steps:

Step 1: Write the code implementation for each component as an executable file/script or
reuse pre-built components.

Step 2: Define the pipeline using a domain-specific language (DSL).

Step 3: Build and compile the workflow you have just defined.

Step 4: Step 3 will create a static YAML file that can be triggered to run the pipeline through
the intuitive Python SDK for pipelines.
Kubeflow is notably complex, and with slow development iteration cycles, other K8s-based
platforms like Flyte are making it easier to build pipelines. But deploying a cloud-managed
service like Google Kubernetes Engine (GKE) can be easier.

Machine Learning Pipeline Implementation

Machine Learning Pipeline Deployment on Different Platforms


This section gives you an overview of deploying machine learning data pipelines on various
platforms such as Azure, AWS, etc. It also consists of some projects which will provide you
with a better idea of deploying machine learning pipelines on these platforms.
Azure Machine Learning Pipelines
The Azure Machine Learning Pipeline makes creating, monitoring, and enhancing machine
learning processes easier. It is easy to use and consists of various other pipelines, each of
which has a function. Some of the multiple benefits of an Azure machine learning data
pipeline are-
 The Azure Machine Learning pipeline enables the coordination of several pipelines with
diverse and extensible computation and storage facilities. Individual pipeline phases are
run on separate compute units to use existing compute resources.
 It enables the creation of pipeline layouts for specific instances to activate published
pipelines from multiple systems, allowing for reusability.
 Azure machine learning pipelines optimize productivity by constantly monitoring data
and result pathways.

Here are some Azure MLOps projects you can practice to gain hands-on experience working
with Azure Machine Learning Pipelines -

 Azure Text Analytics for Medical Search Deployment


This project develops a machine learning application to recognize the relationship and pattern
between various medical terms. It will illustrate how to build an intelligent search engine that
will scan for documents that contain those keywords. The project also entails creating
an Azure machine learning pipeline to deploy and extend the application. This project will
introduce you to various Azure services, including Azure Data Storage, Data Factory,
databricks, and Azure Containers, among others.
 MLOps using Azure DevOps to Deploy a Classification Model
In this MLOps Azure project, you'll learn how to use scalable CI/CD ML pipelines to deploy
a classification machine learning model on Azure to predict a customer's licensing status. It
will assist you in comprehending Azure DevOps and developing a classification model that
will forecast the licensing status. You'll also learn how to use Azure DevOps to implement
the license status classification model in a scalable manner. The dataset contains customer
data whose license status is forecasted, such as granted, updated, or terminated.
 Azure Deep Learning-Deploy RNN CNN models for TimeSeries
You will learn how to do Docker-based deployment of RNN and CNN Models for Time
Series Forecasting on Azure Cloud in this Azure MLOps Project. Use Azure as your cloud
platform, and Microsoft Azure's ecosystem includes several robust services for building an
end-to-end MLOps pipeline. In this project, you will deploy your time-series deep learning
model on the Azure cloud platform in a multi-part manner.

AWS Machine Learning Pipelines

AWS machine learning data pipelines allow businesses to develop, test, and deploy Machine
Learning models at volume. Data transformation, feature extraction, data retrieval, model
assessment and evaluation, and model deployment are all part of this process.

 ML Model Deployment on AWS for Customer Churn Prediction


This MLOps project idea seeks to deploy a model that predicts if a client will churn in the
coming days or not. AWS (Amazon Web Services) is the cloud provider in this project. This
project will introduce you to Terraform and show you how to deploy your machine learning
model on the Gunicorn web server. It will teach you to store the Terraform state in the AWS
s3 backend bucket.

 AWS MLOps Project to Deploy Multiple Linear Regression Model


This project aims to create a cost-optimized machine learning pipeline for a time series
multiple linear regression model on the AWS cloud platform (Amazon Web Services).
Working on this project will introduce you to the concept of Docker, Lightsail, and Flask. It
will help you understand the EC2 machine setup and deploy the machine learning application
on Lightsail.
Machine learning data pipelines play a crucial role in making businesses stand out in the
industry. These pipelines save up a lot of time and enhance the overall efficiency of machine
learning operations in an organization. You can learn more about how ML pipelines work by
practicing a variety of solved end-to-end MLOps projects from the ProjectPro repository.

Iterative Machine Learning Model

It is been mentioned several times that Machine learning implementation goes through an
iterative cycle. Each step of the entire ML cycle is visited again and again.

What makes the ML cycle iterative? Why it becomes necessary to perform same steps
repeatedly?
The answer lies in the nature of the problems that Machine learning is trying to solve.
Machine learning as a field has no boundaries set at this point of time. It is an evolving field of
technology. As new algorithms are being developed, there is new opportunity found in real
world to solve or vice versa. Machine learning as a solution primarily used in the areas where
traditional programming ceases to have a viable solution. For ex.

 Too complex problem to code such as Face recognition, Text extraction and
understanding from variety of documents with different languages

 Insanely high amount of data available such as stock market predictions, Government
agencies are trying to get meaningful insights about the population based on census data

 Dynamic nature of information availability such as product or movie recommendations


on Netflix, Amazon etc…which is highly dependent on your last transaction or what is
the current transaction that you are executing.

By the innate nature of Machine learning, it is applicable to all those applications/systems


which are trying to learn and doesn’t require you to code explicitly to solve the problem.
However, it is not right to assume that there is no coding required at all.
Machine learning field allows you to code in a way so that the application or system can
learn to solve the problem on its own.

Learning is an iterative process. Even when an infant tries to learn to walk, it has to go through
same process of walking, falling, standing, walking, balancing etc. again and again until it
achieves a certain degree of confidence to walk and run independently.

The same fundamental concept applies to Machine learning as well where it goes through the
Machine learning cycle repeatedly until desired confidence level is achieved.

Machine Learning Iterative process

Going through this cycle is necessary to ensure that Machine learning model is capturing the
patterns, characteristics and inter-dependencies from the given data.

A machine learning solution is only as good as the data that drives it.

You are never guaranteed to have a perfect model (Generalized model, there is never a perfect
model) until it has gone through a significant amount of iterations to learn the various
scenarios from the data.
In fact, with this iterative process you are trying to obtain the Best model which is able to
perform equally good on unseen data. The term “Best” is measured through various metrics
based on the problem in hand for ex. in case of prediction problem with a continuous variable
as output, can be measured with RMSE(Root mean square error) or R² (R-squared). Whereas
for a classification problem, the confusion matrix is better suited to measure model
performance.

Machine Learning Cycle (Design > Development > Deployment > Optimize)

Summary
To achieve the desired ML model performance, having a high quality data is a crucial
requirement. However, for most real world problems primarily 3 reasons that impose
challenges in implementation of any ML model.
 Implementation | Integration | Data Quality
Hence, it becomes necessary to go through the below mentioned ML pipeline steps repeatedly.
 Build
1. Collect and prepare training data — Involves data Collection or data Engineering
followed by EDA — Exploratory Data analysis
2. Data Pre-processing and Feature Engineering
3. Choose Algorithm
 Train
1. Model training
2. Hyper parameter Optimization
3. Manage training requirements
4. Model Evaluation, Tuning and debugging
 Deployment
1. Deploy model in production
2. Address scalability requirements
3. Monitor quality, detect drift and retrain if required.

Comparison of Different Models


Create a pipeline consisting of a linear SVM, a simple Decision Tree and a simple Random
Forest Classifier

Create three different pipelines:

 One for a standard linear SCM


 One for a default decision tree
 One for a RandomForestClassifier

from sklearn import svm


from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

# Construct some pipelines


pipe_svm = Pipeline([('pca', PCA(n_components=27)),
('clf', svm.SVC(random_state=123))])

pipe_tree = Pipeline([('pca', PCA(n_components=27)),


('clf', tree.DecisionTreeClassifier(random_state=123))])

pipe_rf = Pipeline([('pca', PCA(n_components=27)),


('clf', RandomForestClassifier(random_state=123))])

# List of pipelines, List of pipeline names


pipelines = [pipe_svm, pipe_tree, pipe_rf]
pipeline_names = ['Support Vector Machine','Decision Tree','Random Forest']

# Loop to fit each of the three pipelines


for pipe in pipelines:
print(pipe)
pipe.fit(X_train, y_train)
# Compare accuracies
for index, val in enumerate(pipelines):
print('%s pipeline test accuracy: %.3f' % (pipeline_names[index], val.score(X_test,
y_test)))Pipeline(memory=None,
steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=27, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=123, shrinking=True,
tol=0.001, verbose=False))])
Pipeline(memory=None,
steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=27, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('clf', DecisionTreeClassifier(class_weight=None,
criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=123,
splitter='best'))])
Pipeline(memory=None,
steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=27, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('clf', RandomForestClassifier(bootstrap=True,
class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
...timators=10, n_jobs=1,
oob_score=False, random_state=123, verbose=0, warm_start=False))])

Support Vector Machine pipeline test accuracy: 0.749


Decision Tree pipeline test accuracy: 0.666
Random Forest pipeline test accuracy: 0.745

You might also like