Unit 2 MLMM
Unit 2 MLMM
Unit -2
In general, a learning problem considers a set of n samples of data and then tries to predict
properties of unknown data. If each sample is more than a single number and, for instance, a
multi-dimensional entry (multivariate data), it is said to have several attributes or features.
Supervised learning, in which the data comes with additional attributes that we want
to predict. This problem can be either:
o Classification: samples belong to two or more classes and we want to learn
from already labeled data how to predict the class of unlabeled data. An
example of a classification problem would be handwritten digit recognition, in
which the aim is to assign each input vector to one of a finite number of
discrete categories. Another way to think of classification is as a discrete (as
opposed to continuous) form of supervised learning where one has a limited
number of categories and for each of the n samples provided, one is to try to
label them with the correct category or class.
o Regression: if the desired output consists of one or more continuous variables,
then the task is called regression. An example of a regression problem would
be the prediction of Housing Prices.
Unsupervised learning, in which the training data consists of a set of input vectors x
without any corresponding target values. The goal in such problems may be to
discover groups of similar examples within the data, where it is called clustering, or to
determine the distribution of data within the input space, known as density estimation,
or to project the data from a high-dimensional space down to two or three dimensions
for the purpose of visualization.
Training set and testing set
Machine learning is about learning some properties of a data set and then testing those
properties against another data set. A common practice in machine learning is to evaluate an
algorithm by splitting a data set into two. We call one of those sets the training set, on which
we learn some properties; we call the other set the testing set, on which we test the learned
properties.
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for
classification and the diabetes dataset for regression.
In the following, we start a Python interpreter from our shell and then load
the iris and digits datasets. Our notational convention is that $ denotes the shell prompt
while >>> denotes the Python interpreter prompt:
$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
A dataset is a dictionary-like object that holds all the data and some metadata about the data.
This data is stored in the .data member, which is a n_samples, n_features array. In the case of
supervised problems, one or more response variables are stored in the .target member.
For instance, in the case of the digits dataset, digits.data gives access to the features that can
be used to classify the digits samples:
>>> print(digits.data)
[[ 0. 0. 5. ... 0. 0. 0.]
[ 0. 0. 0. ... 10. 0. 0.]
[ 0. 0. 0. ... 16. 9. 0.]
...
[ 0. 0. 1. ... 6. 0. 0.]
[ 0. 0. 2. ... 12. 0. 0.]
[ 0. 0. 10. ... 12. 1. 0.]]
and digits.target gives the ground truth for the digit dataset, that is the number corresponding
to each digit image that we are trying to learn:
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])
The data is always a 2D array, shape (n_samples, n_features), although the original data may
have had a different shape. In the case of the digits, each original sample is an image of
shape (8, 8) and can be accessed using:
>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
The simple example on this dataset illustrates how starting from the original problem one can
shape the data for consumption in scikit-learn.
In the case of the digits dataset, the task is to predict, given an image, which digit it
represents. We are given samples of each of the 10 possible classes (the digits zero through
nine) on which we fit an estimator to be able to predict the classes to which unseen samples
belong.
In this example, we set the value of gamma manually. To find good values for these
parameters, we can use tools such as grid search and cross validation.
The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from
the model. This is done by passing our training set to the fit method. For the training set,
we’ll use all the images from our dataset, except for the last image, which we’ll reserve for
our predicting. We select the training set with the [:-1] Python syntax, which produces a new
array that contains all but the last item from digits.data:
>>> clf.predict(digits.data[-1:])
array([8])
Refitting and updating parameters
>>> clf.set_params(kernel='rbf').fit(X, y)
SVC()
>>> clf.predict(X[:5])
array([0, 0, 0, 0, 0])
Here, the default kernel rbf is first changed to linear via SVC.set_params() after the
estimator has been constructed, and changed back to rbf to refit the estimator and to make a
second prediction.
When using multiclass classifiers, the learning and prediction task that is performed is
dependent on the format of the target data fit upon:
>>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
>>> y = [0, 0, 1, 1, 2]
In the above case, the classifier is fit on a 1d array of multiclass labels and
the predict() method therefore provides corresponding multiclass predictions. It is also
possible to fit upon a 2d array of binary label indicators:
>>> y = LabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer.
In this case predict() returns a 2d array representing the corresponding multilabel predictions.
Note that the fourth and fifth instances returned all zeroes, indicating that they matched none
of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to
be assigned multiple labels:
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
>>> y = MultiLabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
In this case, the classifier is fit upon instances each assigned multiple labels.
The MultiLabelBinarizer is used to binarize the 2d array of multilabels to fit upon. As a
result, predict() returns a 2d array with multiple predicted labels for each instance.
parameters:
obj: The pickled Python object.
file: The pickled object will be written to a file or buffer.
fix_imports: When supplied, the method dump() will determine if the pickling procedure
should be compatible with Python version 2 or not based on the value for the pickle
protocol option. True is the default value. Only a name-value pair should be used with
this default parameter.
syntax of load() method:
pickle.load(file, *, fix_imports=True, encoding=’ASCII’, errors=’strict’, buffers=None)
The load() method Returns the rebuilt object hierarchy indicated therein after reading the
pickled representation of an object from the open file object file.
Example 1: Saving and loading models using pickle
Python’s default method for serializing objects is a pickle. Your machine learning
algorithms can be serialized/encoded using the pickling process, and the serialized format
can then be saved to a file. When you want to deserialize/decode your model and utilize it
to produce new predictions, you can load this file later. The training of a linear regression
model is shown in the example that follows. In the below example we fit the data with train
data and the dump() method is used to create a model. The dump method takes in the
machine learning model and a file is given. The test data is used to find predictions after
loading the model using the load() method. root mean square error metric is used to
evaluate the predictions of the model.
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import pickle
X = dataset.iloc[:, : -1].values
Y = dataset.iloc[:, -1].values
y_pred = load_model.predict(X_test)
print('root mean squared error : ', np.sqrt(
metrics.mean_squared_error(y_test, y_pred)))
Output:
root mean squared error : 72.11529287182815
The SciPy ecosystem includes Joblib, which offers tools for pipelining Python jobs. It
offers tools for effectively saving and loading Python objects that employ NumPy data
structures. This can be helpful for machine learning algorithms that need to store the
complete dataset or have a lot of parameters. Let’s look at a simple example where we save
and load a linear regression model. The same steps are repeated while using the joblib
library.
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import joblib
X = dataset.iloc[:, : -1].values
Y = dataset.iloc[:, -1].values
y_pred = load_model.predict(X_test)
print('root mean squared error : ', np.sqrt(
metrics.mean_squared_error(y_test, y_pred)))
Output:
root mean squared error : 72.11529287182815
Introduction
Regression problems are prevalent in machine learning, and regression analysis is the most
often used technique for solving them. It is based on data modelling and entails determining
the best fit line that passes through all data points with the shortest distance possible between
the line and each data point. While there are other techniques for regression analysis, linear
and logistic regression is the most widely used. Ultimately, the type of regression analysis
Regression Model/Analysis
the relationship between a dataset’s dependent (goal) and independent variables. It is widely
used when the dependent and independent variables are linked in a linear or non-linear
fashion, and the target variable has a set of continuous values. Thus, regression analysis
approaches help establish causal relationships between variables, modelling time series, and
forecasting. Regression analysis, for example, is the best way to examine the relationship
Regression analysis is used for one of two purposes: predicting the value of the dependent
variable when information about the independent variables is known or predicting the effect
There are numerous regression analysis approaches available for making predictions.
number of independent variables, the form of the regression line, and the type of dependent
variable.
Let us examine several of the most often utilized regression analysis techniques:
1. Linear Regression
The most extensively used modelling technique is linear regression, which assumes a linear
connection between a dependent variable (Y) and an independent variable (X). It employs a
regression line, also known as a best-fit line. The linear connection is defined as Y = c+m*X
+ e, where ‘c’ denotes the intercept, ‘m’ denotes the slope of the line, and ‘e’ is the error
term.
The linear regression model can be simple (with only one dependent and one independent
variable) or complex (with numerous dependent and independent variables) (with one
When the dependent variable is discrete, the logistic regression technique is applicable. In
other words, this technique is used to compute the probability of mutually exclusive
occurrences such as pass/fail, true/false, 0/1, and so forth. Thus, the target variable can take
on only one of two values, and a sigmoid curve represents its connection to the independent
3. Polynomial Regression
between dependent and independent variables. It is a variant of the multiple linear regression
model, except that the best fit line is curved rather than straight.
4. Ridge Regression
When data exhibits multicollinearity, that is, the ridge regression technique is applied when
the independent variables are highly correlated. While least squares estimates are unbiased in
multicollinearity, their variances are significant enough to cause the observed value to
diverge from the actual value. Ridge regression reduces standard errors by biassing the
regression estimates.
The lambda (λ) variable in the ridge regression equation resolves the multicollinearity
problem.
5. Lasso Regression
As with ridge regression, the lasso (Least Absolute Shrinkage and Selection Operator)
technique penalizes the absolute magnitude of the regression coefficient. Additionally, the
lasso regression technique employs variable selection, which leads to the shrinkage of
coefficient values to absolute zero. The basic idea of lasso regression is to introduce a little
bias so that the variance can be substantially reduced, which leads to a lower overall MSE.
Notice that as λ increases, variance drops substantially with very little increase in bias.
Beyond a certain point, though, variance decreases less rapidly and the shrinkage in the
coefficients causes them to be significantly underestimated which results in a large increase
in bias.
We can see from the chart that the test MSE is lowest when we choose a value for λ that
produces an optimal tradeoff between bias and variance.
When λ = 0, the penalty term in lasso regression has no effect and thus it produces the same
coefficient estimates as least squares. However, by increasing λ to a certain point we can
reduce the overall test MSE. This means the model fit by lasso regression will produce
smaller test errors than the model fit by least squares regression.
6. Quantile Regression
The quantile regression approach is a subset of the linear regression technique. It is employed
when the linear regression requirements are not met or when the data contains outliers. In
learning that uses Bayes’ theorem to calculate the regression coefficients’ values. Rather than
determining the least-squares, this technique determines the features’ posterior distribution.
Multicollinear regression data is often evaluated using the principle components regression
approach. The significant components regression approach, like ridge regression, reduces
standard errors by biassing the regression estimates. Principal component analysis (PCA) is
used first to modify the training data, and then the resulting transformed samples are used to
The partial least squares regression technique is a fast and efficient covariance-based
independent variables with a high probability of multicollinearity between the variables. The
method decreases the number of variables to a manageable number of predictors, then is
utilized in a regression.
Elastic net regression combines ridge and lasso regression techniques that are particularly
useful when dealing with strongly correlated data. It regularizes regression models by
utilizing the penalties associated with the ridge and lasso regression methods.
There are 3 species in the Iris genus namely Iris Setosa, Iris Versicolor and Iris Virginica and
50 rows of data for each species of Iris flower. The column names represent the feature of the
flower that was studied and recorded.
# Making Predictions
lr.predict(X_test)
pred = lr.predict(X_test)
Now to test…
iris_df.loc[6]
d = {'sepal length (cm)' : [4.6],
'sepal width (cm)' : [3.4],
'petal length (cm)' : [1.4],
'petal width (cm)' : [0.3],
'species' : 0}
test_df = pd.DataFrame(data= d)
test_df
As you can see, there is a discrepancy between the predicted value and the actual value, the
difference is approximate 0.283 cm which is a little bit higher than the mean absolute error.
Classification Management
Classification is a supervised machine learning method where the model tries to predict the
correct label of a given input data. In classification, the model is fully trained using the
training data, and then it is evaluated on test data before being used to perform prediction on
new unseen data.
For instance, an algorithm can learn to predict whether a given email is spam or ham (no
spam), as illustrated below.
Before diving into the classification concept, we will first understand the difference between
the two types of learners in classification: lazy and eager learners. Then we will clarify the
misconception between classification and regression.
There are two types of learners in machine learning classification: lazy and eager learners.
Eager learners are machine learning algorithms that first build a model from the training
dataset before making any prediction on future datasets. They spend more time during the
training process because of their eagerness to have a better generalization during the training
from learning the weights, but they require less time to make predictions.
Most machine learning algorithms are eager learners, and below are some examples:
Logistic Regression.
Decision Trees.
K-Nearest Neighbour.
Case-based reasoning.
There are four main categories of Machine Learning algorithms: supervised, unsupervised,
semi-supervised, and reinforcement learning.
Even though classification and regression are both from the category of supervised learning,
they are not the same.
The prediction task is a classification when the target variable is discrete. An application is
the identification of the underlying sentiment of a piece of text.
The prediction task is a regression when the target variable is continuous. An example can be
the prediction of the salary of a person given their education degree, previous work
experience, geographical location, and level of seniority.
Healthcare
Training a machine learning model on historical patient data can help healthcare specialists
accurately analyze their diagnoses:
During the COVID-19 pandemic, machine learning models were implemented to efficiently
predict whether a person had COVID-19 or not.
Researchers can use machine learning models to predict new diseases that are more likely to
emerge in the future.
Education
Education is one of the domains dealing with the most textual, video, and audio data. This
unstructured information can be analyzed with the help of Natural Language technologies to
perform different tasks such as:
Transportation
Predict potential issues that may occur in specific locations due to weather conditions.
Sustainable agriculture
Agriculture is one of the most valuable pillars of human survival. Introducing sustainability
can help improve farmers' productivity at a different level without damaging the
environment:
By using classification models to predict which type of land is suitable for a given type of
seed.
There are four main classification tasks in Machine learning: binary, multi-class, multi-label,
and imbalanced classifications.
Binary Classification
In a binary classification task, the goal is to classify the input data into two mutually
exclusive categories. The training data in such a situation is labeled in a binary format: true
and false; positive and negative; O and 1; spam and not spam, etc. depending on the problem
being tackled. For instance, we might want to detect whether a given image is a truck or a
boat.
Logistic Regression and Support Vector Machines algorithms are natively designed for
binary classifications. However, other algorithms such as K-Nearest Neighbors and Decision
Trees can also be used for binary classification.
Multi-Class Classification
The multi-class classification, on the other hand, has at least two mutually exclusive class
labels, where the goal is to predict to which class a given input example belongs to. In the
following case, the model correctly classified the image to be a plane.
Most of the binary classification algorithms can be also used for multi-class classification.
These algorithms include but are not limited to:
Random Forest
Naive Bayes
K-Nearest Neighbors
Gradient Boosting
SVM
Logistic Regression.
One-versus-one: this strategy trains as many classifiers as there are pairs of labels. If we
have a 3-class classification, we will have three pairs of labels, thus three classifiers.
In general, for N labels, we will have Nx(N-1)/2 classifiers. Each classifier is trained on a
single binary dataset, and the final class is predicted by a majority vote between all the
classifiers. One-vs-one approach works best for SVM and other kernel-based algorithms.
One-versus-rest: at this stage, we start by considering each label as an independent label and
consider the rest combined as only one label. With 3-classes, we will have 3 classifiers.
Multi-Label Classification
In multi-label classification tasks, we try to predict 0 or more classes for each input example.
In this case, there is no mutual exclusion because the input example can have more than one
label.
Imbalanced Classification
For the imbalanced classification, the number of examples is unevenly distributed in each
class, meaning that we can have more of one class than the others in the training data. Let’s
consider the following 3-class classification scenario where the training data contains: 60% of
trucks, 25% of planes, and 15% of boats.
Using conventional predictive models such as Decision Trees, Logistic Regression, etc. could
not be effective when dealing with an imbalanced dataset, because they might be biased
toward predicting the class with the highest number of observations, and considering those
with fewer numbers as noise.
So, does that mean that such problems are left behind?
Of course not! We can use multiple approaches to tackle the imbalance problem in a dataset.
The most commonly used approaches include sampling techniques or harnessing the power
of cost-sensitive algorithms.
Sampling Techniques
Cluster-based Oversampling:
Cost-Sensitive Algorithms
These algorithms take into consideration the cost of misclassification. They aim to minimize
the total cost generated by the models.
Now that we have an idea of the different types of classification models, it is crucial to
choose the right evaluation metrics for those models. The most commonly used metrics:
accuracy, precision, recall, F1 score, and area under the ROC (Receiver Operating
Characteristic) curve and AUC (Area Under the Curve).
1. Data Ingestion
The initial stage in every machine learning workflow is transferring incoming data into a data
repository. The vital element is that data is saved without alteration, allowing everyone to
record the original information accurately. You can obtain data from various sources,
including pub/sub requests. Also, you can use streaming data from other platforms. Each
dataset has a separate pipeline, which you can analyze simultaneously. The data is split
within each pipeline to take advantage of numerous servers or processors. This reduces the
overall time to perform the task by distributing the data processing across multiple pipelines.
For storing data, use NoSQL databases as they are an excellent choice for keeping massive
amounts of rapidly evolving organized/unorganized data. They also provide storage space
that is shared and extensible.
2. Data Processing
This time-consuming phase entails taking input, unorganized data and converting it into data
that the models can use. During this step, a distributed pipeline evaluates the data's quality for
structural differences, incorrect or missing data points, outliers, anomalies, etc., and corrects
any abnormalities along the way. This stage also includes the process of feature engineering.
Once you ingest data into the pipeline, the feature engineering process begins. It stores all the
generated features in a feature data repository. It transfers the output of features to the online
feature data storage upon completion of each pipeline, allowing for easy data retrieval.
3. Data Splitting
The primary objective of a machine learning data pipeline is to apply an accurate model to
data that it hasn't been trained on, based on the accuracy of its feature prediction. To assess
how the model works against new datasets, you need to divide the existing labeled data into
training, testing, and validation data subsets at this point. Model training and assessment are
the next two pipelines in this stage, both of which should be likely to access the API used for
data splitting. It needs to produce a notification and return with the dataset to protect the
pipeline (model training or evaluation) against selecting values that result in an irregular data
distribution.
4. Model Training
This pipeline includes the entire collection of training model algorithms, which you can use
repeatedly and alternatively as needed. The model training service obtains the training
configuration details, and the pipeline's process requests the required training dataset from the
API (or service) constructed during the data splitting stage. Once it sets the model,
configurations, training parameters, and other elements, it stores them into a model candidate
data repository which will be evaluated and used further in the pipeline. Model training
should take error tolerance, data backups, and failover on training segments. For example,
you can retrain each split if the latest attempt fails, owing to a transitory glitch.
5. Model Evaluation
This stage assesses the stored models' predictive performance using test and validation data
subsets until a model solves the business problem efficiently. The model evaluation step uses
several criteria to compare predictions on the evaluation dataset with actual values. A
notification service is broadcast once a model is ready for deployment, and the pipeline
chooses the "best" model from the evaluation sample to make predictions on future cases. A
library of multiple evaluators provides the accuracy metrics of a model and stores them
against the model in the data repository.
6. Model Deployment
Once the model evaluation is complete, the pipeline selects the best model and deploys it.
The pipeline can deploy multiple machine learning models to ensure a smooth transition
between old and new models; the pipeline services continue to work on new prediction
requests while deploying a new model.
There are different tools in Machine learning for building a Pipeline. Some are given below
along with their usage:
Steps while Tools
building the
pipeline
Interpreting the Data Visualization Tools -ggplot, Seaborn, D3.JS, Matplotlib, Tableau.
result
Overview – Pipelines
Overview
Pipeline Process
Dataloop's pipeline process enables the smooth transition of data across various
stages, including:
Labeling tasks
Quality assurance tasks
Execution of functions embedded within the Dataloop system
Integration of code snippets
Utilization of machine learning (ML) models
Throughout this process, your data can be filtered based on specific criteria,
segmented, merged, and its status can be modified as needed.
Process Example
1. Pipeline involves the preprocessing of data by code, which may involve actions
like segmenting a video into frames.
2. After the data is preprocessed, it is directed to three different tasks, all of which
run in parallel.
3. Items that are marked as completed during these tasks are sent to a separate
task, such as a Quality Assurance (QA) task.
4. Conversely, items that are labeled with a discard status are directed to a
separate dataset for further handling.
The Pipelines page allows you to manage and monitor your pipelines. Here, you
can initiate new pipelines either from the ground up or by using a template, as well
as access the dedicated details page for each individual pipeline.
The Pipeline page can be broken down into three sections, as outlined below:
The metrics bar is where you can view the counts of your existing pipelines and the
breakdown of pipelines categorized by their statuses, which include Active,
Inactive, and Failed.
The Pipeline page features a Search field, which permits you to employ the
Pipeline's Name, Status, or Creator as search parameters for locating specific
pipelines. Furthermore, you have the option to filter pipelines based on their status.
Using the Create Pipeline button allows you to create pipelines.
The Pipeline page displays a table containing a list of pipelines, with their
respective details presented in individual columns. The available columns are as
follows:
Column Description
Name
Pipeline The name of the Pipeline serves as the entry point to access the pipeline's canvas page. When you click on
Name an Active Pipeline, it will open in View Mode, whereas clicking on a Paused Pipeline will open it in Edit
Mode.
Status The status of the pipeline can be one of the following: Active, Inactive, or Failed.
Pause/Play Indicates whether the pipeline is currently active or inactive. By clicking on the play/pause icon, you can
activate or deactivate the pipeline as needed.
Pending Indicates the number of pipeline cycles in the queue. Once the initial node service in the cycle becomes
Cycles "active", the cycles will start running.
Created By The avatar displayed here belongs to the user who created the pipeline.
Last This timestamp corresponds the most recent cycle execution of the pipeline.
Execution
Pipeline Actions
Create Pipeline: Use the Create Pipeline button to create a new pipeline.
Double-click on a Pipeline:
o Double-clicking on an Active Pipeline will open the Pipeline
in View mode.
o Double-clicking on a Paused Pipeline will open the Pipeline in Edit mode.
Customize the table columns by clicking the Show/Hide Columns icon on the
right-side.
Sort the table to present meaningful data by clicking the header of each
column.
View or edit a pipeline by clicking the View only or Edit mode icons.
Clicking on the Ellipsis (three-dots) icon will provide you additional actions,
including:
o Copy Pipeline ID: This allows you to copy the Pipeline ID. It can be used
for SDK Scripts.
o Rename Pipeline: This allows you to rename the pipeline, with the caveat
that renaming is not possible if the pipeline is currently running.
o Manage Secrets: This enables you to handle available secrets for your
pipeline. Clicking on Manage Pipeline allows you to add multiple secrets
from the available list. Managing the pipeline's secrets allows you to
create environment variables that will be securely stored in the vault. This
is recommended when handling sensitive information, such as usernames
and passwords in your code.
o View Executions: This feature allows you to view a detailed page of the
pipeline executions based on the selected pipeline's ID.
o View Logs: This option allows you to access a detailed page of the logs
related to the selected pipeline's ID.
o View Audit Logs: This feature provides access to a detailed page of the
audit logs linked to the selected pipeline's ID.
o Delete Pipeline: This action enables you to delete the selected pipeline.
The following are examples of machine learning pipeline orchestration tools and platforms:
Metaflow.
Kedro pipelines.
ZenML.
Flyte.
Kubeflow pipelines.
Metaflow
Metaflow, originally a Netflix project, is a cloud-native framework that couples all the pieces
of the ML stack together—from orchestration to versioning, modeling, deployment, and other
stages. Metaflow allows you to specify a pipeline as a DAG of computations relating to your
workflow. Netflix runs hundreds to thousands of machine learning projects on Metaflow—
that’s how scalable it is.
Metaflow differs from other pipelining frameworks because it can load and store artifacts
(such as data and models) as regular Python instance variables. Anyone with a working
knowledge of Python can use it without learning other domain-specific languages (DSLs).
Kedro
Kedro is a Python library for building modular data science pipelines. Kedro assists you in
creating data science workflows composed of reusable components, each with a “single
responsibility,” to speed up data pipelining, improve data science prototyping, and promote
pipeline reproducibility.
ZenML
Flyte
Flyte is a platform for orchestrating ML pipelines at scale. You can use Flyte for deployment,
maintenance, lifecycle management, version control, and training. You can also use it with
platforms like Feast, PyTorch, TensorFlow, and whylogs to do tasks for the whole model
lifecycle.
Kubeflow Pipelines
Kubeflow Pipelines is an orchestration tool for building and deploying portable, scalable, and
reproducible end-to-end machine learning workflows directly on Kubernetes clusters. You
can define Kubeflow Pipelines with the following steps:
Step 1: Write the code implementation for each component as an executable file/script or
reuse pre-built components.
Step 3: Build and compile the workflow you have just defined.
Step 4: Step 3 will create a static YAML file that can be triggered to run the pipeline through
the intuitive Python SDK for pipelines.
Kubeflow is notably complex, and with slow development iteration cycles, other K8s-based
platforms like Flyte are making it easier to build pipelines. But deploying a cloud-managed
service like Google Kubernetes Engine (GKE) can be easier.
Here are some Azure MLOps projects you can practice to gain hands-on experience working
with Azure Machine Learning Pipelines -
AWS machine learning data pipelines allow businesses to develop, test, and deploy Machine
Learning models at volume. Data transformation, feature extraction, data retrieval, model
assessment and evaluation, and model deployment are all part of this process.
It is been mentioned several times that Machine learning implementation goes through an
iterative cycle. Each step of the entire ML cycle is visited again and again.
What makes the ML cycle iterative? Why it becomes necessary to perform same steps
repeatedly?
The answer lies in the nature of the problems that Machine learning is trying to solve.
Machine learning as a field has no boundaries set at this point of time. It is an evolving field of
technology. As new algorithms are being developed, there is new opportunity found in real
world to solve or vice versa. Machine learning as a solution primarily used in the areas where
traditional programming ceases to have a viable solution. For ex.
Too complex problem to code such as Face recognition, Text extraction and
understanding from variety of documents with different languages
Insanely high amount of data available such as stock market predictions, Government
agencies are trying to get meaningful insights about the population based on census data
Learning is an iterative process. Even when an infant tries to learn to walk, it has to go through
same process of walking, falling, standing, walking, balancing etc. again and again until it
achieves a certain degree of confidence to walk and run independently.
The same fundamental concept applies to Machine learning as well where it goes through the
Machine learning cycle repeatedly until desired confidence level is achieved.
Going through this cycle is necessary to ensure that Machine learning model is capturing the
patterns, characteristics and inter-dependencies from the given data.
A machine learning solution is only as good as the data that drives it.
You are never guaranteed to have a perfect model (Generalized model, there is never a perfect
model) until it has gone through a significant amount of iterations to learn the various
scenarios from the data.
In fact, with this iterative process you are trying to obtain the Best model which is able to
perform equally good on unseen data. The term “Best” is measured through various metrics
based on the problem in hand for ex. in case of prediction problem with a continuous variable
as output, can be measured with RMSE(Root mean square error) or R² (R-squared). Whereas
for a classification problem, the confusion matrix is better suited to measure model
performance.
Machine Learning Cycle (Design > Development > Deployment > Optimize)
Summary
To achieve the desired ML model performance, having a high quality data is a crucial
requirement. However, for most real world problems primarily 3 reasons that impose
challenges in implementation of any ML model.
Implementation | Integration | Data Quality
Hence, it becomes necessary to go through the below mentioned ML pipeline steps repeatedly.
Build
1. Collect and prepare training data — Involves data Collection or data Engineering
followed by EDA — Exploratory Data analysis
2. Data Pre-processing and Feature Engineering
3. Choose Algorithm
Train
1. Model training
2. Hyper parameter Optimization
3. Manage training requirements
4. Model Evaluation, Tuning and debugging
Deployment
1. Deploy model in production
2. Address scalability requirements
3. Monitor quality, detect drift and retrain if required.