0% found this document useful (0 votes)

399 views72 pages

Machine Learning Life Cycle Overview

The document discusses machine learning concepts including the machine learning life cycle and types of machine learning. It describes the 7 major steps of the machine learning life cycle as gathering data, data preparation, data wrangling, data analysis, training the model, testing the model, and deployment. It also discusses supervised machine learning, how it works using labeled training data to predict outputs, and examples of supervised learning algorithms like linear regression and decision trees.

Uploaded by

Manshi Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

399 views72 pages

Machine Learning Life Cycle Overview

Uploaded by

Manshi Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning

Unit :- 1
Topics
1. Intro to ML
2. ML Life Cycle
3. Types of ML
3.1 supervised and unsupervised
3.2 batch and online
3.3 instance based and model based
4. Scope and Limitations
5. Challenges Of ML
6. Data Visualization
7. Hypothesis Fn. And testing
8. Data Pre-processing
9. Data Augmentation
10. Normalizing Data Set
11. Bias-Variance Tradeoff
12. Relation Between AI,ML,DP and DS
ML Life Cycle
Machine learning has given the computer
systems the abilities to automatically learn
without being explicitly programmed. But
how does a machine learning system work?
So, it can be described using the life cycle of
machine learning. Machine learning life cycle
is a cyclic process to build an efficient
machine learning project. The main purpose of
the life cycle is to find a solution to the
problem or project.
Machine learning life cycle involves seven
major steps, which are given below:
o Gathering Data

o Data preparation

o Data Wrangling

o Analyse Data

o Train the model

o Test the model

o Deployment
The most important thing in the complete
process is to understand the problem and to
know the purpose of the problem. Therefore,
before starting the life cycle, we need to
understand the problem because the good
result depends on the better understanding of
the problem.
In the complete life cycle process, to solve a
problem, we create a machine learning system
called "model", and this model is created by
providing "training". But to train a model, we
need data, hence, life cycle starts by collecting
data.

1. Gathering Data:
Data Gathering is the first step of the machine
learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different
data sources, as data can be collected from
various sources such
as files, database, internet, or mobile
devices. It is one of the most important steps
of the life cycle. The quantity and quality of
the collected data will determine the efficiency
of the output. The more will be the data, the
more accurate will be the prediction.
This step includes the below tasks:
o Identify various data sources

o Collect data

o Integrate the data obtained from

different sources
By performing the above task, we get a
coherent set of data, also called as a dataset. It
will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it
for further steps. Data preparation is a step
where we put our data into a suitable place and
prepare it to use in our machine learning
training.
In this step, first, we put all data together, and
then randomize the ordering of data.
This step can be further divided into two
processes:
o Data exploration:
It is used to understand the nature of data
that we have to work with. We need to
understand the characteristics, format, and
quality of data.
A better understanding of data leads to an
effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data
for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and
converting raw data into a useable format. It is
the process of cleaning the data, selecting the
variable to use, and transforming the data in a
proper format to make it more suitable for
analysis in the next step. It is one of the most
important steps of the complete process.
Cleaning of data is required to address the
quality issues.
It is not necessary that data we have collected
is always of our use as some of the data may
not be useful. In real-world applications,
collected data may have various issues,
including:
o Missing Values

o Duplicate data

o Invalid data

o Noise

So, we use various filtering techniques to

clean the data.
It is mandatory to detect and remove the above
issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed
on to the analysis step. This step involves:
o Selection of analytical techniques

o Building models

o Review the result

The aim of this step is to build a machine

learning model to analyze the data using
various analytical techniques and review the
outcome. It starts with the determination of the
type of the problems, where we select the
machine learning techniques such
as Classification, Regression, Cluster
analysis, Association, etc. then build the
model using prepared data, and evaluate the
model.
Hence, in this step, we take the data and use
machine learning algorithms to build the
model.

5. Train Model
Now the next step is to train the model, in this
step we train our model to improve its
performance for better outcome of the
problem.
We use datasets to train the model using
various machine learning algorithms. Training
a model is required so that it can understand
the various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been
trained on a given dataset, then we test the
model. In this step, we check for the accuracy
of our model by providing a test dataset to it.
Testing the model determines the percentage
accuracy of the model as per the requirement
of project or problem.

7. Deployment
The last step of machine learning life cycle is
deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an
accurate result as per our requirement with
acceptable speed, then we deploy the model in
the real system. But before deploying the
project, we will check whether it is improving
its performance using available data or not.
The deployment phase is similar to making the
final report for a project.
Types of ML
Machine learning is a subset of AI, which
enables the machine to automatically learn
from data, improve performance from past
experiences, and make predictions. Machine
learning contains a set of algorithms that work on
a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of
training, they build the model & perform a specific
task.

Supervised
Machine
Learning
Supervised learning is the types of machine
learning in which machines are trained using
well "labelled" training data, and on basis of
that data, machines predict the output. The
labelled data means some input data is already
tagged with the correct output.
In supervised learning, the training data
provided to the machines work as the
supervisor that teaches the machines to predict
the output correctly. It applies the same
concept as a student learns in the supervision
of the teacher.
Supervised learning is a process of providing
input data as well as correct output data to the
machine learning model. The aim of a
supervised learning algorithm is to find a
mapping function to map the input
variable(x) with the output variable(y).
Supervised learning: Supervised
learning is the learning of the model
where with input variable ( say, x) and
an output variable (say, Y) and an
algorithm to map the input to the
output. That is, Y = f(X)

In the real-world, supervised learning can be

used for Risk Assessment, Image
classification, Fraud Detection, spam
filtering, etc.
How Supervised
Learning Works?
In supervised learning, models are trained
using labelled dataset, where the model learns
about each type of data. Once the training
process is completed, the model is tested on
the basis of test data (a subset of the training
set), and then it predicts the output.
The working of Supervised learning can be
easily understood by the below example and
diagram:
Suppose we have a dataset of different types
of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is
that we need to train the model for each shape.
o If the given shape has four sides, and all

the sides are equal, then it will be labelled

as a Square.
o If the given shape has three sides, then it

will be labelled as a triangle.

o If the given shape has six equal sides then

it will be labelled as hexagon.

Now, after training, we test our model using
the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of
shapes, and when it finds a new shape, it
classifies the shape on the bases of a number
of sides, and predicts the output.
Steps Involved in
Supervised Learning:
o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into
training dataset, test dataset, and
validation dataset.
o Determine the input features of the
training dataset, which should have
enough knowledge so that the model can
accurately predict the output.
o Determine the suitable algorithm for the
model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training
dataset. Sometimes we need validation sets
as the control parameters, which are the
subset of training datasets.
o Evaluate the accuracy of the model by
providing the test set. If the model predicts
the correct output, which means our model
is accurate.
Types of supervised
Machine learning
Algorithms:
Supervised learning can be further divided
into two types of problems:

1. Regression
Regression algorithms are used if there is a
relationship between the input variable and the
output variable. It is used for the prediction of
continuous variables, such as Weather
forecasting, Market Trends, etc. Below are
some popular Regression algorithms which
come under supervised learning:
o Linear Regression

o Regression Trees

o Non-Linear Regression

o Bayesian Linear Regression

o Polynomial Regression

2. Classification
Classification algorithms are used when the
output variable is categorical, which means
there are two classes such as Yes-No, Male-
Female, True-false, etc.
Spam Filtering,
o Random Forest

o Decision Trees

o Logistic Regression

o Support vector Machines

Note: We will discuss these algorithms in
detail in later chapters.

Advantages of
Supervised learning:
o With the help of supervised learning, the
model can predict the output on the basis
of prior experiences.
o In supervised learning, we can have an
exact idea about the classes of objects.
o Supervised learning model helps us to
solve various real-world problems such
as fraud detection, spam filtering, etc.
Disadvantages of
supervised learning:
o Supervised learning models are not
suitable for handling the complex tasks.
o Supervised learning cannot predict the
correct output if the test data is different
from the training dataset.
o Training required lots of computation
times.
o In supervised learning, we need enough
knowledge about the classes of object

Unsupervis
ed
Machine
Learning
In the previous topic, we learned supervised
machine learning in which models are trained
using labeled data under the supervision of
training data. But there may be many cases in
which we do not have labeled data and need to
find the hidden patterns from the given
dataset. So, to solve such types of cases in
machine learning, we need unsupervised
learning techniques.
Unsupervised
Learning: Unsupervised learning is
where only the input data (say, X) is
present and no corresponding output
variable is there.

What is Unsupervised
Learning?
As the name suggests, unsupervised learning
is a machine learning technique in which
models are not supervised using training
dataset. Instead, models itself find the hidden
patterns and insights from the given data. It
can be compared to learning which takes place
in the human brain while learning new things.
It can be defined as:
Unsupervised learning is a type of machine
learning in which models are trained using
unlabeled dataset and are allowed to act on
that data without any supervision.
Unsupervised learning cannot be directly
applied to a regression or classification
problem because unlike supervised learning,
we have the input data but no corresponding
output data. The goal of unsupervised learning
is to find the underlying structure of
dataset, group that data according to
similarities, and represent that dataset in a
compressed format.
Example: Suppose the unsupervised learning
algorithm is given an input dataset containing
images of different types of cats and dogs. The
algorithm is never trained upon the given
dataset, which means it does not have any idea
about the features of the dataset. The task of
the unsupervised learning algorithm is to
identify the image features on their own.
Unsupervised learning algorithm will perform
this task by clustering the image dataset into
the groups according to similarities between
images.
Why use Unsupervised
Learning?
Below are some main reasons which describe
the importance of Unsupervised Learning:
o Unsupervised learning is helpful for
finding useful insights from the data.
o Unsupervised learning is much similar as a

human learns to think by their own

experiences, which makes it closer to the
real AI.
o Unsupervised learning works on unlabeled

and uncategorized data which make

unsupervised learning more important.
o In real-world, we do not always have input
data with the corresponding output so to
solve such cases, we need unsupervised
learning.
Working of
Unsupervised
Learning
Working of unsupervised learning can be
understood by the below diagram:

Here, we have taken an unlabeled input data,

which means it is not categorized and
corresponding outputs are also not given.
Now, this unlabeled input data is fed to the
machine learning model in order to train it.
Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will
apply suitable algorithms such as k-means
clustering, Decision tree, etc.
Once it applies the suitable algorithm, the
algorithm divides the data objects into groups
according to the similarities and difference
between the objects.
Types of Unsupervised
Learning Algorithm:
The unsupervised learning algorithm can be
further categorized into two types of
problems:
o Clustering: Clustering is a method of
grouping the objects into clusters such that
objects with most similarities remains into
a group and has less or no similarities with
the objects of another group. Cluster
analysis finds the commonalities between
the data objects and categorizes them as
per the presence and absence of those
commonalities.
o Association: An association rule is an
unsupervised learning method which is
used for finding the relationships between
variables in the large database. It
determines the set of items that occurs
together in the dataset. Association rule
makes marketing strategy more effective.
Such as people who buy X item (suppose a
bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of
Association rule is Market Basket
Analysis.
Note: We will learn these algorithms in
later chapters.

Unsupervised
Learning algorithms:
Below is the list of some popular unsupervised
learning algorithms:
o K-means clustering

o KNN (k-nearest neighbors)

o Hierarchal clustering

o Anomaly detection

o Neural Networks

o Principle Component Analysis

o Independent Component Analysis

o Apriori algorithm

o Singular value decomposition

Advantages of
Unsupervised
Learning
o Unsupervised learning is used for more
complex tasks as compared to supervised
learning because, in unsupervised learning,
we don't have labeled input data.
o Unsupervised learning is preferable as it is
easy to get unlabeled data in comparison to
labeled data.
Disadvantages of
Unsupervised
Learning
o Unsupervised learning is intrinsically more
difficult than supervised learning as it does
not have corresponding output.
o The result of the unsupervised learning
algorithm might be less accurate as input
data is not labeled, and algorithms do not
know the exact output in advance.
Difference
between Supervised
and Unsupervised
Learning

Supervised Learning
Supervised learning algorithms are trained using label

Supervised learning model takes direct feedback t

predicting correct output or not.

Supervised learning model predicts the output.

In supervised learning, input data is provided to the m

the output.
The goal of supervised learning is to train the mod
predict the output when it is given new data.

Supervised learning needs supervision to train the mod

Supervised learning can be

in Classification and Regression problems.

Supervised learning can be used for those cases whe

input as well as corresponding outputs.

Supervised learning model produces an accurate result

Supervised learning is not close to true Artificial intell
we first train the model for each data, and then only
correct output.

It includes various algorithms such as Linear Reg

Regression, Support Vector Machine, Multi-clas
Decision tree, Bayesian Logic, etc.

Supervised Learning

Supervised Learning can be used for 2 different t

problems i.e. regression and classification

Input Data is provided to the model along with the

Supervised Learning.
Output is predicted by the Supervised Learning.

Labeled data is used to train supervised learning

Accurate results are produced using a supervised

model.

Training the model to predict output when a new

is the objective of Supervised Learning.

Supervised Learning includes various algorithms

Bayesian Logic, Decision Tree, Logistic Regressi
Regression, Multi-class Classification, Support Ve
etc.

To assess whether right output is being predicted

feedback is accepted by the Supervised Learning

In Supervised Learning, for right prediction of out

has to be trained for each data, hence Supervise
not have close resemblance to Artificial Intelligen
Number of classes are known in Supervised Lear

In scenarios where one is aware of output and inp

supervised learning can be used.

Computational Complexity is very complex in Sup

Learning compared to Unsupervised Learning

Supervised Learning will use off-line analysis

Some of the applications of Supervised Learning

detection, handwriting detection, pattern recognit
recognition etc.
Regression vs.
Classification in
Machine
Learning
Regression and
Classification
algorithms are
Supervised Learning
algorithms. Both the
algorithms are used for
prediction in Machine
learning and work with
the labeled datasets.
But the difference
between both is how
they are used for
different machine
learning problems.
The main difference
between Regression
and Classification
algorithms that
Regression algorithms
are used to predict the
continuous values
such as price, salary,
age, etc. and
Classification
algorithms are used
to predict/Classify the
discrete values such
as Male or Female,
True or False, Spam or
Not Spam, etc.
Consider the below
diagram:
Classifica
tion:
Classification is a
process of finding a
function which helps
in dividing the dataset
into classes based on
different parameters.
In Classification, a
computer program is
trained on the training
dataset and based on
that training, it
categorizes the data
into different classes.
The task of the
classification
algorithm is to find the
mapping function to
map the input(x) to the
discrete output(y).
Example: The best
example to understand
the Classification
problem is Email
Spam Detection. The
model is trained on the
basis of millions of
emails on different
parameters, and
whenever it receives a
new email, it identifies
whether the email is
spam or not. If the
email is spam, then it
is moved to the Spam
folder.
Types of ML
Classification
Algorithms:
Classification
Algorithms can be
further divided into the
following types:
o Logistic Regression
o K-Nearest
Neighbours
o Support Vector
Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree
Classification
o Random Forest
Classification
Regressio
n:
Regression is a process
of finding the
correlations between
dependent and
independent variables.
It helps in predicting
the continuous
variables such as
prediction of Market
Trends, prediction of
House prices, etc.
The task of the
Regression algorithm
is to find the mapping
function to map the
input variable(x) to the
continuous output
variable(y).
Example: Suppose we
want to do weather
forecasting, so for this,
we will use the
Regression algorithm.
In weather prediction,
the model is trained on
the past data, and once
the training is
completed, it can
easily predict the
weather for future
days.
Types of Regression
Algorithm:
o Simple Linear
Regression
o Multiple Linear
Regression
o Polynomial
Regression
o Support Vector
Regression
o Decision Tree
Regression
o Random Forest
Regression
Difference
between
Regression and
Classification
Regression Algorithm
In Regression, the output va
must be of continuous nature o
value.
The task of the regression algo
is to map the input value (x) w
continuous output variable(y).

Regression Algorithms are use

continuous data.
In Regression, we try to find th
fit line, which can predict the
more accurately.
Batch Learning
and online
Batch learning, also known as offline
learning, involves training a model on a fixed
dataset, or a batch of data, all at once. The
model is trained on the entire dataset, and then
used to make predictions on new data. This
means that batch learning requires a complete
dataset before training can begin, and the
model cannot be updated once it has been
trained without retraining the entire model.
Batch learning is commonly used in situations
where the dataset is relatively small and can
be processed quickly
Batch learning represents the training
of machine learning models in a batch
manner. In other words, batch learning
represents the training of the models at
regular intervals such as weekly, bi-
weekly, monthly, quarterly, etc. In
batch learning, the system is not
capable of learning incrementally. The
models must be trained using all the
available data every single time. The
data gets accumulated over a period of
time. The models then get trained with
the accumulated data from time to time
at periodic intervals. This model
training takes a lot of time and
computing resources. Hence, it is
typically done offline. After the models
are trained, they are launched into
production and they run without
learning anymore. Batch learning is
also called offline learning. The
models trained using batch or offline
learning are moved into production
only at regular intervals based on the
performance of models trained with
new data.
Building offline models or models
trained in a batch manner requires
training the models with the entire
training data set. Improving the model
performance would require re-training
all over again with the entire training
data set. These models are static in
nature which means that once they get
trained, their performance will not
improve until a new model gets re-
trained. The model’s performance tends
to decay slowly over time, simply
because the world continues to evolve
while the model remains
unchanged. This phenomenon is often
called model rot or data drift. The
solution is to regularly retrain the model
on up-to-date data. Offline models or
models trained using batch learning are
deployed in the production environment
by replacing the old model with the
newly trained model.
For the model to learn about the new
data, the model would need to be
trained with all the data from scratch.
The old model would then need to be
replaced with the new model. And, as
part of batch learning, the whole
process of training, evaluating, and
launching a machine learning system
gets automated. The following picture
represents the automation of batch
learning. Model training using the full
set of data can take many hours. Thus,
it is recommended to run the batch
frequently rather than weekly as
training on the full set of data would
require a lot of computing resources
(CPU, memory space, disk space, disk
I/O, network I/O, etc.)
There can be various reasons why we
can choose to adopt batch learning for
training the models. Some of these
reasons are the following:
 The business requirements do not
require frequent learning of models.
 The data distribution is not expected
to change frequently. Therefore,
batch learning is suitable.
 The software systems (big data)
required for real-time learning is not
available due to various reasons
including the cost.
 The expertise required for creating
the system for incremental learning is
not available.
If the models trained using batch
learning needs to learn about new data,
the models need to be retrained using
the new data set and replaced
appropriately with the model already in
production based on different criteria
such as model performance. The whole
process of batch learning can be
automated as well. The disadvantage of
batch learning is it takes a lot of time
and resources to re-training the model.
The criteria based on which the
machine learning models can be
decided to train in a batch manner
depends on the model performance.
Red-amber-green statuses can be used
to determine the health of the model
based on the prediction accuracy or
error rates. Accordingly, the models can
be chosen to be retrained or otherwise.
The following stakeholders can be
involved in reviewing the model
performance and leveraging batch
learning:
 Business/product owners
 Product managers
 Data science architects
 Data scientists
 ML engineers
On the other hand,
online learning, also known as
incremental learning or streaming learning,
involves training a model on new data as it
arrives, one observation at a time. The
model is updated each time a new
observation is received, allowing it to
adapt to changes in the data over time.
Online learning is commonly used in
situations where the data is too large to be
processed all at once, or where the data is
constantly changing, such as in stock
market data or social media data
In online learning, the training happens in
an incremental manner by continuously
feeding data as it arrives or in a small
group / mini batches. Each learning step is
fast and cheap, so the system can learn
about new data on the fly, as it arrives.
Online learning is great for machine
learning systems that receive data as a
continuous flow (e.g., stock prices) and
need to adapt to change rapidly or
autonomously. It is also a good option if
you have limited computing
resources: once an online learning system
has learned about new data instances, it
does not need them anymore, so you can
discard them (unless you want to be able to
roll back to a previous state and “replay”
the data) or move the data to another form
of storage (warm or cold storage) if you
are using the data lake. This can save a
huge amount of space and cost. The
diagram given below represents online
learning.

Fig 1. Online learning – Machine Learning System

Online learning algorithms can also be used to

train systems on huge datasets that cannot fit
in one machine’s main memory (this is also
called out-of-core learning). The algorithm
loads part of the data runs a training step on
that data and repeats the process until it has
run on all of the data.
One of the key aspects of online learning is
the learning rate. The rate at which you want
your machine learning to adapt to new data set
is called the learning rate. A system with a
high learning rate will tend to forget the
learning quickly. A system with a low learning
rate will be more like batch learning.
One of the big disadvantages of an online
learning system is that if it is fed with bad
data, the system will have bad performance
and the user will see the impact instantly.
Thus, it is very important to come up with
appropriate data governance strategy to ensure
that the data fed is of high quality. In addition,
it is very important to monitor the
performance of the machine learning system
in a very close manner.
Data governance needs to be put in place
across different levels such as the following
when choosing to go with online learning:
 Data ingestion
 ETL pipelines
 Feature extraction
 Predictions
The following are some of the challenges for
adopting an online learning method:
 Data governance
 Model governance includes appropriate

algorithm and model selection on-the-fly

Online models require only a single
deployment in the production setting and they
evolve over a period of time. The
disadvantage that the online models have is
that they don’t have the entire dataset
available for the training. The models are
trained in an incremental manner based on the
assumptions made using the available data and
the assumptions at times can be sub-optimal.
Instance-Based
Learning
The Machine Learning systems which are
categorized as instance-based learning are
the systems that learn the training examples
by heart and then generalizes to new instances
based on some similarity measure. It is called
instance-based because it builds the
hypotheses from the training instances. It is
also known as memory-based
learning or lazy-learning (because they
delay processing until a new instance must be
classified). The time complexity of this
algorithm depends upon the size of training
data. Each time whenever a new query is
encountered, its previously stores data is
examined. And assign to a target function
value for the new instance.
The worst-case time complexity of this
algorithm is O (n), where n is the number of
training instances. For example, If we were to
create a spam filter with an instance-based
learning algorithm, instead of just flagging
emails that are already marked as spam
emails, our spam filter would be programmed
to also flag emails that are very similar to
them. This requires a measure of resemblance
between two emails. A similarity measure
between two emails could be the same sender
or the repetitive use of the same keywords or
something else.

What is instance-based
learning & how does it work?
Instance-based learning(also known
as memory-based learning or lazy learning)
involves memorizing training data in order to
make predictions about future data points.
This approach doesn’t require any prior
knowledge or assumptions about the data,
which makes it easy to implement and
understand. However, it can be
computationally expensive since all of the
training data needs to be stored in memory
before making a prediction. Additionally, this
approach doesn’t generalize well to unseen
data sets because its predictions are based on
memorized examples rather than learned
models.
In instance-based learning, the system learns
the training data by heart. At the time of
making prediction, the system uses similarity
measure and compare the new cases with the
learned data. K-nearest neighbors (KNN) is an
algorithm that belongs to the instance-based
learning class of algorithms. KNN is a non-
parametric algorithm because it does not
assume any specific form or underlying
structure in the data. Instead, it relies on a
measure of similarity between each pair of
data points. Generally speaking, this measure
is based on either Euclidean distance or cosine
similarity; however, other forms of metric can
be used depending on the type of data being
analyzed. Once the similarity between two
points is calculated, KNN looks at how many
neighbors are within a certain radius around
that point and uses these neighbors as
examples to make its prediction. This means
that instead of creating a generalizable model
from all of the data, KNN looks for
similarities among individual data
points and makes predictions
accordingly. The picture below demonstrates
how the new instance will be predicted as
triangle based on greater number of triangles
in its proximity.
In addition to providing accurate predictions,
one major advantage of using KNN over other
forms of supervised learning algorithms is its
versatility; KNN can be used with both
numeric datasets – such as when predicting
house prices – and categorical datasets – such
as when predicting whether a website visitor
will purchase a product or not. Furthermore,
there are no parameters involved in tuning
KNN since it does not assume any underlying
structure in the data that needs to be fitted
into; instead, all parameters involved are
dependent on how close two points are
considered to be similar.
Because KNN is an instance-based learning
algorithm, it is not suitable for very large
datasets. This is because the model has to store
all of the training examples in memory, and
making predictions on new data points
involves comparing the new point to all of the
stored training examples. However, for small
or medium-sized datasets, KNN can be a very
effective learning algorithm.
Other instance-based learning algorithms
include learning vector quantization
(LVQ) and self-organizing maps (SOMs).
These algorithms also memorize the training
examples and use them to make predictions on
new data, but they use different techniques to
do so.

Advantages:
1. Instead of estimating for the entire
instance set, local approximations can be
made to the target function.
2. This algorithm can adapt to new data
easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store
the data, and each query involves starting
the identification of a local model from
scratch.
Some of the instance-based learning
algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning

Common questions

Batch learning, also known as offline learning, involves training a model on a complete dataset at once. It is suitable for situations where data distribution changes infrequently, and real-time learning is not required. It is computationally expensive because it requires retraining the entire model with all data to learn from new observations, leading to static models susceptible to model rot or data drift . In contrast, online learning, or incremental learning, updates the model with each new data observation, allowing it to adapt continuously over time. This method is advantageous for applications requiring real-time data processing and system adaptability but relies on assumptions from limited data, which can be sub-optimal .

Supervised learning's advantage for image classification includes the ability to leverage labeled data to train models effectively by mapping input to correct outputs, resulting in high prediction accuracy when trained on sufficiently diverse and representative datasets . This makes it suitable for tasks like image classification, where clear categories are desired. However, limitations include the significant time and computational resources needed for training, especially as dataset size grows. Additionally, models can struggle when presented with new, unseen data that differ from the training set, and they may not handle highly complex patterns well without extensive feature engineering .

Regression and classification algorithms are applied based on the nature of the output variable. Regression algorithms predict continuous variables, ideal for applications like weather forecasting, where models output values like temperature over time, and market trend analysis for predicting stock prices . Classification algorithms handle categorical outcomes, suitable for tasks like spam filtering in email systems where emails are flagged as spam or not, and in medical diagnosis where symptoms are evaluated to classify diseases into specific categories . Both methods leverage historical data for predictions but vary in output processing complexity and precision requirements in different contexts.

Supervised learning models assist in fraud detection by identifying patterns and correlations in historical transaction data where cases of fraud are labeled. They learn the characteristics of fraudulent versus legitimate transactions during training and apply this knowledge to detect anomalies in new data. Common algorithmic approaches include decision trees, which build hierarchical models based on decision rules, and support vector machines, which classify data points into categories based on margins. These models require extensive training data, often supplemented with feature engineering to enhance the discriminative power of the fraud detection system .

Implementing an online learning system poses challenges in terms of data governance, which ensures data quality, security, and compliance. Real-time data ingestion and processing demand robust pipelines and oversight to manage streaming data efficiently and prevent faulty data from corrupting models. Model governance involves continuously selecting and tuning appropriate algorithms and models on-the-fly, which requires sophisticated strategies for monitoring model performance and updating them dynamically to avoid drift . These adaptive processes also need to account for the trade-off between rapid adaptability and the potential loss of predictive accuracy due to limited data view at any single point in time .

Supervised learning models struggle with complex, undefined patterns due to their dependence on labeled training data and predefined model structures. They require comprehensive labeled datasets that represent all possible scenarios, which is often impractical in complex domains without clear categorization or extensive variability . Additionally, supervised models might overfit training data or fail to generalize if the test data diverges significantly. Mitigation strategies include using techniques like regularization to prevent overfitting, employing ensemble methods to improve robustness, and exploring semi-supervised learning where a smaller set of labeled data guides additional learning from a larger set of unlabeled data .

K-means clustering is advantageous for its simplicity and efficiency in grouping data into clusters based on resemblance, which makes it suitable for large datasets . It requires selecting the number of clusters beforehand, allowing for clear interpretations of the data structure, and is easily implementable with known metrics like Euclidean distance. However, its limitations include sensitivity to initial cluster centroid placement, potential for poor performance on non-spherical clusters or uneven data distributions, and difficulty in scaling with high-dimensional data. It can also converge to local minima, thus requiring multiple runs with different initializations .

Clustering algorithms in unsupervised learning group data based on similarities within the data, creating clusters without predefined labels. These algorithms identify and categorize data points into groups where intra-cluster similarity is maximized, and inter-cluster similarity is minimized. Examples include k-means and hierarchical clustering . They solve real-world problems such as customer segmentation in marketing, where customers are grouped based on purchasing behavior, and pattern recognition tasks like image segmentation, where different parts of an image are classed based on texture or color .

Association rules in unsupervised learning identify relationships between variables in large datasets by finding patterns in which items frequently co-occur. This technique doesn't rely on pre-labeled data and instead observes natural associations between different data points . A classic application in the retail sector is market basket analysis, which identifies products frequently bought together, such as bread and butter. Retailers use these insights to inform inventory management, cross-selling strategies, and targeted promotions to enhance sales and customer satisfaction .

Instance-based learning, or memory-based learning, relies on storing training data and making predictions based on stored examples without any explicit model learning. Unlike model-based learning, which generalizes from a training set to form a predictive model, instance-based learning uses similarity measures to compare new data with memorized data to make predictions, exemplified by K-nearest neighbors (KNN). Consequently, instance-based learning requires more memory and can be computationally expensive due to the need to store and constantly compare against the dataset. In contrast, model-based learning derives a generalized model for predictions, which often reduces computational load once trained .

Software Testing Techniques Overview
No ratings yet
Software Testing Techniques Overview
14 pages
Data Link Layer: Key Functions & Errors
100% (1)
Data Link Layer: Key Functions & Errors
30 pages
Cuckoo and Universal Hashing Explained
No ratings yet
Cuckoo and Universal Hashing Explained
31 pages
Data Integration and Transformation Techniques
No ratings yet
Data Integration and Transformation Techniques
3 pages
Asymmetric Cryptography in Blockchain
No ratings yet
Asymmetric Cryptography in Blockchain
18 pages
Introduction to Web Analytics Metrics
No ratings yet
Introduction to Web Analytics Metrics
6 pages
Web Technology Overview at MIT Bulandshahr
No ratings yet
Web Technology Overview at MIT Bulandshahr
113 pages
Optimistic Concurrency Control in DBMS
No ratings yet
Optimistic Concurrency Control in DBMS
19 pages
Styling Images with CSS Techniques
No ratings yet
Styling Images with CSS Techniques
40 pages
NoSQL Databases: Overview and Benefits
No ratings yet
NoSQL Databases: Overview and Benefits
28 pages
PHP Basics: Introduction and Syntax
100% (1)
PHP Basics: Introduction and Syntax
53 pages
Unit - Iv: Machine Learning (ML) For Iot
No ratings yet
Unit - Iv: Machine Learning (ML) For Iot
17 pages
Principles of Object-Oriented Programming
No ratings yet
Principles of Object-Oriented Programming
10 pages
NLP Applications: Machine Translation & Sentiment Analysis
No ratings yet
NLP Applications: Machine Translation & Sentiment Analysis
7 pages
Introduction to Web Technologies and Protocols
No ratings yet
Introduction to Web Technologies and Protocols
30 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
47 pages
TCP Service Model Overview
No ratings yet
TCP Service Model Overview
45 pages
I/O Organization and Interrupt Handling
No ratings yet
I/O Organization and Interrupt Handling
70 pages
HTML Basics for Web Design
No ratings yet
HTML Basics for Web Design
55 pages
Computer Arithmetic Operations Overview
No ratings yet
Computer Arithmetic Operations Overview
25 pages
Data Mining and Warehousing Syllabus
No ratings yet
Data Mining and Warehousing Syllabus
23 pages
Machine Learning Unit 3 Overview
No ratings yet
Machine Learning Unit 3 Overview
21 pages
Acknowledged Connection Methods in Data Link
No ratings yet
Acknowledged Connection Methods in Data Link
31 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
58 pages
Cryptographic Principles and RSA Basics
No ratings yet
Cryptographic Principles and RSA Basics
14 pages
Variants of Convolution Functions in CNN
No ratings yet
Variants of Convolution Functions in CNN
23 pages
IoT Sensors and Actuators Overview
No ratings yet
IoT Sensors and Actuators Overview
25 pages
Computer Network Error Detection Methods
No ratings yet
Computer Network Error Detection Methods
32 pages
IoT Security Lab Manual 2025-26
No ratings yet
IoT Security Lab Manual 2025-26
46 pages
Data Stream Mining Techniques
No ratings yet
Data Stream Mining Techniques
16 pages
Machine Learning Overview and Applications
No ratings yet
Machine Learning Overview and Applications
22 pages
CS602 Computer Networks Unit 5 Notes
No ratings yet
CS602 Computer Networks Unit 5 Notes
19 pages
COCOMO II Model Overview and Insights
No ratings yet
COCOMO II Model Overview and Insights
13 pages
Docker Seminar: Enhancing Development Tools
No ratings yet
Docker Seminar: Enhancing Development Tools
12 pages
Data Representation & Flow in Networks
No ratings yet
Data Representation & Flow in Networks
10 pages
Cryptography: MD5 & SHA Algorithms
No ratings yet
Cryptography: MD5 & SHA Algorithms
76 pages
Corruption in Public Blockchain Systems
No ratings yet
Corruption in Public Blockchain Systems
38 pages
Understanding Data Flow Diagrams
No ratings yet
Understanding Data Flow Diagrams
14 pages
ML Notes Updated
No ratings yet
ML Notes Updated
60 pages
Web Security Threats and Solutions
No ratings yet
Web Security Threats and Solutions
37 pages
BDA Classification with Mahout Techniques
No ratings yet
BDA Classification with Mahout Techniques
72 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
160 pages
Decision Trees and Ensemble Learning Guide
No ratings yet
Decision Trees and Ensemble Learning Guide
46 pages
Understanding Transport Layer Functions
No ratings yet
Understanding Transport Layer Functions
38 pages
Bitcoin Consensus and Community Dynamics
No ratings yet
Bitcoin Consensus and Community Dynamics
59 pages
Change Management in Software Engineering
No ratings yet
Change Management in Software Engineering
5 pages
Overview of the OSI Model Layers
No ratings yet
Overview of the OSI Model Layers
12 pages
Frequent Itemsets and Association Rules
No ratings yet
Frequent Itemsets and Association Rules
31 pages
Operating Systems Lab Guide CSE 2014-15
No ratings yet
Operating Systems Lab Guide CSE 2014-15
40 pages
ROSP Mini-Project Logbook 2024-25
100% (1)
ROSP Mini-Project Logbook 2024-25
15 pages
IoT Overview for MCA Students
No ratings yet
IoT Overview for MCA Students
16 pages
Real-Time MapReduce Applications
No ratings yet
Real-Time MapReduce Applications
12 pages
Data Loading and Handling in R
No ratings yet
Data Loading and Handling in R
78 pages
Full Stack Web Development Internship Report
No ratings yet
Full Stack Web Development Internship Report
40 pages
Internet and Web Overview
No ratings yet
Internet and Web Overview
76 pages
Evolution of Computer Architecture
0% (1)
Evolution of Computer Architecture
6 pages
Overview of the Data Encryption Standard
No ratings yet
Overview of the Data Encryption Standard
52 pages
Blockchain Technology Exam Overview
No ratings yet
Blockchain Technology Exam Overview
1 page
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
17 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
60 pages
Automatic Emergency Torch Circuit Guide
No ratings yet
Automatic Emergency Torch Circuit Guide
2 pages
Cisco Spiad - Labv1.3.1
No ratings yet
Cisco Spiad - Labv1.3.1
91 pages
Microsoft Word Exercises and Formatting Guide
No ratings yet
Microsoft Word Exercises and Formatting Guide
3 pages
CF 72service Manual
No ratings yet
CF 72service Manual
69 pages
Cryptocurrency Address Tracking System
No ratings yet
Cryptocurrency Address Tracking System
7 pages
Student Exam Query to Wafaqi Mohtasib
No ratings yet
Student Exam Query to Wafaqi Mohtasib
2 pages
MOSFET Basics and Operation Explained
No ratings yet
MOSFET Basics and Operation Explained
197 pages
Compass QuantAnalysis 5.0 User Manual
No ratings yet
Compass QuantAnalysis 5.0 User Manual
92 pages
OTT Platforms in India: Trends & Regulations
No ratings yet
OTT Platforms in India: Trends & Regulations
7 pages
Data Warehousing Fundamentals
No ratings yet
Data Warehousing Fundamentals
67 pages
Wireless LAN and Bluetooth Overview
No ratings yet
Wireless LAN and Bluetooth Overview
6 pages
Cross-Site Scripting (XSS) Cheat Sheet
No ratings yet
Cross-Site Scripting (XSS) Cheat Sheet
29 pages
TPS92830-Q1 LED Controller Overview
No ratings yet
TPS92830-Q1 LED Controller Overview
42 pages
December 2024 To December 2025
No ratings yet
December 2024 To December 2025
71 pages
Marketing Concepts & Strategies 8th Edition Lyndon Simkin Ebook Interactive Digital Edition
100% (1)
Marketing Concepts & Strategies 8th Edition Lyndon Simkin Ebook Interactive Digital Edition
46 pages
Understanding MCE in CAD/CAM
No ratings yet
Understanding MCE in CAD/CAM
56 pages
Innolux V500HJ1-ME1 Datasheet, Specification, Stocks, Overview - Lcd-Source
0% (1)
Innolux V500HJ1-ME1 Datasheet, Specification, Stocks, Overview - Lcd-Source
2 pages
PF Enomination Process
No ratings yet
PF Enomination Process
6 pages
Wireless and RF Attack Techniques
No ratings yet
Wireless and RF Attack Techniques
12 pages
Internet Service Providers Overview
No ratings yet
Internet Service Providers Overview
20 pages
Ssrsnssasnssis Concepts
100% (1)
Ssrsnssasnssis Concepts
35 pages
C String Usage in Programming Lab
No ratings yet
C String Usage in Programming Lab
5 pages
Ste Microproject11
No ratings yet
Ste Microproject11
15 pages
Derecho Internacional Público - Larios Ochaita
No ratings yet
Derecho Internacional Público - Larios Ochaita
424 pages
RasterLink 6 - Installation Guide - D202383-V18
No ratings yet
RasterLink 6 - Installation Guide - D202383-V18
62 pages
Oracle 11g New Features Overview
No ratings yet
Oracle 11g New Features Overview
64 pages
Android Mobile App Development Guide
No ratings yet
Android Mobile App Development Guide
18 pages
AutoCAD Layout and Plotting Guide
No ratings yet
AutoCAD Layout and Plotting Guide
8 pages
Defining Vectors and Matrices in MATLAB
No ratings yet
Defining Vectors and Matrices in MATLAB
7 pages
Introduction to Embedded Systems and 8051 Microcontroller
91% (11)
Introduction to Embedded Systems and 8051 Microcontroller
32 pages

Machine Learning Life Cycle Overview

Uploaded by

Machine Learning Life Cycle Overview

Uploaded by

Machine Learning

o Train the model

o Test the model

o Integrate the data obtained from

So, we use various filtering techniques to

o Review the result

The aim of this step is to build a machine

In the real-world, supervised learning can be

the sides are equal, then it will be labelled

will be labelled as a triangle.

it will be labelled as hexagon.

o Bayesian Linear Regression

o Support vector Machines

human learns to think by their own

and uncategorized data which make

Here, we have taken an unlabeled input data,

o KNN (k-nearest neighbors)

o Principle Component Analysis

o Independent Component Analysis

o Singular value decomposition

Supervised learning model takes direct feedback t

Supervised learning model predicts the output.

In supervised learning, input data is provided to the m

Supervised learning needs supervision to train the mod

Supervised learning can be

Supervised learning can be used for those cases whe

Supervised learning model produces an accurate result

It includes various algorithms such as Linear Reg

Supervised Learning can be used for 2 different t

Input Data is provided to the model along with the

Labeled data is used to train supervised learning

Accurate results are produced using a supervised

Training the model to predict output when a new

Supervised Learning includes various algorithms

To assess whether right output is being predicted

In Supervised Learning, for right prediction of out

In scenarios where one is aware of output and inp

Computational Complexity is very complex in Sup

Supervised Learning will use off-line analysis

Some of the applications of Supervised Learning

Regression Algorithms are use

Fig 1. Online learning – Machine Learning System

Online learning algorithms can also be used to

algorithm and model selection on-the-fly

Common questions

What are the key differences between batch learning and online learning in machine learning, and how do these differences affect their applications?

What are the key differences between batch learning and online learning in machine learning, and how do these differences affect their applications?

What are the advantages and limitations of using supervised learning for image classification tasks?

What are the advantages and limitations of using supervised learning for image classification tasks?

Compare the applicability of regression and classification algorithms in real-world scenarios, providing specific examples for each.

Compare the applicability of regression and classification algorithms in real-world scenarios, providing specific examples for each.

How do supervised learning models assist in fraud detection, and what specific algorithmic approaches are typically used?

How do supervised learning models assist in fraud detection, and what specific algorithmic approaches are typically used?

What are the potential challenges associated with implementing an online learning system, particularly in terms of data and model governance?

What are the potential challenges associated with implementing an online learning system, particularly in terms of data and model governance?

What are the major drawbacks of supervised learning models in dealing with complex, undefined patterns, and how might these be mitigated?

What are the major drawbacks of supervised learning models in dealing with complex, undefined patterns, and how might these be mitigated?

What are the distinct advantages and limitations of using K-means clustering as an unsupervised learning algorithm?

What are the distinct advantages and limitations of using K-means clustering as an unsupervised learning algorithm?

In what ways do clustering algorithms function within unsupervised learning, and what types of real-world problems can they solve?

In what ways do clustering algorithms function within unsupervised learning, and what types of real-world problems can they solve?

Describe the role of association rules in unsupervised learning and give an example of their application in the retail sector.

Describe the role of association rules in unsupervised learning and give an example of their application in the retail sector.

How does instance-based learning differentiate itself from model-based learning in terms of data handling and prediction?

How does instance-based learning differentiate itself from model-based learning in terms of data handling and prediction?

You might also like