0% found this document useful (0 votes)
149 views

Machine Learning Unit 1

The document discusses machine learning concepts including the machine learning life cycle and types of machine learning. It describes the 7 major steps of the machine learning life cycle as gathering data, data preparation, data wrangling, data analysis, training the model, testing the model, and deployment. It also discusses supervised machine learning, how it works using labeled training data to predict outputs, and examples of supervised learning algorithms like linear regression and decision trees.

Uploaded by

Manshi Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views

Machine Learning Unit 1

The document discusses machine learning concepts including the machine learning life cycle and types of machine learning. It describes the 7 major steps of the machine learning life cycle as gathering data, data preparation, data wrangling, data analysis, training the model, testing the model, and deployment. It also discusses supervised machine learning, how it works using labeled training data to predict outputs, and examples of supervised learning algorithms like linear regression and decision trees.

Uploaded by

Manshi Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Machine Learning

Unit :- 1
Topics
1. Intro to ML
2. ML Life Cycle
3. Types of ML
3.1 supervised and unsupervised
3.2 batch and online
3.3 instance based and model based
4. Scope and Limitations
5. Challenges Of ML
6. Data Visualization
7. Hypothesis Fn. And testing
8. Data Pre-processing
9. Data Augmentation
10. Normalizing Data Set
11. Bias-Variance Tradeoff
12. Relation Between AI,ML,DP and DS
ML Life Cycle
Machine learning has given the computer
systems the abilities to automatically learn
without being explicitly programmed. But
how does a machine learning system work?
So, it can be described using the life cycle of
machine learning. Machine learning life cycle
is a cyclic process to build an efficient
machine learning project. The main purpose of
the life cycle is to find a solution to the
problem or project.
Machine learning life cycle involves seven
major steps, which are given below:
o Gathering Data

o Data preparation

o Data Wrangling

o Analyse Data

o Train the model

o Test the model

o Deployment
The most important thing in the complete
process is to understand the problem and to
know the purpose of the problem. Therefore,
before starting the life cycle, we need to
understand the problem because the good
result depends on the better understanding of
the problem.
In the complete life cycle process, to solve a
problem, we create a machine learning system
called "model", and this model is created by
providing "training". But to train a model, we
need data, hence, life cycle starts by collecting
data.

1. Gathering Data:
Data Gathering is the first step of the machine
learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different
data sources, as data can be collected from
various sources such
as files, database, internet, or mobile
devices. It is one of the most important steps
of the life cycle. The quantity and quality of
the collected data will determine the efficiency
of the output. The more will be the data, the
more accurate will be the prediction.
This step includes the below tasks:
o Identify various data sources

o Collect data

o Integrate the data obtained from


different sources
By performing the above task, we get a
coherent set of data, also called as a dataset. It
will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it
for further steps. Data preparation is a step
where we put our data into a suitable place and
prepare it to use in our machine learning
training.
In this step, first, we put all data together, and
then randomize the ordering of data.
This step can be further divided into two
processes:
o Data exploration:
It is used to understand the nature of data
that we have to work with. We need to
understand the characteristics, format, and
quality of data.
A better understanding of data leads to an
effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data
for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and
converting raw data into a useable format. It is
the process of cleaning the data, selecting the
variable to use, and transforming the data in a
proper format to make it more suitable for
analysis in the next step. It is one of the most
important steps of the complete process.
Cleaning of data is required to address the
quality issues.
It is not necessary that data we have collected
is always of our use as some of the data may
not be useful. In real-world applications,
collected data may have various issues,
including:
o Missing Values

o Duplicate data

o Invalid data

o Noise

So, we use various filtering techniques to


clean the data.
It is mandatory to detect and remove the above
issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed
on to the analysis step. This step involves:
o Selection of analytical techniques

o Building models

o Review the result

The aim of this step is to build a machine


learning model to analyze the data using
various analytical techniques and review the
outcome. It starts with the determination of the
type of the problems, where we select the
machine learning techniques such
as Classification, Regression, Cluster
analysis, Association, etc. then build the
model using prepared data, and evaluate the
model.
Hence, in this step, we take the data and use
machine learning algorithms to build the
model.

5. Train Model
Now the next step is to train the model, in this
step we train our model to improve its
performance for better outcome of the
problem.
We use datasets to train the model using
various machine learning algorithms. Training
a model is required so that it can understand
the various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been
trained on a given dataset, then we test the
model. In this step, we check for the accuracy
of our model by providing a test dataset to it.
Testing the model determines the percentage
accuracy of the model as per the requirement
of project or problem.

7. Deployment
The last step of machine learning life cycle is
deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an
accurate result as per our requirement with
acceptable speed, then we deploy the model in
the real system. But before deploying the
project, we will check whether it is improving
its performance using available data or not.
The deployment phase is similar to making the
final report for a project.
Types of ML
Machine learning is a subset of AI, which
enables the machine to automatically learn
from data, improve performance from past
experiences, and make predictions. Machine
learning contains a set of algorithms that work on
a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of
training, they build the model & perform a specific
task.

Supervised
Machine
Learning
Supervised learning is the types of machine
learning in which machines are trained using
well "labelled" training data, and on basis of
that data, machines predict the output. The
labelled data means some input data is already
tagged with the correct output.
In supervised learning, the training data
provided to the machines work as the
supervisor that teaches the machines to predict
the output correctly. It applies the same
concept as a student learns in the supervision
of the teacher.
Supervised learning is a process of providing
input data as well as correct output data to the
machine learning model. The aim of a
supervised learning algorithm is to find a
mapping function to map the input
variable(x) with the output variable(y).
Supervised learning: Supervised
learning is the learning of the model
where with input variable ( say, x) and
an output variable (say, Y) and an
algorithm to map the input to the
output. That is, Y = f(X)

In the real-world, supervised learning can be


used for Risk Assessment, Image
classification, Fraud Detection, spam
filtering, etc.
How Supervised
Learning Works?
In supervised learning, models are trained
using labelled dataset, where the model learns
about each type of data. Once the training
process is completed, the model is tested on
the basis of test data (a subset of the training
set), and then it predicts the output.
The working of Supervised learning can be
easily understood by the below example and
diagram:
Suppose we have a dataset of different types
of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is
that we need to train the model for each shape.
o If the given shape has four sides, and all

the sides are equal, then it will be labelled


as a Square.
o If the given shape has three sides, then it

will be labelled as a triangle.


o If the given shape has six equal sides then

it will be labelled as hexagon.


Now, after training, we test our model using
the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of
shapes, and when it finds a new shape, it
classifies the shape on the bases of a number
of sides, and predicts the output.
Steps Involved in
Supervised Learning:
o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into
training dataset, test dataset, and
validation dataset.
o Determine the input features of the
training dataset, which should have
enough knowledge so that the model can
accurately predict the output.
o Determine the suitable algorithm for the
model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training
dataset. Sometimes we need validation sets
as the control parameters, which are the
subset of training datasets.
o Evaluate the accuracy of the model by
providing the test set. If the model predicts
the correct output, which means our model
is accurate.
Types of supervised
Machine learning
Algorithms:
Supervised learning can be further divided
into two types of problems:

1. Regression
Regression algorithms are used if there is a
relationship between the input variable and the
output variable. It is used for the prediction of
continuous variables, such as Weather
forecasting, Market Trends, etc. Below are
some popular Regression algorithms which
come under supervised learning:
o Linear Regression

o Regression Trees

o Non-Linear Regression

o Bayesian Linear Regression

o Polynomial Regression

2. Classification
Classification algorithms are used when the
output variable is categorical, which means
there are two classes such as Yes-No, Male-
Female, True-false, etc.
Spam Filtering,
o Random Forest

o Decision Trees

o Logistic Regression

o Support vector Machines


Note: We will discuss these algorithms in
detail in later chapters.

Advantages of
Supervised learning:
o With the help of supervised learning, the
model can predict the output on the basis
of prior experiences.
o In supervised learning, we can have an
exact idea about the classes of objects.
o Supervised learning model helps us to
solve various real-world problems such
as fraud detection, spam filtering, etc.
Disadvantages of
supervised learning:
o Supervised learning models are not
suitable for handling the complex tasks.
o Supervised learning cannot predict the
correct output if the test data is different
from the training dataset.
o Training required lots of computation
times.
o In supervised learning, we need enough
knowledge about the classes of object

Unsupervis
ed
Machine
Learning
In the previous topic, we learned supervised
machine learning in which models are trained
using labeled data under the supervision of
training data. But there may be many cases in
which we do not have labeled data and need to
find the hidden patterns from the given
dataset. So, to solve such types of cases in
machine learning, we need unsupervised
learning techniques.
Unsupervised
Learning: Unsupervised learning is
where only the input data (say, X) is
present and no corresponding output
variable is there.

What is Unsupervised
Learning?
As the name suggests, unsupervised learning
is a machine learning technique in which
models are not supervised using training
dataset. Instead, models itself find the hidden
patterns and insights from the given data. It
can be compared to learning which takes place
in the human brain while learning new things.
It can be defined as:
Unsupervised learning is a type of machine
learning in which models are trained using
unlabeled dataset and are allowed to act on
that data without any supervision.
Unsupervised learning cannot be directly
applied to a regression or classification
problem because unlike supervised learning,
we have the input data but no corresponding
output data. The goal of unsupervised learning
is to find the underlying structure of
dataset, group that data according to
similarities, and represent that dataset in a
compressed format.
Example: Suppose the unsupervised learning
algorithm is given an input dataset containing
images of different types of cats and dogs. The
algorithm is never trained upon the given
dataset, which means it does not have any idea
about the features of the dataset. The task of
the unsupervised learning algorithm is to
identify the image features on their own.
Unsupervised learning algorithm will perform
this task by clustering the image dataset into
the groups according to similarities between
images.
Why use Unsupervised
Learning?
Below are some main reasons which describe
the importance of Unsupervised Learning:
o Unsupervised learning is helpful for
finding useful insights from the data.
o Unsupervised learning is much similar as a

human learns to think by their own


experiences, which makes it closer to the
real AI.
o Unsupervised learning works on unlabeled

and uncategorized data which make


unsupervised learning more important.
o In real-world, we do not always have input
data with the corresponding output so to
solve such cases, we need unsupervised
learning.
Working of
Unsupervised
Learning
Working of unsupervised learning can be
understood by the below diagram:

Here, we have taken an unlabeled input data,


which means it is not categorized and
corresponding outputs are also not given.
Now, this unlabeled input data is fed to the
machine learning model in order to train it.
Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will
apply suitable algorithms such as k-means
clustering, Decision tree, etc.
Once it applies the suitable algorithm, the
algorithm divides the data objects into groups
according to the similarities and difference
between the objects.
Types of Unsupervised
Learning Algorithm:
The unsupervised learning algorithm can be
further categorized into two types of
problems:
o Clustering: Clustering is a method of
grouping the objects into clusters such that
objects with most similarities remains into
a group and has less or no similarities with
the objects of another group. Cluster
analysis finds the commonalities between
the data objects and categorizes them as
per the presence and absence of those
commonalities.
o Association: An association rule is an
unsupervised learning method which is
used for finding the relationships between
variables in the large database. It
determines the set of items that occurs
together in the dataset. Association rule
makes marketing strategy more effective.
Such as people who buy X item (suppose a
bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of
Association rule is Market Basket
Analysis.
Note: We will learn these algorithms in
later chapters.

Unsupervised
Learning algorithms:
Below is the list of some popular unsupervised
learning algorithms:
o K-means clustering

o KNN (k-nearest neighbors)

o Hierarchal clustering

o Anomaly detection

o Neural Networks

o Principle Component Analysis

o Independent Component Analysis

o Apriori algorithm

o Singular value decomposition


Advantages of
Unsupervised
Learning
o Unsupervised learning is used for more
complex tasks as compared to supervised
learning because, in unsupervised learning,
we don't have labeled input data.
o Unsupervised learning is preferable as it is
easy to get unlabeled data in comparison to
labeled data.
Disadvantages of
Unsupervised
Learning
o Unsupervised learning is intrinsically more
difficult than supervised learning as it does
not have corresponding output.
o The result of the unsupervised learning
algorithm might be less accurate as input
data is not labeled, and algorithms do not
know the exact output in advance.
Difference
between Supervised
and Unsupervised
Learning

Supervised Learning
Supervised learning algorithms are trained using label

Supervised learning model takes direct feedback t


predicting correct output or not.

Supervised learning model predicts the output.

In supervised learning, input data is provided to the m


the output.
The goal of supervised learning is to train the mod
predict the output when it is given new data.

Supervised learning needs supervision to train the mod

Supervised learning can be


in Classification and Regression problems.

Supervised learning can be used for those cases whe


input as well as corresponding outputs.

Supervised learning model produces an accurate result


Supervised learning is not close to true Artificial intell
we first train the model for each data, and then only
correct output.

It includes various algorithms such as Linear Reg


Regression, Support Vector Machine, Multi-clas
Decision tree, Bayesian Logic, etc.

Supervised Learning

Supervised Learning can be used for 2 different t


problems i.e. regression and classification

Input Data is provided to the model along with the


Supervised Learning.
Output is predicted by the Supervised Learning.

Labeled data is used to train supervised learning

Accurate results are produced using a supervised


model.

Training the model to predict output when a new


is the objective of Supervised Learning.

Supervised Learning includes various algorithms


Bayesian Logic, Decision Tree, Logistic Regressi
Regression, Multi-class Classification, Support Ve
etc.

To assess whether right output is being predicted


feedback is accepted by the Supervised Learning

In Supervised Learning, for right prediction of out


has to be trained for each data, hence Supervise
not have close resemblance to Artificial Intelligen
Number of classes are known in Supervised Lear

In scenarios where one is aware of output and inp


supervised learning can be used.

Computational Complexity is very complex in Sup


Learning compared to Unsupervised Learning

Supervised Learning will use off-line analysis

Some of the applications of Supervised Learning


detection, handwriting detection, pattern recognit
recognition etc.
Regression vs.
Classification in
Machine
Learning
Regression and
Classification
algorithms are
Supervised Learning
algorithms. Both the
algorithms are used for
prediction in Machine
learning and work with
the labeled datasets.
But the difference
between both is how
they are used for
different machine
learning problems.
The main difference
between Regression
and Classification
algorithms that
Regression algorithms
are used to predict the
continuous values
such as price, salary,
age, etc. and
Classification
algorithms are used
to predict/Classify the
discrete values such
as Male or Female,
True or False, Spam or
Not Spam, etc.
Consider the below
diagram:
Classifica
tion:
Classification is a
process of finding a
function which helps
in dividing the dataset
into classes based on
different parameters.
In Classification, a
computer program is
trained on the training
dataset and based on
that training, it
categorizes the data
into different classes.
The task of the
classification
algorithm is to find the
mapping function to
map the input(x) to the
discrete output(y).
Example: The best
example to understand
the Classification
problem is Email
Spam Detection. The
model is trained on the
basis of millions of
emails on different
parameters, and
whenever it receives a
new email, it identifies
whether the email is
spam or not. If the
email is spam, then it
is moved to the Spam
folder.
Types of ML
Classification
Algorithms:
Classification
Algorithms can be
further divided into the
following types:
o Logistic Regression
o K-Nearest
Neighbours
o Support Vector
Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree
Classification
o Random Forest
Classification
Regressio
n:
Regression is a process
of finding the
correlations between
dependent and
independent variables.
It helps in predicting
the continuous
variables such as
prediction of Market
Trends, prediction of
House prices, etc.
The task of the
Regression algorithm
is to find the mapping
function to map the
input variable(x) to the
continuous output
variable(y).
Example: Suppose we
want to do weather
forecasting, so for this,
we will use the
Regression algorithm.
In weather prediction,
the model is trained on
the past data, and once
the training is
completed, it can
easily predict the
weather for future
days.
Types of Regression
Algorithm:
o Simple Linear
Regression
o Multiple Linear
Regression
o Polynomial
Regression
o Support Vector
Regression
o Decision Tree
Regression
o Random Forest
Regression
Difference
between
Regression and
Classification
Regression Algorithm
In Regression, the output va
must be of continuous nature o
value.
The task of the regression algo
is to map the input value (x) w
continuous output variable(y).

Regression Algorithms are use


continuous data.
In Regression, we try to find th
fit line, which can predict the
more accurately.
Batch Learning
and online
Batch learning, also known as offline
learning, involves training a model on a fixed
dataset, or a batch of data, all at once. The
model is trained on the entire dataset, and then
used to make predictions on new data. This
means that batch learning requires a complete
dataset before training can begin, and the
model cannot be updated once it has been
trained without retraining the entire model.
Batch learning is commonly used in situations
where the dataset is relatively small and can
be processed quickly
Batch learning represents the training
of machine learning models in a batch
manner. In other words, batch learning
represents the training of the models at
regular intervals such as weekly, bi-
weekly, monthly, quarterly, etc. In
batch learning, the system is not
capable of learning incrementally. The
models must be trained using all the
available data every single time. The
data gets accumulated over a period of
time. The models then get trained with
the accumulated data from time to time
at periodic intervals. This model
training takes a lot of time and
computing resources. Hence, it is
typically done offline. After the models
are trained, they are launched into
production and they run without
learning anymore. Batch learning is
also called offline learning. The
models trained using batch or offline
learning are moved into production
only at regular intervals based on the
performance of models trained with
new data.
Building offline models or models
trained in a batch manner requires
training the models with the entire
training data set. Improving the model
performance would require re-training
all over again with the entire training
data set. These models are static in
nature which means that once they get
trained, their performance will not
improve until a new model gets re-
trained. The model’s performance tends
to decay slowly over time, simply
because the world continues to evolve
while the model remains
unchanged. This phenomenon is often
called model rot or data drift. The
solution is to regularly retrain the model
on up-to-date data. Offline models or
models trained using batch learning are
deployed in the production environment
by replacing the old model with the
newly trained model.
For the model to learn about the new
data, the model would need to be
trained with all the data from scratch.
The old model would then need to be
replaced with the new model. And, as
part of batch learning, the whole
process of training, evaluating, and
launching a machine learning system
gets automated. The following picture
represents the automation of batch
learning. Model training using the full
set of data can take many hours. Thus,
it is recommended to run the batch
frequently rather than weekly as
training on the full set of data would
require a lot of computing resources
(CPU, memory space, disk space, disk
I/O, network I/O, etc.)
There can be various reasons why we
can choose to adopt batch learning for
training the models. Some of these
reasons are the following:
 The business requirements do not
require frequent learning of models.
 The data distribution is not expected
to change frequently. Therefore,
batch learning is suitable.
 The software systems (big data)
required for real-time learning is not
available due to various reasons
including the cost.
 The expertise required for creating
the system for incremental learning is
not available.
If the models trained using batch
learning needs to learn about new data,
the models need to be retrained using
the new data set and replaced
appropriately with the model already in
production based on different criteria
such as model performance. The whole
process of batch learning can be
automated as well. The disadvantage of
batch learning is it takes a lot of time
and resources to re-training the model.
The criteria based on which the
machine learning models can be
decided to train in a batch manner
depends on the model performance.
Red-amber-green statuses can be used
to determine the health of the model
based on the prediction accuracy or
error rates. Accordingly, the models can
be chosen to be retrained or otherwise.
The following stakeholders can be
involved in reviewing the model
performance and leveraging batch
learning:
 Business/product owners
 Product managers
 Data science architects
 Data scientists
 ML engineers
On the other hand,
online learning, also known as
incremental learning or streaming learning,
involves training a model on new data as it
arrives, one observation at a time. The
model is updated each time a new
observation is received, allowing it to
adapt to changes in the data over time.
Online learning is commonly used in
situations where the data is too large to be
processed all at once, or where the data is
constantly changing, such as in stock
market data or social media data
In online learning, the training happens in
an incremental manner by continuously
feeding data as it arrives or in a small
group / mini batches. Each learning step is
fast and cheap, so the system can learn
about new data on the fly, as it arrives.
Online learning is great for machine
learning systems that receive data as a
continuous flow (e.g., stock prices) and
need to adapt to change rapidly or
autonomously. It is also a good option if
you have limited computing
resources: once an online learning system
has learned about new data instances, it
does not need them anymore, so you can
discard them (unless you want to be able to
roll back to a previous state and “replay”
the data) or move the data to another form
of storage (warm or cold storage) if you
are using the data lake. This can save a
huge amount of space and cost. The
diagram given below represents online
learning.

Fig 1. Online learning – Machine Learning System

Online learning algorithms can also be used to


train systems on huge datasets that cannot fit
in one machine’s main memory (this is also
called out-of-core learning). The algorithm
loads part of the data runs a training step on
that data and repeats the process until it has
run on all of the data.
One of the key aspects of online learning is
the learning rate. The rate at which you want
your machine learning to adapt to new data set
is called the learning rate. A system with a
high learning rate will tend to forget the
learning quickly. A system with a low learning
rate will be more like batch learning.
One of the big disadvantages of an online
learning system is that if it is fed with bad
data, the system will have bad performance
and the user will see the impact instantly.
Thus, it is very important to come up with
appropriate data governance strategy to ensure
that the data fed is of high quality. In addition,
it is very important to monitor the
performance of the machine learning system
in a very close manner.
Data governance needs to be put in place
across different levels such as the following
when choosing to go with online learning:
 Data ingestion
 ETL pipelines
 Feature extraction
 Predictions
The following are some of the challenges for
adopting an online learning method:
 Data governance
 Model governance includes appropriate

algorithm and model selection on-the-fly


Online models require only a single
deployment in the production setting and they
evolve over a period of time. The
disadvantage that the online models have is
that they don’t have the entire dataset
available for the training. The models are
trained in an incremental manner based on the
assumptions made using the available data and
the assumptions at times can be sub-optimal.
Instance-Based
Learning
The Machine Learning systems which are
categorized as instance-based learning are
the systems that learn the training examples
by heart and then generalizes to new instances
based on some similarity measure. It is called
instance-based because it builds the
hypotheses from the training instances. It is
also known as memory-based
learning or lazy-learning (because they
delay processing until a new instance must be
classified). The time complexity of this
algorithm depends upon the size of training
data. Each time whenever a new query is
encountered, its previously stores data is
examined. And assign to a target function
value for the new instance.
The worst-case time complexity of this
algorithm is O (n), where n is the number of
training instances. For example, If we were to
create a spam filter with an instance-based
learning algorithm, instead of just flagging
emails that are already marked as spam
emails, our spam filter would be programmed
to also flag emails that are very similar to
them. This requires a measure of resemblance
between two emails. A similarity measure
between two emails could be the same sender
or the repetitive use of the same keywords or
something else.

What is instance-based
learning & how does it work?
Instance-based learning(also known
as memory-based learning or lazy learning)
involves memorizing training data in order to
make predictions about future data points.
This approach doesn’t require any prior
knowledge or assumptions about the data,
which makes it easy to implement and
understand. However, it can be
computationally expensive since all of the
training data needs to be stored in memory
before making a prediction. Additionally, this
approach doesn’t generalize well to unseen
data sets because its predictions are based on
memorized examples rather than learned
models.
In instance-based learning, the system learns
the training data by heart. At the time of
making prediction, the system uses similarity
measure and compare the new cases with the
learned data. K-nearest neighbors (KNN) is an
algorithm that belongs to the instance-based
learning class of algorithms. KNN is a non-
parametric algorithm because it does not
assume any specific form or underlying
structure in the data. Instead, it relies on a
measure of similarity between each pair of
data points. Generally speaking, this measure
is based on either Euclidean distance or cosine
similarity; however, other forms of metric can
be used depending on the type of data being
analyzed. Once the similarity between two
points is calculated, KNN looks at how many
neighbors are within a certain radius around
that point and uses these neighbors as
examples to make its prediction. This means
that instead of creating a generalizable model
from all of the data, KNN looks for
similarities among individual data
points and makes predictions
accordingly. The picture below demonstrates
how the new instance will be predicted as
triangle based on greater number of triangles
in its proximity.
In addition to providing accurate predictions,
one major advantage of using KNN over other
forms of supervised learning algorithms is its
versatility; KNN can be used with both
numeric datasets – such as when predicting
house prices – and categorical datasets – such
as when predicting whether a website visitor
will purchase a product or not. Furthermore,
there are no parameters involved in tuning
KNN since it does not assume any underlying
structure in the data that needs to be fitted
into; instead, all parameters involved are
dependent on how close two points are
considered to be similar.
Because KNN is an instance-based learning
algorithm, it is not suitable for very large
datasets. This is because the model has to store
all of the training examples in memory, and
making predictions on new data points
involves comparing the new point to all of the
stored training examples. However, for small
or medium-sized datasets, KNN can be a very
effective learning algorithm.
Other instance-based learning algorithms
include learning vector quantization
(LVQ) and self-organizing maps (SOMs).
These algorithms also memorize the training
examples and use them to make predictions on
new data, but they use different techniques to
do so.

Advantages:
1. Instead of estimating for the entire
instance set, local approximations can be
made to the target function.
2. This algorithm can adapt to new data
easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store
the data, and each query involves starting
the identification of a local model from
scratch.
Some of the instance-based learning
algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning

You might also like