0% found this document useful (0 votes)
18 views

Unit-1 Part-1 Material

The document discusses machine learning, including its definition as enabling machines to learn from data without being explicitly programmed. It describes how machine learning works using historical data to build models and make predictions, and classifies machine learning into supervised, unsupervised, and reinforcement learning.

Uploaded by

Udaya sri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit-1 Part-1 Material

The document discusses machine learning, including its definition as enabling machines to learn from data without being explicitly programmed. It describes how machine learning works using historical data to build models and make predictions, and classifies machine learning into supervised, unsupervised, and reinforcement learning.

Uploaded by

Udaya sri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

INTRODUCTION

What is Machine Learning

In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past
experiences on their own. The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as:

Definition

Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions
without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the information, the
higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining more
data.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically.
The performance of the machine learning algorithm depends on the amount of data, and it can
be determined by the cost function. With the help of machine learning, we can save both time
and money.

The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and
Amazon have build machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample


labeled data to the machine learning system in order to train it, and on that basis, it predicts
the output.

The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample


labeled data to the machine learning system in order to train it, and on that basis, it predicts
the output.

The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:


o Classification
o Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any


supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:

o Clustering
o Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

Artificial Intelligence (AI): Developing machines to mimic human intelligence


and behaviour.

Machine Learning (ML): Algorithms that learn from structured data to predict
outputs and discover patterns in that data.

Deep Learning (DL): Algorithms based on highly complex neural networks that
mimic the way a human brain works to detect patterns in large unstructured data
sets.
What is Artificial Intelligence?

Artificial intelligence or AI recreates human intelligence and behaviour using


algorithms, data, and models. AI predicts, automates, and completes tasks typically
done by humans with greater accuracy and precision, reduced bias, cost, and
timesaving.

When is Artificial Intelligence Used?

Artificial intelligence is used when a machine completes a task using human


intellect and behaviours. For example, Roomba, the smart robotic vacuum, uses AI
to analyze the size of the room, obstacles, and pathways. Just like a human being
taking this information into account, Roomba then retains this information and
creates the most efficient route for vacuuming

Machine Learning

What is Machine Learning?

Machine learning, or ML, is a type of AI that uses algorithms to learn from data to
make sense of it or predict a pattern. Machine learning uses methods from neural
networks, statistics, operations research, and physics to find hidden insights within
data without being programmed where to look or what to conclude .For example,
machine learning is used to develop self-learning processing where software is
given instructions on accomplishing a specific task. The machine is then trained
and learns how to perform the job by analyzing relevant data and algorithms,
allowing it to understand how to accomplish the task and then evolve its
performance.

When to Use Machine Learning

Use machine learning when you’re looking to teach a model how to perform a task,
such as predicting an output or discovering a pattern using structured data
(see Structured vs. Unstructured Data for definitions). For example, Spotify builds
you a customized playlist based on your favourite songs and the data from other
users who share your likes and dislikes.
Structured vs. Unstructured Data

Structured data (quantitative data) is organized data that is decipherable by ML


algorithms, easily used by businesses and accessible by more tools than
unstructured data. This type of data has a predefined format, which limits its
flexibility and use cases . Examples of structured data include dates, phone
numbers, customer names, and product names.

Unstructured data (qualitative data) is typically easy and inexpensive to store and
can be used across different formats as it does not have a defined
purpose .However, since this type of data is available in other forms, it isn’t easy
to analyze and leverage. DL is commonly used for unstructured data and is the best
option for the most challenging use cases. Examples of unstructured data include
photos, audio, and video files.
The Benefits of Machine Learning

Accurate Forecasting

Companies gain significant and precise insights when integrating machine learning
with their data analytics to forecast factors such as market trends and consumer
buying habits. This helps companies save on costs and better manage their
inventory. ML can also indicate other items, such as transportation costs, future
demand, and delivery lead times. Machine learning is used in this scenario over
deep learning as ML models are better equipped to handle structured data, which is
used in forecasting, and are better at predicting trends.

Automation

Using machine learning, businesses can reduce the time spent analyzing
complicated data sets. The results and tasks accomplished by machine learning
models are often very reliable and well done. This is because the model can learn
from itself by making its predictions and improving its algorithms, meaning that no
human intervention is needed. Meanwhile, a deep learning model requires human
intervention during its early stages as someone needs to review its results since it
works with unstructured data.

Trend and Pattern Recognition

Machine learning models are designed to handle large sets of structured data and
analyze them to discover patterns and trends humans wouldn’t identify. A deep
learning model is not recommended as it’s not designed to recognize trends and
patterns within structured data.
Machine Learning Applications

Chatbots

Chatbots are conversational artificial intelligence systems trained using machine


learning to provide the appropriate response or assistance based on inputs. These
systems learn from past experiences, such as questions asked by previous
visitors/responses given and from datasets containing possible future
queries/appropriate answers. While deep learning plays a role in chatbots, this
specific feature of providing proper responses to questions is unique to machine
learning since it requires structured data analysis.

Educational Tools

Educational tools, such as apps that teach you different languages, also use
machine learning. By analyzing the data you provided from completing sections of
the course, ML uses that knowledge to adjust the educational system to meet your
needs. Deep learning does not apply to this function, as educational apps primarily
use structured data.
Streaming Platforms

Recommendations on streaming platforms are another form of machine learning.


The ML model integrated into these platforms analyzes songs, movies or shows
you have engaged with in the past, compares it with other data from customers with
similar consumer behaviours and then suggests additional content you may enjoy.
Once again, this function uses structured data instead of unstructured data, so deep
learning models cannot be applied.

Deep Learning

What is Deep Learning?

Deep learning is the evolution of machine learning and neural networks, which
uses advanced computer programming and training to understand complex patterns
hidden in large data sets (Source DL is about understanding how the human brain
works in different situations and then tryi

ng to recreate its behaviour . Deep learning is used for complicated problems such
as facial recognition, defect detection and image processing.
When to Use Deep Learning?

Deep learning is used to complete complex tasks and train models using
unstructured data . For example, deep learning is commonly used in image
classification tasks like facial recognition. Although machine learning models can
also identify faces, deep learning models are more accurate. In this case, it takes
the unstructured data (images of faces) and extracts factors such as the various
facial features. The extracted features are then matched to those stored in a
database.

The Benefits of Deep Learning

Efficiently Handles Unstructured Data

While machine learning models can handle various types of data, they are limited
when understanding unstructured data (such as handwriting, images and voices).
This means that the knowledge hidden in this data may go unnoticed, and it is
where deep learning fills the gap. When businesses train their deep learning
models, they must do so with unstructured data, as it can help the company
optimize many of their business-related functions.

Scalability

Deep learning’s ability to process massive amounts of data simultaneously and


perform analysis quickly makes this approach highly scalable. A company can
improve its productivity, modularity, and portability by using deep learning. For
example, Google’s Cloud AI platform can run deep neural networks at scale on
their cloud, leveraging their infrastructure to scale batch prediction, improving
efficiency by scaling the number of nodes based on traffic requests .

Parallel and Distributed Algorithms

Since deep learning models are better at supporting parallel and distributed
algorithms, the amount of time it would take for a DL model to learn the relevant
parameters are significantly reduced. The models can be trained locally (only using
one machine); however, working with massive data sets becomes challenging.
Parallel and distributed algorithms allow the data (or model) to be distributed
across multiple machines, making the training more effective . Also, parallel and
distributed algorithms speed up the time the model needs to learn and train, saving
the company time and money.

Deep Learning Applications

Virtual Assistants

Deep learning is used in virtual assistants such as Alexa and Siri, which use
Natural Language Processing (NLP). NLP analyzes and understands unstructured
data, such as forms of human language (written and verbal). It also analyzes factors
such as language recognition, sentiment analysis and text classification and then
creates the appropriate response to your input. When using NLP, it’s recommended
to use deep learning as it better understands unstructured data, such as written and
verbal language, which helps in scenarios of recognizing sentiment analysis.

Self-Driving Vehicles

Self-driving cars are autonomous decision-making systems that process data from
multiple sensors such as cameras, LiDAR, RADAR and GPS. The data collected is
then analyzed using deep learning algorithms to produce relevant decisions
depending on the car’s environment. Deep learning plays a role in a self-driving
car’s perception as it helps the car recognize and classify objects, buildings,
beings, road signs, traffic lights, etc., picked up by its sensors and cameras. DL is
also used to improve the visual odometry of the car, which helps the car calculate
its position and orientation while navigating (Source: Neptune.ai ).

Manufacturing

Deep learning is also used in manufacturing to improve quality. A significant


expense the manufacturing industry faces is equipment and machinery
maintenance. Deep learning models decrease the time a piece is out of commission
as it helps identify quality problems using process monitoring and anomaly
detection. This saves the company money from unscheduled repairs, helps them
better design their equipment, improves employee safety and product quality, and
increases productivity. Only deep learning can be used for this function, as ML
models are limited in handling the unstructured data involved in process
monitoring and anomaly detection.
A Summary of Artificial Intelligence, Machine Learning, and Deep Learning

Artificial Intelligence recreates human intelligence and behaviours using


algorithms, data, and models. AI is implemented when using a machine to complete
a task using human behaviours.

Machine Learning or ML, a subset of AI, uses algorithms to learn from data and
makes sense of the data or predict patterns. ML is used when you’re looking to
teach a model how to predict an output or discover a trend using structured data.

Deep Learning, or DL, a subset of ML, is the evolution of machine learning and
neural networks, which uses advanced computer programming and training to
understand complex patterns hidden in large data sets, similar to a human brain
Types of Machine Learning

Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions. Machine
learning contains a set of algorithms that work on a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of training, they build the model & perform a
specific task.

These ML algorithms help to solve different business problems like Regression,


Classification, Forecasting, Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
1. Supervised Machine Learning

As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some of
the inputs are already mapped to the output. More preciously, we can say; first, we train the
machine with the input and corresponding output, and then we ask the machine to predict the
output using the test dataset.

Suppose we have an input dataset of cats and dog images. So, first, we will provide the
training to the machine to understand the images, such as the shape & size of the tail of cat
and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After
completion of training, we input the picture of a cat and ask the machine to identify the object
and predict the output. Now, the machine is well trained, so it will check all the features of
the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will
put it in the Cat category. This is the process of how the machine identifies the objects in
Supervised Learning.

The main goal of the supervised learning technique is to map the input variable(x) with
the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning

Supervised machine learning can be classified into two types of problems, which are given
below:

o Classification
o Regression
a) Classification

Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm


o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

Advantages and Disadvantages of Supervised Learning

Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.
Applications of Supervised Learning

Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data to
identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning

Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:

o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the data. It is
a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which finds interesting


relations among variables within a large dataset. The main aim of this learning algorithm is to
find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:

o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning


o Network Analysis: Unsupervised learning is used for identifying plagiarism and
copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised
learning techniques for building recommendation applications for different web
applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised
learning, which can identify unusual data points within the dataset. It is used to
discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to
extract particular information from the database. For example, extracting information
of each user located at a particular location.

3. Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between


Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled datasets
during the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it mostly consists
of unlabeled data. As labels are costly, but for corporate purposes, they may have few labels.
It is completely different from supervised and unsupervised learning as they are based on the
presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning


algorithms, the concept of Semi-supervised learning is introduced. The main aim of semi-
supervised learning is to effectively use all the available data, rather than only labelled data
like in supervised learning. Initially, similar data is clustered along with an unsupervised
learning algorithm, and further, it helps to label the unlabeled data into labelled data. It is
because labelled data is a comparatively more expensive acquisition than unlabeled data.

We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under unsupervised
learning. Under semi-supervised learning, the student has to revise himself after analyzing the
same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning

Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.

4. Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A


software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded
for each good action and get punished for each bad action; hence the goal of reinforcement
learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is
to play a game, where the Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.

Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision


Process(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:


o Positive Reinforcement Learning: Positive reinforcement learning specifies
increasing the tendency that the required behaviour would occur again by adding
something. It enhances the strength of the behaviour of the agent and positively
impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour
would occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning


o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. Some popular games that use RL algorithms
are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial
and manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by


general techniques.
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
o Helps in achieving long term results.

Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken
the results.

The curse of dimensionality limits reinforcement learning for real physical systems.

Batch and Online Learning

Another criterion used to classify Machine Learning systems is whether or

not the system can learn incrementally from a stream of incoming data.

In batch learning, the system is incapable of learning incrementally: it must be trained using
all the available data. This will generally take a lot of time and computing resources, so it is
typically done offline. First the system is trained, and then it is launched into production and
runs without learning anymore; it just applies what it has learned. This is called offline
learning.

or the model to learn about the new data, the model would need to be trained with all the data
from scratch. The old model would then need to be replaced with the new model. And, as part
of batch learning, the whole process of training, evaluating, and launching a machine learning
system gets automated. The following picture represents the automation of batch learning.
Model training using the full set of data can take many hours. Thus, it is recommended to run
the batch frequently rather than weekly as training on the full set of data would require a lot
of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.)

Online Learning
n online learning, the training happens in an incremental manner by continuously feeding
data as it arrives or in a small group / mini batches. Each learning step is fast and cheap, so
the system can learn about new data on the fly, as it arrives.

Online learning is great for machine learning systems that receive data as a continuous
flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a
good option if you have limited computing resources: once an online learning system has
learned about new data instances, it does not need them anymore, so you can discard them
(unless you want to be able to roll back to a previous state and “replay” the data) or move the
data to another form of storage (warm or cold storage) if you are using the data lake. This can
save a huge amount of space and cost

Instance-Based Versus Model-Based Learning

Instance based Learning(also known as memory-based learning or lazy learning) involves


memorizing training data in order to make predictions about future data points. This approach
doesn’t require any prior knowledge or assumptions about the data, which makes it easy to
implement and understand. However, it can be computationally expensive since all of the
training data needs to be stored in memory before making a prediction. Additionally, this
approach doesn’t generalize well to unseen data sets because its predictions are based on
memorized examples rather than learned models.

In instance-based learning, the system learns the training data by heart. At the time of making
prediction, the system uses similarity measure and compare the new cases with the learned
data. K-nearest neighbors (KNN) is an algorithm that belongs to the instance-based learning
class of algorithms. KNN is a non-parametric algorithm because it does not assume any
specific form or underlying structure in the data. Instead, it relies on a measure of similarity
between each pair of data points. Generally speaking, this measure is based on either
Euclidean distance or cosine similarity; however, other forms of metric can be used
depending on the type of data being analyzed. Once the similarity between two points is
calculated, KNN looks at how many neighbors are within a certain radius around that point
and uses these neighbors as examples to make its prediction. This means that instead of
creating a generalizable model from all of the data, KNN looks for similarities among
individual data points and makes predictions accordingly. The picture below demonstrates
how the new instance will be predicted as triangle based on greater number of triangles in its
proximity.

model-based learning
Instance-based vs Model-based Learning: Differences

Machine learning is a field of artificial intelligence that deals with giving machines the ability
to learn without being explicitly programmed. In this context, instance-based learning and
model-based learning are two different approaches used to create machine learning models.
While both approaches can be effective, they also have distinct differences that must be taken
into account when building a machine learning system. Let’s explore the differences between
these two types of machine learning.

What is instance-based learning & how does it work?


Instance-based learning (also known as memory-based learning or lazy learning) involves
memorizing training data in order to make predictions about future data points. This approach
doesn’t require any prior knowledge or assumptions about the data, which makes it easy to
implement and understand. However, it can be computationally expensive since all of the
training data needs to be stored in memory before making a prediction. Additionally, this
approach doesn’t generalize well to unseen data sets because its predictions are based on
memorized examples rather than learned models.
In instance-based learning, the system learns the training data by heart. At the time of making
prediction, the system uses similarity measure and compare the new cases with the learned
data. K-nearest neighbors (KNN) is an algorithm that belongs to the instance-based learning
class of algorithms. KNN is a non-parametric algorithm because it does not assume any
specific form or underlying structure in the data. Instead, it relies on a measure of similarity
between each pair of data points. Generally speaking, this measure is based on either
Euclidean distance or cosine similarity; however, other forms of metric can be used
depending on the type of data being analyzed. Once the similarity between two points is
calculated, KNN looks at how many neighbors are within a certain radius around that point
and uses these neighbors as examples to make its prediction. This means that instead of
creating a generalizable model from all of the data, KNN looks for similarities among
individual data points and makes predictions accordingly. The picture below demonstrates
how the new instance will be predicted as triangle based on greater number of triangles in its
proximity.

In addition to providing accurate predictions, one major advantage of using KNN over other
forms of supervised learning algorithms is its versatility; KNN can be used with both numeric
datasets – such as when predicting house prices – and categorical datasets – such as when
predicting whether a website visitor will purchase a product or not. Furthermore, there are no
parameters involved in tuning KNN since it does not assume any underlying structure in the
data that needs to be fitted into; instead, all parameters involved are dependent on how close
two points are considered to be similar.

Because KNN is an instance-based learning algorithm, it is not suitable for very large
datasets. This is because the model has to store all of the training examples in memory, and
making predictions on new data points involves comparing the new point to all of the stored
training examples. However, for small or medium-sized datasets, KNN can be a very
effective learning algorithm.
Other instance-based learning algorithms include learning vector quantization
(LVQ) and self-organizing maps (SOMs). These algorithms also memorize the training
examples and use them to make predictions on new data, but they use different techniques to
do so.

What is model-based learning & how does it work?


Model-based learning (also known as structure-based or eager learning) takes a different
approach by constructing models from the training data that can generalize better than
instance-based methods. This involves using algorithms like linear regression, logistic
regression, random forest, etc. trees to create an underlying model from which predictions
can be made for new data points. The picture below represents how the prediction about the
class is decided based on boundary learned from training data rather than comparing with
learned data set based on similarity measures.

Main challenges of Machine Learning

 Inadequate Training Data. ...


 Poor quality of data. ...
 Non-representative training data. ...
 Overfitting and Underfitting. ...
 Monitoring and maintenance. ...
 Getting bad recommendations. ...
 Lack of skilled resources. ...
 Customer Segmentation.
Commonly used Algorithms in Machine Learning

Machine Learning is the study of learning algorithms using past experience and making
future decisions. Although, Machine Learning has a variety of models, here is a list of the
most commonly used machine learning algorithms by all data scientists and professionals in
today's world.

o Linear Regression
o Logistic Regression
o Decision Tree
o Bayes Theorem and Naïve Bayes Classification
o Support Vector Machine (SVM) Algorithm
o K-Nearest Neighbor (KNN) Algorithm
o K-Means
o Gradient Boosting algorithms
o Dimensionality Reduction Algorithms
o Random Forest

1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of quality as
well as quantity of data. Although data plays a vital role in the processing of machine
learning algorithms, many data scientists claim that inadequate data, noisy data, and unclean
data are extremely exhausting the machine learning algorithms. For example, a simple task
requires thousands of sample data, and an advanced task such as speech or image recognition
needs millions of sample data examples. Further, data quality is also important for the
algorithms to work ideally, but the absence of data quality is also found in Machine Learning
applications. Data quality can be affected by some factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as
well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.
o Generalizing of output data- Sometimes, it is also found that generalizing output
data becomes complex, which results in comparatively poor future actions.

2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it must be
of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to
less accuracy in classification and low-quality results. Hence, data quality can also be
considered as a major common problem while processing machine learning algorithms.

3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training
data must cover all cases that are already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for
generalized cases and provides accurate decisions. If there is less training data, then there will
be a sampling noise in the model, called the non-representative training set. It won't be
accurate in predictions. To overcome this, it will be biased against one class or a group.

Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.

4. Overfitting and Underfitting

Overfitting:

Overfitting is one of the most common issues faced by Machine Learning engineers and data
scientists. Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set. It negatively affects the
performance of the model. Let's understand with a simple example where we have a few
training data sets such as 1000 mangoes, 1000 apples, 1000 bananas, and 5000 papayas. Then
there is a considerable probability of identification of an apple as papaya because we have a
massive amount of biased data in the training data set; hence prediction got negatively
affected. The main reason behind overfitting is using non-linear methods used in machine
learning algorithms as they build non-realistic data models. We can overcome overfitting by
using linear and parametric algorithms in the machine learning models.

Methods to reduce overfitting:

o Increase training data in a dataset.


o Reduce model complexity by simplifying the model by selecting one with fewer
parameters
o Ridge Regularization and Lasso Regularization
o Early stopping during the training phase
o Reduce the noise
o Reduce the number of attributes in training data.
o Constraining the model.
Underfitting:

Underfitting is just the opposite of overfitting. Whenever a machine learning model is trained
with fewer amounts of data, and as a result, it provides incomplete and inaccurate data and
destroys the accuracy of the machine learning model.

Underfitting occurs when our model is too simple to understand the base structure of the data,
just like an undersized pant. This generally happens when we have limited data into the data
set, and we try to build a linear model with non-linear data. In such scenarios, the complexity
of the model destroys, and rules of the machine learning model become too easy to be applied
on this data set, and the model starts doing wrong predictions as well.

Methods to reduce Underfitting:

o Increase model complexity


o Remove noise from the data
o Trained on increased and better features
o Reduce the constraints
o Increase the number of epochs to get better results.

5. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine learning model;
hence, regular monitoring and maintenance become compulsory for the same. Different
results for different actions require data change; hence editing of codes as well as resources
for monitoring them also become necessary.

6. Getting bad recommendations

A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example where at
a specific time customer is looking for some gadgets, but now customer requirement changed
over time but still machine learning model showing same recommendations to the customer
while customer expectation has been changed. This incident is called a Data Drift. It
generally occurs when new data is introduced or interpretation of data changes. However, we
can overcome this by regularly updating and monitoring data according to the expectations.

7. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in the
market, still these industries are fresher in comparison to others. The absence of skilled
resources in the form of manpower is also an issue. Hence, we need manpower having in-
depth knowledge of mathematics, science, and technologies for developing and managing
scientific substances for machine learning.
8. Customer Segmentation

Customer segmentation is also an important issue while developing a machine learning


algorithm. To identify the customers who paid for the recommendations shown by the model
and who don't even check them. Hence, an algorithm is necessary to recognize the customer
behavior and trigger a relevant recommendation for the user based on past experience.

9. Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major issue faced by
machine learning engineers and data scientists. However, Machine Learning and Artificial
Intelligence are very new technologies but are still in an experimental phase and continuously
being changing over time. There is the majority of hits and trial experiments; hence the
probability of error is higher than expected. Further, it also includes analyzing the data,
removing data bias, training data, applying complex mathematical calculations, etc., making
the procedure more complicated and quite tedious.

10. Data Bias

Data Biasing is also found a big challenge in Machine Learning. These errors exist when
certain elements of the dataset are heavily weighted or need more importance than others.
Biased data leads to inaccurate results, skewed outcomes, and other analytical errors.
However, we can resolve this error by determining where data is actually biased in the
dataset. Further, take necessary steps to reduce it.

Methods to remove Data Bias:

o Research more for customer segmentation.


o Be aware of your general use cases and potential outliers.
o Combine inputs from multiple sources to ensure data diversity.
o Include bias testing in the development process.
o Analyze data regularly and keep tracking errors to resolve them easily.
o Review the collected and annotated data.
o Use multi-pass annotation such as sentiment analysis, content moderation, and intent
recognition.

11. Lack of Explainability

This basically means the outputs cannot be easily comprehended as it is programmed in


specific ways to deliver for certain conditions. Hence, a lack of explainability is also found in
machine learning algorithms which reduce the credibility of the algorithms.
12. Slow implementations and results

This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are time-consuming.
Slow programming, excessive requirements' and overloaded data take more time to provide
accurate results than expected. This needs continuous maintenance and monitoring of the
model for delivering accurate results.

13. Irrelevant features

Although machine learning models are intended to give the best possible outcome, if we feed
garbage data as input, then the result will also be garbage. Hence, we should use relevant
features in our training sample. A machine learning model is said to be good if training data
has a good set of features or less to no irrelevant features.

Statistical Learning

A main challenge in data science is the mathematical analysis of the data.  When the goal is
to interpret the model and quantify the uncertainty in the data, this analysis is usually referred
to as statistical learning

Statistical learning theory is a framework for machine learning that draws from statistics and
functional analysis. It deals with finding a predictive function based on the data presented. The
main idea in statistical learning theory is to build a model that can draw conclusions from data
and make predictions.

There are two major goals for modeling data:

i.To accurately predict some future quantity of interest, given some observed data.

ii.To discover unusual or interesting patterns in the data.

To achieve these two goals, one must rely on knowledge from three important pillars:

1. Function approximation: We usually assume that this mathematical function is not


completely known, but can be approximated well, given enough computing power and data.

2. Optimization: Given a class of approximating functions, we wish to find the best possible
function in that class.

3. Probability and Statistics: In general, the data used to fit the model is viewed as a
realization of a random process, whose probability law determines the accuracy with which
we can predict future observations.

Types of Data in Statistical Learning:


There are two main types of data:

 Dependent Variable — a variable (y) whose values depend on the values of other
variables (a dependent variable is sometimes also referred to as a target variable)

 Independent Variables — a variable (x) whose value does not depend on the values of
other variables (independent variables are sometimes also referred to as predictor
variables, input variables, explanatory variables, or features)

In statistical learning, the independent variable(s) are the variable that will affect the
dependent variable.

A common examples of an Independent Variable is Age. There is nothing that one can do to
increase or decrease age. This variable is independent.

Some common examples of Dependent Variables are:

 Weight — a person’s weight is dependent on his or her age, diet, and activity levels
(as well as other factors)

 Temperature — temperature is impacted by altitude, distance from equator (latitude)


and distance from the sea

In graphs, the independent variable is often plotted along the x-axis while the dependent
variable is plotted along the y-axis.

In this example, which shows how the price of a home is affected by the size of the home, sq.
ft is the independent variable while price of the home is the dependent variable.
Supervised and Unsupervised Learning

Feature-input

Response-Output

Given an input or feature vector x, one of the main goals of machine learning is to predict
response an output or response variable y.

Example

1.x could be a digitized signature and y a binary variable that indicates whether the signature
is genuine or false.

2. x represents the weight and smoking habits of an expecting mother and y the birth weight
of the baby.

Prediction Function

The data science attempt at this prediction is encoded in a mathematical prediction function
g, called the prediction function function , which takes as an input x and outputs a guess g(x)
for y (denoted by by, for example). In a sense, g encompasses all the information about the
relationship between the variables x and y, excluding the effects of chance and randomness in
nature.

Regression

A regression is a statistical technique that relates a dependent variable to one or more


independent (explanatory) variables.

A regression model is able to show whether changes observed in the dependent variable are
associated with changes in one or more of the explanatory variables.

 It does this by essentially fitting a best-fit line and seeing how the data is dispersed
around this line.

Calculating Regression

Linear regression models often use a least-squares approach to determine the line of best fit.
The least-squares technique is determined by minimizing the sum of squares created by a
mathematical function. A square is, in turn, determined by squaring the distance between a
data point and the regression line or mean value of the data set.

Once this process has been completed a regression model is constructed.

The general form of each type of regression model is:

Simple linear regression:


Y=a+bX+u

Multiple linear regression:

Y=a+b1X1+b2X2+b3X3+...+btXt+u

where:Y=The dependent variable you are trying to predictor explain

X=The explanatory (independent) variable(s) you are using to predict or associate wit
hY

a=The yintercept

b=(beta coefficient) is the slope of the explanatoryvariable(s)u=The regression residu


al or error term

Errors in Machine Learning

We can describe an error as an action which is inaccurate or wrong. In Machine Learning,


error is used to see how accurately our model can predict on data it uses to learn; as well as
new, unseen data. Based on our error, we choose the machine learning model which performs
best for a particular dataset.

There are two main types of errors present in any machine learning model. They are
Reducible Errors and Irreducible Errors.

 Irreducible errors are errors which will always be present in a machine learning model,
because of unknown variables, and whose values cannot be reduced.

 Reducible errors are those errors whose values can be further reduced to improve a model.
They are caused because our model’s output function does not match the desired output
function and can be optimized.

We can further divide reducible errors into two: Bias and Variance.
Figure 1: Errors in Machine Learning

What is Bias?

To make predictions, our model will analyze our data and find patterns in it. Using these
patterns, we can make generalizations about certain instances in our data. Our model after
training learns these patterns and applies them to the test set to predict them.

Bias is the difference between our actual and predicted values. Bias is the simple assumptions
that our model makes about our data to be able to predict new data.

Figure 2: Bias

When the Bias is high, assumptions made by our model are too basic, the model can’t capture
the important features of our data. This means that our model hasn’t captured patterns in the
training data and hence cannot perform well on the testing data too. If this is the case, our
model cannot perform on new data and cannot be sent into production.
This instance, where the model cannot find patterns in our training set and hence fails for
both seen and unseen data, is called Underfitting.

The below figure shows an example of Underfitting. As we can see, the model has found no
patterns in our data and the line of best fit is a straight line that does not pass through any of
the data points. The model has failed to train properly on the data given and cannot predict
new data either.

Figure 3: Underfitting

What is Variance?

Variance is the very opposite of Bias. During training, it allows our model to ‘see’ the data a
certain number of times to find patterns in it. If it does not work on the data for long enough,
it will not find patterns and bias occurs. On the other hand, if our model is allowed to view
the data too many times, it will learn very well for only that data. It will capture most patterns
in the data, but it will also learn from the unnecessary data present, or from the noise.

We can define variance as the model’s sensitivity to fluctuations in the data. Our model may
learn from noise. This will cause our model to consider trivial features as important.
Figure 4: Example of Variance

In the above figure, we can see that our model has learned extremely well for our training
data, which has taught it to identify cats. But when given new data, such as the picture of a
fox, our model predicts it as a cat, as that is what it has learned. This happens when the
Variance is high, our model will capture all the features of the data given to it, including the
noise, will tune itself to the data, and predict it very well but when given new data, it cannot
predict on it as it is too specific to training data.

Hence, our model will perform really well on testing data and get high accuracy but will fail
to perform on new, unseen data. New data may not have the exact same features and the
model won’t be able to predict it very well. This is called Overfitting.

Figure 5: Over-fitted model where we see model performance on, a) training data b)
new data
Bias-Variance Tradeoff

Plotting Bias and Variance Using Python

While discussing model accuracy, we need to keep in mind the prediction errors, ie: Bias and
Variance, that will always be associated with any machine learning model. There will always
be a slight difference in what our model predicts and the actual predictions. These differences
are called errors. The goal of an analyst is not to eliminate errors but to reduce them. There is
always a tradeoff between how low you can get errors to be. In this article titled ‘Everything
you need to know about Bias and Variance’, we will discuss what these errors are.

Bias-Variance Tradeoff

For any model, we have to find the perfect balance between Bias and Variance. This just
ensures that we capture the essential patterns in our model while ignoring the noise present it
in. This is called Bias-Variance Tradeoff. It helps optimize the error in our model and keeps it
as low as possible.
An optimized model will be sensitive to the patterns in our data, but at the same time will be
able to generalize to new data. In this, both the bias and variance should be low so as to
prevent overfitting and underfitting.

In the above figure, we can see that when bias is high, the error in both testing and training
set is also high.If we have a high variance, the model performs well on the training set, we
can see that the error is low, but gives high error on the testing set. We can see that there is a
region in the middle, where the error in both training and testing set is low and the bias and
variance is in perfect balance.

Figure 7: Bull’s Eye Graph for Bias and Variance


The above bull’s eye graph helps explain bias and variance tradeoff better. The best fit is
when the data is concentrated in the center, ie: at the bull’s eye. We can see that as we get
farther and farther away from the center, the error increases in our model. The best model is
one where bias and variance are both low.

What is Bias?

In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them to
test data for prediction. While making predictions, a difference occurs between prediction
values made by the model and actual values/expected values, and this difference is known
as bias errors or Errors due to bias. It can be defined as an inability of machine learning
algorithms such as Linear Regression to capture the true relationship between the data points.
Each algorithm begins with some amount of bias because bias occurs from assumptions in the
model, which makes the target function simple to learn. A model has either:

o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.

Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.

Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest
Neighbours and Support Vector Machines. At the same time, an algorithm with high bias
is Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Ways to reduce High Bias:

High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:

o Increase the input features as the model is underfitted.


o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.

What is a Variance Error?

The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is different
from its expected value. Ideally, a model should not vary too much from one training dataset
to another, which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables. Variance errors are either of low variance or high
variance.

Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in
the prediction of the target function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with the training dataset, and
does not generalize well with the unseen dataset. As a result, such a model gives good results
with the training dataset but shows high error rates on the test dataset.

Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:

o A high variance model leads to overfitting.


o Increase model complexities.

Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.

Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms with
high variance are decision tree, Support Vector Machine, and K-nearest neighbours.

Ways to Reduce High Variance:


o Reduce the input features or number of parameters as a model is overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

Different Combinations of Bias-Variance

There are four possible combinations of bias and variances, which are represented by the
below diagram:
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.

How to identify High variance or High Bias?

High variance can be identified if the model has:


o Low training error and high test error.

High Bias can be identified if the model has:

o High training error and the test error is almost similar to training error.

Bias-Variance Trade-Off

While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high variance and low bias. So, it is required to
make a balance between bias and variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.

Cross Validation

Cross-validation is a technique for evaluating ML models by training several ML models


on subsets of the available input data and evaluating them on the complementary subset
of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.

The three steps involved in cross-validation are as follows :


1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation

Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used
for the testing purpose. The major drawback of this method is that we perform training on
the 50% of the dataset, it may possible that the remaining 50% of the data contains some
important information which we are leaving while training our model i.e higher bias.

LOOCV (Leave One Out Cross Validation)


In this method, we perform training on the whole data-set but leaves only one data-point of
the available data-set and then iterates for each data-point. It has some advantages as well
as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low
bias.
The major drawback of this method is that it leads to higher variation in the testing model
as we are testing against one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times.
K-Fold Cross Validation
In this method, we split the data-set into k number of subsets(known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the
trained model. In this method, we iterate k times with a different subset reserved for testing
purpose each time.

Example
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration we
use the first 20 percent of data for evaluation, and the remaining 80 percent for training([1-
5] testing and [5-25] training) while in the second iteration we use the second subset of 20
percent for evaluation, and the remaining three subsets of the data for training([5-10]
testing and [1-5 and 10-25] training), and so on.

Total instances: 25
Value of k :5
No. Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
Comparison of train/test split to cross-validation

Advantages of train/test split:


1. This runs K times faster than Leave One Out cross-validation because K-fold cross-
validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and testing.

You might also like