Unit-1 Part-1 Material
Unit-1 Part-1 Material
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past
experiences on their own. The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as:
Definition
Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions
without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the information, the
higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining more
data.
A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:
The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes the
machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically.
The performance of the machine learning algorithm depends on the amount of data, and it can
be determined by the cost function. With the help of machine learning, we can save both time
and money.
The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and
Amazon have build machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.
2) Unsupervised Learning
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:
o Clustering
o Association
3) Reinforcement Learning
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
Machine Learning (ML): Algorithms that learn from structured data to predict
outputs and discover patterns in that data.
Deep Learning (DL): Algorithms based on highly complex neural networks that
mimic the way a human brain works to detect patterns in large unstructured data
sets.
What is Artificial Intelligence?
Machine Learning
Machine learning, or ML, is a type of AI that uses algorithms to learn from data to
make sense of it or predict a pattern. Machine learning uses methods from neural
networks, statistics, operations research, and physics to find hidden insights within
data without being programmed where to look or what to conclude .For example,
machine learning is used to develop self-learning processing where software is
given instructions on accomplishing a specific task. The machine is then trained
and learns how to perform the job by analyzing relevant data and algorithms,
allowing it to understand how to accomplish the task and then evolve its
performance.
Use machine learning when you’re looking to teach a model how to perform a task,
such as predicting an output or discovering a pattern using structured data
(see Structured vs. Unstructured Data for definitions). For example, Spotify builds
you a customized playlist based on your favourite songs and the data from other
users who share your likes and dislikes.
Structured vs. Unstructured Data
Unstructured data (qualitative data) is typically easy and inexpensive to store and
can be used across different formats as it does not have a defined
purpose .However, since this type of data is available in other forms, it isn’t easy
to analyze and leverage. DL is commonly used for unstructured data and is the best
option for the most challenging use cases. Examples of unstructured data include
photos, audio, and video files.
The Benefits of Machine Learning
Accurate Forecasting
Companies gain significant and precise insights when integrating machine learning
with their data analytics to forecast factors such as market trends and consumer
buying habits. This helps companies save on costs and better manage their
inventory. ML can also indicate other items, such as transportation costs, future
demand, and delivery lead times. Machine learning is used in this scenario over
deep learning as ML models are better equipped to handle structured data, which is
used in forecasting, and are better at predicting trends.
Automation
Using machine learning, businesses can reduce the time spent analyzing
complicated data sets. The results and tasks accomplished by machine learning
models are often very reliable and well done. This is because the model can learn
from itself by making its predictions and improving its algorithms, meaning that no
human intervention is needed. Meanwhile, a deep learning model requires human
intervention during its early stages as someone needs to review its results since it
works with unstructured data.
Machine learning models are designed to handle large sets of structured data and
analyze them to discover patterns and trends humans wouldn’t identify. A deep
learning model is not recommended as it’s not designed to recognize trends and
patterns within structured data.
Machine Learning Applications
Chatbots
Educational Tools
Educational tools, such as apps that teach you different languages, also use
machine learning. By analyzing the data you provided from completing sections of
the course, ML uses that knowledge to adjust the educational system to meet your
needs. Deep learning does not apply to this function, as educational apps primarily
use structured data.
Streaming Platforms
Deep Learning
Deep learning is the evolution of machine learning and neural networks, which
uses advanced computer programming and training to understand complex patterns
hidden in large data sets (Source DL is about understanding how the human brain
works in different situations and then tryi
ng to recreate its behaviour . Deep learning is used for complicated problems such
as facial recognition, defect detection and image processing.
When to Use Deep Learning?
Deep learning is used to complete complex tasks and train models using
unstructured data . For example, deep learning is commonly used in image
classification tasks like facial recognition. Although machine learning models can
also identify faces, deep learning models are more accurate. In this case, it takes
the unstructured data (images of faces) and extracts factors such as the various
facial features. The extracted features are then matched to those stored in a
database.
While machine learning models can handle various types of data, they are limited
when understanding unstructured data (such as handwriting, images and voices).
This means that the knowledge hidden in this data may go unnoticed, and it is
where deep learning fills the gap. When businesses train their deep learning
models, they must do so with unstructured data, as it can help the company
optimize many of their business-related functions.
Scalability
Since deep learning models are better at supporting parallel and distributed
algorithms, the amount of time it would take for a DL model to learn the relevant
parameters are significantly reduced. The models can be trained locally (only using
one machine); however, working with massive data sets becomes challenging.
Parallel and distributed algorithms allow the data (or model) to be distributed
across multiple machines, making the training more effective . Also, parallel and
distributed algorithms speed up the time the model needs to learn and train, saving
the company time and money.
Virtual Assistants
Deep learning is used in virtual assistants such as Alexa and Siri, which use
Natural Language Processing (NLP). NLP analyzes and understands unstructured
data, such as forms of human language (written and verbal). It also analyzes factors
such as language recognition, sentiment analysis and text classification and then
creates the appropriate response to your input. When using NLP, it’s recommended
to use deep learning as it better understands unstructured data, such as written and
verbal language, which helps in scenarios of recognizing sentiment analysis.
Self-Driving Vehicles
Self-driving cars are autonomous decision-making systems that process data from
multiple sensors such as cameras, LiDAR, RADAR and GPS. The data collected is
then analyzed using deep learning algorithms to produce relevant decisions
depending on the car’s environment. Deep learning plays a role in a self-driving
car’s perception as it helps the car recognize and classify objects, buildings,
beings, road signs, traffic lights, etc., picked up by its sensors and cameras. DL is
also used to improve the visual odometry of the car, which helps the car calculate
its position and orientation while navigating (Source: Neptune.ai ).
Manufacturing
Machine Learning or ML, a subset of AI, uses algorithms to learn from data and
makes sense of the data or predict patterns. ML is used when you’re looking to
teach a model how to predict an output or discover a trend using structured data.
Deep Learning, or DL, a subset of ML, is the evolution of machine learning and
neural networks, which uses advanced computer programming and training to
understand complex patterns hidden in large data sets, similar to a human brain
Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions. Machine
learning contains a set of algorithms that work on a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of training, they build the model & perform a
specific task.
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some of
the inputs are already mapped to the output. More preciously, we can say; first, we train the
machine with the input and corresponding output, and then we ask the machine to predict the
output using the test dataset.
Suppose we have an input dataset of cats and dog images. So, first, we will provide the
training to the machine to understand the images, such as the shape & size of the tail of cat
and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After
completion of training, we input the picture of a cat and ask the machine to identify the object
and predict the output. Now, the machine is well trained, so it will check all the features of
the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will
put it in the Cat category. This is the process of how the machine identifies the objects in
Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with
the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data to
identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is
a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the customers by their purchasing behaviour.
2) Association
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
3. Semi-Supervised Learning
We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under unsupervised
learning. Under semi-supervised learning, the student has to revise himself after analyzing the
same concept under the guidance of an instructor at college.
Advantages:
Disadvantages:
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is
to play a game, where the Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
Advantages
Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken
the results.
The curse of dimensionality limits reinforcement learning for real physical systems.
not the system can learn incrementally from a stream of incoming data.
In batch learning, the system is incapable of learning incrementally: it must be trained using
all the available data. This will generally take a lot of time and computing resources, so it is
typically done offline. First the system is trained, and then it is launched into production and
runs without learning anymore; it just applies what it has learned. This is called offline
learning.
or the model to learn about the new data, the model would need to be trained with all the data
from scratch. The old model would then need to be replaced with the new model. And, as part
of batch learning, the whole process of training, evaluating, and launching a machine learning
system gets automated. The following picture represents the automation of batch learning.
Model training using the full set of data can take many hours. Thus, it is recommended to run
the batch frequently rather than weekly as training on the full set of data would require a lot
of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.)
Online Learning
n online learning, the training happens in an incremental manner by continuously feeding
data as it arrives or in a small group / mini batches. Each learning step is fast and cheap, so
the system can learn about new data on the fly, as it arrives.
Online learning is great for machine learning systems that receive data as a continuous
flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a
good option if you have limited computing resources: once an online learning system has
learned about new data instances, it does not need them anymore, so you can discard them
(unless you want to be able to roll back to a previous state and “replay” the data) or move the
data to another form of storage (warm or cold storage) if you are using the data lake. This can
save a huge amount of space and cost
In instance-based learning, the system learns the training data by heart. At the time of making
prediction, the system uses similarity measure and compare the new cases with the learned
data. K-nearest neighbors (KNN) is an algorithm that belongs to the instance-based learning
class of algorithms. KNN is a non-parametric algorithm because it does not assume any
specific form or underlying structure in the data. Instead, it relies on a measure of similarity
between each pair of data points. Generally speaking, this measure is based on either
Euclidean distance or cosine similarity; however, other forms of metric can be used
depending on the type of data being analyzed. Once the similarity between two points is
calculated, KNN looks at how many neighbors are within a certain radius around that point
and uses these neighbors as examples to make its prediction. This means that instead of
creating a generalizable model from all of the data, KNN looks for similarities among
individual data points and makes predictions accordingly. The picture below demonstrates
how the new instance will be predicted as triangle based on greater number of triangles in its
proximity.
model-based learning
Instance-based vs Model-based Learning: Differences
Machine learning is a field of artificial intelligence that deals with giving machines the ability
to learn without being explicitly programmed. In this context, instance-based learning and
model-based learning are two different approaches used to create machine learning models.
While both approaches can be effective, they also have distinct differences that must be taken
into account when building a machine learning system. Let’s explore the differences between
these two types of machine learning.
In addition to providing accurate predictions, one major advantage of using KNN over other
forms of supervised learning algorithms is its versatility; KNN can be used with both numeric
datasets – such as when predicting house prices – and categorical datasets – such as when
predicting whether a website visitor will purchase a product or not. Furthermore, there are no
parameters involved in tuning KNN since it does not assume any underlying structure in the
data that needs to be fitted into; instead, all parameters involved are dependent on how close
two points are considered to be similar.
Because KNN is an instance-based learning algorithm, it is not suitable for very large
datasets. This is because the model has to store all of the training examples in memory, and
making predictions on new data points involves comparing the new point to all of the stored
training examples. However, for small or medium-sized datasets, KNN can be a very
effective learning algorithm.
Other instance-based learning algorithms include learning vector quantization
(LVQ) and self-organizing maps (SOMs). These algorithms also memorize the training
examples and use them to make predictions on new data, but they use different techniques to
do so.
Machine Learning is the study of learning algorithms using past experience and making
future decisions. Although, Machine Learning has a variety of models, here is a list of the
most commonly used machine learning algorithms by all data scientists and professionals in
today's world.
o Linear Regression
o Logistic Regression
o Decision Tree
o Bayes Theorem and Naïve Bayes Classification
o Support Vector Machine (SVM) Algorithm
o K-Nearest Neighbor (KNN) Algorithm
o K-Means
o Gradient Boosting algorithms
o Dimensionality Reduction Algorithms
o Random Forest
The major issue that comes while using machine learning algorithms is the lack of quality as
well as quantity of data. Although data plays a vital role in the processing of machine
learning algorithms, many data scientists claim that inadequate data, noisy data, and unclean
data are extremely exhausting the machine learning algorithms. For example, a simple task
requires thousands of sample data, and an advanced task such as speech or image recognition
needs millions of sample data examples. Further, data quality is also important for the
algorithms to work ideally, but the absence of data quality is also found in Machine Learning
applications. Data quality can be affected by some factors as follows:
o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as
well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.
o Generalizing of output data- Sometimes, it is also found that generalizing output
data becomes complex, which results in comparatively poor future actions.
As we have discussed above, data plays a significant role in machine learning, and it must be
of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to
less accuracy in classification and low-quality results. Hence, data quality can also be
considered as a major common problem while processing machine learning algorithms.
To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training
data must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for
generalized cases and provides accurate decisions. If there is less training data, then there will
be a sampling noise in the model, called the non-representative training set. It won't be
accurate in predictions. To overcome this, it will be biased against one class or a group.
Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.
Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers and data
scientists. Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set. It negatively affects the
performance of the model. Let's understand with a simple example where we have a few
training data sets such as 1000 mangoes, 1000 apples, 1000 bananas, and 5000 papayas. Then
there is a considerable probability of identification of an apple as papaya because we have a
massive amount of biased data in the training data set; hence prediction got negatively
affected. The main reason behind overfitting is using non-linear methods used in machine
learning algorithms as they build non-realistic data models. We can overcome overfitting by
using linear and parametric algorithms in the machine learning models.
Underfitting is just the opposite of overfitting. Whenever a machine learning model is trained
with fewer amounts of data, and as a result, it provides incomplete and inaccurate data and
destroys the accuracy of the machine learning model.
Underfitting occurs when our model is too simple to understand the base structure of the data,
just like an undersized pant. This generally happens when we have limited data into the data
set, and we try to build a linear model with non-linear data. In such scenarios, the complexity
of the model destroys, and rules of the machine learning model become too easy to be applied
on this data set, and the model starts doing wrong predictions as well.
As we know that generalized output data is mandatory for any machine learning model;
hence, regular monitoring and maintenance become compulsory for the same. Different
results for different actions require data change; hence editing of codes as well as resources
for monitoring them also become necessary.
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example where at
a specific time customer is looking for some gadgets, but now customer requirement changed
over time but still machine learning model showing same recommendations to the customer
while customer expectation has been changed. This incident is called a Data Drift. It
generally occurs when new data is introduced or interpretation of data changes. However, we
can overcome this by regularly updating and monitoring data according to the expectations.
Although Machine Learning and Artificial Intelligence are continuously growing in the
market, still these industries are fresher in comparison to others. The absence of skilled
resources in the form of manpower is also an issue. Hence, we need manpower having in-
depth knowledge of mathematics, science, and technologies for developing and managing
scientific substances for machine learning.
8. Customer Segmentation
The machine learning process is very complex, which is also another major issue faced by
machine learning engineers and data scientists. However, Machine Learning and Artificial
Intelligence are very new technologies but are still in an experimental phase and continuously
being changing over time. There is the majority of hits and trial experiments; hence the
probability of error is higher than expected. Further, it also includes analyzing the data,
removing data bias, training data, applying complex mathematical calculations, etc., making
the procedure more complicated and quite tedious.
Data Biasing is also found a big challenge in Machine Learning. These errors exist when
certain elements of the dataset are heavily weighted or need more importance than others.
Biased data leads to inaccurate results, skewed outcomes, and other analytical errors.
However, we can resolve this error by determining where data is actually biased in the
dataset. Further, take necessary steps to reduce it.
This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are time-consuming.
Slow programming, excessive requirements' and overloaded data take more time to provide
accurate results than expected. This needs continuous maintenance and monitoring of the
model for delivering accurate results.
Although machine learning models are intended to give the best possible outcome, if we feed
garbage data as input, then the result will also be garbage. Hence, we should use relevant
features in our training sample. A machine learning model is said to be good if training data
has a good set of features or less to no irrelevant features.
Statistical Learning
A main challenge in data science is the mathematical analysis of the data. When the goal is
to interpret the model and quantify the uncertainty in the data, this analysis is usually referred
to as statistical learning
Statistical learning theory is a framework for machine learning that draws from statistics and
functional analysis. It deals with finding a predictive function based on the data presented. The
main idea in statistical learning theory is to build a model that can draw conclusions from data
and make predictions.
i.To accurately predict some future quantity of interest, given some observed data.
To achieve these two goals, one must rely on knowledge from three important pillars:
2. Optimization: Given a class of approximating functions, we wish to find the best possible
function in that class.
3. Probability and Statistics: In general, the data used to fit the model is viewed as a
realization of a random process, whose probability law determines the accuracy with which
we can predict future observations.
Dependent Variable — a variable (y) whose values depend on the values of other
variables (a dependent variable is sometimes also referred to as a target variable)
Independent Variables — a variable (x) whose value does not depend on the values of
other variables (independent variables are sometimes also referred to as predictor
variables, input variables, explanatory variables, or features)
In statistical learning, the independent variable(s) are the variable that will affect the
dependent variable.
A common examples of an Independent Variable is Age. There is nothing that one can do to
increase or decrease age. This variable is independent.
Weight — a person’s weight is dependent on his or her age, diet, and activity levels
(as well as other factors)
In graphs, the independent variable is often plotted along the x-axis while the dependent
variable is plotted along the y-axis.
In this example, which shows how the price of a home is affected by the size of the home, sq.
ft is the independent variable while price of the home is the dependent variable.
Supervised and Unsupervised Learning
Feature-input
Response-Output
Given an input or feature vector x, one of the main goals of machine learning is to predict
response an output or response variable y.
Example
1.x could be a digitized signature and y a binary variable that indicates whether the signature
is genuine or false.
2. x represents the weight and smoking habits of an expecting mother and y the birth weight
of the baby.
Prediction Function
The data science attempt at this prediction is encoded in a mathematical prediction function
g, called the prediction function function , which takes as an input x and outputs a guess g(x)
for y (denoted by by, for example). In a sense, g encompasses all the information about the
relationship between the variables x and y, excluding the effects of chance and randomness in
nature.
Regression
A regression model is able to show whether changes observed in the dependent variable are
associated with changes in one or more of the explanatory variables.
It does this by essentially fitting a best-fit line and seeing how the data is dispersed
around this line.
Calculating Regression
Linear regression models often use a least-squares approach to determine the line of best fit.
The least-squares technique is determined by minimizing the sum of squares created by a
mathematical function. A square is, in turn, determined by squaring the distance between a
data point and the regression line or mean value of the data set.
Y=a+b1X1+b2X2+b3X3+...+btXt+u
X=The explanatory (independent) variable(s) you are using to predict or associate wit
hY
a=The yintercept
There are two main types of errors present in any machine learning model. They are
Reducible Errors and Irreducible Errors.
Irreducible errors are errors which will always be present in a machine learning model,
because of unknown variables, and whose values cannot be reduced.
Reducible errors are those errors whose values can be further reduced to improve a model.
They are caused because our model’s output function does not match the desired output
function and can be optimized.
We can further divide reducible errors into two: Bias and Variance.
Figure 1: Errors in Machine Learning
What is Bias?
To make predictions, our model will analyze our data and find patterns in it. Using these
patterns, we can make generalizations about certain instances in our data. Our model after
training learns these patterns and applies them to the test set to predict them.
Bias is the difference between our actual and predicted values. Bias is the simple assumptions
that our model makes about our data to be able to predict new data.
Figure 2: Bias
When the Bias is high, assumptions made by our model are too basic, the model can’t capture
the important features of our data. This means that our model hasn’t captured patterns in the
training data and hence cannot perform well on the testing data too. If this is the case, our
model cannot perform on new data and cannot be sent into production.
This instance, where the model cannot find patterns in our training set and hence fails for
both seen and unseen data, is called Underfitting.
The below figure shows an example of Underfitting. As we can see, the model has found no
patterns in our data and the line of best fit is a straight line that does not pass through any of
the data points. The model has failed to train properly on the data given and cannot predict
new data either.
Figure 3: Underfitting
What is Variance?
Variance is the very opposite of Bias. During training, it allows our model to ‘see’ the data a
certain number of times to find patterns in it. If it does not work on the data for long enough,
it will not find patterns and bias occurs. On the other hand, if our model is allowed to view
the data too many times, it will learn very well for only that data. It will capture most patterns
in the data, but it will also learn from the unnecessary data present, or from the noise.
We can define variance as the model’s sensitivity to fluctuations in the data. Our model may
learn from noise. This will cause our model to consider trivial features as important.
Figure 4: Example of Variance
In the above figure, we can see that our model has learned extremely well for our training
data, which has taught it to identify cats. But when given new data, such as the picture of a
fox, our model predicts it as a cat, as that is what it has learned. This happens when the
Variance is high, our model will capture all the features of the data given to it, including the
noise, will tune itself to the data, and predict it very well but when given new data, it cannot
predict on it as it is too specific to training data.
Hence, our model will perform really well on testing data and get high accuracy but will fail
to perform on new, unseen data. New data may not have the exact same features and the
model won’t be able to predict it very well. This is called Overfitting.
Figure 5: Over-fitted model where we see model performance on, a) training data b)
new data
Bias-Variance Tradeoff
While discussing model accuracy, we need to keep in mind the prediction errors, ie: Bias and
Variance, that will always be associated with any machine learning model. There will always
be a slight difference in what our model predicts and the actual predictions. These differences
are called errors. The goal of an analyst is not to eliminate errors but to reduce them. There is
always a tradeoff between how low you can get errors to be. In this article titled ‘Everything
you need to know about Bias and Variance’, we will discuss what these errors are.
Bias-Variance Tradeoff
For any model, we have to find the perfect balance between Bias and Variance. This just
ensures that we capture the essential patterns in our model while ignoring the noise present it
in. This is called Bias-Variance Tradeoff. It helps optimize the error in our model and keeps it
as low as possible.
An optimized model will be sensitive to the patterns in our data, but at the same time will be
able to generalize to new data. In this, both the bias and variance should be low so as to
prevent overfitting and underfitting.
In the above figure, we can see that when bias is high, the error in both testing and training
set is also high.If we have a high variance, the model performs well on the training set, we
can see that the error is low, but gives high error on the testing set. We can see that there is a
region in the middle, where the error in both training and testing set is low and the bias and
variance is in perfect balance.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them to
test data for prediction. While making predictions, a difference occurs between prediction
values made by the model and actual values/expected values, and this difference is known
as bias errors or Errors due to bias. It can be defined as an inability of machine learning
algorithms such as Linear Regression to capture the true relationship between the data points.
Each algorithm begins with some amount of bias because bias occurs from assumptions in the
model, which makes the target function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest
Neighbours and Support Vector Machines. At the same time, an algorithm with high bias
is Linear Regression, Linear Discriminant Analysis and Logistic Regression.
High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:
The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is different
from its expected value. Ideally, a model should not vary too much from one training dataset
to another, which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables. Variance errors are either of low variance or high
variance.
Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in
the prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset, and
does not generalize well with the unseen dataset. As a result, such a model gives good results
with the training dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms with
high variance are decision tree, Support Vector Machine, and K-nearest neighbours.
There are four possible combinations of bias and variances, which are represented by the
below diagram:
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high variance and low bias. So, it is required to
make a balance between bias and variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.
Cross Validation
Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used
for the testing purpose. The major drawback of this method is that we perform training on
the 50% of the dataset, it may possible that the remaining 50% of the data contains some
important information which we are leaving while training our model i.e higher bias.
Example
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration we
use the first 20 percent of data for evaluation, and the remaining 80 percent for training([1-
5] testing and [5-25] training) while in the second iteration we use the second subset of 20
percent for evaluation, and the remaining three subsets of the data for training([5-10]
testing and [1-5 and 10-25] training), and so on.
Total instances: 25
Value of k :5
No. Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
Comparison of train/test split to cross-validation