UNIT III;dkd
UNIT III;dkd
Introduction to machine learning – Linear Regression Models: Least squares, single & multiple
variables, Bayesian linear regression, gradient descent, Linear Classification Models: Discriminant
function – Probabilistic discriminative model - Logistic regression, Probabilistic generative model –
Naive Bayes, Maximum margin classifier – Support vector machine, Decision Tree, Random forests
Can a machine also learn from experiences or past data like a human does? So here comes the role
of Machine Learning.
A subset of artificial intelligence known as machine learning focuses primarily on the creation of
algorithms that enable a computer to independently learn from data and previous experiences.
Arthur Samuel first used the term "machine learning" in 1959. It could be summarized as follows:
Without being explicitly programmed, machine learning enables a machine to automatically learn from
data, improve performance from experiences, and predict things.
For the purpose of developing predictive models, machine learning brings together statistics and
computer science.
Algorithms that learn from historical data are either constructed or utilized in machine learning. The
performance will rise in proportion to the quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
A machine learning system builds prediction models, learns from previous data, and predicts the
output of new data whenever it receives it.
The amount of data helps to build a better model that accurately predicts the output, which in turn
affects the accuracy of the predicted output.
Let's say we have a complex problem in which we need to make predictions. Instead of writing code,
we just need to feed the data to generic algorithms, which build the logic based on the data and predict
the output.
Our perspective on the issue has changed as a result of machine learning. The Machine Learning
algorithm's operation is depicted in the following block diagram:
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of the
data.
1) Supervised Learning
The system uses labelled data to build a model that understands the datasets and learns about each one.
After the training and processing are done, we test the model with sample data to see if it can
accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning. Spam filtering
is an example of supervised learning.
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labelled, classified, or
categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group of
objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful
insights from the huge amount of data. It can be further classifieds into two categories of algorithms:
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward
for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its performance.
In reinforcement learning, the agent interacts with the environment and explores it. The goal of an
agent is to get the most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
Before some years (about 40-50 years), machine learning was science fiction, but today it is the part of
our daily life.
Machine learning is making our day to day life easy from self-driving cars to Amazon virtual
assistant "Alexa".
However, the idea behind machine learning is so old and has a long history. Below some milestones
are given which have occurred in the history of machine learning:
Machine Learning at present:
The field of machine learning has made significant strides in recent years, and its applications are
numerous, including self-driving cars, Amazon Alexa, Catboats, and the recommender system.
It incorporates clustering, classification, decision tree, SVM algorithms, and reinforcement learning, as
well as unsupervised and supervised learning.
Present day AI models can be utilized for making different expectations, including climate expectation,
sickness forecast, financial exchange examination, and so on.
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:
1. Image Recognition:
It is used to identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with
our Facebook friends, then we automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and
it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition."
At present, machine learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to
follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working
on self-driving car. It is using unsupervised learning method to train the car models to detect people
and objects while driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our spam
box, and the technology behind this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the
name suggests, they help us in finding the information using our voice instruction. These assistants can
help us in various ways just by our voice instructions such as Play music, call someone, Open an
email, Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or
a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is
used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the
brain.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine
Learning that translates the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which is
used with image recognition and translates the text from one language to another language.
Machine learning has given the computer systems the abilities to automatically learn without being
explicitly programmed.
But how does a machine learning system work? So, it can be described using the life cycle of machine
learning.
Machine learning life cycle is a cyclic process to build an efficient machine learning project. The main
purpose of the life cycle is to find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Deployment
o
o The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the
problem because the good result depends on the better understanding of the problem.
o In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we need
data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify
and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important steps of
the life cycle.
The quantity and quality of the collected data will determine the efficiency of the output. The more
will be the data, the more accurate will be the prediction.
o Collect data
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in
further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we
put our data into a suitable place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
o Data Exploration:
It is used to understand the nature of data that we have to work with.
o We need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations,
general trends, and outliers.
o Data Pre-processing:
Now the next step is pre-processing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format.
It is the process of cleaning the data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step.
It is one of the most important steps of the complete process. Cleaning of data is required to address
the quality issues.
It is not necessary that data we have collected is always of our use as some of the data may not be
useful. In real-world applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
It is mandatory to detect and remove the above issues because it can negatively affect the quality of the
outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
o Building models
The aim of this step is to build a machine learning model to analyze the data using various analytical
techniques and review the outcome.
It starts with the determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster analysis, Association, etc. then build the
model using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
5. Train Model
We use datasets to train the model using various machine learning algorithms. Training a model is
required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once model has been trained on a given dataset, then we test the model. In this step, we check for the
accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of project or
problem.
7. Deployment
If the above-prepared model is producing an accurate result as per our requirement with acceptable
speed, then we deploy the model in the real system.
But before deploying the project, we will check whether it is improving its performance using
available data or not. The deployment phase is similar to making the final report for a project.
Artificial intelligence and machine learning are the part of computer science that are correlated with
each other.
Although these are two related technologies and sometimes people use them as a synonym for each
other, but still both are the two different terms in various cases.
AI is a bigger concept to create intelligent machines that can simulate human thinking capability and
behavior, whereas, machine learning is an application or subset of AI that allows machines to learn
from data without being programmed explicitly.
Artificial Intelligence
Artificial intelligence is a technology using which we can create intelligent systems that can simulate
human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of that, they use such
algorithms which can work with their own intelligence.
It involves machine learning algorithms such as Reinforcement learning algorithm and deep learning
neural networks.
AI is being used in multiple places such as Siri, Google’s AlphaGo, AI in Chess playing, etc.
o Weak AI
o General AI
o Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it is
said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from past
data or experiences without being explicitly programmed.
Key differences between Artificial Intelligence (AI) and Machine learning (ML):
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can contain any
data from a series of an array to a database table. Below table shows an example of the dataset:
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
A tabular dataset can be understood as a database table or matrix, where each column corresponds to
a particular variable, and each row corresponds to the fields of the dataset. The most supported file
type for a tabular dataset is "Comma Separated File," or CSV. But to store a "tree-like data," we can
use the JSON file more efficiently.
o Ordinal data:These data are similar to categorical data but can be measured on the basis of
comparison.
Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets. A few
normal sorts of datasets utilized in machine learning include:
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer vision tasks
such as image classification, object detection, and image segmentation.
Examples :
o ImageNet
o CIFAR-10
o MNIST
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment posts. These
datasets are utilized in NLP techniques like sentiment analysis, text classification, and machine
translation.
Examples :
Time series datasets include information focuses gathered after some time. They are generally utilized
in determining, abnormality location, and pattern examination.
Examples :
o Climate information
o Sensor readings.
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They contain
lines addressing examples or tests and segments addressing highlights or qualities. Tabular datasets are
utilized for undertakings like relapse and arrangement. The dataset given before in the article is an
illustration of a tabular dataset.
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables.
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years and
the corresponding sales:
Now, the company wants to do the advertisement of $200 in the next year and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
It is mainly used for prediction, forecasting, time series modeling, and determining the
causal-effect relationship between variables.
"Regression shows a line or curve that passes through all the datapoints on target-predictor graph
in such a way that the vertical distance between the datapoints and the regression line is
minimum." The distance between datapoints and line tells whether a model has captured a strong
relationship or not.
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so
we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost
function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit
line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and hence
the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.
Supervised Machine Learning algorithm can be broadly classified into Regression and Classification
Algorithms.
In Regression algorithms, we have predicted the output for continuous values, but to predict the
categorical values, we need Classification algorithms.
The Classification algorithm is used to identify the category of new observations on the basis of
training data.
A program learns from the given dataset or observations and then classifies new observation into a
number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
The output variable of Classification is a category, not a value, such as "Green or Blue", "fruit or
animal", etc. It takes labeled input data, which means it contains input with the corresponding output.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
In the below diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is called
as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.
o We use the concept of the threshold value, which defines the probability of either 0 or 1. Such
as values above the threshold value tends to 1, and a value below the threshold values tends to
0.
The mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
o Binomial: There can be only two possible types of the dependent variables, such as 0 or 1, Pass
or Fail, etc.
o Multinomial: There can be 3 or more possible unordered types of the dependent variable, such
as "cat", "dogs", or "sheep"
o Ordinal: There can be 3 or more possible ordered types of dependent variables, such as "low",
"Medium", or "High".
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction
techniques in machine learning to solve more than two-class classification problems. It is also
known as Normal Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).
This can be used to project the features of higher dimensional space into lower-dimensional space in
order to reduce resources and dimensional costs.
Although the logistic regression algorithm is limited to only two-class, linear Discriminant analysis is
applicable for more than two classes of classification problems.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques used
for supervised classification problems in machine learning. It is also considered a pre-processing step
for modeling differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple features efficiently,
the Linear Discriminant Analysis model is considered the most common technique to solve such
classification problems. For e.g., if we have two classes with multiple features and need to separate
them efficiently. When we classify them using a single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-dimensional
plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the
1-D plane. Using this technique, we can also maximize the separability between multiple classes.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need
to classify them efficiently. As we have already seen in the above example that LDA enables us to
draw a straight line that can completely separate the two classes of the data points. Here, LDA uses an
X-Y axis to create a new axis by separating them using a straight line and projecting data onto a new
axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can maximize the
distance between the means of the two classes and minimizes the variation within each class.
In other words, we can say that the new axis will increase the separation between the data points of the
two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform well for
binary classification but falls short in the case of multiple classification problems with
well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as PCA,
which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract useful
data from different faces. Coupled with eigenfaces, it produces effective results.
LDA fails in cases where the Mean of the distributions is shared. In such case, LDA fails to create a
new axis that makes both the classes linearly separable.
o Face Recognition
Face recognition is the popular application of computer vision, where each face is represented
as the combination of a number of pixel values. In this case, LDA is used to minimize the
number of features to a manageable number before going through the classification process. It
generates a new template in which each dimension consists of a linear combination of pixel
values. If a linear combination is generated using Fisher's linear discriminant, then it is called
Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on the basis
of various parameters of patient health and the medical treatment which is going on. On such
parameters, it classifies disease as mild, moderate, or severe. This classification helps the
doctors in either increasing or decreasing the pace of the treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of LDA; we
can easily identify and select the features that can specify the group of customers who are
likely to purchase a specific product in a shopping mall. This can be helpful when we want to
identify a group of customers who mostly purchase a product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For example, "will
you buy this product” will give a predicted result of either one or two possible classes as a
buying or not.
o In Learning
Nowadays, robots are being trained for learning and talking to simulate human work, and it can
also be considered a classification problem. In this case, LDA builds similar groups on the
basis of different parameters, including pitches, frequencies, sound, tunes, etc.
Difference between Linear Discriminant Analysis and PCA
o PCA is an unsupervised algorithm that does not care about classes and labels and only aims to
find the principal components to maximize the variance in the given dataset. At the same time,
LDA is a supervised algorithm that aims to find the linear discriminants to represent the axes
that maximize separation between different classes of data.
o LDA is much more suitable for multi-class classification tasks compared to PCA. However,
PCA is assumed to be an as good performer for a comparatively small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques, where PCA is first
followed by LDA.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We
will first train our model with lots of images of cats and dogs so that it can learn about different
features of cats and dogs, and then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose extreme cases (support vectors), it
will see the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the hyperplane
is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum
margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
SVM Applications
2. Image classification
Limitations of SVM
1. It is sensitive to noise.
4. The optimal design for multiclass SVM classifiers is also a research area.
2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane
3. C is the trade off between the slack variable penalty and the margin.
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Where,
o P(no)= probability of no
2. Gini Index:
o Gini Index is utilized to determine the best feature to split the data on at every node of the tree.
o It is a measure of how mixed or impure a dataset is.
o Gini index is a measure of inequality or impurity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset.
Therefore, a technique that decreases the size of the learning tree without reducing accuracy is known
as Pruning.
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
It is based on the concept of ensemble learning, which is a process of combining multiple classifiers
to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset."
Instead of relying on one decision tree, the random forest takes the prediction from each tree and based
on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not.
But together, all the trees predict the correct output. Therefore, below are two assumptions for a better
Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
o It enhances the accuracy of the model and prevents the overfitting issue.
o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.