0% found this document useful (0 votes)
202 views

Data Mining UNIT-2 Notes

This document discusses predictive analytics, specifically classification and prediction. It provides definitions and examples of classification, which predicts categorical labels, and prediction, which predicts continuous values. The key aspects covered are: 1. Classification identifies the category or class label of new observations, using a training set to derive a model or classifier to classify unlabeled data. Prediction finds numerical outputs using regression to predict values. 2. Issues in classification and prediction include data cleaning, relevance analysis, and transformation to prepare the data for modeling. 3. The process of classification involves model creation using a training set, and then applying the classifier to classify new data and estimate accuracy. The goal is to expand the classification rules to new data records.

Uploaded by

padma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views

Data Mining UNIT-2 Notes

This document discusses predictive analytics, specifically classification and prediction. It provides definitions and examples of classification, which predicts categorical labels, and prediction, which predicts continuous values. The key aspects covered are: 1. Classification identifies the category or class label of new observations, using a training set to derive a model or classifier to classify unlabeled data. Prediction finds numerical outputs using regression to predict values. 2. Issues in classification and prediction include data cleaning, relevance analysis, and transformation to prepare the data for modeling. 3. The process of classification involves model creation using a training set, and then applying the classifier to classify new data and estimate accuracy. The goal is to expand the classification rules to new data records.

Uploaded by

padma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Unit-II: Predictive Analytics

Classification and Prediction - Basic Concepts of


Classification and Prediction, General Approach to
solving a classification problem- Logistic Regression -
LDA - Decision Trees: Tree Construction Principle –
Feature Selection measure – Tree Pruning - Decision Tree
construction Algorithm, Random Forest, Bayesian
Classification-Accuracy and Error Measures- Evaluating
the Accuracy of the classifier / predictor- Ensemble
methods and Model selection.

Classification and Predication in Data Mining

There are two forms of data analysis that can be used to


extract models describing important classes or predict
future data trends. These two forms are as follows:

1. Classification
2. Prediction

We use classification and prediction to extract a model,


representing the data classes to predict future data
trends. Classification predicts the categorical labels of
data with the prediction models. This analysis provides
us with the best understanding of the data at a large
scale.

Classification models predict categorical class labels, and


prediction models predict continuous-valued functions.
For example, we can build a classification model to
categorize bank loan applications as either safe or risky
or a prediction model to predict the expenditures in
dollars of potential customers on computer equipment
given their income and occupation.

What is Classification?

Classification is to identify the category or the class label


of a new observation. First, a set of data is used as
training data. The set of input data and the
corresponding outputs are given to the algorithm. So,
the training data set includes the input data and their
associated class labels. Using the training dataset, the
algorithm derives a model or the classifier. The derived
model can be a decision tree, mathematical formula, or a
neural network. In classification, when unlabeled data is
given to the model, it should find the class to which it
belongs. The new data provided to the model is the test
data set.

Classification is the process of classifying a record. One


simple example of classification is to check whether it is
raining or not. The answer can either be yes or no. So,
there is a particular number of choices. Sometimes there
can be more than two classes to classify. That is
called multiclass classification.

The bank needs to analyze whether giving a loan to a


particular customer is risky or not. For example, based
on observable data for multiple loan borrowers, a
classification model may be established that forecasts
credit risk. The data could track job records,
homeownership or leasing, years of residency, number,
type of deposits, historical credit ranking, etc. The goal
would be credit ranking, the predictors would be the
other characteristics, and the data would represent a
case for each consumer. In this example, a model is
constructed to find the categorical label. The labels are
risky or safe.

How does Classification Works?

The functioning of classification with the assistance of


the bank loan application has been mentioned above.
There are two stages in the data classification system:
classifier or model creation and classification classifier.
1. Developing the Classifier or model creation: This
level is the learning stage or the learning process.
The classification algorithms construct the classifier
in this stage. A classifier is constructed from a
training set composed of the records of databases
and their corresponding class names. Each category
that makes up the training set is referred to as a
category or class. We may also refer to these
records as samples, objects, or data points.
2. Applying classifier for classification: The classifier
is used for classification at this level. The test data
are used here to estimate the accuracy of the
classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded
to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is
highly helpful in social media monitoring. We
can use it to extract social media insights. We
can build sentiment analysis models to read
and analyze misspelled words with advanced
machine learning algorithms. The accurate
trained models provide consistently accurate
outcomes and result in a fraction of the time.
o Document Classification: We can use
document classification to organize the
documents into sections according to the
content. Document classification refers to text
classification; we can classify the words in the
entire document. And with the help of machine
learning classification algorithms, we can
execute it automatically.
o Image Classification: Image classification is

used for the trained categories of an image.


These could be the caption of the image, a
statistical value, a theme. You can tag images
to train your model for relevant categories by
applying supervised learning algorithms.
o Machine Learning Classification: It uses the

statistically demonstrable algorithm rules to


execute analytical tasks that would take
humans hundreds of more hours to perform.
3. Data Classification Process: The data classification
process can be categorized into five steps:
o Create the goals of data classification, strategy,

workflows, and architecture of data


classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use
effects.
o Data is complex, and a continuous method is a
classification.

What is Data Classification Lifecycle?

The data classification life cycle produces an excellent


structure for controlling the flow of data to an
enterprise. Businesses need to account for data security
and compliance at each level. With the help of data
classification, we can perform it at every stage, from
origin to deletion. The data life-cycle has the following
stages, such as:

1. Origin: It produces sensitive data in various


formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-basedpractice: Role-basedsecurity restrictions
apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
3. Storage: Here, we have the obtained data,
including access controls and encryption.
4. Sharing: .
5. Archive: Here, data is Data is continually distributed
among agents, consumers, and co-workers from
various devices and platforms eventually archived
within an industry's storage systems.
6. Publication: Through the publication of data, it can
reach customers. They can then view and download
in the form of dashboards.

What is Prediction?

Another process of data analysis is prediction. It is used


to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding
numerical output values. The algorithm derives the
model or a predictor according to the training dataset.
The model should find a numerical output when the new
data is given. Unlike in classification, this method does
not have a class label. The model predicts a continuous-
valued function or ordered value.

Regression is generally used for prediction. Predicting


the value of a house depending on the facts such as the
number of rooms, the total area, etc., is an example for
prediction.
For example, suppose the marketing manager needs to
predict how much a particular that forecasts a
continuous or ordered value function. customer will
spend at his company during a sale. We are bothered to
forecast a numerical value in this case. Therefore, an
example of numeric prediction is the data processing
activity. In this case, a model or a predictor will be
developed

Classification and Prediction Issues

The major issue is preparing the data for Classification


and Prediction. Preparing the data involves the following
activities, such as:
1. Data Cleaning: Data cleaning involves removing
the noise and treatment of missing values. The
noise is removed by applying smoothing
techniques, and the problem of missing values is
solved by replacing a missing value with the most
commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have
irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
3. Data Transformation and reduction: The data can
be transformed by any of the following methods.
o Normalization: The data is transformed using

normalization. Normalization involves scaling


all values for a given attribute to make them fall
within a small specified range. Normalization is
used when the neural networks or the methods
involving measurements are used in the
learning step.
o Generalization: The data can also be
transformed by generalizing it to the higher
concept. For this purpose, we can use the
concept hierarchies.
NOTE: Data can also be reduced by some other
methods such as wavelet transformation, binning,
histogram analysis, and clustering.

Comparison of Classification and Prediction


Methods
Here are the criteria for comparing the methods of
Classification and Prediction, such as:

ADVERTISEMENT
ADVERTISEMENT
o Accuracy: The accuracy of the classifier can be
referred to as the ability of the classifier to predict
the class label correctly, and the accuracy of the
predictor can be referred to as how well a given
predictor can estimate the unknown value.
o Speed: The speed of the method depends on the
computational cost of generating and using the
classifier or predictor.
o Robustness: Robustness is the ability to make
correct predictions or classifications. In the context
of data mining, robustness is the ability of the
classifier or predictor to make correct predictions
from incoming unknown data.
o Scalability: Scalability refers to an increase or
decrease in the performance of the classifier or
predictor based on the given data.
o Interpretability: Interpretability is how readily we
can understand the reasoning behind predictions or
classification made by the predictor or classifier.

Difference between Classification and


Prediction

The decision tree, applied to existing data, is a


classification model. We can get a class prediction by
applying it to new data for which the class is unknown.
The assumption is that the new data comes from a
distribution similar to the data we used to construct our
decision tree. In many instances, this is a correct
assumption, so we can use the decision tree to build a
predictive model. Classification of prediction is the
process of finding a model that describes the classes or
concepts of information. The purpose is to predict the
class of objects whose class label is unknown using this
model. Below are some major differences between
classification and prediction.

Classification Prediction
Classification is the process of Predication is the process of
identifying which category a new identifying the missing or
observation belongs to based on unavailable numerical data for
a training data set containing a new observation.
observations whose category
membership is known.
In classification, the accuracy In prediction, the accuracy
depends on finding the class label depends on how well a given
correctly. predictor can guess the value
of a predicated attribute for
new data.
In classification, the model can be In prediction, the model can be
known as the classifier. known as the predictor.
A model or the classifier is A model or a predictor will be
constructed to find the constructed that predicts a
categorical labels. continuous-valued function or
ordered value.
For example, the grouping of For example, We can think of
patients based on their medical prediction as predicting the
records can be considered a correct treatment for a
classification. particular disease for a person.

Classification-Based Approaches in Data Mining

Classification is that the processing of finding a group of


models (or functions) that describe and distinguish data
classes or concepts, for the aim of having the ability to
use the model to predict the category of objects whose
class label is unknown. The determined model depends
on the investigation of a set of training data information
(i.e. data objects whose class label is known). The
derived model could also be represented in various
forms, like classification (if – then) rules, decision trees,
and neural networks. Data Mining has a different type of
classifier: A classification is a form of data analysis that
extracts models describing important data classes.
Such models are called Classifiers. For example, We
can build a classification model for banks to categorize
loan applications.
A general approach to classification:
Classification is a two-step process involving,
Learning Step: It is a step where the Classification
model is to be constructed. In this phase, training data
are analyzed by a classification Algorithm.
Classification Step: it’s a step where the model is
employed to predict class labels for given data. In this
phase, test data are wont to estimate the accuracy of
classification rules.
Basic algorithms of classification:
 Decision Tree Induction

 Naïve Bayesian Classification

 Rule-Based Classification

 SVM(Support Vector Machine)

 Generalized Linear Models

 Bayesian classification

 Classification by Backpropagation

 K-NN Classifier

 Frequent-Pattern Based Classification

 Rough set theory

 Fuzzy Logic
Decision Tree Induction:
 Decision Tree Induction is the learning of decision

trees from class labeled training tuples.


 Given a tuple X, for which the association class label is

unknown the attribute values of tuples are tested


against the decision tree.
 A path that is traced from the root to the leaf node,

which holds the class prediction for the tuple.


 These trees are then converted into Classification

rules.
 Decision Trees are easier to interrupt are they no need

any domain Knowledge


Naïve Bayesian Classification:
 They are Statistical Classifiers.

 They can predict class membership probabilities such

as the probability that a given tuple belongs to a


particular class.
 Naïve classifiers assume that the effect of an attribute
value on a class is independent of values of other
attributes.
 The mathematical formula for this classification is,

where H be a hypothesis and p(H|X) is a probability that


H holds the given evidence for the tuple X (Observed
data)
p(X|H) is the posterior probability of X conditioned on
H
Rule-Based Classification:
 Rules are a good way of representing information or

knowledge.
 A rule-based classifier uses a set of IF-THEN rules for

Classification and is represented as

 The IF part is called as Precondition and THEN part is


called as rule consequent.
 This implies that only if the condition is met is the
next(THEN) part will execute.

Now let’s see how to classify Outlier. A database may


contain data objects that don’t suits the overall behavior
or model of the info . These data objects are Outliers.
The investigation of OUTLIER data is understood as
OUTLIER MINING. An outlier could also be detected or
classify using statistical tests which assume a
distribution or probability model for the info , or using
distance measures where objects having alittle fraction
of “close” neighbors in space are considered outliers.
Rather than utilizing factual or distance measures,
deviation-based techniques distinguish
exceptions/outliers by inspecting differences within the
principle attributes of things during a group.
Outlier detection (also referred to as anomaly detection)
is that the process of finding data objects with behaviors
that are very different from expectations. Such objects
are called outliers or anomalies. Outlier detection is vital
in many applications additionally to fraud detection like
medical aid , public safety and security, industry
damage detection, image processing, sensor/video
network surveillance, and intrusion detection.
In general, outliers are often classified into three
categories, namely global outliers, contextual (or
conditional) outliers, and collective outliers. Let’s
examine each of these categories.

Global Outliers: during a given data set, a knowledge


object may be a global outlier if it deviates significantly
from the remainder of the info set. Global outliers are
sometimes called point anomalies and are the only sort
of outliers. Most outlier detection methods are aimed
toward finding global outliers.
Contextual Outliers: during a given data set, a
knowledge object could also be a contextual outlier if it
deviates significantly with regard to a specific context of
the thing. Contextual outliers also are referred to as
conditional outliers because they’re conditional on the
chosen context. Therefore, in contextual outlier
detection, the context possesses to be specified as a
neighborhood of the matter definition. Unlike global
outlier detection, in contextual outlier detection, whether
a knowledge object is an outlier depends on not only the
behavioral attributes but also the contextual attributes.
Contextual outliers are a generalization of local outliers,
a notion introduced in density-based outlier analysis
approaches. An object during a data set may be a local
outlier if its density significantly deviates from the local
area during which it occurs.
Collective Outliers: during a given data set, a subset of
knowledge objects forms a collective outlier if the
objects as an entire deviate significantly from the whole
data set. Importantly, the individual data objects might
not be outliers. Unlike global or contextual outlier
detection, in collective outlier detection, we’ve to think
about not only the behavior of individual objects but also
that of groups of objects. Therefore, to detect collective
outliers, we’d like the background of the connection
among data objects like distance or similarity
measurements between objects.

Logistic Regression for Classification

Logistic Regression comes under Supervised


Learning. Supervised Learning is when the algorithm
learns on a labeled dataset and analyses the training
data. These labeled data sets have inputs and expected
outputs.
Supervised learning can be further split into
classification and regression.

Classification is about predicting a label, by identifying


which category an object belongs to based on different
parameters.

Regression is about predicting a continuous output, by


finding the correlations between dependent and
independent variables.

Source: Javatpoint

What is Logistic Regression?

Logistic Regression is a statistical approach and a


Machine Learning algorithm that is used for
classification problems and is based on the concept of
probability. It is used when the dependent variable
(target) is categorical.
It is widely used when the classification problem at hand
is binary; true or false, yes or no, etc. For example, it
can be used to predict whether an email is spam (1) or
not (0).
Logistics regression uses the sigmoid function to return
the probability of a label.

Sigmoid Function

Sigmoid Function is a mathematical function used to


map the predicted values to probabilities. The function
has the ability to map any real value into another value
within a range of 0 and 1.

Code:
def sigmoid(z):
return 1.0 / (1 + np.exp(-z))

Source: Wikipedia
The rule is that the value of the logistic regression must
be between 0 and 1. Due to the limitations of it not being
able to go beyond the value 1, on a graph it forms a
curve in the form of an "S". This is an easy way to
identify the Sigmoid function or the logistic function.
In regards to Logistic Regression, the concept used is
the threshold value. The threshold values help to define
the probability of either 0 or 1. For example, values
above the threshold value tend to 1, and a value below
the threshold value tends to 0.

Type of Logistic Regression

1. Binomial: This means that there can be only two


possible types of the dependent variables, such as 0 or
1, Yes or No, etc.
2. Multinomial: This means that there can be 3 or more
possible unordered types of the dependent variable,
such as "cat", "dogs", or "sheep"
3. Ordinal: This means that there can be 3 or more
possible ordered types of dependent variables, such as
"low", "Medium", or "High".

Linear and Logistic Regression

Linear Regression is similar to Logistic Regression but


different.
Linear Regression assumes that there is a linear
relationship between dependent and independent
variables. It uses the line of best fit that describes two or
more variables. The aim of Linear Regression is to
accurately predict the output for the continuous
dependent variable.
However, Logistic regression predicts the probability of
an event or class that is dependent on other factors,
therefore the output of Logistic Regression always lies
between 0 and 1.
To find out more about the difference between Linear
and Logistic Regression, you can read more about it on
this link.

Linear Discriminant Analysis (LDA) in


Machine Learning

Linear Discriminant Analysis (LDA) is one of the


commonly used dimensionality reduction techniques
in machine learning to solve more than two-class
classification problems. It is also known as Normal
Discriminant Analysis (NDA) or Discriminant
Function Analysis (DFA).

This can be used to project the features of higher


dimensional space into lower-dimensional space in
order to reduce resources and dimensional costs. In this
topic, "Linear Discriminant Analysis (LDA) in machine
learning”, we will discuss the LDA algorithm for
classification predictive modeling problems, limitation of
logistic regression, representation of linear Discriminant
analysis model, how to make a prediction using LDA,
how to prepare data for LDA, extensions to LDA and
much more. So, let's start with a quick introduction to
Linear Discriminant Analysis (LDA) in machine learning.
Note: Before starting this topic, it is recommended to
learn the basics of Logistic Regression algorithms and a
basic understanding of classification problems in
machine learning as a prerequisite

What is Linear Discriminant Analysis (LDA)?

Although the logistic regression algorithm is limited to


only two-class, linear Discriminant analysis is applicable
for more than two classes of classification problems.

Linear Discriminant analysis is one of the most


popular dimensionality reduction techniques used for
supervised classification problems in machine
learning. It is also considered a pre-processing step for
modeling differences in ML and applications of pattern
classification.

Whenever there is a requirement to separate two or


more classes having multiple features efficiently, the
Linear Discriminant Analysis model is considered the
most common technique to solve such classification
problems. For e.g., if we have two classes with multiple
features and need to separate them efficiently. When we
classify them using a single feature, then it may show
overlapping.
To overcome the overlapping issue in the classification
process, we must increase the number of features
regularly.

Example:

Let's assume we have to classify two different classes


having two sets of data points in a 2-dimensional plane
as shown below image:
However, it is impossible to draw a straight line in a 2-d
plane that can separate these data points efficiently but
using linear Discriminant analysis; we can dimensionally
reduce the 2-D plane into the 1-D plane. Using this
technique, we can also maximize the separability
between multiple classes.

How Linear Discriminant Analysis (LDA)


works?

Linear Discriminant analysis is used as a dimensionality


reduction technique in machine learning, using which we
can easily transform a 2-D and 3-D graph into a 1-
dimensional plane.

Let's consider an example where we have two classes in


a 2-D plane having an X-Y axis, and we need to classify
them efficiently. As we have already seen in the above
example that LDA enables us to draw a straight line that
can completely separate the two classes of the data
points. Here, LDA uses an X-Y axis to create a new axis
by separating them using a straight line and projecting
data onto a new axis.
ADVERTISEMENT

Hence, we can maximize the separation between these


classes and reduce the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses
the following criteria:

ADVERTISEMENT
o It maximizes the distance between means of two
classes.
o It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new


axis in such a way that it can maximize the distance
between the means of the two classes and minimizes the
variation within each class.

In other words, we can say that the new axis will increase
the separation between the data points of the two
classes and plot them onto the new axis.

Why LDA?
o Logistic Regression is one of the most popular
classification algorithms that perform well for binary
classification but falls short in the case of multiple
classification problems with well-separated classes.
At the same time, LDA handles these quite
efficiently.
o LDA can also be used in data pre-processing to
reduce the number of features, just as PCA, which
reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In
Fisherfaces, LDA is used to extract useful data from
different faces. Coupled with eigenfaces, it produces
effective results.

Drawbacks of Linear Discriminant Analysis


(LDA)

Although, LDA is specifically used to solve supervised


classification problems for two or more classes which are
not possible using logistic regression in machine
learning. But LDA also fails in some cases where the
Mean of the distributions is shared. In this case, LDA fails
to create a new axis that makes both the classes linearly
separable.

To overcome such problems, we use non-linear


Discriminant analysis in machine learning.

Extension to Linear Discriminant Analysis


(LDA)
Linear Discriminant analysis is one of the most simple
and effective methods to solve classification problems in
machine learning. It has so many extensions and
variations as follows:

1. Quadratic Discriminant Analysis (QDA): For


multiple input variables, each class deploys its own
estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used
when there are non-linear groups of inputs are
used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses
regularization in the estimate of the variance
(actually covariance) and hence moderates the
influence of different variables on LDA.

Real-world Applications of LDA

Some of the common real-world applications of Linear


discriminant Analysis are given below:

o FaceRecognition
Face recognition is the popular application of
computer vision, where each face is represented as
the combination of a number of pixel values. In this
case, LDA is used to minimize the number of
features to a manageable number before going
through the classification process. It generates a
new template in which each dimension consists of a
linear combination of pixel values. If a linear
combination is generated using Fisher's linear
discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in
classifying the patient disease on the basis of
various parameters of patient health and the
medical treatment which is going on. On such
parameters, it classifies disease as mild, moderate,
or severe. This classification helps the doctors in
either increasing or decreasing the pace of the
treatment.
o Customer Identification
In customer identification, LDA is currently being
applied. It means with the help of LDA; we can
easily identify and select the features that can
specify the group of customers who are likely to
purchase a specific product in a shopping mall. This
can be helpful when we want to identify a group of
customers who mostly purchase a product in a
shopping mall.
o For Predictions
LDA can also be used for making predictions and so
in decision making. For example, "will you buy this
product” will give a predicted result of either one or
two possible classes as a buying or not.
o In Learning
Nowadays, robots are being trained for learning
and talking to simulate human work, and it can also
be considered a classification problem. In this case,
LDA builds similar groups on the basis of different
parameters, including pitches, frequencies, sound,
tunes, etc.

Difference between Linear Discriminant


Analysis and PCA

Below are some basic differences between LDA and PCA:

o PCA is an unsupervised algorithm that does not


care about classes and labels and only aims to find
the principal components to maximize the variance
in the given dataset. At the same time, LDA is a
supervised algorithm that aims to find the linear
discriminants to represent the axes that maximize
separation between different classes of data.
o LDA is much more suitable for multi-class
classification tasks compared to PCA. However, PCA
is assumed to be an as good performer for a
comparatively small sample size.
o Both LDA and PCA are used as dimensionality
reduction techniques, where PCA is first followed by
LDA.

ADVERTISEMENT

How to Prepare Data for LDA


Below are some suggestions that one should always
consider while preparing the data to build the LDA
model:

o Classification Problems: LDA is mainly applied for


classification problems to classify the categorical
output variable. It is suitable for both binary and
multi-class classification problems.
o Gaussian Distribution: The standard LDA model
applies the Gaussian Distribution of the input
variables. One should review the univariate
distribution of each attribute and transform them
into more Gaussian-looking distributions. For e.g.,
use log and root for exponential distributions and
Box-Cox for skewed distributions.
o Remove Outliers: It is good to firstly remove the
outliers from your data because these outliers can
skew the basic statistics used to separate classes in
LDA, such as the mean and the standard deviation.
o Same Variance: As LDA always assumes that all the
input variables have the same variance, hence it is
always a better way to firstly standardize the data
before implementing an LDA model. By this, the
Mean will be 0, and it will have a standard deviation
of 1.
Decision Tree


A decision tree is one of the most powerful tools of


supervised learning algorithms used for both classification
and regression tasks. It builds a flowchart-like tree structure
where each internal node denotes a test on an attribute, each
branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label. It is constructed by
recursively lspitting the training data into subsets based on the
values of the attributes until a stopping criterion is met, such
as the maximum depth of the tree or the minimum number of
samples required to split a node.
During training, the Decision Tree algorithm selects the best
attribute to split the data based on a metric such as entropy or
Gini impurity, which measures the level of impurity or
randomness in the subsets. The goal is to find the attribute that
maximizes the information gain or the reduction in impurity
after the split.
What is a Decision Tree?
A decision tree is a flowchart-like tree structure where each
internal node denotes the feature, branches denote the rules
and the leaf nodes denote the result of the algorithm. It is a
versatile supervised machine-learning algorithm, which is
used for both classification and regression problems. It is one
of the very powerful algorithms. And it is also used in
Random Forest to train on different subsets of training data,
which makes random forest one of the most powerful
algorithms in machine learning.
Decision Tree Terminologies
Some of the common Terminologies used in Decision Trees
are as follows:
 Root Node: It is the topmost node in the tree, which

represents the complete dataset. It is the starting point of


the decision-making process.
 Decision/Internal Node: A node that symbolizes a choice
regarding an input feature. Branching off of internal nodes
connects them to leaf nodes or other internal nodes.
 Leaf/Terminal Node: A node without any child nodes that
indicates a class label or a numerical value.
 Splitting: The process of splitting a node into two or more
sub-nodes using a split criterion and a selected feature.
 Branch/Sub-Tree: A subsection of the decision tree starts
at an internal node and ends at the leaf nodes.
 Parent Node: The node that divides into one or more child
nodes.
 Child Node: The nodes that emerge when a parent node is
split.
 Impurity: A measurement of the target variable’s
homogeneity in a subset of data. It refers to the degree of
randomness or uncertainty in a set of examples. The Gini
index and entropy are two commonly used impurity
measurements in decision trees for classifications task
 Variance: Variance measures how much the predicted and
the target variables vary in different sampl f a es o dataset.
It is used for regression problems in decision trees. Mean
squared error, Mean Absolute Error are used to measure
the variance for the regression tasks in the decision tree.
 Information Gain: Information gain is a measure of the
reduction in impurity achieved by splitting a dataset on a
particular feature in a decision tree. The splitting criterion is
determined by the feature that offers the greatest
information gain, It is used to determine the most
informative feature to split on at each node of the tree, with
the goal of creating pure subsets
 Pruning: The process of removing branches from the tree
that do not provide any additional information or lead to
overfitting.
Decision Tree

Attribute Selection Measures:


Construction of Decision Tree: A tree can be “learned” by
splitting the source set into subsets based on Attribute
Selection Measures. Attribute selection measure (ASM) is a
criterion used in decision tree algorithms to evaluate the
usefulness of different attributes for splitting a dataset. The
goal of ASM is to identify the attribute that will create the
most homogeneous subsets of data after the split, thereby
maximizing the information gain. This process is repeated on
each derived subset in a recursive manner called recursive
partitioning. The recursion is completed when the subset at a
node all has the same value of the target variable, or when
splitting no longer adds value to the predictions. The
construction of a decision tree classifier does not require any
domain knowledge or parameter setting and therefore is
appropriate for exploratory knowledge discovery. Decision
trees can handle high-dimensional data.
Entropy:
Entropy is the measure of the degree of randomness or
uncertainty in the dataset. In the case of classifications, It
measures the randomness based on the distribution of class
labels in the dataset.
The entropy for a subset of the original dataset having K
number of classes for the ith node can be defined as:

Where,
 S is the dataset sample.

 k is the particular class from K classes


 p(k) is the proportion of the data points that belong to class
k to the total number of data points in dataset sample
S.
 Here p(i,k) should not be equal to zero.

Important points related to Entropy:


1. The entropy is 0 when the dataset is completely
homogeneous, meaning that each instance belongs to the
same class. It is the lowest entropy indicating no
uncertainty in the dataset sample.
2. when the dataset is equally divided between multiple
classes, the entropy is at its maximum value. Therefore,
entropy is highest when the distribution of class labels is
even, indicating maximum uncertainty in the dataset
sample.
3. Entropy is used to evaluate the quality of a split. The goal
of entropy is to select the attribute that minimizes the
entropy of the resulting subsets, by splitting the dataset into
more homogeneous subsets with respect to the class labels.
4. The highest information gain attribute is chosen as the
splitting criterion (i.e., the reduction in entropy after
splitting on that attribute), and the process is repeated
recursively to build the decision tree.
Gini Impurity or index:
Gini Impurity is a score that evaluates how accurate a split is
among the classified groups. The Gini Impurity evaluates a
score in the range between 0 and 1, where 0 is when all
observations belong to one class, and 1 is a random
distribution of the elements within classes. In this case, we
want to have a Gini index score as low as possible. Gini Index
is the evaluation metric we shall use to evaluate our Decision
Tree Model.
Here,
 pi is the proportion of elements in the set that belongs to the

ith category.
Information Gain:
Information gain measures the reduction in entropy or
variance that results from splitting a dataset based on a
specific property. It is used in decision tree algorithms to
determine the usefulness of a feature by partitioning the
dataset into more homogeneous subsets with respect to the
class labels or target variable. The higher the information
gain, the more valuable the feature is in predicting the target
variable.
The information gain of an attribute A, with respect to a
dataset S, is calculated as follows:

where
 A is the specific attribute or class label

 |H| is the entropy of dataset sample S

 |HV| is the number of instances in the subset S that have the

value v for attribute A


Information gain measures the reduction in entropy or
variance achieved by partitioning the dataset on attribute A.
The attribute that maximizes information gain is chosen as the
splitting criterion for building the decision tree.
Information gain is used in both classification and regression
decision trees. In classification, entropy is used as a measure
of impurity, while in regression, variance is used as a measure
of impurity. The information gain calculation remains the
same in both cases, except that entropy or variance is used
instead of entropy in the formula.
How does the Decision Tree algorithm Work?
The decision tree operates by analyzing the data set to predict
its classification. It commences from the tree’s root node,
where the algorithm views the value of the root attribute
compared to the attribute of the record in the actual data set.
Based on the comparison, it proceeds to follow the branch and
move to the next node.
The algorithm repeats this action for every subsequent node
by comparing its attribute values with those of the sub-nodes
and continuing the process further. It repeats until it reaches
the leaf node of the tree. The complete mechanism can be
better explained through the algorithm given below.
 Step-1: Begin the tree with the root node, says S, which

contains the complete dataset.


 Step-2: Find the best attribute in the dataset using Attribute

Selection Measure (ASM).


 Step-3: Divide the S into subsets that contains possible

values for the best attributes.


 Step-4: Generate the decision tree node, which contains the

best attribute.
 Step-5: Recursively make new decision trees using the

subsets of the dataset created in step -3. Continue this


process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf
nodeClassification and Regression Tree algorithm.
Advantages of the Decision Tree:
1. It is simple to understand as it follows the same process
which a human follow while making any decision in real-
life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a
problem.
4. There is less requirement of data cleaning compared to
other algorithms.
Disadvantages of the Decision Tree:
1. The decision tree contains lots of layers, which makes it
complex.
2. It may have an overfitting issue, which can be resolved
using the Random Forest algorithm.
3. For more class labels, the computational complexity of the
decision tree may increase.
What are appropriate problems for Decision tree
learning?
Although a variety of decision tree learning methods have
been developed with somewhat differing capabilities and
requirements, decision tree learning is generally best suited to
problems with the following characteristics:
1. Instances are represented by attribute-value pairs:
In the world of decision tree learning, we commonly use
attribute-value pairs to represent instances. An instance is
defined by a predetermined group of attributes, such as
temperature, and its corresponding value, such as hot. Ideally,
we want each attribute to have a finite set of distinct values,
like hot, mild, or cold. This makes it easy to construct
decision trees. However, more advanced versions of the
algorithm can accommodate attributes with continuous
numerical values, such as representing temperature with a
numerical scale.
2. The target function has discrete output values:
The marked objective has distinct outcomes. The decision tree
method is ordinarily employed for categorizing Boolean
examples, such as yes or no. Decision tree approaches can be
readily expanded for acquiring functions with beyond dual
conceivable outcome values. A more substantial expansion
lets us gain knowledge about aimed objectives with numeric
outputs, although the practice of decision trees in this
framework is comparatively rare.
3. Disjunctive descriptions may be required:
Decision trees naturally represent disjunctive expressions.
4.The training data may contain errors:
“Techniques of decision tree learning demonstrate high
resilience towards discrepancies, including inconsistencies in
categorization of sample cases and discrepancies in the feature
details that characterize these cases.”
5. The training data may contain missing attribute values:
In certain cases, the input information designed for training
might have absent characteristics. Employing decision tree
approaches can still be possible despite experiencing
unknown features in some training samples. For instance,
when considering the level of humidity throughout the day,
this information may only be accessible for a specific set of
training specimens.
Practical issues in learning decision trees include:
 Determining how deeply to grow the decision tree,
 Handling continuous attributes,
 Choosing an appropriate attribute selection measure,
 Handling training data with missing attribute values,
 Handling attributes with differing costs, and
 Improving computational efficiency.

To build the Decision Tree, CART (Classification and


Regression Tree) algorithm is used. It works by selecting the
best split at each node based on metrics like Gini impurity or
information Gain. In order to create a decision tree. Here are
the basic steps of the CART algorithm:
1. The root node of the tree is supposed to be the complete
training dataset.
2. Determine the impurity of the data based on each feature
present in the dataset. Impurity can be measured using
metrics like the Gini index or entropy for classification and
Mean squared error, Mean Absolute Error, friedman_mse,
or Half Poisson deviance for regression.
3. Then selects the feature that results in the highest
information gain or impurity reduction when splitting the
data.
4. For each possible value of the selected feature, split the
dataset into two subsets (left and right), one where the
feature takes on that value, and another where it does not.
The split should be designed to create subsets that are as
pure as possible with respect to the target variable.
5. Based on the target variable, determine the impurity of each
resulting subset.
6. For each subset, repeat steps 2–5 iteratively until a stopping
condition is met. For example, the stopping condition could
be a maximum tree depth, a minimum number of samples
required to make a split or a minimum impurity threshold.
7. Assign the majority class label for classification tasks or the
mean value for regression tasks for each terminal node (leaf
node) in the tree.
Classification and Regression Tree algorithm for
Classification
Let the data available at node m be Qm and it has nm samples.
and tm as the threshold for node m. then, The classification
and regression tree algorithm for classification can be written
as :

Here,
 H is the measure of impurities of the left and right subsets

at node m. it can be entropy or Gini impurity.


 nm is the number of instances in the left and right subsets at

node m.
To select the parameter, we can write as:
Example:
 Python3

# Import the necessary


libraries

from sklearn.datasets
import load_iris

from sklearn.tree import


DecisionTreeClassifier

from sklearn.tree import


export_graphviz

from graphviz import


Source

# Load the dataset

iris = load_iris()

X = iris.data[:, 2:] #
petal length and width

y = iris.target
#
DecisionTreeClassifier

tree_clf =
DecisionTreeClassifier(
criterion='entropy',

max_depth=2)

tree_clf.fit(X, y)

# Plot the decision


tree graph

export_graphviz(

tree_clf,

out_file="iris_tree
.dot",

feature_names=iris.
feature_names[2:],

class_names=iris.ta
rget_names,
rounded=True,

filled=True

with
open("iris_tree.dot")
as f:

dot_graph = f.read()

Source(dot_graph)

Output:
Decision Tree Classifier

Classification and Regression Tree algorithm for


Regression
Let the data available at node m be Qm and it has nm samples.
and tm as the threshold for node m. then, The classification
and regression tree algorithm for regression can be written as :

Here,
 MSE is the mean squared error.

 nm is the number of instances in the left and right subsets at


node m.
To select the parameter, we can write as:
Example:
 Python3

# Import the necessary


libraries

from sklearn.datasets
import load_diabetes

from sklearn.tree import


DecisionTreeRegressor

from sklearn.tree import


export_graphviz

from graphviz import


Source

# Load the dataset

diabetes =
load_diabetes()

X = diabetes.data

y = diabetes.target
# DecisionTreeRegressor

tree_reg =
DecisionTreeRegressor(c
riterion =
'squared_error',

max_depth=2)

tree_reg.fit(X, y)

# Plot the decision


tree graph

export_graphviz(

tree_reg,

out_file="diabetes_
tree.dot",

feature_names=diabe
tes.feature_names,

class_names=diabete
s.target,

rounded=True,

filled=True

with
open("diabetes_tree.dot
") as f:

dot_graph = f.read()

Source(dot_graph)

Output:
Decision Tree Regression

Strengths and Weaknesses of the Decision Tree Approach


The strengths of decision tree methods are:
 Decision trees are able to generate understandable rules.

 Decision trees perform classification without requiring

much computation.
 Decision trees are able to handle both continuous and

categorical variables.
 Decision trees provide a clear indication of which fields are

most important for prediction or classification.


 Ease of use: Decision trees are simple to use and don’t

require a lot of technical expertise, making them accessible


to a wide range of users.
 Scalability: Decision trees can handle large datasets and can

be easily parallelized to improve processing time.


 Missing value tolerance: Decision trees are able to handle

missing values in the data, making them a suitable choice


for datasets with missing or incomplete data.
 Handling non-linear relationships: Decision trees can
handle non-linear relationships between variables, making
them a suitable choice for complex datasets.
 Ability to handle imbalanced data: Decision trees can

handle imbalanced datasets, where one class is heavily


represented compared to the others, by weighting the
importance of individual nodes based on the class
distribution.
The weaknesses of decision tree methods :
 Decision trees are less appropriate for estimation tasks

where the goal is to predict the value of a continuous


attribute.
 Decision trees are prone to errors in classification problems

with many classes and a relatively small number of training


examples.
 Decision trees can be computationally expensive to train.

The process of growing a decision tree is computationally


expensive. At each node, each candidate splitting field must
be sorted before its best split can be found. In some
algorithms, combinations of fields are used and a search
must be made for optimal combining weights. Pruning
algorithms can also be expensive since many candidate sub-
trees must be formed and compared.
 Decision trees are prone to overfitting the training data,

particularly when the tree is very deep or complex. This can


result in poor performance on new, unseen data.
 Small variations in the training data can result in different

decision trees being generated, which can be a problem


when trying to compare or reproduce results.
 Many decision tree algorithms do not handle missing data

well, and require imputation or deletion of records with


missing values.
 The initial splitting criteria used in decision tree algorithms
can lead to biased trees, particularly when dealing with
unbalanced datasets or rare classes.
 Decision trees are limited in their ability to represent

complex relationships between variables, particularly when


dealing with nonlinear or interactive effects.
 Decision trees can be sensitive to the scaling of input

features, particularly when using distance-based metrics or


decision rules that rely on comparisons between values.
Implementation:
 C++
 Java
 Python3
 C#
 Javascript

// Importing required
headers

#include <algorithm>

#include <chrono>

#include <cmath>

#include <cstdlib>

#include <ctime>
#include <fstream>

#include <iostream>

#include <iterator>

#include <queue>

#include <random>

#include <sstream>

#include <vector>

#include <bits/stdc+
+.h>

#include <stdio.h>

using namespace std;

int main()

// Generating
random data for
classification

int X[100][5];

int t[100];

srand(10);

for (int i = 0; i <


100; i++) {

for (int j = 0; j
< 5; j++) {

X[i][j] =
rand() % 2;

t[i] = rand() %
2;

} // Splitting data
into train and test
sets

int X_train[70][5];

int X_test[30][5];
int t_train[70];

int t_test[30];

for (int i = 0; i <


70; i++) {

for (int j = 0; j
< 5; j++) {

X_train[i]
[j] = X[i][j];

t_train[i] =
t[i];

for (int i = 0; i <


30; i++) {

for (int j = 0; j
< 5; j++) {

X_test[i]
[j] = X[i + 70][j];

t_test[i] = t[i
+ 70];

// Randomly
predicting binary
values for test set

int
predicted_value[30];

for (int i = 0; i <


30; i++) {

predicted_value
[i] = rand() % 2;

// Printing
predicted binary values
for test set
for (int i = 0; i <
30; i++) {

cout <<
predicted_value[i] << "
";

cout << "\n";

// Calculating
number of 0s and 1s in
train set

int zeroes = 0;

int ones = 0;

for (int i = 0; i <


70; i++) {

if (t_train[i]
== 0) {

zeroes +=
1;

}
else {

ones += 1;

// Calculating Gini
index

float val = 1

-
((zeroes / 70.0) *
(zeroes / 70.0)

+
(ones / 70.0) * (ones /
70.0));

cout << "Gini : " <<


val << "\n";

// Calculating
accuracy of predictions
int match = 0;

int UnMatch = 0;

for (int i = 0; i <


30; i++) {

if
(predicted_value[i] ==
t_test[i]) {

match += 1;

else {

UnMatch +=
1;

float accuracy =
match / 30.0;

cout << "Accuracy


is: " << accuracy << "\
n";
// Returning 0 on
successful completion

return 0;

Output
1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 1 1 0 0 0
1 1 0 0 0 1 0
Gini : 0.5
Accuracy is: 0.366667
Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning


technique that can be used for both classification
and Regression problems, but mostly it is preferred
for solving Classification problems. It is a tree-
structured classifier, where internal nodes
represent the features of a dataset, branches
represent the decision rules and each leaf node
represents the outcome.
o In a Decision tree, there are two nodes, which are
the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of
those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis
of features of the given dataset.
o It is a graphical representation for getting all the
possible solutions to a problem/decision based
on given conditions.
o It is called a decision tree because, similar to a tree,
it starts with the root node, which expands on
further branches and constructs a tree-like
structure.
o In order to build a tree, we use the CART
algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based
on the answer (Yes/No), it further split the tree into
subtrees.
o Below diagram explains the general structure of a
decision tree:
Note: A decision tree can contain categorical data
(YES/NO) as well as numeric data.
Why use Decision Trees?

There are various algorithms in Machine learning, so


choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a
machine learning model. Below are the two reasons for
using the Decision tree:

o Decision Trees usually mimic human thinking ability


while making a decision, so it is easy to understand.
o The logic behind the decision tree can be easily
understood because it shows a tree-like structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree
starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the
unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called
the parent node, and other nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given


dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps
to the next node.

For the next node, the algorithm again compares the


attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf
node of the tree. The complete process can be better
understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S,
which contains the complete dataset.
o Step-2: Find the best attribute in the dataset
using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains
possible values for the best attributes.
o Step-4: Generate the decision tree node, which
contains the best attribute.
o Step-5: Recursively make new decision trees using
the subsets of the dataset created in step -3.
Continue this process until a stage is reached where
you cannot further classify the nodes and called the
final node as a leaf node.

Example: Suppose there is a candidate who has a job


offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree
starts with the root node (Salary attribute by ASM). The
root node splits further into the next decision node
(distance from the office) and one leaf node based on
the corresponding labels. The next decision node further
gets split into one decision node (Cab facility) and one
leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the
below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue


arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems
there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which
are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in
entropy after the segmentation of a dataset based
on an attribute.
o It calculates how much information a feature
provides us about a class.
o According to the value of information gain, we split
the node and build the decision tree.
o A decision tree algorithm always tries to maximize
the value of information gain, and a node/attribute
having the highest information gain is split first. It
can be calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy


(each feature)

Entropy: Entropy is a metric to measure the impurity in


a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no)
log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used
while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be
preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm
uses the Gini index to create binary splits.
o Gini index can be calculated using the below
formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes


from a tree in order to get the optimal decision tree.

A too-large tree increases the risk of overfitting, and a


small tree may not capture all the important features of
the dataset. Therefore, a technique that decreases the
size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of
tree pruning technology used:

ADVERTISEMENT
o Cost Complexity Pruning
o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same
process which a human follow while making any
decision in real-life.
o It can be very useful for solving decision-related
problems.
o It helps to think about all the possible outcomes for
a problem.
o There is less requirement of data cleaning
compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which


makes it complex.
o It may have an overfitting issue, which can be
resolved using the Random Forest algorithm.
o For more class labels, the computational complexity
of the decision tree may increase.

Python Implementation of Decision Tree

Now we will implement the Decision tree using Python.


For this, we will use the dataset "user_data.csv," which
we have used in previous classification models. By using
the same dataset, we can compare the Decision tree
classifier with other classification models such
as KNN SVM, LogisticRegression, etc.

Steps will also remain the same, which are given below:

o Data Pre-processing step


o Fitting a Decision-Tree algorithm to the Training
set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion
matrix)
o Visualizing the test set result.

1. Data Pre-Processing Step:

Below is the code for the pre-processing step:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split

15. x_train, x_test, y_train, y_test= train_test_split(x, y, tes


t_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler

19. st_x= StandardScaler()


20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

In the above code, we have pre-processed the data.


Where we have loaded the dataset, which is given as:

2. Fitting a Decision-Tree algorithm to the


Training set
Now we will fit the model to the training set. For this, we
will import the DecisionTreeClassifier class
from sklearn.tree library. Below is the code for it:

1. #Fitting Decision Tree classifier to the training set


2. From sklearn.tree import DecisionTreeClassifier
3. classifier= DecisionTreeClassifier(criterion='entropy', ran
dom_state=0)
4. classifier.fit(x_train, y_train)

In the above code, we have created a classifier object, in


which we have passed two main parameters;

o "criterion='entropy': Criterion is used to measure


the quality of split, which is calculated by
information gain given by entropy.
o random_state=0": For generating the random
states.

Below is the output for this:


ADVERTISEMENT
Out[8]:
DecisionTreeClassifier(class_weight=None,
criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=0,
splitter='best')

3. Predicting the test result

Now we will predict the test set result. We will create a


new prediction vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

In the below output image, the predicted output and


real test output are given. We can clearly see that there
are some values in the prediction vector, which are
different from the real vector values. These are
prediction errors.

ADVERTISEMENT
4. Test accuracy of the result (Creation of
Confusion matrix)

In the above output, we have seen that there were some


incorrect predictions, so if we want to know the number
of correct and incorrect predictions, we need to use the
confusion matrix. Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:
ADVERTISEMENT
ADVERTISEMENT

In the above output image, we can see the confusion


matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions.
Therefore, we can say that compared to other
classification models, the Decision Tree classifier
made a good prediction.

5. Visualizing the training set result:

Here we will visualize the training set result. To visualize


the training set result we will plot a graph for the
decision tree classifier. The classifier will predict yes or
No for the users who have either Purchased or Not
purchased the SUV car as we did in Logistic
Regression. Below is the code for it:
1. #Visulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min()
- 1, stop = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].m
ax() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(),
x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))

8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label
= j)
13. mtp.title('Decision Tree Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:
The above output is completely different from the rest
classification models. It has both vertical and horizontal
lines that are splitting the dataset according to the age
and estimated salary variable.

As we can see, the tree is trying to capture each dataset,


which is the case of overfitting.

6. Visualizing the test set result:

Visualization of test set result will be similar to the


visualization of the training set except that the training
set will be replaced with the test set.

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min()
- 1, stop = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].m
ax() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(),
x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))

8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label
= j)
13. mtp.title('Decision Tree Algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Random Forest Algorithm

Random Forest is a popular machine learning algorithm


that belongs to the supervised learning technique. It can
be used for both Classification and Regression problems
in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the
performance of the model.

As the name suggests, "Random Forest is a classifier


that contains a number of decision trees on various
subsets of the given dataset and takes the average to
improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the
random forest takes the prediction from each tree and
based on the majority votes of predictions, and it
predicts the final output.

The greater number of trees in the forest leads to


higher accuracy and prevents the problem of
overfitting.

The below diagram explains the working of the Random


Forest algorithm:
This is a modal window.
The media could not be loaded, either because the server or
network failed or because the format is not supported.
Backward Skip 10sPlay VideoForward Skip 10s
Note: To better understand the Random Forest
Algorithm, you should have knowledge of the Decision
Tree Algorithm.

Assumptions for Random Forest

Since the random forest combines multiple trees to


predict the class of the dataset, it is possible that some
decision trees may predict the correct output, while
others may not. But together, all the trees predict the
correct output. Therefore, below are two assumptions
for a better Random forest classifier:

ADVERTISEMENT
ADVERTISEMENT
o There should be some actual values in the feature
variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
o The predictions from each tree must have very low
correlations.

Why use Random Forest?

Below are some points that explain why we should use


the Random Forest algorithm:

<="" li="">
o It takes less training time as compared to other
algorithms.
o It predicts output with high accuracy, even for the
large dataset it runs efficiently.
o It can also maintain accuracy when a large
proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the


random forest by combining N decision tree, and
second is to make predictions for each tree created in
the first phase.

The Working process can be explained in the below


steps and diagram:

Step-1: Select random K data points from the training


set.
Step-2: Build the decision trees associated with the
selected data points (Subsets).

Step-3: Choose the number N for decision trees that


you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each


decision tree, and assign the new data points to the
category that wins the majority votes.

The working of the algorithm can be better understood


by the below example:

Example: Suppose there is a dataset that contains


multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction
result, and when a new data point occurs, then based on
the majority of results, the Random Forest classifier
predicts the final decision. Consider the below image:
Applications of Random Forest

There are mainly four sectors where Random forest


mostly used:

1. Banking: Banking sector mostly uses this algorithm


for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease
trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land
use by this algorithm.
4. Marketing: Marketing trends can be identified
using this algorithm.
Advantages of Random Forest

o Random Forest is capable of performing both


Classification and Regression tasks.
o It is capable of handling large datasets with high
dimensionality.
o It enhances the accuracy of the model and prevents
the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both


classification and regression tasks, it is not more
suitable for Regression tasks.

Python Implementation of Random Forest


Algorithm

Now we will implement the Random Forest Algorithm


tree using Python. For this, we will use the same dataset
"user_data.csv", which we have used in previous
classification models. By using the same dataset, we can
compare the Random Forest classifier with other
classification models such as Decision tree
Classifier, KNN, SVM, Logistic Regression, etc.

Implementation Steps are given below:

o Data Pre-processing step


o Fitting the Random forest algorithm to the Training
set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion
matrix)
o Visualizing the test set result.

1.Data Pre-Processing Step:

Below is the code for the pre-processing step:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split

15. x_train, x_test, y_train, y_test= train_test_split(x, y, tes


t_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

In the above code, we have pre-processed the data.


Where we have loaded the dataset, which is given as:

2. Fitting the Random Forest algorithm to the


training set:

Now we will fit the Random forest algorithm to the


training set. To fit it, we will import
the RandomForestClassifier class from
the sklearn.ensemble library. The code is given below:

1. #Fitting Decision Tree classifier to the training set


2. from sklearn.ensemble import RandomForestClassifier
3. classifier= RandomForestClassifier(n_estimators= 10, crit
erion="entropy")
4. classifier.fit(x_train, y_train)

In the above code, the classifier object takes below


parameters:

o n_estimators= The required number of trees in the


Random Forest. The default value is 10. We can
choose any number but need to take care of the
overfitting issue.
o criterion= It is a function to analyze the accuracy of
the split. Here we have taken "entropy" for the
information gain.

Output:
RandomForestClassifier(bootstrap=True,
class_weight=None, criterion='entropy',
max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decre
ase=0.0, min_impurity_split=None,
min_samples_leaf=1
, min_samples_split=2,
min_weight_fractio
n_leaf=0.0, n_estimators=10,
n_jobs=None,
oob_score=False, random_state=None,
verbose=0,
warm_start=False)

3. Predicting the Test Set result

Since our model is fitted to the training set, so now we


can predict the test result. For prediction, we will create
a new prediction vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

The prediction vector is given as:


By checking the above prediction vector and test set real
vector, we can determine the incorrect predictions done
by the classifier.

4. Creating the Confusion Matrix

Now we will create the confusion matrix to determine


the correct and incorrect predictions. Below is the code
for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:

As we can see in the above matrix, there are 4+4= 8


incorrect predictions and 64+28= 92 correct
predictions.

5. Visualizing the training Set result

Here we will visualize the training set result. To visualize


the training set result we will plot a graph for the
Random forest classifier. The classifier will predict yes or
No for the users who have either Purchased or Not
purchased the SUV car as we did in Logistic
Regression. Below is the code for it:

1. from matplotlib.colors import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min()
- 1, stop = x_set[:, 0].max() + 1, step =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].m
ax() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(),
x2.ravel()]).T).reshape(x1.shape),
6. alpha = 0.75, cmap = ListedColormap(('purple','green' )))

7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1
],
11. c = ListedColormap(('purple', 'green'))(i), label
= j)
12. mtp.title('Random Forest Algorithm (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

Output:
The above image is the visualization result for the
Random Forest classifier working with the training set
result. It is very much similar to the Decision tree
classifier. Each data point corresponds to each user of
the user_data, and the purple and green regions are the
prediction regions. The purple region is classified for the
users who did not purchase the SUV car, and the green
region is for the users who purchased the SUV.

So, in the Random Forest classifier, we have taken 10


trees that have predicted Yes or NO for the Purchased
variable. The classifier took the majority of the
predictions and provided the result.

6. Visualizing the test set result

Now we will visualize the test set result. Below is the


code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min()
- 1, stop = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].m
ax() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(),
x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1
],
12. c = ListedColormap(('purple', 'green'))(i), label
= j)
13. mtp.title('Random Forest Algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above image is the visualization result for the test


set. We can check that there is a minimum number of
incorrect predictions (8) without the Overfitting issue.
We will get different results by changing the number of
trees in the classifier.

You might also like