0% found this document useful (0 votes)
5 views

UNIT-3

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

UNIT-3

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT-3

Classification

Classification: Basic Concepts, General Approach to solving a


classification problem, Decision Tree Induction: Working of Decision
Tree, building a decision tree, methods for expressing an attribute test
conditions, measures for selecting the best split, Algorithm for decision
tree induction.
Basic Concepts:
What is classification? What is prediction?
Classification: It is the task of learning a target function f that
maps each attribute set X to one of the predefined class labels ’Y’

 used for prediction(future analysis ) to know the unknown attributes with


their values by using classifier algorithms and decision tree.(in data
mining)
 which constructs some models(like decision trees) then which classifies
the attributes..
 The classification can work on both categorical attribute and numerical
attribute attributes.

Prediction: prediction also used for to know the unknown or missing values..
 which also uses some models in order to predict the attributes
 models like neural networks, if else rules and other mechanisms

Classification and prediction are used in the Applications like


 Credit approval
 Target marketing
 Medical diagnosis

Classification—A Two-Step Process


• Model construction: describing a set of predetermined classes

1.Each tuple/sample is assumed to belong to a predefined class, as


determined by the class label attribute
2.The set of tuples used for model construction: training set
3.The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects – Estimate accuracy
of the model
1.The known label of test sample is compared with the classified result
from the model
2.Accuracy rate is the percentage of test set samples that are correctly
classified by the model
3.Test set is independent of training set, otherwise over-fitting will occur

Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning :


Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.)
– are accompanied by labels indicating the class of the observations.
New data is classified based on the training set.

Unsupervised learning (clustering)


– The class labels of training data is unknown.
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.

A classification model serves two important roles in data mining.

Predictive Model: First, it is used as a predictive model to classify previously


unlabelled instances. A good classification model must provide accurate
predictions with a fast response time.

Descriptive Model: Second, it serves as a descriptive model to identify the


characteristics that distinguish instances from different classes. This is
particularly useful for critical applications, such as medical diagnosis, where it
is insufficient to have a model that makes a prediction without justifying how it
reaches such a decision.

General Approach To Solve Classification Problem


Classification is the task of assigning labels to unlabelled data instances and a
classifier is used to perform such a task. The model is created using a given a
set of instances, known as the training set, which contains attribute values as
well as class labels for each instance.
The systematic approach for learning a classification model given a training set
is known as a learning algorithm. The process of using a learning algorithm to
build a classification model from the training data is known as induction. This
process is also often described as “learning a model” or “building a model.” This
process of applying a classification model on unseen test instances to predict
their class labels is known as deduction.
Thus, the process of classification involves two steps: applying a learning
algorithm to training data to learn a model, and then applying the model to
assign labels to unlabelled instances.
Confusion Matrix:
The performance of a model (classifier) can be evaluated by comparing the
predicted labels against the true labels of instances. This information can be
summarized in a table called a confusion matrix.
Table depicts the confusion matrix for a binary classification problem. Each
entry fij denotes the number of instances from class i predicted to be of class j.
For example, f01 is the number of instances from class 0 incorrectly predicted
as class 1. The number of correct predictions made by the model is (f11 + f00)
and the number of incorrect predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how
well a classification model performs, summarizing this information into a single
number makes it more convenient to compare the relative performance of
different models. This can be done using an evaluation metric such as accuracy,
which is computed in the following way:

For binary classification problems, the accuracy of a model is given by

Error rate is defined as follows for binary classification problems:

Decision Tree Classifier


To illustrate how a decision tree works, consider the classification problem of
distinguishing mammals from non-mammals using the vertebrate data set.
Suppose a new species is discovered by scientists. How can we tell whether it is a
mammal or a non-mammal? One approach is to pose a series of questions about
the characteristics of the species. The first question we may ask is whether the
species is cold- or warm-blooded. If it is cold-blooded, then it is definitely not a
mammal. Otherwise, it is either a bird or a mammal. In the latter case, we need to
ask a follow-up question: Do the females of the species give birth to their young?
Those that do give birth are definitely mammals, while those that do not are likely
to be nonmammals (with the exception of egg- laying mammals such as the
platypus and spiny anteater).
The previous example illustrates how we can solve a classification problem by
asking a series of carefully crafted questions about the attributes of the test
instance. Each time we receive an answer, we could ask a follow-up question
until we can conclusively decide on its class label. The series of questions and
their possible answers can be organized into a hierarchical structure called a
decision tree.
The tree has three types of nodes:
• A root node, with no incoming links and zero or more outgoing links.
• Internal nodes, each of which has exactly one incoming link and two or more
outgoing links.
• Leaf or terminal nodes, each of which has exactly one incoming link and no
outgoing links.
Every leaf node in the decision tree is associated with a class label. The non-
terminal nodes, which include the root and internal nodes, contain attribute test
conditions that are typically defined using a single attribute. Each possible
outcome of the attribute test condition is associated with exactly one child of
this node.
Every leaf node in the decision tree is associated with a class label. The non-
terminal nodes, which include the root and internal nodes, contain attribute
test conditions that are typically defined using a single attribute. Each
possible outcome of the attribute test condition is associated with exactly one
child of this node. For example, the root node of the tree shown in Figure 3.4
uses the attribute Body Temperature to define an attribute test condition that
has two outcomes, warm and cold, resulting in two child nodes.
Given a decision tree, classifying a test instance is straightforward. Starting
from the root node, we apply its attribute test condition and follow the
appropriate branch based on the outcome of the test. This will lead us either to
another internal node, for which a new attribute test condition is applied, or to
a leaf node. Once a leaf node is reached, we assign the class label associated
with the node to the test instance.
Figure traces the path used to predict the class label of a flamingo. The path
terminates at a leaf node labeled as Non-mammals.
A Basic Algorithm to Build a Decision Tree
Hunt’s algorithm, which is the basis for many current implementations of
decision tree classifiers, including ID3, C4.5, and CART

In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree


initially contains a single root node that is associated with all the training
instances. If a node is associated with instances from more than one class, it is
expanded using an attribute test condition that is determined using a splitting
criterion. A child leaf node is created for each outcome of the attribute test
condition and the instances associated with the parent node are distributed to
the children based on the test outcomes. This node expansion step can then be
recursively applied to each child node, as long as it has labels of more than
one class. If all the instances associated with a leaf node have identical class
labels, then the node is not expanded any further. Each leaf node is assigned a
class label that occurs most frequently in the training instances associated
with the node.
To illustrate how the algorithm works, consider the training set shown in Table
for the loan borrower classification problem. Suppose we apply Hunt’s
algorithm to fit the training data. The tree initially contains only a single leaf
node as shown in Figure (a). This node is labelled as Defaulted = No, since the
majority of the borrowers did not default on their loan payments. The training
error of this tree is 30% as three out of the ten training instances have the class
label Defaulted = Yes. The leaf node can therefore be further expanded because
it contains training instances from more than one class.
Let Home Owner be the attribute chosen to split the training instances. The
justification for choosing this attribute as the attribute test condition will be
discussed later. The resulting binary split on the Home Owner attribute is
shown in Figure (b). All the training instances for which Home Owner = Yes are
propagated to the left child of the root node and the rest are propagated to the
right child. Hunt’s algorithm is then recursively applied to each child. The left
child becomes a leaf node labelled Defaulted = No, since all instances
associated with this node have identical class label Defaulted = No. The right
child has instances from each class label. Hence, we split it further. The
resulting subtrees after recursively expanding the right child are shown in
Figures (c) and (d). Hunt’s algorithm, as described above, makes some
simplifying assumptions that are often not true in practice.
Assumptions and the ways to handle them
1. Some of the child nodes created in Hunt’s algorithm can be empty if none of
the training instances have the particular attribute values. One way to handle
this is by declaring each of them as a leaf node with a class label that occurs
most frequently among the training instances associated with their parent
nodes. 2. If all training instances associated with a node have identical attribute
values
but different class labels, it is not possible to expand this node any further. One
way to handle this case is to declare it a leaf node and assign it the class label
that occurs most frequently in the training instances associated with this node.

Design Issues of Decision Tree Induction


Hunt’s algorithm is a generic procedure for growing decision trees in a greedy
fashion. To implement the algorithm, there are two key design issues that must
be addressed.
1. What is the splitting criterion? At each recursive step, an attribute must
be selected to partition the training instances associated with a node into
smaller subsets associated with its child nodes. The splitting criterion
determines which attribute is chosen as the test condition and how the training
instances should be distributed to the child nodes. 2. What is the stopping
criterion? The basic algorithm stops expanding a
node only when all the training instances associated with the node have the
same class labels or have identical attribute values. Although these conditions
are sufficient, there are reasons to stop expanding a node much earlier even if
the leaf node contains training instances from more than one class. This process
is called early termination and the condition used to determine when a node
should be stopped from expanding is called a stopping criterion

Methods for Expressing Attribute Test Conditions


Decision tree induction algorithms must provide a method for expressing an
attribute test condition and its corresponding outcomes for different attribute
types.

Binary Attributes: The test condition for a binary attribute

generates two
potential outcomes,

Nominal Attributes
: Since a nominal attribute can have many values, its
attribute test condition can be expressed in two ways, as a multiway split or a
binary split.

Multiway split: The number of outcomes depends on the number of distinct


values for the corresponding attribute.
For example, if an attribute such as marital status has three distinct values—
single, married, or divorced—its test condition will produce a three-way split.

Binary split : partitioning all values taken by the nominal attribute into two
groups.
For example, some decision tree algorithms, such as CART, produce only
binary splits by considering all 2k−1 − 1 ways of creating a binary partition of k
attribute values. Figure illustrates three different ways of grouping the
attribute values for marital status into two subsets.

Ordinal Attributes :
Ordinal attributes can also produce binary or multiway splits. Ordinal attribute
values can be grouped as long as the grouping does not violate the order
property of the attribute values. Figure illustrates various ways of splitting
training records based on the Shirt
Size attribute. The groupings shown in Figures (a) and (b) preserve the order
among the attribute values, whereas the grouping shown in Figure c) violates
this property because it combines the attribute values Small and Large into the
same partition while Medium and Extra Large are combined into another
partition.

Continuous Attributes For continuous attributes, the attribute test condition


can be expressed as a comparison test (e.g., A<v) producing a binary split, or as
a range query of the form vi ≤ A<v i+1, fori =1,...,k, producing a multiway split.
For the binary split, any possible value v between the minimum and maximum
attribute values in the training data can be used for constructing the
comparison test A<v. However, it is sufficient to only consider distinct attribute
values in the training set as candidate split positions.
For the multiway split, any possible collection of attribute value ranges can be
used, as long as they are mutually exclusive and cover the entire range of
attribute values between the minimum and maximum values observed in the
training set.
One approach for constructing multiway splits is to apply the discretization
strategies. After discretization, a new ordinal value is assigned to each
discretized interval, and the attribute test condition is then defined using this
newly constructed ordinal attribute.

Measures for Selecting an Attribute Test Condition


There are many measures that can be used to determine the goodness of an
attribute test condition. These measures try to give preference to attribute
test conditions that partition the training instances into purer subsets in the
child nodes, which mostly have the same class labels. Having purer nodes is
useful since a node that has all of its training instances from the same class
does not need to be expanded further. In contrast, an impure node containing
training instances from multiple classes is likely to require several levels of
node
expansions, thereby increasing the depth of the tree
considerably.

Impurity Measure for a Single Node


The impurity of a node measures how dissimilar the class labels are for the data
instances belonging to a common node. Following are examples of measures that
can be used to evaluate the impurity of a node t:
where pi(t) is the relative frequency of training instances that belong to class i
at node t, c is the total number of classes, and 0log2 0 = 0 in entropy
calculations. All three measures give a zero impurity value if a node contains
instances from a single class and maximum impurity if the node has equal
proportion of instances from multiple classes.

Comparison among the impurity measures for binary classification


problems.
Since there are only two classes, p0(t)+p1(t) = 1. The horizontal axis p refers to
the fraction of instances that belong to one of the two classes. Observe that all
three measures attain their maximum value when the class distribution is uniform
(i.e., p0(t)=p1(t)= 0.5) and minimum value when all the instances belong to a
single class (i.e., either p0(t) orp1(t) equals to 1). The following examples
illustrate how the values of the impurity measures vary as we alter the class
distribution.

Based on these calculations, node N1 has the lowest impurity value, followed by
N2 and N3. This example, along with Figure, shows the consistency among the
impurity measures, i.e., if a node N1 has lower entropy than node N2, then the
Gini index and error rate of N1 will also be lower than that of N2. Despite their
agreement, the attribute chosen as splitting criterion by the impurity measures
can still be different.
Collective Impurity of Child Nodes:
Consider an attribute test condition that splits a node containing N training
instances into k children, {v1,v2,···,vk}, where every child node represents a
partition of the data resulting from one of the k outcomes of the attribute test
condition. Let N(vj) be the number of training instances associated with a
child node vj, whose impurity value is I(vj). Since a training instance in the
parent node reaches node vj for a fraction of N(vj)/N times, the collective
impurity of the child nodes can be computed by taking a weighted sum of the
impurities of the child nodes, as follows:

Example: [Weighted Entropy]


Consider the candidate attribute test condition shown in Figures (a) and (b) for
the loan borrower classification problem. Splitting on the Home Owner attribute
will generate two child nodes
Identifying the best attribute test condition
To determine the goodness of an attribute test condition, we need to compare
the degree of impurity of the parent node (before splitting) with the weighted
degree of impurity of the child nodes (after splitting). The larger their
difference, the better the test condition. This difference, Δ, also termed as the
gain in purity of an attribute test condition, can be defined as follows:
where I(parent) is the impurity of a node before splitting and I(children) is the
weighted impurity measure after splitting. It can be shown that the gain is non-
negative since I(parent) ≥ I(children) for any reasonable measure such as those
presented above. The higher the gain, the purer are the classes in the child
nodes relative to the parent node. The splitting criterion in the decision tree
learning algorithm selects the attribute test condition that shows the maximum
gain. Note that maximizing the gain at a given node is equivalent to minimizing
the weighted impurity measure of its children since I(parent) is the same for all
candidate attribute test conditions. Finally, when entropy is used as the
impurity measure, the difference in entropy is commonly known as information
gain, Δinfo.

Splitting of Qualitative Attributes


Consider the first two candidate splits shown in Figure involving qualitative
attributes Home Owner and Marital Status. The initial class distribution at
the parent node is (0.3,0.7), since there are 3 instances of class Yes and 7
instances of class No in the training data. Thus,
Binary Splitting of Qualitative Attributes
Consider building a decision tree using only binary splits and the Gini index as
the impurity measure. Figure 3.13 shows examples of four candidate splitting
criteria for the Home Owner and Marital Status attributes. Since there are 3
borrowers in the training set who defaulted and 7 others who repaid their loan
(see Table in Figure 3.13), the Gini inde x of the parent node before splitting is

If Home Owner is chosen as the splitting attribute, the Gini index for the child
nodes N1 and N2 are 0 and 0.490, respectively. The weighted average Gini
index for the children is

where the weights represent the proportion of training instances assigned to


each child. The gain using Home Owner as splitting attribute is 0.420 - 0.343 =
0.077. Similarly, we can apply a binary split on the Marital Status attribute.
However,
since Marital Status is a nominal attribute with three outcomes, there are three
possible ways to group the attribute values into a binary split. The weighted
average Gini index of the children for each candidate binary split is shown in
Figure. Based on these results, Home Owner and the last binary split using
Marital Status are clearly the best candidates, since they both produce the
lowest weighted average Gini index. Binary splits can also be used for ordinal
attributes, if the binary partitioning of the attribute values does not violate the
ordering property of the values.

Binary Splitting of Quantitative Attributes


Consider the problem of identifying the best binary split Annual Income ≤ τ for
the preceding loan approval classification problem. As discussed previously,
even though τ can take any value between the minimum and maximum values of
annual income in the training set, it is sufficient to only consider the annual
income values observed in the training set as candidate split positions. For each
candidate τ, the training set is scanned once to count the number of borrowers
with annual income less than or greater than τ along with their class proportions.
We can then compute the Gini index at each candidate split position and choose
the τ that produces the lowest value. Computing the Gini index at each candidate
split position requires O(N) operations, where N is the number of training
instances. Since there are at most N possible candidates, the overall complexity of
this brute-force method is O(N2). It is possible to reduce the complexity of this
problem to O(N logN) by using a method described as follows (see illustration in
Figure 3.14). In this method, we first sort the training instances based on their
annual income, a one-time cost that requires O(N logN) operations. The candidate
split positions are given by the midpoints between every two adjacent sorted
values: $55,000, $65,000, $72,500, and so on. For the first candidate, since none
of the instances has an annual income less than or equal to $55,000, the Gini
index for the child node with Annual Income < $55,000 is equal to zero. In
contrast, there are 3 training instances of class Yes and 7 instances of class No
with annual income greater than $55,000. The Gini index for this node is 0.420.
The weighted average Gini index for the first candidate split position, τ =
$55,000, is equal to 0×0+1×0.420 = 0.420. For the next candidate, τ = $65,000,
the class distribution of its child nodes can be obtained with a simple update of
the distribution for the previous candidate. This is because, as τ increases from
$55,000 to $65,000, there is only one training instance affected by the change. By
examining the class label of the affected training instance, the new class
distribution is obtained. For example, as τ increases to $65,000, there is only one
borrower in the training set, with an annual income of $60,000, affected by this
change. Since the class label for the borrower is No, the count for class No
increases from 0 to 1 (for Annual Income ≤ $65,000) and decreases from 7 to 6
(for Annual Income > $65,000), as shown in Figure 3.14. The distribution for the
Yes class remains unaffected. The updated Gini index for this candidate split
position is 0.400. This procedure is repeated until the Gini index for all candidates
are found. The best split position corresponds to the one that produces the lowest
Gini index, which occurs at τ =
$97,500. Since the Gini index at each candidate split position can be computed
in O(1) time, the complexity of finding the best split position is O(N) once all the
values are kept sorted, a one-time operation that takes O(N logN) time. The
overall complexity of this method is thus O(N logN), which is much smaller than
the O(N2) time taken by the brute-force method. The amount of computation can
be further reduced by considering only candidate split positions located between
two adjacent sorted instances with different class labels. For example, we do not
need to consider candidate split positions located between $60,000 and $75,000
because all three instances with annual income in this range ($60,000, $70,000,
and $75,000) have the same class labels. Choosing a split position within this
range only increases the degree of impurity, compared to a split position located
outside this range. Therefore, the candidate split positions at τ = $65,000 and τ
= $72,500 can be ignored. Similarly, we do not need to consider the candidate
split positions at $87,500, $92,500, $110,000, $122,500, and $172,500 because
they are located between two adjacent instances with the same labels. This
strategy reduces the number of candidate split positions to consider from 9 to 2
(excluding the two boundary cases τ = $55,000 and τ = $230,000).

Gain Ratio

Gain Ratio is modification of information gain that reduces its bias. Gain
ratio overcomes the problem with information gain by taking into
account the number of branches that would result before making the
split.It corrects information gain by taking the intrinsic information of a
split into account.We can also say Gain Ratio will add penalty to
information gain.

We already know how to calculate Gain or


Information Gain

One potential limitation of impurity measures such as entropy and Gini index is
that they tend to favor qualitative attributes with large number of distinct
values. Figure shows three candidate attributes for partitioning the data set. As
previously mentioned, the attribute Marital Status is a better choice than the
attribute Home Owner, because it provides a larger information gain. However,
if we compare them against Customer ID, the latter produces the purest
partitions with the maximum information gain, since the weighted entropy and
Gini index is equal to zero for its children. Yet, Customer ID is not a good
attribute for splitting because it has a unique value for each instance. Even
though a test condition involving Customer ID will accurately classify every
instance in the training data, we cannot use such a test condition on new
test instances with Customer ID values that haven’t been seen before
during training. This example suggests having a low impurity value alone is
insufficient to find a good attribute test condition for a node. The number
of children produced by the splitting attribute should also be taken into
consideration while deciding the best attribute test condition. There are
two ways to overcome this problem. One way is to generate only binary
decision trees, thus avoiding the difficulty of handling attributes with
varying number of partitions. This strategy is employed by decision tree
classifiers such as CART. Another way is to modify the splitting criterion to
take into account the number of partitions produced by the attribute. For
example, in the C4.5 decision tree algorithm, a measure known as gain
ratio is used to compensate for attributes that produce a large number of
child nodes. This measure is computed as follows:

where N(vi) is the number of instances assigned to node vi and k is the total
number of splits. The split information measures the entropy of splitting a node
into its child nodes and evaluates if the split results in a larger number of
equally-sized child nodes or not. For example, if every partition has the same
number of instances, then ∀i : N(vi)/N =1/k and the split information would be
equal to log2 k. Thus, if an attribute produces a large number of splits, its split
information is also large, which in turn, reduces the gain ratio.

Example: [Gain Ratio]


Consider the data set. We want to select the best attribute test condition among
the following three attributes: Gender, Car Type, and Customer ID. The entropy
before splitting is
Gini Index –CART algorithm
ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain
Gain Ratio is used as an attribute selection criteria in algorithms such as C4. 5

Algorithm for Decision Tree Induction


Algorithm presents a pseudocode for decision tree induction algorithm. The
input to this algorithm is a set of training instances E along with the
attribute set F. The algorithm works by recursively selecting the best
attribute to split the data (Step 7) and expanding the nodes of the tree (Steps
11 and 12) until the stopping criterion is met (Step 1).
The details of this algorithm are:
1. The createNode()
extends function
the decision tree by creating a new node. A node in the decision tree
either has a test condition, denoted as node.test cond, or a class label, denoted
as node.label.

determines
2. The findthe attribute
best test
split() condition for
function
partitioning the training instances associated with a node. The splitting
attribute chosen depends on the impurity measure used. The popular measures
include entropy and the Gini index.
3. The Classify() function determines the class label to be assigned to a leaf
node. For each leaf node t, let p(i|t) denote the fraction of training instances
from class i associated with the node t. The label assigned to the leaf node is
typically the one that occurs most frequently in the training instances that are
associated with this node.

where the argmax operator returns the class i that maximizes p(i|t). Besides
providing the information needed to determine the class label of a leaf node, p(i|
t) can also be used as a rough estimate of the probability that an instance
assigned to the leaf node t belongs to class i.

4.The stopping cond() function is used to terminate the tree-growing


process by checking whether all the instances have identical class label
or attribute values. Since decision tree classifiers employ a top-down,
recursive partitioning approach for building a model, the number of
training instances associated with a node decreases as the depth of the
tree increases. As a result, a leaf node may contain too few training
instances to make a statistically significant decision about its class label.
This is known as the data fragmentation problem. One way to avoid this
problem is to disallow splitting of a node when the number of instances
associated with the node fall below a certain threshold.
Tree Pruning
Pruning is the procedure that decreases the size of decision trees. It can
decrease the risk of overfitting by defining the size of the tree or
eliminating areas of the tree that support little power. Pruning supports
by trimming the branches that follow anomalies in the training
information because of noise or outliers and supports the original tree in a
method that enhances the generalization efficiency of the tree.
Various methods generally use statistical measures to delete the least
reliable departments, frequently resulting in quicker classification and
an improvement in the capability of the tree to properly classify
independent test data.

There are two approaches to tree pruning which are as follows –


Prepruning Approach

In the prepruning approach, a tree is “pruned” by halting


its construction early (e.g., by deciding not to further split
or partition the subset of training tuples at a given node).
Upon halting, the node becomes a leaf. The leaf may hold
the most frequent class among the subset tuples or the
probability distribution of those tuples.
When constructing a tree, measures such as statistical
significance, information gain, Gini index, and so on, can
be used to assess the goodness of a split. If partitioning
the tuples at a node would result in a split that falls below
a prespecified threshold, then further partitioning of the
given subset is halted. There are difficulties, however, in
choosing an appropriate threshold. High thresholds could
result in oversimplified trees, whereas low thresholds
could result in very little simplification.
Post Pruning Approach
The second and more common approach is postpruning,
which removes subtrees from a “fully grown” tree. A
subtree at a given node is pruned by removing its
branches and replacing it with a leaf. The leaf is labeled
with the most frequent class among the subtree being
replaced. For example, notice the subtree at node “A3?” in
the unpruned tree of Figure 8.6. Suppose that the most
common class within this subtree is “class B.” In the
pruned version of the tree, the subtree in question is
pruned by replacing it with the leaf “class B.”
The cost complexity pruning algorithm used in CART is
an example of the postpruning approach. This approach
considers the cost complexity of a tree to be a function of
the number of leaves in the tree and the error rate of the
tree (where the error rate is the percentage of tuples
misclassified by the tree). It starts from the bottom of the
tree. For each internal node, N, it computes the cost
complexity of the subtree at N, and the cost complexity of
the subtree at N if it were to be pruned (i.e., replaced by a
leaf node). The two values are compared. If pruning the
subtree at node N would result in a smaller cost
complexity, then the subtree is pruned. Otherwise, it is
kept.
A pruning set of class-labeled tuples is used to estimate
cost complexity. This set is independent of the training set
used to build the unpruned tree and of any test set used
for accuracy estimation. The algorithm generates a set of
progressively pruned trees. In general, the smallest
decision tree that minimizes the cost complexity is
preferred.
C4.5 uses a method called pessimistic pruning, which is
similar to the cost complexity method in that it also uses
error rate estimates to make decisions regarding subtree
pruning. Pessimistic pruning, however, does not require
the use of a prune set. Instead, it uses the training set to
estimate error rates. Recall that an estimate of accuracy
or error based on the training set is overly optimistic and,
therefore, strongly biased. The pessimistic pruning
method therefore adjusts the error rates obtained from the
training set by adding a penalty, so as to counter the bias
incurred.
Rather than pruning trees based on estimated error rates,
we can prune trees based on the number of bits required
to encode them. The “best” pruned tree is the one that
minimizes the number of encoding bits.
Scalibility and Decision Tree Induction
The efficiency of existing decision tree algorithms, such as ID3, C4. 5, and
CART, has been well established for relatively small data sets. Efficiency
becomes an issue of concern when these algorithms are applied to the mining of
very large real-world databases.
Techniques
1.Rain Forest-AVC set(Attribute-Value,Class Label)
2.BOAT(Bootstrapped Optimistic Algorithm)

Confusion Matrix?

A Confusion matrix is an N x N matrix used for evaluating the


performance of a classification model, where N is the number of
target classes. The matrix compares the actual target values with
those predicted by the machine learning model. This gives us a
holistic view of how well our classification model is performing
and what kinds of errors it is making.

For a binary classification problem, we would have a 2 x 2 matrix


as shown below with 4 values:

 The target variable has two values: Positive or Negative


 The columns represent the actual values of the target variable
 The rows represent the predicted values of the target variable

True Positive (TP)

The predicted value matches the actual value




The actual value was positive and the model predicted a positive
value

True Negative (TN)

The predicted value matches the actual value


 The actual value was negative and the model predicted a
 negative value

False Positive (FP) – Type 1 error

The predicted value was falsely predicted


The actual value was negative but the model predicted a positive
 value

 Also known as the Type 1 error

False Negative (FN) – Type 2 error


The predicted value was falsely predicted
 The actual value was positive but the model predicted a negative
value

Also known as the Type 2 error
Let me give you an example to better understand this. Suppose
we had a classification dataset with 1000 data points. We fit a
classifier on it and get the below confusion matrix:

The different values of the Confusion matrix would be as follows:

 True Positive (TP) = 560; meaning 560 positive class data points
were correctly classified by the model
 True Negative (TN) = 330; meaning 330 negative class data
points were correctly classified by the model
 False Positive (FP) = 60; meaning 60 negative class data points
were incorrectly classified as belonging to the positive class by
the model
 False Negative (FN) = 50; meaning 50 positive class data points
were incorrectly classified as belonging to the negative class by
the model
This turned out to be a pretty decent classifier for our dataset
considering the relatively larger number of true positive and true
negative values.

Example
Suppose we are trying to create a model that can predict the result for
the disease that is either a person has that disease or not. So, the
confusion matrix for this is given as:

From the above example, we can conclude that:

o The table is given for the two-class classifier, which has two
predictions "Yes" and "NO." Here, Yes defines that patient has the
disease, and No defines that patient does not has that disease.
o The classifier has made a total of 100 predictions. Out of 100
predictions, 89 are true predictions, and 11 are incorrect
predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68
times. Whereas the actual "Yes" was 27, and actual "No" was 73
times.

Calculations using Confusion Matrix:


We can perform various calculations for the model, such as the model's
accuracy, using this matrix. These calculations are given below:

o Classification Accuracy: It is one of the important parameters to


determine the accuracy of the classification problems. It defines
how often the model predicts the correct output. It can be
calculated as the ratio of the number of correct predictions made
by the classifier to all number of predictions made by the
classifiers. The formula is given below:

o Misclassification rate: It is also termed as Error rate, and it


defines how often the model gives the wrong predictions. The
value of error rate can be calculated as the number of
incorrect predictions to all number of the predictions made by
the classifier. The formula is given below:

o Precision: It can be defined as the number of correct outputs


provided by the model or out of all positive classes that have
predicted correctly by the model, how many of them were actually
true. It can be calculated using the below formula:

o Recall: It is defined as the out of total positive classes, how our


model predicted correctly. The recall must be as high as possible.

Recall vs Precision
Recall is the number of relevant documents retrieved by a search divided
by the total number of existing relevant documents, while precision is the
number of relevant documents retrieved by a search divided by the total
number of documents retrieved by that search.
o F-measure: If two models have low precision and high recall or
vice versa, it is difficult to compare these models. So, for this
purpose, we can use F-score. This score helps us to evaluate the
recall and precision at the same time. The F-score is maximum if
the recall is equal to the precision. It can be calculated using the
below formula:

Other important terms used in Confusion Matrix:

o Null Error rate: It defines how often our model would be incorrect
if it always predicted the majority class. As per the accuracy
paradox, it is said that "the best classifier has a higher error rate
than the null error rate."
o ROC Curve: The ROC is a graph displaying a classifier's
performance for all possible thresholds. The graph is plotted
between the true positive rate (on the Y-axis) and the false Positive
rate (on the x-axis).
o PRC area:Precision Recall area
o MCC-Mathews Correlation Coefficient

Visual Mining for Decision Tree Induction

Perception based Classification(PBC)

“Are there any interactive approaches to decision tree


induction that allow us to visualize the data and the tree
as it is being constructed? Can we use any knowledge of
our data to help in building the tree?”
In this section, you
will learn about an approach to decision tree induction
that supports these options. Perception-
based classification (PBC) is an interactive approach based
on multidimensional visualization techniques and allows
the user to incorporate background knowledge about the
data when building a decision tree. By visually interacting
with the data, the user is also likely to develop a deeper
understanding of the data. The resulting trees tend to be
smaller than those built using traditional decision tree
induction methods and so are easier to interpret, while
achieving about the same accuracy.
“How can the data be visualized to support interactive
decision tree construction?” PBC uses a pixel-oriented
approach to view multidimensional data with its class
label information. The circle segments approach is
adapted, which maps d-dimensional data objects to a circle
that is partitioned into d segments, each representing one
attribute (Section 2.3.1). Here, an attribute value of a data
object is mapped to one colored pixel, reflecting the
object's class label. This mapping is done for each
attribute–value pair of each data object. Sorting is done
for each attribute to determine the arrangement order
within a segment. For example, attribute values within a
given segment may be organized so as to display
homogeneous (with respect to class label) regions within
the same attribute value. The amount of training data that
can be visualized at one time is approximately determined
by the product of the number of attributes and the number
of data objects.
The PBC system displays a split screen, consisting of a
Data Interaction window and a Knowledge Interaction
window (Figure 8.9). The Data Interaction window
displays the circle segments of the data under
examination, while the Knowledge Interaction window
displays the decision tree constructed so far. Initially, the
complete training set is visualized in the Data Interaction
window, while the Knowledge Interaction window displays
an empty decision tree.
Traditional decision tree algorithms allow only binary
splits for numeric attributes. PBC, however, allows the
user to specify multiple split-points, resulting in multiple
branches to be grown from a single tree node.
A tree is interactively constructed as follows. The user
visualizes the multidimensional data in the Data
Interaction window and selects a splitting attribute and
one or more split-points. The current decision tree in the
Knowledge Interaction window is expanded. The user
selects a node of the decision tree. The user may either
assign a class label to the node (which makes the node a
leaf) or request the visualization of the training data
corresponding to the node. This leads to a new
visualization of every attribute except the ones used for
splitting criteria on the same path from the root. The
interactive process continues until a class has been
assigned to each leaf of the decision tree.
The trees constructed with PBC were compared with trees
generated by the CART, C4.5, and SPRINT algorithms
from various data sets. The trees created with PBC were
of comparable accuracy with the tree from the algorithmic
approaches, yet were significantly smaller and, thus,
easier to understand. Users can use their domain
knowledge in building a decision tree, but also gain a
deeper understanding of their data during the
construction process.

You might also like