0% found this document useful (0 votes)
0 views

Decision Tree.pptx

The document discusses various classification methods in machine learning, including Naïve Bayesian classification, decision trees, and instance-based methods, highlighting the differences between supervised and unsupervised learning. It explains the two-step process of data classification involving a learning step to construct a model and a classification step to predict class labels for new data. Additionally, it details decision tree algorithms, attribute selection measures like information gain, gain ratio, and Gini index, and the principles of Bayesian classification.

Uploaded by

coderashu15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Decision Tree.pptx

The document discusses various classification methods in machine learning, including Naïve Bayesian classification, decision trees, and instance-based methods, highlighting the differences between supervised and unsupervised learning. It explains the two-step process of data classification involving a learning step to construct a model and a classification step to predict class labels for new data. Additionally, it details decision tree algorithms, attribute selection measures like information gain, gain ratio, and Gini index, and the principles of Bayesian classification.

Uploaded by

coderashu15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Classification

Naïve Bayesian classification, Decision trees, Decision rules,


Instance-based methods
Supervised vs. Unsupervised Learning
Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
• New data is classified based on the training set.
Unsupervised learning (clustering)
• The class labels of training data is unknown.
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.
Prediction Problems: Classification vs. Numeric Prediction
Classification
• Predicts categorical class labels (discrete or nominal)
• Classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
Numeric Prediction
• Models continuous-valued functions, i.e., predicts unknown or missing values
Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Classification
Data classification is a two-step process: Learning & Classification step.
i. Learning step (or Training phase or Model construction): where a
classification model is constructed.
• Describing a set of predetermined classes.
• “Learning from” a training set made up of database tuples and their
associated class labels.
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute.
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• The class label attribute is categorical (or nominal) in that each value
serves as a category or class.
Classification
ii. Classification step (Model usage ): where the model is used to predict
class labels for given data.
• Classifying future or unknown objects
• If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test)
set
Classification
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
Decision Tree
• A decision tree is a flowchart-like tree structure:
• Each internal node (nonleaf node) denotes a test on an attribute
• Each branch represents an outcome of the test.
• Each leaf node (or terminal node) holds a class label.
• Internal nodes are denoted
by rectangles.
• Leaf nodes are denoted by
ovals.
• Some decision tree
algorithms produce only
binary trees.
• Others can produce
nonbinary trees.
Decision Tree
How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown, the
attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be
converted to classification
rules.
• Decision tree classifiers have
good accuracy. However,
successful use may depend
on the data at hand.
Decision Tree
• Popular traditional Decision Tree Algorithms:
✔ ID3 (Iterative Dichotomiser)
✔ C4.5 (a successor of ID3)
✔ CART (Classification And Regression Trees)
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in
which decision trees are constructed in a top-down recursive
divide-and-conquer manner.
Decision Tree
Decision tree algorithm has three parameters:
i. D: data partition, Initially, it is the complete set of training tuples and their
associated class labels.
ii. Parameter attribute list: the set of candidate attributes.
iii. Attribute selection method: specifies a heuristic procedure for selecting
the attribute that “best” discriminates the given tuples according to class.
The criterion consists of a splitting attribute and, possibly, either a
split-point or splitting subset.
Attribute selection measure such as information gain or the Gini index.
Gini index: Tree is strictly binary
Information gain: Multiway splits (i.e., two or more branches to be grown
from a node)
Decision Tree
• If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class (Stopping criteria).
• Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion.
• All the tuples in partition D (represented at node N) belong to the same
class.
Stopping criteria: Recursive partitioning stops only when any one of the
following terminating conditions is true :
✔ All the tuples in partition D (represented at node N) belong to the
same class.
✔ There are no remaining attributes on which the tuples may be further
partitioned.
✔ There are no tuples for a given branch, that is, a partition Dj is empty.
Decision Tree
• Let A be the splitting attribute. A has v distinct values, {a 1, a2,..., av},
based on the training data.
• A is discrete-valued: In this case, the outcomes of the test at node N
correspond directly to the known values of A.
• A branch is created for each known value, aj, of A and labeled with that
value.
• A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A ≤ split point and A > split
point, respectively.
Decision Tree: Attribute Selection Measures
Three popular attribute selection measures—
a) Information gain (ID3)
b) Gain ratio (C4.5)
c) Gini index (CART)
Decision Tree: Attribute Selection Measures- Information gain
Information Gain:
• ID3 uses information gain as its attribute selection measure.
• The attribute with the highest information gain is chosen as the
splitting attribute for node N.
• To classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions.
Decision Tree: Attribute Selection Measures- Information gain
Entropy (Information Theory)
• A measure of uncertainty associated with a random variable.
• High entropy -> higher uncertainty
• Lower entropy -> lower uncertainty
• The expected information (or entropy) needed to classify a tuple in D
is:
Decision Tree: Attribute Selection Measures- Information gain
• The expected information (or entropy) needed to classify a tuple in D
is:

• where pi is the nonzero probability that an arbitrary tuple in D belongs


to class Ci and is estimated by |Ci,D|/|D|.
• A log function to the base 2 is used, because the information is encoded
in bits.
• Info(D) is just the average amount of information needed to identify the
class label of a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Information needed (after using A to split D into v partitions) to classify
D:

• The term acts as the weight of the jth partition.


• InfoA(D) is the expected information required to classify a tuple from D
based on the partitioning by A.
• The smaller the expected information (still) required, the greater the
purity of the partitions.
Decision Tree: Attribute Selection Measures- Information gain
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A).
• Information gained by branching on attribute A
Decision Tree: Attribute Selection Measures- Information gain
• In this example, class label attribute, buys_computer, has two distinct
values (namely, yes, no).
• There are nine tuples of class yes and five tuples of class no.
• A (root) node N is created for the tuples in D.
• To find the splitting criterion for
these tuples, we first need to
compute the expected information
to classify a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Next, we need to compute the expected information requirement for
each attribute.
• Let’s start with the attribute age. We need to look at the distribution of
yes and no tuples for each category of age.
Decision Tree: Attribute Selection Measures- Information gain
• For the age category “youth,” there are two yes tuples and three no
tuples.
• For the category “middle aged,” there are four yes tuples and zero no
tuples.
• For the category “senior,” there
are three yes tuples and two no
tuples.
Decision Tree: Attribute Selection Measures- Information gain

• Similarly, we can compute Gain(income) =0.029 bits, Gain(student) =


0.151 bits, and Gain(credit_rating) = 0.048 bits.
• Because age has the highest
information gain among the
attributes, it is selected as the
splitting attribute.
Decision Tree: Attribute Selection Measures- Information gain

• *Note the tuples falling into the partition for age D


middle aged all belong to the same class.
• Therefore be created at the end of this branch and
labeled “yes.”
Decision Tree: Attribute Selection Measures- Information gain
Decision Tree: Attribute Selection Measures
• Computing Information-Gain for Continuous-Valued Attributes:
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for
A is selected as the split-point for A
• Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set
of tuples in D satisfying A > split-point
Decision Tree: Attribute Selection Measures- Gain Ratio
• Information gain measure is biased toward tests with many outcomes.
• Information gain prefers to select attributes having a large number of
values.
• For example, consider an attribute that acts as a unique identifier such
as product ID.
• A split on product ID would result in a large number of partitions (as
many as there are values), each one containing just one tuple.
• Each partition is pure, the information required to classify data set D
based on this partitioning would be Infoproduct_ID(D)= 0.
• Therefore, the information gained by partitioning on this attribute is
maximal.
Decision Tree: Attribute Selection Measures- Gain Ratio
• C4.5, a successor of ID3, uses an extension to information gain known
as gain ratio, which attempts to overcome this bias.

• For each outcome, it considers the number of tuples having that


outcome with respect to the total number of tuples in D.

• The attribute with the maximum gain ratio is selected as the splitting
attribute.
Decision Tree: Attribute Selection Measures- Gain Ratio
• Computation of gain ratio for the attribute income.

• We have Gain(income) = 0.029.


• Therefore,
GainRatio(income) 0.029/1.557
= 0.019.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index measures the impurity of a data partition or set of training
tuples.

• where pi is the probability that a tuple in D belongs to class Ci and is


estimated by |Ci,D|/|D|. The sum is computed over m classes.
• Gini index considers a binary split for each attribute.
• To determine the best binary split on A, we examine all the possible
subsets that can be formed using known values of A.
• For example, if income has three possible values, namely {low, medium,
high}, then the possible subsets are {low, medium, high}, {low, medium},
{low, high}, {medium, high}, {low}, {medium}, {high}, and {}.
Decision Tree: Attribute Selection Measures- Gini Index
• We exclude the power set, {low, medium, high}, and the empty set {}
from consideration since, conceptually, they do not represent a split.
• Therefore, there are 2v − 2 possible ways to form two partitions of the
data D, based on a binary split on A.
• When considering a binary split, we compute a weighted sum of the
impurity of each resulting partition.
• For example, if a binary split on A partitions D into D1 and D2, the Gini
index of D given that partitioning is:

• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.
Decision Tree: Attribute Selection Measures- Gini Index
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is:

• The attribute that maximizes the reduction in impurity (or, equivalently,


has the minimum Gini index) is selected as the splitting attribute.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index to compute the impurity of D:

• To find the splitting criterion for the tuples in D, we need to compute


the Gini index for each attribute.
• For attribute income and consider each
of the possible splitting subsets.
• The subset {low, medium} in 10 tuples
in partition D1 satisfying the condition
“income ∈ {low, medium}.”
• The remaining four tuples of D would
be assigned to partition D2.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index value computed based on partitioning {low, medium} is

• Gini index for the subsets {low, high}


and {medium} is 0.458
• {medium, high} and {low}= 0.450.
• The best binary split for attribute
income is on {low, medium*} (or {high})
because it minimizes the Gini index.
Decision Tree: Attribute Selection Measures- Gini Index
• Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best
split for age with a Gini index of 0.375.
• The attributes student and credit_rating are both binary, with Gini index
values of 0.367 and 0.429, respectively.

• Gini index for the subsets {low, high}


and {medium} is 0.458
• {medium, high} and {low}= 0.450
Decision Tree
The three measures, in general, return good results but
• Information gain:
biased towards multivalued attributes
• Gain ratio:
tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in
both partitions
Bayes Classification
• Bayesian classifiers are statistical classifiers.
• Bayesian classifier can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the naïve Bayesian classifier.
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.

• Bayes’ theorem is useful in that it provides a way of calculating the


posterior probability, P(H|X) from P(H), P(X|H), and P(X).
Bayes Classification
• For example, X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a
computer.
• Then P(H|X) reflects the probability that customer X will buy a
computer given that we know the customer’s age and income.
Bayes Classification
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.

• For example, X is a 35-year-old customer with an income of $40,000.


Suppose that H is the hypothesis that our customer will buy a
computer.
• Then P(H|X) reflects the probability that customer X will buy a
computer given that we know the customer’s age and income.
Naïve Bayesian Classification
• Naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
• Let D be a training set of tuples and their associated class labels. As
usual, each tuple is represented by an n-dimensional attribute vector, X
= (x1, x2,…, xn), depicting n measurements made on the tuple from n
attributes, respectively, A1, A2,…, An.
• Suppose that there are m classes, C1, C2,:::, Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, the na¨ıve Bayesian
classifier predicts that tuple X belongs to the class Ci if and only if
Summary
• Data sets are made up of data objects. A data object represents an
entity. Data objects are described by attributes. Attributes can be
nominal, binary, ordinal, or numeric.
• The values of a nominal (or categorical) attribute are symbols or names
of things, where each value represents some kind of category, code, or
state.
• Binary attributes are nominal attributes with only two possible states
(such as 1 and 0 or true and false). If the two states are equally
important, the attribute is symmetric; otherwise it is asymmetric.
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.
• Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar, Introduction to
data mining, Pearson India.

You might also like