0% found this document useful (0 votes)
14 views

Topic01 Classification Basics Jiawei Han Extra

Uploaded by

Amy Wright
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Topic01 Classification Basics Jiawei Han Extra

Uploaded by

Amy Wright
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 198

Topic 1

Classification Basics

[Jiawei Han, Jian Pei, Hanghang Tong. 2022. Data Mining Concepts and Techniques. 4th Ed. Morgan
Kaufmann. ISBN: 0128117605.]
[Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar. 2018. Introduction to Data Mining.
2nd Ed. Pearson. ISBN: 0133128903.]

1
Contents

1. Decision Tree
2. Naïve Bayesian Classification
3. Rule-Based Classification
4. Evaluate Classifier Performance
5. K-Nearest Neighbors Classification

2
Introduction – Basic Concepts

• Classification is a form of data analysis that


extracts models describing data classes. A classifier
(or classification model) predicts categorical labels
(classes).
• Data classification is a two-step process,
consisting of a training (or learning) step (where a
classification model is constructed) and a
classification (prediction) step (where the model is
used to predict class labels for given data).

3
Introduction – Basic Concepts

• Numeric prediction models constructed by


regression analysis to predict continuous-valued
functions.
• Classification and numeric prediction are two
major types of prediction problems.

4
Introduction – Basic Concepts

• Classification applications include fraud detection,


target marketing, performance prediction,
manufacturing, medical diagnosis, and so on.
• Basic classification techniques are decision tree,
Bayes(ian), and rule-based classifiers.
• Construction and evaluation of a classifier require
partitioning labeled data set D into training set
(two-thirds or 70% of D) and test set (one-third or
30% of D). This partition method is called holdout.

5
Introduction – Basic Concepts

• Typical data partitioning methods are holdout,


random sampling, cross-validation, and
bootstrapping.

6
Introduction – Basic Concepts

• Evaluate and compare different classifiers by


using various accuracy measures.
• A confusion matrix can be used to evaluate a
classifier’s quality.

7
Introduction – Basic Concepts

• Ensemble methods can be used to increase


overall accuracy by learning and combining a series
of individual (base) classifier models.
• Popular ensemble methods are bagging, boosting,
and random forests.

8
Introduction – Basic Concepts

• Training data D_train contains tuples X_train =


(x1, x2, ..., xn, y)
- X_train is also called attribute/feature vector
- y is called the class label attribute.
• Test data D_Test contains tuples X_Test = (x1, x2,
..., xn) without the class label attribute.
• Data tuples can be referred to as samples,
examples, instances, data points, or objects.

9
Introduction – Basic Concepts

• In supervised learning (e.g., classification), the


class label of each training tuple is provided.
• In unsupervised learning (e.g., clustering), the
class label of each training tuple is not known.
• The accuracy of a classifier on a given test set is
the percentage of test set tuples that are correctly
classified by the (trained/learned) classifier.

10
1. Decision Tree

• Consider class-labeled training tuples D from the AllElectronics


customer data set

• How to predicts whether a given customer at AllElectronics is likely


to purchase a computer (i.e., buys_computer = yes or no).
11
1. Decision Tree

• A constructed decision tree classifier

12
1. Decision Tree

• During tree construction, attribute selection


measures are used to select the attribute that best
partitions the tuples into distinct classes.
• Three popular attribute selection measures are
information gain, gain ratio, and Gini index.
• Reference links
https://en.wikipedia.org/wiki/ID3_algorithm
https://en.wikipedia.org/wiki/C4.5_algorithm
http://www.cs.cmu.edu/~tom/mlbook.html

13
Basic Algorithm for Inducing a Decision Tree

Algorithm: Generate_decision_tree.
// Generate decision tree from training tuples of data partition
D.
Input:
• Data partition D is set of training tuples and their associated
class labels
• attribute_list is the set of candidate attributes
• Attribute_selection_method is a procedure to determine the
splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting_attribute
and, possibly, either a split-point or splitting subset.
Output: A decision tree.
14
Basic Algorithm for Inducing a Decision Tree

Method:
1. Create a node N
2. if tuples in D are all of the same class C then
3. return N as leaf node labeled with the class C
4. if attribute_list is empty then
5. return N as a leaf node labeled with the
majority class in D // majority voting

15
Basic Algorithm for Inducing a Decision Tree

6. Apply Attribute_selection_method(D, attribute_list)


to find the “best” splitting_criterion
7. Label node N with splitting_criterion
8. if splitting_attribute is discrete-valued and
multiway splits allowed then
// not restricted to binary trees
9. attribute_list ← attribute_list – splitting_attribute
// remove splitting_attribute

16
Basic Algorithm for Inducing a Decision Tree

10. for each outcome j of splitting_criterion


// partition tuples and grow subtrees for each partition
11. Let Dj be set of data tuples in D satisfying outcome j
// a partition
12. if Dj is empty then
13. Attach a leaf labeled with the majority class
in D to node N
14. else Attach the node returned by
Generate_decision_tree(Dj, attribute_list) to
node N
endfor
15. return N
17
Basic Algorithm for Inducing a Decision Tree

• The computational complexity of the algorithm


given training set D is O(n ´ |D| ´ log(|D|)), where
n is the number of attributes describing the tuples in
D and |D| is the number of training tuples in D.

18
Three Possibilities for Partitioning Tuples

19
Attribute Selection Measures - Notations Used

• D (parent set) is training set of class-labeled


tuples.
• m is the number of classes, each class is denoted
as Ci (for i = 1, ..., m)
• Ci,D is the set of tuples of class Ci in D.
• |D| is the number of tuples in D.
• |Ci,D| is the number of tuples in Ci,D.

20
Attribute Selection Measures - Notations Used

• Attribute A has v distinct values {a1, a2, ..., av}.


• Attribute A can be used to split D into v partitions
or subsets {D1, D2, ..., Dv}, where Dj contains those
tuples in D that have outcome aj of A.

21
Information Gain (ID3)

(0 log20 = 0, pi = |Ci,D| / |D| is the ratio of class Ci


tuples among the training tuples at (root) node D
(depth 0 node))
• Attribute A with the highest information gain
Gain(A) is chosen as splitting attribute at node N.
22
Information Gain (ID3)

• pi,j is the ratio of class Ci tuples among the


training tuples at node Dj (depth 1, left, middle, or
right node). That is, Dj contains tuples whose value
of A is aj.

23
Attribute Selection Measures - Notations Used

• Before partitioning D, Info(D) is expected


information needed to identify class label of a tuple
in D. Info(D) is also known as the entropy of D.
• After partitioning D on A, Info(Dj) is the expected
information required to classify a tuple from D
based on the partitioning by A.
• After partitioning D on A, InfoA(D) is the amount
of information still needed to classify a tuple in D.
• pi is the nonzero probability that an arbitrary tuple
in D belongs to class Ci, pi = |Ci,D| / |D|.
24
Example 1: Information Gain (ID3)

• Example 1. Given training set D of class-labeled


tuples randomly selected from AllElectronics
customer database as shown in Table 8.1. Class
label attribute, buys_computer, has two distinct
values, namely, {yes, no}. Therefore, there are two
classes (i.e., m = 2). Let class C1 correspond to yes
and class C2 correspond to no.

25
Example 1: Information Gain (ID3)

|D| = 14, m = 2, |C1,D| = 9, |C2,D| = 5


26
Example 1: Information Gain (ID3)

• There are nine tuples of class yes (|C1,D| = 9) and


five tuples of class no (|C2,D| = 5). A (root) node N is
created for tuples in D. To find the splitting
criterion for these tuples, we must compute the
information gain of each attribute.

27
Example 1: Information Gain (ID3)

• We first use Eq. (8.1) to compute the expected


information (known as entropy) needed to classify
a tuple in D:

pi = |Ci,D| / |D|
Info(D) = –[p1log2(p1) + p2log2(p2)]
p1 = 9 / 14, p2 = 5 / 14

28
Example 1: Information Gain (ID3)

• Next, we need to compute expected information


requirement for each attribute (i.e., InfoA(D)).
• Let’s start with attribute age. We need to look at
the distribution of yes and no tuples for each
category of age. That is, we have v = 3, 5 youth
tuples (|D1| = 5), 4 middle_aged tuples (|D2| = 4),
and 5 senior tuples (|D3| = 5).

29
Example 1: Information Gain (ID3)

30
Example 1: Information Gain (ID3)

- For category youth (|D1| = 5): there are two yes


tuples (p1,1 = | C1, D1 | / | D1 | = 2) and three no tuples
(p2,1 = | C2, D1 | / | D1 | = 3).
- For category middle_aged (|D2| = 4): there are
four yes tuples (p1,2 = | C1, D2 | / | D2 | = 4) and zero no
tuples (p2,2 = | C2, D2 | / | D2 | = 0).
- For category senior (|D3| = 5): there are three yes
tuples (p1,3 = | C1, D3 | / | D3 | = 3) and two no tuples
(p2,3 = | C2, D3 | / | D3 | = 2).

31
Example 1: Information Gain (ID3)

• Using Eq. (8.2), the expected information needed


to classify a tuple in D if the tuples are partitioned
according to age is

InfoA(D) = |D1|/|D| ´ Info(D1) + |D2|/|D| ´ Info(D2) +


|D3|/|D| ´ Info(D3), where A = age, v = 3,
Info(D1) = –[p1,1log2(p1,1) + p2,1log2(p2,1)],
Info(D2) = –[p1,2log2(p1,2) + p2,2log2(p2,2)],
Info(D3) = –[p1,3log2(p1,3) + p2,3log2(p2,3)]
32
Example 1: Information Gain (ID3)

33
Example 1: Information Gain (ID3)

• InfoA(D) = 0.694 < Info(D) = 0.940


→ data purity (or homogeneity) is improved.
• The information gain from such a partitioning is

Gain(A) = Info(D) – Infoage(D)


Gain(A) = 0.940 – 0.694 = 0.246 (bits)

34
Example 1: Information Gain (ID3)

• Similarly, we can compute Gain(income) = 0.029


bits, Gain(student) = 0.151 bits, and
Gain(credit_rating) = 0.048 bits [Exercise].
• Because age has the highest information gain
among the attributes, it is selected as the splitting
attribute.

35
Example 1: Information Gain (ID3)

36
Reminder: Information Gain (ID3)

37
Gain Ratio (C4.5)

• Information gain measure is biased toward tests


with many outcomes. That is, it prefers to select
attributes having a large number of values. In other
words, the information gain measure is biased
toward multivalued attributes.
• C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio, which
attempts to overcome this bias.

38
Gain Ratio (C4.5)

GainRatio(A) = Gain(A) / SplitInfoA(D) (8.6)

• The attribute with the maximum gain ratio is


selected as the splitting attribute.
• Example: A = income, v = 3, we have
SplitInfoA(D) = –[(|D1|/|D|)×log2(|D1|/|D|) +
(|D2|/|D|)×log2(|D2|/|D|) +
(|D3|/|D|)×log2(|D3|/|D|)]

39
Example 2: Gain Ratio (C4.5)

• Example 2. Let D be the training data shown in


Table 8.1. A test on income splits the given data into
three partitions, namely low, medium, and high,
containing four, six, and four tuples, respectively.
• We have |D| = 14, |D1| = 4, |D2| = 6, |D3| = 4
• To compute the gain ratio of income, we first use
Eq. (8.5) to obtain

40
Example 2: Gain Ratio (C4.5)

41
Example 2: Gain Ratio (C4.5)

42
Example 2: Gain Ratio (C4.5)

• From Example 8.1, we have Gain(income) =


0.029. Therefore, GainRatio(income) = 0.029 /
1.557 = 0.019.
• Similarly, we can compute GainRatio(age) =
0.156 bits, GainRatio(student) = 0.152 bits, and
GainRatio(credit_rating) = 0.049 bits [Exercise].
• Because age has highest gain ratio among
attributes, it is selected as the splitting attribute.

43
Reminder

44
Gini index (CART)

• Gain ratio tends to prefer unbalanced splits in


which one partition is much smaller than the others.
• Gini index is biased to multivalued attributes and
has difficulty when the number of classes is large. It
also tends to favor tests that result in equal-sized
partitions and purity in both partitions.
• Gini index considers a binary split (i.e., two-way
split) for each attribute.
• For each attribute, each of the possible binary
splits is considered.
45
Gini index (CART)

• Gini index measures the impurity of training


tuples D and is defined as
m
Gini(D) = 1 – å pi 2 , pi = |Ci,D| / |D| (8.7)
i =1

• A binary split on A partitions D into D1 and D2, the


Gini index of D given that partitioning is
GiniA(D) = (|D1|/|D|)×Gini(D1) + (|D2|/|D|)×Gini(D2) (8.8)

46
Gini index (CART)

• For a discrete-valued attribute, subset that gives


minimum Gini index for that attribute GiniA(D) is
selected as its splitting subset.
• The reduction in impurity that would be incurred
by a binary split on a discrete- or continuous-valued
attribute A is
DGini(A) = Gini(D) – GiniA(D) (8.9)
• Attribute that maximizes reduction in impurity
DGini(A) (i.e., has the minimum Gini index) is
selected as the splitting attribute.
47
Gini index (CART)

• Training data D can be split on attribute A using


binary (or two-way) split or multiway split.
• Multiway split has smaller Gini index compared
to binary split because binary split actually merges
some of the outcomes of a multiway split, and thus,
results in less pure subsets.

48
Example 3: Gini index (CART)

• Example 3. Let D be the given training data as


shown in Table 8.1, where there are nine tuples
belonging to the class buys_computer = yes and the
remaining five tuples belong to the class
buys_computer = no. A (root) node N is created for
the tuples in D.

• We have |C1,D| = 9, |C2,D| = 5, |D| = 14

49
Example 3: Gini index (CART)

• We first use Eq. (8.7) for the Gini index to


compute the impurity of D:
m
Gini(D) = 1 – å i , pi = |Ci,D| / |D| (8.7)
p 2

i =1

50
Example 3: Gini index (CART)

• To find splitting criterion for tuples in D, we need


to compute Gini index for each attribute.
• Let’s start with attribute income and consider
each of the possible splitting subsets.

51
Example 3: Gini index (CART)

• Consider subset {low, medium}. This would result


in 10 tuples in partition D1 satisfying condition
“income Î {low, medium}.” Remaining four tuples
of D would be assigned to partition D2 (i.e., income
Î {high}). That is, we have |D1| = 10, |D2| = 4, |D| =
14.

52
Example 3: Gini index (CART)

53
Example 3: Gini index (CART)

• Gini index value computed based on this


partitioning (i.e., attribute income and subset {low,
medium}) is

54
Example 3: Gini index (CART)

• Similarly, Gini index values for splits on the


remaining subsets are 0.458 (for the subsets {low,
high} and {medium}) and 0.450 (for the subsets
{medium, high} and {low}) [Exercise].
• Therefore, the best binary split for attribute
income is on {low, medium} (or {high}) because it
minimizes the Gini index.

55
Example 3: Gini index (CART)

• Evaluating age, we obtain {youth, senior} (or


{middle_aged}) as the best split for age with a Gini
index of 0.357 [Exercise].
• The attributes student and credit_rating are both
binary, with Gini index values of 0.367 and 0.429,
respectively [Exercise].

56
Example 3: Gini index (CART)

• Attribute age and splitting subset {youth, senior}


therefore give minimum Gini index overall (i.e.,
0.357), with a reduction in impurity of 0.459 -
0.357 = 0.102.
• Binary split “age Î {youth, senior}?” results in
maximum reduction in impurity of the tuples in
D and is selected as splitting criterion.

57
Example 3: Gini index (CART)

58
Reminder

59
Other Attribute Selection Measures

• Information gain is biased toward multivalued


attributes.
• Gain ratio tends to prefer unbalanced splits in
which one partition is much smaller than the others.
• Gini index is biased toward multivalued attributes
and has difficulty when the number of classes is
large. It also tends to favor tests that result in equal-
size partitions and purity in both partitions.

60
Other Attribute Selection Measures

• Other attribute selection measures are CHAID


(Chi-square Automatic Interaction Detector) (based
on the statistical χ2 test), C-SEP, and G-statistic (is a
close approximation to χ2 distribution).
• No one attribute selection measure has been found
to be significantly superior to others. Most
measures give quite good results.

61
Scalability of Decision Tree Induction

• Decision tree algorithms such as ID3, C4.5, and


CART have the restriction that the training tuples
should reside in memory. That is, these methods are
applicable to small data sets.
• Scalable decision tree induction methods include
RainForest and BOAT (Bootstrapped Optimistic
Algorithm for Tree construction).

62
Scalability of Decision Tree Induction

• RainForest method maintains an AVC-set (where


“AVC” stands for “Attribute-Value, Classlabel”) for
each attribute, at each tree node, describing the
training tuples at the node.
• AVC-set of an attribute A at node N gives class
label counts for each value of A for tuples at N.

63
AVC-sets for the Tuple Data of Table 8.1

64
Scalability of Decision Tree Induction

• The set of all AVC-sets at a node N is the AVC-


group of N.
• The size of an AVC-set for attribute A at node N
depends only on the number of distinct values of A
and the number of classes in the set of tuples at N.
• Typically, the size of an AVC-set for attribute A at
node N should fit in memory.

65
Scalability of Decision Tree Induction

• BOAT method uses a statistical technique known


as bootstrapping to create several smaller samples
(or subsets) of given training data, each of which
fits in memory.
• Each subset is used to construct a tree, resulting in
several trees.
• Trees are examined and used to construct a new
tree, T ’, that turns out to be “very close” to the tree
that would have been generated if all original
training data had fit in memory.
66
Scalability of Decision Tree Induction

• Basic decision tree induction algorithm requires


one scan per tree level!
• BOAT was found to be two to three times faster
than RainForest, while constructing exactly the
same tree.
• An additional advantage of BOAT is that it can be
used for incremental updates. That is, BOAT can
take new insertions and deletions for training data
and update decision tree to reflect these changes,
without having to reconstruct the tree from scratch.
67
Contents

1. Decision Tree
2. Naïve Bayesian Classification
3. Rule-Based Classification
4. Evaluate Classifier Performance
5. K-Nearest Neighbors Classification

68
2. Naïve Bayesian Classification

• A Bayesian classifier (BC) can predict class


membership probabilities such as the probability
that a given tuple belongs to a particular class.
• A BC has high accuracy and speed when applied
to large databases.
• A simple Bayesian classifier (BC) known as the
naïve Bayesian classifier (NBC).
• NBC is comparable in performance with decision
tree and selected neural network classifiers.

69
2. Bayes’ Theorem

• Let X be a data tuple (a.k.a. “evidence”)


• Let H be a hypothesis that X belongs to a class C.
• Classification is to determine P(H | X).
P(H | X) = [P(X | H)×P(H)] / P(X) (8.10)
• P(H | X) is the probability that the hypothesis H
holds given the observed data tuple X. That is, we
are looking for the probability that tuple X belongs
to class C, given that we know the attribute
description of X.
70
2. Bayes’ Theorem

• P(H | X) is the posterior probability (or a


posteriori probability) of H conditioned on X.
• Example: suppose that
- We know the age and income of the customer X
(e.g., age = youth, income = medium).
- H is the hypothesis that the customer X will buy a
computer.
- Then, P(H | X) is the probability that the customer
X will buy a computer.
71
2. Bayes’ Theorem

• P(X | H) is the posterior probability of X


conditioned on H (i.e., given that hypothesis H is
true).
• Example:
- We know that customer X will buy a computer.
- Then, P(X | H) is the probability that customer X
is youth and has medium income.

72
2. Bayes’ Theorem

• P(H) is the prior probability (or a priori


probability) of H (regardless of data X).
• Example: P(H) is probability that any given
customer will buy computer, regardless of age,
income, or any other information. P(C1) = 9/14
• P(X) is the prior probability of X (regardless of
hypothesis H).
• Example: P(X) is probability that customer from
data set D is youth and has medium income.
P(age = youth, income = medium) = 2 /14
73
2. Bayes’ Theorem

• P(X | H), P(H), and P(X) can be computed from


the given data set D.
• Bayes’ theorem is useful in that it provides a way
of calculating the posterior probability P(H | X)
from P(X | H), P(H), and P(X).
• Bayes’ theorem is

74
2. Naïve Bayesian Classification

• Naïve Bayesian classifier (or simple Bayesian


classifier) works as follows.
1. Let D be a training set of tuples and their
associated class labels. Each tuple is represented by
an n-dimensional attribute vector, X = (x1, x2, ...,
xn), depicting n measurements made on the tuple
from n attributes A1, A2, ..., An, respectively.
- Example: X = (x1 = youth, x2 = medium, x3 = yes,
x4 = fair)

75
2. Naïve Bayesian Classification

2. Suppose that there are m classes, C1, C2, ..., Cm.


Given a tuple X, naïve Bayesian classifier (NBC)
will predict that X belongs to the class having
highest posterior probability, conditioned on X.
That is, NBC predicts that tuple X belongs to class
Ci if and only if
P(Ci | X) > P(Cj | X) for 1 ≤ j ≤ m, j ¹ i.
Thus, we maximize P(Ci | X).

76
2. Naïve Bayesian Classification

• The class Ci for which P(Ci | X) is maximized is


called maximum posteriori hypothesis.
• By Bayes’ theorem (Eq. 8.10), we have

where P(Ci) = |Ci,D|/|D|


(Eq. (8.10): P(H | X) = [P(X | H)×P(H)] / P(X))
• Example: P(C1) = 9/14, P(C2) = 5/14

77
2. Naïve Bayesian Classification

3. P(X) is constant for all classes so only


P(X | Ci)P(Ci) needs to be maximized.
• If class prior probabilities P(Ci) are not known,
then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = ... = P(Cm),
and we would therefore maximize P(X | Ci).
• If the class prior probabilities P(Ci) are known, we
need to maximize P(X | Ci)P(Ci), where P(Ci) =
|Ci,D|/|D|, and |Ci,D| is the number of training tuples
of class Ci in D.
78
2. Naïve Bayesian Classification

4. P(X | Ci) is computed as

• We can compute probabilities P(x1 | Ci), P(x2 | Ci),


..., P(xn | Ci) from training tuples, where xk is the
value of attribute Ak for tuple X (e.g., X = (x1 =
youth, x2 = medium, x3 = yes, x4 = fair)).
• For each attribute, we look at whether the attribute
is categorical or continuous-valued.

79
2. Naïve Bayesian Classification

• To compute P(X | Ci), we consider the following.


(a) If Ak is categorical, then P(xk | Ci) is the number
of tuples of class Ci in D having the value xk for Ak,
divided by |Ci,D|, the number of tuples of class Ci in
D. That is, P(xk | Ci) =
(b) If Ak is continuous-valued: read the textbook for
details.

80
2. Naïve Bayesian Classification

5. To predict class label of X, P(X | Ci)P(Ci) is


computed for each class Ci. NBC predicts that class
label of tuple X is class Ci if and only if

(P(Ci | X) = [P(X | Ci) × P(Ci)] / P(X) (8.11))


• Predicted class label is class Ci for which
P(X | Ci)P(Ci) is maximum.

81
2. Naïve Bayesian Classification

• Formulas used are

82
2. Naïve Bayesian Classification

• Example 4 Predicting a class label using naïve


Bayesian classification. We want to predict class
label of a tuple using NBC, given training data D
shown in Table 8.1. Data tuples are described by
attributes age, income, student, and credit rating.
Class label attribute, buys_computer, has two
distinct values (namely, {yes, no}). Let C1
correspond to class buys_computer = yes and C2
correspond to buys_computer = no. The tuple we
want to classify is
83
2. Naïve Bayesian Classification

X = (age = youth, income = medium,


student = yes, credit_rating = fair)
i.e., X = (x1 = youth, x2 = medium, x3 = yes,
x4 = fair)
• We need to maximize P(X | Ci)P(Ci), for i = 1, 2.
P(Ci), the prior probability of each class, can be
computed based on training tuples in D.

84
2. Naïve Bayesian Classification

• Step 1: Compute P(Ci) = |Ci,D|/|D|


• Remainder: P(Ci) = |Ci,D|/|D|, where |Ci,D| is the
number of training tuples of class Ci in D.
P(C1) = |C1,D|/|D| = P(buys_computer = yes)
= 9/14 = 0.643
P(C2) = |C2,D|/|D| = P(buys_computer = no)
= 5/14 = 0.357

85
2. Naïve Bayesian Classification

• Step 2: Compute
• To compute P(X | Ci), for i = 1, 2, we compute the
following conditional probabilities.
• Remainder: P(X | Ci) = P(x1 | Ci) × P(x2 | Ci) × ...
× P(xn | Ci) is computed from the training tuples in
D, where xk is the value of attribute Ak for the given
tuple X and P(xk | Ci) =
(e.g., X = (x1 = youth, x2 = medium, x3 = yes, x4 =
fair))

86
2. Naïve Bayesian Classification

• Step 2.1: Compute P(xk | C1) =


where x1 = youth, x2 = medium, x3 = yes, x4 = fair
P(age = youth | C1) = 2/9 = 0.222
P(income = medium | C1) = 4/9
= 0.444
P(student = yes | C1) = 6/9 = 0.667
P(credit_rating = fair | C1) = 6/9
= 0.667
(C1 means buys_computer = yes)
87
2. Naïve Bayesian Classification

• Step 2.2: Compute P(xk | C2) =


where x1 = youth, x2 = medium, x3 = yes, x4 = fair
P(age = youth | C2) = 3/5 = 0.600
P(income = medium | C2) = 2/5
= 0.400
P(student = yes | C2) = 1/5 = 0.200
P(credit_rating = fair | C2) = 2/5
= 0.400
(C2 means buys_computer = no)
88
2. Naïve Bayesian Classification

• Step 3: Compute P(X | Ci) for i = 1, 2.


• Using the probabilities P(xk | Ci) computed above,
we obtain P(X | C1)
Step 3.1: Compute P(X | C1) = P(age = youth | C1)
´ P(income = medium | C1)
´ P(student = yes | C1)
´ P(credit_rating = fair | C1)
= 2/9 ´ 4/9 ´ 6/9 ´ 6/9
= 0.222 ´ 0.444 ´ 0.667 ´ 0.667 = 0.044.
89
2. Naïve Bayesian Classification

• Similarly, we obtain P(X | C2)


Step 3.2: Compute P(X | C2) = P(age = youth | C2)
´ P(income = medium | C2)
´ P(student = yes | C2)
´ P(credit_rating = fair | C2)
= 3/5 ´ 2/5 ´ 1/5 ´ 2/5
= 0.600 ´ 0.400 ´ 0.200 ´ 0.400
= 0.019.

90
2. Naïve Bayesian Classification

• Step 4: Compute P(X | Ci)P(Ci)


• To find the class Ci that maximizes P(X | Ci)P(Ci),
we compute
P(X | C1)P(C1) = 0.044 ´ 0.643 = 0.028
P(X | C2)P(C2) = 0.019 ´ 0.357 = 0.007

91
2. Naïve Bayesian Classification

• Step 5: Classification
• We have P(X | C1)P(C1) = 0.028 > P(X | C2)P(C2)
= 0.007. Thus, NBC predicts buys_computer = yes
for the given tuple X (i.e., unseen X is
classified/labeled as C1).
/* X = (age = youth,
income = medium,
student = yes,
credit_rating = fair) */
92
2. Naïve Bayesian Classification

• Recall: P(X | Ci) = P(x1 | Ci) × P(x2 | Ci) × ... ×


P(xn | Ci)
• If P(xk | Ci) = / |Ci,D| = zero for some k, then
P(X | Ci) is zero.
• Example: P(student = yes | buys_computer = no) =
0

93
2. Naïve Bayesian Classification

• Solution to the problem of P(xk | Ci) = 0 for some


k. We can assume that training data set D is so large
that adding one to each count that we need would
only make a negligible difference in the estimated
probability value, yet would conveniently avoid the
case of P(xk | Ci) = 0.

94
2. Naïve Bayesian Classification

• The above technique to deal with P(xk | Ci) = 0 is


known as Laplacian correction or Laplace
estimator. Specifically, if we have q counts (e.g., 3
counts of income are low, medium, high) to which
we each add one, then we must remember to add q
to the corresponding denominator used in the
probability calculation.

95
2. Naïve Bayesian Classification

• Example 5 Using the Laplacian correction to


avoid computing probability values of zero.
Suppose that for class buys_computer = yes in some
training data set D containing 1000 tuples, we have
0 tuples with income = low, 990 tuples with income
= medium, and 10 tuples with income = high.
• The probabilities of these events, without the
Laplacian correction, are 0/1000 = 0, 990/1000 =
0.990, and 10/1000 = 0.010, respectively.

96
2. Naïve Bayesian Classification

• Using Laplacian correction for the three


quantities, we pretend that we have 1 more tuple for
each income-value pair. In this way, we obtain the
following probabilities:
1/1003 = 0.001, 991/1003 = 0.988, and 11/1003 =
0.011, respectively.
• The “corrected” probability estimates are close to
their “uncorrected” counterparts, yet the zero
probability value is avoided.

97
Contents

1. Decision Tree
2. Naïve Bayesian Classification
3. Rule-Based Classification
4. Evaluate Classifier Performance
5. K-Nearest Neighbors Classification

98
3. Rule-Based Classification

• We study rule-based classification, where


learned/trained model is represented as set of IF-
THEN rules.
• Classification rules can be generated either from a
decision tree or directly from training data D using
a sequential covering algorithm.

99
3. Rule-Based Classification

• Rules represent knowledge in the form of IF-


THEN rules for classification.
R: IF condition THEN conclusion.
• The “IF” part (or left side) of a rule is known as
the rule antecedent or precondition.
• The “THEN” part (or right side) is the rule
consequent.

100
3. Rule-Based Classification

• Example: the rule R1 is written as


R1: IF age = youth AND student = yes THEN
buys_computer = yes.
or
R1: (age = youth) ˄ (student = yes) →
(buys_computer = yes).

101
3. Rule-Based Classification

• If condition (i.e., all the attribute tests) in rule


antecedent holds true for given tuple, we say that
rule antecedent is satisfied (or rule is satisfied) and
that rule covers the tuple.
• A rule R can be assessed/evaluated by its
coverage and accuracy.
• Given tuple X from class-labeled data set D, |D| be
the number of tuples in D.

102
3. Rule-Based Classification

• Let ncovers be the number of tuples covered by R.


• Let ncorrect be the number of tuples correctly
classified by R.
• We can define coverage and accuracy of R as

103
3. Rule-Based Classification

• A rule’s coverage is the percentage of tuples that


are covered by the rule (i.e., their attribute values
hold true for the rule’s antecedent).
• A rule’s accuracy is the percentage of tuples
(covered by the rule) that are correctly classified.

104
3. Rule-Based Classification

• Example 6 Rule accuracy and coverage. Given


data set D shown in Table 8.1.

105
3. Rule-Based Classification

• Our task is to predict whether a customer will buy


a computer. Consider rule R1,

R1: IF age = youth AND student = yes THEN


buys_computer = yes

which covers 2 of the 14 tuples.


• R1 can correctly classify both tuples.
• Therefore, coverage(R1) = 2/14 = 14.28% and
accuracy(R1) = 2/2 = 100%.
106
3. Rule-Based Classification

• How we use rule-based classification to predict


class label of given tuple X.
• If rule is satisfied by X, rule is said to be
triggered.
• Example: suppose we have
X = (age = youth, income = medium, student = yes,
credit_rating = fair).
• We want to classify X according to
buys_computer. X satisfies R1, which triggers the
rule.
107
Conflict Resolution

• If R1 is only rule satisfied, then the rule fires by


returning class prediction for X.
• Note that triggering does not always mean firing
because there may be two or more rules that are
satisfied.
• If two or more rules are triggered, we have a
conflict problem.
• That is, what if they each specify a different class?

108
Conflict Resolution

• If two or more rules are triggered, we need a


conflict resolution strategy to figure out which
rule gets to fire and assign its class prediction to X.
• Two main possible strategies for conflict
resolution are size ordering and rule ordering.

109
Conflict Resolution – Size Ordering

• Size ordering scheme assigns highest priority to


triggering rule that has the “toughest” requirements,
where toughness is measured by the rule antecedent
size. That is, the triggering rule with the most
attribute tests is fired.

110
Conflict Resolution – Rule Ordering

• Rule ordering scheme can be class-based


ordering or rule-based ordering.
• Class-based ordering:
- Classes are sorted in order of decreasing order of
prevalence. That is, all the rules for the most
prevalent (or most frequent) class come first, the
rules for the next prevalent class come next, and so
on.
- Alternatively, classes may be sorted based on the
misclassification cost per class.
111
Conflict Resolution – Rule Ordering

• Rule-based ordering: rules are organized into


one long priority list, according to some measure of
rule quality (e.g., accuracy, coverage, size of
attribute tests, or based on advice from domain
experts.)
- When rule-based ordering is used, the rule set is
known as a decision list.
- Only the triggering rule that appears earliest in
the list has the highest priority, and so it gets to
fire its class prediction.
112
Conflict Resolution – Rule Ordering

• Most rule-ordering classification systems use a


class-based rule-ordering strategy.

113
Default Rule

• What if no rule is satisfied by X? How can we


determine class label of X?
- A default rule (Rd: {} → C) can be set up to
specify a default class based on a training set.
- Default class may be the majority class.
- Default rule is evaluated at the end, if and only if
no other rule covers X.
- Condition in default rule is empty and default rule
fires when no other rule is satisfied X.

114
Rule Extraction from Decision Tree

• We study how to build rule-based classifier by


extracting IF-THEN rules from decision tree.
• IF-THEN rules are easier to understand than large
decision tree.
• One rule is created for each path from the root to a
leaf node.
• Each attribute-value pair along a path is logically
ANDed to form the rule antecedent (“IF” part).
• Leaf node holds class prediction, forming the rule
consequent (“THEN” part).
115
Rule Extraction from Decision Tree

• Example 7 Extracting classification rules from


a decision tree. Decision tree of Figure 8.2 can be
converted to classification IF-THEN rules by
tracing path from root node to each leaf node in the
tree. Rules extracted from Figure 8.2 are as follows.

116
Rule Extraction from Decision Tree

R1: IF age = youth AND student = no THEN


buys_computer = no
R2: IF age = youth AND student = yes THEN
buys_computer = yes
R3: IF age = middle aged THEN buys_computer =
yes
R4: IF age = senior AND credit_rating = excellent
THEN buys_computer = no
R5: IF age = senior AND credit_rating = fair
THEN buys_computer = yes
117
Rule Extraction from Decision Tree

• Extracted rules are mutually exclusive (no rule


conflict) and exhaustive (default rule is not
required).
• Mutually exclusive and exhaustive ensure that
every tuple is covered by exactly one rule.

118
Rule Induction: Sequential Covering Algorithm

• IF-THEN rules can be extracted directly from


training data (i.e., without having to generate
decision tree first) using a sequential covering
algorithm.
• Rules are learned for one class at a time.
• Each time a rule is learned, tuples covered by the
rule are removed, and the process repeats on the
remaining tuples.

119
Rule Induction: Sequential Covering Algorithm

• A basic sequential covering algorithm is shown in


Figure 8.10.
• The process continues until the terminating
condition is met, such as when there are no more
training tuples or the quality of a rule returned is
below a user-specified threshold.

120
Rule Induction: Sequential Covering Algorithm

Algorithm: Sequential covering. Learn a set of IF-


THEN rules for classification.
Input:
• D is a data set of class-labeled tuples
• Att_vals is the set of all attributes and their
possible values.
Output: A set of IF-THEN rules.

121
Rule Induction: Sequential Covering Algorithm

Method:
1. Rule_set = {} // initial set of rules learned is empty
2. for each class c do
3. repeat
4. Rule = Learn_One_Rule(D, Att_vals, c)
5. remove tuples covered by Rule from D
6. Rule_set = Rule_set + Rule // add new rule
// to rule set
7. until terminating condition
8. endfor
9. return Rule_set
Figure 8.10 Basic sequential covering algorithm.
122
Rule Induction: Sequential Covering Algorithm

• Rules are grown in a general-to-specific (i.e., start


with R: {} → C) manner or specific-to-general
manner.

123
Rule Induction: Sequential Covering Algorithm

Figure 8.11 A general-to-specific search through rule space.

124
Rule Induction: Sequential Covering Algorithm

• Example:

IF THEN loan_decision = accept.

IF income = high THEN loan_decision = accept.

IF income = high AND credit_rating = excellent


THEN loan_decision = accept.

125
Rule Quality Measures/Metrics

• Example 8.8 Choosing between two rules based


on accuracy. Consider two rules as illustrated in
Figure 8.12. Both are for class loan_decision =
accept. We use “a” to represent tuples of class
“accept” and “r” for tuples of class “reject.” Rule
R1 correctly classifies 38 of the 40 tuples it covers.
Rule R2 covers only two tuples, which it correctly
classifies. Their respective accuracies are 95% and
100%. Thus, R2 has greater accuracy than R1, but it
is not the better rule because of its small coverage.
126
Rule Quality Measures: accuracy and coverage

accuracy(R1) = 95%
.

accuracy(R2) = 100%

Figure 8.12 Rules for the class loan_decision =


accept, showing accept (a) and reject (r) tuples.
127
Rule Quality Measures: accuracy and coverage

• We see that accuracy on its own is not a reliable


estimate of rule quality.
• Coverage on its own is not useful either.
• We seek other measures (e.g., FOIL’s information
gain, likelihood ratio statistic, Laplace) for
evaluating rule quality, which may integrate aspects
of accuracy and coverage.

128
Rule Quality Measures: FOIL_Gain

• Suppose we are learning rules for class c.


• Our current rule is R: IF condition THEN class = c
(e.g., condition = A).
• We want to see if logically ANDing a given
attribute test to condition would result in a better
rule.
• We call new condition, condition’ (e.g., condition’
= A and B), where R’: IF condition’ THEN class = c
is our potential new rule. That is, we want to see if
R’ is any better than R.
129
Rule Quality Measures: FOIL_Gain

• One rule quality measure/metric is based on


information gain and was proposed in FOIL (First
Order Inductive Learner).
• Tuples of class for which we are learning rules are
called positive tuples, while remaining tuples are
negative.
• Let pos (neg) be the number of positive (negative)
tuples covered by R.
• Let pos’ (neg’) be the number of positive
(negative) tuples covered by R’.
130
Rule Quality Measures: FOIL_Gain

• FOIL assesses the information gained by


extending condition’ as

(i.e., the FOIL’s information gain for R’ w.r.t. R)


• FOIL information gain favors rules that have
high accuracy and cover many positive tuples.
• The rule with a higher FOIL_Gain value is better.
That is, if FOIL_Gain(R0, R) > FOIL_Gain(R0, R’),
R is better than R’.
131
Example 1: FOIL_Gain

• Suppose R covers 350 positive examples and 150


negative examples, and R’ covers 300 positive
examples and 50 negative examples. Compute the
FOIL’s information gain for the rule R’ w.r.t. R. (i.e.,
compute FOIL_Gain(R, R’)).
• We have pos = 350, neg = 150, pos’ = 300, neg’ =
50.

132
Rule Quality Measures: Likelihood Ratio

• Likelihood ratio statistic R can be used to prune


rules that have poor coverage and computed as

where m is the number of classes, fi is the observed


frequency of class i examples that are covered by
the rule, and ei is the expected frequency of a rule
that makes random predictions, log20 = 0.
• If R(r1) > R(r2), then r1 is a better rule than r2.

133
Rule Quality Measures: Laplace

• The Laplace measure takes into account the rule


coverage and is computed by

where n and f+ are the number of examples and


positive examples covered by the rule, respectively.
k = m is the number of classes.
• If the rule coverage is large, then its Laplace
measure asymptotically approaches the rule
accuracy f+ / n. That is, the rule that has the Laplace
measure close to its accuracy is a better rule.
134
Contents

1. Decision Tree
2. Naïve Bayesian Classification
3. Rule-Based Classification
4. Evaluate Classifier Performance
5. K-Nearest Neighbors Classification

135
4. Evaluate Classifier Performance

• How accurately the classifier can predict the


purchasing behavior of future customers?

136
4. Evaluate Classifier Performance

• The classifier evaluation measures include


- accuracy (aka recognition rate),
- error rate (aka misclassification rate),
- sensitivity (aka recall, true positive rate),
- specificity (aka true negative rate)
- precision,
- F1 (aka F, F-score), and
- Fb.

137
Summary of Classifier Evaluation Measures

138
4. Evaluate Classifier Performance

• True positives (TP): These refer to the positive


tuples that were correctly labeled by the classifier.
Let TP be the number of true positives.

• True negatives (TN): These are the negative


tuples that were correctly labeled by the classifier.
Let TN be the number of true negatives.

139
4. Evaluate Classifier Performance

• False positives (FP): These are the negative


tuples that were incorrectly/falsely labeled as
positive (e.g., tuples of class buys_computer = no
for which the classifier predicted buys_computer =
yes). Let FP be the number of false positives.
• False negatives (FN): These are the positive
tuples that were mislabeled as negative (e.g.,
tuples of class buys_computer = yes for which the
classifier predicted buys_computer = no). Let FN be
the number of false negatives.
140
4. Evaluate Classifier Performance

141
4. Evaluate Classifier Performance

• Accuracy is most effective when the class


distribution is relatively balanced.
• Unbalanced data contains a significant majority of
negative class and a minority positive class.
• Sensitivity and specificity measures should be
used for imbalanced data.

142
4. Evaluate Classifier Performance

• Sensitivity (aka true positive rate, recall) is the


proportion of positive tuples that are correctly
identified (i.e., accuracy for positive class).

sensitivity = TP / P (8.23)

• Specificity is the proportion of negative tuples that


are correctly identified (i.e., accuracy for negative
class).
specificity = TN / N (8.24)
143
4. Evaluate Classifier Performance

144
Example 1

• Confusion matrix for the classes buys_computer =


yes (C1) and buys_computer = no (C2), where an
entry in row i (i = 1, 2) and column j (j = 1, 2)
shows the number of tuples of class i that were
labeled by the classifier as class j.

• sensitivity = 99.34, specificity = 86.27,


(overall) accuracy = 95.42 145
Example 1

(overall) accuracy = (TP + TN) / (P + N)


accuracy = (6954 + 2588) / (7000 + 3000)
accuracy = (9542) / (10000) = 0.9542 = 95.42%

146
Example 1

sensitivity (true positive recognition/rate) = TP / P


sensitivity = 6954 / 7000 = 0.9934 = 99.34%
specificity (true negative recognition/rate) = TN / N
specificity = 2588 / 3000 = 0.8627 = 86.27%

147
Example 1

precision (C1) = TP / (TP + FP)


precision (C1) = 6954 / (6954 + 412)
precision (C1) = 6954 / 7366
precision (C1) = 0.9441 = 94.41%,
where C1 means buys_computer = yes
148
Example 2

• Confusion matrix for the classes cancer = yes (C1)


and cancer = no (C2).

• sensitivity = 30.30 (low/poor accuracy for positive


class), specificity = 98.56 (high accuracy for
negative class), (overall) accuracy = 96.50 (high
accuracy but it may not be acceptable)
149
Example 2

(overall) accuracy = (TP + TN) / (P + N)


accuracy = (90 + 9560) / (300 + 9700)
accuracy = (9650) / (10000) = 0.9650 = 96.50%
(high accuracy but it may not be acceptable)

150
Example 2

sensitivity (true positive recognition/rate) = TP / P


sensitivity = 90 / 300 = 0.3 = 30.00% (low/poor)
specificity (true negative recognition/rate) = TN / N
specificity = 9560 / 9700 = 0.9856 = 98.56%

151
Example 2

precision (cancer = yes) = TP / (TP + FP)


precision (cancer = yes) = 90 / (90 + 140)
precision (cancer = yes) = 90 / 230
precision (cancer = yes) = 0.3913 = 39.13%

152
Contents

1. Decision Tree
2. Naïve Bayesian Classification
3. Rule-Based Classification
4. Evaluate Classifier Performance
5. K-Nearest Neighbors Classification

153
5. K-Nearest Neighbors Classification

• Eager learning (e.g., decision tree) spends a lot of


time for model building (training/learning).
- Once a model has been built, classifying a test
example is extremely fast.
• Lazy learning (e.g., K-nearest-neighbor classifier)
does not require model building (no training).
- Classifying a test example is quite expensive
because we need to compute the proximity values
individually between the test and training examples.

154
5. K-Nearest Neighbors Classification

• When we want to classify an unknown (unseen)


tuple, a K-nearest-neighbor (K-NN) classifier
searches the pattern space for the K training tuples
that are closest to the unknown tuple. These K
training tuples are the K “nearest neighbors” of the
unknown tuple.
• For K-NN classification, the unknown tuple is
assigned the most common class among its K-
nearest neighbors (i.e., majority class of its K
nearest neighbors).
155
5. K-Nearest Neighbors Classification

The 1-, 2-, and 3-nearest neighbors of an instance x.

• In (b), we may randomly choose one of class


labels (i.e., + or –) to classify the data point x.
156
5. K-Nearest Neighbors Classification

• The Euclidean distance between two points or


tuples X1 = (x11, x12, ..., x1n) and X2 = (x21, x22, ...,
x2n) is defined as

• Other distance metrics (e.g., Manhattan,


Minkowski, Cosine, and Mahalanobis distance)
can be used.

157
5. K-Nearest Neighbors Classification

• The importance of choosing the right value for K.


- If K is too small, then the K-NN classifier may be
susceptible to overfitting because of noise in the
training data.
- If K is too large, the K-NN classifier may
misclassify the test instance because its list of
nearest neighbors may include data points that are
located far away from its neighborhood, as shown
below.

158
5. K-Nearest Neighbors Classification

K-NN classification with large K.


(x is classified as – instead of +)
159
5. K-Nearest Neighbors Classification

Algorithm v1: Basic K-NN classification algorithm


1. Find the K training instances that are closest to
the unseen instance.
2. Take the most commonly occurring class label of
these K instances and assign it to the class label of
the unseen instance.

160
5. K-Nearest Neighbors Classification

Algorithm v2: Basic K-NN classification algorithm.


1. Let K be the number of nearest neighbors and D be
the set of training examples.
2. for each test example z = (x’, y’) do
3. Compute d(x’, x), the distance between z and
every example (x, y) Î D.
4. Select Dz ⊆ D, the set of K training examples
closest to z.
5. y’ =
6. end for

161
5. K-Nearest Neighbors Classification

• Once the K-NN list Dz is obtained, the test


example z = (x’, y’) is classified based on the
majority class of its K nearest neighbors:

where v is a class label, yi is the class label for one


of the K nearest neighbors, and I(·) is an indicator
function that returns the value 1 if its argument is
true and 0 otherwise.
162
5. K-Nearest Neighbors Classification

• In the majority voting approach, every neighbor


has the same impact on the classification. This
makes the algorithm sensitive to the choice of K.

163
5. K-Nearest Neighbors Classification

• One way to reduce the impact of K is to weight


the influence of each nearest neighbor xi according
to its distance: wi = 1/d(x’, xi)2.
• As a result, training examples that are located far
away from z = (x’, y’) have a weaker impact on the
classification compared to those that are located
close to z.

164
5. K-Nearest Neighbors Classification

• Using the distance-weighted voting scheme, the


class label can be determined as follows:

where v is a class label, yi is the class label for one


of the K nearest neighbors of z = (x’, y’), I(·) is an
indicator function that returns the value 1 if its
argument is true and 0 otherwise, and wi = 1/d(x’,
xi)2.

165
Data Normalization (or Feature Scaling)

• K-NN classifiers can produce wrong predictions


due to varying scales of attribute values of tuples.
• For example, suppose we want to classify a group
of people based on attributes such as height
(measured in meters) and weight (measured in
pounds).

166
Data Normalization (or Feature Scaling)

• The height attribute has a low variability, ranging


from 1.5 m to 1.85 m, whereas the weight attribute
may vary from 90 lb. to 250 lb.
• If the scale of the attributes are not taken into
consideration, the proximity measure may be
dominated by differences in the weights of a
person.

167
Data Normalization (or Feature Scaling)

• Data normalization (aka feature scaling): We


normalize the values of each attribute before
computing proximity measure (e.g., Euclidean
distance).
- This helps prevent attributes with large ranges
(e.g., weight) from outweighing attributes with
smaller ranges (e.g., height).

168
Data Normalization (or Feature Scaling)

• Min-max normalization (aka unity-based


normalization): can be used to transform a value v
of a numeric attribute A to v’ in the range [0, 1] by
computing
v’ = (v – minA) / (maxA – minA) Î [0, 1],
where minA and maxA are the minimum and
maximum values of attribute A.

169
Data Normalization (or Feature Scaling)

• In general, min-max normalization (aka unity-


based normalization): can be used to transform a
value v of a numeric attribute A to v’ in the range [ℓ,
u] by computing
v’ = ℓ + [(v – minA)/(maxA – minA)]×(u – ℓ) Î [ℓ, u],
where minA and maxA are the minimum and
maximum values of attribute A.

170
Data Normalization (or Feature Scaling)

• Note that it is possible that an unseen instance


may have a value of A that is less than min or
greater than max. If we want to keep the adjusted
numbers in the range from 0 to 1, we can just
convert any values of A that are less than min or
greater than max to 0 or 1, respectively.

171
5. K-Nearest Neighbors Classification

• Dealing with non-numeric attributes: For non-


numeric attributes (e.g., nominal or categorical), a
simple method is to compare the corresponding
value of the non-numeric attribute in tuple X1 with
that in tuple X2.
- If the two are identical (e.g., tuples X1 and X2 both
have the blue color), then the difference between
the two is 0.
- If the two are different (e.g., tuple X1 is blue but
tuple X2 is red), then the difference is 1.
172
Summary

173
Exercises

1. Compute Gain(income), Gain(student), and


Gain(credit_rating).
2. Compute GainRatio(age), GainRatio(student),
and GainRatio(credit_rating).
3. Compute Gini index values for splits on the
subsets {low, high} and {medium}, {medium, high}
and {low}).

174
Exercises

4. Given a training data set D shown in the table


below for a binary classification problem. The class
label attribute Play has two different values {Yes,
No}.

175
Exercises

a. Compute the information gain for the attribute


Outlook.
b. Compute the gain ratio for the attribute
Temperature using Gain(Temperature) = 0.0292.
c. Compute the Gini index for the attribute
Temperature and the splitting subset {Cool, Mild}.

176
Exercises

5. Consider a training data set D that contains p =


60 positive examples and n = 100 negative
examples. Suppose that we are given the following
two candidate rules.
Rule r1: covers p1 = 50 positive examples and n1 =
5 negative examples,
Rule r2: covers p2 = 2 positive examples and n2 = 0
negative example.

177
Exercises

Which rule is better according to


a. the accuracy metric?
b. the coverage metric?
c. the FOIL_Gain metric? Assume that the initial
rule r0: {} → + covers p0 = 60 positive examples
and n0 = 100 negative examples.
d. the likelihood ratio statistic R?
e. Laplace metric?

178
Exercises

6. You are given a training dataset D shown in the


table below for a binary classification problem.
The class-labeled training dataset D

179
Exercises

The dataset D given above contains the details of policy


holders at an insurance company. The attributes (i.e.,
descriptive features) included in the table describe each
policy holders’ ID, gender, age, the type of insurance
policy they hold, and their preferred contact channel. The
preferred contact channel attribute is the class label
attribute (i.e., target feature) that has two different values
{phone, email}.
Given the test instance X = (Gender = female, Age = young,
Policy = plan A). What the class label will a naive Bayesian
classifier predict for the given test instance X?
180
Exercises

7. You are given a training dataset D shown in the table below for a
binary classification problem, where MP = Magazine Promotion, WP
= Watch Promotion, LIP = Life Insurance Promotion, CCI = Credit
Card Insurance. .
The class-labeled training dataset D for credit card customers

181
Exercises

The class label attribute Gender of a cardholder has


two different values {Male, Female}. Given the test
instance X = (MP = Yes, WP = Yes, LIP = No, CCI =
No). What the class label will a naive Bayesian
classifier predict for the given test instance X?

182
References

1. Jiawei Han, Jian Pei, Hanghang Tong. 2022.


Data Mining Concepts and Techniques. 4th Ed.
Morgan Kaufmann. ISBN: 0128117605 .
2. Pang-Ning Tan, Michael Steinbach, Anuj
Karpatne, Vipin Kumar. 2018. Introduction to Data
Mining. 2nd Ed. Pearson. ISBN: 0133128903.
3. Charu C. Aggarwal. 2015. Data Mining The
Textbook. Springer. ISBN: 3319141414.

183
References

4. Nong Ye. 2013. Data Mining: Theories,


Algorithms, and Examples. CRC Press. ISBN:
1439808384.
5. Aurelien Geron. 2017. Hands-On Machine
Learning with Scikit-Learn and TensorFlow.
O'Reilly Media. ISBN: 1491962291.
6. Sebastian Raschka, Vahid Mirjalili. 2017. Python
Machine Learning. 2nd Ed. Packt Publishing. ISBN:
1787125939

184
References

7. Gavin Hackeling. 2017. Mastering Machine


Learning with scikit-learn. 2nd Ed. Packt
Publishing. ISBN: 1788299876.
8. Peter Harrington. 2012. Machine Learning in
Action. Manning Publications. ISBN: 1617290181.
9. Prateek Joshi. 2017. Artificial Intelligence with
Python. Packt Publishing. ISBN: 178646439X.

185
Extra Slides

• Iris flower classes are setosa (class value = 0),


versicolor (class value = 1), virginica (class value =
2)

186
Visualize a Decision Tree – Iris Data Set

• Iris data set contains 150 tuples (50 setosa tuples (class value = 0),
50 versicolor tuples (class value = 1), 50 virginica tuples (class value
= 2))

187
Visualize a Decision Tree

• Install the graphviz package


C:\>conda install -c anaconda graphviz
Proceed ([y]/n)? y

• Check the installation of the graphviz package


C:\> conda list
...
graphviz 2.38.0 4 anaconda
...

188
Visualize a Decision Tree

• Open Windows Command Prompt, type


C:\Windows\System32>cd D:\Visualize
C:\Windows\System32>D:

• Convert the iris_tree.dot file to


iris_tree.png
D:\Visualize>dot -Tpng iris_tree.dot -o iris_tree.png

• Convert the iris_tree.dot file to


iris_tree.pdf
D:\Visualize>dot -Tpdf iris_tree.dot -o iris_tree.pdf

189
Visualize a Decision Tree

Iris Decision Tree 190


Visualize a Decision Tree

from sklearn.datasets import load_iris


from sklearn.tree import
DecisionTreeClassifier
from sklearn.tree import
export_graphviz

# Load data
iris = load_iris()
# extract petal length and width
X = iris.data[:, 2:]
y = iris.target

191
Visualize a Decision Tree

# Create and train decision tree


tree_clf = DecisionTreeClassifier(
max_depth = 2, random_state = 42)
tree_clf.fit(X, y)

192
Visualize a Decision Tree

# Visualize
export_graphviz(
tree_clf,
out_file = "iris_tree.dot",
feature_names = iris.feature_names[2:],
# use petal length and width only
class_names = iris.target_names,
rounded = True,
filled = True)

193
Extra Slides

• Typical data partitioning methods


1. A simple hold-out validation split

194
Extra Slides

2. A k-fold cross-validation (e.g., 3, 4, or 5)

3-fold cross-validation

195
Extra Slides

3. Random sampling

196
Extra Slides

4. Bootstrapping

197
Extra Slides - Weka

• Weka is a machine learning tool (classification,


clustering, association rules, ...)
https://www.cs.waikato.ac.nz/ml/weka/

198

You might also like