0% found this document useful (0 votes)
9 views

08ClassBasic

this is the basic level classification in data mining or ml

Uploaded by

bscs21108112
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

08ClassBasic

this is the basic level classification in data mining or ml

Uploaded by

bscs21108112
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 154

Data Mining:

Concepts and
Techniques
(3rd ed.)

— Chapter 8 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
2
Classification
Classification is a task in data mining that involves
assigning a class label to each instance in a dataset
based on its features.
The goal of classification is to build a model that
accurately predicts the class labels of new instances
based on their features.
There are two main types of classification:
binary classification and multi-class classification.
Binary classification involves classifying instances
into two classes, such as “spam” or “not spam”, while
multi-class classification involves classifying
instances into more than two classes.
3
Classification
The process of building a classification model typically
involves the following steps:

Data Collection:
The first step in building a classification model is data
collection.
Data Preprocessing:
The second step in building a classification model is data
preprocessing.
Handling Missing Values
Dealing with Outliers
Data Transformation

4
Classification
Feature Selection:
The third step in building a classification model is feature selection.
Correlation Analysis
Information Gain: Information gain is a measure of the amount of information
that a feature provides for classification. Features with high information gain
are selected for classification.
Principal Component Analysis (PCA)
Model Selection:
The fourth step in building a classification model is model selection.

Decision Trees
Support Vector Machines
Neural Networks (RNN or CNN)
5
Classification
Model Training:
The fifth step in building a classification model is model training.
Model training involves using the selected classification algorithm to learn the
patterns in the data.
The data is divided into a training set and a validation set.
The model is trained using the training set, and its performance is evaluated on
the validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation.
Model evaluation involves assessing the performance of the trained model on a
test set.
This is done to ensure that the model generalizes well

6
Data Minign Techniques

7
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
8
Prediction Problems:
Classification vs. Numeric
Prediction
 Classification
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the training

set and the values (class labels) in a classifying attribute


and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts

unknown or missing values


 Typical applications
 Credit/loan approval:

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

9
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute


 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or

mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model


The known label of test sample is compared with the classified
result from the model

Accuracy rate is the percentage of test set samples that are
correctly classified by the model

Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data

 Note: If the test set is used to select models, it is called validation (test) set
10
Process (1): Model
Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
11
Process (2): Using the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
12
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

13
Examples of Classification Task
 Predicting tumor cells as benign or malignant

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

 Categorizing news stories as finance,


weather, entertainment, sports, etc

14
The Classification Problem
Katydids
(informal definition)
 Given a collection of
annotated data.

 In this case 5 instances


Katydids of and five of
Grasshoppers, decide what
Grasshoppers
type of insect the unlabeled
example is.

Katydid or Grasshopper? 15
15
For any domain of interest, we can measure
features

Color {Green, Brown, Gray, Other} Has Wings?

Abdomen Thorax
Length Length Antennae
Length

Mandible
Size

Spiracle
Diameter
Leg Length

16
Our_Collection
We can store features
Insect Abdomen Antennae Insect Class
in a database. ID Length Length
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
The classification
3 0.9 4.7 Grasshopper
problem can now be 4 1.1 3.1 Grasshopper
expressed as: 5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
• Given a training database
(Our_Collection), predict the class 7 6.1 6.6 Katydid
label of a previously unseen 8 0.5 1.0 Grasshopper
instance
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids

previously unseen instance = 11 5.1 7.0 ???????


17
Grasshoppers Katydids

10
9
8
7
6
Antenna

5
Length

4
3
2
1

1 2 3 4 5 6 7 8 9 10
Abdomen
Length

18
Grasshoppers Katydids

10
9
8
7
6
Antenna

5 Each of these data


Length

4 objects are called…


• exemplars
3 • (training) examples

2 • instances
• tuples
1

1 2 3 4 5 6 7 8 9 10
Abdomen
Length

19
Problem 1
Examples of class A Examples of class
B

3 4 5 2.5

1.5 5 5 2

6 8 8 3

2.5 5 4.5 3
20
Whatclass
classisis
Problem 1 What
thisobject?
this object?
Examples of class A Examples of class
B

3 4 5 2.5 8 1.5

Whatabout
What aboutthis
this
1.5 5 5 2 one,AAor
orB?
B?
one,

6 8 8 3

4.5 7
2.5 5 4.5 3
21
Oh!This
Thisones
ones
Problem 2 Oh!
hard!
hard!
Examples of class A Examples of class B

4 4 5 2.5 8 1.5

5 5 2 5

6 6 5 3

3 3 2.5 3
22
Problem 3
Examples of class A Examples of class B

6 6

4 4 5 6 Thisone
This oneisisreally
reallyhard!
hard!
Whatisisthis,
What this,AAor
orB?
B?

1 5 7 5

6 3 4 8

3 7 7 7
23
Why did
Why did we
we spend
spend so
so much
much
time with
time with these
these problems?
problems?

Because we
Because we wanted
wanted to to
show that
show that almost
almost all
all
classification problems
classification problems
have aa geometric
have geometric
interpretation, check
interpretation, check
out the
out the next
next 33 slides…
slides…
24
10
Problem 1 9
8
Examples of class A Examples of class B 7

Left Bar
6
5
4
3
2
3 4 5 2.5 1

1 2 3 4 5 6 7 8 9 10
Right Bar

1.5 5 5 2

Hereisisthe
Here therule
ruleagain.
again.
IfIfthe
theleft
leftbar
barisis
smallerthan
smaller thanthe
theright
right
6 8 8 3
bar,ititisisan
bar, anA,
A,
otherwiseititisisaaB.
otherwise B.

2.5 5 4.5 3
25
10
Problem 2 9
8
Examples of class A Examples of class B 7

Left Bar
6
5
4
3
2
4 4 5 2.5 1

1 2 3 4 5 6 7 8 9 10
Right Bar

5 5 2 5
Letme
Let melook
lookititup…
up…herehereitit
is..the
is.. therule
ruleis,
is,ififthe
thetwo
two
barsare
bars areequal
equalsizes,
sizes,ititisis
anA.
an OtherwiseititisisaaBB..
A.Otherwise
6 6 5 3

3 3 2.5 3
26
100
Problem 3 90
80
Examples of class A Examples of class B 70

Left Bar
60
50
40
30
20
4 4 5 6 10

10 20 30 40 50 60 70 80 90 100
Right Bar

1 5 7 5

6 3 4 8 Therule
ruleagain:
again:
The
ififthe
thesquare
squareof ofthe
thesum
sumof
of
thetwo
the twobars
barsisisless
lessthan
thanor
or
equalto
equal to100,
100,ititisisan
anA.
A.
3 7 7 7 OtherwiseititisisaaB.B.
Otherwise 27
Grasshoppers Katydids

10
9
8
7
6
Antenna

5
Length

4
3
2
1

1 2 3 4 5 6 7 8 9 10
Abdomen
Length

28
previously unseen instance = 11 5.1 7.0 ???????

••We
Wecan
can“project”
“project”the
the
10 previouslyunseen
previously unseeninstance
instance
intothe
into thesame
samespace
spaceasasthe
the
9 database.
database.
8
7 ••We
Wehave
havenow
nowabstracted
abstracted
awaythe
away thedetails
detailsof
ofour
our
6 particularproblem.
particular problem.ItItwill
willbe
be
mucheasier
easiertototalk
talkabout
about
Antenna

5 much
Length

pointsin
points inspace.
space.
4
3
2
1

1 2 3 4 5 6 7 8 9 10
Katydids
Abdomen Grasshoppers
Length 29
Simple Linear Classifier

10
9
8
7 R.A. Fisher
1890-1962
6
5 If previously unseen instance above the line
4 then
class is Katydid
3
else
2 class is Grasshopper
1
Katydids
1 2 3 4 5 6 7 8 9 10 Grasshoppers
30
Classification Accuracy
Predicted class
Class = Class =
Katydid (1) Grasshopper (0)
Class = Katydid f11 f10
Actual (1)
Class Class = f01 f00
Grasshopper (0)
Number of correct predictions f11 + f00
Accuracy = --------------------------------------------- = -----------------------
Total number of predictions f11 + f10 + f01 + f00

Number of wrong predictions f10 + f01


Error rate = --------------------------------------------- = -----------------------
Total number of predictions f11 + f10 + f01 + f00

31
Confusion Matrix
 In a binary decision problem, a classifier labels examples as either positive
or negative.
 Classifiers produce confusion/ contingency matrix, which shows four
entities: TP (true positive), TN (true negative), FP (false positive), FN (false
negative)

Confusion Matrix
Positive Negative
(+) (-)
Predicted TP FP
positive (Y)
Predicted FN TN
negative (N)

32
The simple linear classifier is defined for higher
dimensional spaces

33
we can visualize it as being an
n-dimensional hyperplane

34
It is interesting to think about what would happen in this example if we did not have
the 3rd dimension…

35
Which of the “Problems” can be solved by the Simple
Linear Classifier? 10
9
8
7
6
5
1) Perfect 4
2) Useless 3
2
3) Pretty Good 1
1 2 3 4 5 6 7 8 9 10

100 10
90 9
Problems that can 80 8
70 7
be solved by a linear 60 6
classifier are call 50 5
linearly separable. 40 4
30 3
20 2
10 1
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10

36
Virginica

A Famous Problem
R. A. Fisher’s Iris Dataset.

• 3 classes
• 50 of each class Setosa
The task is to classify Iris plants Versicolor
into one of 3 varieties using the
Petal Length and Petal Width.

Iris Setosa Iris Versicolor Iris Virginica 37


We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case
we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor,
then we learned to approximately discriminate between Virginica and Versicolor.

Virginica

Setosa
Versicolor

If petal width > 3.272 – (0.325 * petal length) then class = Virginica
Elseif petal width…
38
We have now seen one classification algorithm,
and we are about to see more. How should we
compare them?
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
– efficiency in disk-resident databases
• Robustness
– handling noise, missing values and irrelevant features, streaming data
• Interpretability:
– understanding and insight provided by the model

39
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
40
Another Example of Decision
Tree
l l
ir ca ir ca o us
go go
t i nu ss
t e t e n l a Single,
ca ca co c MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10

41
Decision Tree Induction: An
Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
42
Example of a Decision Tree
l l
ir c a ir c a
o us
go go
t i nu ss
t e t e n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

43
Decision Tree Classification
Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model Decision
11 No Small 55K ? Tree
12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
44
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

45
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

46
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

47
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

48
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

49
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

50
Decision Tree Terminology

51
Algorithm for Decision Tree
Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-

conquer manner
 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they are

discretized in advance)
 Examples are partitioned recursively based on selected

attributes
 Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


 Conditions for stopping partitioning
 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


 There are no samples left
52
Decision Tree Classification
Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

53
Decision Tree Induction
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)

 CART

 ID3, C4.5

 SLIQ,SPRINT

 John Ross Quinlan is a computer science researcher in data


mining and decision theory.
 He has contributed extensively to the development of decision
tree algorithms, including inventing the
canonical C4.5 and ID3 algorithms.
54
Decision Tree Classifier

10
9 Ross Quinlan

8
7
6 Abdomen Length > 7.1?
Antenna

5 no yes
Length

4
3 Antenna Length > 6.0? Katydid
2
no yes
1
Grasshopper Katydid
1 2 3 4 5 6 7 8 9 10
Abdomen 55
Antennae shorter than body?

Yes No

3 Tarsi?

Grasshopper
Yes No

Foretiba has ears?

Yes No
Cricket

Decision trees predate computers Katydids Camel Cricket


56
Definition
 Decision tree is a classifier in the form of a tree structure
 Decision node: specifies a test on a single attribute
 Leaf node: indicates the value of the target attribute
 Arc/edge: split of one attribute
 Path: a disjunction of test to make the final decision

 Decision trees classify instances or examples by starting at the


root of the tree and moving through it until a leaf node.

57
Decision Tree Classification
Decision tree generation consists of two phases
 Tree construction


At start, all the training examples are at the root

Partition examples recursively based on selected
attributes
 Tree pruning


Identify and remove branches that reflect noise or
outliers
Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision

tree

58
Decision Tree Representation

 Each internal node tests an attribute


 Each branch corresponds to attribute value
 Each leaf node assigns a classification
outlook

sunny overcast rain

humidity yes wind

high normal strong weak

no yes no yes

59
How to Construct Decision Tree

Basic algorithm (a greedy algorithm)


 Tree is constructed in a top-down recursive divide-and-conquer

manner
 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they can be

discretized in advance)
 Examples are partitioned recursively based on selected attributes.

 Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)


Conditions for stopping partitioning
 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


 There are no samples left
60
Top-Down Decision tree
Induction

Main loop:

1. A  the “best” decision attribute for next node


2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified,

Then STOP, Else iterate over new leaf nodes

61
Tree Induction

 Greedy strategy.
 Split the records based on an attribute test that

optimizes certain criterion.

 Issues
 Determine how to split the records


How to specify the attribute test condition?

How to determine the best split?
 Determine when to stop splitting

62
How to Split Records
 Random Split

The tree can grow huge

These trees are hard to understand.

Larger trees are typically less accurate than smaller trees.

 Principled Criterion

Selection of an attribute to test at each node - choosing the most useful
attribute for classifying examples.

How?

Information gain

measures how well a given attribute separates the training examples
according to their target classification

This measure is used to select among the candidate attributes at
each step while growing the tree
63
Tree Induction

 Greedy strategy:
 Split the records based on an attribute test that

optimizes certain criterion:


 Hunt’s algorithm: recursively partition training

records into successively purer subsets. How to


measure purity/impurity

Entropy and information gain (covered in the lectures
slides)

Gini (covered in the textbook)

Classification error

64
How to determine the Best
Split
Before Splitting: 10 records of class 0,
10 records of class 1
Own Car Student
Gender
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


Why is student id a bad feature to use?

65
How to determine the Best
Split
 Greedy approach:
 Nodes with homogeneous class

distribution are preferred


 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

66
Picking a Good Split Feature
 Goal is to have the resulting tree be as small as possible, per
Occam’s razor.
 Finding a minimal decision tree (nodes, leaves, or depth) is an NP-
hard optimization problem.
 Top-down divide-and-conquer method does a greedy search for a
simple tree but does not guarantee to find the smallest.
 General lesson in Machine Learning and Data Mining: “Greed

is good.”
 Want to pick a feature that creates subsets of examples that are
relatively “pure” in a single class so they are “closer” to being leaf
nodes.
 There are a variety of heuristics for picking a good test, a popular
one is based on information gain that originated with the ID3
system of Quinlan (1979).
67
Information Theory
 Think of playing "20 questions": I am thinking of an integer
between 1 and 1,000 -- what is it? What is the first
question you would ask?
 What question will you ask?
 Why?

 Entropy measures how much more information you need


before you can identify the integer.
 Initially, there are 1000 possible values, which we assume
are equally likely.
 What is the maximum number of question you need to
ask?
68
Entropy
 Entropy (disorder, impurity) of a set of examples, S,
relative to a binary classification is:

Entropy ( S )  p1 log 2 ( p1 )  p0 log 2 ( p0 )


where p1 is the fraction of positive examples in S and
p0 is the fraction of negatives.
 If all examples are in one category, entropy is zero (we
define 0log(0)=0)
 If examples are equally mixed (p1=p0=0.5), entropy is a
maximum of 1.
69
Entropy
 Entropy can be viewed as the number of bits required
on average to encode the class of an example in S
where data compression (e.g. Huffman coding) is used
to give shorter codes to more likely cases.
 For multi-class problems with c categories, entropy
generalizes to:

c
Entropy ( S )   pi log 2 ( pi )
i 1

70
Entropy Plot for Binary
Classification

m=2

71
Brief Review of Entropy

 The entropy is 0 if the outcome is certain.


 The entropy is maximum if we have no knowledge of the
system (or any outcome is equally possible).

Entropy of a 2-class
problem with regard to
the portion of one of the
two groups

72
Information Gain

Is the expected reduction in entropy caused by


partitioning the examples according to this
attribute.

Is the number of bits saved when encoding the


target value of an arbitrary member of S, by
knowing the value of attribute A.

73
Information Gain in Decision
Tree Induction
 Assume that using attribute A, a current set will be
partitioned into some number of child sets

 The encoding information that would be gained by


branching on A

Gain( A) E (Current set )   E (all child sets )

Note: entropy is at its minimum if the collection of objects is completely uniform

74
Example for Computing
Entropy
Entropy (t )   p ( j | t ) log p ( j | t )
j 2

NOTE: p( j | t) is computed as the relative frequency of class j at node t

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2


C1 3
Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2)
C2 3
= -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1 75
How to Calculate log2x
 Many calculators only have a button for log10x and
logex (note log typically means log10)
 You can calculate the log for any base b as follows:
 log (x) = log (x) / log (b)
b k k

 Thus log2(x) = log10(x) / log10(2)


 Since log10(2) = .301, just calculate the log base 10
and divide by .301 to get log base 2.
 You can use this for HW if needed

76
Splitting Based on INFO…….
 Information Gain:
 n k 
GAIN Entropy ( p )    Entropy (i )  i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i
 Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most
reduction (maximizes GAIN)
 Used in ID3 and C4.5
 Disadvantage: Tends to prefer splits that result in
large number of partitions, each being small but
pure.
77
Continuous Attribute?
(more on it later)
 Each non-leaf node is a test, its edge partitioning the attribute
into subsets (easy for discrete attribute).
 For continuous attribute

Partition the continuous value of attribute A into a discrete set
of intervals
 Create a new boolean attribute A , looking for a threshold c,
c

true if Ac  c
Ac 
 false otherwise

How to choose c ?

78
Person Hair Weight Age Class
Length
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M

Comic 8” 290 38 ?
79
p  p  n  n 
Entropy ( S )  log 2    log 2  
pn  p  n  pn  p  n 

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
Hair Length <= 5?

Letus
Let ustry
trysplitting
splitting
onHair
on Hairlength
length

Entrop Entro
y(1F py(3F
,3M) = ,2M)
-(1/4)l = -(3/
og2 (1/ 5)log
4) - (3/ = 0 .9 2 (3/5)
= 0 .8 1 4)log ( 710 - (2/ 5)log
13 2 3/4 ) 2 (2/5)

Gain( A) E (Current set )   E (all child sets )

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911


80
p  p  n  n 
Entropy ( S )  log 2    log 2  
pn  p  n  pn  p  n 

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
Weight <= 160?

Letus
Let ustry
trysplitting
splitting
onWeight
on Weight

Entrop Entro
y(4F py(0F
,1M) = ,4M)
-(4/5)l = -(0/
og2 (4/ 4)log
5) - (1/ = 0 2 (0/4)
= 0 .7 2 5)log ( - (4/ 4)log
19 2 1/5 ) 2 (4/4)

Gain( A) E (Current set )   E (all child sets )

Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900


81
p  p  n  n 
Entropy ( S )  log 2    log 2  
pn  p  n  pn  p  n 

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
age <= 40?

Letus
Let ustry
trysplitting
splitting
onAge
on Age

Entrop Entro
y(3F py(1F
,3M) = ,2M)
-(3/6)l = -(1/
og2 (3/ 3)log
6) - (3/ = 0 .9 2 (1/3)
= 1 6)log ( 183 - (2/ 3)log
2 3/6 ) 2 (2/3)

Gain( A) E (Current set )   E (all child sets )

Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183


82
Of the 3 features we had, Weight was best.

But while people who weigh over 160 are


perfectly classified (as males), the under 160 yes no
people are not perfectly classified… So we Weight <= 160?
simply recurse!

This time we find that we can split on no


yes
Hair length, and we are done! Hair Length <= 2?

83
We don’t need to keep the data around, just the
test conditions. Weight <= 160?

yes no

How would these


people be
classified?
Hair Length <= 2?
Male
yes no

Male Female

84
Weight <= 160?

yes no
It is trivial to convert
Decision Trees to rules… Hair Length <= 2?
Male
yes no

Male Female

Rules to
Rules to Classify
Classify Males/Females
Males/Females

IfWeight
If Weightgreater
greater than
than160,
160,classify
classifyas
asMale
Male
Elseif Hair
Elseif HairLength
Lengthless
less than
than or
or equal
equalto to2,
2,classify
classifyas
as
Male
Male
Elseclassify
Else classifyas
asFemale
Female
85
Once we have learned the decision tree, we don’t even need a computer!

This decision tree is attached to a medical machine, and is designed to help


nurses make decisions about what type of doctor to call.

Decision tree for a typical shared-care setting applying the system for the
diagnosis of prostatic obstructions.
86
The worked examples we have seen were
performed on small datasets. However with
small datasets there is a great danger of
overfitting the data…

When you have few datapoints, there are


many possible splitting rules that perfectly Yes No
classify the data, but will not generalize to
future datasets. Wears green?

Female Male

For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
87
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify
m a tuple in D:
Info( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D ) 
j
Info( D j )
j 1 | D |

 Information gained by branching on attribute A


Gain(A) Info(D)  Info A(D)
88
Attribute Selection: Information
Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D )  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) I (9,5)  log 2 ( )  log 2 ( ) 0.940  I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I (2,3)means “age <=30” has 5 out of
14
31…40 4 0 0 14 samples, with 2 yes’es and 3
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info( D )  Infoage ( D ) 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 89
for Continuous-Valued
Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
90
Gain Ratio for Attribute
Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)   log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.

 gain_ratio(income) = 0.029/1.557 = 0.019


 The attribute with the maximum gain ratio is selected as the
splitting attribute
91
Measure of Impurity: GINI (at
node t)
 Gini Index for a given node t with classes j
GINI (t ) 1  j
[ p ( j | t )] 2

NOTE: p( j | t) is computed as the relative frequency of class j at node t

 Example: Two classes C1 & C2 and node t has 5 C1


and 5 C2 examples. Compute Gini(t)
 1 – [p(C1|t) + p(C2|t)] = 1 – [(5/10)2 + [(5/10)2 ]

 1 – [¼ + ¼] = ½.

 Do you think this Gini value indicates a good split

or bad split? Is it an extreme value?


92
More on Gini

Worst Gini corresponds to probabilities of 1/nc, where


nc is the number of classes.
For 2-class problems the worst Gini will be ½

How do we get the best Gini? Come up with an


example for node t with 10 examples for classes C1
and C2
10 C1 and 0 C2
Now what is the Gini?

1 – [(10/10)2 + (0/10)2 = 1 – [1 + 0] = 0
So 0 is the best Gini

So for 2-class problems:


Gini varies from 0 (best) to ½ (worst). 93
Some More Examples
 Below we see the Gini values for 4 nodes with different
distributions. They are ordered from best to worst.

 Note that thus far we are only computing GINI for


one node. We need to compute it for a split and
then compute the change in Gini from the parent
node.
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

94
Example of Computing GINI
GINI (t ) 1   [
j
p ( j | t )] 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

95
Gini Index (CART, IBM
IntelligentMiner)
 If a data set D contains examples from n classes, gini index, gini(D)
is defined as n 2
gini( D) 1  p j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as |D | |D |
gini A ( D)  1 gini( D1)  2 gini( D 2)
|D| |D|
 Reduction in Impurity:
gini( A) gini( D)  giniA ( D)
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
96
Gini Index (CART, IBM
IntelligentMiner)
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 97
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
 9  5
gini ( D) 1       0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
 10   4
medium} and 4 in D2 giniincome{low,medium} ( D)  14 Gini( D1 )   14 Gini( D2 )
   

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split
values
 98
Comparing Attribute Selection
Measures
 The three measures, in general, return good results but
 Information gain:

biased towards multivalued attributes
 Gain ratio:

tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized partitions
and purity in both partitions
99
Splitting Criteria Based on
Classification Error
 Classification error at a node t :

Error (t ) 1  max P (i | t )
i

 Measures misclassification error made by a node.


 Maximum (1 - 1/n ) when records are equally distributed
c
among all classes, implying least interesting information

Minimum (0.0) when all records belong to one class,
implying most interesting information

100
Examples of Computing Error
Error (t ) 1  max P (i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

101
Comparison among Splitting
Criteria
For a 2-class problem:

102
Discussion
 Error rate is often the metric used to evaluate a
classifier (but not always)
 So it seems reasonable to use error rate to

determine the best split


 That is, why not just use a splitting metric that

matches the ultimate evaluation metric?


 But this is wrong!


The reason is related to the fact that decision trees use a
greedy strategy, so we need to use a splitting metric that
leads to globally better results

The other metrics will empirically outperform error rate,
although there is no proof for this.
103
Other Attribute Selection
Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
104
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to

noise or outliers
 Poor accuracy for unseen samples

 Two approaches to avoid overfitting


 Prepruning: Halt tree construction early ̵ do not split a node

if this would result in the goodness measure falling below a


threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—

get a sequence of progressively pruned trees



Use a set of data different from the training data to
decide which is the “best pruned tree” 105
Splitting Based on Continuous
Attributes
 Different ways of handling
 Discretization to form an ordinal categorical

attribute

Static – discretize once at the beginning

Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)



consider all possible splits and finds the best cut

can be more compute intensive
106
Splitting Based on Continuous
Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

107
Tree Replication
P

Q R

S 0 Q 1

0 1 S 0

0 1

 Same subtree appears in multiple branches


108
Advantages/Disadvantages of
DT’s
• Advantages:
– Easy to understand (Doctors love them!)

– Easy to generate rules

• Disadvantages:
– May suffer from overfitting.

– Classifies by rectangular partitioning (so does not

handle correlated features very well).


– Can be quite large – pruning is necessary.

– Does not handle streaming data easily

109
Enhancements to Basic Decision Tree
Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
110
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
 Why is decision tree induction popular?

relatively faster learning speed (than other classification
methods)

convertible to simple and easy to understand classification
rules

can use SQL queries for accessing databases

comparable classification accuracy with other methods
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

Builds an AVC-list (attribute, value, class label)
111
Scalability Framework for
RainForest

 Separates the scalability aspects from the criteria that


determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n

112
Rainforest: Training Set and Its
AVC Sets

Training Examples AVC-set on Age AVC-set on income


age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no


yes no
<=30 high no excellent no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer Buy_Computer
>40 medium yes fair yes
Credit
<=30 medium yes excellent yes yes no
rating yes no
31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
113
Optimistic Algorithm for Tree
Construction)
 Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
 It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.

114
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
115
Bayesian Classification:
Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
116
Bayes’ Theorem: Basics
M
 Total probability Theorem: P(B)   P(B | Ai )P( Ai )
i 1

 Bayes’ Theorem: P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)


P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability

E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds

E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
117
Bayes’ Theorem: Basics
 Let X be a data tuple.
 In Bayesian terms, X is considered “evidence.”
 As usual, it is described by measurements made on
a set of n attributes.
 Let H be some hypothesis such as that the data
tuple X belongs to a specified class C.
 For classification problems, we want to determine
P.HjX/, the probability that the hypothesis H holds
given the “evidence” or observed data tuple X.

 In other words, we are looking for the probability


that tuple X belongs to class C, given that we know
the attribute description of X.
118
Bayes’ Theorem: Basics
 P.HjX/ is the posterior probability, or a posteriori
probability, of H conditioned on X.

 For example, suppose our world of data tuples is


confined to customers described by the attributes
age and income, respectively, and that X is a 35-
year-old customer with an income of $40,000.

 Suppose H is the hypothesis that our customer will


buy a computer.
 Then P.Hj|X reflects the probability that customer X
will buy a computer given that we know the
customer’s age and income.
119
Bayes’ Theorem: Basics
 In contrast, P.(H) is the prior probability, or a
priori probability, of H.
 For our example, this is the probability that any
given customer will buy a computer, regardless of
age, income, or any other information, for that
matter.
 The posterior probability, P.(H|X), is based on more
information (e.g., customer information) than the
prior probability, P.(H), which is independent of X.

120
Bayes’ Theorem: Basics
 Similarly, P(X|H) is the posterior probability of X
conditioned on H.
 That is, it is the probability that a customer, X, is 35
years old and earns $40,000, given that we know
the customer will buy a computer.
 P.(X) is the prior probability of X.
 Using our example, it is the probability that a person
from our set of customers is 35 years old and earns
$40,000.

121
Prediction Based on Bayes’
Theorem
 Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)


P(X)
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
122
Classification Is to Derive the Maximum
Posteriori
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

 Since P(X) is constant for all classes, only


P(C | X) P(X | C )P(C )
i i i
needs to be maximized

123
Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
124
Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes

P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
>40
medium
low
low
no fair
yes fair
yes excellent
yes
yes
no

P(buys_computer = “no”) = 5/14= 0.357 31…40


<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes

Compute P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40


>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”) 125
Bayes Classifier: Comments
 Advantages
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore loss of

accuracy
 Practically, dependencies exist among variables


E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.

Dependencies among these cannot be modeled by Naïve
Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
126
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
127
Using IF-THEN Rules for
Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent

 Assessment of a rule: coverage and accuracy


 n
covers = # of tuples covered by R

 ncorrect = # of tuples correctly classified by R


coverage(R) = ncovers /|D| (D: training data set )
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has the

“toughest” requirement (i.e., with the most attribute tests)


 Class-based ordering: decreasing order of prevalence or misclassification

cost per class


 Rule-based ordering (decision list): rules are organized into one long

priority list, according to some measure of rule quality or by experts


128
Rule Extraction from a Decision
Tree
 Rules are easier to understand than large
trees age?
 One rule is created for each path from the <=30 31..40 >40
root to a leaf student? credit rating?
yes
 Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class
no yes no yes
prediction
 Rules are mutually exclusive and exhaustive
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
129
Rule Induction: Sequential
Covering Method
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL (First Order Inductive Learner),
AQ (Aquarius), CN2 (Concept Learning System 2), RIPPER (Repeated Incremental Pruning to
Produce Error Reduction)
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time

 Each time a rule is learned, the tuples covered by the rules are

removed
 Repeat the process on the remaining tuples until the termination

condition, e.g., when no more training examples or when the


quality of a rule returned is below a user-specified threshold.
130
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
131
Model Evaluation and Selection
 Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
 Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
132
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
133
Accuracy, Error Rate,
Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All 
Sensitivity = TP/P
 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate



Specificity = TN/N
134
Precision and Recall, and F-
measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall

F measure (F1 or F-score): harmonic mean of precision and
recall,


Fß: weighted measure of precision and recall

assigns ß times as much weight to recall as to precision

135
Precision and Recall, and F-
measures

136
Classifier Evaluation Metrics:
Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

137
Estimating Confidence Intervals: t-test

 If only 1 test set available: pairwise comparison


 For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
 Average over 10 rounds to get an
 t-test computes t-statistic with k-1 degrees
d of
freedom: where

 If two test sets available: use non-paired t-test


wher
e
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
138
Estimating Confidence Intervals:
Table for t-distribution

 Symmetric
 Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
 Confidence limit, z
= sig/2

139
Estimating Confidence Intervals:
Statistical Significance
 Are M1 & M2 significantly different?

Compute t. Select significance level (e.g. sig = 5%)

Consult table for t-distribution: Find t value corresponding to
k-1 degrees of freedom (here, 9)
 t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
z=sig/2 (here, 0.025)
 If t > z or t < -z, then t value lies in rejection region:
 Reject null hypothesis that mean error rates of M & M
1 2
are same
 Conclude: statistically significant difference between M &
1
M2

Otherwise, conclude that any difference is chance
140
Model Selection: ROC
Curves
 ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection theory
 Shows the trade-off between the true
positive rate and the false positive rate
 The area under the ROC curve is a
 Vertical axis
represents the true
measure of the accuracy of the model positive rate
 Rank the test tuples in decreasing  Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at  The plot also shows a
the top of the list diagonal line
 The closer to the diagonal line (i.e., the  A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
141
Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
142
Chapter 8. Classification: Basic
Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
143
Ensemble Methods: Increasing the
Accuracy

 Ensemble methods
 Use a combination of models to increase accuracy

 Combine a series of k learned models, M , M , …, M , with


1 2 k
the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of

classifiers
 Boosting: weighted vote with a collection of classifiers

 Ensemble: combining a set of heterogeneous classifiers

144
Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority vote
 Training
 Given a set D of d tuples, at each iteration i, a training set D of d tuples
i
is sampled with replacement from D (i.e., bootstrap)
 A classifier model M is learned for each training set D
i i
 Classification: classify an unknown sample X
 Each classifier M returns its class prediction
i
 The bagged classifier M* counts the votes and assigns the class with the
most votes to X
 Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
 Accuracy
 Often significantly better than a single classifier derived from D

 For noise data: not considerably worse, more robust

 Proved improved accuracy in prediction


145
Boosting
 Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned

After a classifier Mi is learned, the weights are updated to allow
the subsequent classifier, Mi+1, to pay more attention to the
training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
146
Random Forest (Breiman 2001)
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is

generated using a random selection of attributes at each node to


determine the split
 During classification, each tree votes and the most popular class is

returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F

attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
 Forest-RC (random linear combinations): Creates new attributes (or

features) that are a linear combination of the existing attributes


(reduces the correlation between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to errors and outliers
 Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
147
Chapter 8. Classification: Basic
Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
148
Summary (I)
 Classification is a form of data analysis that extracts models
describing important data classes.
 Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
 Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
 Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.

149
Summary (II)
 Significance tests and ROC curves are useful for model selection.
 There have been numerous comparisons of the different
classification methods; the matter remains a research topic
 No single method has been found to be superior over all others
for all data sets
 Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method

150
References (1)
 C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu,
Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07
 H. Cheng, X. Yan, J. Han, and P. S. Yu,
Direct Discriminative Pattern Mining for Effective Classification, ICDE'08
 W. Cohen. Fast effective rule induction. ICML'95
 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
151
References (2)
 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
 G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
 U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
 D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
 W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
152
References (3)
 T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
Marketing Research, Blackwell Business, 1994.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. EDBT'96.
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
 J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
153
References (4)
 R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
 J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
 P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
 X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
154

You might also like