08ClassBasic
08ClassBasic
Concepts and
Techniques
(3rd ed.)
— Chapter 8 —
Data Collection:
The first step in building a classification model is data
collection.
Data Preprocessing:
The second step in building a classification model is data
preprocessing.
Handling Missing Values
Dealing with Outliers
Data Transformation
4
Classification
Feature Selection:
The third step in building a classification model is feature selection.
Correlation Analysis
Information Gain: Information gain is a measure of the amount of information
that a feature provides for classification. Features with high information gain
are selected for classification.
Principal Component Analysis (PCA)
Model Selection:
The fourth step in building a classification model is model selection.
Decision Trees
Support Vector Machines
Neural Networks (RNN or CNN)
5
Classification
Model Training:
The fifth step in building a classification model is model training.
Model training involves using the selected classification algorithm to learn the
patterns in the data.
The data is divided into a training set and a validation set.
The model is trained using the training set, and its performance is evaluated on
the validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation.
Model evaluation involves assessing the performance of the trained model on a
test set.
This is done to ensure that the model generalizes well
6
Data Minign Techniques
7
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
8
Prediction Problems:
Classification vs. Numeric
Prediction
Classification
predicts categorical class labels (discrete or nominal)
9
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
10
Process (1): Model
Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
12
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
13
Examples of Classification Task
Predicting tumor cells as benign or malignant
14
The Classification Problem
Katydids
(informal definition)
Given a collection of
annotated data.
Katydid or Grasshopper? 15
15
For any domain of interest, we can measure
features
Abdomen Thorax
Length Length Antennae
Length
Mandible
Size
Spiracle
Diameter
Leg Length
16
Our_Collection
We can store features
Insect Abdomen Antennae Insect Class
in a database. ID Length Length
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
The classification
3 0.9 4.7 Grasshopper
problem can now be 4 1.1 3.1 Grasshopper
expressed as: 5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
• Given a training database
(Our_Collection), predict the class 7 6.1 6.6 Katydid
label of a previously unseen 8 0.5 1.0 Grasshopper
instance
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
10
9
8
7
6
Antenna
5
Length
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen
Length
18
Grasshoppers Katydids
10
9
8
7
6
Antenna
2 • instances
• tuples
1
1 2 3 4 5 6 7 8 9 10
Abdomen
Length
19
Problem 1
Examples of class A Examples of class
B
3 4 5 2.5
1.5 5 5 2
6 8 8 3
2.5 5 4.5 3
20
Whatclass
classisis
Problem 1 What
thisobject?
this object?
Examples of class A Examples of class
B
3 4 5 2.5 8 1.5
Whatabout
What aboutthis
this
1.5 5 5 2 one,AAor
orB?
B?
one,
6 8 8 3
4.5 7
2.5 5 4.5 3
21
Oh!This
Thisones
ones
Problem 2 Oh!
hard!
hard!
Examples of class A Examples of class B
4 4 5 2.5 8 1.5
5 5 2 5
6 6 5 3
3 3 2.5 3
22
Problem 3
Examples of class A Examples of class B
6 6
4 4 5 6 Thisone
This oneisisreally
reallyhard!
hard!
Whatisisthis,
What this,AAor
orB?
B?
1 5 7 5
6 3 4 8
3 7 7 7
23
Why did
Why did we
we spend
spend so
so much
much
time with
time with these
these problems?
problems?
Because we
Because we wanted
wanted to to
show that
show that almost
almost all
all
classification problems
classification problems
have aa geometric
have geometric
interpretation, check
interpretation, check
out the
out the next
next 33 slides…
slides…
24
10
Problem 1 9
8
Examples of class A Examples of class B 7
Left Bar
6
5
4
3
2
3 4 5 2.5 1
1 2 3 4 5 6 7 8 9 10
Right Bar
1.5 5 5 2
Hereisisthe
Here therule
ruleagain.
again.
IfIfthe
theleft
leftbar
barisis
smallerthan
smaller thanthe
theright
right
6 8 8 3
bar,ititisisan
bar, anA,
A,
otherwiseititisisaaB.
otherwise B.
2.5 5 4.5 3
25
10
Problem 2 9
8
Examples of class A Examples of class B 7
Left Bar
6
5
4
3
2
4 4 5 2.5 1
1 2 3 4 5 6 7 8 9 10
Right Bar
5 5 2 5
Letme
Let melook
lookititup…
up…herehereitit
is..the
is.. therule
ruleis,
is,ififthe
thetwo
two
barsare
bars areequal
equalsizes,
sizes,ititisis
anA.
an OtherwiseititisisaaBB..
A.Otherwise
6 6 5 3
3 3 2.5 3
26
100
Problem 3 90
80
Examples of class A Examples of class B 70
Left Bar
60
50
40
30
20
4 4 5 6 10
10 20 30 40 50 60 70 80 90 100
Right Bar
1 5 7 5
6 3 4 8 Therule
ruleagain:
again:
The
ififthe
thesquare
squareof ofthe
thesum
sumof
of
thetwo
the twobars
barsisisless
lessthan
thanor
or
equalto
equal to100,
100,ititisisan
anA.
A.
3 7 7 7 OtherwiseititisisaaB.B.
Otherwise 27
Grasshoppers Katydids
10
9
8
7
6
Antenna
5
Length
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen
Length
28
previously unseen instance = 11 5.1 7.0 ???????
••We
Wecan
can“project”
“project”the
the
10 previouslyunseen
previously unseeninstance
instance
intothe
into thesame
samespace
spaceasasthe
the
9 database.
database.
8
7 ••We
Wehave
havenow
nowabstracted
abstracted
awaythe
away thedetails
detailsof
ofour
our
6 particularproblem.
particular problem.ItItwill
willbe
be
mucheasier
easiertototalk
talkabout
about
Antenna
5 much
Length
pointsin
points inspace.
space.
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Katydids
Abdomen Grasshoppers
Length 29
Simple Linear Classifier
10
9
8
7 R.A. Fisher
1890-1962
6
5 If previously unseen instance above the line
4 then
class is Katydid
3
else
2 class is Grasshopper
1
Katydids
1 2 3 4 5 6 7 8 9 10 Grasshoppers
30
Classification Accuracy
Predicted class
Class = Class =
Katydid (1) Grasshopper (0)
Class = Katydid f11 f10
Actual (1)
Class Class = f01 f00
Grasshopper (0)
Number of correct predictions f11 + f00
Accuracy = --------------------------------------------- = -----------------------
Total number of predictions f11 + f10 + f01 + f00
31
Confusion Matrix
In a binary decision problem, a classifier labels examples as either positive
or negative.
Classifiers produce confusion/ contingency matrix, which shows four
entities: TP (true positive), TN (true negative), FP (false positive), FN (false
negative)
Confusion Matrix
Positive Negative
(+) (-)
Predicted TP FP
positive (Y)
Predicted FN TN
negative (N)
32
The simple linear classifier is defined for higher
dimensional spaces
33
we can visualize it as being an
n-dimensional hyperplane
34
It is interesting to think about what would happen in this example if we did not have
the 3rd dimension…
35
Which of the “Problems” can be solved by the Simple
Linear Classifier? 10
9
8
7
6
5
1) Perfect 4
2) Useless 3
2
3) Pretty Good 1
1 2 3 4 5 6 7 8 9 10
100 10
90 9
Problems that can 80 8
70 7
be solved by a linear 60 6
classifier are call 50 5
linearly separable. 40 4
30 3
20 2
10 1
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
36
Virginica
A Famous Problem
R. A. Fisher’s Iris Dataset.
• 3 classes
• 50 of each class Setosa
The task is to classify Iris plants Versicolor
into one of 3 varieties using the
Petal Length and Petal Width.
Virginica
Setosa
Versicolor
If petal width > 3.272 – (0.325 * petal length) then class = Virginica
Elseif petal width…
38
We have now seen one classification algorithm,
and we are about to see more. How should we
compare them?
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
– efficiency in disk-resident databases
• Robustness
– handling noise, missing values and irrelevant features, streaming data
• Interpretability:
– understanding and insight provided by the model
39
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
40
Another Example of Decision
Tree
l l
ir ca ir ca o us
go go
t i nu ss
t e t e n l a Single,
ca ca co c MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
41
Decision Tree Induction: An
Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
42
Example of a Decision Tree
l l
ir c a ir c a
o us
go go
t i nu ss
t e t e n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
43
Decision Tree Classification
Task
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model Decision
11 No Small 55K ? Tree
12 Yes Medium 80K ?
15 No Large 67K ?
10
Test Set
44
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
45
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
46
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
47
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
48
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
49
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
50
Decision Tree Terminology
51
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
53
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
10
9 Ross Quinlan
8
7
6 Abdomen Length > 7.1?
Antenna
5 no yes
Length
4
3 Antenna Length > 6.0? Katydid
2
no yes
1
Grasshopper Katydid
1 2 3 4 5 6 7 8 9 10
Abdomen 55
Antennae shorter than body?
Yes No
3 Tarsi?
Grasshopper
Yes No
Yes No
Cricket
57
Decision Tree Classification
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise or
outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision
tree
58
Decision Tree Representation
no yes no yes
59
How to Construct Decision Tree
manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected attributes.
Main loop:
61
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
62
How to Split Records
Random Split
The tree can grow huge
These trees are hard to understand.
Larger trees are typically less accurate than smaller trees.
Principled Criterion
Selection of an attribute to test at each node - choosing the most useful
attribute for classifying examples.
How?
Information gain
measures how well a given attribute separates the training examples
according to their target classification
This measure is used to select among the candidate attributes at
each step while growing the tree
63
Tree Induction
Greedy strategy:
Split the records based on an attribute test that
64
How to determine the Best
Split
Before Splitting: 10 records of class 0,
10 records of class 1
Own Car Student
Gender
Car? Type? ID?
65
How to determine the Best
Split
Greedy approach:
Nodes with homogeneous class
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
66
Picking a Good Split Feature
Goal is to have the resulting tree be as small as possible, per
Occam’s razor.
Finding a minimal decision tree (nodes, leaves, or depth) is an NP-
hard optimization problem.
Top-down divide-and-conquer method does a greedy search for a
simple tree but does not guarantee to find the smallest.
General lesson in Machine Learning and Data Mining: “Greed
is good.”
Want to pick a feature that creates subsets of examples that are
relatively “pure” in a single class so they are “closer” to being leaf
nodes.
There are a variety of heuristics for picking a good test, a popular
one is based on information gain that originated with the ID3
system of Quinlan (1979).
67
Information Theory
Think of playing "20 questions": I am thinking of an integer
between 1 and 1,000 -- what is it? What is the first
question you would ask?
What question will you ask?
Why?
c
Entropy ( S ) pi log 2 ( pi )
i 1
70
Entropy Plot for Binary
Classification
m=2
71
Brief Review of Entropy
Entropy of a 2-class
problem with regard to
the portion of one of the
two groups
72
Information Gain
73
Information Gain in Decision
Tree Induction
Assume that using attribute A, a current set will be
partitioned into some number of child sets
74
Example for Computing
Entropy
Entropy (t ) p ( j | t ) log p ( j | t )
j 2
76
Splitting Based on INFO…….
Information Gain:
n k
GAIN Entropy ( p ) Entropy (i ) i
n
split i 1
true if Ac c
Ac
false otherwise
How to choose c ?
78
Person Hair Weight Age Class
Length
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
79
p p n n
Entropy ( S ) log 2 log 2
pn p n pn p n
Letus
Let ustry
trysplitting
splitting
onHair
on Hairlength
length
Entrop Entro
y(1F py(3F
,3M) = ,2M)
-(1/4)l = -(3/
og2 (1/ 5)log
4) - (3/ = 0 .9 2 (3/5)
= 0 .8 1 4)log ( 710 - (2/ 5)log
13 2 3/4 ) 2 (2/5)
Letus
Let ustry
trysplitting
splitting
onWeight
on Weight
Entrop Entro
y(4F py(0F
,1M) = ,4M)
-(4/5)l = -(0/
og2 (4/ 4)log
5) - (1/ = 0 2 (0/4)
= 0 .7 2 5)log ( - (4/ 4)log
19 2 1/5 ) 2 (4/4)
Letus
Let ustry
trysplitting
splitting
onAge
on Age
Entrop Entro
y(3F py(1F
,3M) = ,2M)
-(3/6)l = -(1/
og2 (3/ 3)log
6) - (3/ = 0 .9 2 (1/3)
= 1 6)log ( 183 - (2/ 3)log
2 3/6 ) 2 (2/3)
83
We don’t need to keep the data around, just the
test conditions. Weight <= 160?
yes no
Male Female
84
Weight <= 160?
yes no
It is trivial to convert
Decision Trees to rules… Hair Length <= 2?
Male
yes no
Male Female
Rules to
Rules to Classify
Classify Males/Females
Males/Females
IfWeight
If Weightgreater
greater than
than160,
160,classify
classifyas
asMale
Male
Elseif Hair
Elseif HairLength
Lengthless
less than
than or
or equal
equalto to2,
2,classify
classifyas
as
Male
Male
Elseclassify
Else classifyas
asFemale
Female
85
Once we have learned the decision tree, we don’t even need a computer!
Decision tree for a typical shared-care setting applying the system for the
diagnosis of prostatic obstructions.
86
The worked examples we have seen were
performed on small datasets. However with
small datasets there is a great danger of
overfitting the data…
Female Male
For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
87
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info( D j )
j 1 | D |
1 – [¼ + ¼] = ½.
1 – [(10/10)2 + (0/10)2 = 1 – [1 + 0] = 0
So 0 is the best Gini
94
Example of Computing GINI
GINI (t ) 1 [
j
p ( j | t )] 2
95
Gini Index (CART, IBM
IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D)
is defined as n 2
gini( D) 1 p j
j 1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as |D | |D |
gini A ( D) 1 gini( D1) 2 gini( D 2)
|D| |D|
Reduction in Impurity:
gini( A) gini( D) giniA ( D)
The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
96
Gini Index (CART, IBM
IntelligentMiner)
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 97
Computation of Gini Index
Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
9 5
gini ( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D1: {low,
10 4
medium} and 4 in D2 giniincome{low,medium} ( D) 14 Gini( D1 ) 14 Gini( D2 )
Error (t ) 1 max P (i | t )
i
100
Examples of Computing Error
Error (t ) 1 max P (i | t )
i
101
Comparison among Splitting
Criteria
For a 2-class problem:
102
Discussion
Error rate is often the metric used to evaluate a
classifier (but not always)
So it seems reasonable to use error rate to
The reason is related to the fact that decision trees use a
greedy strategy, so we need to use a splitting metric that
leads to globally better results
The other metrics will empirically outperform error rate,
although there is no proof for this.
103
Other Attribute Selection
Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistic: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
104
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to
noise or outliers
Poor accuracy for unseen samples
attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
107
Tree Replication
P
Q R
S 0 Q 1
0 1 S 0
0 1
• Disadvantages:
– May suffer from overfitting.
109
Enhancements to Basic Decision Tree
Induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
110
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
Why is decision tree induction popular?
relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand classification
rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
Builds an AVC-list (attribute, value, class label)
111
Scalability Framework for
RainForest
112
Rainforest: Training Set and Its
AVC Sets
114
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
115
Bayesian Classification:
Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
116
Bayes’ Theorem: Basics
M
Total probability Theorem: P(B) P(B | Ai )P( Ai )
i 1
120
Bayes’ Theorem: Basics
Similarly, P(X|H) is the posterior probability of X
conditioned on H.
That is, it is the probability that a customer, X, is 35
years old and earns $40,000, given that we know
the customer will buy a computer.
P.(X) is the prior probability of X.
Using our example, it is the probability that a person
from our set of customers is 35 years old and earns
$40,000.
121
Prediction Based on Bayes’
Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
123
Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
124
Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
>40
medium
low
low
no fair
yes fair
yes excellent
yes
yes
no
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
126
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
127
Using IF-THEN Rules for
Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
Each time a rule is learned, the tuples covered by the rules are
removed
Repeat the process on the remaining tuples until the termination
Fß: weighted measure of precision and recall
assigns ß times as much weight to recall as to precision
135
Precision and Recall, and F-
measures
136
Classifier Evaluation Metrics:
Example
137
Estimating Confidence Intervals: t-test
Symmetric
Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
Confidence limit, z
= sig/2
139
Estimating Confidence Intervals:
Statistical Significance
Are M1 & M2 significantly different?
Compute t. Select significance level (e.g. sig = 5%)
Consult table for t-distribution: Find t value corresponding to
k-1 degrees of freedom (here, 9)
t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
z=sig/2 (here, 0.025)
If t > z or t < -z, then t value lies in rejection region:
Reject null hypothesis that mean error rates of M & M
1 2
are same
Conclude: statistically significant difference between M &
1
M2
Otherwise, conclude that any difference is chance
140
Model Selection: ROC
Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive rate
The area under the ROC curve is a
Vertical axis
represents the true
measure of the accuracy of the model positive rate
Rank the test tuples in decreasing Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at The plot also shows a
the top of the list diagonal line
The closer to the diagonal line (i.e., the A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
141
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
142
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
143
Ensemble Methods: Increasing the
Accuracy
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers
144
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set D of d tuples
i
is sampled with replacement from D (i.e., bootstrap)
A classifier model M is learned for each training set D
i i
Classification: classify an unknown sample X
Each classifier M returns its class prediction
i
The bagged classifier M* counts the votes and assigns the class with the
most votes to X
Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived from D
returned
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
149
Summary (II)
Significance tests and ROC curves are useful for model selection.
There have been numerous comparisons of the different
classification methods; the matter remains a research topic
No single method has been found to be superior over all others
for all data sets
Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
150
References (1)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
H. Cheng, X. Yan, J. Han, and C.-W. Hsu,
Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07
H. Cheng, X. Yan, J. Han, and P. S. Yu,
Direct Discriminative Pattern Mining for Effective Classification, ICDE'08
W. Cohen. Fast effective rule induction. ICML'95
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
151
References (2)
A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
152
References (3)
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
Marketing Research, Blackwell Business, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. EDBT'96.
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
153
References (4)
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
154