08 Class Basic
08 Class Basic
— Chapter 8 —
4
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
Note: If the test set is used to select models, it is called validation (test) set
5
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Chapter 8. Classification: Basic Concepts
no yes no yes
9
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
10
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
12
Brief Review of Entropy
m=2
13
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info( D j )
j 1 | D |
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 15
Computing Information-Gain for
Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
16
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
noise or outliers
Poor accuracy for unseen samples
23
Repetition and Replication Problems in Decision
Trees
24
Chapter 8. Classification: Basic Concepts
medium income
27
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
29
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i )
k 1 2 n
This greatly reduces the computation cost: Only counts the
k 1
class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1
and P(xk|Ci) is g ( x, , ) e 2 2
2
P ( X | C i ) g ( xk , Ci , Ci )
30
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
31
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
>40
medium
low
low
no fair
yes fair
yes excellent
yes
yes
no
“uncorrected” counterparts 33
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
34
Chapter 8. Classification: Basic Concepts
36
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
37
Using IF-THEN Rules for Classification
If more than one rule are triggered, need conflict resolution
Size ordering: assign the highest priority to the triggering rules that has
38
Rule Extraction from a Decision Tree
Rules are easier to understand than large
trees age?
One rule is created for each path from the <=30 31..40 >40
root to a leaf student? credit rating?
yes
Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class
no yes no yes
prediction
Rules are mutually exclusive and exhaustive
Example: Rule extraction from our buys_computer decision-tree
IF age = <=30 AND student = no THEN buys_computer = no
IF age = <=30 AND student = yes THEN buys_computer = yes
IF age = 31…40 THEN buys_computer = yes
IF age = >=40 AND credit_rating = excellent THEN buys_computer = no
IF age = >=40 AND credit_rating = fair THEN buys_computer = yes
39
Rule Induction: Sequential Covering Method
Sequential covering algorithm: Extracts rules directly from training
data
Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
Steps:
Rules are learned one at a time
Each time a rule is learned, the tuples covered by the rules are
removed
Repeat the process on the remaining tuples until termination
41
How are rules learned?
42
Sequential Covering Algorithm
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
43
Is Coverage Good Rule Quality Measure
44
Rule Generation
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
45
How to Learn-One-Rule?
R: If condition Then C
R’=If condition’ then C where condition’ is condition with new
attribute test added
Rule-Quality measures: consider both coverage and accuracy
Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
46
Rule pruning based on an independent set of test tuples
pos neg
FOIL _ Prune( R)
pos neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune
R
47
Chapter 8. Classification: Basic Concepts
51
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
52
Fβ
The F-beta score is the weighted harmonic mean of
precision and recall, reaching its optimal value at 1 and
its worst value at 0.
The beta parameter determines the weight of
precision in the combined score. beta < 1 lends more
weight to precision, while beta > 1 favors recall (beta
-> 0 considers only precision, beta -> inf only recall).
53
Eveny Distributed Classes and
Imbalanced Classes
Accuracy measure – If data classes are fairly evenly
distributed
Sensitivity (recall) , specificity, precisison, F, F-beta for
class imbalance problem- especially when the main
class of interest is rare
54
Classifier Evaluation Metrics: Example
55
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent
sets
Training set (e.g., 2/3) for model construction
56
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
data
*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
57
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
58
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive rate
The area under the ROC curve is a
Vertical axis
measure of the accuracy of the model represents the true
positive rate
Rank the test tuples in decreasing Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at The plot also shows a
the top of the list diagonal line
The closer to the diagonal line (i.e., the A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
59
ROC Curve contd.
60
ROC Curve contd.
61
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
62
Chapter 8. Classification: Basic Concepts
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers
64
Ensemble Methods
Ensembles yield better results when there is significant
diversity among models.
The classifiers should perform better than random
guessing
Each base classifier can be allocated to a different CPU
and so ensembles are parallelizable
65
Decisison Boundary with Ensemble Methods
66
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set D of d tuples
i
is sampled with replacement from D (i.e., bootstrap)
A classifier model M is learned for each training set D
i i
1 error ( M i )
The weight of classifier Mi’s vote is log
error ( M i )
69
Random Forest (Breiman 2001)
Random Forest:
Each classifier in the ensemble is a decision tree classifier and is
returned
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
71
Classification of Class-Imbalanced Data Sets
Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill from satellite
radar images, fault, etc.
Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
Typical methods for imbalance data in 2-class classification:
Oversampling: re-sampling of data from positive class
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
Ensemble techniques: Ensemble multiple classifiers introduced
above
Still difficult for class imbalance problem on multiclass tasks
72
Chapter 8. Classification: Basic Concepts
74
Summary (II)
Significance tests and ROC curves are useful for model selection.
There have been numerous comparisons of the different
classification methods; the matter remains a research topic
No single method has been found to be superior over all others
for all data sets
Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
75
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
Loss function: measures the error betw. yi and the predicted value yi’
Absolute error: | yi – yi’|
Squared error: (yi – yi’)2
Test error (generalization error):
d the average loss over the test setd
| y i yi ' | (y i yi ' ) 2
Mean absolute error: i 1 Mean squared error: i 1
d d
d
d
| y i yi ' | ( yi yi ' ) 2
i 1
Relative absolute error: i 1
d Relative squared error: d
| y
i 1
i y|
(y
i 1
i y)2