Copy of Classification-1
Copy of Classification-1
Prediction
Classification and Prediction
What is classification? What is
regression?
Issues regarding classification and
prediction
Classification by decision tree induction
Scalable decision tree induction
Classification vs. Prediction
Classification:
◼ predicts categorical class labels
◼ classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Regression:
◼ models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
◼ credit approval
◼ target marketing
◼ medical diagnosis
◼ treatment effectiveness analysis
Why Classification? A motivating
application
Credit approval
◼ A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
◼ The history of past customers is used to train the
classifier
◼ The classifier provides rules, which identify potentially
reliable future customers
◼ Classification rule:
If age = “31...40” and income = high then credit_rating =
excellent
◼ Future customers
Paul: age = 35, income = high excellent credit rating
John: age = 20, income = medium fair credit rating
Classification—A Two-Step Process
Model construction: describing a set of predetermined
classes
◼ Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
◼ The set of tuples used for model construction: training set
◼ The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
The known label of test samples is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that
are correctly classified by the model
Test set is independent of training set, otherwise over-
fitting will occur
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues regarding classification and
prediction (1): Data Preparation
Data cleaning
◼ Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
◼ Remove the irrelevant or redundant attributes
Data transformation
◼ Generalize and/or normalize data
numerical attribute income categorical
{low,medium,high}
normalize all numerical attributes to [0,1)
Issues regarding classification and prediction
(2): Evaluating Classification Methods
Predictive accuracy
Speed
◼ time to construct the model
◼ time to use the model
Robustness
◼ handling noise and missing values
Scalability
◼ efficiency in disk-resident databases
Interpretability:
◼ understanding and insight provided by the model
Goodness of rules (quality)
◼ decision tree size
◼ compactness of classification rules
Classification by Decision Tree
Induction
Decision tree
◼ A flow-chart-like tree structure
◼ Internal node denotes a test on an attribute
◼ Branch represents an outcome of the test
◼ Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
◼ Tree construction
At start, all the training examples are at the root
◼ Tree pruning
Identify and remove branches that reflect noise or outliers
follows
<=30 high no excellent no
31…40 high no fair yes
an >40 medium no fair yes
example >40 low yes fair yes
>40 low yes excellent no
from 31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
ID3 <=30
>40
low
medium
yes fair
yes fair
yes
yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for
“buys_computer”
age?
<=30 overcast
30..40 >40
no yes no yes
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-conquer
manner
◼ At start, all the training examples are at the root
◼ Attributes are categorical (if continuous-valued, they are
discretized in advance)
◼ Samples are partitioned recursively based on selected attributes
◼ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
◼ All samples for a given node belong to the same class
◼ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
◼ There are no samples left
Algorithm for Decision Tree
Induction (pseudocode)
Algorithm GenDecTree(Sample S, Attlist A)
1. create a node N
2. If all samples are of the same class C then label N with C;
terminate;
3. If A is empty then label N with the most common class C in
S (majority voting); terminate;
4. Select aA, with the highest information gain; Label N with
a;
5. For each value v of a:
a. Grow a branch from N with condition a=v;
b. Let Sv be the subset of samples in S with a=v;
c. If Sv is empty then attach a leaf labeled with the most
common class in S;
d. Else attach the node generated by GenDecTree(Sv, A-a)
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple
in D: m
Info( D ) = − pi log 2 ( pi )
i =1
j =1 | D |
j =1 |D| |D|
4 4 6 6 4 4
SplitInfoA ( D) = − log 2 ( ) − log 2 ( ) − log 2 ( ) = 0.926
14 14 14 14 14 14
◼ GainRatio(A) = Gain(A)/SplitInfo(A)
Ex. gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is
selected as the splitting attribute
Gini index (CART, IBM
IntelligentMiner)
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D) = 1− p 2j
j =1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the
gini index gini(D) is defined as
|D1| |D |
gini A (D) = gini(D1) + 2 gini(D2)
|D| |D|
Reduction in Impurity:
gini( A) = gini( D) − giniA ( D)
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split
values
Can be modified for categorical attributes
Comparing Attribute Selection Measures
The three measures, in general, return good
results but
◼ Information gain:
biased towards multivalued attributes
◼ Gain ratio:
tends to prefer unbalanced splits in which one partition
is much smaller than the others
◼ Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions
and purity in both partitions
Comparison among Splitting Criteria
For a 2-class problem:
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the
training data
◼ Too many branches, some may reflect anomalies due to noise or
outliers
◼ Poor accuracy for unseen samples
Design goals:
◼ Able to handle large disk-resident training
sets
◼ No restrictions on training-set size
Building tree
GrowTree(TrainingData D)
Partition(D);
Partition(Data D)
if (all points in D belong to the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into D1 and D2;
Partition(D1);
Partition(D2);
Data Setup
Gini Index
◼ if data D contains examples from c classes
Gini(D) = 1 - pj2
where pj is the relative frequency of class j in D
If D split into D1 & D2 with n1 & n2 tuples each
Ginisplit(D) = n1* gini(D1) + n2* gini(D2)
n n
Note: Only class frequencies are needed to compute index
Finding Split Points
evaluate splitting index for various subsets using the constructed matrix;
class/value matrix
◼ Next step is to update the Class List with the new nodes
◼ Scan the attr list that is used to split and update the corresponding
leaf entry in the Class List
Two Approaches:
◼ Stop growing the tree beyond a certain point
◼ First over-fit, then post prune. (More widely used)
Tree building divided into phases:
▪ Growth phase
▪ Prune phase
Hard to decide when to stop growing the tree, so
second approach more widely used.
Criteria for finding correct final tree size:
Three criteria:
◼ Cross validation with separate test data
◼ Use some criteria function to choose best size
Example: Minimum description length (MDL)
criteria
◼ Statistical bounds: use all data for training but apply
statistical test to decide right size.
Occam’s Razor
Given two models of similar generalization errors, one
should prefer the simpler model over the more complex
model
Therefore, one should include model complexity when
evaluating a model
• Pros • Cons
+ Reasonable training time – Cannot handle complicated
+ Fast application relationship between features
+ Easy to interpret – simple decision boundaries
+ Easy to implement – problems with lots of missing
+ Can handle large number data
of features
Decision Boundary
1
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
Oblique Decision Trees
x+y<1
Class = + Class =