Classification
Classification
— Chapter 8 —
1
Chapter 8. Classification: Basic Concepts
5
Classification—A Two-Step Process
■ Model construction: describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
■ The set of tuples used for model construction is training set
■ The model is represented as classification rules, decision trees, or
mathematical formulae
■ Model usage: for classifying future or unknown objects
■ Estimate accuracy of the model
■ The known label of test sample is compared with the classified
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
7
Process (2): Using the Model in Prediction
Classifier
Testing Unseen
Data Data
(Jeff, Professor, 4)
Tenured?
8
Chapter 8. Classification: Basic Concepts
<=30 overcast
31..40 >40
no yes no yes
10
Algorithm for Decision Tree Induction
■ Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive
divide-and-conquer manner
■ At start, all the training examples are at the root
discretized in advance)
■ Examples are partitioned recursively based on selected
attributes
■ Test attributes are selected on the basis of a heuristic or
m=2
12
Attribute Selection Measure: Information
Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in D:
13
Attribute Selection: Information Gain
g Class P: buys_computer = “yes”
g Class N: buys_computer = “no”
Similarly,
14
Computing Information-Gain for
Continuous-Valued Attributes
■ Let attribute A be a continuous-valued attribute
■ Must determine the best split point for A
■ Sort the value A in increasing order
■ Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
■ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
■ The point with the minimum expected information
requirement for A is selected as the split-point for A
■ Split:
■ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
15
Gain Ratio for Attribute Selection (C4.5)
■ Information gain measure is biased towards attributes with a
large number of values
■ C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
■ GainRatio(A) = Gain(A)/SplitInfo(A)
■ Ex.
■ Reduction in Impurity:
noise or outliers
■ Poor accuracy for unseen samples
22
Classification in Large Databases
■ Classification—a classical problem extensively studied by
statisticians and machine learning researchers
■ Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
■ Why is decision tree induction popular?
■ relatively faster learning speed (than other classification
methods)
■ convertible to simple and easy to understand classification
rules
■ can use SQL queries for accessing databases
23
Chapter 8. Classification: Basic Concepts
■ Bayes’ Theorem:
medium income
26
Prediction Based on Bayes’ Theorem
■ Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
27
Classification Is to Derive the Maximum Posteriori
■ Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
■ Suppose there are m classes C1, C2, …, Cm.
■ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
■ This can be derived from Bayes’ theorem
needs to be maximized
28
Naïve Bayes Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
and P(xk|Ci) is
29
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
30
Naïve Bayes Classifier: An Example
■ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
■ Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
31
Naïve Bayes Classifier: Comments
■ Advantages
■ Easy to implement
■ Disadvantages
■ Assumption: class conditional independence, therefore loss of
accuracy
■ Practically, dependencies exist among variables
Bayes Classifier
■ How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
32
Chapter 8. Classification: Basic Concepts
34
Summary (II)
■ Significance tests and ROC curves are useful for model selection.
■ There have been numerous comparisons of the different
classification methods; the matter remains a research topic
■ No single method has been found to be superior over all others
for all data sets
■ Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve
trade-offs, further complicating the quest for an overall superior
method
35
References (1)
■ C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
■ C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
■ L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
■ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
■ P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
■ H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
■ H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
■ W. Cohen. Fast effective rule induction. ICML'95
■ G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
36
References (2)
■ A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
■ G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
■ R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
■ U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
■ Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
■ J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
■ J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
■ T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
■ D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
■ W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
37
References (3)
■ T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
■ J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
Marketing Research, Blackwell Business, 1994.
■ M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. EDBT'96.
■ T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
■ S. K. Murthy, Automatic Construction of Decision Trees from Data: A
Multi-Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
■ J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
■ J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
■ J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
■ J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
38
References (4)
■ R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
■ J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
■ J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
■ P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
■ S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
■ S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
■ I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
■ X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
■ H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
39
CS412 Midterm Exam Statistics
■ Opinion Question Answering:
■ Like the style: 70.83%, dislike: 29.16%
■ >=90: 24 ■ <40: 2
■ 60-69: 37
■ 80-89: 54
■ 50-59: 15
■ 70-79: 46
■ 40-49: 2
41
Issues: Evaluating Classification Methods
■ Accuracy
■ classifier accuracy: predicting class label
■ Speed
■ time to construct the model (training time)
42
Predictor Error Measures
■ Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
■ Loss function: measures the error betw. yi and the predicted value yi’
■ Absolute error: | yi – yi’|
■ Squared error: (yi – yi’)2
■ Test error (generalization error): the average loss over the test set
■ Mean absolute error: Mean squared error:
43
Scalable Decision Tree Induction Methods
tree earlier
■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
■ Builds an AVC-list (attribute, value, class label)
44
Data Cube-Based Decision-Tree Induction
■ Integration of generalization with decision-tree induction
(Kamber et al.’97)
■ Classification at primitive concept levels
■ E.g., precise temperature, humidity, outlook, etc.
■ Low-level concepts, scattered classes, bushy
classification-trees
■ Semantic interpretation problems
■ Cube-based multi-level classification
■ Relevance analysis at multi-levels
■ Information-gain analysis with dimension + level
45