Classification and Prediction
Classification and Prediction
Prediction
Classification and Prediction
What is classification? What is
regression?
Issues regarding classification and
prediction
Classification by decision tree induction
Scalable decision tree induction
Classification vs. Prediction
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Regression:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Why Classification? A motivating
application
Credit approval
A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
The history of past customers is used to train the
classifier
The classifier provides rules, which identify potentially
reliable future customers
Classification rule:
If age = “31...40” and income = high then credit_rating =
excellent
Future customers
Paul: age = 35, income = high excellent credit rating
John: age = 20, income = medium fair credit rating
Classification—A Two-Step Process
Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test samples is compared with the
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues regarding classification and
prediction (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
numerical attribute income categorical
{low,medium,high}
normalize all numerical attributes to [0,1)
Issues regarding classification and
prediction (2): Evaluating Classification
Methods
Predictive accuracy
Speed
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules (quality)
decision tree size
compactness of classification rules
Classification by Decision Tree
Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Tree pruning
Identify and remove branches that reflect noise or outliers
age?
<=30 overcast
30..40 >40
no yes no yes
Scalable Decision Tree Induction Methods