Classification and Prediction
Classification and Prediction
Classification:
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
2
ClassificationA Two-Step
Process
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
4
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
Data cleaning
Data transformation
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules
decision tree size
compactness of classification rules
Training Dataset
This
follows
an
example
from
Quinlans
ID3
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
9
age?
<=30
student?
30..40
overcast
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
10
11
Decision Trees
E(A)
j 1
I ( s1 j ,..., smj )
13
Entropy
Information Gain
Attribute Selection by
Information Gain Computation
age
<=30
3040
>40
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
April
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
5
4
I ( 2,3)
I (4,0)
14
14
5
I (3,2) 0.694
14
5
I (2,3) means age <=30 has
14 5 out of 14 samples, with 2
E ( age)
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
Data
yes Mining: Concepts and
Techniques
no
Similarly,
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
16
Generating Classification
Rules
19
j 1
gini split (T )
T 1)
1 gini(
gini(T 2)
One rule is created for each path from the root to a leaf
Example
IF age = <=30 AND student = no THEN buys_computer = no
IF age = <=30 AND student = yes THEN buys_computer = yes
IF age = 3140 THEN buys_computer = yes
IF age = >40 AND credit_rating = excellent THEN
buys_computer = yes
IF age = <=30 AND credit_rating = fair THEN buys_computer =
no
21
Avoid Overfitting in
Classification
23
Enhancements to basic
decision tree induction
Attribute construction
25
26
27
28
Bayesian
Theorem
29
P( x k | C i)
k 1
30
Training dataset
age
Class:
<=30
C1:buys_computer=<=30
yes
3040
C2:buys_computer=>40
>40
no
>40
3140
Data sample
<=30
X =(age<=30,
Income=medium, <=30
>40
Student=yes
<=30
Credit_rating=
3140
Fair)
3140
>40
April 27, 2016
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
31
32
Advantages :
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence , therefore loss
of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
Dependencies among these cannot be modeled by Nave
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
33
_
_
April 27, 2016
.
+
+
xq
34
1
d ( xq , xi )2
35
C1
C1
True Positives
(TP)
False Negatives
(FN)
C1 Confusion
False
Positives
Example of
Matrix:
True Negatives
(TN)
class
C1
Actual
class\Predicted
class
buy_computer =
yes
(FP)
buy_comput buy_comput
er = yes
er = no
6954
46
Total
7000
buy_computer =
412
2588
3000
no
Given m classes,
an entry, CMi,j in a confusion matrix
indicates #Total
of tuples in class
i that were labeled
the
7366
2634 by10000
classifier as class j
May have extra rows/columns to provide totals
36
T
P
F
N
F
P
T
N
Classifier Accuracy, or
P N Al
recognition rate: percentage
l
of test set tuples that are
correctly classified
Accuracy = (TP + TN)/All
Error rate: 1 accuracy, or
Error rate = (FP + FN)/All
Problem:
One class may be rare,
e.g. fraud, or HIVpositive
Significant majority of
the negative class and
minority of the positive
class
Sensitivity: True
Positive recognition rate
Sensitivity = TP/P
Specificity: True
37
Negative recognition
What Is Prediction?
Non-linear regression
38
Predictive Modeling in
Databases
Minimal generalization
Prediction
Determine the major factors which influence the
prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
39
Linear regression: Y = + X
Two parameters , and specify the line and
are to be estimated by using the data at hand.
using the least squares criterion to the known
values of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed
into the above.
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability: p(a, b, c, d) =
ab acad bcd
40