Chapter 4: Classification & Prediction
Chapter 4: Classification & Prediction
}
}
}
}
}
}
}
Example
R1: IF age=youth AND student=yes THEN buys_computer=yes
Terminology
"
"
"
Coverage
Accuracy
Methodology
Tuple X
" Class labeled Data set D
Consider
" Ncovers: the number of tuples covered by R
" ncorrect : the number of tuples correctly classified by R
" |D|: the total number of tuples in D
"
ncovers
coverage( R) =
| D|
ncorrect
accuracy( R) =
ncovers
Important
"
Triggering firing
"
Problems
"
"
Solution: use a default rule that fires, for example, the most
frequent class
Conflicting Rules
X(age=youth, student=yes, income=low)
R1: IF age=youth AND student=yes THEN buys_computer=yes
R2: IF income=low THEN buys_computer=no
}
"
One rule is created for each path from the root to a leave node
Each splitting criterion along a given path is logically ANDed to form
the rule antecedent (IF part)
The leaf node holds the class prediction (the rule consequent)
Example:
R1: IF age=youth AND student=no
THEN buys_cpmputer=no
R2: IF age=youth AND student=yes
THEN buys_cpmputer=yes
R3: IF age=middle-aged
THEN buys_cpmputer=yes
R4: IF age=senior AND credit_rating=excellent THEN buys_cpmputer=yes
R5: IF age=senior AND credit_rating=fair
THEN buys_cpmputer=no
age?
youth
senior
Middle-aged
student?
no
no
yes
yes
yes
credit rating?
fair
Excellent
no
yes
Exhaustive
"
"
Note: The order of rules does not matter when extracted from a
decision tree
} Pruning rules
Any rule that does not improve accuracy
youth
can be pruned
" Pruning may generates non
student?
Mutually exclusive and non
yes
no
Exhaustive rules:
ye
C4.5 uses class-based ordering no
s
"
age
?
senior
Middle-aged
ye
s
credit rating?
fair
Excellent
no
ye
s
youth
age
?
senior
Middle-aged
A
C
B
When a rule is learned, the tuples covered by the rule are removed
(need of accurate rules but not necessarly high coverage)
No tuples left
The quality measure of a rule is below a threshold
In a general-to-specific manner
Example
"
"
THEN loan_decision=accept
Consider each possible attribute test that may be added to the rule
Adopt a greedy depth-first strategy choosing the rule with high quality
(use beam search where the k best attributes are maintained)
Repeat the process till the rule reached an acceptable quality level
IF income=high AND credit_rating=execellent
THEN loan_decision=accept
R1
a
a
"
"
r
a
a
r
a
a
a
a
R2
r
a
"
Use Entropy
}
}
4.6 Prediction
4.7 How to Evaluate and Improve Classification
Lazy learners
The learner waits till the last minute
Before doing any model construction
In order to classify a given test tuple
" Store training tuples
" Wait for test tuples
" Perform generalization based on similarity between test and the
stored training tuples
"
Eager Learners
Lazy learners
} Do
} Do
} Do
} Do
presented
presented
"
X1(x11,,x1n)
X2(x21,,x2n)
"
"
dist ( X 1 , X 2 ) =
2
(
x
x
)
1i 2i
i =1
Classification
"
"
"
The unknown tuple is assigned the most common class among its k
nearest neighbor
When k=1 the unknown tuple is assigned the class of the training
tuple that is closest to it
1-NN scheme has a miss-classification probability that is no worse
than twice that of the situation where we know the precise
probability density of each function
Prediction
"
"
"
Example
RID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Income($000s)
60
85.5
64.8
61.5
87
110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63
class: Owners =1
Non-Owners=2
1
1
1
mower
1
1
1
1
1
1
1
We randomly divide
1
the data into
2
2
2
18 training cases
2
2
6 test cases:
2
tuples 6,7,12,14,19, 20
2
2
2
Use training cases
2
to classify test cases
2
2
and compute error rates
2
Values of K
}
"
11
13
18
Misclassification
error %
33
33
33
33
33
17
17
50
"
"
"
"
4.6.1 Definitions
}
}
}
}
A statistical methodology
Used to model the relationship between one or more independent
(predictor) variable and a dependent (response) variable
}
}
y = b + wx
"
"
"
b: Y-intercept
w: the slope of the line
y = 0 + 1 x
"
Estimate the best-fitting straight line as the one that minimizes the
error between the actual data and the estimate of the line
Used to solve overdetermined systems (more equations than
unknowns)
yi = f ( x, ) = 0 + 1 x
}
S =
ri
ri = yi f ( xi , )
i =1
(x1, y1)
(x2,y2)
(x|D|,y|D|)
ri
S
= 2
= 0, j = 1,..., m
j
j
i
}
(x
1 =
x )( yi y )
0 = y 1 x
i =1
| D|
(x
x )2
i =1
Example
}
!!0
#
#!0
"
#!0
#
$!0
y = f ( x, ) = 0 + 1 x
+1!1 = 6
+ 2 !1 = 5
+ 3!1 = 7
+ 4 !1 = 10
2
S = [6 ( 0 + 11 )] + [5 ( 0 + 21 )]
2
+ [7 ( 0 + 31 )] + [10 ( 0 + 41 )]
}
(X1, y1)
(X2,y2)
(X|D|,y|D|)
y = 0 + 1 x1 + 2 x2 + ... + n xn
}
y = 0 + 1 x + 2 x 2 + 3 x 3
}
x1 = x
x2 = x 2
x3 = x 3
y = 0 + 1 x1 + 2 x2 + 3 x3
Poisson regression
Logistic regression
Logistic Regression
}
0.5
-4
-2
ex
f ( x) =
1+ ex
Logistic Regression
}
x = 0 + 1 x1 + ... + k xk
}
0.5
0 + 1 x1 +...+ k xk
-2
e
P(Y = 1 | x1 , x2 ,...xk ) =
0 + 1 x1 +...+ k xk
1+ e
Logistic Regression
0 + 1 x1 +...+ k xk
e
P(Y = 1 | x1 , x2 ,...xk ) =
0 + 1 x1 +...+ k xk
1+ e
}
L( ) =
j =1
"
0 + 1 x1 j +...+ p x pj
1+ e
0 + 1 x1 j +...+ p x pj
Accuracy is better measured using test data that was not used to
build the classifier
Class1
Class2
Classm
Class1
CM1,1
CM12
CM1,m
Class2
CM2,1
CM2,2
CM2,m
Classm
CMm,1
CMm,2
CMm,m
}
}
}
}
C1
C2
C1
True positive
False negative
C2
False positive
True negative
Sensitivity
Specificity
"
Precision
"
t _ pos
sens =
pos
t _ neg
spec =
neg
t _ pos
precision =
(t _ pos + f _ pos)
pos
neg
accuracy = sens
+ spec
( pos + neg )
( pos + neg )
"
"
"
"
"
|
i =1
| D|
yi yi ' |
|D|
2
(
y
y
'
)
i
i
i =1
|D|
yi yi ' |
yi y |
i =1
| D|
i =1
| D|
2
(
y
y
'
)
i
i
i =1
| D|
2
(
y
y
)
i
i =1
"
"
Bootstrap
"
}
}
Holdout
"
"
"
Random Subsampling
"
"
Cross-validation
}
Each sample is used the same number of times for training and
once for testing
Cross-validation
}
Leave-one-out
"
"
"
Stratified cross-validation
"
Bootstrap
}
"
"
"
"
"
"
Bagging
Intuition
Ask diagnosis
to one doctor
How accurate is
this diagnosis ?
diagnosis
diagnosis_1
patient
diagnosis_2
diagnosis_3
Choose the diagnosis that occurs more than any of the others
Bagging
M1
New data
sample
Data
M2
.
.
.
Combine
votes
Prediction
Mk
}
}
}
K iterations
At each iteration a training set Di is sampled with replacement
The combined model M* returns the most frequent class in case
of classification, and the average value in case of prediction
Boosting
Intuition
diagnosis_1 0.4
patient
diagnosis_2 0.5
diagnosis_3 0.1
Boosting
}
Example:Adaboost Algorithm
}
Example:Adaboost Algorithm
}
error ( M i ) = w j err (X j )
j
"
"
1 error ( M i )
log
error ( M i )