supervised-learning
supervised-learning
Learning
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Naïve Bayesian classification
Naïve Bayes for text classification
Support vector machines
Linear regression and gradient descent
Neural networks
K-nearest neighbor
Ensemble methods
Summary
CS583, Bing Liu, UIC 2
An application of supervised
learning
Endless applications of supervised learning.
An emergency room in a hospital measures 17
variables (e.g., blood pressure, heart rate, etc) of
newly admitted patients.
A decision is needed: whether to put a new patient in
an intensive-care unit (ICU).
Due to the high cost of ICU, those patients who may survive
less than a month are given higher priority.
Problem: to predict high-risk patients and
discriminate them from low-risk patients.
into
Yes (approved) and
No (not approved)
What is the class for following applicant/case?
No
|C |
Pr(c ) 1,
j 1
j
entropy Ai ( D)
j
j 1 | D |
entropy ( D j )
5 5 5
entropy Age ( D) entropy ( D1 ) entropy ( D2 ) entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
0.888
Own_house is a better
choice for the root.
Efficiency
time to construct the model
time to use the model
Robustness: handling noise and missing values
Scalability: efficiency when the data is large
Interpretability: understandable and insight
provided by the model.
Compactness of the model: size of the tree, or
the number of rules.
CS583, Bing Liu, UIC 43
Evaluation methods
Holdout set: The available data set D is divided into
two disjoint subsets,
the training set Dtrain (for learning a model)
the test set Dtest (for testing the model)
Important: training set should not be used in testing
and the test set should not be used in learning.
Unseen test set provides a unbiased estimate of accuracy.
The test set is also called the holdout set. (the
examples in the original data set D are all labeled
with classes.)
This method is used when the data set D is large.
CS583, Bing Liu, UIC 44
Evaluation methods (cont…)
n-fold cross-validation: The available data is
partitioned into n equal-size disjoint subsets.
Use each subset as the test set and combine the
rest n-1 subsets as the training set to learn a
classifier.
The procedure is run n times, which give n accuracies.
The final estimated accuracy of learning is the
average of the n accuracies.
10-fold and 5-fold cross-validations are commonly
used.
CS583, Bing Liu, UIC 45
Evaluation methods (cont…)
Leave-one-out cross-validation:
used when the data set is very small.
a special case of cross-validation
Each fold of the cross validation has only a
single test example and all the rest of the
data is used in training.
If the original data has m examples, this is m-fold
cross-validation
TP TP
p . r .
TP FP TP FN
Precision p is the number of correctly classified
positive examples divided by the total number of
examples that are classified as positive.
Recall r is the number of correctly classified positive
examples divided by the total number of actual
positive examples in the test set.
CS583, Bing Liu, UIC 50
An example
Then we have
is maximal.
Question: Can we estimate this probability directly?
Without using a decision tree or a list of rules.
CS583, Bing Liu, UIC 60
Apply Bayes’ Rule
Pr(C c j | A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
Pr( A1 a1 ,..., A| A| a| A| )
Pr( A1 a1 ,..., A| A| a| A| | C c j ) Pr(C c j )
|C |
Pr( A a ,..., A
r 1
1 1 | A| a| A| | C cr ) Pr(C cr )
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
N it | di | t 1
Pr( wt | cj; ) 1. (25)
t 1
counts.
| D|
N Pr(c | d )
ˆ)
Pr( w | c ; . (26)
i 1 ti j i
t j
N Pr(c | d )
|V | | D|
s 1 i 1 si j i
In order to handle 0 counts for infrequent occurring
words that do not appear in the training set, but may
appear in the test set, we need to smooth the
probability. Lidstone smoothing, 0 1
i 1 N ti Pr(c j | d i )
| D|
Pr( wt | c j ; ˆ ) . (27)
| V | s 1 i 1 N si Pr(c j | d i )
|V | | D|
Training set D2
| D|
Pr(cj | di )
ˆ
Pr(c | )
j
i 1 (28)
|D|
only require dot products (x) (z) and never the mapped
vector (x) in its explicit form. This is a crucial point.
Thus, if we have a way to compute the dot product
(x) (z) using the input vectors x and z directly,
no need to know the feature vector (x) or even itself.
In SVM, this is done through the use of kernel
functions, denoted by K,
K(x, z) = (x) (z) (82)
applications.
Try many distance functions and data pre-processing
methods.
A test point
Pr(science| )
?
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
training set
(x1, y1, w1) Build a classifier ht
(x2, y2, w2) whose accuracy on
… training set > ½
(xn, yn, wn) (better than random)
Non-negative weights
sum to 1 (wi =1/n initially)
Change weights
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
Training
for i = 1 … T