Unit V - Classification and Prediction 2020-21
Unit V - Classification and Prediction 2020-21
Classification and
Prediction
1
Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
2
Classification vs. Prediction
Classification:
predicts categorical class labels (discrete or nominal)
Prediction:
models continuous-valued functions, i.e., predicts
target marketing
medical diagnosis
3
Classification—A Two-Step Process
1. Model construction: describing a set of predetermined classes
4
Classification—A Two-Step Process
1. Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
5
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data are accompanied by
labels indicating the class of the observations
New data is classified based on the training set
8
Chapter 7. Classification and Prediction
A. Data cleaning
Preprocess data in order to reduce noise and handle missing
values
C. Data transformation
Generalize and/or normalize data. Concept hierarchies can be
used. 10
Issues regarding classification and prediction:
2. Comparing Classification Methods
1. Predictive accuracy :
ability of the model to correctly predict the class label of
new/unseen data.
2. Speed
time to construct the model
time to use the model
3. Robustness
handling noise and missing values
4. Scalability
efficiency of the model for large databases
5. Interpretability:
understanding the insight provided by the model
6. Goodness of rules
decision tree size
compactness of classification rules
11
Chapter 7. Classification and Prediction
13
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
30..40 >40
no yes no yes
14
Basic algorithm for inducing a decision tree
from training tuples
15
Basic algorithm for inducing a decision tree
from training tuples (contd..)
16
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures: info required to classify any arbitrary tuple
m
si si
I( s1,s2,...,sm ) log 2
i 1 s s
entropy of attribute A with values {a1,a2,…,av}
v
s1 j ... smj
E(A) I ( s1 j ,..., smj )
j 1 s
information gained by branching on attribute A
17
Attribute Selection by Information Gain
Computation
Class P: buys_computer =
“yes” 5 4
Class N: buys_computer = “no” E (age) I (2,3) I (4,0)
14 14
I(p, n) = I(9, 5) =0.940 5
Compute the entropy for age: I (3,2) 0.694
14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5
<=30 2 3 0.971 14
out of 14 samples, with 2
30…40 4 0 0
yes’es and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age) I ( p, n) E (age) 0.246
<=30 high no fair no
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income) 0.029
Gain( student ) 0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 18
Tree Pruning: Avoid Overfitting in Classification
Overfitting: An induced tree may overfit the training
data
Too many branches, some may reflect anomalies due
to noise or outliers
Poor accuracy for unseen samples
20
Approaches to Determine
the Final Tree Size
21
Enhancements to basic decision
tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that
are sparsely represented
This reduces fragmentation, repetition, and
replication
22
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other
classification methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
23
Scalable Decision Tree Induction Methods
in Data Mining Studies
SLIQ
builds an index for each attribute and only class list and the
current attribute list reside in memory
SPRINT
constructs an attribute list data structure
PUBLIC
integrates tree splitting and tree pruning: stop growing the
tree earlier
RainForest
separates the scalability aspects from the criteria that
determine the quality of the tree
builds an AVC-list (attribute, value, class label)
24
Chapter 7. Classification and Prediction
26
Bayesian Theorem Basics
Given training data X, posteriori probability of a hypothesis H, P(H|X)
follows the Bayes theorem
P(H | X ) P( X | H )P(H )
Informally, this can be written as
P( X )
posterior =likelihood x prior / evidence
Let
X- a data sample whose class label is unknown
H- hypothesis that X belongs to class C
28
Naive Bayesian classification
Suppose
1. Each data sample, X = (x1,x2,…,xn), depicting n
measurements for n attributes, respectively
A1,A2,…,An
2. There are m classes, C1,C2,…,Cm.
29
By Bayes theorem P(Ci/X)= P(X/Ci) * P(Ci)
P(X)
Probability of X conditioned
on Ci is given by
30
Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
Class: 30…40 high no fair yes
C1:buys_computer= >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer= >40 low yes excellent no
‘no’
31…40 low yes excellent yes
Data sample(unknown) <=30 medium no fair no
X =(age<=30, <=30 low yes fair yes
Income=medium, >40 medium yes fair yes
Student=yes <=30 medium yes excellent yes
Credit_rating= 31…40 medium no excellent yes
Fair)
31…40 high yes fair yes
>40 medium no excellent no
31
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)
P(buys_computer=“yes”)=9/14=0.643
P(buys_computer=“no”)=5/14=0.357
33
Naïve Bayesian Classifier: Comments
Advantages :
1. Easy to implement
2. Good results obtained in most of the cases
Disadvantages
1. Assumption: class conditional independence , therefore loss of
accuracy
2. Practically, dependencies exist among variables
1. E.g., hospitals: patients: Profile: age, family history etc
2. Symptoms: fever, cough etc., Disease: lung cancer, diabetes
etc
3. Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
34
Bayesian Networks
distribution
Nodes: random variables
Links: dependency
X Y X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Z
P Has no loops or cycles
35
Bayesian Belief Network: An Example
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
Family
Smoker
History
LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9
n
PositiveXRay Dyspnoea P ( z1,..., zn) P ( z i | Parents ( Z i ))
i 1
Bayesian Belief Networks
Where tuple(z1,…,zn) correspond to attributes Z1,Z2,…,Zn) 36
Learning Bayesian Networks
Several cases
1. Given both the network structure and all variables
observable: learn only the CPTs
37
Chapter 7. Classification and Prediction
Neural Network
Its a set of connected input/output units where each connection
has a weight associated with it.
During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of the
input samples.
39
Multilayer Feed-forward Neural Network
A training sample, X = (x1; x2; …; xi), is fed to the input layer.
The weighted outputs of these units are, in turn, fed simultaneously to a
second layer, known as a hidden layer.
The hidden layer's weighted outputs can be input to another hidden layer,
and so on. The number of hidden layers is arbitrary, although in practice,
usually only one is used.
The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction for given samples.
Weighted connections
exist between each
layer, where wij denotes
the weight from a unit j
in one layer to a unit i in
the previous layer.
40
Benefits and drawbacks of Neural Networks
The network is feed-forward in that none of the weights cycle
back to an input unit or to an output unit of a previous layer.
Benefits:
include their high tolerance to noisy data
able to classify patterns on which they have not been
trained.
Disadvantages:
involve long training times, so more suitable for
applications where this is feasible.
Need to know network topology or “structure".
poor interpretability, as its difficult for humans to
interpret the symbolic meaning behind the learned
weights.
41
Defining a network topology
Before training can begin, the user must decide on
the network topology by specifying
1. the number of units in the input layer,
2. the number of hidden layers (if more than one),
3. the number of units in each hidden layer, and
4. the number of units in the output layer.
42
Backpropagation
“What is backpropagation?"
Its a neural network learning algorithm.
43
Multi-Layer Perceptron
Output vector
Err j O j (1 O j ) Errk w jk
Output nodes k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden nodes Err j O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input nodes
I j wij Oi j
i
Input vector: xi
Given a unit j in a hidden or
output layer, the net input, x1
Ij, to unit j is:
I j wij Oi j x2 i j k
i wij
j=Bias(threshold) .
xn
.
Given the net input Ij to
unit j, then Oj, the output
Oj . 1
of unit j, is computed as: I
1 e j
Backpropagate the error. The error is propagated
backwards by updating the weights and biases to reflect
the error of the network's prediction. Tj=true o/p
Err j O j (1 O j )(T j O j ) Err j O j (1 O j ) Errk w jk
k
45
Network Training
The ultimate objective of training
obtain a set of weights that makes almost all the
Steps
Initialize weights with random values
Associative classification:
It mines high support and high confidence rules in the
form of “cond_set => y”, where y is a class label
48
Chapter 7. Classification and Prediction
49
Other Classification Methods
k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
50
Instance-Based (Lazy Learners) Methods
Instance-based learning:
Store training examples and delay the processing (“lazy
Typical approaches
k-nearest neighbor approach
Case-based reasoning
Uses symbolic representations and knowledge-based
inference
51
Remarks on Lazy vs. Eager Learning
K Nearest Neighbor & Case based reasoning : lazy evaluation
Decision-tree and Bayesian classification: eager evaluation
Key differences
Lazy Learning Eager Learning
52
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space.
d ( X ,Y )
n
i 1
x y
i i
2
53
Case-Based Reasoning
Uses: lazy evaluation + analyze similar instances
Instances are not “points in a Euclidean space”
Methodology
Instances represented by rich symbolic descriptions,
or as cases.
Earlier similar cases are seen. Multiple retrieved cases
may be combined
Tight coupling between case retrieval, knowledge-
Research issues
Indexing based on syntactic similarity measure, and
55
Rough Set Approach
Rough sets are used to approximately or “roughly” define
equivalent classes
56
Fuzzy Set
Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to represent
the degree of membership (such as using fuzzy membership
graph)
For a given new sample, more than one fuzzy value may apply
59
Regress Analysis and Log-Linear
Models in Prediction
Linear regression: Y = + X
Two parameters , and specify the line and are to be estimated
X1, X2, ….
60
Non-linear models:
Polynomial regression can be modeled by adding
61
Chapter 7. Classification and Prediction
2. K-fold Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one sub-sample as test
data
Useful for data set with moderate size
3. Bootstrapping (leave-one-out)
for small size data
63
Increasing Accuracy: Bagging and Boosting
General idea
Training data
Classification method (CM)
Classifier C
Altered Training data CM
Classifier C1
Altered Training data CM
Classifier C2
……..
Aggregation ….
Classifier C*
64
Bagging
Given a set S of s samples
65
Boosting Technique — Algorithm
For t = 1, 2, …, T Do
Obtain a hypothesis (classifier) h(t) under w(t)
66
Chapter 7. Classification and Prediction
68