3-Classification, Clustering and Prediction
3-Classification, Clustering and Prediction
Task:
Learn a model that maps each attribute set x into one
of the predefined class labels y
Examples of Classification Task
Task Attribute set, x Class label, y
3 No Small 70K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Classification—A Two-Step Process
1st step: Model construction – describing a set of
predetermined classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
Classification—A Two-Step Process
2nd step: Model usage – for classifying future or
unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify
new data
Note: If the test set is used to select models, it is called
validation (test) set
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification Techniques
Base Classifiers
Decision Tree based Methods
Rule-based Methods
Nearest-neighbor
Neural Networks
Deep Learning
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Ensemble Classifiers
Boosting, Bagging, Random Forests
Decision Tree
Why Decision trees?
Decision tress often mimic the human level thinking so
its so simple to understand the data and make some
good interpretations.
Decision trees actually make you see the logic for the
data to interpret(not like black box algorithms like SVM,
NN, etc..)
Decision Tree
A decision tree is a tree with the following properties
An inner node represents an attribute.
An edge represents a test on the attribute of the father
node.
A leaf represents one of the classes.
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the
decision tree
Construction of a decision tree
Based on the training data
Top-Down strategy
Decision Tree
Example:
The data set has five attributes.
There is a special attribute: the attribute class is the class label.
The attributes, temp (temperature) and humidity are numerical attributes
Other attributes are categorical, that is, they cannot be ordered.
Decision Tree
Example (cont.):
Based on the training data set, we want to find a set of rules to know
what values of outlook, temperature, humidity and wind, determine
whether or not to play golf.
Decision Tree
Example (cont.):
We have five leaf nodes.
In a decision tree, each leaf node represents a rule.
We have the following rules corresponding to the tree given in Figure.
Decision Tree
Example (cont.): Classification
The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node.
A record enters the tree at the root node.
At the root, a test is applied to determine which child node
the record will encounter next.
This process is repeated until the record arrives at a leaf
node.
All the records that end up at a given leaf of the tree are
classified in the same way.
There is a unique path from the root to each leaf.
The path is a rule which is used to classify the records.
Decision Tree
Example (cont.):
In our tree, we can carry out the classification for o an unknown
record as follows.
Let us assume, for the record, that we know the values of the
first four attributes (but we do not know the value of class
attribute) as
outlook= rain; temp = 70; humidity = 65; and windy= true.
Decision Tree
Example (cont.):
We start from the root node to check the value of the
attribute associated at the root node.
This attribute is the splitting attribute at this node.
For a decision tree, at every node there is an attribute
associated with the node called the splitting attribute.
In our example, outlook is the splitting attribute at root.
Since for the given record, outlook = rain, we move to the
rightmost child node of the root.
At this node, the splitting attribute is windy and we find that
for the record we want classify, windy = true.
Hence, we move to the left child node to conclude that the
class label Is "no play".
Decision Tree
Example (cont.):
The accuracy of the classifier is determined by the
percentage of the test data set that is correctly classified.
We can see that for Rule 1 there are two records of the test
data set satisfying outlook= sunny and humidity < 75, and
only one of these is correctly classified as play.
Thus, the accuracy of this rule is 0.5 (or 50%). Similarly,
the accuracy of Rule 2 is also 0.5 (or 50%). The accuracy
of Rule 3 is 0.66.
Decision Tree
Concept of Categorical Attributes:
Consider the following training data
set.
There are three attributes, namely,
age, pincode and class.
The attribute class is used for class
label.
{Small, {Medium,
Large} Extra Large}
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
C0: 5 C0: 9
C1: 5 C1: 1
Entropy
Entropy (t ) p ( j | t ) log p ( j | t )
j
Misclassification error
Error (t ) 1 max P (i | t )
i
Finding the Best Split
Compute impurity measure (P) before splitting
Compute impurity measure (M) after splitting
Compute impurity measure of each child node
M is the weighted impurity of children
Choose the attribute test condition that produces the
highest gain
Gain = P – M
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
4 steps of Measure of Impurity: GINI
1. If a data set D contains examples from n classes, gini index,
gini(D) is defined as:
n 2
gini( D) 1 p j
j 1
where pj = count of specific class level / total count of D
2. If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A ( D) gini( D1) 2 gini( D 2)
|D| |D|
3. Reduction in Impurity: gini( A) gini( D) giniA ( D)
4. Select best attribute whose impurity is less will be selected.
GINI Index
GINI Index
GINI Index: Example
5
GINI Index: Example
GINI Index: Example
GINI Index: Example
GINI Index: Example
GINI Index: Example
Best binary split for income is {medium, high} or {low} with
minimum gini index.
Now do the same for attribute age, student and credit_rating
Maximum (log nc) when records are equally distributed among all
classes implying least information
Minimum (0.0) when all records belong to one class, implying
most information
n
split i 1
Then calculate entropy of rest all attributes: age, competition and type.
At last calculate the Gain and Gain with maximum value construct the
root node.
Other nodes are constructed with second largest, 3 rd largest value and
so on.
Computing Information Gain After
Splitting
Down Up
old 3 0
mid 2 2
new 0 3
Computing Information Gain After
Splitting
Example:
Gain(Age) = IG - E(Age) = 1 – 0.4 = 0.6
Gain(competition) = 0.124
Gain(type) = 0
age
new
old mid
Down competition Up
Yes No
Down Up
Iterative Dichotomizer (ID3)
Quinlan (1986)
Each node corresponds to a splitting attribute
Each arc is a possible value of that attribute.
At each node the splitting attribute is selected to be the
most informative among the attributes not yet
considered in the path from the root.
Entropy is used to measure how informative is a node.
The algorithm uses the criterion of information gain to
determine the goodness of a split.
The attribute with the greatest information gain is taken as
the splitting attribute, and the data set is split for all
distinct values of the attribute.
Iterative Dichotomizer (ID3) – Example
age income student credit_rating buys_computer
The class label attribute, <=30 high no fair no
buys_computer, has two <=30 high no excellent no
31…40 high no fair yes
distinct values.
>40 medium no fair yes
Thus there are two distinct >40 low yes fair yes
classes. (m =2) >40 low yes excellent no
31…40 low yes excellent yes
Class C1 corresponds to
<=30 medium no fair no
yes and class C2 <=30 low yes fair yes
corresponds to no. >40 medium yes fair yes
There are 9 samples of <=30 medium yes excellent yes
31…40 medium no excellent yes
class yes and 5 samples of
31…40 high yes fair yes
class no. >40 medium no excellent no
Iterative Dichotomizer (ID3) – Example
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand age?
<=30 overcast
31..40 >40
Extracting Classification
Rules from Trees
student? yes credit rating?
no yes no yes
Iterative Dichotomizer (ID3) – Example
Solution (Rules):
Iterative Dichotomizer (ID3) – Algorithm
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Iterative Dichotomizer (ID3) – Algorithm
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
There are no samples left
Advantages of Decision Tree
A decision tree construction process is concerned with
identifying the splitting attributes and splitting criterion at
every level of the tree.
Major strengths are:
Decision tree able to generate understandable rules.
They are able to handle both numerical and categorical
attributes.
They provide clear indication of which fields are most
important for prediction or classification.
Shortcomings of Decision Tree
Weaknesses are:
The process of growing a decision tree is computationally
expensive. At each node, each candidate splitting field is
examined before its best split can be found.
Some decision tree can only deal with binary-valued
target classes.
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise
or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early ̵ do not split a node if
this would result in the goodness measure falling below a
threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
Use a set of data different from the training data to decide which is
the “best pruned tree”
Bayesian Classification
A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can be
measured
Bayesian Theorem: Basics
Let X be a data sample class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
Posterior Probability
P(H) (prior probability), the initial probability
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of
observing the sample X, given that the hypothesis holds
X – Round and Red Fruit H - Apple
Bayesian Theorem
Classification Is to Derive the
Maximum Posteriori
Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only
n
P( X | C i) P( x k | C i)
k 1000
Ex. Suppose a dataset with 1 tuples, income=low (0),
income= medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Naïve Bayesian Classifier
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
Classification by Back propagation
Quartiles split the ranked data into 4 segments with an
equal number of values per segment
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% of the observations
are smaller and 50% are larger)
Only 25% of the observations are greater than the third
quartile
Support Vector Machines
Solution provided SVM is
Theoretically elegant
Computationally Efficient
Error rate is 1.1%
Very effective in many Large practical problems
It has a simple geometrical interpretation in a high-
dimensional feature space that is nonlinearly related to
input space
By using kernels all computations keep simple
Types of Data in Classification
Linearly Separable Data
Linearly Non-separable Data
Class 2 Class 2
Class 1
Class 1
Figure 1: Linearly Separable Data Figure 2: Linearly Non-separable Data
Types of Data in Classification
Linear Classifier
Non-linear Classifier
Class 2 Class 2
Class 1
Class 1
Class 2
Equation of a hyperplane :
wT x b 0 (0,3)
2 D : w1 x1 w2 x2 b 0
example : (x,y)
y 0 30 3
x2 02 2
2 y 3 x 6
3x 2 y 6 0
(2,0)
w : coefficient
b : constant
Decision Function
f ( x) wT x b
f(x) > 0 class 1 Class 2
f(x) < 0 class 2
w
How to find good w and b? b
There are many possible (w, b)
We are looking for (w, b) that
will: Class 1
0
b=
+
x
w
How would you
classify this data?
w x + b<0
Linear Classifiers Cont.
w x + b<0
Linear Classifiers Cont.
f(x, w, b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1
w x + b<0
Linear Classifiers Cont.
f(x, w, b) = sign(w x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
Support Vector Machines
A promising technique for data classification
Statistic learning theorem: maximize the distance between two
classes
A new classification method for both linear and nonlinear data
It uses a nonlinear mapping to transform the original training data
into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
Support Vectors
+1”
= x+ M=Margin Width
la ss
i c t C one
ed z
“Pr
X- - 1”
=1 =
x +b la ss
w
+ b =0
i c t C one
wx =-
1
Pr ed z
+ b “
wx
What we know:
• w . x+ + b = +1
• w . x- + b = -1
Computing the Margin Width
Maximizing the margin is good according to intuition and
theory.
Implies that only support vectors are important; other
training examples are ignorable.
Computing the Margin Width Cont.
Margin
Computing the Margin Width Cont.
Formulate the Decision Boundary
Recap of Constrained Optimization
Standard form problem (not necessarily convex)
Or
Here,
1
f ( x) g i ( x) 1 y i ( wT x b), 1 i n
2
w
2
Primal to Dual Journey
1 n
i (1 yi ( wT xi b))) 0
2
( w
X 2 i 1
Primal to Dual Journey
The Dual Problem
i
We can calculate “b” by taking “ i” such that i 0 :
1
Must be yi ( w xi b) 1 0 b wt xi yi wt xi ( yi {1, 1})
t
yi
Calculating “w” will be done using what we have found above :
w i yi xi
i
Solution of this Optimization Problem
f(x) = ΣαiyixiTx + b
Support Vectors
=0
Class 2
n
=0 w i yi xi
>0 i 1
=0
=0
=0
>0
>0 w y x
iSV
i i i
=0 =0
Class 1
Dataset with noise
Hard Margin: So far we require all data points be classified
correctly
- No training error
What if the training set is noisy?
denotes +1 OVERFITTING!
denotes -1
Soft Margin Classification
Slack variables ξi can be added to allow misclassification of difficult or
noisy examples. b=
-1 0
w x + +b=
w x
1
b=
w x+
k f(x) = ΣαiyixiTx + b
α
Linear SVMs: Overview
The classifier is a separating hyperplane.
Most “important” training points are support vectors; they define the
hyperplane.
Quadratic optimization algorithms can identify which training points xi are
support vectors with non-zero Lagrangian multipliers αi.
Both in the dual formulation of the problem and in the solution training
points appear only inside inner products:
Find α1…αN such that f(x) = ΣαiyixiTx + b
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is
maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Extension to Non-linear Decision Boundary
0 x
0 x
How about… mapping data to a higher-dimensional space:
x2
0 x
Non-linear SVMs: Feature spaces
General idea:
The original feature space can always be mapped to some higher-
dimensional feature space where the training set is separable:
Φ: x → φ(x)
Non-linear SVMs: Feature spaces
Example: Mapping To Feature Space
x1,x2 ,x3 R1
x1 0 ,x2 1,x3 2 (nonseparable in R1 )
mapping to higher dim ension:
x ( x) (x 2 , 2 x,1 )
R1 R 3
0 ( 0 ,0 ,1 )
1 ( 1, 2 , 1 )
2 ( 4 ,2 2 ,1 )
now separable
Classification Problem in Feature Space
x1 , x 2 , , xl R n
( x1 ), ( x 2 ), ( xl ) R m
2
max
w
s.t. w T ( xi ) b 1 if y i 1
w T ( xi ) b 1 if y i 1
Classification Problem in Feature Space
2 w wT w
max min min w,b
w 2 2
subject.to. y i ( w T ( xi ) b) 1 i 1, l
Questions:
1. How to choose ?
2. Is it really better? Yes.
Soft margin Hyperplane
Some times even in high dimension spaces, Data may still
not separable.
Allow training error
l
1 T
min w w C ( i )
w ,b , 2 i 1
y i (( w T ( xi )) b) 1 i ,
i 0, i 1, , l
Optimization Problem to find W and b
Consider the following primal problem:
l
minimise , w,b wT w C i
i 1
subject to yi ( wT ( xi ) b) 1 i , i 1, , l
i 0 , i 1, , l
Cons:
Slow training (O(n2), n= number of vectors in training set)
Separates only 2 classes
Clustering
Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups.
Based on information found in the data that describes the
objects and their relationships.
Also known as unsupervised classification.
Many applications
Understanding: group related documents for browsing or to find
genes and proteins that have similar functionality.
Summarization: Reduce the size of large data sets.
Web Documents are divided into groups based on a
similarity metric.
Most common similarity metric is the dot product between two
document vectors.
What is not Cluster Analysis?
Supervised classification.
Have class label information.
Simple segmentation.
Dividing students into different registration groups
alphabetically, by last name.
Results of a query.
Groupings are a result of an external specification.
Graph partitioning
Some mutual relevance and synergy, but areas are not
identical.
Notion of a Cluster is Ambiguous
Types of Clusterings
A clusteringis a set of clusters.
One important distinction is between hierarchicaland
partitionalsets of clusters.
Partitional Clustering
A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset.
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree.
Partitional Clustering
Hierarchical Clustering
Variance is a measure of how data points differ from the
mean
Example:
Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7
What is the mean and median of the above data set?
Data Set 1: mean = 7, median = 7
Data Set 2: mean = 7, median = 7
But we know that the two data sets are not identical! The
variance shows how they are different.
We want to find a way to represent these two data set
numerically.
Hierarchical Clustering
Clustering by Density Based Methods
Population variance:
N μ = population mean
(X i μ) 2
N = population size
σ2 i 1
N Xi = ith value of the variable X
Clustering by Grid-Based Methods
Calculate the Variance for Ungrouped Data
1.Find the Mean.
2.Calculate the difference between each score and the
mean.
3.Square the difference between each score and the
mean.
4.Add up all the squares of the difference between each
score and the mean.
5.Divide the obtained sum by n – 1.
Clustering by Model-Based Methods
Calculate the Variance for Grouped Data
1.Calculate the mean.
2.Get the deviations by finding the difference of each
midpoint from the mean.
3.Square the deviations and find its summation.
4.Substitute in the formula.
Clustering High-Dimensional Data
Outlier analysis
Prediction
k
k = Number of classes
f i xi x
xi= Mid point of the i-th class
MDx i 1
fi= frequency of the i-th class
n
Linear Regression
Nonlinear Regression
Measures the variation of observations from the mean
The most common measure of dispersion
Takes into account every observation
Measures the ‘average deviation’ of observations from
mean
Works with squares of residuals not absolute values—
easier to use in further calculations
Is the square root of the variance
Has the same units as the original data
Other Regression-Based Methods
of prediction
Standard deviation of a sample s
In practice, most populations are very large and it is
more common to calculate the sample standard
deviation.
x x
2
x x
2
Standard deviation δ
n
Evaluating the Accuracy and error
measures of a Classifier or Predictor
Evaluation metrics: How can we measure accuracy?
Other metrics to consider?
Use validation test set of class-labeled tuples instead
of training set when assessing accuracy
Methods for estimating a classifier’s accuracy:
Holdout method, random subsampling
Cross-validation
Bootstrap
Comparing classifiers:
Confidence intervals
Cost-benefit analysis and ROC Curves
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)