0% found this document useful (0 votes)
38 views

3-Classification, Clustering and Prediction

The document discusses classification, regression, and clustering techniques for data mining. It covers supervised learning methods like classification, which involves labeling training data examples, and unsupervised learning methods like clustering, which finds inherent groupings in unlabeled data. Classification can be used for problems like predicting categorical class labels or continuous numeric values. Common classification algorithms discussed include decision trees, rules, nearest neighbors, neural networks, naive Bayes, and support vector machines.

Uploaded by

Barsha Roy
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

3-Classification, Clustering and Prediction

The document discusses classification, regression, and clustering techniques for data mining. It covers supervised learning methods like classification, which involves labeling training data examples, and unsupervised learning methods like clustering, which finds inherent groupings in unlabeled data. Classification can be used for problems like predicting categorical class labels or continuous numeric values. Common classification algorithms discussed include decision trees, rules, nearest neighbors, neural networks, naive Bayes, and support vector machines.

Uploaded by

Barsha Roy
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 142

Data Mining

Course No: CSE 4221

Topic 3: Classification, Regression and


Clustering
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Prediction Problems: Classification vs.
Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit/loan approval
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
Classification: Definition
 Given a collection of records (training set )
 Each record is by characterized by a tuple (x,y),
where x is the attribute set and y is the class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output

 Task:
 Learn a model that maps each attribute set x into one
of the predefined class labels y
Examples of Classification Task
Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies
General Approach for Building
Classification Model
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Classification—A Two-Step Process
 1st step: Model construction – describing a set of
predetermined classes
 Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
 The set of tuples used for model construction is training
set
 The model is represented as classification rules, decision
trees, or mathematical formulae
Classification—A Two-Step Process
 2nd step: Model usage – for classifying future or
unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify
new data
 Note: If the test set is used to select models, it is called
validation (test) set
Process (1): Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification Techniques
 Base Classifiers
 Decision Tree based Methods
 Rule-based Methods
 Nearest-neighbor
 Neural Networks
 Deep Learning
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

 Ensemble Classifiers
 Boosting, Bagging, Random Forests
Decision Tree
 Why Decision trees?
 Decision tress often mimic the human level thinking so
its so simple to understand the data and make some
good interpretations.
 Decision trees actually make you see the logic for the
data to interpret(not like black box algorithms like SVM,
NN, etc..)
Decision Tree
 A decision tree is a tree with the following properties
 An inner node represents an attribute.
 An edge represents a test on the attribute of the father
node.
 A leaf represents one of the classes.
 Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the
decision tree
 Construction of a decision tree
 Based on the training data
 Top-Down strategy
Decision Tree

 Example:
 The data set has five attributes.
 There is a special attribute: the attribute class is the class label.
 The attributes, temp (temperature) and humidity are numerical attributes
 Other attributes are categorical, that is, they cannot be ordered.
Decision Tree

 Example (cont.):
 Based on the training data set, we want to find a set of rules to know
what values of outlook, temperature, humidity and wind, determine
whether or not to play golf.
Decision Tree

 Example (cont.):
 We have five leaf nodes.
 In a decision tree, each leaf node represents a rule.
 We have the following rules corresponding to the tree given in Figure.
Decision Tree
 Example (cont.): Classification
 The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node.
 A record enters the tree at the root node.
 At the root, a test is applied to determine which child node
the record will encounter next.
 This process is repeated until the record arrives at a leaf
node.
 All the records that end up at a given leaf of the tree are
classified in the same way.
 There is a unique path from the root to each leaf.
 The path is a rule which is used to classify the records.
Decision Tree

 Example (cont.):
 In our tree, we can carry out the classification for o an unknown
record as follows.
 Let us assume, for the record, that we know the values of the
first four attributes (but we do not know the value of class
attribute) as
 outlook= rain; temp = 70; humidity = 65; and windy= true.
Decision Tree
 Example (cont.):
 We start from the root node to check the value of the
attribute associated at the root node.
 This attribute is the splitting attribute at this node.
 For a decision tree, at every node there is an attribute
associated with the node called the splitting attribute.
 In our example, outlook is the splitting attribute at root.
 Since for the given record, outlook = rain, we move to the
rightmost child node of the root.
 At this node, the splitting attribute is windy and we find that
for the record we want classify, windy = true.
 Hence, we move to the left child node to conclude that the
class label Is "no play".
Decision Tree
 Example (cont.):
 The accuracy of the classifier is determined by the
percentage of the test data set that is correctly classified.
 We can see that for Rule 1 there are two records of the test
data set satisfying outlook= sunny and humidity < 75, and
only one of these is correctly classified as play.
 Thus, the accuracy of this rule is 0.5 (or 50%). Similarly,
the accuracy of Rule 2 is also 0.5 (or 50%). The accuracy
of Rule 3 is 0.66.
Decision Tree
 Concept of Categorical Attributes:
 Consider the following training data
set.
 There are three attributes, namely,
age, pincode and class.
 The attribute class is used for class
label.

The attribute age is a numeric attribute, whereas pincode is a


categorical one.
Though the domain of pincode is numeric, no ordering can be
defined among pincode values.
You cannot derive any useful information if one pin-code is greater
than another pincode.
Decision Tree
 Concept of Categorical Attributes
(cont.):
 Figure gives a decision tree for
the training data.
 The splitting attribute at the root is
pincode and the splitting criterion
here is pincode = 500 046.
 Similarly, for the left child node, At root level, we have 9 records.
the splitting criterion is age < 48 The associated splitting criterion is
(the splitting attribute is age). pincode = 500 046.
 Although the right child node has As a result, we split the records
into two subsets. Records 1, 2, 4,
the same attribute as the splitting
8, and 9 are to the left child note
attribute, the splitting criterion is and remaining to the right node.
different. The process is repeated at every
node.
Decision Tree
 Tree construction Principle
 Splitting Attribute
 Splitting Criterion
 3 main phases
 construction Phase
 Pruning Phase
 Processing the pruned tree to improve the
understandability
Decision Tree
 Decision Tree Construction Algorithms
 Hunt’s Algorithm (one of the earliest)
 CART(Classification And Regression Tree) → uses Gini
Index(Classification) as metric.
 ID3(Iterative Dichotomizer 3) → uses Entropy function
and Information gain as metrics.
 C4.5
 SLIQ
 SPRINT
Design Issues of Decision Tree Induction
 How should training records be split?
 Method for specifying test condition
 depending on attribute types
 Measure for evaluating the goodness of a test
condition

 How should the splitting procedure stop?


 Stop splitting if all the records belong to the same
class or have identical attribute values
 Early termination
Methods for Expressing Test Conditions
 Depends on attribute types
 Binary
 Nominal
 Ordinal
 Continuous

 Depends on number of ways to split


 2-way split
 Multi-way split
Test Condition for Ordinal Attributes
 Multi-way split: Shirt
Size
 Use as many partitions
as distinct values
Small
Medium Large Extra Large

 Binary split: Shirt


Size
Shirt
Size
 Divides values into two
subsets
 Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt


attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split
 Greedy approach:
 Nodes with purer class distribution are preferred

 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity


How to determine the Best Split
 Gini Index
GINI (t )  1   [ p ( j | t )]2
j

 Entropy
Entropy (t )    p ( j | t ) log p ( j | t )
j

 Misclassification error

Error (t )  1  max P (i | t )
i
Finding the Best Split
 Compute impurity measure (P) before splitting
 Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of children
 Choose the attribute test condition that produces the
highest gain
Gain = P – M

or equivalently, lowest impurity measure after splitting


(M)
Finding the Best Split
Before Splitting: C0 N00
P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
4 steps of Measure of Impurity: GINI
1. If a data set D contains examples from n classes, gini index,
gini(D) is defined as:
n 2
gini( D)  1  p j
j 1
where pj = count of specific class level / total count of D
2. If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A ( D)  gini( D1)  2 gini( D 2)
|D| |D|
3. Reduction in Impurity: gini( A)  gini( D)  giniA ( D)
4. Select best attribute whose impurity is less will be selected.
GINI Index
GINI Index
GINI Index: Example

5
GINI Index: Example
GINI Index: Example
GINI Index: Example
GINI Index: Example
GINI Index: Example
 Best binary split for income is {medium, high} or {low} with
minimum gini index.
 Now do the same for attribute age, student and credit_rating

 Income is selected with minimum gini index and highest reduction


in impurity.
Measure of Impurity: Entropy
 Entropy at a given node t:
Entropy (t )    p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

 Maximum (log nc) when records are equally distributed among all
classes implying least information
 Minimum (0.0) when all records belong to one class, implying
most information

 Entropy based computations are quite similar to the GINI


index computations
Computing Entropy of a Single Node
Entropy (t )    p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Computing Information Gain After
Splitting
 Information Gain:
 n 
GAIN  Entropy ( p )    Entropy (i ) 
k
i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i

 Choose the split that achieves most reduction


(maximizes GAIN)

 Used in the ID3 and C4.5 decision tree algorithms


Computing Information Gain After
Splitting
 Example: Age Competition Type Profit
old yes S/w Down
old No S/w Down
old No H/w Down
mid yes S/w Down
mid yes H/w Down
mid No H/w Up
mid No S/w Up
new yes S/w Up
new No H/w Up
new No S/w Up

Produces a decision tree from this table.


Computing Information Gain After
Splitting

 Then calculate entropy of rest all attributes: age, competition and type.
 At last calculate the Gain and Gain with maximum value construct the
root node.
 Other nodes are constructed with second largest, 3 rd largest value and
so on.
Computing Information Gain After
Splitting

Down Up
old 3 0
mid 2 2
new 0 3
Computing Information Gain After
Splitting
 Example:
 Gain(Age) = IG - E(Age) = 1 – 0.4 = 0.6
 Gain(competition) = 0.124
 Gain(type) = 0

age
new
old mid

Down competition Up

Yes No

Down Up
Iterative Dichotomizer (ID3)
 Quinlan (1986)
 Each node corresponds to a splitting attribute
 Each arc is a possible value of that attribute.
 At each node the splitting attribute is selected to be the
most informative among the attributes not yet
considered in the path from the root.
 Entropy is used to measure how informative is a node.
 The algorithm uses the criterion of information gain to
determine the goodness of a split.
 The attribute with the greatest information gain is taken as
the splitting attribute, and the data set is split for all
distinct values of the attribute.
Iterative Dichotomizer (ID3) – Example
age income student credit_rating buys_computer
 The class label attribute, <=30 high no fair no
buys_computer, has two <=30 high no excellent no
31…40 high no fair yes
distinct values.
>40 medium no fair yes
 Thus there are two distinct >40 low yes fair yes
classes. (m =2) >40 low yes excellent no
31…40 low yes excellent yes
 Class C1 corresponds to
<=30 medium no fair no
yes and class C2 <=30 low yes fair yes
corresponds to no. >40 medium yes fair yes
 There are 9 samples of <=30 medium yes excellent yes
31…40 medium no excellent yes
class yes and 5 samples of
31…40 high yes fair yes
class no. >40 medium no excellent no
Iterative Dichotomizer (ID3) – Example
 Represent the knowledge in the form of IF-THEN rules
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand age?

<=30 overcast
31..40 >40
Extracting Classification
Rules from Trees
student? yes credit rating?

no yes excellent fair

no yes no yes
Iterative Dichotomizer (ID3) – Example
 Solution (Rules):
Iterative Dichotomizer (ID3) – Algorithm
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Iterative Dichotomizer (ID3) – Algorithm
 Conditions for stopping partitioning
 All samples for a given node belong to the same
class
 There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
 There are no samples left
Advantages of Decision Tree
 A decision tree construction process is concerned with
identifying the splitting attributes and splitting criterion at
every level of the tree.
 Major strengths are:
 Decision tree able to generate understandable rules.
 They are able to handle both numerical and categorical
attributes.
 They provide clear indication of which fields are most
important for prediction or classification.
Shortcomings of Decision Tree
 Weaknesses are:
 The process of growing a decision tree is computationally
expensive. At each node, each candidate splitting field is
examined before its best split can be found.
 Some decision tree can only deal with binary-valued
target classes.
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise
or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split a node if
this would result in the goodness measure falling below a
threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 Use a set of data different from the training data to decide which is
the “best pruned tree”
Bayesian Classification
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can be
measured
Bayesian Theorem: Basics
 Let X be a data sample class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 Posterior Probability
 P(H) (prior probability), the initial probability
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of
observing the sample X, given that the hypothesis holds
 X – Round and Red Fruit H - Apple
Bayesian Theorem
Classification Is to Derive the
Maximum Posteriori
 Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem

P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only

P(C | X)  P(X | C )P(C )


i i i
needs to be maximized
Bayesian Classification
 Naïve Bayesian Classifier
 Class Conditional Independence
 Effect of an attribute value on a given class is independent
of the values of other attributes
 Simplifies Computations
 Bayesian Belief Networks
 Graphical models
 Represent dependencies among subsets of attributes
Naïve Bayesian Classifier
Naïve Bayesian Classifier
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
X = (age <=30, 31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
<=30 low yes fair yes
Student = yes
>40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example
age income studentcredit_rating
buys_compu
<=30 high no fair no
 P(Ci): <=30 high no excellent no
31…40 high no fair yes
P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
P(buys_computer = “no”) = 5/14= 0.357 >40
>40
low
low
yes fair
yes excellent
yes
no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
 Compute P(X|Ci) for each class >40 medium no excellent no

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222


P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
Naïve Bayes Classifier: An Example
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)


P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =
0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each conditional prob.
be non-zero. Otherwise, the predicted prob. will be zero

n
P( X | C i)   P( x k | C i)
k 1000
 Ex. Suppose a dataset with 1 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Naïve Bayesian Classifier
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore
loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
 Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
Classification by Back propagation
 Quartiles split the ranked data into 4 segments with an
equal number of values per segment

25% 25% 25% 25%

Q1 Q2 Q3

 The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
 Q2 is the same as the median (50% of the observations
are smaller and 50% are larger)
 Only 25% of the observations are greater than the third
quartile
Support Vector Machines
 Solution provided SVM is
 Theoretically elegant
 Computationally Efficient
 Error rate is 1.1%
 Very effective in many Large practical problems
 It has a simple geometrical interpretation in a high-
dimensional feature space that is nonlinearly related to
input space
 By using kernels all computations keep simple
Types of Data in Classification
 Linearly Separable Data
 Linearly Non-separable Data

Class 2 Class 2

Class 1

Class 1
Figure 1: Linearly Separable Data Figure 2: Linearly Non-separable Data
Types of Data in Classification
 Linear Classifier
 Non-linear Classifier

Class 2 Class 2

Class 1

Class 1

Figure 3: Linear Classifier Figure 4: Non-Linear Classifier


Basic Concept of Classification Function
 Consider linear separable case
 Training data two classes

Class 2

Class 1 Separating hyperplane


wT x  b  0
Basic Concept of Classification Function

Equation of a hyperplane :
wT x  b  0 (0,3)
2  D : w1 x1  w2 x2  b  0
example : (x,y)
y 0 30 3
 
x2 02 2
2 y  3 x  6
 3x  2 y  6  0
(2,0)
w : coefficient
b : constant
Decision Function

f ( x)  wT x  b
 f(x) > 0  class 1 Class 2
 f(x) < 0  class 2
w
 How to find good w and b? b
 There are many possible (w, b)
 We are looking for (w, b) that
will: Class 1

 Classify correctly the classes Separating hyperplane

 Give maximum Margins wT x  b  0


Linear Classifiers
f(x, w, b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

0
b=
+
x
w
How would you
classify this data?

w x + b<0
Linear Classifiers Cont.

denotes +1 w x + b>0 f(x, w, b) = sign(w x + b)


denotes -1

How would you


classify this data?

w x + b<0
Linear Classifiers Cont.
f(x, w, b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

How would you


classify this data?

w x + b<0
Linear Classifiers Cont.
f(x, w, b) = sign(w x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?
Support Vector Machines
 A promising technique for data classification
 Statistic learning theorem: maximize the distance between two
classes
 A new classification method for both linear and nonlinear data
 It uses a nonlinear mapping to transform the original training data
into a higher dimension
 With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
 With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
 SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
Support Vectors

 Support Vectors are those


input points (vectors) M
closest to the decision
boundary
1. They are vectors
2. They “support” the decision
hyperplane
 Margin M of the separator
is the width of separation
between classes.
Best Linear Separator

 Find Closest Points in Convex Hulls


Best Linear Separator

 Plane Bisect Closest Points wT x + b =0


w=d-c
Linear SVM Mathematically

+1”
= x+ M=Margin Width
la ss
i c t C one
ed z
“Pr
X- - 1”
=1 =
x +b la ss
w
+ b =0
i c t C one
wx =-
1
Pr ed z
+ b “
wx

What we know:
• w . x+ + b = +1
• w . x- + b = -1
Computing the Margin Width
 Maximizing the margin is good according to intuition and
theory.
 Implies that only support vectors are important; other
training examples are ignorable.
Computing the Margin Width Cont.

Margin
Computing the Margin Width Cont.
Formulate the Decision Boundary
Recap of Constrained Optimization
 Standard form problem (not necessarily convex)

Lagrangian: augment the objective with a weighted sum of constraints


Lagrange Dual Function

Lower bound property:

The optimal solutions of the primal and dual problems are


Weak and Strong Duality
Back to Our Problem

Or

This is our Primal Problem

Here,
1
f ( x) g i ( x)  1  y i ( wT x  b), 1  i  n
2
 w
2
Primal to Dual Journey

 1 n
   i (1  yi ( wT xi  b)))  0
2
( w
X 2 i 1
Primal to Dual Journey
The Dual Problem

 The new objective function is in terms of ai only

 It is known as the dual problem:

if we know w, we know all ai


if we know all ai, we know w

 The original problem is known as the primal problem

 The objective function of the dual problem needs to be


maximized!
The Dual Problem (Cont.)

 The dual problem is therefore:

The result when we


differentiate the original
Lagrangian w.r.t. b
Finding “w” and “b” for the boundary x  b
wt

 Using the KKT(Karus-Kuhn-Tucker) condition:

i
 We can calculate “b” by taking “ i” such that i  0 :

1
Must be yi ( w xi  b)  1  0  b   wt xi  yi  wt xi ( yi  {1, 1})
t

yi
 Calculating “w” will be done using what we have found above :
w    i yi xi
i
Solution of this Optimization Problem

w =Σαiyixi b= yk- wTxk for any xk such that αk  0

f(x) = ΣαiyixiTx + b
Support Vectors
=0
Class 2
n
=0 w    i yi xi
>0 i 1

=0

=0
=0
>0
>0 w  y x
iSV
i i i

=0 =0

Class 1
Dataset with noise
 Hard Margin: So far we require all data points be classified
correctly
- No training error
 What if the training set is noisy?

denotes +1 OVERFITTING!
denotes -1
Soft Margin Classification
 Slack variables ξi can be added to allow misclassification of difficult or
noisy examples. b=
-1 0
w x + +b=
w x
1
b=
w x+

What should our quadratic


optimization criterion be?
Hard Margin Vs. Soft Margin

 The old formulation:


Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 The new formulation incorporating slack variables:


Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

 Parameter C can be viewed as a way to control overfitting.


The Optimization Problem
(Soft Margin Classification)
 The dual of this new constrained optimization problem is :

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
 This is very similar to the optimization problem in the linear separable
case, except that there is an upper bound C on ai now. Once again, a
QP solver can be used to find ai
 Neither slack variables ξi nor their Lagrange multipliers appear in the dual
problem!
 Again, xi with non-zero αi will be support vectors.
But neither w nor b are
 Solution to the dual problem is: needed explicitly for
w =Σαiyixi classification!
b= yk(1- ξk) - w xk where k = argmax
T

k f(x) = ΣαiyixiTx + b
α
Linear SVMs: Overview
 The classifier is a separating hyperplane.
 Most “important” training points are support vectors; they define the
hyperplane.
 Quadratic optimization algorithms can identify which training points xi are
support vectors with non-zero Lagrangian multipliers αi.
 Both in the dual formulation of the problem and in the solution training
points appear only inside inner products:
Find α1…αN such that f(x) = ΣαiyixiTx + b
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is
maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Extension to Non-linear Decision Boundary

 So far, we have only considered large-margin classifier with


a linear decision boundary
 How to generalize it to become nonlinear?
Non-linear SVMs
 Datasets that are linearly separable with some noise work out great:

0 x

 But what are we going to do if the dataset is just too hard?

0 x
 How about… mapping data to a higher-dimensional space:

mapping data to two-dimensional space with  ( x )  ( x, x 2 )

x2

0 x
Non-linear SVMs: Feature spaces
 General idea:
The original feature space can always be mapped to some higher-
dimensional feature space where the training set is separable:

Φ: x → φ(x)
Non-linear SVMs: Feature spaces
Example: Mapping To Feature Space

x1,x2 ,x3  R1
x1  0 ,x2  1,x3  2 (nonseparable in R1 )
mapping to higher dim ension:
x   ( x)  (x 2 , 2 x,1 )
R1  R 3
0  ( 0 ,0 ,1 )
1  ( 1, 2 , 1 )
2  ( 4 ,2 2 ,1 )
now separable
Classification Problem in Feature Space

x1 , x 2 ,  , xl  R n

 ( x1 ),  ( x 2 ), ( xl )  R m

 Find a linear separating hyperplane

2
max
w
s.t. w T  ( xi )  b  1 if y i  1
w T  ( xi )  b  1 if y i  1
Classification Problem in Feature Space

2 w wT w
max  min  min w,b
w 2 2
subject.to. y i ( w T  ( xi )  b)  1 i  1,  l

 Questions:
1. How to choose  ?
2. Is it really better? Yes.
Soft margin Hyperplane
 Some times even in high dimension spaces, Data may still
not separable.
 Allow training error
l
1 T
min w w  C (  i )
w ,b , 2 i 1

y i (( w T  ( xi ))  b)  1   i ,
 i  0, i  1,  , l
Optimization Problem to find W and b
 Consider the following primal problem:
l
minimise , w,b wT  w  C   i
i 1

subject to yi ( wT   ( xi )  b)  1   i , i  1,  , l
 i  0 , i  1,  , l

 Find α1…αN such that


How to know the mapping ∅?
 The mapped data only occurs as an inner product in the objectives.
 A kernel function is defined as a function that corresponds to a dot
product of two feature vectors in some expanded feature space:

 Now we only need to compute K ( xi , x j ) and we don’t need to perform


computations in high dimensional space explicitly. This is what is
called the Kernel Trick.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Non-linear SVMs Mathematically
 Dual problem formulation:

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

 The solution is:

f(x) = ΣαiyiK(xi, xj)+ b

 Optimization techniques for finding αi’s remain the


same!
Non-linear SVM Cont.

 It is observed that to change from a linear to nonlinear


classifier, one must only substitute a kernel evaluation in
the objective instead of the original dot product.

 No algorithmic changes are required from the linear


case other than substitution of a kernel evaluation for the
simple dot product.
Kernel function

Commonly Used Kernel


Positive Semi-definite Function
SVM Pros and Cons
 Pros:
 Easy to integrate different distance functions
 Fast classification of new objects (depends on SV)
 Good performance even with small set of examples

 Cons:
 Slow training (O(n2), n= number of vectors in training set)
 Separates only 2 classes
Clustering
 Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups.
 Based on information found in the data that describes the
objects and their relationships.
 Also known as unsupervised classification.
 Many applications
 Understanding: group related documents for browsing or to find
genes and proteins that have similar functionality.
 Summarization: Reduce the size of large data sets.
 Web Documents are divided into groups based on a
similarity metric.
 Most common similarity metric is the dot product between two
document vectors.
What is not Cluster Analysis?
 Supervised classification.
 Have class label information.
 Simple segmentation.
 Dividing students into different registration groups
alphabetically, by last name.
 Results of a query.
 Groupings are a result of an external specification.
 Graph partitioning
 Some mutual relevance and synergy, but areas are not
identical.
Notion of a Cluster is Ambiguous
Types of Clusterings
 A clusteringis a set of clusters.
 One important distinction is between hierarchicaland
partitionalsets of clusters.
 Partitional Clustering
 A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset.
 Hierarchical clustering
 A set of nested clusters organized as a hierarchical tree.
Partitional Clustering
Hierarchical Clustering
 Variance is a measure of how data points differ from the
mean
 Example:
Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7
What is the mean and median of the above data set?
Data Set 1: mean = 7, median = 7
Data Set 2: mean = 7, median = 7
But we know that the two data sets are not identical! The
variance shows how they are different.
We want to find a way to represent these two data set
numerically.
Hierarchical Clustering
Clustering by Density Based Methods
 Population variance:
N μ = population mean
 (X i  μ) 2

N = population size
σ2  i 1
N Xi = ith value of the variable X
Clustering by Grid-Based Methods
Calculate the Variance for Ungrouped Data
1.Find the Mean.
2.Calculate the difference between each score and the
mean.
3.Square the difference between each score and the
mean.
4.Add up all the squares of the difference between each
score and the mean.
5.Divide the obtained sum by n – 1.
Clustering by Model-Based Methods
Calculate the Variance for Grouped Data
1.Calculate the mean.
2.Get the deviations by finding the difference of each
midpoint from the mean.
3.Square the deviations and find its summation.
4.Substitute in the formula.
Clustering High-Dimensional Data
Outlier analysis
Prediction

k
k = Number of classes
f i xi  x
xi= Mid point of the i-th class
MDx   i 1
fi= frequency of the i-th class
n
Linear Regression
Nonlinear Regression
 Measures the variation of observations from the mean
 The most common measure of dispersion
 Takes into account every observation
 Measures the ‘average deviation’ of observations from
mean
 Works with squares of residuals not absolute values—
easier to use in further calculations
 Is the square root of the variance
 Has the same units as the original data
Other Regression-Based Methods
of prediction
Standard deviation of a sample s
In practice, most populations are very large and it is
more common to calculate the sample standard
deviation.
 x  x 
2

Sample standard deviation  s 


n 1
Where: (n-1) is the number of observations in the
sample

Standard deviation of a population δ


Every observation in the population is used.

 x  x 
2

Standard deviation  δ 
n
Evaluating the Accuracy and error
measures of a Classifier or Predictor
 Evaluation metrics: How can we measure accuracy?
Other metrics to consider?
 Use validation test set of class-labeled tuples instead
of training set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer = buy_computer = Total
class yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix


indicates # of tuples in class i that were labeled by the classifier
as class j
 May have extra rows/columns to provide totals
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and


recognition rate: percentage of minority of the positive
test set tuples that are class
correctly classified  Sensitivity: True Positive

Accuracy = (TP + TN)/All recognition rate


 Sensitivity = TP/P
 Error rate: 1 – accuracy, or
 Specificity: True Negative
Error rate = (FP + FN)/All
recognition rate
 Specificity = TN/N
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,

 Fß: weighted measure of precision and recall


 assigns ß times as much weight to recall as to precision
Evaluating Classifier Accuracy:
Cross-Validation Methods
 Cross-validation (k-fold, where k = 10 is most
popular)
 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as
training set
 Leave-one-out: k folds where k = # of tuples, for small
sized data
 *Stratified cross-validation*: folds are stratified so
that class dist. in each fold is approx. the same as that
in the initial data

You might also like