Classification
Classification
Michael I. Jordan
University of California, Berkeley
Classification
• In classification problems, each entity in some domain can be placed
in one of a discrete set of categories: yes/no, friend/foe,
good/bad/indifferent, blue/red/green, etc.
• Given a training set of labeled entities, develop a rule for assigning
labels to entities in a test set
• Many variations on this theme:
• binary classification
• multi-category classification
• non-exclusive categories
• ranking
• Many criteria to assess rules and their predictions
• overall errors
• costs associated with different kinds of errors
• operating points
Representation of Objects
• Each object to be classified is represented as
a pair (x, y):
• where x is a description of the object (see examples
of data types in the following slides)
• where y is a label (assumed binary for now)
• Success or failure of a machine learning
classifier often depends on choosing good
descriptions of objects
• the choice of description can also be viewed as a
learning problem, and indeed we’ll discuss automated
procedures for choosing descriptions in a later lecture
• but good human intuitions are often needed here
Data Types
• Vectorial data:
• physical attributes
• behavioral attributes
• context
• history
• etc
Class1
Class2
Some Issues
• There may be a simple separator (e.g., a straight line in 2D or
a hyperplane in general) or there may not
• There may be “noise” of various kinds
• There may be “overlap”
• One should not be deceived by one’s low-dimensional
geometrical intuition
• Some classifiers explicitly represent separators (e.g., straight
lines), while for other classifiers the separation is done
implicitly
• Some classifiers just make a decision as to which class an
object is in; others estimate class probabilities
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Linearly Separable Data
Class1
Linear Decision boundary Class2
Nonlinearly Separable Data
Class1
Non Linear Classifier
Class2
Which Separating Hyperplane to Use?
x1
x2
Maximizing the Margin
x1 Select the
separating
hyperplane that
maximizes the
margin
Margin
Width
Margin
Width
x2
Support Vectors
x1
Support Vectors
Margin
Width
x2
Setting Up the Optimization Problem
x1
The maximum margin
can be characterized
as a solution to an
optimization problem:
2
r r max
r w ⋅x + b = 1 w
w s.t. (w ⋅x + b) ≥1, ∀x of class 1
r r (w ⋅x + b) ≤−1, ∀x of class 2
w ⋅x + b = −1
1 1 x2
r r
w ⋅x + b = 0
Setting Up the Optimization Problem
• So the problem
2
becomes: 1 2
max min w
w or 2
s.t. yi (w ⋅xi + b) ≥1, ∀xi s.t. yi (w ⋅xi + b) ≥1, ∀xi
Linear, Hard-Margin SVM
Formulation
• Find w,b that solves
1 2
min w
2
s.t. yi (w ⋅xi + b) ≥1, ∀xi
Allow some
instances to fall
ξi within the margin,
r r but penalize them
r w ⋅x + b = 1
w
r r
w ⋅x + b = −1
1 1 Var2
r r
w ⋅x + b = 0
Formulating the Optimization Problem
Constraints becomes :
Var1
ξi yi (w ⋅xi + b) ≥1 −ξi , ∀xi
ξi ≥0
Objective function
penalizes for
ξi misclassified instances
r r and those within the
r w ⋅x + b = 1
w margin
1
min w + C ∑ξi
2
r r
w ⋅x + b = −1
1 1 Var2
2 i
r r
w ⋅x + b = 0
C trades-off margin width
and misclassifications
Linear, Soft-Margin SVMs
1
min w + C ∑ξi
2
yi (w ⋅xi + b) ≥1 −ξi , ∀xi
2 i ξi ≥0
Var1 Var1
ξi
i
r r Var2
Var2 w ⋅x + b = 0
r r
w ⋅x + b = 0
Var2
Advantages of Nonlinear Surfaces
Var1
Var2
Linear Classifiers in High-
Dimensional Spaces
Constructed
Var1
Feature 2
Var2
Constructed
Feature 1
Find function (x) to map to
a different space
Mapping Data to a High-
Dimensional Space
• Find function (x) to map to a different space, then SVM
formulation becomes:
1 s.t. yi ( w ⋅Φ ( x) + b) ≥1 −ξ i , ∀xi
w + C ∑ξi
2
min
2 i ξ i ≥0
• Data appear as (x), weights w are now weights in the new
space
• Explicit mapping expensive if (x) is very high dimensional
• Solving the problem without explicitly mapping the data is
desirable
The Dual of the SVM
Formulation
• Original SVM formulation
1 2
• n inequality constraints min w + C ∑ξ i
• n positivity constraints w ,b 2 i
• n number of variables
s.t. yi ( w ⋅Φ ( x) + b) ≥1 −ξ i , ∀xi
ξ i ≥0
wT(x)+b=0
= (x1z1 + x2z2)2
Efficient!
= (X1T X2)2 O(d)
Kernel Trick
• Kernel function: a symmetric function
k : Rd x Rd -> R
• Inner product kernels: additionally,
k(x,z) = (x)T (z)
• Example:
O(d2) O(d)
d ,d 2
⎛ d
⎞
Φ ( x) Φ ( z ) = ∑( xi x j )( zi z j ) = ⎜∑ xi zi ⎟ = ( xT z ) 2 = K ( x, z )
T
i , j = (1,1) ⎝ i =1 ⎠
Kernel Trick
• Implement an infinite-dimensional mapping implicitly
• Only inner products explicitly needed for training and
evaluation
• Inner products computed efficiently, in finite
dimensions
• The underlying mathematical theory is that of
reproducing kernel Hilbert space from functional
analysis
Kernel Methods
• If a linear algorithm can be expressed only in
terms of inner products
• it can be “kernelized”
• find linear pattern in high-dimensional space
• nonlinear relation in original space
• Specific kernel function determines nonlinearity
Kernels
• Some simple kernels
• Linear kernel: k(x,z) = xTz
equivalent to linear algorithm
• Polynomial kernel: k(x,z) = (1+xTz)d
polynomial decision rules
• RBF kernel: k(x,z) = exp(-||x-z||2/2)
highly nonlinear decisions
Gaussian Kernel: Example
A hyperplane
in some space
Kernel Matrix
k(x,y) i
K
• Kernel matrix K defines all pairwise
inner products
j
• Mercer theorem: K positive
semidefinite
Kij=k(xi,xj)
• Any symmetric positive semidefinite
matrix can be regarded as an inner
product matrix in some space
Kernel-Based Learning
Data Embedding Linear algorithm
{(xi,yi)}
• k(x,y) or
• y K
Kernel-Based Learning
Data Embedding Linear algorithm
+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits
+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits
+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits
+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits
+
+
+ +
++ Once regions are
+ chosen class
+ + probabilities are easy
+ to calculate
+
+ pm=5/6
How to choose a split
N1=9 Impurity measures: L(p)
p1=8/9 • Information gain (entropy):
+ - p log p - (1-p) log(1-p)
+
+ + C1 • Gini index: 2 p (1-p)
++
+
+ + • ( 0-1 error: 1-max(p,1-p) )
+
+ s
+
C2
min N1 L(p1)+N2 L(p2)
s
L: 0-1 loss
+ minTiL(xi) + |T|
+
+ +
++ then choose with CV
+
+ +
+
+
+
increase
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Random Forest
Each node: pick
Randomly sample randomly a small m
number of input
2/3 of the data
variables to split on
Data VOTE !