0% found this document useful (0 votes)
12 views

Classification

The document discusses classification problems and methods. It describes how objects are represented with features and labels for classification. Several example classification problems are provided like spam filtering and digit recognition. The document also covers training and evaluating classifiers as well as some commonly used state-of-the-art classifiers like support vector machines and random forests.

Uploaded by

huo si
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Classification

The document discusses classification problems and methods. It describes how objects are represented with features and labels for classification. Several example classification problems are provided like spam filtering and digit recognition. The document also covers training and evaluating classifiers as well as some commonly used state-of-the-art classifiers like support vector machines and random forests.

Uploaded by

huo si
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 88

Classification

Michael I. Jordan
University of California, Berkeley
Classification
• In classification problems, each entity in some domain can be placed
in one of a discrete set of categories: yes/no, friend/foe,
good/bad/indifferent, blue/red/green, etc.
• Given a training set of labeled entities, develop a rule for assigning
labels to entities in a test set
• Many variations on this theme:
• binary classification
• multi-category classification
• non-exclusive categories
• ranking
• Many criteria to assess rules and their predictions
• overall errors
• costs associated with different kinds of errors
• operating points
Representation of Objects
• Each object to be classified is represented as
a pair (x, y):
• where x is a description of the object (see examples
of data types in the following slides)
• where y is a label (assumed binary for now)
• Success or failure of a machine learning
classifier often depends on choosing good
descriptions of objects
• the choice of description can also be viewed as a
learning problem, and indeed we’ll discuss automated
procedures for choosing descriptions in a later lecture
• but good human intuitions are often needed here
Data Types
• Vectorial data:
• physical attributes
• behavioral attributes
• context
• history
• etc

• We’ll assume for now that such vectors are


explicitly represented in a table, but later (cf.
kernel methods) we’ll relax that asumption
Data Types
• text and hypertext

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">


<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Welcome to FairmontNET</title>
</head>
<STYLE type="text/css">
.stdtext {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: #1F3D4E;}
.stdtext_wh {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: WHITE;}
</STYLE>

<body leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" bgcolor="BLACK">


<TABLE cellpadding="0" cellspacing="0" width="100%" border="0">
<TR>
<TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif">&nbsp;</TD>
<TD><img src="/TFN/en/CDA/Images/common/labels/decorative.gif"></td>
<TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif">&nbsp;</TD>
</TR>
</TABLE>
<tr>
<td align="right" valign="middle"><IMG src="/TFN/en/CDA/Images/common/labels/centrino_logo_blk.gif"></td>
</tr>
</body>
</html>
Data Types
• email
Return-path <[email protected]>Received from relay2.EECS.Berkeley.EDU
(relay2.EECS.Berkeley.EDU [169.229.60.28]) by imap4.CS.Berkeley.EDU (iPlanet Messaging Server
5.2 HotFix 1.16 (built May 14 2003)) with ESMTP id <[email protected]>;
Tue, 08 Jun 2004 11:40:43 -0700 (PDT)Received from relay3.EECS.Berkeley.EDU (localhost
[127.0.0.1]) by relay2.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58Ieg3N000927; Tue, 08
Jun 2004 11:40:43 -0700 (PDT)Received from redbirds (dhcp-168-35.EECS.Berkeley.EDU
[128.32.168.35]) by relay3.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58IegFp007613;
Tue, 08 Jun 2004 11:40:42 -0700 (PDT)Date Tue, 08 Jun 2004 11:40:42 -0700From Robert Miller
<[email protected]>Subject RE: SLT headcount = 25In-reply-
to <[email protected]>To 'Randy Katz'
<[email protected]>Cc "'Glenda J. Smith'" <[email protected]>, 'Gert Lanckriet'
<[email protected]>Message-
id <[email protected]>MIME-version 1.0X-
MIMEOLE Produced By Microsoft MimeOLE V6.00.2800.1409X-Mailer Microsoft Office Outlook, Build
11.0.5510Content-type multipart/alternative; boundary="----
=_NextPart_000_0033_01C44D4D.6DD93AF0"Thread-
index AcRMtQRp+R26lVFaRiuz4BfImikTRAA0wf3Qthe headcount is now 32.
---------------------------------------- Robert Miller, Administrative Specialist University of California,
Berkeley Electronics Research Lab 634 Soda Hall #1776 Berkeley, CA 94720-1776 Phone: 510-
642-6037 fax: 510-643-1289
Data Types
• protein sequences
Data Types
• sequences of Unix system calls
Data Types
• network layout: graph
Data Types
• images
Example: Spam Filter
Dear Sir.
• Input: email
• Output: spam/ham First, I must solicit your confidence in this
transaction, this is by virture of its nature
• Setup: as being utterly confidencial and top
• Get a large collection of secret. …
example emails, each
labeled “spam” or “ham”
TO BE REMOVED FROM FUTURE
• Note: someone has to hand MAILINGS, SIMPLY REPLY TO THIS
label all this data MESSAGE AND PUT "REMOVE" IN THE
• Want to learn to predict SUBJECT.
labels of new, future emails
99 MILLION EMAIL ADDRESSES
FOR ONLY $99
• Features: The attributes used to
make the ham / spam decision
Ok, Iknow this is blatantly OT but I'm
• Words: FREE!
beginning to go insane. Had an old Dell
• Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
• Non-text: SenderInContacts decided to put it to use, I know it was
• … working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
Example: Digit Recognition
• Input: images / pixel grids 0
• Output: a digit 0-9
• Setup:
• Get a large collection of example images, each
labeled with a digit 1
• Note: someone has to hand label all this data
• Want to learn to predict labels of new, future
digit images
2
• Features: The attributes used to make the digit
decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, 1
AspectRatio, NumLoops
• …
• Current state-of-the-art: Human-level
performance ??
Other Examples of Real-World
Classification Tasks
• Fraud detection (input: account activity, classes: fraud / no fraud)
• Web page spam detection (input: HTML/rendered page, classes:
spam / ham)
• Speech recognition and speaker recognition (input: waveform,
classes: phonemes or words)
• Medical diagnosis (input: symptoms, classes: diseases)
• Automatic essay grader (input: document, classes: grades)
• Customer service email routing and foldering
• Link prediction in social networks
• Catalytic activity in drug design
• … many many more

• Classification is an important commercial technology


Training and Validation
• Data: labeled instances, e.g. emails marked spam/ham
• Training set
• Validation set
• Test set
• Training Training
• Estimate parameters on training set Data
• Tune hyperparameters on validation set
• Report results on test set
• Anything short of this yields over-optimistic claims
• Evaluation
• Many different metrics
• Ideally, the criteria used to train the classifier should be closely
related to those used to evaluate the classifier Validation
Data
• Statistical issues
• Want a classifier which does well on test data
• Overfitting: fitting the training data very closely, but not Test
generalizing well
• Error bars: want realistic (conservative) estimates of accuracy Data
Some State-of-the-art
Classifiers
• Support vector machine
• Random forests
• Kernelized logistic regression
• Kernelized discriminant analysis
• Kernelized perceptron
• Bayesian classifiers
• Boosting and other ensemble methods
• (Nearest neighbor)
Intuitive Picture of the Problem

Class1
Class2
Some Issues
• There may be a simple separator (e.g., a straight line in 2D or
a hyperplane in general) or there may not
• There may be “noise” of various kinds
• There may be “overlap”
• One should not be deceived by one’s low-dimensional
geometrical intuition
• Some classifiers explicitly represent separators (e.g., straight
lines), while for other classifiers the separation is done
implicitly
• Some classifiers just make a decision as to which class an
object is in; others estimate class probabilities
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Linearly Separable Data

Class1
Linear Decision boundary Class2
Nonlinearly Separable Data

Class1
Non Linear Classifier
Class2
Which Separating Hyperplane to Use?

x1

x2
Maximizing the Margin
x1 Select the
separating
hyperplane that
maximizes the
margin

Margin
Width

Margin
Width
x2
Support Vectors
x1

Support Vectors

Margin
Width
x2
Setting Up the Optimization Problem
x1
The maximum margin
can be characterized
as a solution to an
optimization problem:

2
r r max
r w ⋅x + b = 1 w
w s.t. (w ⋅x + b) ≥1, ∀x of class 1
r r (w ⋅x + b) ≤−1, ∀x of class 2
w ⋅x + b = −1
1 1 x2
r r
w ⋅x + b = 0
Setting Up the Optimization Problem

• If class 1 corresponds to 1 and class 2


corresponds to -1, we can rewrite
(w ⋅xi + b) ≥1, ∀xi with yi = 1
(w ⋅xi + b) ≤−1, ∀xi with yi = −1

• as yi (w ⋅xi + b) ≥1, ∀xi

• So the problem
2
becomes: 1 2
max min w
w or 2
s.t. yi (w ⋅xi + b) ≥1, ∀xi s.t. yi (w ⋅xi + b) ≥1, ∀xi
Linear, Hard-Margin SVM
Formulation
• Find w,b that solves
1 2
min w
2
s.t. yi (w ⋅xi + b) ≥1, ∀xi

• Problem is convex so, there is a unique global minimum


value (when feasible)
• There is also a unique minimizer, i.e. weight and b value
that provides the minimum
• Quadratic Programming
• very efficient computationally with procedures that take
advantage of the special structure
Nonlinearly Separable Data

Var1 Introduce slack


ξi variables ξi

Allow some
instances to fall
ξi within the margin,
r r but penalize them
r w ⋅x + b = 1
w
r r
w ⋅x + b = −1
1 1 Var2
r r
w ⋅x + b = 0
Formulating the Optimization Problem

Constraints becomes :

Var1
ξi yi (w ⋅xi + b) ≥1 −ξi , ∀xi
ξi ≥0

Objective function
penalizes for
ξi misclassified instances
r r and those within the
r w ⋅x + b = 1
w margin
1
min w + C ∑ξi
2
r r
w ⋅x + b = −1
1 1 Var2
2 i
r r
w ⋅x + b = 0
C trades-off margin width
and misclassifications
Linear, Soft-Margin SVMs
1
min w + C ∑ξi
2
yi (w ⋅xi + b) ≥1 −ξi , ∀xi
2 i ξi ≥0

• Algorithm tries to maintain i to zero while maximizing


margin
• Notice: algorithm does not minimize the number of
misclassifications (NP-complete problem) but the sum of
distances from the margin hyperplanes
• Other formulations use i2 instead
• As C0, we get the hard-margin solution
Robustness of Soft vs Hard
Margin SVMs

Var1 Var1

ξi

i

r r Var2
Var2 w ⋅x + b = 0
r r
w ⋅x + b = 0

Soft Margin SVN Hard Margin SVN


Disadvantages of Linear Decision
Surfaces
Var1

Var2
Advantages of Nonlinear Surfaces

Var1

Var2
Linear Classifiers in High-
Dimensional Spaces
Constructed
Var1
Feature 2

Var2
Constructed
Feature 1
Find function (x) to map to
a different space
Mapping Data to a High-
Dimensional Space
• Find function (x) to map to a different space, then SVM
formulation becomes:
1 s.t. yi ( w ⋅Φ ( x) + b) ≥1 −ξ i , ∀xi
w + C ∑ξi
2
min
2 i ξ i ≥0
• Data appear as (x), weights w are now weights in the new
space
• Explicit mapping expensive if (x) is very high dimensional
• Solving the problem without explicitly mapping the data is
desirable
The Dual of the SVM
Formulation
• Original SVM formulation
1 2
• n inequality constraints min w + C ∑ξ i
• n positivity constraints w ,b 2 i
• n number of  variables
s.t. yi ( w ⋅Φ ( x) + b) ≥1 −ξ i , ∀xi
ξ i ≥0

• The (Wolfe) dual of this 1


problem min ∑α iα j yi y j (Φ ( xi ) ⋅Φ ( x j )) −∑α i
w ,b 2
• one equality constraint i, j i
• n positivity constraints
• n number of  variables s.t. C ≥α i ≥0, ∀xi
(Lagrange multipliers)
• Objective function more
∑α
i
i yi = 0
complicated

• NOTE: Data only appear


as (xi)  (xj)
The Kernel Trick
• (xi)  (xj): means, map data into new space, then take the
inner product of the new vectors
• We can find a function such that: K(xi  xj) = (xi)  (xj), i.e., the
image of the inner product of the data is the inner product of the
images of the data
• Then, we do not need to explicitly map the data into the high-
dimensional space to solve the optimization problem
Example

wT(x)+b=0

X=[x z] (X)=[x2 z2 xz]


f(x) = sign(w1x2+w2z2+w3xz + b)
Example

X1=[x1 z1] (X1)=[x12 z12 21/2x1z1]


X2=[x2 z2] (X2)=[x22 z22 21/2x2z2]

(X1)T(X2)= [x12 z12 21/2x1z1] [x22 z22 21/2x2z2]T


Expensive!
= x12z12 + x22 z22 + 2 x1 z1 x2 z2 O(d2)

= (x1z1 + x2z2)2
Efficient!
= (X1T X2)2 O(d)
Kernel Trick
• Kernel function: a symmetric function
k : Rd x Rd -> R
• Inner product kernels: additionally,
k(x,z) = (x)T (z)
• Example:
O(d2) O(d)
d ,d 2
⎛ d

Φ ( x) Φ ( z ) = ∑( xi x j )( zi z j ) = ⎜∑ xi zi ⎟ = ( xT z ) 2 = K ( x, z )
T

i , j = (1,1) ⎝ i =1 ⎠
Kernel Trick
• Implement an infinite-dimensional mapping implicitly
• Only inner products explicitly needed for training and
evaluation
• Inner products computed efficiently, in finite
dimensions
• The underlying mathematical theory is that of
reproducing kernel Hilbert space from functional
analysis
Kernel Methods
• If a linear algorithm can be expressed only in
terms of inner products
• it can be “kernelized”
• find linear pattern in high-dimensional space
• nonlinear relation in original space
• Specific kernel function determines nonlinearity
Kernels
• Some simple kernels
• Linear kernel: k(x,z) = xTz
 equivalent to linear algorithm
• Polynomial kernel: k(x,z) = (1+xTz)d
 polynomial decision rules
• RBF kernel: k(x,z) = exp(-||x-z||2/2)
 highly nonlinear decisions
Gaussian Kernel: Example

A hyperplane
in some space
Kernel Matrix

k(x,y) i
K
• Kernel matrix K defines all pairwise
inner products
j
• Mercer theorem: K positive
semidefinite
Kij=k(xi,xj)
• Any symmetric positive semidefinite
matrix can be regarded as an inner
product matrix in some space
Kernel-Based Learning
Data Embedding Linear algorithm

{(xi,yi)}

• k(x,y) or

• y K
Kernel-Based Learning
Data Embedding Linear algorithm

Kernel design Kernel algorithm


Kernel Design
• Simple kernels on vector data
• More advanced
• string kernel
• diffusion kernel
• kernels over general structures (sets, trees,
graphs...)
• kernels derived from graphical models
• empirical kernel map
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
[From Tom Mitchell’s slides]
Spatial example:
recursive binary splits

+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits

+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits

+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits

+
+
+ +
++
+
+ +
+
+
+
Spatial example:
recursive binary splits

+
+
+ +
++ Once regions are
+ chosen class
+ + probabilities are easy
+ to calculate
+
+ pm=5/6
How to choose a split
N1=9 Impurity measures: L(p)
p1=8/9 • Information gain (entropy):
+ - p log p - (1-p) log(1-p)
+
+ + C1 • Gini index: 2 p (1-p)
++
+
+ + • ( 0-1 error: 1-max(p,1-p) )
+
+ s
+
C2
min N1 L(p1)+N2 L(p2)
s

p2=5/6 Then choose the region that has the


best split
N2=6
Overfitting and pruning

L: 0-1 loss
+ minTiL(xi) +  |T|
+
+ +
++ then choose  with CV
+
+ +
+
+
+
increase 
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Methods
I) Instance-based methods:
1) Nearest neighbor
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Random Forest
Each node: pick
Randomly sample randomly a small m
number of input
2/3 of the data
variables to split on

Data VOTE !

Use Out-Of-Bag samples to:


- Estimate error
- Choosing m
- Estimate variable importance
Reading
• All of the methods that we have discussed
are presented in the following book:
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements
of Statistical Learning: Data Mining, Inference, and Prediction
(Second Edition), NY: Springer.

• We haven’t discussed theory, but if you’re


interested in the theory of (binary)
classification, here’s a pointer to get started:
Bartlett, P., Jordan, M. I., & McAuliffe, J. D. (2006).
Convexity, classification and risk bounds. Journal of the
American Statistical Association, 101, 138-156.

You might also like