Multi Class Classification
Multi Class Classification
Multi-Class Classification
Kai-Wei Chang
CS @ UCLA
[email protected]
ML in NLP 1
Previous Lecture
v Binary linear classification models
v Perceptron, SVMs, Logistic regression
v Prediction is simple:
v Given an example 𝑥, prediction is 𝑠𝑔𝑛 𝑤 & x
v Note that all these linear classifier have the same
inference rule
v In logistic regression, we can further estimate the
probability
v Question?
CS6501 Lecture 3 2
This Lecture
CS6501 Lecture 3 3
What is multiclass
v Output ∈ 1,2,3, … 𝐾
v In some cases, output space can be very large
(i.e., K is very large)
v Each input belongs to exactly one class
(c.f. in multilabel, input belongs to many classes)
CS6501 Lecture 3 4
Multi-class Applications in NLP?
ML in NLP 5
Two key ideas to solve multiclass
CS6501 Lecture 3 6
Reduction v.s. single classifier
v Reduction
v Future-proof: binary classification improved so
does muti-class
v Easy to implement
v Single classifier
v Global optimization: directly minimize the
empirical loss; easier for joint prediction
v Easy to add constraints and domain knowledge
CS6501 Lecture 3 7
A General Formula
𝑦0 = argmax𝒚∈𝒴 𝑓(𝒚; 𝒘, 𝒙)
input
model parameters
output space
v Inference/Test: given 𝒘, 𝒙, solve argmax
v Learning/Training: find a good 𝒘
v Today: 𝒙 ∈ ℝ𝒏 , 𝒴 = {1,2, … 𝐾} (multiclass)
CS6501 Lecture 3 8
This Lecture
CS6501 Lecture 3 9
One against all strategy
CS6501 Lecture 3 10
One against All learning
v Multiclass classifier
v Function f : Rn à {1,2,3,...,k}
v Decompose into binary problems
CS6501 Lecture 3 11
One-again-All learning algorithm
v Learning: Given a dataset 𝐷 = 𝑥C , 𝑦C
𝑥C ∈ 𝑅 E , 𝑦C ∈ 1,2,3, … 𝐾
v Decompose into K binary classification tasks
v Learn K models: 𝑤F , 𝑤G , 𝑤H , … 𝑤I
v For class k, construct a binary classification
task as:
v Positive examples: Elements of D with label k
v Negative examples: All other elements of D
v The binary classification can be solved by any
algorithm we have seen
CS6501 Lecture 3 12
One against All learning
v Multiclass classifier
v Function f : Rn à {1,2,3,...,k}
v Decompose into binary problems
Ideal case: only the correct label will have a positive score
CS6501 Lecture 3 13
One-again-All Inference
v Learning: Given a dataset 𝐷 = 𝑥C , 𝑦C
𝑥C ∈ 𝑅 E , 𝑦C ∈ 1,2,3, … 𝐾
v Decompose into K binary classification tasks
v Learn K models: 𝑤F , 𝑤G , 𝑤H , … 𝑤I
v Inference: “Winner takes all”
v 𝑦0 = argmaxV∈{F,G,…I} 𝑤V& 𝑥
& & &
For example: y = argmax(𝑤JKLMN 𝑥 , 𝑤JKRS 𝑥, 𝑤TUSSE 𝑥)
v An instance of the general form
𝑦0 = argmax𝒚∈𝒴 𝑓(𝒚; 𝒘, 𝒙)
𝑤 = {𝑤F , 𝑤G , … 𝑤I }, 𝑓 𝒚; 𝒘, 𝒙 = 𝒘&V 𝑥
CS6501 Lecture 3 14
One-again-All analysis
v Not always possible to learn
v Assumption: each class individually separable from
all the others
v No theoretical justification
v Need to make sure the range of all classifiers is
the same – we are comparing scores produced
by K classifiers trained independently.
v Easy to implement; work well in practice
CS6501 Lecture 3 15
One v.s. One (All against All) strategy
CS6501 Lecture 3 16
One v.s. One learning
v Multiclass classifier
v Function f : Rn à {1,2,3,...,k}
v Decompose into binary problems
Training Test
CS6501 Lecture 3 17
One-v.s-One learning algorithm
v Learning: Given a dataset 𝐷 = 𝑥C , 𝑦C
𝑥C ∈ 𝑅 E , 𝑦C ∈ 1,2,3, … 𝐾
v Decompose into C(K,2) binary classification tasks
v Learn C(K,2) models: 𝑤F , 𝑤G , 𝑤H , … 𝑤I∗(IYF)/G
v For each class pair (i,j), construct a binary
classification task as:
v Positive examples: Elements of D with label i
v Negative examples Elements of D with label j
v The binary classification can be solved by any
algorithm we have seen
CS6501 Lecture 3 18
One-v.s-One Inference algorithm
v Decision Options:
v More complex; each label gets k-1 votes
v Output of binary classifier may not cohere.
v Majority: classify example x to take label i
if i wins on x more often than j (j=1,…k)
v A tournament: start with n/2 pairs; continue with
winners
CS6501 Lecture 3 19
Classifying with One-vs-one
CS6501 Lecture 3 20
One-v.s.-one Assumption
It is possible to
separate all k
classes with the
O(k2) classifiers
Decision
Regions
21
CS6501 Lecture 3
Comparisons
v One against all
v O(K) weight vectors to train and store
v Training set of the binary classifiers may unbalanced
v Less expressive; make a strong assumption
CS6501 Lecture 3 22
Problems with Decompositions
v Learning optimizes over local metrics
v Does not guarantee good global performance
v We don’t care about the performance of the
local classifiers
v Poor decomposition Þ poor performance
v Difficult local problems
v Irrelevant local problems
v Efficiency: e.g., All vs. All vs. One vs. All
v Not clear how to generalize multi-class to
problems with a very large # of output
CS6501 Lecture 3 23
Still an ongoing research direction
Key questions:
v How to deal with large number of classes
v How to select “right samples” to train binary classifiers
v Error-correcting tournaments
[Beygelzimer, Langford, Ravikumar 09]
v…
CS6501 Lecture 3 24
Decomposition methods: Summary
v General Ideas:
v Decompose the multiclass problem into many
binary problems
v Prediction depends on the decomposition
v Constructs the multiclass label from the output of
the binary classifiers
v Learning optimizes local correctness
v Each binary classifier don’t need to be globally
correct and isn’t aware of the prediction procedure
CS6501 Lecture 3 25
This Lecture
CS6501 Lecture 3 26
Revisit One-again-All learning algorithm
CS6501 Lecture 3 27
Observation
v At training time, we require 𝑤C& 𝑥 to be positive for
examples of class 𝑖.
v Really, all we need is for 𝑤C& 𝑥 to be more than all
others ⇒ this is a weaker requirement
For examples with label 𝑖, we need
𝑤C& 𝑥 > 𝑤_& 𝑥 for all 𝑗
CS6501 Lecture 3 28
Perceptron-style algorithm
For examples with label 𝑖, we need
𝑤C& 𝑥 > 𝑤_& 𝑥 for all 𝑗
CS6501 Lecture 3 30
Linear Separability with multiple classes
v models:
𝑤F , 𝑤G , … 𝑤I , 𝑤N ∈ 𝑅 E
v Input:
𝑥 ∈ 𝑅E
CS6501 Lecture 3 32
Kesler construction
Assume we have a multi-class problem with K class and n features.
CS6501 Lecture 3 33
Kesler construction
Assume we have a multi-class problem with K class and n features.
CS6501 Lecture 3 34
Kesler construction
Assume we have a multi-class problem with K class and n features.
𝑤 & 𝜙 𝑥, 𝑦 = 𝑤V& 𝑥
CS6501 Lecture 3 35
Kesler construction
Assume we have a multi-class problem with K class and n features.
binary classification problem
𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀ 𝑗
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 > 0 ∀𝑗
0E 𝑥 in 𝑖 }~ block;
𝑤F ⋮
⋮ 𝑥 −𝑥 in 𝑗}~ block;
𝑤 = 𝑤V [𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 ] = ⋮
⋮ −𝑥
𝑤E EI×F ⋮
0E EI×F
CS6501 Lecture 3 36
Linear Separability with multiple classes
What we want
𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀ 𝑗
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 > 0 ∀𝑗
CS6501 Lecture 3 37
How can we predict?
For input an input x,
the model predict label is 3
argmaxq 𝑤 & 𝜙(𝑥, 𝑦)
𝑤F 0E 𝜙(𝑥, 3)
⋮ ⋮ 𝜙(𝑥, 2)
𝑤 = 𝑤V 𝜙 𝑥, 𝑦 = 𝑥 𝑤 𝜙(𝑥, 4)
⋮ ⋮
𝑤E 0E 𝜙(𝑥, 1)
EI×F EI×F
𝜙(𝑥, 5)
CS6501 Lecture 3 38
How can we predict?
For input an input x,
the model predict label is 3
argmaxq 𝑤 & 𝜙(𝑥, 𝑦)
𝑤
𝑤F 0E 𝜙(𝑥, 3)
⋮ ⋮ 𝜙(𝑥, 2)
𝑤 = 𝑤V 𝜙 𝑥, 𝑦 = 𝑥 𝜙(𝑥, 4)
⋮ ⋮
𝑤E 0E 𝜙(𝑥, 1)
EI×F EI×F
𝜙(𝑥, 5)
This is equivalent to
argmaxV∈{F,G,…I} 𝑤V& 𝑥
CS6501 Lecture 3 39
Constraint Classification
v Goal: 𝑤 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 ≥ 0 ∀𝑗
v Training:
v For each example 𝑥, 𝑖
v Update model if 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 < 0 ,∀ 𝑗
x -x
Transform Examples 2>4 2>1
2>1 2>3
2>3
2>4
CS6501 Lecture 3 40
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟: This needs 𝐷 ×𝐾 updates,
b do we need all of them?
For 𝑦 ≠ 𝑦
if 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦′ < 0
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦′
Return 𝒘 How to interpret this update rule?
CS6501 Lecture 3 41
An alternative training algorithm
v Goal: 𝑤 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 ≥ 0 ∀𝑗
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − max 𝑤 & 𝜙 𝑥, 𝑖 ≥0
_’C
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − max 𝑤 & 𝜙 𝑥, 𝑖 ≥0
_
v Training:
v For each example 𝑥, 𝑖
v Find the prediction of the current model:
𝑦0 = argmax‘ 𝑤 & 𝜙(𝑥, 𝑗)
v Update model if 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑦0 < 0 , ∀ 𝑦′
CS6501 Lecture 3 42
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
if 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 < 0
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, ”
𝑦
Return 𝒘 How to interpret this update rule?
CS6501 Lecture 3 43
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
There are only two situations:
For epoch 1 … 𝑇: 1. 𝑦0 = 𝑦: 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 =0
For (𝒙, 𝑦) in 𝒟: 2. 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 < 0
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
if 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 < 0
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Return 𝒘 How to interpret this update rule?
CS6501 Lecture 3 44
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Return 𝒘
How to interpret this update rule?
CS6501 Lecture 3 45
Consider multiclass margin
CS6501 Lecture 3 46
Marginal constraint classifier
CS6501 Lecture 3 48
Remarks
v This approach can be generalized to train a
ranker; in fact, any output structure
v We have preference over label assignments
v E.g., rank search results, rank movies / products
CS6501 Lecture 3 49
A peek of a generalized Perceptron model
CS6501 Lecture 3 50
Recap: A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
For 𝑦b ≠ 𝑦
if 𝑤V& 𝑥 < 𝑤Vb
&
𝑥 make a mistake
𝑤V ← 𝑤V + 𝜂𝑥 promote y
𝑤Vb ← 𝑤Vb − 𝜂𝑥 demote y’
Return 𝒘
CS6501 Lecture 3 51
Recap: Kesler construction
Assume we have a multi-class problem with K class and n features.
𝑤 & 𝜙 𝑥, 𝑦 = 𝑤V& 𝑥
CS6501 Lecture 3 52
Geometric interpretation # features = n; # classes = K
𝑤
𝑤G 𝜙(𝑥, 3)
𝑤F 𝜙(𝑥, 2)
𝜙(𝑥, 4)
𝑤 𝜙(𝑥, 1)
𝜙(𝑥, 5)
𝑤H
𝑤¡
CS6501 Lecture 3 53
Recap: A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Return 𝒘
How to interpret this update rule?
CS6501 Lecture 3 54
Multi-category to Constraint Classification
v Multiclass
v (x, A) Þ (x, (A>B, A>C, A>D) )
v Multilabel
v (x, (A, B)) Þ (x, ( (A>C, A>D, B>C, B>D) )
v Label Ranking
v (x, (5>4>3>2>1)) Þ (x, ( (5>4, 4>3, 3>2, 2>1) )
CS6501 Lecture 3 55
Generalized constraint classifiers
v In all cases, we have examples (x,y) with y Î Sk
v Where Sk : partial order over class labels {1,...,k}
v defines “preference” relation ( > ) for class labeling
v Consequently, the Constraint Classifier is: h: X ® Sk
v h(x) is a partial order
v h(x) is consistent with y if (i<j) Î y è (i<j) Îh(x)
CS6501 Lecture 3 56
Multi-category to Constraint Classification
CS6501 Lecture 3 57
Properties of Construction (Zimak et. al 2002, 2003)
CS6501 Lecture 3 58
This Lecture
CS6501 Lecture 3 59
Recall: Margin for binary classifiers
- -- + + ++ +
- -- - + ++
- -- - -
- -- - Margin with respect to this hyperplane
-
CS6501 Lecture 3 60
Multi-class SVM
CS6501 Lecture 3 61
Multiclass Margin
CS6501 Lecture 3 62
Multiclass SVM (Intuition)
v Binary SVM
v Maximize margin. Equivalently,
Minimize norm of weights such that the closest
points to the hyperplane have a score 1
v Multiclass SVM
v Each label has a different weight vector
(like one-vs-all)
v Maximize multiclass margin. Equivalently,
Minimize total norm of the weights such that the true label
is scored at least 1 more than the second best one
CS6501 Lecture 3 63
Multiclass SVM in the separable case
Recall hard binary SVM
Size of the weights.
Effectively, regularizer
CS6501 Lecture 3 64
Multiclass SVM: General case
Total slack. Effectively,
Size of the weights. don’t allow too many
Effectively, regularizer examples to violate the
margin constraint
The score for the true label is higher Slack variables. Not all
than the score for any other label by 1 examples need to
satisfy the margin
constraint.
The score for the true label is higher Slack variables. Not all
than the score for any other label by examples need to
1 - »i satisfy the margin
constraint.
CS6501 Lecture 3 67
Rewrite it as unconstraint problem
Let’s define:
𝛿 if 𝑦 ≠ 𝑦′
Δ 𝑦, 𝑦 b =›
0 𝑖𝑓 𝑦 = 𝑦′
F
min ∑N 𝑤N& 𝑤N + 𝐶 ∑C(max Δ 𝑦C , 𝑘 + 𝑤N& 𝑥 − 𝑤V&¦ 𝑥 )
k G N
CS6501 Lecture 3 68
Multiclass SVM
v Generalizes binary SVM algorithm
v If we have only two classes, this reduces to the
binary (up to scale)
CS6501 Lecture 3 69
Exercise!
v Write down SGD for multiclass SVM
CS6501 Lecture 3 70
This Lecture
CS6501 Lecture 3 71
Recall: (binary) logistic regression
1 & ³
min 𝒘 𝒘 + 𝐶 ° log( 1 + 𝑒 Yq² (𝐰 𝐱² ) )
𝒘 2
C
CS6501 Lecture 3 72
(multi-class) log-linear model
exp 𝑤V& 𝑥
𝑃 𝑦 𝑥, 𝑤 = &
∑Vb∈{F,G,…I} exp(𝑤Vb 𝑥)
v This is a valid probability assumption. Why?
v Another way to write this (with Kesler
construction) is This often called soft-max
exp 𝑤 & 𝜙(𝑥, 𝑦)
𝑃 𝑦 𝑥, 𝑤 =
∑Vb∈{F,G,…I} exp(𝑤 & 𝜙(𝑥, 𝑦′))
CS6501 Lecture 3 73
Softmax
v Softmax: let s(y) be the score for output y
here s(y)=𝑤 & 𝜙(𝑥, 𝑦) (or 𝑤V& 𝑥) but it can be
computed by other metric.
exp 𝑠(𝑦)
𝑃(𝑦) =
∑Vb∈{F,G,…I} exp(𝑠(𝑦))
exp 𝑠 𝑦 /𝜎
𝑃(𝑦|𝜎) =
∑Vb∈{F,G,…I} exp(𝑠(𝑦/𝜎))
CS6501 Lecture 3 74
Example
S(dog)=.5; s(cat)=1; s(mouse)=0.3; s(duck)=-0.2
1.2
1
0.8
0.6
0.4
0.2
0
Dog Cat Mouse Duck
-0.2
-0.4 score softmax (hard) max (𝜎→0)
softmax (𝜎=0.5) softmax (𝜎=2)
CS6501 Lecture 3 75
Log linear model
·¸¹ kº» ¼
𝑃 𝑦 𝑥, 𝑤 = ∑ »
º•∈{½,¾,…¿} ·¸¹(kº• ¼)
CS6501 Lecture 3 76
Maximum log-likelihood estimation
v Training can be done by maximum log-likelihood
estimation i.e. max log 𝑃(𝐷 𝑤
k
D={(𝑥C , 𝑦C )}
·¸¹ kº »¼
¦ ¦
𝑃(𝐷 𝑤 = ΠC ∑ »
º•∈{½,¾,…¿} ·¸¹(kº• ¼¦ )
CS6501 Lecture 3 77
Maximum a posteriori
Can you use Kesler construction to
D={(𝑥C , 𝑦C )} rewrite this formulation?
𝑃 𝑤𝐷 ∝𝑃 𝑤 𝑃 𝐷𝑤
F
m𝑎𝑥k − ∑V 𝑤V& 𝑤V + 𝐶 ∑C[𝑤V&¦ 𝑥C − log ∑Vb∈{F,G,…I} exp(𝑤V&• 𝑥C )]
G
or
F
min ∑V 𝑤V& 𝑤V + 𝐶 ∑C[ log ∑Vb∈{F,G,…I} exp 𝑤V&• 𝑥C − 𝑤V&¦ 𝑥C ]
k G
CS6501 Lecture 3 78
Comparisons
v Multi-class SVM:
F
min ∑N 𝑤N& 𝑤N + 𝐶 ∑C(max(Δ 𝑦C , 𝑘 + 𝑤N& 𝑥 − 𝑤V&¦ 𝑥))
k G N
v Binary SVM:
F &
min 𝒘 𝒘 + 𝐶 ∑C max(0, 1 − y£ (𝐰 - 𝐱 £ ))
𝒘 G
v Log-linear mode (logistic regression)
F & ³𝐱
min 𝒘 𝒘 + 𝐶 ∑C log( 1 + 𝑒 Yq² (𝐰 ²) )
𝒘 G
CS6501 Lecture 3 79