0% found this document useful (0 votes)

36 views

Multi Class Classification

Uploaded by

da da

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Multi Class Classification

Uploaded by

da da

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Lecture 3:

Multi-Class Classification

Kai-Wei Chang
CS @ UCLA
[email protected]

Couse webpage: https://uclanlp.github.io/CS269-17/

ML in NLP 1
Previous Lecture
v Binary linear classification models
v Perceptron, SVMs, Logistic regression
v Prediction is simple:
v Given an example 𝑥, prediction is 𝑠𝑔𝑛 𝑤 & x
v Note that all these linear classifier have the same
inference rule
v In logistic regression, we can further estimate the
probability

v Question?
CS6501 Lecture 3 2
This Lecture

v Multiclass classification overview

v Reducing multiclass to binary
v One-against-all & One-vs-one
v Error correcting codes
v Training a single classifier
v Multiclass Perceptron: Kesler’s construction
v Multiclass SVMs: Crammer&Singer formulation
v Multinomial logistic regression

CS6501 Lecture 3 3
What is multiclass
v Output ∈ 1,2,3, … 𝐾
v In some cases, output space can be very large
(i.e., K is very large)
v Each input belongs to exactly one class
(c.f. in multilabel, input belongs to many classes)

CS6501 Lecture 3 4
Multi-class Applications in NLP?

ML in NLP 5
Two key ideas to solve multiclass

v Reducing multiclass to binary

v Decompose the multiclass prediction into
multiple binary decisions
v Make final decision based on multiple binary
classifiers
v Training a single classifier
v Minimize the empirical risk
v Consider all classes simultaneously

CS6501 Lecture 3 6
Reduction v.s. single classifier

v Reduction
v Future-proof: binary classification improved so
does muti-class
v Easy to implement
v Single classifier
v Global optimization: directly minimize the
empirical loss; easier for joint prediction
v Easy to add constraints and domain knowledge

CS6501 Lecture 3 7
A General Formula

𝑦0 = argmax𝒚∈𝒴 𝑓(𝒚; 𝒘, 𝒙)

input
model parameters
output space
v Inference/Test: given 𝒘, 𝒙, solve argmax
v Learning/Training: find a good 𝒘
v Today: 𝒙 ∈ ℝ𝒏 , 𝒴 = {1,2, … 𝐾} (multiclass)

CS6501 Lecture 3 8
This Lecture

v Multiclass classification overview

CS6501 Lecture 3 9
One against all strategy

CS6501 Lecture 3 10
One against All learning

v Multiclass classifier
v Function f : Rn à {1,2,3,...,k}
v Decompose into binary problems

CS6501 Lecture 3 11
One-again-All learning algorithm
v Learning: Given a dataset 𝐷 = 𝑥C , 𝑦C
𝑥C ∈ 𝑅 E , 𝑦C ∈ 1,2,3, … 𝐾
v Decompose into K binary classification tasks
v Learn K models: 𝑤F , 𝑤G , 𝑤H , … 𝑤I
v For class k, construct a binary classification
task as:
v Positive examples: Elements of D with label k
v Negative examples: All other elements of D
v The binary classification can be solved by any
algorithm we have seen

CS6501 Lecture 3 12
One against All learning

v Multiclass classifier
v Function f : Rn à {1,2,3,...,k}
v Decompose into binary problems

Ideal case: only the correct label will have a positive score

& & &

𝑤JKLMN 𝑥>0 𝑤JKRS 𝑥>0 𝑤TUSSE 𝑥>0

CS6501 Lecture 3 13
One-again-All Inference
v Learning: Given a dataset 𝐷 = 𝑥C , 𝑦C
𝑥C ∈ 𝑅 E , 𝑦C ∈ 1,2,3, … 𝐾
v Decompose into K binary classification tasks
v Learn K models: 𝑤F , 𝑤G , 𝑤H , … 𝑤I
v Inference: “Winner takes all”
v 𝑦0 = argmaxV∈{F,G,…I} 𝑤V& 𝑥
& & &
For example: y = argmax(𝑤JKLMN 𝑥 , 𝑤JKRS 𝑥, 𝑤TUSSE 𝑥)
v An instance of the general form
𝑦0 = argmax𝒚∈𝒴 𝑓(𝒚; 𝒘, 𝒙)
𝑤 = {𝑤F , 𝑤G , … 𝑤I }, 𝑓 𝒚; 𝒘, 𝒙 = 𝒘&V 𝑥

CS6501 Lecture 3 14
One-again-All analysis
v Not always possible to learn
v Assumption: each class individually separable from
all the others
v No theoretical justification
v Need to make sure the range of all classifiers is
the same – we are comparing scores produced
by K classifiers trained independently.
v Easy to implement; work well in practice

CS6501 Lecture 3 15
One v.s. One (All against All) strategy

CS6501 Lecture 3 16
One v.s. One learning

v Multiclass classifier
v Function f : Rn à {1,2,3,...,k}
v Decompose into binary problems

Training Test

CS6501 Lecture 3 17
One-v.s-One learning algorithm
v Learning: Given a dataset 𝐷 = 𝑥C , 𝑦C
𝑥C ∈ 𝑅 E , 𝑦C ∈ 1,2,3, … 𝐾
v Decompose into C(K,2) binary classification tasks
v Learn C(K,2) models: 𝑤F , 𝑤G , 𝑤H , … 𝑤I∗(IYF)/G
v For each class pair (i,j), construct a binary
classification task as:
v Positive examples: Elements of D with label i
v Negative examples Elements of D with label j
v The binary classification can be solved by any
algorithm we have seen

CS6501 Lecture 3 18
One-v.s-One Inference algorithm
v Decision Options:
v More complex; each label gets k-1 votes
v Output of binary classifier may not cohere.
v Majority: classify example x to take label i
if i wins on x more often than j (j=1,…k)
v A tournament: start with n/2 pairs; continue with
winners

CS6501 Lecture 3 19
Classifying with One-vs-one

Tournament Majority Vote

1 red, 2 yellow, 2 green

è?

All are post-learning and might cause weird stuff

CS6501 Lecture 3 20
One-v.s.-one Assumption

v Every pair of classes is separable

It is possible to
separate all k
classes with the
O(k2) classifiers

Decision
Regions
21

CS6501 Lecture 3
Comparisons
v One against all
v O(K) weight vectors to train and store
v Training set of the binary classifiers may unbalanced
v Less expressive; make a strong assumption

v One v.s. One (All v.s. All)

v O(𝐾 G ) weight vectors to train and store
v Size of training set for a pair of labels could be small
⇒ overfitting of the binary classifiers
v Need large space to store model

CS6501 Lecture 3 22
Problems with Decompositions
v Learning optimizes over local metrics
v Does not guarantee good global performance
v We don’t care about the performance of the
local classifiers
v Poor decomposition Þ poor performance
v Difficult local problems
v Irrelevant local problems
v Efficiency: e.g., All vs. All vs. One vs. All
v Not clear how to generalize multi-class to
problems with a very large # of output

CS6501 Lecture 3 23
Still an ongoing research direction
Key questions:
v How to deal with large number of classes
v How to select “right samples” to train binary classifiers
v Error-correcting tournaments
[Beygelzimer, Langford, Ravikumar 09]

v Logarithmic Time One-Against-Some

[Daume, Karampatziakis, Langford, Mineiro 16]

v Label embedding trees for large multi-class tasks.

[Bengio, Weston, Grangier 10]

v…

CS6501 Lecture 3 24
Decomposition methods: Summary

v General Ideas:
v Decompose the multiclass problem into many
binary problems
v Prediction depends on the decomposition
v Constructs the multiclass label from the output of
the binary classifiers
v Learning optimizes local correctness
v Each binary classifier don’t need to be globally
correct and isn’t aware of the prediction procedure

CS6501 Lecture 3 25
This Lecture

v Multiclass classification overview

CS6501 Lecture 3 26
Revisit One-again-All learning algorithm

v Learning: Given a dataset 𝐷 = 𝑥C , 𝑦C

𝑥C ∈ 𝑅 E , 𝑦C ∈ 1,2,3, … 𝐾
v Decompose into K binary classification tasks
v Learn K models: 𝑤F , 𝑤G , 𝑤H , … 𝑤I
v 𝑤N : separate class 𝑘 from others
v Prediction
𝑦0 = argmaxV∈{F,G,…I} 𝑤V& 𝑥

CS6501 Lecture 3 27
Observation
v At training time, we require 𝑤C& 𝑥 to be positive for
examples of class 𝑖.
v Really, all we need is for 𝑤C& 𝑥 to be more than all
others ⇒ this is a weaker requirement
For examples with label 𝑖, we need
𝑤C& 𝑥 > 𝑤_& 𝑥 for all 𝑗

CS6501 Lecture 3 28
Perceptron-style algorithm
For examples with label 𝑖, we need
𝑤C& 𝑥 > 𝑤_& 𝑥 for all 𝑗

v For each training example 𝑥, 𝑦

v If for some y’, 𝑤V& 𝑥 ≤ 𝑤Vb
&
𝑥 mistake!
v 𝑤V ← 𝑤V + 𝜂𝑥 update to promote y
v 𝑤Vb ← 𝑤Vb − 𝜂𝑥 update to demote y’

Why add 𝜂𝑥 to 𝑤V promote label 𝑦:

Before update s y = < 𝑤ViKj , 𝑥 >
After update s y =< 𝑤VESk , 𝑥 >=< 𝑤ViKj +𝜂𝑥, 𝑥 >
= < 𝑤ViKj , 𝑥 > +𝜂 < 𝑥, 𝑥 >
Note! < 𝑥 , 𝑥 > = 𝑥 & 𝑥 > 0
CS6501 Lecture 3 29
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
How to analyze this algorithm
For (𝒙, 𝑦) in 𝒟:
b
and simplify the update rules?
For 𝑦 ≠ 𝑦
if 𝑤V& 𝑥 < 𝑤Vb
&
𝑥 make a mistake
𝑤V ← 𝑤V + 𝜂𝑥 promote y
𝑤Vb ← 𝑤Vb − 𝜂𝑥 demote y’
Return 𝒘

Prediction: argmaxq 𝑤V& 𝑥

CS6501 Lecture 3 30
Linear Separability with multiple classes

v Let’s rewrite the equation

𝑤C& 𝑥 > 𝑤_& 𝑥 for all 𝑗
v Instead of having 𝑤F , 𝑤G , 𝑤H , … 𝑤I , we want to
represent the model using a single vector 𝑤
& &
𝑤 ? > 𝑤 ? for all j

v How? Change the input representation

Let’s define 𝜙(𝑥, 𝑦), such that
& &
𝑤 𝜙 𝑥, 𝑖 > 𝑤 𝜙 𝑥, 𝑗 ∀𝑗
multiple models v.s. multiple data points
CS6501 Lecture 3 31
Kesler construction
Assume we have a multi-class problem with K class and n features.

𝑤C& 𝑥 > 𝑤_& 𝑥 ∀ j 𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀𝑗

v models:
𝑤F , 𝑤G , … 𝑤I , 𝑤N ∈ 𝑅 E
v Input:
𝑥 ∈ 𝑅E

CS6501 Lecture 3 32
Kesler construction
Assume we have a multi-class problem with K class and n features.

𝑤C& 𝑥 > 𝑤_& 𝑥 ∀ j 𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀𝑗

v models: v Only one model:

𝑤F , 𝑤G , … 𝑤I , 𝑤N ∈ 𝑅 E 𝑤 ∈ 𝑅 E×I
v Input:
𝑥 ∈ 𝑅E

CS6501 Lecture 3 33
Kesler construction
Assume we have a multi-class problem with K class and n features.

𝑤C& 𝑥 > 𝑤_& 𝑥 ∀ 𝑗 𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀𝑗

v Only one model:
v models: 𝑤 ∈ 𝑅EI
𝑤F , 𝑤G , … 𝑤I , 𝑤N ∈ 𝑅 E v Define 𝜙 𝑥, 𝑦 for label y
v Input: being associated to input x
0E 𝑥 in 𝑦 }~ block;
𝑥 ∈ 𝑅E ⋮
Zeros elsewhere
𝜙 𝑥, 𝑦 = 𝑥
⋮
0E EI×F

CS6501 Lecture 3 34
Kesler construction
Assume we have a multi-class problem with K class and n features.

𝑤C& 𝑥 > 𝑤C& 𝑥 ∀ i 𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀𝑗

v models: 𝑤F 0E
𝑤F , 𝑤G , … 𝑤I , ⋮ ⋮ 𝑥 in 𝑦 }~ block;
Zeros elsewhere
𝑤N ∈ 𝑅E 𝑤 = 𝑤V 𝜙 𝑥, 𝑦 = 𝑥
⋮ ⋮
v Input: 𝑤E EI×F 0E EI×F
𝑥 ∈ 𝑅E

𝑤 & 𝜙 𝑥, 𝑦 = 𝑤V& 𝑥

CS6501 Lecture 3 35
Kesler construction
Assume we have a multi-class problem with K class and n features.
binary classification problem
𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀ 𝑗
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 > 0 ∀𝑗
0E 𝑥 in 𝑖 }~ block;
𝑤F ⋮
⋮ 𝑥 −𝑥 in 𝑗}~ block;
𝑤 = 𝑤V [𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 ] = ⋮
⋮ −𝑥
𝑤E EI×F ⋮
0E EI×F

CS6501 Lecture 3 36
Linear Separability with multiple classes
What we want
𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀ 𝑗
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 > 0 ∀𝑗

For all example (x, y) with all other labels y’ in

dataset, 𝑤 in nK dimension should linearly separate
𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 and −[𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 ]

CS6501 Lecture 3 37
How can we predict?
For input an input x,
the model predict label is 3
argmaxq 𝑤 & 𝜙(𝑥, 𝑦)
𝑤F 0E 𝜙(𝑥, 3)
⋮ ⋮ 𝜙(𝑥, 2)
𝑤 = 𝑤V 𝜙 𝑥, 𝑦 = 𝑥 𝑤 𝜙(𝑥, 4)
⋮ ⋮
𝑤E 0E 𝜙(𝑥, 1)
EI×F EI×F
𝜙(𝑥, 5)

CS6501 Lecture 3 38
How can we predict?
For input an input x,
the model predict label is 3
argmaxq 𝑤 & 𝜙(𝑥, 𝑦)
𝑤
𝑤F 0E 𝜙(𝑥, 3)
⋮ ⋮ 𝜙(𝑥, 2)
𝑤 = 𝑤V 𝜙 𝑥, 𝑦 = 𝑥 𝜙(𝑥, 4)
⋮ ⋮
𝑤E 0E 𝜙(𝑥, 1)
EI×F EI×F
𝜙(𝑥, 5)

This is equivalent to
argmaxV∈{F,G,…I} 𝑤V& 𝑥

CS6501 Lecture 3 39
Constraint Classification

v Goal: 𝑤 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 ≥ 0 ∀𝑗
v Training:
v For each example 𝑥, 𝑖
v Update model if 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 < 0 ,∀ 𝑗
x -x
Transform Examples 2>4 2>1

2>1 2>3
2>3
2>4

CS6501 Lecture 3 40
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟: This needs 𝐷 ×𝐾 updates,
b do we need all of them?
For 𝑦 ≠ 𝑦
if 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦′ < 0
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦′
Return 𝒘 How to interpret this update rule?

Prediction: argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 41
An alternative training algorithm

v Goal: 𝑤 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑗 ≥ 0 ∀𝑗
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − max 𝑤 & 𝜙 𝑥, 𝑖 ≥0
_’C
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − max 𝑤 & 𝜙 𝑥, 𝑖 ≥0
_

v Training:
v For each example 𝑥, 𝑖
v Find the prediction of the current model:
𝑦0 = argmax‘ 𝑤 & 𝜙(𝑥, 𝑗)
v Update model if 𝑤 & 𝜙 𝑥, 𝑖 − 𝜙 𝑥, 𝑦0 < 0 , ∀ 𝑦′

CS6501 Lecture 3 42
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
if 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 < 0
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, ”
𝑦
Return 𝒘 How to interpret this update rule?

Prediction: argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 43
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
There are only two situations:
For epoch 1 … 𝑇: 1. 𝑦0 = 𝑦: 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 =0
For (𝒙, 𝑦) in 𝒟: 2. 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 < 0
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
if 𝑤 & 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0 < 0
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Return 𝒘 How to interpret this update rule?

Prediction: argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 44
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Return 𝒘
How to interpret this update rule?

Prediction: argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 45
Consider multiclass margin

CS6501 Lecture 3 46
Marginal constraint classifier

v Goal: for every (x,y) in the training data set

&
min
•
𝑤 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦′ ≥𝛿
V ’V
⇒ 𝑤 & 𝜙 𝑥, 𝑦 − max 𝑤 & 𝜙 𝑥, 𝑦′ ≥ 𝛿
V’Vb
⇒ 𝑤 & 𝜙 𝑥, 𝑖 − [max 𝑤 & 𝜙 𝑥, 𝑗 + 𝛿] ≥ 0
V’Vb
Constraints violated ⇒ need an update
𝛿 Let’s define:
𝛿
𝛿 𝛿 if 𝑦 ≠ 𝑦′
Δ 𝑦, 𝑦 b =›
0 𝑖𝑓 𝑦 = 𝑦′
Check if
𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥V• 𝑤 & 𝜙 𝑥, 𝑦′ + Δ(𝑦, 𝑦 b )
CS6501 Lecture 3 47
A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE 𝛿
𝛿 𝛿
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
𝑦̂ = argmaxqb 𝑤 & 𝜙 𝑥, 𝑦′ + Δ(𝑦, y’)
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Return 𝒘
How to interpret this update rule?

Prediction: argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 48
Remarks
v This approach can be generalized to train a
ranker; in fact, any output structure
v We have preference over label assignments
v E.g., rank search results, rank movies / products

CS6501 Lecture 3 49
A peek of a generalized Perceptron model

Given a training set 𝒟 = { 𝒙, 𝑦 }

Initialize 𝒘 ← 𝟎 ∈ ℝE Structured output
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟: Structural prediction/Inference
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′) + Δ(𝑦, y’)
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Structural loss
Return 𝒘
Model update

Prediction: argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 50
Recap: A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
For 𝑦b ≠ 𝑦
if 𝑤V& 𝑥 < 𝑤Vb
&
𝑥 make a mistake
𝑤V ← 𝑤V + 𝜂𝑥 promote y
𝑤Vb ← 𝑤Vb − 𝜂𝑥 demote y’
Return 𝒘

Prediction: argmaxq 𝑤V& 𝑥

CS6501 Lecture 3 51
Recap: Kesler construction
Assume we have a multi-class problem with K class and n features.

𝑤C& 𝑥 > 𝑤C& 𝑥 ∀ i 𝑤 & 𝜙 𝑥, 𝑖 > 𝑤 & 𝜙 𝑥, 𝑗 ∀𝑗

v models: 𝑤F 0E
𝑤F , 𝑤G , … 𝑤I , ⋮ ⋮ 𝑥 in 𝑦 }~ block;
Zeros elsewhere
𝑤N ∈ 𝑅E 𝑤 = 𝑤V 𝜙 𝑥, 𝑦 = 𝑥
⋮ ⋮
v Input: 𝑤E EI×F 0E EI×F
𝑥 ∈ 𝑅E

𝑤 & 𝜙 𝑥, 𝑦 = 𝑤V& 𝑥

CS6501 Lecture 3 52
Geometric interpretation # features = n; # classes = K

In 𝑅Ÿ space In 𝑅ŸI space

𝑤
𝑤G 𝜙(𝑥, 3)
𝑤F 𝜙(𝑥, 2)
𝜙(𝑥, 4)

𝑤 𝜙(𝑥, 1)
𝜙(𝑥, 5)
𝑤H
𝑤¡

argmaxV∈{F,G,…I} 𝑤V& 𝑥 argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 53
Recap: A Perceptron-style Algorithm
Given a training set 𝒟 = { 𝒙, 𝑦 }
Initialize 𝒘 ← 𝟎 ∈ ℝE
For epoch 1 … 𝑇:
For (𝒙, 𝑦) in 𝒟:
𝑦̂ = argmaxqb 𝑤 & 𝜙(𝑥, 𝑦′)
𝒘 ← 𝒘 + 𝜂 𝜙 𝑥, 𝑦 − 𝜙 𝑥, 𝑦0
Return 𝒘
How to interpret this update rule?

Prediction: argmaxq 𝑤 & 𝜙(𝑥, 𝑦)

CS6501 Lecture 3 54
Multi-category to Constraint Classification

v Multiclass
v (x, A) Þ (x, (A>B, A>C, A>D) )
v Multilabel
v (x, (A, B)) Þ (x, ( (A>C, A>D, B>C, B>D) )
v Label Ranking
v (x, (5>4>3>2>1)) Þ (x, ( (5>4, 4>3, 3>2, 2>1) )

CS6501 Lecture 3 55
Generalized constraint classifiers
v In all cases, we have examples (x,y) with y Î Sk
v Where Sk : partial order over class labels {1,...,k}
v defines “preference” relation ( > ) for class labeling
v Consequently, the Constraint Classifier is: h: X ® Sk
v h(x) is a partial order
v h(x) is consistent with y if (i<j) Î y è (i<j) Îh(x)

Just like in the multiclass we learn one wi Î Rn for each

label, the same is done for multi-label and ranking. The
weight vectors are updated according with the
requirements from y Î Sk

CS6501 Lecture 3 56
Multi-category to Constraint Classification

Solving structured prediction problems by ranking algorithms

v Multiclass
v (x, A) Þ (x, (A>B, A>C, A>D) )
v Multilabel
v (x, (A, B)) Þ (x, ( (A>C, A>D, B>C, B>D) )
v Label Ranking
v (x, (5>4>3>2>1)) Þ (x, ( (5>4, 4>3, 3>2, 2>1) )
y Î Sk h: X ® Sk

CS6501 Lecture 3 57
Properties of Construction (Zimak et. al 2002, 2003)

v Can learn any argmax wi.x function (even when i isn’t

linearly separable from the union of the others)
v Can use any algorithm to find linear separation
v Perceptron Algorithm
v ultraconservative online algorithm [Crammer, Singer 2001]
v Winnow Algorithm
v multiclass winnow [ Masterharm 2000 ]
v Defines a multiclass margin by binary margin in RKN
v multiclass SVM [Crammer, Singer 2001]

CS6501 Lecture 3 58
This Lecture

v Multiclass classification overview

CS6501 Lecture 3 59
Recall: Margin for binary classifiers

v The margin of a hyperplane for a dataset is

the distance between the hyperplane and
the data point nearest to it.

- -- + + ++ +
- -- - + ++
- -- - -
- -- - Margin with respect to this hyperplane
-

CS6501 Lecture 3 60
Multi-class SVM

v In a risk minimization framework

Ÿ
v Goal: D = x£ , y£ C¤F
1. 𝑤V&¦ 𝑥C > 𝑤Vb
&
𝑥C for all 𝑖, y’
2. Maximizing the margin

CS6501 Lecture 3 61
Multiclass Margin

CS6501 Lecture 3 62
Multiclass SVM (Intuition)

v Binary SVM
v Maximize margin. Equivalently,
Minimize norm of weights such that the closest
points to the hyperplane have a score 1
v Multiclass SVM
v Each label has a different weight vector
(like one-vs-all)
v Maximize multiclass margin. Equivalently,
Minimize total norm of the weights such that the true label
is scored at least 1 more than the second best one

CS6501 Lecture 3 63
Multiclass SVM in the separable case
Recall hard binary SVM
Size of the weights.
Effectively, regularizer

The score for the true label is higher than the

score for any other label by 1

CS6501 Lecture 3 64
Multiclass SVM: General case
Total slack. Effectively,
Size of the weights. don’t allow too many
Effectively, regularizer examples to violate the
margin constraint

The score for the true label is higher Slack variables. Not all
than the score for any other label by 1 examples need to
satisfy the margin
constraint.

Slack variables can

only be positive
CS6501 Lecture 3 65
Multiclass SVM: General case
Total slack. Effectively,
Size of the weights. don’t allow too many
Effectively, regularizer examples to violate the
margin constraint

The score for the true label is higher Slack variables. Not all
than the score for any other label by examples need to
1 - »i satisfy the margin
constraint.

Slack variables can

only be positive
CS6501 Lecture 3 66
Recap: An alternative SVM formulation
F &
min 𝒘 𝒘 + 𝐶 ∑C 𝜉C
𝒘,J,𝝃 G
s. t y£ (𝐰 - 𝐱 £ + 𝑏) ≥ 1 − 𝜉C ; 𝜉C ≥ 0 ∀𝑖
v Rewrite the constraints:
𝜉C ≥ 1 − y£ (𝐰 - 𝐱 £ + 𝑏); 𝜉C ≥ 0 ∀𝑖
v In the optimum, 𝜉C = max(0, 1 − y£ (𝐰 - 𝐱£ + 𝑏))
v Soft SVM can be rewritten as:
F &
min 𝒘 𝒘 + 𝐶 ∑C max(0, 1 − y£ (𝐰 - 𝐱 £ + 𝑏))
𝒘,J G

Regularization term Empirical loss

CS6501 Lecture 3 67
Rewrite it as unconstraint problem

Let’s define:
𝛿 if 𝑦 ≠ 𝑦′
Δ 𝑦, 𝑦 b =›
0 𝑖𝑓 𝑦 = 𝑦′

F
min ∑N 𝑤N& 𝑤N + 𝐶 ∑C(max Δ 𝑦C , 𝑘 + 𝑤N& 𝑥 − 𝑤V&¦ 𝑥 )
k G N

CS6501 Lecture 3 68
Multiclass SVM
v Generalizes binary SVM algorithm
v If we have only two classes, this reduces to the
binary (up to scale)

v Comes with similar generalization guarantees

as the binary SVM

v Can be trained using different optimization

methods
v Stochastic sub-gradient descent can be generalized

CS6501 Lecture 3 69
Exercise!
v Write down SGD for multiclass SVM

v Write down multiclas SVM with Kesler

construction

CS6501 Lecture 3 70
This Lecture

v Multiclass classification overview

CS6501 Lecture 3 71
Recall: (binary) logistic regression
1 & ³
min 𝒘 𝒘 + 𝐶 ° log( 1 + 𝑒 Yq² (𝐰 𝐱² ) )
𝒘 2
C

Assume labels are generated using the following

probability distribution:

CS6501 Lecture 3 72
(multi-class) log-linear model

v Assumption: Partition function

exp 𝑤V& 𝑥
𝑃 𝑦 𝑥, 𝑤 = &
∑Vb∈{F,G,…I} exp(𝑤Vb 𝑥)
v This is a valid probability assumption. Why?
v Another way to write this (with Kesler
construction) is This often called soft-max
exp 𝑤 & 𝜙(𝑥, 𝑦)
𝑃 𝑦 𝑥, 𝑤 =
∑Vb∈{F,G,…I} exp(𝑤 & 𝜙(𝑥, 𝑦′))

CS6501 Lecture 3 73
Softmax
v Softmax: let s(y) be the score for output y
here s(y)=𝑤 & 𝜙(𝑥, 𝑦) (or 𝑤V& 𝑥) but it can be
computed by other metric.
exp 𝑠(𝑦)
𝑃(𝑦) =
∑Vb∈{F,G,…I} exp(𝑠(𝑦))

v We can control the peakedness of the distribution

exp 𝑠 𝑦 /𝜎
𝑃(𝑦|𝜎) =
∑Vb∈{F,G,…I} exp(𝑠(𝑦/𝜎))

CS6501 Lecture 3 74
Example
S(dog)=.5; s(cat)=1; s(mouse)=0.3; s(duck)=-0.2
1.2
1
0.8
0.6
0.4
0.2
0
Dog Cat Mouse Duck
-0.2
-0.4 score softmax (hard) max (𝜎→0)
softmax (𝜎=0.5) softmax (𝜎=2)
CS6501 Lecture 3 75
Log linear model
·¸¹ kº» ¼
𝑃 𝑦 𝑥, 𝑤 = ∑ »
º•∈{½,¾,…¿} ·¸¹(kº• ¼)

log 𝑃 𝑦 𝑥, 𝑤 = log(exp 𝑤V& 𝑥 ) − log(∑Vb∈{F,G,…I} exp(𝑤V&• 𝑥))

= 𝑤V& 𝑥 − log(∑Vb∈{F,G,…I} exp(𝑤V&• 𝑥))

Linear function Except this term

Note:

CS6501 Lecture 3 76
Maximum log-likelihood estimation
v Training can be done by maximum log-likelihood
estimation i.e. max log 𝑃(𝐷 𝑤
k
D={(𝑥C , 𝑦C )}
·¸¹ kº »¼
¦ ¦
𝑃(𝐷 𝑤 = ΠC ∑ »
º•∈{½,¾,…¿} ·¸¹(kº• ¼¦ )

log 𝑃(𝐷 𝑤 = ∑C[𝑤V&¦ 𝑥C − log ∑Vb∈{F,G,…I} exp(𝑤V&• 𝑥C )]

CS6501 Lecture 3 77
Maximum a posteriori
Can you use Kesler construction to
D={(𝑥C , 𝑦C )} rewrite this formulation?

𝑃 𝑤𝐷 ∝𝑃 𝑤 𝑃 𝐷𝑤

F
m𝑎𝑥k − ∑V 𝑤V& 𝑤V + 𝐶 ∑C[𝑤V&¦ 𝑥C − log ∑Vb∈{F,G,…I} exp(𝑤V&• 𝑥C )]
G

or
F
min ∑V 𝑤V& 𝑤V + 𝐶 ∑C[ log ∑Vb∈{F,G,…I} exp 𝑤V&• 𝑥C − 𝑤V&¦ 𝑥C ]
k G

CS6501 Lecture 3 78
Comparisons
v Multi-class SVM:
F
min ∑N 𝑤N& 𝑤N + 𝐶 ∑C(max(Δ 𝑦C , 𝑘 + 𝑤N& 𝑥 − 𝑤V&¦ 𝑥))
k G N

v Log-linear model w/ MAP (multi-class)

F
min ∑N 𝑤N& 𝑤N + 𝐶 ∑C[ log ∑N∈{F,G,…I} exp 𝑤N& 𝑥C − 𝑤V&¦ 𝑥C ]
k G

v Binary SVM:
F &
min 𝒘 𝒘 + 𝐶 ∑C max(0, 1 − y£ (𝐰 - 𝐱 £ ))
𝒘 G
v Log-linear mode (logistic regression)
F & ³𝐱
min 𝒘 𝒘 + 𝐶 ∑C log( 1 + 𝑒 Yq² (𝐰 ²) )
𝒘 G

CS6501 Lecture 3 79

Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
100% (1)
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
400 pages
Numerical Methods For Ordinary Differential Equations
100% (1)
Numerical Methods For Ordinary Differential Equations
134 pages
Survey On Multiclass Classification Methods
No ratings yet
Survey On Multiclass Classification Methods
9 pages
CS373 Lecture18.1
No ratings yet
CS373 Lecture18.1
33 pages
ML 06 Multiclass
No ratings yet
ML 06 Multiclass
11 pages
Linear - Classification
No ratings yet
Linear - Classification
72 pages
Video 5 - AI - Binary Classifier For Multi-Class
No ratings yet
Video 5 - AI - Binary Classifier For Multi-Class
12 pages
Artificial Intelligence: Binary Classifiers For Multi-Class Classification Problems
No ratings yet
Artificial Intelligence: Binary Classifiers For Multi-Class Classification Problems
12 pages
6-CMPS460-S22-Beyond Binary Classification
No ratings yet
6-CMPS460-S22-Beyond Binary Classification
22 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
3 Percept Ron
No ratings yet
3 Percept Ron
34 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
datamining-lect12
No ratings yet
datamining-lect12
75 pages
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
No ratings yet
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
79 pages
Section02 MulticlassClassification
No ratings yet
Section02 MulticlassClassification
19 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
Classification
No ratings yet
Classification
4 pages
ML
No ratings yet
ML
22 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
UNIT-3
No ratings yet
UNIT-3
100 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
Unit Ii: Beyond Binary Classification: Handling More Than Two Classes, Regression, Unsupervised
No ratings yet
Unit Ii: Beyond Binary Classification: Handling More Than Two Classes, Regression, Unsupervised
22 pages
Machine Learning Basics Lecture 7: Multiclass Classification
No ratings yet
Machine Learning Basics Lecture 7: Multiclass Classification
28 pages
Support Vector Machines
No ratings yet
Support Vector Machines
24 pages
4 Types of Classification Tasks in Machine Learning
No ratings yet
4 Types of Classification Tasks in Machine Learning
14 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
cs188 sp23 Lec25 - Z
No ratings yet
cs188 sp23 Lec25 - Z
38 pages
03 - Classification PDF
No ratings yet
03 - Classification PDF
92 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
06 MultiClass Classification
No ratings yet
06 MultiClass Classification
16 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
08ClassBasic
No ratings yet
08ClassBasic
154 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
AI Lec 4
No ratings yet
AI Lec 4
35 pages
Lecture 2.1
No ratings yet
Lecture 2.1
49 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
ML Mod1-CS467 Machine Learning - Ktustudents - in
No ratings yet
ML Mod1-CS467 Machine Learning - Ktustudents - in
16 pages
lec41
No ratings yet
lec41
6 pages
06 EnsembleLearning
No ratings yet
06 EnsembleLearning
65 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Intro Perceptron
No ratings yet
Intro Perceptron
70 pages
2223hk1 Slide03 ML2022
No ratings yet
2223hk1 Slide03 ML2022
33 pages
Lec 13
No ratings yet
Lec 13
16 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
74 pages
UNIT 1 PART 3
No ratings yet
UNIT 1 PART 3
11 pages
Classification DMKD
No ratings yet
Classification DMKD
50 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Multiclass Classification
No ratings yet
Multiclass Classification
3 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Exp6 C121 MLOA
No ratings yet
Exp6 C121 MLOA
15 pages
L02 Classification and Regression
No ratings yet
L02 Classification and Regression
26 pages
07 Supervised Machine Learning
No ratings yet
07 Supervised Machine Learning
24 pages
6 Classification
No ratings yet
6 Classification
53 pages
What's Next?: Binary Classification and Related Tasks Classification
No ratings yet
What's Next?: Binary Classification and Related Tasks Classification
44 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
Numerical Analysis - Historical Developments in The 20th Century - C. Brezinski and L. Wuytack (Auth.) - Elsevier Science (2001)
No ratings yet
Numerical Analysis - Historical Developments in The 20th Century - C. Brezinski and L. Wuytack (Auth.) - Elsevier Science (2001)
497 pages
A Framework For Robust Subspace Learning
No ratings yet
A Framework For Robust Subspace Learning
47 pages
Matt Carrick - Foundations of Digital Signal Processing - Complex Numbers
No ratings yet
Matt Carrick - Foundations of Digital Signal Processing - Complex Numbers
83 pages
Pattern Recognition - Theodoridis Koutroumbas
No ratings yet
Pattern Recognition - Theodoridis Koutroumbas
641 pages
Patterns of Scalable Bayesian Inference
No ratings yet
Patterns of Scalable Bayesian Inference
133 pages
子空间学习机（SLM）：方法论和性能
No ratings yet
子空间学习机（SLM）：方法论和性能
12 pages
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
No ratings yet
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
575 pages
Statistical Inference For Engineers and Data Scientists - Pierre Moulin - Venugopal v. Veeravalli (2019)
100% (1)
Statistical Inference For Engineers and Data Scientists - Pierre Moulin - Venugopal v. Veeravalli (2019)
421 pages
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros, André C.P.L.F de Carvalho, Alex A. Freitas (Auth.)
No ratings yet
Automatic Design of Decision-Tree Induction Algorithms - Rodrigo C. Barros, André C.P.L.F de Carvalho, Alex A. Freitas (Auth.)
184 pages
On-Line Q-Learning Using Connectionist Systems
No ratings yet
On-Line Q-Learning Using Connectionist Systems
21 pages
1、Review of Learning Machines Nilsson Nils J. 1965
No ratings yet
1、Review of Learning Machines Nilsson Nils J. 1965
1 page
Interpolation-Based Q-Learning
No ratings yet
Interpolation-Based Q-Learning
37 pages
Learning From Delayed Rewards（1989）-可选定
No ratings yet
Learning From Delayed Rewards（1989）-可选定
241 pages
1、Book Reviews
No ratings yet
1、Book Reviews
7 pages
2、Model Based Bayesian Exploration
No ratings yet
2、Model Based Bayesian Exploration
10 pages
1、Reinforcement Learning With Gaussian Processes (2005)
No ratings yet
1、Reinforcement Learning With Gaussian Processes (2005)
8 pages