08 Classification
08 Classification
Computer Vision
Algorithm Generalization Ability
2 / 46
Estimation of Generalization Ability
3 / 46
Illustration
Too simple?
Too complicated?
Optimal?
4 / 46
Practical Conclusion
• A balance is required between the complexity of the model, which
provides low empirical risk, and the simplicity, which provides the
ability to generalize.
6 / 46
Holding
• Let us estimate the total risk of an error on some finite subset 𝑋 #
that does not intersect with the training sample:
#
#
1
𝑅 𝑓, 𝑋 ~𝑃 𝑓 𝑥 ≠ 𝑦 𝑋 = 4[𝑓(𝑥$ ) ≠ 𝑦$ ]
𝑐
$%&
• Let there be a data set 𝑋 ' = {𝑥& , … , 𝑥' } with known answers.
• Let’s union 𝑋 ( ∪ 𝑋 # = 𝑋 ' : 𝑋 ( ∩ 𝑋 # = ∅.
• We will use for training 𝑋 ( , and for control 𝑋 # :
𝑃 𝑓 𝑥 ≠ 𝑦 ≈ 𝑃 𝑓 𝑥 ≠ 𝑦 𝑋# .
7 / 46
Holding
Training
• Characteristics:
1. Quick and easy calculation.
2. Some complex cases may completely
fall into only one of the samples, and
then the error estimate will be
biased.
9 / 46
Sliding Control
• The result is considered as the average error over all iterations.
Control Training
10 / 46
Sliding Control
• Properties:
• In the limit, the approximate risk will be equal to the total risk.
• Each case will be present in the control sample.
• The training samples will overlap heavily (the more segments, the
more overlap).
• If one group of “complex precedents” falls completely into one
segment, then the estimation will be biased.
11 / 46
Cross Sliding Control
• CV (cross validation)
• 5-2 cross-validation:
• Divide the sample randomly in half.
• We will train the algorithm on one half, test on the other and vice
versa.
• Repeat this experiment five times and average the result.
• Property: each of the use cases will participate in control
(validation) samples in each of the 5 stages.
12 / 46
Classification
13 / 46
Linear Classifier
• Find a linear function (hyperplane)
and separate the positive {𝑦 = +1}
and negative {𝑦 = −1} examples:
• 𝑥$ positive : 𝑥$ & 𝑤 + 𝑏 ≥ 0,
• 𝑥$ negative: 𝑥$ & 𝑤 + 𝑏 < 0.
14 / 46
Support Vector Machine (SVM)
• Based on VC theory.
• An optimal separating hyperplane is a hyperplane such that the
distance from it to the nearest point from the training sample
(no matter from which class) is maximum.
• The optimal separating hyperplane maximizes the margin
(indent):
• the distance from it to the points from the training sample.
15 / 46
Support Vector Machine (SVM)
• Find a hyperplane that maximizes the
indent between positive and negative
examples:
• 𝑥$ positive (𝑦$ = 1): 𝑥$ & 𝑤 + 𝑏 ≥ 1,
• 𝑥$ negative (𝑦$ = −1): 𝑥$ & 𝑤 + 𝑏 ≤ −1.
• For support vectors: 𝑥) D 𝑤 + 𝑏 = ±1,
• The distance from a point to a
|" ./01|
hyperplane is: " ,
/
2
• The indent is: .
/ Support vectors Indent
16 / 46
Support Vector Machine (SVM)
• Finding a hyperplane:
"
1. Maximizing .
%
2. Classify all data:
• 𝑥# positive (𝑦# = 1): 𝑥# % 𝑤 + 𝑏 ≥ 1,
• 𝑥# negative (𝑦# = −1): 𝑥# % 𝑤 + 𝑏 ≤ −1,
3. Let's solve the quadratic optimization problem:
$
• Minimize 𝑤 & 𝑤 on condition 𝑦# 𝑤 % 𝑥# + 𝑏 ≥ 1.
%
17 / 46
Support Vector Machine (SVM)
• The problem is solved using the method of Lagrange multipliers:
𝑤 = 1 𝛼$ 𝑦$ 𝑥$ ,
$
where 𝛼$ – trained weights (Lagrange multipliers), 𝑥$ – support vectors.
• For most vectors, the weight is zero.
• All vectors for which the weight is greater than zero 𝑤 > 0 are called
support vectors.
• The solution will depend only on the support vectors.
18 / 46
Support Vector Machine (SVM)
• The decision function takes the form:
𝑤 & 𝑥 + 𝑏 = 1 𝛼$ 𝑦$ 𝑥$ & 𝑥 + 𝑏.
$
• The decision function depends on the inner products of the test vector
𝑥 and the support vectors 𝑥$ .
• The solution of the optimization problem requires the calculation of
scalar products 𝑥$ & 𝑥& between all pairs of vectors from the training set.
19 / 46
SVM Examples
• On linearly separable data, the support vector machine works
fine:
• On more complex data, the method does not work very well:
21 / 46
Nonlinear SVM
• Idea: original parameter space mapping into some
multidimensional feature space, where the training sample is
linearly separable:
22 / 46
Nonlinear SVM
• Problem: Calculating dot products in high-dimensional space is
computationally difficult.
• To avoid this, there is an approach called the kernel trick, which was
designed in 1995, in which, instead of directly calculating the
transformation 𝜑(𝑥), the kernel function 𝐾 is defined:
𝐾 𝑥$ , 𝑥& = 𝜑 𝑥$ & 𝜑 𝑥& ,
where is the matrix 𝐾 𝑥$ , 𝑥& – non-negative definite.
• Using this kernel, a nonlinear decision function is designed in the
original space:
1 𝛼$ 𝑦$ 𝐾 𝑥$ , 𝑥 + 𝑏.
$
23 / 46
Nonlinear SVM: an Example of Kernels
24 / 46
Nonlinear SVM Usage Algorithm
25 / 46
Multiclass SVM
• There is no special formulation of SVM for the case of many
classes.
• In practice, the SVM for several classes is obtained by
combining several two-class SVMs.
• One against all:
• Training: training SVM for each class against all others.
• Validation: apply all SVMs to the sample and assign the class for
which the SVM gave the most reliable solution.
• One against one:
• Training: train SVM for each pair of classes.
• Validation: each SVM votes for classes, choose the class with the
most votes.
26 / 46
Properties of SVM
• Advantages of SVM:
1. Lots of libraries available: http://www.kernel-
machines.org/software.
2. Powerful and flexible kernel-based approach.
3. Works very well in practice, even for small training samples.
• Disadvantages of SVM:
1. No direct multiclass methods, two-class SVMs need to be
combined.
2. Resource intensity:
• When learning, you need to design a complete kernel matrix for all
examples.
• Training takes a long time for big tasks.
27 / 46
Error Types
28 / 46
Error Types
• If we consider the error as the probability of giving an incorrect answer,
then such a measure is not fully sufficient.
• For example, a 15% error in diagnosis may mean that 15% of patients will be
considered healthy (and possibly die from lack of treatment),
• and that 15% of healthy patients will be ill (and money for treatment will be
wasted).
• In such cases, the concept of errors of the first and second type is
introduced, and then they are evaluated separately.
29 / 46
Error Types
• Let there be some main class.
• As a rule, the main class is the class, upon detection of which some
action is taken.
• In our example, when making a diagnosis, the main class will be
“sick”, and the secondary one will be “healthy”.
• Type I error is the probability of mistaking the primary class for
a secondary one.
• The probability of missing the desired object.
• Type II error is the probability of mistaking a secondary class for
a primary one.
• The probability when the background will be taken for the desired
object.
30 / 46
Error Types
Type II error
Designed hypothesis
True hypothesis
We will consider
red dots as the
main class
Type I error
31 / 46
Error Types
• It is especially important to evaluate errors separately in case of
strong imbalance of classes.
• For example, 𝑃 𝑦 = +1 = 0,01, 𝑃 𝑦 = −1 = 0,99.
• Then for an error of the first kind 𝑃 𝑓 𝑥 = −1 𝑦 = +1 = 0,5 and a
zero error of the second kind, the total error will be just:
𝑃 𝑎 𝑥 ≠ 𝑦 = 0,005.
32 / 46
Error Types
• Based on the errors, the following two parameters can be
distinguished:
• Sensitivity – the probability of giving the correct answer to the
example of the main class:
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑃 𝑓 𝑥 = 𝑦 𝑦 = +1 .
• Specificity – the probability of giving the correct answer to an
example of a secondary class:
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑃 𝑓 𝑥 = 𝑦 𝑦 = −1 .
33 / 46
Balance Setting
• Almost all classification algorithms have parameters, varying
which you can get different levels of errors of the 1st and 2nd
type.
34 / 46
ROC Curve
• To estimate the dependence of sensitivity on specificity, you can
design an ROC curve (Receiver Operating Characteristic curve).
Best case
Coin
35 / 46
ROC Curve
• For various values of the variable setting parameter, an error table is
built.
• The parameter itself does not appear in the table.
• The classifier is built and evaluated constantly on different samples.
• Each row of the table is a point on which the curve is built.
36 / 46
ROC Curve
• The area under the curve gives some objective indicator of the
quality of the classifier and allows you to compare different
curves.
• Often, for a particular task, there is a framework for a certain
type of error. Using ROC, you can analyze the performance of
the current solution for compliance with the requirements.
37 / 46
Example
• Given some points on the plane.
• The parametric family is the
threshold along the axis 𝑥:
+1, 𝑥& > Θ
𝑎 𝑥&, 𝑥 2 = U .
−1, 𝑥2 ≤ Θ
38 / 46
Holding
40 / 46
Sliding Control
42 / 46
ROC Curve
• We will change the threshold, estimate the error and build a
ROC table, build a curve using the table:
43 / 46
ROC Curve
• General scheme for selecting classifier parameters:
1. We take a training sample.
2. We take different classifier parameters.
a) For each parameter set:
• Using 5-2 Cross Validation.
• Train a classifier with these parameters.
• Estimating errors.
b) Building a ROC curve.
3. Choosing the optimal parameters and training the classifier with
them on the entire sample.
44 / 46
Test on Lectures 7-8
45 / 46
THANK YOU
FOR YOUR TIME!