0% found this document useful (0 votes)
0 views

08 Classification

Uploaded by

liangyibo653
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

08 Classification

Uploaded by

liangyibo653
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Classification

Computer Vision
Algorithm Generalization Ability

2 / 46
Estimation of Generalization Ability

• Vapnik–Chervonenkis theory (VC)


• Estimate of the complexity of a parametric family of functions;
• Assessing the quality of an algorithm through empirical risk and
model complexity.
• Main idea: choosing the simplest model among sufficiently
accurate ones.
• Let there be a sequence of nested parametric families of increasing
complexity: 𝐹! ⊂ 𝐹" ⊂ ⋯ ⊂ 𝐹# = 𝐹.
• It is necessary to choose a family with minimal complexity that
provides the desired accuracy.

3 / 46
Illustration

Too simple?

Too complicated?
Optimal?

4 / 46
Practical Conclusion
• A balance is required between the complexity of the model, which
provides low empirical risk, and the simplicity, which provides the
ability to generalize.

The red line is a training error,


blue line is a validation error,
marked yellow dot is an optimum of difficulty. 5 / 46
Total Risk Estimation
• Minimizing the total risk is the main goal.
• However, it cannot be computed because computations are
required on an unbounded set.

𝑅 𝑓, 𝑋 = 𝑃!! 𝑓(𝑥) ≠ 𝑦 = , 𝑃 𝑥 [𝑓(𝑥) ≠ 𝑦]𝑑𝑥 .


"

6 / 46
Holding
• Let us estimate the total risk of an error on some finite subset 𝑋 #
that does not intersect with the training sample:
#
#
1
𝑅 𝑓, 𝑋 ~𝑃 𝑓 𝑥 ≠ 𝑦 𝑋 = 4[𝑓(𝑥$ ) ≠ 𝑦$ ]
𝑐
$%&
• Let there be a data set 𝑋 ' = {𝑥& , … , 𝑥' } with known answers.
• Let’s union 𝑋 ( ∪ 𝑋 # = 𝑋 ' : 𝑋 ( ∩ 𝑋 # = ∅.
• We will use for training 𝑋 ( , and for control 𝑋 # :
𝑃 𝑓 𝑥 ≠ 𝑦 ≈ 𝑃 𝑓 𝑥 ≠ 𝑦 𝑋# .

7 / 46
Holding
Training
• Characteristics:
1. Quick and easy calculation.
2. Some complex cases may completely
fall into only one of the samples, and
then the error estimate will be
biased.

The error will occur not


through the fault of the
classifier, but due to
splitting
Controlling
(Validation)
8 / 46
Sliding Control
• Divide the sample into 𝑑 non-overlapping parts and alternately
use one of them for control, and the rest for training.
• Breaking down: {𝑋 ) }* : 𝑋 ) ∩ 𝑋$ = ∅, 𝑖 ≠ 𝑗, ⋃*)%& 𝑋 ) = 𝑋 ' .
• Calculate the approximate risk:
*
1
𝑃(𝑓 𝑋 ' = 𝑦 ∗) ≈ 4 𝑃(𝑓(𝑋 ) ) ≠ 𝑦 ∗| A 𝑋 ) ).
𝑑
)%& ),$

9 / 46
Sliding Control
• The result is considered as the average error over all iterations.

Control Training

10 / 46
Sliding Control
• Properties:
• In the limit, the approximate risk will be equal to the total risk.
• Each case will be present in the control sample.
• The training samples will overlap heavily (the more segments, the
more overlap).
• If one group of “complex precedents” falls completely into one
segment, then the estimation will be biased.

11 / 46
Cross Sliding Control
• CV (cross validation)
• 5-2 cross-validation:
• Divide the sample randomly in half.
• We will train the algorithm on one half, test on the other and vice
versa.
• Repeat this experiment five times and average the result.
• Property: each of the use cases will participate in control
(validation) samples in each of the 5 stages.

12 / 46
Classification

13 / 46
Linear Classifier
• Find a linear function (hyperplane)
and separate the positive {𝑦 = +1}
and negative {𝑦 = −1} examples:
• 𝑥$ positive : 𝑥$ & 𝑤 + 𝑏 ≥ 0,
• 𝑥$ negative: 𝑥$ & 𝑤 + 𝑏 < 0.

What is the best hyperplane?

14 / 46
Support Vector Machine (SVM)
• Based on VC theory.
• An optimal separating hyperplane is a hyperplane such that the
distance from it to the nearest point from the training sample
(no matter from which class) is maximum.
• The optimal separating hyperplane maximizes the margin
(indent):
• the distance from it to the points from the training sample.

15 / 46
Support Vector Machine (SVM)
• Find a hyperplane that maximizes the
indent between positive and negative
examples:
• 𝑥$ positive (𝑦$ = 1): 𝑥$ & 𝑤 + 𝑏 ≥ 1,
• 𝑥$ negative (𝑦$ = −1): 𝑥$ & 𝑤 + 𝑏 ≤ −1.
• For support vectors: 𝑥) D 𝑤 + 𝑏 = ±1,
• The distance from a point to a
|" ./01|
hyperplane is: " ,
/
2
• The indent is: .
/ Support vectors Indent

16 / 46
Support Vector Machine (SVM)
• Finding a hyperplane:
"
1. Maximizing .
%
2. Classify all data:
• 𝑥# positive (𝑦# = 1): 𝑥# % 𝑤 + 𝑏 ≥ 1,
• 𝑥# negative (𝑦# = −1): 𝑥# % 𝑤 + 𝑏 ≤ −1,
3. Let's solve the quadratic optimization problem:
$
• Minimize 𝑤 & 𝑤 on condition 𝑦# 𝑤 % 𝑥# + 𝑏 ≥ 1.
%

17 / 46
Support Vector Machine (SVM)
• The problem is solved using the method of Lagrange multipliers:
𝑤 = 1 𝛼$ 𝑦$ 𝑥$ ,
$
where 𝛼$ – trained weights (Lagrange multipliers), 𝑥$ – support vectors.
• For most vectors, the weight is zero.
• All vectors for which the weight is greater than zero 𝑤 > 0 are called
support vectors.
• The solution will depend only on the support vectors.

18 / 46
Support Vector Machine (SVM)
• The decision function takes the form:
𝑤 & 𝑥 + 𝑏 = 1 𝛼$ 𝑦$ 𝑥$ & 𝑥 + 𝑏.
$
• The decision function depends on the inner products of the test vector
𝑥 and the support vectors 𝑥$ .
• The solution of the optimization problem requires the calculation of
scalar products 𝑥$ & 𝑥& between all pairs of vectors from the training set.

19 / 46
SVM Examples
• On linearly separable data, the support vector machine works
fine:

• On more complex data, the method does not work very well:

An example of linearly inseparable data


20 / 46
Nonlinear SVM
• Basic approach: mapping data to another space of higher
dimension and in it their linear separation.

Data in another space

21 / 46
Nonlinear SVM
• Idea: original parameter space mapping into some
multidimensional feature space, where the training sample is
linearly separable:

22 / 46
Nonlinear SVM
• Problem: Calculating dot products in high-dimensional space is
computationally difficult.
• To avoid this, there is an approach called the kernel trick, which was
designed in 1995, in which, instead of directly calculating the
transformation 𝜑(𝑥), the kernel function 𝐾 is defined:
𝐾 𝑥$ , 𝑥& = 𝜑 𝑥$ & 𝜑 𝑥& ,
where is the matrix 𝐾 𝑥$ , 𝑥& – non-negative definite.
• Using this kernel, a nonlinear decision function is designed in the
original space:
1 𝛼$ 𝑦$ 𝐾 𝑥$ , 𝑥 + 𝑏.
$

23 / 46
Nonlinear SVM: an Example of Kernels

• Polynomial kernel example:


𝐾 𝑥, 𝑦 = ( 𝑥, 𝑦 + 𝑐)* .
• Let be 𝑑 = 2, 𝑥 = 𝑥& , 𝑥2 :
2 2
𝐾 𝑥, 𝑦 = 𝑥, 𝑦 + 𝑐 = 𝑥& 𝑦& + 𝑥2 𝑦2 + 𝑐 =𝜑 𝑥 D𝜑 𝑦 ,
where:
𝜑 𝑥 = 𝑥&2 , 𝑥22 , 2𝑥& 𝑥2 , 2𝑐𝑥& , 2𝑐𝑥2 , 𝑐 ,
those mapping from two-dimensional space to six-dimensional.

24 / 46
Nonlinear SVM Usage Algorithm

1. Create a training sample.


2. Select kernel.
3. Calculate the matrix of kernel values for each pair of examples
from the training set.
4. Load matrix into SVM library to get weights and support
vectors.
5. During inference, calculate the kernel values for the test
sample and each support vector and take the weighted sum to
obtain the value of the decision function.

25 / 46
Multiclass SVM
• There is no special formulation of SVM for the case of many
classes.
• In practice, the SVM for several classes is obtained by
combining several two-class SVMs.
• One against all:
• Training: training SVM for each class against all others.
• Validation: apply all SVMs to the sample and assign the class for
which the SVM gave the most reliable solution.
• One against one:
• Training: train SVM for each pair of classes.
• Validation: each SVM votes for classes, choose the class with the
most votes.
26 / 46
Properties of SVM
• Advantages of SVM:
1. Lots of libraries available: http://www.kernel-
machines.org/software.
2. Powerful and flexible kernel-based approach.
3. Works very well in practice, even for small training samples.
• Disadvantages of SVM:
1. No direct multiclass methods, two-class SVMs need to be
combined.
2. Resource intensity:
• When learning, you need to design a complete kernel matrix for all
examples.
• Training takes a long time for big tasks.

27 / 46
Error Types

28 / 46
Error Types
• If we consider the error as the probability of giving an incorrect answer,
then such a measure is not fully sufficient.
• For example, a 15% error in diagnosis may mean that 15% of patients will be
considered healthy (and possibly die from lack of treatment),
• and that 15% of healthy patients will be ill (and money for treatment will be
wasted).
• In such cases, the concept of errors of the first and second type is
introduced, and then they are evaluated separately.

29 / 46
Error Types
• Let there be some main class.
• As a rule, the main class is the class, upon detection of which some
action is taken.
• In our example, when making a diagnosis, the main class will be
“sick”, and the secondary one will be “healthy”.
• Type I error is the probability of mistaking the primary class for
a secondary one.
• The probability of missing the desired object.
• Type II error is the probability of mistaking a secondary class for
a primary one.
• The probability when the background will be taken for the desired
object.

30 / 46
Error Types
Type II error
Designed hypothesis

True hypothesis

We will consider
red dots as the
main class

Type I error
31 / 46
Error Types
• It is especially important to evaluate errors separately in case of
strong imbalance of classes.
• For example, 𝑃 𝑦 = +1 = 0,01, 𝑃 𝑦 = −1 = 0,99.
• Then for an error of the first kind 𝑃 𝑓 𝑥 = −1 𝑦 = +1 = 0,5 and a
zero error of the second kind, the total error will be just:
𝑃 𝑎 𝑥 ≠ 𝑦 = 0,005.

32 / 46
Error Types
• Based on the errors, the following two parameters can be
distinguished:
• Sensitivity – the probability of giving the correct answer to the
example of the main class:
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑃 𝑓 𝑥 = 𝑦 𝑦 = +1 .
• Specificity – the probability of giving the correct answer to an
example of a secondary class:
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑃 𝑓 𝑥 = 𝑦 𝑦 = −1 .

33 / 46
Balance Setting
• Almost all classification algorithms have parameters, varying
which you can get different levels of errors of the 1st and 2nd
type.

34 / 46
ROC Curve
• To estimate the dependence of sensitivity on specificity, you can
design an ROC curve (Receiver Operating Characteristic curve).

Best case

Coin

35 / 46
ROC Curve
• For various values of the variable setting parameter, an error table is
built.
• The parameter itself does not appear in the table.
• The classifier is built and evaluated constantly on different samples.
• Each row of the table is a point on which the curve is built.

36 / 46
ROC Curve
• The area under the curve gives some objective indicator of the
quality of the classifier and allows you to compare different
curves.
• Often, for a particular task, there is a framework for a certain
type of error. Using ROC, you can analyze the performance of
the current solution for compliance with the requirements.

37 / 46
Example
• Given some points on the plane.
• The parametric family is the
threshold along the axis 𝑥:
+1, 𝑥& > Θ
𝑎 𝑥&, 𝑥 2 = U .
−1, 𝑥2 ≤ Θ

38 / 46
Holding

Left – training sample, error 0,1133;


Right – control (validation) sample, error 0,1433.
39 / 46
Re-Holding
• Training error:
• 0,1097; 0,1236; 0,1209; 0,1250; 0,1250 ,
• the average is 0,1208.
• Error in control:
• {0,1833; 0,1222; 0,1333; 0,1222; 0,1167},
• the average is 0,1356.

40 / 46
Sliding Control

Left – training sample, error 0,1219;


Right – control sample, error 0,1367
41 / 46
Sliding Control
• Training error:
• {0,1236; 0,1208; 0,1250; 0,1097; 0,1306},
• the average is 0,1219.
• Error in control:
• {0,1500; 0,1333; 0,1222; 0,1778; 0,1000},
• the average is 0,1367.

42 / 46
ROC Curve
• We will change the threshold, estimate the error and build a
ROC table, build a curve using the table:

43 / 46
ROC Curve
• General scheme for selecting classifier parameters:
1. We take a training sample.
2. We take different classifier parameters.
a) For each parameter set:
• Using 5-2 Cross Validation.
• Train a classifier with these parameters.
• Estimating errors.
b) Building a ROC curve.
3. Choosing the optimal parameters and training the classifier with
them on the entire sample.

44 / 46
Test on Lectures 7-8

CS3 Test AT3 Test

45 / 46
THANK YOU
FOR YOUR TIME!

[email protected]

You might also like