0% found this document useful (0 votes)
11 views4 pages

Support Vector Machines Ymod

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Support Vector Machines Ymod

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

STATISTICS AND RESEARCH DESIGN

Support vector machines


Dirk Valkenborg,a,b Axel-Jan Rousseau,a,b Melvin Geubbelmans,a,b and Tomasz Burzykowskia,b
Hasselt, Belgium, and Bialystok, Poland

A
support vector machine (SVM) is a supervised classification boundary will become a plane, a line
machine learning (ML) method capable of generalization. This boundary will be called a hyperplane
learning from data and making decisions. The for higher dimensions, which can be considered a plane
fundamental principles of the SVM were already intro- in .3 dimensions and is beyond our imagination. Again,
duced in the 1960s by Vapnik and Chervonenkis1 in a the question is how well such a simple model classifies
theory that was further developed throughout the next and how well-learned concepts are generalizable to pre-
decennia. However, it was only in the 1990s that SVMs viously unseen data.
attracted greater attention from the scientific commu- Figure 1, A shows a separable classification problem.
nity, and this was attributed to 2 significant improve- It is perfectly possible to separate the blue from the red
ments. The first extension is a kernel trick that allows by using a straight line as a classification boundary.
the SVM to classify highly nonlinear problems.2 The sec- However, as illustrated in the plot, multiple options are
ond permitted the extension of the SVM to solve prob- possible. Which line should we select as our boundary
lems in a regression framework3 called support vector to minimize the risk of misclassifying a previously un-
regression machine. These improvements have resulted seen subject? The solution to this question presents itself
in a decisive general approximator that nowadays finds in Figure 1, B and is called a maximum-margin classi-
its use in many applications. Typically, the mathematics fier. The basic idea is simple. To minimize misclassifica-
and theory behind SVMs are complex and require a deep tion risk, we want our classification boundary positioned
understanding of optimization theory, algebra, and as far as possible from neighboring subjects belonging to
learning theory. the different classes. The margin is maximized between
Nonetheless, the main idea can be intuitively ex- the classification boundary and the training data allow-
plained, and this article will consider a classification ing for a tolerance region when predicting a class label
problem to illustrate the concepts. In what follows, it for new subjects.
can be noticed that SVMs differ from previously pre- An important observation can be made from the
sented methods as they exploit geometries in the data figure. Data points far from the classification line do
and are not directly rooted in statistics (eg, generalized not influence its position. The only data points that
linear models). However, they originate from mathe- determine the decision boundary are the 3 points in
matics and engineering and are often compared with lo- black in Figure 1, B. These points are called support
gistic regression explained in the previous article. points or support vectors. In other words, if we would
The starting point of an SVM is straightforward as it remove all the subjects from our training dataset apart
will try to solve a particular binary classification problem from these 3 support vectors, then the location of the
by the simplest model possible, separating the subjects decision boundary would remain unaltered. This
that belong to the 2 different classes by a classification example indicates that support vectors significantly in-
boundary. In 2 dimensions, this classification boundary fluence the decision boundary, and changes in the
will form a straight line. In 3 dimensions, this training data will dramatically impact the decision
boundary.
a
Data Science Institute and Center for Statistics, Hasselt University, Hasselt, Figure 1, C shows an additional subject indicated by
Belgium. an arrow added to the training dataset. Coincidentally,
b
Department of Biostatistics and Medical Informatics, Medical University of Bia-
lystok, Bialystok, Poland.
this subject lies close to the decision boundary and is
This research received funding from the Flemish Government under the an influential support vector that will modify the
“Onderzoeksprogramma Artifici€ele Intelligentie (AI) Vlaanderen” program. maxim-margin problem, resulting in a different classifi-
Address correspondence to: Tomasz Burzykowski, Hasselt University - Data
Science Institute, Agoralaan 1, Building D, B-3590 Diepenbeek, Belgium;
cation boundary, as indicated by the green. However,
e-mail, [email protected]. when eyeballing the classification problem, we can be
Am J Orthod Dentofacial Orthop 2023;164:754-7 pretty satisfied with the previous classifier, indicated by
0889-5406/$36.00
Ó 2023.
dashes, which yields wider overall margins concerning
https://doi.org/10.1016/j.ajodo.2023.08.003 their neighboring training subjects.
754
Statistics and research design 755

Fig 1. A and B illustrate the principle of the maximum-margin classifier; C and D show the introduction
of the slack variable to allow the support vector classifier to maximize its margin while disregarding the
influence of nearby observations even when the data is inseparable.

From the previous articles in this series on ML, we can as previously discussed.4 The curse of dimensionality
recognize such behavior as a symptom of overfitting the states that any classification problem can be perfectly
training data. We want to avoid this overfitting by allow- solved when adding enough covariates such that the
ing some support vectors within the maximum margin, dimension of the dataset is much increased, allowing
thus relaxing their influence on the decision boundary. for a clear separation of subjects belonging to different
A trade-off similar in spirit to the least absolute classes. However, the example in Figure 2 has only 2 co-
shrinkage and selection operator regression, discussed variates or dimensions, so the SVM has to develop addi-
in a previous article,4 presents itself. In the case of tional virtual dimensions. Toward this aim, the SVM will
SVMs, a budget-and-penalty system, called slack vari- consider mapping the data onto a higher dimension,
ables, is proposed to relax some of the support vectors. which is the opposite of the projection of high-
A fortiori, the same procedure can be used to pay for dimensional data objects to a lower dimension.5 For
the penalty even when misclassifying subjects from the example, consider a projection of the 2-dimensional
training dataset. For example, in the case of nonsepar- data onto a spherical object as indicated in Figure 2,
able data, as indicated in Figure 1, D, misclassifying B. By this projection, the problem becomes separable
the subjects encircled in black is unavoidable and toler- with a simple plane that maximizes the margin with
ated by this budget-and-penalty system. respect to neighboring subjects, as indicated by black.
Up to this point, we only have considered simple clas- This mapping to higher dimensions is called a kernel
sification problems, but how will an SVM perform on a function. The good news is that this mapping or projec-
complex nonlinear problem, such as the one displayed tion does not need to be explicitly performed, as it is a
in Figure 2, A (showing that the decision boundary computationally expensive step. Therefore, this step is
should be a circle)? However, with SVMs, we are often called the kernel trick because the SVM can find
restricted to straight lines, planes, or hyperplanes. The the maximum-margin classifier in an unspecified higher
curse of dimensionality may offer a potential solution, dimension. Note that the decision boundary in Figure 2,

American Journal of Orthodontics and Dentofacial Orthopedics November 2023  Vol 164  Issue 5
756 Statistics and research design

Fig 2. A, After back-transforming the hyperplane, the decision boundary is no longer linear in shape as
indicated by the green; B, Illustration of a support vector machine that maps the binary classification
data onto a higher dimension by a kernel trick such that separation by a hyperplane is achievable,
as indicated by green.

B, is planar; when back-transformed to the original data


Table. Error table or confusion matrix of the SVM
space, the border will take a circular form or sometimes
classifier from Konstantonis et al6
more complicated nonlinear boundaries.
Let us consider the classification problem introduced Predicted class
by Konstantonis et al6 to predict the extraction and non-
True class Nonextraction Extraction
extraction treatment on the basis of demographic, ceph- Nonextraction 378 18
alometric, and model covariates. Other model-building
Extraction 59 86
parameters besides the parameters or coefficients that
define the decision boundary must be decided on and
estimated from the training data. These other model op-
tions are typically called hyperparameters and must be can be computed similarly, but now by considering the
fine-tuned to select the most optimal and best- off-diagonal cells (incorrect prediction), yielding an error
performing model. This practical example considers ker- of 14.23%. These classification results are reasonably
nels (ie, linear, quadratic, cubic, and Gaussian) and opti- good, but the error table provides us with more insight
mizes over a grid of slack variables and kernel sizes using into the behavior of our SVM classifier. Note that for
5-fold cross-validation. This fine-tuning has resulted in the largest class in this study (nonextraction), the SVM
a linear SVM with 1.3 on the slack variables budget. The performed well; however, for extraction, the intraclass
resulting error table (often called confusion matrix) performance is weak as only half of the subjects in this
based on the 5-fold cross-validation is displayed in the class were correctly classified by our selected classifier.
Table. To summarize this intraclass performance, we use the
An error table is an informative summary of the per- concepts of sensitivity and specificity. Sensitivity is
formance of a classifier. The rows in this table indicate computed as the proportion of the correctly classified
the ground truth values used as training labels, whereas subjects that need an extraction divided by the total
the columns show how the SVM has classified the sub- number of ground truth extractions in the dataset (86/
ject. A few things can be learned from such an error ta- [59 1 86] 5 59.31%), and the specificity is computed
ble. First of all, it can be noticed that 1 subject has been as the proportion of correctly classified subjects that
left out of the classification as the first row only sums up do not require treatment divided by the total nontreated
to 396 subjects. Missing values did motivate the removal subjects in the dataset (378/[378 1 18] 5 95.45%).
of that subject from the study. These numbers indicate that the classifier has difficulties
Furthermore, the classification accuracy can be correctly classifying the treatment cases. Further investi-
computed by summing the diagonal cells (correct pre- gation is required to clarify the reason for this imbalance.
diction) and dividing this number by the total sum of Alternatively, a trade-off can be proposed to improve the
the cells ([378 1 86]/[378 1 18 1 59 1 86]), which intraclass performance at the cost of the overall classifi-
gives an accuracy of 85.77%. The misclassification error cation accuracy.

November 2023  Vol 164  Issue 5 American Journal of Orthodontics and Dentofacial Orthopedics
Statistics and research design 757

To conclude, support vector machines are a power- to further investigate the predictive capacity of the
ful addition to the arsenal of ML methods, primarily covariates.
because of the kernel trick. One could argue that pro-
jecting the data into unspecified higher dimensions is REFERENCES
a bad practice in terms of overfitting the data, but 1. Vapnik V, Chervonenkis A. On the uniform convergence of relative fre-
the simple hyperplane model combined with a quencies of events to their probabilities. In: Vovk V, Papadopoulos H,
budget-penalty system and sufficient training observa- Gammerman A, editors. Measures of Complexity: Festschrift for Alexey
tions should prevent large generalization errors. Not Chervonenkis. Cham: Springer International Publishing; 2015. p. 11-30.
2. Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin
surprisingly, the linear SVM classifier was found as classifiers, COLT ’92. In: Proceedings of the Fifth Annual Workshop
the optimal model on the Konstantonis dataset,6 avoid- on Computational Learning Theory. New York: ACM; 1992. p. 144-52.
ing overfitting as the many covariates (36) and limited 3. Drucker H, Burges C, Kaufman L, Smola A, Vapnik V. Support vector
observations (541) already yield a high-dimensional regression machines, advances in neural. In: Advances in neural in-
and sparse data space not requiring a mapping to formation processing systems. Cambridge: Massachusetts Institute
of Technology Press; 1997. p. 155-61.
higher dimensions. As mentioned earlier, SVMs are 4. Geubbelmans M, Rousseau A-J, Valkenborg D, Burzykowski T.
general approximators well-suited for prediction and High-dimensional data. Am J Orthod Dentofacial Orthop 2023;
classification tasks. However, when focusing on under- 164:453-6.
standing the relationship between the outcome and the 5. Valkenborg D, Rousseau AJ, Geubbelmans M, Burzykowski T. Unsu-
treatment predictors, SVMs are less suitable as no asso- pervised learning. Am J Orthod Dentofacial Orthop 2023;163:877-82.
6. Konstantonis D, Anthopoulou C, Makou M. Extraction decision and
ciation model between the covariates and the outcome identification of treatment predictors in Class I malocclusions. Prog
is constructed, as with generalized linear models.7 Orthod 2013;14:47.
What is delivered by the SVM is only the notion of a 7. Burzykowski T, Geubbelmans M, Rousseau AJ, Valkenborg D.
support vector, which are subjects that are close to Generalized linear models. Am J Orthod Dentofacial Orthop 2023;
the decision boundary. A post-hoc analysis is needed 164:604-6.

American Journal of Orthodontics and Dentofacial Orthopedics November 2023  Vol 164  Issue 5

You might also like