Support Vector Machines Ymod
Support Vector Machines Ymod
A
support vector machine (SVM) is a supervised classification boundary will become a plane, a line
machine learning (ML) method capable of generalization. This boundary will be called a hyperplane
learning from data and making decisions. The for higher dimensions, which can be considered a plane
fundamental principles of the SVM were already intro- in .3 dimensions and is beyond our imagination. Again,
duced in the 1960s by Vapnik and Chervonenkis1 in a the question is how well such a simple model classifies
theory that was further developed throughout the next and how well-learned concepts are generalizable to pre-
decennia. However, it was only in the 1990s that SVMs viously unseen data.
attracted greater attention from the scientific commu- Figure 1, A shows a separable classification problem.
nity, and this was attributed to 2 significant improve- It is perfectly possible to separate the blue from the red
ments. The first extension is a kernel trick that allows by using a straight line as a classification boundary.
the SVM to classify highly nonlinear problems.2 The sec- However, as illustrated in the plot, multiple options are
ond permitted the extension of the SVM to solve prob- possible. Which line should we select as our boundary
lems in a regression framework3 called support vector to minimize the risk of misclassifying a previously un-
regression machine. These improvements have resulted seen subject? The solution to this question presents itself
in a decisive general approximator that nowadays finds in Figure 1, B and is called a maximum-margin classi-
its use in many applications. Typically, the mathematics fier. The basic idea is simple. To minimize misclassifica-
and theory behind SVMs are complex and require a deep tion risk, we want our classification boundary positioned
understanding of optimization theory, algebra, and as far as possible from neighboring subjects belonging to
learning theory. the different classes. The margin is maximized between
Nonetheless, the main idea can be intuitively ex- the classification boundary and the training data allow-
plained, and this article will consider a classification ing for a tolerance region when predicting a class label
problem to illustrate the concepts. In what follows, it for new subjects.
can be noticed that SVMs differ from previously pre- An important observation can be made from the
sented methods as they exploit geometries in the data figure. Data points far from the classification line do
and are not directly rooted in statistics (eg, generalized not influence its position. The only data points that
linear models). However, they originate from mathe- determine the decision boundary are the 3 points in
matics and engineering and are often compared with lo- black in Figure 1, B. These points are called support
gistic regression explained in the previous article. points or support vectors. In other words, if we would
The starting point of an SVM is straightforward as it remove all the subjects from our training dataset apart
will try to solve a particular binary classification problem from these 3 support vectors, then the location of the
by the simplest model possible, separating the subjects decision boundary would remain unaltered. This
that belong to the 2 different classes by a classification example indicates that support vectors significantly in-
boundary. In 2 dimensions, this classification boundary fluence the decision boundary, and changes in the
will form a straight line. In 3 dimensions, this training data will dramatically impact the decision
boundary.
a
Data Science Institute and Center for Statistics, Hasselt University, Hasselt, Figure 1, C shows an additional subject indicated by
Belgium. an arrow added to the training dataset. Coincidentally,
b
Department of Biostatistics and Medical Informatics, Medical University of Bia-
lystok, Bialystok, Poland.
this subject lies close to the decision boundary and is
This research received funding from the Flemish Government under the an influential support vector that will modify the
“Onderzoeksprogramma Artifici€ele Intelligentie (AI) Vlaanderen” program. maxim-margin problem, resulting in a different classifi-
Address correspondence to: Tomasz Burzykowski, Hasselt University - Data
Science Institute, Agoralaan 1, Building D, B-3590 Diepenbeek, Belgium;
cation boundary, as indicated by the green. However,
e-mail, [email protected]. when eyeballing the classification problem, we can be
Am J Orthod Dentofacial Orthop 2023;164:754-7 pretty satisfied with the previous classifier, indicated by
0889-5406/$36.00
Ó 2023.
dashes, which yields wider overall margins concerning
https://doi.org/10.1016/j.ajodo.2023.08.003 their neighboring training subjects.
754
Statistics and research design 755
Fig 1. A and B illustrate the principle of the maximum-margin classifier; C and D show the introduction
of the slack variable to allow the support vector classifier to maximize its margin while disregarding the
influence of nearby observations even when the data is inseparable.
From the previous articles in this series on ML, we can as previously discussed.4 The curse of dimensionality
recognize such behavior as a symptom of overfitting the states that any classification problem can be perfectly
training data. We want to avoid this overfitting by allow- solved when adding enough covariates such that the
ing some support vectors within the maximum margin, dimension of the dataset is much increased, allowing
thus relaxing their influence on the decision boundary. for a clear separation of subjects belonging to different
A trade-off similar in spirit to the least absolute classes. However, the example in Figure 2 has only 2 co-
shrinkage and selection operator regression, discussed variates or dimensions, so the SVM has to develop addi-
in a previous article,4 presents itself. In the case of tional virtual dimensions. Toward this aim, the SVM will
SVMs, a budget-and-penalty system, called slack vari- consider mapping the data onto a higher dimension,
ables, is proposed to relax some of the support vectors. which is the opposite of the projection of high-
A fortiori, the same procedure can be used to pay for dimensional data objects to a lower dimension.5 For
the penalty even when misclassifying subjects from the example, consider a projection of the 2-dimensional
training dataset. For example, in the case of nonsepar- data onto a spherical object as indicated in Figure 2,
able data, as indicated in Figure 1, D, misclassifying B. By this projection, the problem becomes separable
the subjects encircled in black is unavoidable and toler- with a simple plane that maximizes the margin with
ated by this budget-and-penalty system. respect to neighboring subjects, as indicated by black.
Up to this point, we only have considered simple clas- This mapping to higher dimensions is called a kernel
sification problems, but how will an SVM perform on a function. The good news is that this mapping or projec-
complex nonlinear problem, such as the one displayed tion does not need to be explicitly performed, as it is a
in Figure 2, A (showing that the decision boundary computationally expensive step. Therefore, this step is
should be a circle)? However, with SVMs, we are often called the kernel trick because the SVM can find
restricted to straight lines, planes, or hyperplanes. The the maximum-margin classifier in an unspecified higher
curse of dimensionality may offer a potential solution, dimension. Note that the decision boundary in Figure 2,
American Journal of Orthodontics and Dentofacial Orthopedics November 2023 Vol 164 Issue 5
756 Statistics and research design
Fig 2. A, After back-transforming the hyperplane, the decision boundary is no longer linear in shape as
indicated by the green; B, Illustration of a support vector machine that maps the binary classification
data onto a higher dimension by a kernel trick such that separation by a hyperplane is achievable,
as indicated by green.
November 2023 Vol 164 Issue 5 American Journal of Orthodontics and Dentofacial Orthopedics
Statistics and research design 757
To conclude, support vector machines are a power- to further investigate the predictive capacity of the
ful addition to the arsenal of ML methods, primarily covariates.
because of the kernel trick. One could argue that pro-
jecting the data into unspecified higher dimensions is REFERENCES
a bad practice in terms of overfitting the data, but 1. Vapnik V, Chervonenkis A. On the uniform convergence of relative fre-
the simple hyperplane model combined with a quencies of events to their probabilities. In: Vovk V, Papadopoulos H,
budget-penalty system and sufficient training observa- Gammerman A, editors. Measures of Complexity: Festschrift for Alexey
tions should prevent large generalization errors. Not Chervonenkis. Cham: Springer International Publishing; 2015. p. 11-30.
2. Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin
surprisingly, the linear SVM classifier was found as classifiers, COLT ’92. In: Proceedings of the Fifth Annual Workshop
the optimal model on the Konstantonis dataset,6 avoid- on Computational Learning Theory. New York: ACM; 1992. p. 144-52.
ing overfitting as the many covariates (36) and limited 3. Drucker H, Burges C, Kaufman L, Smola A, Vapnik V. Support vector
observations (541) already yield a high-dimensional regression machines, advances in neural. In: Advances in neural in-
and sparse data space not requiring a mapping to formation processing systems. Cambridge: Massachusetts Institute
of Technology Press; 1997. p. 155-61.
higher dimensions. As mentioned earlier, SVMs are 4. Geubbelmans M, Rousseau A-J, Valkenborg D, Burzykowski T.
general approximators well-suited for prediction and High-dimensional data. Am J Orthod Dentofacial Orthop 2023;
classification tasks. However, when focusing on under- 164:453-6.
standing the relationship between the outcome and the 5. Valkenborg D, Rousseau AJ, Geubbelmans M, Burzykowski T. Unsu-
treatment predictors, SVMs are less suitable as no asso- pervised learning. Am J Orthod Dentofacial Orthop 2023;163:877-82.
6. Konstantonis D, Anthopoulou C, Makou M. Extraction decision and
ciation model between the covariates and the outcome identification of treatment predictors in Class I malocclusions. Prog
is constructed, as with generalized linear models.7 Orthod 2013;14:47.
What is delivered by the SVM is only the notion of a 7. Burzykowski T, Geubbelmans M, Rousseau AJ, Valkenborg D.
support vector, which are subjects that are close to Generalized linear models. Am J Orthod Dentofacial Orthop 2023;
the decision boundary. A post-hoc analysis is needed 164:604-6.
American Journal of Orthodontics and Dentofacial Orthopedics November 2023 Vol 164 Issue 5