Reaching Nirvana: Maximizing The Margin in Both Euclidean and Angular Spaces For Deep Neural Network Classification
Reaching Nirvana: Maximizing The Margin in Both Euclidean and Angular Spaces For Deep Neural Network Classification
5, MAY 2025
Abstract— The classification loss functions used in deep neural function is the most common function used for classification
network classifiers can be split into two categories based on in deep neural network classifiers. Although the softmax loss
maximizing the margin in either Euclidean or angular spaces. yields satisfactory accuracies for general object classification
Euclidean distances between sample vectors are used during clas-
sification for the methods maximizing the margin in Euclidean problems, its performance for discrimination of the instances
spaces whereas the Cosine similarity distance is used during the coming from the same class categories (e.g., face recognition)
testing stage for the methods maximizing the margin in the angu- or open set recognition (a classification scenario that allows
lar spaces. This article introduces a novel classification loss that the test samples to come from the novel classes) is not
maximizes the margin in both the Euclidean and angular spaces satisfactory. The performance decrease is typically attributed
at the same time. This way, the Euclidean and Cosine distances
will produce similar and consistent results and complement each to two factors: there is no mechanism for enforcing large-
other, which will in turn improve the accuracies. The proposed margin between classes and the softmax does not attempt to
loss function enforces the samples of classes to cluster around minimize the within-class scatter which is critical for obtaining
the centers that represent them. The centers approximating good accuracies in open set recognition problems.
classes are chosen from the boundary of a hypersphere, and the To improve the classification accuracies of the deep neural
pair-wise distances between class centers are always equivalent.
This restriction corresponds to choosing centers from the vertices network classifiers, many researchers focused on maximizing
of a regular simplex inscribed in a hypersphere. The proposed the margin between classes. The recent methods can be
loss function can be effortlessly applied to classical classification roughly grouped into two categories based on maximizing the
problems as there is a single hyperparameter that must be margin in either Euclidean or angular spaces. The methods tar-
set by the user, and setting this parameter is straightforward. geting margin maximization in the Euclidean spaces attempt to
Additionally, the proposed method can effectively reject test
samples from unfamiliar classes by measuring their distances minimize the Euclidean distances among the samples coming
from the known class centers, which are compactly clustered from the same classes and maximize the distances among the
around their corresponding centers. Therefore, the proposed samples coming from different classes. Euclidean distances
technique is especially suitable for open set recognition problems. are used during the testing stage after the network is trained.
Despite its simplicity, experimental studies have demonstrated In contrast, the methods maximizing the margin in the angular
that the proposed method outperforms other techniques in
both open set recognition and classical classification problems. spaces use the cosine distances for classification.
Interested individuals can access the source code for the proposed In this article, we propose a novel method that maximizes
approach at [Link] the margin in both the Euclidean and angular spaces at
Index Terms— Classification, computer vision, deep learning, the same time. The proposed methodology first selects class
neural collapse, open set recognition, simplex classifier. centers from the vertices of a regular simplex inscribed in a
hypersphere and utilizes a loss function that minimizes the
I. I NTRODUCTION distances between the samples and their corresponding class
centers.
D EEP neural network classifiers have been dominating
many fields including computer vision by achieving
the state-of-the-art accuracies in many tasks such as visual
A. Related Work
object, activity, face, and scene classification. Therefore, new
deep neural network architectures and different classification Wen et al. [1], [2] introduced the center loss for face
losses have been constantly developing. The softmax loss recognition to maximize the margin Euclidean space, and
they reported significant improvements over the method using
Manuscript received 27 February 2023; revised 17 October 2023 and 4 April
2024; accepted 29 July 2024. Date of publication 12 August 2024; date of the softmax loss function in the context of face recognition.
current version 5 May 2025. This work was supported by the Scientific and The range loss is combined with the softmax loss function
Technological Research Council of Türkiye (TUBİTAK) under Grant EEEAG- in [3] to maximize the margin in Euclidean spaces. Wei
121E390. (Corresponding author: Hakan Cevikalp.)
Hakan Cevikalp and Bedirhan Uzun are with the Department of Elec- et al. [4] proposed a classifier that combines the softmax
trical and Electronics Engineering, Eskişehir Osmangazi University, 26040 loss and center loss functions with the minimum margin
Eskişehir, Türkiye (e-mail: [Link]@[Link]). loss. A method combining the softmax loss function with
Hasan Saribas is with the AIE Department, Huawei Türkiye Research and
Development Center, 34768 Istanbul, Türkiye. the marginal loss is proposed by Deng et al. [5]. Cevikalp
Digital Object Identifier 10.1109/TNNLS.2024.3437641 et al. [6] proposed a deep neural network based open set
2162-237X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
CEVIKALP et al.: REACHING NIRVANA: MAXIMIZING THE MARGIN IN BOTH EUCLIDEAN AND ANGULAR SPACES 8179
recognition method that returns compact class acceptance weights. The main idea is to learn diverse deep neural network
regions for each known class. In this framework, hinge loss weights that are uniformly distributed on a hypersphere in
and polyhedral conic functions are used for the between- order to reduce the redundancy. Therefore, these methods are
class separation. The methods using contrastive loss [7] also more complex (in some sense it is also more sophisticated
return compact class acceptance regions. To this end, they since it applies the hyperspherical uniformity to all neural
minimize the Euclidean distances of the positive sample pairs network layers). Consequently, there are many hyperparame-
and penalize the negative pairs that have a distance smaller ters that must be fixed in the resulting method. Also, when
than a given margin threshold. In a similar manner, Schroff this idea is used in the classification layer, the distances
et al. [8], Hoffer and Ailon [9], Sohn [10], and Roy et al. between the resulting class representative weights are not
[11] employ triplet loss function that uses triplets including a equivalent as in our proposed method. A related study called
positive sample, a negative sample and an anchor. An anchor UniformFace [20] used the same idea in the classification layer
is also a positive sample, thus the within-class compactness is only and introduced uniform loss function to learn equidis-
achieved by minimizing the Euclidean distances between the tributed representations for face recognition. Another similar
anchor and positive samples whereas the distances between the method using class centroids is introduced in [21] for distance
anchor and negative samples are maximized for the between- metric learning. Although this study focuses on distance metric
class separation. The employment of contrastive or triplet loss learning, it uses class centers chosen as the basis vectors
functions in methods has a significant drawback, which is the of C-dimensional space as anchors. Then, as in triplet loss,
quadratic or cubic growth of the number of sample pairs or it attempts to minimize the distances between the data samples
triplets compared to the total number of samples. This leads and the corresponding class centers and to maximize the
to slow convergence and instability in the training process, distances between the samples and rival class centers. The
necessitating cautious data sampling/mining to mitigate these selected class centers are fixed as in our proposed method and
issues. Overall, the majority of the methods maximizing mar- it has a restriction that the feature dimension size must be
gin in the Euclidean spaces have shortcomings in a way that larger than or equal to the number of classes similar to our
they are too complex since the user has to set many weighting case. Compared to this method, our proposed method is much
and margin parameters. Furthermore, many of these methods simpler and the run-time complexity of the proposed method
are not suitable for open set recognition problems since they is significantly less. Additionally, there are two significant
do not return compact acceptance regions for classes. oversights made by the authors in their proposed method-
The methods that enlarge the margin in the angular spaces ology. The initial oversight concerns their choice of centers,
typically revise the classical softmax loss functions to maxi- which are selected from the surface of a unit hypersphere (a
mize the angular margins between rival classes. These methods hypersphere with a radius of 1). As expounded upon below,
use either multiplicative or additive margins for the inter- it becomes apparent that data samples tend to cluster near the
class separations in the angular spaces. Among these, the surface of an expanding hypersphere as the dimensionality
SphereFace [12], [13] and the RegularFace [14] methods increases. Consequently, establishing the hypersphere radius
employ multiplicative margins whereas the CosFace [15] and as 1 is not well-suited for high-dimensional feature spaces,
the ArcFace [16] methods use additive margins. Majority a viewpoint that is supported with findings reported in studies
of these methods normalize the feature vectors, classifier such as ArcFace [16] and CosFace [15]. The second con-
weights or both of them since the similarities are computed cern revolves around the exclusive use of a fully connected
by using the angles. We would like to point out that almost layer to increase dimensionality, particularly when the feature
all methods that maximize the margin in the angular space dimension is smaller than the number of classes. A fully
are proposed for face recognition. As indicated in [6], these connected layer just uses the linear combination of existing
methods use subspace approximations for the classes and the features and the resulting space has the same dimensionality
similarities are measured by using the angles between sample as in the original feature space in the best case scenario (this
vectors. However, subspace approximations work well for the issue is explained in more details below). As a result, the
classification settings where the number of the features is dimensionality is not increased, and this method will not work
much larger than the number of class specific samples. This is for large-scale problems where the number of classes is very
typically satisfied for the face recognition problems, but there large.
are many classification tasks that do not satisfy this criterion. There are studies using or mentioning simplex centers as
In addition to this problem, these methods are also complex in our proposed method. Among these methods, Papyan et al.
since they have many parameters that must be set by the user [22] shows that the samples of different classes cluster around
as in the methods that maximize the margin in the Euclidean the class centers forming the vertices of a regular simplex (as
spaces. we proposed in this study) at the last stages of the learning
The methods that are most closer to the proposed method- process when the linear classifiers are used with the softmax
ology are proposed in [17], [18], and [19]. These methods loss function and feature dimension is higher than the number
introduce loss functions for learning uniformly distributed of classes. They show that the lengths of the vectors of the
representations on the hypersphere manifold through potential class means (after centering by their global mean) converge
energy minimization. However, these studies consider the layer to the same length and the angles between pair-wise center
regularization problem rather than the direct classification vectors become equal during the last training stages (it is called
problem and apply hyperspherical uniformity to the learned terminal phase of the training in the study) of the deep neural
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
8180 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 36, NO. 5, MAY 2025
networks using linear classifiers. This method is different than Our proposed method has many advantages over other
our proposed method in the sense that they do not use fixed margin maximizing deep neural network classifiers. These
class centers chosen from the vertices of a simplex. Instead, advantages can be summarized as follows.
they directly use the softmax loss function and learn class 1) The proposed method is very straightforward in the
weights. In general, they simply provide theoretical arguments sense that one needs to fix only one parameter,
showing that using the softmax loss function with the linear the hypersphere radius. Prior research on classifica-
classifiers yields embeddings where the class samples cluster tion methods employing hyperspherical embeddings has
around vertices of a regular simplex after some kind of already investigated the selection of this parameter,
normalizations. Pernici et al. [23], [24], Kasarla et al. [25], with [15] offering lower bounds for its determination.
and Bytyqi et al. [26] use fixed centers chosen from the Therefore, setting this parameter is extremely easy for
vertices of a regular simplex as in our proposed method. the users. For open set recognition, the user has to set
But, all of them utilize variants of the softmax loss function two parameters if the background class samples are used
including hyperparameters that must be fixed by the user. for learning.
None of them proposes a simple loss function as in our 2) The proposed method returns compact and interpretable
proposed method. Using the softmax loss function yields radial acceptance regions for each class, thus it is very suit-
distributions as illustrated in these studies. Therefore, their able for open set recognition problems. Other methods
success is not satisfactory especially in open set recognition utilizing simplex vertices for classification purposes use
problems since the resulting embeddings are not compact as variants of the softmax loss function and return radial
in our proposed method, please see related discussion given distributions which is not compact. Therefore, their
in Section II-C below. Also, none of the studies considered accuracies are not satisfactory for open set recognition.
the case when the dimension is smaller than the number of 3) The distances between the samples and their corre-
samples and conducted experiments on this setting. For such sponding centers are minimized independently of each
cases, we need to increase the dimension of the feature space other, thus the proposed method also works well for
and we propose solutions to handle this case. In contrast, none imbalanced datasets.
of these methods propose an effective solution for this case. 4) We investigate scenarios where the utilization of cen-
Yang et al. [27] introduced an alternative loss function named ters from a regular simplex is unfeasible due to the
dot regression loss, which, like our proposed method, utilizes dimensionality of the feature space being less than the
centers selected from the vertices of a regular simplex. How- number of classes minus one (d < C − 1). In such
ever, their approach requires the selection of two parameters, instances, neural collapse does not occur, and the case
making our method comparatively simpler. Additionally, the where d < C − 1 remains largely unexplored with
loss function described in [27] mandates that feature samples no proposed efficient solutions. Here, we address this
conform to the surface of a hypersphere with a predefined issue by introducing a new module that augments the
radius, akin to the spherical embeddings used in the ArcFace dimensionality of the feature space, as elaborated upon
method [16]. In contrast, our method does not impose such below.
constraints, allowing the samples to occupy the full feature Against all these advantages, there is only one limitation of the
space for embedding. proposed method: The dimension of the CNN features must
be larger than or equal to the total number of classes minus 1.
To overcome this limitation, we introduced two solutions: The
B. Contributions first solution uses a dimension augmentation module (DAM)
The methods that maximize the margin in Euclidean or whereas the second solution revises the existing deep neural
angular spaces mentioned above have the shortcomings in the network architectures.
ways that the objective loss functions include many terms that
need to be weighted, the class acceptance regions are not com- II. M ETHOD
pact, or they need additional hard-mining algorithms. In this
study, we propose a simple yet effective method that does not A. Motivation
have these limitations. Our proposed method maximizes the In this study, we introduce a simple yet effective deep neural
margin in both the Euclidean and angular spaces. To the best network classifier that maximizes the margin in both Euclidean
of our knowledge, our proposed method is the first method and angular spaces. To this end, we propose a novel classi-
that maximizes the margin in both spaces. To accomplish this fication loss function that enforces the samples to compactly
goal, we train a deep neural network that enforces the samples cluster around the class-specific centers that are selected from
to gather in the vicinity of the class-specific centers that lie on the outer boundaries of a hypersphere. The Euclidean distances
the boundary of a hypersphere whose center is set to the origin. and angles between the centers are equivalent. Please not that
Each class is represented with a single center, and the distances in terms of margin maximization the distances between the
between the class centers are equivalent. This corresponds to class centers are the maximum values for angular distances.
selection of class centers from the vertices of a regular simplex In a similar manner, for Euclidean distances, if the class
inscribed in a hypersphere. Both the Euclidean distances and centers are enforced to lie on the boundary of a hypersphere,
angular distances between class centers are equivalent to each the distances among the classes again become the best optimal
other. solution we can get. Theoretical proofs of this fact can be
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
CEVIKALP et al.: REACHING NIRVANA: MAXIMIZING THE MARGIN IN BOTH EUCLIDEAN AND ANGULAR SPACES 8181
Fig. 1. In the proposed method, class samples are enforced to lie closer to the class-specific centers representing them, and the class centers are located on
the boundary of a hypersphere. All the distances between the class centers are equivalent, thus there is no need to tune any margin term. The class centers
form the vertices of a regular simplex inscribed in a hypersphere. Therefore, to separate C different classes, the dimensionality of the feature space must be
at least C − 1. The figure on the left shows the separation of two classes in 1-D space, the middle figure depicts the separation of three classes in 2-D space,
and the figure on the right illustrates the separation of four classes in 3-D space. For all cases, the centers are chosen from a regular C−simplex.
found in both [19] and [26]. Using simplex vertices as class when the data samples are mapped to Laplacian eigenspace,
centers is illustrated in Fig. 1. In this figure, the centers they concentrate on the vertices of a simplex structure. These
representing the classes are denoted by the star symbols studies are also complementary to the studies showing that the
whereas the class samples are represented with circles having high-dimensional data samples lie on the boundary of a grow-
different colors based on the class memberships. As seen in ing hypersphere. It is because, as proved in [32], normalized
the figure, all pair-wise distances between the class centers are cuts (NCuts) [33] clustering algorithm, which is presented as
equivalent, and class centers are located on the boundary of a a spectral relaxation of a graph cut problem, maps the data
hypersphere. Moreover, if the hypersphere center is set to the samples onto an infinite-dimensional feature space. Therefore,
origin, then the angles between the class centers are also same, these data samples naturally concentrate on the vertices of a
and the lengths of the centers are equivalent, i.e., ∥si ∥ = u, regular simplex due to the high-dimensionality of the feature
(u is the length of the center vectors). After learning stage, space.
if the class samples are compactly clustered around the centers There are strong arguments that verify that high-dimensional
representing them, we can classify the data samples based on data samples concentrate on the vertices of regular simplex
the Euclidean or angular distances from the class centers. Both as discussed above. Do the same arguments hold for the
distances yield the same results if the hypersphere center is high-dimensional features produced by the deep neural net-
set to the origin. work classifiers? A recent study [22] answers this question
At this point, the question of whether enforcing data sam- and reveals that the samples of different classes cluster around
ples to lie around the simplex vertices is appropriate or not the class centers forming the vertices of a regular simplex (as
comes to mind. In fact, high-dimensional spaces are quite we proposed in this study) at the last stages of the learning
different than the low-dimensional spaces, and there are many process when the feature dimension is higher than the number
studies showing that the data samples lie on the boundary of of classes. They show that the lengths of the vectors of the
a hypersphere when the feature dimensionality, d, is high and class means (after centering by their global mean) converge
the number of samples, n, is small. For example, Jimenez and to the same length and the angles between pair-wise center
Landgrebe [28] theoretically shows that the high-dimensional vectors become equal during the last training stages (it is
spaces are mostly empty and data concentrate on the outside of called terminal phase of the training in the study) of the deep
a shell (on the outer boundary of a hypersphere). The authors neural networks using linear classifiers. They also demonstrate
also show that as the number of dimensions increases, the that the within-class scatter converges to zero indicating that
shell increases its distance from the origin. More precisely, the class-specific samples gather around their corresponding
the data samples lie near the outer surface of a growing class center. A geometrical analysis of this study is given
hypersphere in high-dimensional spaces (therefore setting the in [34]. However, both studies are not complete in the sense
hypersphere radius to 1 as in [21] is not suitable for high- that they do not consider the cases when the dimension of the
dimensional spaces). A more recent study [29] explicitly feature space is smaller than the number of classes so that it
shows that the data samples lie at the vertices of a regular is impossible to fit the class centers to the vertices of a regular
simplex in high-dimensional spaces. These two studies are simplex. Also, the authors do not propose an efficient method
not contradictory and they support each other since we can as in our proposed method, instead they use the classical
always inscribe a regular simplex in a hypersphere as seen softmax loss function with the linear classifiers, and learn class
in Fig. 1. In addition to these studies, Kumar et al. [30] weights for classification. In contrast, we propose an efficient
and Weber [31] show that the eigenvectors of the Laplacian method that enforces the samples to lie closer to the vertices
matrices (the matrices computed by operating on similar- of a regular simplex directly in this article. We do not learn
ity matrices in spectral clustering analysis) form a simplex class weights, instead we use fixed class centers chosen from
structure, and they use the vertices of resulting simplex for the vertices of a regular simplex. In addition, we consider the
clustering of data samples. In other words, they prove that dimension restriction (when the number of classes is larger
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
8182 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 36, NO. 5, MAY 2025
than the feature dimension) and introduce solutions to handle between the classes are again the maximum optimal value one
this problem as explained below. can get. Therefore, there is no need of using a loss term for the
interclass separation. Now, let us assume that the deep neural
network features of training samples are given in the form
B. Maximizing Margin in Euclidean and Angular Spaces
(fi , yi ), i = 1, . . . , n, fi ∈ IRd , yi ∈ { j} where j = 1, . . . , C.
Here, we propose a novel and simple method that enforces Here, C is the total number of known classes, and we assume
the samples of classes to cluster around the centers chosen that the feature dimension d is larger than or equal to C − 1,
from the vertices of a regular simplex. As shown in [22], i.e., d ≥ C − 1. Under these assumptions, the loss function of
all class samples cluster around the class centers forming the the proposed method can be written as
vertices of a regular simplex when the dimension of the feature n
space is larger than the number of classes. Therefore, there is 1X 2
L= fi − s yi . (4)
no need to use complicated classifier layers, and the same n i=1
effect can be accomplished by using much simpler classi-
fication layers as in our proposed method. In the proposed The loss function includes a single term that targets to mini-
method, instead of using more complicated linear classifiers mize the within-class variations by minimizing the distances
and learning class weights for each class, we directly enforce between the samples and their corresponding class centers,
the class samples to compactly cluster around the fixed class which are set to the vertices of a regular simplex. There is
centers chosen from the vertices of a regular simplex. All no need for another loss term for the between-class separa-
the pair-wise distances between the selected class centers are tion since the selected centers have the maximum possible
equivalent. Euclidean and angular distances among them. As a result, there
Let us assume that there are C classes in our dataset. In this is no hyperparameter that must be fixed, and the proposed
case, we first need to create a C-simplex (some researchers call method is extremely easy for the users. Moreover, the data
it C −1 simplex considering the feature dimension, but we will samples compactly cluster around their class centers, therefore
prefer C-simplex definition). The vertices of a regular simplex the proposed method results in compact acceptance regions
inscribed in a hypersphere with radius 1 can be defined as for classes, which is crucial for the success in the context
follows: of the open set recognition. It should be noted that our
( proposed method is quite different than the methods using
(C − 1)−1/2 1, j = 1 vertices of a regular simplex as in our proposed method. It is
vj = (1)
κ1 + ηe j−1 , 2≤ j ≤C because, all these methods use variants of the softmax loss
function that typically require setting margin parameters for
where the interclass separation. Furthermore, these methods return
√ r
1+ C C noncompact radial distributions (see [24, Figs. 2, 4, 5, and
κ=− , η= . (2) 8] and [26, Fig. 2]). Therefore, their performance will not be
(C − 1)3/2 C −1
satisfactory for open-set recognition problems. We call our
Here, 1 is an appropriately sized vector whose elements are proposed method as deep simplex classifier (DSC).
all 1, e j is the natural basis vector in which the j−th entry is The running time of the proposed method will be more
1 and all other entries are 0. Such a C−simplex is in fact a efficient compared to the methods using the softmax loss func-
C−dimensional polyhedron where the distances between the tion and its variants, Arcface [16], Cosface [15], and regular
vertices are equivalent. It must be noted that the distances polytope networks [24]. Because, these methods require to
between the vertices do not change even if the simplex apply exponential function to each logit,(w⊤c fi + bc ), followed
is rotated or translated. But, the dimension of the feature by a normalization by dividing with the sum of all these
space must be at least C − 1 in order to define such a exponentials as seen in the softmax loss function given below
regular C−simplex. Next, we must define the radius, u, of the ⊤
n
hypersphere. This term is similar to the scaling parameter 1X ew yi fi +b yi
L=− log PC . (5)
used in methods such as ArcFace [16] and CosFace [15], n i=1 w⊤j fi +b j
j=1 e
that maximize the margin in angular spaces. As the dimension
increases, it must be also increased since the studies [28] show On the other hand, we just need to extract the CNN features of
that the hypersphere whose outer shells include the data also the test samples during training and testing stages. Then, these
grows as the dimension is increased. Wang et al. [15] provided features are compared to precomputed centers by using the
a lower bound for the determination of this parameter. Then, Euclidean distances. Therefore, the proposed method is more
we set the class centers that will represent the classes as efficient in terms of computational complexity. However, this
does not affect testing times much since the most of the time
s j = uv j , j = 1, . . . , C. (3) is spent on convolutional layers of the deep neural network
classifier during the testing stage.
The order of selection of centers does not matter since the
distances among all centers are equivalent. These distances
are the best optimal values that we can get when the cosine C. Including Background Class for Open Set Recognition
distances are used as theoretically proved in [19] and [26]. In open-set recognition scenarios, the training of classifiers
In a similar manner, when the class centers are restricted to commences by exclusively utilizing samples of known classes.
lie on the boundary of a hypersphere, the Euclidean distances Subsequently, both known and unknown class samples are
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
CEVIKALP et al.: REACHING NIRVANA: MAXIMIZING THE MARGIN IN BOTH EUCLIDEAN AND ANGULAR SPACES 8183
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
8184 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 36, NO. 5, MAY 2025
III. E XPERIMENTS
A. Illustrations and Ablation Studies
Here, we first conducted some experiments to visualize the
Fig. 3. Illustration of the proposed method: we use well-known architectures
embedding spaces returned by the various loss functions using
(such as ResNet-19 and ResNet-101) as backbones and we only change the the vertices of the regular simplex. To this end, we utilized
classification loss layer. If the dimension of the CNN feature space is smaller a small deep neural network that yields 2D CNN features.
than C −1, we increase the dimension to desired size by using DAM module or
revising the network architecture, and then apply the proposed loss function.
As training data, we selected three classes from the Cifar-10
dataset since the maximum number of classes is bounded by
3 in 2-D spaces in the proposed method. We would like to
point out that we can use different loss functions in addition
layer. The second fully connected layer increases the dimen-
to our default loss function given in (4) once we determine
sion to desired feature space size, (C − 1). Then, we apply
the vertices of the simplex that will represent the classes. For
another PReLU function followed by the last fully connected
this experiment, we used two other loss functions: The first
layer. It should be noted that, following the ReLU (or PReLU)
one is the hinge loss that minimizes the distances between the
operation, the majority of values may become positive, despite
samples and their corresponding class center if the distance is
their corresponding centers having negative values. Therefore,
larger than a selected threshold
the last layer in the module includes a fully connected layer
n
that maps (C − 1) dimensional feature space back to (C − 1) 1X
2
dimensional feature space so that the sample features may have Lhinge = max 0, fi − s yi − m . (7)
n i=1
negative values. The proposed module increases the dimension
in two steps as explained above. The dimension can be directly This loss function does not minimize the distances between
increased from d to C −1 in the first fully connected layer. In a the samples and their corresponding centers if the distances
similar manner, we can increase the dimension in more than are already smaller than the selected threshold, m. This way
two steps if desired.1 The main idea of the proposed DAM is class-specific samples are compactly clustered in a hyper-
similar to kernel mapping idea used in kernel methods [42], sphere with radius, m. For the second loss function, we used
[43] in the spirit with the exception that we explicitly map the the variant of the softmax loss function where the weights are
data to higher dimensional feature space as in [44] and [45]. fixed to the simplex vertices as in
It should be noted that Do et al. [21] proposed to use a fully 1X
n ⊤
es yi fi +b yi
connected layer alone for increasing the dimensionality of the Lsoftmax =− log PC . (8)
n i=1 s⊤j fi +b j
feature space. However, a fully connected layer just uses the j=1 e
linear combination of existing features and the resulting space For the softmax loss, we fix the classifier weights to the
has the dimensionality which is lower than or equal to the predefined class centers and we only update features of the
original feature space dimension. Therefore, one has to use samples by using backpropagation. We set the hypersphere
activation functions to introduce nonlinearity and increase the radius to, u = 5, since this is a simple dataset.
dimension as in our proposed module. The embeddings returned by the deep neural networks
2) Revising Network Architecture: We can also solve the using different loss functions are plotted in Fig. 5. The first
dimension problem by slightly changing the existing CNN figure on the left is obtained by our default loss function that
architectures instead of using our proposed plug and play does not need any parameter selection. All data samples are
DAM. To this end, we can avoid the fully connected layers compactly clustered around their class means as expected. The
that are used for dimension reduction in the last layers of second loss function using the hinge loss returns spherical
deep CNNs. For example, in the ResNet architectures we distributions based on the selected margin, m, and the classes
used for face recognition in our experiments, the dimension are still separable by a margin. In contrast, when the softmax
of the feature space is 25 088 just before the fully connected is used with the simplex vertices, the data samples are very
layers, and it is reduced to 512 after fully connected layers. close and they overlap since there is no margin among the
Instead of reducing the dimension to 512, we can reduce it classes. Therefore, our default loss function seems to be the
to values that solve the current problem. If the number of best choice among all tested variants since it does not need
classes is much larger than 25088, we can use more filters fixing any parameter and returns compact class regions.
at the last layers to increase this number. In this study, We also conducted tests on imbalanced datasets. In our
we used 25 088 dimensional feature space and reduced the proposed method, the distances between the samples and their
feature size to 12 500 by using a fully connected linear layer corresponding class centers are minimized independently of
(without PReLU) for training the large-scale dataset sampled each other. Therefore, we expect that the proposed method
from MS1MV3 dataset [46] without any need for dimension will be more robust against to imbalanced datasets. To verify
this, we conducted experiments on the same three classes
1 Our shared software allows to select any desired number of steps for used before. We used the same deep neural network classifier
increasing dimensionality. yielding 2-D feature spaces for this experiment. The number
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
CEVIKALP et al.: REACHING NIRVANA: MAXIMIZING THE MARGIN IN BOTH EUCLIDEAN AND ANGULAR SPACES 8185
Fig. 4. Plug and play module that will be used for increasing feature dimension. It maps d−dimensional feature vectors onto a much higher
(C − 1)− dimensional space. The DAM module was specifically designed to allow users to choose any desired number of steps for increasing dimensionality.
It is possible to increase the dimension in a single step or gradually increase it using multiple steps. This figure depicts the case when two steps are used for
increasing the dimension.
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
8186 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 36, NO. 5, MAY 2025
TABLE I
AUC S CORES (%) OF O PEN S ET R ECOGNITION M ETHODS ON T ESTED DATASETS (n.r. S TANDS FOR N OT R EPORTED ). T HE B EST ACCURACIES
A RE S HOWN W ITH R ED F ONTS W HEREAS S TATISTICALLY S IMILAR P ERFORMANCES A RE S HOWN W ITH B LUE F ONTS . T HE M ETHODS T HAT
S TATISTICALLY P ERFORM P OORLY A RE S HOWN W ITH S TANDARD B LACK F ONT. T HE S TANDARD D EVIATION OF
O BJECTTOSPHERE METHOD I S A SSUMED AS 1 FOR THE C IFAR -10 DATASET
TABLE II
C LOSED -S ET ACCURACIES (%) OF O PEN -S ET R ECOGNITION M ETHODS ON T ESTED DATASETS
2) Results: The main goal of open set recognition is to test utilizing the t-distribution. If the obtained significance
detect and reject the samples that come from the novel classes. falls below the predefined significance threshold (set at 0.05),
The performance of open set recognition is often measured we reject the null hypothesis, indicating that there is a sta-
using area under the ROC curve (AUC) scores. Additionally, tistically significant difference in performance between the
the closed set accuracy is also reported to evaluate classifi- two methods. The highest accuracy scores are highlighted in
cation performance on known data by disregarding unknown bold red text, while methods exhibiting statistically similar
samples, as demonstrated in previous works such as [48] performance are indicated in bold blue. Results for methods
and [52]. We trained our proposed method by using the loss that perform poorly from a statistical perspective are pre-
function given at (6), which is especially designed for the sented in standard black font. Notably, there were significant
open-set recognition settings. Our proposed method, DSC, is performance differences observed for the Mnist, Cifar + 10,
compared against to other state-of-the-art open set recogni- Cifar + 50, and TinyImageNet datasets. Our proposed method
tion methods including maximally separating matrix method achieves significantly better accuracies compared to other
of [25] using simplex vertices, C2AE [53], Softmax, Open- tested methods. For the Cifar-10 dataset, our proposed method
Max [35], OSRCI [52], CAC [37], RPL [50], CROSR [49], performs statistically similar to the best performing method
ROSR [49], generative-discriminative feature representations whereas all tested methods perform worse compared to the
(GDFRs) [54], and Objecttosphere [55] methods. Except best performing method for the SVHN dataset. Closed-set
for the TinyImageNet dataset, we employed the identical accuracies for open-set recognition methods were reported
network backbone as in [52] for all datasets. To achieve in Table II, where the proposed method achieved the best
higher accuracies for the TinyImageNet dataset, we utilized accuracies among the tested methods, with the exception of the
a deeper Resnet-50 architecture. The hypersphere radius is SVHN dataset. Obtaining the best accuracies in terms of AUC
set to u = 64 as in ArcFace method. The proposed methods scores and closed-set accuracies indicates that our proposed
demonstrated accuracies that are directly comparable to those method can easily identify and reject the novel class samples
reported in [52] for most of the tested datasets, as the network and correctly classify the known class samples as expected.
weights were randomly initialized during the training stage.
AUC scores were summarized in Table I, which showed that
the proposed method achieved the best accuracies across all C. Closed-Set Recognition Experiments
datasets except for the Cifar-10 and SVHN. We also conducted 1) Experiments on Moderate Sized Datasets: Here, we con-
statistical significance tests to assess the variances in accuracy ducted closed-set recognition experiments on moderate sized
between the proposed method and its competitors listed in datasets. Our proposed method did not need DAM since the
Table I. This examination employs a null hypothesis statistical feature dimension is much larger than the number of classes
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
CEVIKALP et al.: REACHING NIRVANA: MAXIMIZING THE MARGIN IN BOTH EUCLIDEAN AND ANGULAR SPACES 8187
TABLE IV
C LASSIFICATION ACCURACIES (%) ON I MAGE N ET DATASET
is utilized for this purpose has been trained on the MS1MV3
dataset [46], which is a refined variant of the MS-Celeb-1M
dataset [58], and incorporates the proposed loss function. The
MS1MV3 dataset includes approximately 91K individuals.
We have used the first 12K individuals having the most sam-
ples per class in our experiments (using more classes yielded
memory problems with the GPUs we have used for the exper-
in the training set for these experiments. We compared our iments). The ResNet-101 architecture is used as backbone,
results to the methods that maximize the margin in Euclidean and this backbone yields CNN features whose dimension is
or angular spaces. We implemented the compared methods d = 512. Therefore, the number of classes is much larger than
by using provided source codes by their authors, and we the feature dimension, d = 512. For both proposed classifiers,
used the ResNet-18 architecture [56] as backbone for all we mapped the feature dimension to 12 500 rather than C−1 =
tested methods. Therefore, our results are directly comparable. 11999. For DSCDAM , we used only one layer with PReLU
We set the hypersphere radius to u = 64 as before. activation functions, which required to estimate additional
Classification accuracies are given in Table III. For the 512 × 12500 + (12500)2 − 512 × 12000 weight parameters
Mnist dataset, majority of the tested methods yield the same for the utilized network. We also applied batch normalization
accuracy, but our proposed DSC method outperforms all after PReLU layer. For DSCRNA , we first removed the original
tested methods on the Cifar-10 and Cifar-100 datasets. The fully connected layer that maps the 25 088 dimensional CNN
performance difference is significant, especially on the Cifar- features to 512 dimensional space. Then, we added a fully
100 dataset. These results verify the superiority of the margin connected layer (without PReLU) that maps 25 088 dimen-
maximization in both Euclidean and angular spaces. Achieving sional CNN features to 12 500 dimensional feature space.
the best accuracies is encouraging, because our proposed Therefore, this revision is required to estimation of additional
method is very simple and does not need any parameter tuning, 25088 × 12500 − [25088 × 512 + 512 × 12000] weights.
yet it outperforms more complex methods. The hypersphere radius is set to 2000. Training the network,
We also conducted tests on ImageNet dataset [57]. We used DSCRNA , using the revised network architecture took 11 444 s
a deeper architecture ResNet-101 since this dataset is a (3.178 h) to finish one epoch whereas the network using
large-scale dataset including 1000 classes. The results are DAM, DSCDAM , completed an epoch in 11 137 s (3.093 h).
given in Table IV. We compared our results to the method In contrast, a network that uses the 512-D CNN feature space
using the softmax loss function, large-margin softmax loss with the classical softmax loss function finishes an epoch
function [13], and a very related method, maximally separating in 8962 s (2.489 h). Therefore, DSCRNA is approximately
matrix method of [25], using simplex vertices as fixed class 1.28 times slower and DSCDAM is 1.24 times slower compared
centers as in our proposed method. As seen in the table, our to a classical network that uses the softmax loss function. Once
proposed method outperforms all methods and achieve the best the networks are trained, we used the resulting architectures
accuracies for both top-1 and top-5 accuracies. to extract deep CNN features of the face images coming from
2) Experiments on Large-Scale Datasets: We also tested the the test datasets.
proposed method in the classification setting where the number As test datasets, we used labeled faces in the wild (LFW)
of classes is much larger than the feature dimensionality. [59], celebrities in frontal-profile dataset (CFP-FP) [60], cross-
As stated earlier, the dimension restriction occurs in such age LFW (CALFW) [61], AgeDB [60], and cross-pose LFW
settings. To overcome this, we utilized DAM and revised (CPLFW) [62]. For evaluation, the standard protocol of
network architecture as explained in Section II-D. DSCDAM unrestricted with labeled outside data [59] is used and the
represents the classifier using DAM, and DSCRNA represents accuracies are obtained by using 6000 pair testing images on
the classifier using the revised network architecture. We tested LFW, CALFW, AgeDB, and CPLFW. For CFP-FP dataset, the
the proposed methods on face verification and recognition accuracies are obtained by using 7000 pairs of testing images
problems. by following the standard testing setting. Table V reports the
To conduct every face verification test, the standard pro- accuracies. As seen in the results, the proposed method using
cedure is followed by employing the same network that has DAM achieves the best accuracies for four datasets among all
been trained on a large-scale face dataset. The network that tested five datasets. DSCRNA method also obtains competitive
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
8188 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 36, NO. 5, MAY 2025
TABLE VI
I DENTIFICATION ACCURACIES (%) ON THE IJB-B AND IJB-C B ENCHMARKS
accuracies, but its accuracies are lower than DSCDAM . These best accuracies in terms of TPIR accuracies on the IJB-B
results verify that the proposed techniques for solving dimen- dataset.
sion problem successfully resolve this problem. However, the
weight parameters of the networks are greatly increased. IV. S UMMARY AND C ONCLUSION
We also conducted identification (recognition) tests on the
challenging IJB-B and IJB-C datasets [63]. These datasets This article proposed a neural network classifier that aims
present considerable difficulties due to their inclusion of full to maximize the margin in both the Euclidean and angular
pose variations and wide-ranging imaging conditions. The spaces. Specifically, the method generates embeddings such
IJB-B dataset is characterized by its template-based approach, that class-specific samples cluster around the class centers
encompassing 1845 subjects with 11 754 images and 55 025 chosen from the vertices of a regular simplex. The technique
frames from 7011 videos. Images and videos were sourced is particularly straightforward as it requires to fix a simple
from the web, showcasing significant variations in pose, illu- parameter for classical closed set recognition settings. Despite
mination, and image quality, among other factors. The IJB-C its simplicity, the proposed method achieves state-of-the-art
dataset serves as an extension of the IJB-B dataset, featuring accuracies on open-set recognition problems by rejecting
3531 unique subjects in unconstrained environments. This samples of unknown classes based on their distances from
mixed media set-based dataset comprises 31 334 still images, the class-specific centers. Additionally, the proposed method
averaging approximately six images per subject, and 117 542 outperforms other current classification methods on closed set
video frames, averaging about 33 frames per video. Each recognition settings, particularly with moderate-sized datasets.
subject is represented by a template consisting of multiple Nonetheless, the method exhibits a limitation in learning
images, rendering the set-based face recognition approach large-scale datasets, which can be addressed by introducing a
ideal for subject identification. These datasets are widely DAM and revising existing deep neural network architectures.
recognized as benchmark datasets for evaluating state-of-the- The proposed classifier using the DAM achieves state-of-the-
art face recognition methodologies. art accuracies on face verification problems, but the weight
For reporting accuracies, we follow the standard benchmark parameters of the deep neural network classifier greatly
procedure for IJB-B and IJB-C to evaluate the proposed increase. In summary, the proposed method is an ideal choice
methods on “search” protocol for 1:N face identification. Here, for open set recognition and classical classification problems,
the Rank-N classification accuracies are reported for identifi- particularly when the feature dimension is larger than the num-
cation, and the classification rate is the percentage of probe ber of classes, and the proposed classifier is straightforward
searches, which correctly finds the probe’s gallery mate in to use with a simple hyperparameter that requires setting. For
the gallery set within top N rank-ordered results. In addition, large-scale datasets with many classes, the proposed method
we also report the true positive identification rate (TPIR) using DAM still yields good accuracies, but it increases the
accuracies obtained for different false positive identification complexity of the deep neural network architectures.
rate (FPIR) values. The results are given in Table VI, where A PPENDIX
the red and blue fonts successively denote the best and the
second best accuracies. As seen in the table, the proposed I MPLEMENTATION D ETAILS
method using DAM module achieves the best accuracies in The learning rate is set to 0.1 for the proposed
all metrics on the IJB-C dataset, whereas it obtains the second DSC in open-set recognition experiments. We set
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
CEVIKALP et al.: REACHING NIRVANA: MAXIMIZING THE MARGIN IN BOTH EUCLIDEAN AND ANGULAR SPACES 8189
TABLE VII
C LASSIFICATION ACCURACIES (%) FOR D IFFERENT u
VALUES ON C IFAR -100 DATASET
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
8190 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 36, NO. 5, MAY 2025
[6] H. Cevikalp, B. Uzun, O. Köpüklü, and G. Ozturk, “Deep compact [31] M. Weber, “Clustering by using a simplex structure,” ZIB, Berlin,
polyhedral conic classifier for open and closed set recognition,” Pattern Germany, ZIB-Rep. 04-03, 2003.
Recognit., vol. 119, Nov. 2021, Art. no. 108080. [32] A. Rahimi and B. Recht, “Clustering with normalized cuts is clustering
[7] C. Qi and F. Su, “Contrastive-center loss for deep neural networks,” in with a hyperplane,” in Proc. Stat. Learn. Comput. Vis., 2004, pp. 56–69.
Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2017, pp. 2851–2855. [33] J. Shi and J. Malik, “Normalized cuts and image segmentation,”
[8] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905,
embedding for face recognition and clustering,” in Proc. IEEE Conf. Aug. 2000.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823. [34] Z. Zhu et al., “A geometric analysis of neural collapse with uncon-
[9] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in strained features,” in Proc. Adv. Neural Inf. Process. Syst., 2021,
Proc. Int. Conf. Learn. Recognit. (ICLR) Workshops, 2015, pp. 84–92. pp. 1–15.
[10] K. Sohn, “Improved deep metric learning with multi-class N-pair loss [35] W. J. Scheirer, A. Rocha, A. Sapkota, and T. E. Boult, “Towards
objective,” in Proc. Neural Inf. Process. Syst. (NIPS), 2016, pp. 1–9. open set recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
[11] S. Roy, M. Harandi, R. Nock, and R. Hartley, “Siamese networks: The pp. 1757–1772, 2013.
tale of two manifolds,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. [36] A. R. Dhamija, M. Gunther, and T. E. Boult, “Reducing network
(ICCV), Oct. 2019, pp. 3046–3055. agnostophobia,” in Proc. Neural Inf. Process. Syst. (NeurIPS), 2018,
[12] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “SphereFace: pp. 1–12.
Deep hypersphere embedding for face recognition,” in Proc. IEEE Conf. [37] D. Miller, N. Sünderhauf, M. Milford, and F. Dayoub, “Class anchor
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6738–6746. clustering: A loss for distance-based open set recognition,” in Proc. IEEE
[13] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 3569–3577.
for convolutional neural networks,” in Proc. Int. Conf. Mach. Learn. [38] H. Cevikalp, B. Uzun, Y. Salk, H. Saribas, and O. Köpüklü, “From
(ICML), 2016, pp. 1–10. anomaly detection to open set recognition: Bridging the gap,” Pattern
[14] K. Zhao, J. Xu, and M.-M. Cheng, “RegularFace: Deep face recognition Recognit., vol. 138, Jun. 2023, Art. no. 109385.
via exclusive regularization,” in Proc. IEEE/CVF Conf. Comput. Vis. [39] C. Geng, S.-J. Huang, and S. Chen, “Recent advances in open set
Pattern Recognit. (CVPR), Jun. 2019, pp. 1136–1144. recognition: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43,
no. 10, pp. 3614–3631, Oct. 2021.
[15] H. Wang et al., “CosFace: Large margin cosine loss for deep face
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [40] M. Balko, A. Pór, M. Scheucher, K. Swanepoel, and P. Valtr, “Almost-
Jun. 2018, pp. 5265–5274. equidistant sets,” Graphs Combinatorics, vol. 36, no. 3, pp. 729–754,
May 2020.
[16] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angular
[41] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
margin loss for deep face recognition,” in Proc. IEEE/CVF Conf.
Surpassing human-level performance on ImageNet classification,” in
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4685–4694.
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.
[17] W. Liu et al., “Learning towards minimum hyperspherical energy,” in
[42] C. Cortes and V. Vapnik, “Support vector networks,” Mach. Learn.,
Proc. Neural Inf. Process. Syst. (NeurIPS), 2018, pp. 1–12.
vol. 20, pp. 273–297, Sep. 1995.
[18] R. Lin et al., “Regularizing neural networks via minimizing hyperspher-
[43] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers,
ical energy,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
“Fisher discriminant analysis with kernels,” in Proc. Neural Netw. Signal
(CVPR), Jun. 2020, pp. 6916–6925.
Process., IEEE Signal Process. Soc. Workshop, Aug. 1999, pp. 41–48.
[19] W. Liu, R. Lin, Z. Liu, L. Xiong, B. Scholkopf, and A. Weller, “Learning [44] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit
with hyperspherical uniformity,” in Proc. Int. Conf. Artif. Intell. Statist. feature maps,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 3,
(AISTATS), 2021, pp. 1–13. pp. 480–492, Mar. 2012.
[20] Y. Duan, J. Lu, and J. Zhou, “UniformFace: Learning deep equidis- [45] A. Rahimi and B. Recht, “Random features for large-scale kernel
tributed representation for face recognition,” in Proc. IEEE/CVF Conf. machines,” in Proc. NIPS, 2007, pp. 1–8.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 3415–3424.
[46] J. Deng, J. Guo, T. Liu, M. Gong, and S. Zafeiriou, “Sub-center arcface:
[21] T.-T. Do, T. Tran, I. Reid, V. Kumar, T. Hoang, and G. Carneiro, Boosting face recognition by large-scale noisy web faces,” in Proc. Eur.
“A theoretically sound upper bound on the triplet loss for improving the Conf. Comput. Vis., 2020, pp. 741–757.
efficiency of deep distance metric learning,” in Proc. IEEE/CVF Conf. [47] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 10396–10405. large data set for nonparametric object and scene recognition,” IEEE
[22] V. Papyan, X. Y. Han, and D. L. Donoho, “Prevalence of neural collapse Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958–1970,
during the terminal phase of deep learning training,” Proc. Nat. Acad. Nov. 2008.
Sci. USA, vol. 117, no. 40, pp. 24652–24663, Oct. 2020. [48] H.-M. Yang, X.-Y. Zhang, F. Yin, Q. Yang, and C.-L. Liu, “Convolu-
[23] F. Pernici, M. Bruni, C. Baecchi, and A. D. Bimbo, “Maximally compact tional prototype network for open set recognition,” IEEE Trans. Pattern
and separated features with regular polytope networks,” in Proc. CVPR Anal. Mach. Intell., vol. 44, no. 5, pp. 2358–2370, May 2022.
Workshops, 2019, pp. 46–53. [49] R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, and
[24] F. Pernici, M. Bruni, C. Baecchi, and A. D. Bimbo, “Regular polytope T. Naemura, “Classification-reconstruction learning for open-set recog-
networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 9, nition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
pp. 4373–4387, Sep. 2022. (CVPR), Jun. 2019, pp. 4011–4020.
[25] T. Kasarla, G. J. Burghouts, M. van Spengler, E. van der Pol, [50] G. Chen et al., “Learning open set network with discriminative reciprocal
R. Cucchiara, and P. Mettes, “Maximum class separation as inductive points,” in Proc. ECCV, 2020, pp. 507–522.
bias in one matrix,” in Proc. Adv. Neural Inf. Process. Syst., 2022, [51] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
pp. 1–14. lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
[26] Q. Bytyqi, N. Wolpert, E. Schomer, and U. Schwanecke, “Prototype [52] L. Neal, M. Olson, X. Fern, W.-K. Wong, and F. Li, “Open set learning
softmax cross entropy: A new perspective on softmax cross entropy,” in with counterfactual images,” in Proc. ECCV, 2018, pp. 1–16.
Proc. Scandin. Conf. Image Anal., 2023, pp. 16–31. [53] P. Oza and V. M. Patel, “C2AE: Class conditioned auto-encoder for
[27] Y. Yang, S. Chen, X. Li, L. Xie, Z. Lin, and D. Tao, “Inducing neural open-set recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
collapse in imbalanced learning: Do we really need a learnable classifier Recognit. (CVPR), Jun. 2019, pp. 2302–2311.
at the end of deep neural network?” in Proc. Adv. Neural Inf. Process. [54] P. Perera et al., “Generative-discriminative feature representations for
Syst., 2022, pp. 37991–38002. open-set recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[28] L. O. Jimenez and D. A. Landgrebe, “Supervised classification in high- Recognit. (CVPR), Jun. 2020, pp. 11811–11820.
dimensional space: Geometrical, statistical, and asymptotical properties [55] A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proc.
of multivariate data,” IEEE Trans. Syst., Man Cybern., C, vol. 28, no. 1, CVPR, 2016, pp. 1563–1572.
pp. 39–54, Feb. 1998. [56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[29] P. Hall, J. S. Marron, and A. Neeman, “Geometric representation of image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
high dimension, low sample size data,” J. Roy. Stat. Soc. Ser. B, Stat. (CVPR), Jun. 2016, pp. 770–778.
Methodol., vol. 67, no. 3, pp. 427–444, Jun. 2005. [57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
[30] P. Kumar, L. Niveditha, and B. Ravindran, “Spectral clustering as A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
mapping to a simplex,” in Proc. ICML Workshops, 2013, pp. 1–9. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
CEVIKALP et al.: REACHING NIRVANA: MAXIMIZING THE MARGIN IN BOTH EUCLIDEAN AND ANGULAR SPACES 8191
[58] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “MS-Celeb-1M: A dataset Hasan Saribas received the bachelor’s degree in
and benchmark for large-scale face recognition,” in Computer Vision— electrical and electronics engineering from Ataturk
ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., Cham, University, Erzurum, Türkiye, in 2011, the master’s
Switzerland: Springer, 2016, pp. 87–102. degree from the Department of Avionics, Anadolu
[59] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces University, Eskişehir, Türkiye, in 2015, and the
in the wild: A database forstudying face recognition in unconstrained Ph.D. degree from the Department of Avionics,
environments,” in Proc. Workshop Faces ‘Real-Life’ Images, Detection, Eskişehir Technical University, Eskişehir, in 2020.
Alignment, Recognit., 2008, pp. 1–15. He is currently employed as a Senior AI Research
[60] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and Engineer with Huawei Türkiye Research and Devel-
S. Zafeiriou, “AgeDB: The first manually collected, in-the-wild age opment Center, Istanbul, Türkiye. His research
database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Work- interests include recommendation systems, image
shops (CVPRW), Jul. 2017, pp. 1997–2005. processing, machine learning, deep learning, and the control of unmanned
[61] T. Zheng, W. Deng, and J. Hu, “Cross-age LFW: A database for aerial vehicles.
studying cross-age face recognition in unconstrained environments,”
2017, arXiv:1708.08197.
[62] T. Zheng and W. Deng, “Cross-pose LFW: A database for studying
cross-pose face recognition in unconstrained environments,” Dept. Posts
Telecommun., Beijing Univ., Beijing, China, Tech. Rep. 18.01, 2018.
[63] B. Maze et al., “IARPA Janus benchmark-C: Face dataset and protocol,”
in Proc. Int. Conf. Biometrics (ICB), Feb. 2018, pp. 158–165.
[64] F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni,
“FacePoseNet: Making a case for landmark-free face alignment,” in
Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2017,
pp. 1599–1608.
[65] B.-N. Kang, Y. Kim, and D. Kim, “Pairwise relational networks for
face recognition,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
pp. 628–645.
[66] B.-N. Kang, Y. Kim, B. Jun, and D. Kim, “Attentional feature-pair
relation networks for accurate face recognition,” in Proc. IEEE/CVF
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5471–5480.
[67] H. Cevikalp and G. G. Dordinejad, “Video based face recognition by
using discriminatively learned convex models,” Int. J. Comput. Vis.,
vol. 128, no. 12, pp. 3000–3014, Dec. 2020.
Authorized licensed use limited to: SHIV NADAR UNIVERSITY. Downloaded on May 21,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.