EPYNET Efficient Pyramidal Network For
EPYNET Efficient Pyramidal Network For
ABSTRACT Soft biometrics traits extracted from a human body, including the type of clothes, hair color,
and accessories, are useful information used for people tracking and identification. Semantic segmentation
of these traits from images is still a challenge for researchers because of the huge variety of clothing styles,
layering, shapes, and colors. To tackle these issues, we proposed EPYNET, a framework for clothing seg-
mentation. EPYNET is based on the Single Shot MultiBox Detector (SSD) and the Feature Pyramid Network
(FPN) with the EfficientNet model as the backbone. The framework also integrates data augmentation
methods and noise reduction techniques to increase the accuracy of the segmentation. We also propose a new
dataset named UTFPR-SBD3, consisting of 4,500 manually annotated images into 18 classes of objects, plus
the background. Unlike available public datasets with imbalanced class distributions, the UTFPR-SBD3 has,
at least, 100 instances per class to minimize the training difficulty of deep learning models. We introduced
a new measure of dataset imbalance, motivated by the difficulty in comparing different datasets for clothing
segmentation. With such a measure, it is possible to detect the influence of the background, classes with small
items, or classes with a too high or too low number of instances. Experimental results on UTFPR-SBD3
show the effectiveness of EPYNET, outperforming the state-of-art methods for clothing segmentation on
public datasets. Based on these results, we believe that the proposed approach can be potentially useful
for many real-world applications related to soft biometrics, people surveillance, image description, clothes
recommendation, and others.
INDEX TERMS Soft biometrics, clothing segmentation, computer vision, deep learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
187882 VOLUME 8, 2020
A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation
The segmentation of clothes in images, also known as mental results and their discussion are shown in Section IV.
clothing parsing [15], consists of classifying each image pixel Finally, general conclusions and suggestions for future
with labels specifically related to clothes (and accessories). research directions are presented in Section V.
Currently, it is still a challenging topic and has aroused the
interest among researchers due to the inexhaustible variety of II. RELATED WORKS
types, styles, colors, and shapes of clothes [16]. DL methods have been utilized to address many prob-
Approaches developed to localize, classify, and segment lems in the surveillance environment, including per-
clothing-related traits are essential to several tasks, includ- son re-identification [24], face recognition [9], anomaly
ing the identification of a person in surveillance videos, detection [25], multi-view analysis [26], and traffic
semantic enrichment of image description, and overall fash- monitoring [27].
ion analysis [17]. Although many efforts have been done to In the soft biometrics area, several approaches have
develop models to perform the task of clothing segmentation, been proposed to localize, segment, and classify clothes
such approaches appear to be unsatisfactory for real-world in digital images for many application scenarios, including
applications. outfit retrieval, fashion analysis, surveillance, and human
For tackling the clothing segmentation problem in the con- identification.
text of soft biometrics, we propose a new framework named In [12], the authors proposed an approach to classify
EPYNET. The framework is focused on human attributes, soft biometrics traits using Convolutional Neural Networks
such as skin, hair, type of clothes, and accessories. The pro- (CNN). Authors used independent classifiers to detect the
posed approach uses the SSD (Single Shot MultiBox Detec- gender (male or female), upper clothes (short or long sleeves),
tor) [18] model to crop a person from the image, and the and lower clothes (short or pants) of a person. Although their
Feature Pyramid Network (FPN) architecture [19] with the approach achieved remarkable generalization capability of
EfficientNet [20] model to perform the segmentation task. the model, they reported difficulties in finding appropriate
Data augmentation techniques and noise reduction were also image datasets, regarding size, quality, and variability.
applied to improve the performance of the method. A method for clothing segmentation using an adaptation
It is known that DL methods usually require large amounts of the U-Net architecture was presented in [28]. This model
of data to be trained. For the specific problem of clothing was adapted to accommodate multi-class segmentation and
segmentation, there are some popular datasets, for instance: was trained with the Clothing Co-Parsing (CCP) dataset [29].
CFPD [21] and Fashionista [22]. However, several incorrect Due to the significant similarity between classes with few
labeling can be found in the CFPD dataset, since anno- instances, 58 different classes were grouped into 14. The
tated images are based on superpixels. Also, the Fashionista authors concluded that the U-Net model could be a reliable
dataset is quite small for DL methods, since it contains only way to perform the segmentation task, Authors also pointed
685 annotated images. Therefore, the need for a new, high- out the lack of large datasets annotated at the pixel level.
quality, benchmark dataset, led to the creation of a new one, An approach based on a Fully-Convolutional Network
containing 4,500 images manually annotated into 18 classes, (FCN) for the clothing segmentation task was proposed
besides the background. This paper is an extension of a by [30]. The proposed architecture extends the FCN with a
preliminary work presented in [23] and differs from it in side path called ‘‘outfit encoder’’ to filter inappropriate cloth-
four points: (i) we propose a new dataset, called UTFPR- ing combination from segmentation, and a post-processing
SBD3,1 with improved quality over existing datasets, focused step using Conditional Random Field (CRF) to assign a
in the soft biometrics context; (ii) a new measure, named visually consistent set of clothing labels. These authors intro-
Instances-Pixels Balance Index – IPBI, is proposed to com- duced a refined annotation of the Fashionista dataset with
pare the balance of different datasets in terms of pixels and 25 classes to study the influence of erroneous annotations and
instances; (iii) we propose a novel framework based on FPN ambiguous classes on performance metrics.
and EfficientNet, which can extract high-quality features at [14] also proposed an FCN-based approach to compute the
different spatial resolutions; (iv) extensive experiments and color and the class of pixels. First, a Faster-RCNN (Recurrent
comparisons were done with public benchmarks for clothing CNN) model is used to detect and crop people in the image.
segmentation. To the best of our knowledge, all the mentioned Then, the cropped image of a person is used to feed the
contributions are new and not published anywhere. FCN, which computes color and class feature maps. Finally,
The remainder of this paper is organized as follows: a logistic pooling layer combines these features, and then the
Section II presents a brief description of related works. color e class is predicted.
In Section III, we present a new dataset, named UTFPR- More recently, [31] proposed the superpixels features
SBD3, for clothing segmentation, a new measure of instances extractor network (SP-FEN). The proposed model is based
and pixels balance for image segmentation datasets, and the on the FCN with the introduction of superpixels encoder as
proposed framework for clothing segmentation. The experi- an aside-network that feeds the extracted features into the
main segmentation pipeline. Data augmentation techniques
1 The dataset is available at [Link] such as flip, rotation, and deformation were used during
[Link] the training step to improve generalization performance. The
III. METHODS
This section first presents the proposed dataset for cloth-
ing segmentation, named UTFPR-SBD3, and a dataset com-
parison method. In the sequence, the proposed method for FIGURE 1. Examples of annotation errors in the CFPD dataset. Figures in
the left side present annotations with noise. The right side shows images
clothing segmentation called EPYNET is presented. Finally, with misannotation and an image with two similar subjects.
the evaluation procedure of the method is shown.
A. UTFPR-SBD3 DATASET
Previous works in the literature have proposed datasets
for clothing segmentation and classification. They vary in is essential for achieving a reasonable accuracy of the
the number of images and clothing categories. Also, some model. These observed noises, associated with high-class
include non-fashion classes, such as hair, skin, face, back- imbalance, can drastically affect the segmentation accuracy.
ground, etc. To date, the most popular datasets are presented Figure 1(a) and 1(c) shows the categories belts and sun-
in Table 1. glasses with partial annotation. Other annotation prob-
lems in the CFPD dataset are: items without annotations
TABLE 1. Comparison of popular datasets for clothing segmentation and (e.g., footwear class in Figure 1(c) and 1(e)), incorrectly
the new UTFPR-SBD3. annotations (e.g., coat annotated as a skirt in Figure 1(b)
and as a scarf in Figure 1(e)), and images with two or more
subjects and only one annotation, as depicted in Figure 1(f).
Another dataset used for clothing segmentation is the
Fashionista dataset. It consists of 685 images with pixel-
level annotations in 56 different classes. To overcome the
ambiguous categories of clothes presented in the dataset, [22]
proposed the Refined Fashionista dataset by merging some
categories (e.g., blazer and jacket). However, both datasets
still have small images, as well as imbalanced classes.
Modanet contains 55,176 street fashion photos annotated
The Clothing Co-Parsing (CCP) dataset contains 2,098 with polygons in 13 classes. Despite being the largest dataset
high-resolution fashion photos, of which only 1,004 were in the number of images, it was not considered in this study
annotated at the pixel level. It is a highly imbalanced dataset, because it has only 13 classes, mainly for real-world com-
including classes without instances and many ambiguous mercial applications. Moreover, it does not include useful
classes (e.g., it has about 10 different types of footwear). classes for soft biometrics contexts such as skin, hair, glasses,
The Colorful Fashion Parsing (CFPD) dataset contains socks, or neck adornments.
2,682 images annotated with classes (23 different types, In addition to the above-mentioned problems of the
including the background) and colors (13 types). The dataset datasets, such as high imbalanced classes, ambiguous labels,
was annotated using a superpixel method, and it has a sig- and wrong annotations, in the context of soft biometrics, it is
nificant amount of noise, as shown in Figure 1b). It is strongly desirable a dataset with human attributes, including
known that the quality of training data in DL methods hair and skin, so one could distinguish two individuals.
FIGURE 3. Overview of the proposed approach. Given an input image, the pre-processing step detects and crops the person found in the image.
Then, the EPYNET performs the segmentation task. Finally, in the post-processing step, noise is removed, and the final predicted label is presented.
As a consequence, it is a difficult task to compare the makes sense only for c > 1. Therefore, if HI is divided
popular datasets presented before with the proposed UTFPR- by log c, it becomes normalized in the range [0..1], and we
SBD3. Any measure of comparison should take into account can obtain a measure of instances balance in the dataset,
both, the distribution of instances over classes and the number as follows:
of pixels per instance. To meet such requirements, in this
HI
work, we propose the Instances-Pixels Balance Index – IPBI BI = (2)
to compare the joint balance of instances and pixels of differ- log c
ent datasets. Similar reasoning can be done considering the number of
The IPBI is based on the concept of entropy, a common pixels of all samples in a class, so that we can obtain the
measure used in many fields of science, for which there pixels balance measure for the dataset, BP . Consequently,
are several definitions, depending upon the area. In a gen- to meet the above-mentioned definition of the (hypotheti-
eral sense, it measures the amount of disorder of a system. cal) ideal dataset, BI = BP = 1. Since both, BI and
As mentioned before, for the sake of this work, the ideal BP , should be maximized, one could interpret them in
dataset should have the same number of instances per class, the Cartesian plane, such that the farther from the origin,
as well as the same number of pixels in all classes. the better. Therefore, the Instances-Pixels Balance Index is
For a dataset with c classes (labels), such that the i-th defined as:
class has si instances (samples), and a total of k instances, q
the Shannon entropy is given by: IPBI = B2I + B2P (3)
c
X si si
HI = − log (1) B. THE PROPOSED METHOD
k k
i−1 An overview of the proposed approach is shown in Figure 3,
If the number of instances is exactly the same for all and it includes three steps: (1) Pre-processing, (2) Segmenta-
classes, Equation 1 reduces to log c, and turns out that HI tion using EPYNET, and (3) Post-processing.
IV. EXPERIMENTAL RESULTS AND DISCUSSION TABLE 3. Per-class segmentation performance obtained by the proposed
model over the UTFPR-SBD3 dataset, with and without (w/o) including
This Section presents experimental results obtained by the the Background class.
segmentation using the methods previously described in
Section III. Firstly we describe the implementation details,
the data augmentation techniques used during the train-
ing step, and the quantitative and qualitative results of the
EPYNET on the UTFPR-SBD3 dataset. Then, a dataset
comparison using the proposed IPBI measure is presented.
Finally, the proposed approach is compared with other state-
of-the-art approaches on CFPD, Fashionista, and Refined
Fashionista datasets.
A. IMPLEMENTATION DETAILS
Experiments were performed on a workstation with Intel core
i7-8700 processor, 32GBytes RAM, and a Nvidia Titan-Xp
GPU. The Tensorflow and Keras library were used to train
and test the proposed model.
We trained the EPYNET model with the UTFPR-
SBD3 dataset. We used the RMSProp optimizer with default
parameters and a learning rate of 0.001. During the training
process, the learning rate was reduced by a factor of 0.1,
whenever the evaluation metric did not improve for 5 epochs.
A predefined number of 100 epochs was defined. However, B. QUANTITATIVE RESULTS
training was stopped if the evaluation measure stagnated for The quantitative evaluation, presented in Table 3, shows the
10 consecutive epochs. Precision, Recall, and F1-Score obtained with the proposed
The generalization performance of the trained model was approach on UTFPR-SBD3 dataset. If including the Back-
accessed by means of the 10-fold cross-validation procedure, ground as a class, our approach obtained 81.3%, 76.4%, and
as usual in the literature [22], [29]. To ensure the same class 78.3%, respectively. On the other hand, without the Back-
distribution in each generated subset of the cross-validation, ground, values decrease slightly, although still suggesting
we used the stratified sampling method, proposed by [38]. that the inclusion of the Background among the classes of a
Therefore, this procedure guarantees a less optimistic gener- clothing segmentation problem may lead to distorted results.
alization estimate of the model. Considering the F1-Score, the best results were found for
By the end of each fold, the best model was used to predict the following classes: Background, Skin, Pants, and Hair.
the samples in the test set, and the performance was computed On the other hand, the poorest results were those for classes
using the measures presented in Section III-C. related to small items: Belt, Sweater, Neckwear, Socks, Eye-
It is well-known that DL methods require massive volumes wear, and Headwear. These items are those with the smallest
of data for training, and data augmentation techniques are number of pixels among all classes. Moreover, Neckwear
frequently used to overcome the lack of large and high-quality and Headwear are accessories that cover different styles and
datasets [39]. The basic idea is to make slight random changes shapes (e.g., Neckwear class includes bowties, neckties, and
in the input images to create more variety in the training scarves).
data. This procedure is known to take more robustness to the We also evaluate the overall segmentation performance of
trained models since they increase their generalization capa- the model by using the IoU measure, as shown in Figure 4.
bility on unseen images (test dataset). Among the many meth- The average IoU was 65.6%. According to [41], predictions
ods that can be applied for image data augmentation [40], with intersection over union more than 50% are considered
we used: flip, rotation, and random crop methods. There are satisfactory. Therefore, for most classes, results indicate that
two strategies for data augmentation, offline or online. In the the proposed approach is efficient for the clothing segmenta-
first approach, the data augmentation methods are applied to tion task.
the original training dataset to create a much larger dataset For only three classes the results were not satisfactory:
and, then, the augmented dataset is used for further training of Belt, Sweate, and Neckwear. Although the class Belt is
the model. In the other approach, the data augmentation meth- present in approximately 30% of the dataset regarding the
ods are randomly applied each time an image is presented to number of instances, it is a small object and occupies less
the model in the training step. We use the online approach as than 1% of the dataset considering the number of pixels.
it requires less storage for the images, at the expense of some Also, Neckwear and Sweater are among those classes with
extra processing. the lowest occurrence in the dataset.
C. QUALITATIVE RESULTS
In this Section, we present a visual evaluation of the outputs
predicted by the proposed approach. Figure 7a) shows some
FIGURE 4. IOU scores for each class. sample images of the test set with both, the ground truth and
the output provided by the EPYNET.
The trained model was able to satisfactorily segment differ-
ent types of attributes, and confirm that our approach is robust
enough for this task. Notwithstanding, in specific cases,
poor segmentation results were also predicted by our model,
as shown in Figure 7b). Notice that similar classes may cause
misclassifications. For instance, Stocking was occasionally
confused with Pants, Skirt was occasionally confused with
Dress and Sweater may be confused as Shirt or Coat.
As mentioned in Section IV-B, for the Sweater class, seg-
mentation results were not satisfactory. This class consists of
a piece of clothing, made of knitted or crocheted material, that
covers the top part of the body. It can be closed, also called a
pullover, or opened, usually called a cardigan. Sweaters can
have different shapes and styles and, due to such variability,
this class was frequently predicted as Shirt or Coat. This
may suggest that this category would be better divided into
two or more classes (see Figure 6).
FIGURE 7. Segmentation results on the test set based on: a) highest and
b) lowest IOU score. This Figure shows the input image, ground truth, and
the predicted class from left to right, respectively.
to the fusion of classes with a small number of instances into pixels levels. Second, due to the difficulty in comparing
larger ones. datasets for clothing segmentation, a new measure of dataset
imbalance was introduced: IPBI . With such a measure, it is
E. COMPARISON WITH THE STATE-OF-THE-ART possible to evaluate the influence of the background, classes
The proposed approach was compared with the current state- with small items, or classes with a too high or too low
of-the-art methods on the CFPD, Fashionista, and Refined number of instances. The third, and most important contri-
Fashionista datasets. For a fair comparison, we use the same bution, is EPYNET. This framework is based on the pyra-
measures (Acc and IoU) reported by [30], available on midal architecture of FPN, and the EfficientNet model. It is
GitHub.3 Also, since the previous works did not exclude aimed for clothing semantic segmentation, in the context
the Background class in the evaluation, the cropping step, of soft biometrics. We presented an extensive comparison
described in Section III-B1, was not performed and the entire of EPYNET with other approaches using several popular
image was used as input during the training step. datasets, and it outperformed the state-of-art methods. Both,
The Fashionista dataset was divided into training and test quantitative and qualitative results presented show the effec-
set, as described in previous works [22], [30], [43], with tiveness of the EPYNET. Based on these results, we believe
10% of training images left out for validation. On the other that the proposed approach can be potentially useful for many
hand, the CFPD dataset was randomly divided into 78% for real-world applications related to soft biometrics, people
training, 2% for validation, and 20% for testing. Table 5 surveillance, image description, clothes recommendation,
shows that, for all datasets, EPYNET achieved better results people re-identification, and others.
than the other methods in the literature. Despite the good results achieved by EPYNET, we
Notice that our approach overpasses [32] by 9% of IoU observed that other factors could influence the segmentation
in the CFPD dataset, and 1% in the Refined Fashionista task, such as the environment illumination and the quality
dataset. In the Fashionista dataset, EPYNET also outper- of the image. Besides, occlusions and similar classes in the
forms [44] by 11%. Despite the annotation issues reported in dataset can degrade the predicted results. Future work will
Section III-A, we achieved better results for accuracy in the include improvements in the method to better handle illumi-
CFPD and Fashionista datasets, and competitive results in the nation changes, and to enhance the discrimination between
Refined Fashionista dataset. similar objects.
V. CONCLUSION ACKNOWLEDGMENT
Image segmentation has been one of the most challeng- The authors would like to thank to NVIDIA Corporation for
ing problems in computer vision that could be used to the donation of the Titan-Xp GPU board used in this work.
improve applications in many areas, including security
and surveillance. Recently, soft biometrics traits, including REFERENCES
[1] M. Romero, M. Gutoski, L. T. Hattori, M. Ribeiro, and H. S. Lopes, ‘‘Soft
types of clothes, have shown promising results in people’s biometrics classification in videos using transfer learning and bidirectional
re-identification. However, it is still an open problem because long short-term memory networks,’’ Learn. Nonlinear Models, vol. 18,
of the wide variety of types, shapes, styles, and colors of no. 1, pp. 47–59, Sep. 2020.
[2] A. Abdelwhab and S. Viriri, ‘‘A survey on soft biometrics for human
clothes. identification,’’ in Machine Learning and Biometrics, J. Yang, D. S. Park,
Although semantic segmentation using Deep Learning S. Yoon, Y. Chen, and C. Zhang, Eds. Rijeka, Croatia: IntechOpen, 2018,
algorithms has achieved great success in many research ch. 3.
[3] X. Zhao, Y. Chen, E. Blasch, L. Zhang, and G. Chen, ‘‘Face recognition
fields, it is still difficult for computers to understand and in low-resolution surveillance video streams,’’ Proc. SPIE, vol. 11017,
describe a scene as humans naturally do. pp. 147–159, Jul. 2019.
As discussed in Section III-A1, the segmentation problem [4] O. A. Arigbabu, S. M. S. Ahmad, W. A. W. Adnan, and S. Yussof,
‘‘Integration of multiple soft biometrics for human identification,’’ Pattern
faced in this work is naturally unbalanced. The presence of Recognit. Lett., vol. 68, pp. 278–287, Dec. 2015.
classes with objects with a small number of pixels (e.g. belts, [5] D. A. Reid, S. Samangooei, C. Chen, M. S. Nixon, and A. Ross, ‘‘Soft
ties, or sunglasses), or that occurs in all images (e.g. skin biometrics for surveillance: An overview,’’ in Handbook of Statistics,
vol. 31, C. Rao and V. Govindaraju, Eds. Amsterdam, The Netherlands:
and hair), makes it impossible to achieve a perfect balance. Elsevier, 2013, pp. 327–352.
An imbalanced dataset, whether in terms of instances or pix- [6] A. Dantcheva, C. Velardo, A. D’Angelo, and J.-L. Dugelay, ‘‘Bag of soft
els, can negatively influence the performance of segmentation biometrics for person identification,’’ Multimedia Tools Appl., vol. 51,
no. 2, pp. 739–777, Jan. 2011.
methods. [7] R. Vera-Rodriguez, P. Marin-Belinchon, E. Gonzalez-Sosa, P. Tome, and
This work has three contributions. First, motivated by J. Ortega-Garcia, ‘‘Exploring automatic extraction of body-based soft bio-
the need for a large, high-quality dataset with pixel-level metrics,’’ in Proc. Int. Carnahan Conf. Secur. Technol. (ICCST), Oct. 2017,
pp. 1–6.
annotations, we created a new dataset, named UTFPR-SBD3. [8] E. Gonzalez-Sosa, J. Fierrez, R. Vera-Rodriguez, and F. Alonso-Fernandez,
It was designated to overcome the annotation problems fre- ‘‘Facial soft biometrics for recognition in the wild: Recent works, annota-
quently found in other popular datasets, and to provide the tion, and COTS evaluation,’’ IEEE Trans. Inf. Forensics Security, vol. 13,
no. 8, pp. 2001–2014, Aug. 2018.
best possible balance over classes at the instances and the [9] S. Bashbaghi, E. Granger, R. Sabourin, and M. Parchami, Deep Learn-
ing Architectures for Face Recognition in Video Surveillance. Singapore:
3 [Link] Springer, 2019, pp. 133–154.
[10] X. Di and V. M. Patel, ‘‘Deep learning for tattoo recognition,’’ in Deep [34] S. Seferbekov, V. Iglovikov, A. Buslaev, and A. Shvets, ‘‘Feature pyra-
Learning for Biometrics. Cham, Switzerland: Springer, 2017, pp. 241–256. mid network for multi-class land segmentation,’’ in Proc. IEEE/CVF
[11] E. R. H. P. Isaac, S. Elias, S. Rajagopalan, and K. S. Easwarakumar, Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018,
‘‘Multiview gait-based gender classification through pose-based voting,’’ pp. 272–275.
Pattern Recognit. Lett., vol. 126, pp. 41–50, Sep. 2019. [35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
[12] H. A. Perlin and H. S. Lopes, ‘‘Extracting human attributes using a ‘‘MobileNetV2: Inverted residuals and linear bottlenecks,’’ in
convolutional neural network approach,’’ Pattern Recognit. Lett., vol. 68, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 250–259, Dec. 2015. pp. 4510–4520.
[13] K. M. A. Raihan, M. Khaliluzzaman, and S. M. Rezvi, ‘‘Recognition of [36] F. Lateef and Y. Ruichek, ‘‘Survey on semantic segmentation using deep
pedestrian clothing attributes from far view images using convolutional learning techniques,’’ Neurocomputing, vol. 338, pp. 321–348, Apr. 2019.
neural network,’’ in Proc. 10th Int. Conf. Comput., Commun. Netw. Tech- [37] G. E. A. P. A. Batista, A. C. P. L. F. Carvalho, and M. C. Monard, ‘‘Applying
nol. (ICCCNT), Jul. 2019, pp. 1–7. one-sided selection to unbalanced datasets,’’ in Proc. Mex. Int. Conf. Artif.
[14] Z. Chen, S. Liu, Y. Zhai, J. Lin, X. Cao, and L. Yang, ‘‘Human pars- Intell. Berlin, Germany: Springer, 2000, pp. 315–325.
ing by weak structural label,’’ Multimedia Tools Appl., vol. 77, no. 15, [38] K. Sechidis, G. Tsoumakas, and I. Vlahavas, ‘‘On the stratification of
pp. 19795–19809, Aug. 2018. multi-label data,’’ in Proc. Eur. Conf. Mach. Learn. Knowl. Discovery
[15] C.-H. Yoo, Y.-G. Shin, S.-W. Kim, and S.-J. Ko, ‘‘Context-aware encod- Databases. Berlin, Germany: Springer, 2011, pp. 145–158.
ing for clothing parsing,’’ Electron. Lett., vol. 55, no. 12, pp. 692–693, [39] N. Aquino, M. Gutoski, L. Hattori, and H. Lopes, ‘‘The effect of data aug-
Jun. 2019. mentation on the performance of convolutional neural networks,’’ in Proc.
[16] W. Ji, X. Li, F. Wu, Z. Pan, and Y. Zhuang, ‘‘Human-centric clothing 13th Brazilian Conf. Comput. Intell. (SBIC/ABRICOM), 2017, pp. 1–12.
segmentation via deformable semantic locality-preserving network,’’ IEEE [40] C. Shorten and T. M. Khoshgoftaar, ‘‘A survey on image data augmentation
Trans. Circuits Syst. Video Technol., early access, Dec. 25, 2019, doi: for deep learning,’’ J. Big Data, vol. 6, no. 60, pp. 1–48, 2019.
10.1109/TCSVT.2019.2962216. [41] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn,
[17] E. S. Jaha and M. S. Nixon, ‘‘Soft biometrics for subject identification and A. Zisserman, ‘‘The Pascal visual object classes challenge: A retro-
using clothing attributes,’’ in Proc. IEEE Int. Joint Conf. Biometrics, spective,’’ Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, Jan. 2015.
Piscataway, NJ, USA, Sep. 2014, pp. 1–6. [42] Y. Ge, R. Zhang, X. Wang, X. Tang, and P. Luo, ‘‘DeepFashion2: A ver-
[18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and satile benchmark for detection, pose estimation, segmentation and re-
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. 14th Eur. Conf. identification of clothing images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. Pattern Recognit. (CVPR), Jun. 2019, pp. 5332–5340.
[19] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [43] W. Ji, X. Li, Y. Zhuang, O. E. F. Bourahla, Y. Ji, S. Li, and J. Cui, ‘‘Semantic
‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf. locality-aware deformable network for clothing segmentation,’’ in Proc.
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. 27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 764–770.
[20] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for convolu- [44] J. Long, E. Shelhamer, and T. Darrell, ‘‘Fully convolutional networks
tional neural networks,’’ in Proc. 36th Int. Conf. Mach. Learn., vol. 97, for semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
2019, pp. 6105–6114. Recognit. (CVPR), Jun. 2015, pp. 3431–3440.
[21] S. Liu, J. Feng, C. Domokos, H. Xu, J. Huang, Z. Hu, and S. Yan, ‘‘Fash- [45] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun, ‘‘A high per-
ion parsing with weak color-category labels,’’ IEEE Trans. Multimedia, formance CRF model for clothes parsing,’’ in Proc. Asican Conf. Comput.
vol. 16, no. 1, pp. 253–265, Jan. 2014. Vis. (ACCV). Cham, Switzerland: Springer, 2015, pp. 64–81.
[22] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg, ‘‘Parsing
clothing in fashion photographs,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2012, pp. 3570–3577.
[23] A. D. S. Inácio, A. Brilhador, and H. S. Lopes, ‘‘Semantic segmen- ANDREI DE SOUZA INÁCIO received the [Link].
tation of clothes in the context of soft biometrics using deep learning and [Link]. degrees in computer science from the
methods,’’ in Proc. 14th Brazilian Congr. Comput. Intell., Nov. 2020, Federal University of Santa Catarina (UFSC),
pp. 1–7. in 2013 and 2016, respectively. He is currently
[24] A. Li, L. Liu, and S. Yan, Person Re-Identification by Attribute-Assisted pursuing the Ph.D. degree in electrical and com-
Clothes Appearance. London, U.K.: Springer, 2014, pp. 119–138. puter engineering with the Federal University of
[25] M. Ribeiro, M. Gutoski, A. E. Lazzaretti, and H. S. Lopes, ‘‘One- Technology – Paraná, Brazil. Since 2014, he has
class classification in images and videos using a convolutional autoen- been a Lecturer with the Federal Institute of Santa
coder with compact embedding,’’ IEEE Access, vol. 8, pp. 86520–86535, Catarina (IFSC). He has professional experiences
2020. in information systems design, Web development,
[26] P. Hu, D. Peng, Y. Sang, and Y. Xiang, ‘‘Multi-view linear discrimi- and IT project management. His research interests include, but not limited to
nant analysis network,’’ IEEE Trans. Image Process., vol. 28, no. 11, computer vision, machine learning, and data mining.
pp. 5352–5365, Nov. 2019.
[27] J.-S. Zhang, J. Cao, and B. Mao, ‘‘Application of deep learning and
unmanned aerial vehicle technology in traffic flow monitoring,’’ in Proc.
Int. Conf. Mach. Learn. Cybern. (ICMLC), Jul. 2017, pp. 189–194. HEITOR SILVÉRIO LOPES received the [Link].
[28] T. Hrkac, K. Brkic, and Z. Kalafatic, ‘‘Multi-class U-Net for segmentation and [Link]. degrees in electrical engineering from
of non-biometric identifiers,’’ in Proc. 19th Irish Mach. Vis. Image Process. the Federal University of Technology – Paraná
Conf., 2017, pp. 131–138. (UTFPR), Curitiba, in 1984 and 1990, respec-
[29] W. Yang, P. Luo, and L. Lin, ‘‘Clothing co-parsing by joint image segmen- tively, and the Ph.D. degree from the Federal Uni-
tation and labeling,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., versity of Santa Catarina, in 1996. Since 2003,
Piscataway, NJ, USA, Jun. 2014, pp. 3182–3189. he has been a Research Fellow with the Brazilian
[30] P. Tangseng, Z. Wu, and K. Yamaguchi, ‘‘Looking at outfit to National Research Council, in the area of com-
parse clothing,’’ 2017, arXiv:1703.01386. [Online]. Available:
puter science. In 2014, he spent a sabbatical year
[Link]
at the Department of Electrical Engineering and
[31] A. M. Ihsan, C. K. Loo, S. A. Naji, and M. Seera, ‘‘Superpixels features
extractor network (SP-FEN) for clothing parsing enhancement,’’ Neural
Computer Science, The University of Tennesse, Knoxville, USA. He is
Process. Lett., vol. 51, pp. 1–19, Jan. 2020. currently a Tenured Full Professor with the Department of Electronics and
[32] J. Martinsson and O. Mogren, ‘‘Semantic segmentation of fashion images the Graduate Program on Electrical Engineering and Applied Computer
using feature pyramid networks,’’ in Proc. IEEE/CVF Int. Conf. Comput. Science (CPGEI), UTFPR. He is also the Founder and the Current Head
Vis. Workshop (ICCVW), Oct. 2019, pp. 3133–3136. of the Bioinformatics and Computational Intelligence Laboratory (LABIC).
[33] S. Zheng, F. Yang, M. H. Kiapour, and R. Piramuthu, ‘‘ModaNet: A large- His research interests include computer vision, deep learning, evolutionary
scale street fashion dataset with polygon annotations,’’ in Proc. 26th ACM computation, and data mining.
Int. Conf. Multimedia, New York, NY, USA, 2018, pp. 1670–1678.