0% found this document useful (0 votes)
46 views11 pages

EPYNET Efficient Pyramidal Network For

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views11 pages

EPYNET Efficient Pyramidal Network For

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Received October 1, 2020, accepted October 10, 2020, date of publication October 13, 2020, date of current version

October 26, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3030859

EPYNET: Efficient Pyramidal Network for


Clothing Segmentation
ANDREI DE SOUZA INÁCIO 1,2 AND HEITOR SILVÉRIO LOPES1
1 Graduate Program in Electrical Engineering and Industrial Informatics, Federal University of Technology – Paraná, Curitiba 80230-901, Brazil
2 Federal Institute of Santa Catarina, Gaspar 89111-009, Brazil

Corresponding author: Andrei de Souza Inácio ([Link]@[Link])


The work of Heitor Silvério Lopes was supported in part by Conselho Nacional de Desenvolvimento Científico e Tecnológico
(CNPq) under Grants 311785/2019-0, 311778/2016-0, and 423872/2016-8; and in part by the Fundação Araucária under
Grant PRONEX 042/2018.

ABSTRACT Soft biometrics traits extracted from a human body, including the type of clothes, hair color,
and accessories, are useful information used for people tracking and identification. Semantic segmentation
of these traits from images is still a challenge for researchers because of the huge variety of clothing styles,
layering, shapes, and colors. To tackle these issues, we proposed EPYNET, a framework for clothing seg-
mentation. EPYNET is based on the Single Shot MultiBox Detector (SSD) and the Feature Pyramid Network
(FPN) with the EfficientNet model as the backbone. The framework also integrates data augmentation
methods and noise reduction techniques to increase the accuracy of the segmentation. We also propose a new
dataset named UTFPR-SBD3, consisting of 4,500 manually annotated images into 18 classes of objects, plus
the background. Unlike available public datasets with imbalanced class distributions, the UTFPR-SBD3 has,
at least, 100 instances per class to minimize the training difficulty of deep learning models. We introduced
a new measure of dataset imbalance, motivated by the difficulty in comparing different datasets for clothing
segmentation. With such a measure, it is possible to detect the influence of the background, classes with small
items, or classes with a too high or too low number of instances. Experimental results on UTFPR-SBD3
show the effectiveness of EPYNET, outperforming the state-of-art methods for clothing segmentation on
public datasets. Based on these results, we believe that the proposed approach can be potentially useful
for many real-world applications related to soft biometrics, people surveillance, image description, clothes
recommendation, and others.

INDEX TERMS Soft biometrics, clothing segmentation, computer vision, deep learning.

I. INTRODUCTION Unlike traditional biometrics approaches, such as finger-


Soft biometrics is an emerging area of research, mainly due print and iris, soft biometrics extraction is non-invasive and
to its extensive applicability in human identification and requires no contact. Therefore, it can be done at distance
surveillance. Human attributes such as gender, age, hairstyle, and, more importantly, without the explicit cooperation of the
tattoos, as well as type and color of clothes, are examples targets [4].
of soft biometrics traits that can be successfully used for Many soft biometrics approaches have explored hand-
distinguishing different people [1]. crafted features on datasets collected under well-constrained
In the last years, physiological and behavioral human environments [5], [6]. In recent years, Deep Learning (DL)
characteristics have been used for the human identification methods have been successfully used to learn compact and
task [2]. However, identifying human beings accurately in a discriminative features for many complex problems. In the
real visual surveillance system is still challenging, mainly due soft biometrics area, DL methods have been used for a wide
to the low-resolution of cameras, poor illumination, and the variety of classification problems, such as body-based fea-
diversity of camera angles [3]. tures (height, shoulder width, hips-width, arms-length, body
complexion, and hair color) [7], facial traits (gender, ethnic-
The associate editor coordinating the review of this manuscript and ity, age, glasses, beard, and mustache) [8], [9], tattoos [10],
approving it for publication was Xi Peng . gait-based gender [11], and clothes [12]–[14].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
187882 VOLUME 8, 2020
A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

The segmentation of clothes in images, also known as mental results and their discussion are shown in Section IV.
clothing parsing [15], consists of classifying each image pixel Finally, general conclusions and suggestions for future
with labels specifically related to clothes (and accessories). research directions are presented in Section V.
Currently, it is still a challenging topic and has aroused the
interest among researchers due to the inexhaustible variety of II. RELATED WORKS
types, styles, colors, and shapes of clothes [16]. DL methods have been utilized to address many prob-
Approaches developed to localize, classify, and segment lems in the surveillance environment, including per-
clothing-related traits are essential to several tasks, includ- son re-identification [24], face recognition [9], anomaly
ing the identification of a person in surveillance videos, detection [25], multi-view analysis [26], and traffic
semantic enrichment of image description, and overall fash- monitoring [27].
ion analysis [17]. Although many efforts have been done to In the soft biometrics area, several approaches have
develop models to perform the task of clothing segmentation, been proposed to localize, segment, and classify clothes
such approaches appear to be unsatisfactory for real-world in digital images for many application scenarios, including
applications. outfit retrieval, fashion analysis, surveillance, and human
For tackling the clothing segmentation problem in the con- identification.
text of soft biometrics, we propose a new framework named In [12], the authors proposed an approach to classify
EPYNET. The framework is focused on human attributes, soft biometrics traits using Convolutional Neural Networks
such as skin, hair, type of clothes, and accessories. The pro- (CNN). Authors used independent classifiers to detect the
posed approach uses the SSD (Single Shot MultiBox Detec- gender (male or female), upper clothes (short or long sleeves),
tor) [18] model to crop a person from the image, and the and lower clothes (short or pants) of a person. Although their
Feature Pyramid Network (FPN) architecture [19] with the approach achieved remarkable generalization capability of
EfficientNet [20] model to perform the segmentation task. the model, they reported difficulties in finding appropriate
Data augmentation techniques and noise reduction were also image datasets, regarding size, quality, and variability.
applied to improve the performance of the method. A method for clothing segmentation using an adaptation
It is known that DL methods usually require large amounts of the U-Net architecture was presented in [28]. This model
of data to be trained. For the specific problem of clothing was adapted to accommodate multi-class segmentation and
segmentation, there are some popular datasets, for instance: was trained with the Clothing Co-Parsing (CCP) dataset [29].
CFPD [21] and Fashionista [22]. However, several incorrect Due to the significant similarity between classes with few
labeling can be found in the CFPD dataset, since anno- instances, 58 different classes were grouped into 14. The
tated images are based on superpixels. Also, the Fashionista authors concluded that the U-Net model could be a reliable
dataset is quite small for DL methods, since it contains only way to perform the segmentation task, Authors also pointed
685 annotated images. Therefore, the need for a new, high- out the lack of large datasets annotated at the pixel level.
quality, benchmark dataset, led to the creation of a new one, An approach based on a Fully-Convolutional Network
containing 4,500 images manually annotated into 18 classes, (FCN) for the clothing segmentation task was proposed
besides the background. This paper is an extension of a by [30]. The proposed architecture extends the FCN with a
preliminary work presented in [23] and differs from it in side path called ‘‘outfit encoder’’ to filter inappropriate cloth-
four points: (i) we propose a new dataset, called UTFPR- ing combination from segmentation, and a post-processing
SBD3,1 with improved quality over existing datasets, focused step using Conditional Random Field (CRF) to assign a
in the soft biometrics context; (ii) a new measure, named visually consistent set of clothing labels. These authors intro-
Instances-Pixels Balance Index – IPBI, is proposed to com- duced a refined annotation of the Fashionista dataset with
pare the balance of different datasets in terms of pixels and 25 classes to study the influence of erroneous annotations and
instances; (iii) we propose a novel framework based on FPN ambiguous classes on performance metrics.
and EfficientNet, which can extract high-quality features at [14] also proposed an FCN-based approach to compute the
different spatial resolutions; (iv) extensive experiments and color and the class of pixels. First, a Faster-RCNN (Recurrent
comparisons were done with public benchmarks for clothing CNN) model is used to detect and crop people in the image.
segmentation. To the best of our knowledge, all the mentioned Then, the cropped image of a person is used to feed the
contributions are new and not published anywhere. FCN, which computes color and class feature maps. Finally,
The remainder of this paper is organized as follows: a logistic pooling layer combines these features, and then the
Section II presents a brief description of related works. color e class is predicted.
In Section III, we present a new dataset, named UTFPR- More recently, [31] proposed the superpixels features
SBD3, for clothing segmentation, a new measure of instances extractor network (SP-FEN). The proposed model is based
and pixels balance for image segmentation datasets, and the on the FCN with the introduction of superpixels encoder as
proposed framework for clothing segmentation. The experi- an aside-network that feeds the extracted features into the
main segmentation pipeline. Data augmentation techniques
1 The dataset is available at [Link] such as flip, rotation, and deformation were used during
[Link] the training step to improve generalization performance. The

VOLUME 8, 2020 187883


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

authors in [32] propose the ResNeXt-FPN approach that also


uses the FPN architecture with ResNeXt to perform semantic
segmentation of clothes. In this work, the authors concluded
the use of a FPN based-approach is feasible for clothing
segmentation.
In general, the use of DL has shown promising results
for the clothing segmentation task. However, as pointed
out before, a common challenging problem reported by
several authors is the difficulty to distinguishing between
clothes used in the same body part, for example t-shirt and
blouse or hair and hat. Another challenging reported is the
lack of annotated datasets required for training DL mod-
els. Most of the available datasets have highly imbalanced
classes (classes with less than 20 instances), inconsistent
and noisy annotations, and high similarity between classes.
In Section III-A a new dataset will be presented, aiming at
overcoming these problems.

III. METHODS
This section first presents the proposed dataset for cloth-
ing segmentation, named UTFPR-SBD3, and a dataset com-
parison method. In the sequence, the proposed method for FIGURE 1. Examples of annotation errors in the CFPD dataset. Figures in
the left side present annotations with noise. The right side shows images
clothing segmentation called EPYNET is presented. Finally, with misannotation and an image with two similar subjects.
the evaluation procedure of the method is shown.

A. UTFPR-SBD3 DATASET
Previous works in the literature have proposed datasets
for clothing segmentation and classification. They vary in is essential for achieving a reasonable accuracy of the
the number of images and clothing categories. Also, some model. These observed noises, associated with high-class
include non-fashion classes, such as hair, skin, face, back- imbalance, can drastically affect the segmentation accuracy.
ground, etc. To date, the most popular datasets are presented Figure 1(a) and 1(c) shows the categories belts and sun-
in Table 1. glasses with partial annotation. Other annotation prob-
lems in the CFPD dataset are: items without annotations
TABLE 1. Comparison of popular datasets for clothing segmentation and (e.g., footwear class in Figure 1(c) and 1(e)), incorrectly
the new UTFPR-SBD3. annotations (e.g., coat annotated as a skirt in Figure 1(b)
and as a scarf in Figure 1(e)), and images with two or more
subjects and only one annotation, as depicted in Figure 1(f).
Another dataset used for clothing segmentation is the
Fashionista dataset. It consists of 685 images with pixel-
level annotations in 56 different classes. To overcome the
ambiguous categories of clothes presented in the dataset, [22]
proposed the Refined Fashionista dataset by merging some
categories (e.g., blazer and jacket). However, both datasets
still have small images, as well as imbalanced classes.
Modanet contains 55,176 street fashion photos annotated
The Clothing Co-Parsing (CCP) dataset contains 2,098 with polygons in 13 classes. Despite being the largest dataset
high-resolution fashion photos, of which only 1,004 were in the number of images, it was not considered in this study
annotated at the pixel level. It is a highly imbalanced dataset, because it has only 13 classes, mainly for real-world com-
including classes without instances and many ambiguous mercial applications. Moreover, it does not include useful
classes (e.g., it has about 10 different types of footwear). classes for soft biometrics contexts such as skin, hair, glasses,
The Colorful Fashion Parsing (CFPD) dataset contains socks, or neck adornments.
2,682 images annotated with classes (23 different types, In addition to the above-mentioned problems of the
including the background) and colors (13 types). The dataset datasets, such as high imbalanced classes, ambiguous labels,
was annotated using a superpixel method, and it has a sig- and wrong annotations, in the context of soft biometrics, it is
nificant amount of noise, as shown in Figure 1b). It is strongly desirable a dataset with human attributes, including
known that the quality of training data in DL methods hair and skin, so one could distinguish two individuals.

187884 VOLUME 8, 2020


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

TABLE 2. Number of images and pixels for each class in UTFPR-SBD3.

To overcome the drawbacks of the existing datasets,


we constructed a new one, named UTFPR-SBD3 (Soft Bio-
metrics Dataset #3), intended for clothing segmentation in
the context of soft biometrics. It consists of 4,500 images
manually annotated into 18 classes plus the background
(see Table 2), of which 1,003 were taken from the CCP
dataset, 2,679 from the CFPD, and 685 from the Fashionista
dataset. Additionally, 133 more images containing instances
of the less frequent classes in the dataset were collected from
the fashion sharing website [Link] (same source of
the CFPD and Fashionista datasets) to ensure that each class
has, at least, 100 instances. All images were standardized
to 400 × 600 pixels in RGB color. Figure 2 shows the
classes distribution diagram of the datasets. These diagrams FIGURE 2. Distribution diagram of UTFPR-SBD3 dataset comparing the
graphically expose the imbalance of classes present in the prior datasets.
datasets.
To guarantee the high-quality of the dataset, all images
were manually annotated at pixel-level using the JS Segment if not impossible, to achieve a perfect balance between the
Annotator,2 a free web-based image annotation tool. Raw several classes of clothes in a dataset. Furthermore, con-
images were carefully selected to avoid, as far as possi- sidering that the segmentation of the objects (clothes) is
ble, classes with a low number of instances. However, it is accomplished at the pixel level, the number of pixels for
important to notice that two classes, background, and skin, each class must be taken into account. As a matter of fact,
are naturally present in all images and, unfortunately, they different pieces of clothes (as well as non-clothes classes:
occupy a significant part of the image. Later in this paper, skin, hair, and the background) may have quite different
such information will be important for the evaluate the com- sizes. For instance, a bow tie or a belt is usually smaller
parative performance of the proposed method. than pants or overcoats. Therefore, the image segmentation
problem we have in hand is naturally unbalanced.
1) DATASET COMPARISON METHOD Another important issue is how to deal with the background
In Table 1, the number of images and classes of the datasets class. Some published works, when reporting the overall
frequently used for clothing segmentation are shown. Ideally, segmentation performance of the proposed methods, include
in a given dataset, the number of samples per class should be the background class together with all other clothing classes.
the same, so that no classifier would be biased towards the Keeping in mind that the main purpose of the applications in
majority class. Since we are dealing with images of people this area is to identify clothes, not the background, if consid-
dressed in clothes, not only clothes alone, it is quite difficult, ering the background as a class may lead to distorted results
since it is the majority class in most images (see Section IV-E
2 [Link] for more details).

VOLUME 8, 2020 187885


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

FIGURE 3. Overview of the proposed approach. Given an input image, the pre-processing step detects and crops the person found in the image.
Then, the EPYNET performs the segmentation task. Finally, in the post-processing step, noise is removed, and the final predicted label is presented.

As a consequence, it is a difficult task to compare the makes sense only for c > 1. Therefore, if HI is divided
popular datasets presented before with the proposed UTFPR- by log c, it becomes normalized in the range [0..1], and we
SBD3. Any measure of comparison should take into account can obtain a measure of instances balance in the dataset,
both, the distribution of instances over classes and the number as follows:
of pixels per instance. To meet such requirements, in this
HI
work, we propose the Instances-Pixels Balance Index – IPBI BI = (2)
to compare the joint balance of instances and pixels of differ- log c
ent datasets. Similar reasoning can be done considering the number of
The IPBI is based on the concept of entropy, a common pixels of all samples in a class, so that we can obtain the
measure used in many fields of science, for which there pixels balance measure for the dataset, BP . Consequently,
are several definitions, depending upon the area. In a gen- to meet the above-mentioned definition of the (hypotheti-
eral sense, it measures the amount of disorder of a system. cal) ideal dataset, BI = BP = 1. Since both, BI and
As mentioned before, for the sake of this work, the ideal BP , should be maximized, one could interpret them in
dataset should have the same number of instances per class, the Cartesian plane, such that the farther from the origin,
as well as the same number of pixels in all classes. the better. Therefore, the Instances-Pixels Balance Index is
For a dataset with c classes (labels), such that the i-th defined as:
class has si instances (samples), and a total of k instances, q
the Shannon entropy is given by: IPBI = B2I + B2P (3)
c
X si si
HI = − log (1) B. THE PROPOSED METHOD
k k
i−1 An overview of the proposed approach is shown in Figure 3,
If the number of instances is exactly the same for all and it includes three steps: (1) Pre-processing, (2) Segmenta-
classes, Equation 1 reduces to log c, and turns out that HI tion using EPYNET, and (3) Post-processing.

187886 VOLUME 8, 2020


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

1) PRE-PROCESSING Then, in the Semantic Segmentation Head, two convolu-


The first step consists in detecting, localizing, and cropping tions with kernel size 3 × 3 and 128 channels are applied
all people from a raw image. For this task, we used the sequentially in each output block, followed by an upsampling
Single Shot MultiBox Detector (SSD) [18], which uses Con- operation to compute the final feature maps. For each con-
volutional Networks to generate bounding boxes for each volution, we used batch normalization and ReLU (Rectified
class instance, in this case, people, only. Then, each person Linear Unit) activation function. All the four resulting fea-
cropped from the raw image was proportionally resized to ture maps were concatenated, and the last convolution with
320×320 (padding with zeros was applied to make it square). batch normalization and ReLU activation function is applied,
These transformations were also applied to the image label to followed by an upsampling operation.
ensure compatibility with respect to the input image. Finally, Finally, the whole model was jointly optimized in an end-
we used the min-max normalization to normalize the input to-end way with the weighted cross-entropy loss to handle
images in the range [0..1], so that they are guaranteed to be the imbalance in the training data. The model output is a
on the same scale. 19-channel softmax layer that assigns to each pixel a value
that represents the probabilities of belonging to each class of
the dataset.
2) EPYNET: EFFICIENT PYRAMIDAL NETWORK
In this work, the clothing segmentation task was formu- 3) POST-PROCESSING
lated as a classification problem. Therefore, a classifier The output of the model consists of a softmax probability map
was trained to assign each pixel to a target label. For this of the same size of the input image with the number of chan-
task, we propose a new approach, called EPYNET, base on nels defined by the number of classes. The opening morpho-
FPN [19] architecture with the EfficientNet [20] model as the logical operation is applied to each channel of the predicted
backbone. output to reduce noise impacts and enhance the detected
FPN has achieved promising results in the object detection regions. Finally, the prediction is merged using the argmax
task, especially for small objects, by computing feature maps operation on each depth-wise pixel to obtain the final mask.
at different spatial resolutions in a pyramidal representation.
This representation also can be used for semantic segmenta- C. EVALUATION
tion tasks (see, for instance, [34]). The EfficientNet is a family Four measures were used to evaluate the proposed method:
of models, ranging from the light-weight EfficientNet-B0 to Precision, Recall, F1-score, and Intersection over Union
the large EfficientNet-B7. It uses an effective compound (IoU), which are computed by Equations (4), (5), (6) and (7),
scaling method to obtain improved efficiency and perfor- respectively. These evaluation measures are commonly used
mance. Recently, the architecture achieved the state-of-the- for segmentation tasks in general and, also, for clothing seg-
art in image classification using the ImageNet dataset [20]. mentation [36]. All these measures are reported in this work
EPYNET performs the clothing segmentation task by aiming at comparisons with future approaches. It is worth
combining these two approaches, taking advantage of both, to notice that Accuracy is not an appropriate performance
the features extracted from the EfficientNet, and the pyramid measure for unbalanced datasets, as pointed out by [37].
representation at different scales from FPN. To better detail Anyhow, Accuracy will be shown here only for the sake of
our proposed approach, shown in Figure 3, we divided the comparing our approach with previously published works.
framework into three main components: Bottom-up Pathway,
Top-down Pathway, and Semantic Segmentation Head. TPi
The Bottom-up Pathway is a deep CNN, more specifically, Precisioni = , (4)
TPi + FPi
the EfficientNet-B4 model initialized with weights trained on TPi
the ImageNet dataset. It consists of seven main convolution Recalli = , (5)
TPi + FNi
blocks, called MBConv, based on an Inverted Residual Block, Precision × Recall
previously introduced in MobileNet [35]. F1-Scorei = 2 × , (6)
Precision + Recall
The Top-down Pathway consists of four blocks connected TP
with lateral connections from the Bottom-up pathway IoU = . (7)
TP + FP + FN
that builds high-level semantic feature maps at dif-
ferent scales, using the image pyramid principle sug- where TP, FP, FN indicate, respectively, True Positives, False
gested by [19]. For the lateral connections, we use the Positives and False Negatives for a fiven classification i.
following output layers from the EfficientNet-B4: block6a_ Precision is a metric that indicates the proportion of predic-
expand_activation, block4a_expand_activation, block3a_ tions that are true positives. Recall is a metric of completeness
expand_activation, block2a_expand_ctivation. Each block and specifies the proportion of positives that are detected.
performs an upsampling by a factor of 2 from higher pyramids F1-score combines Precision and Recall and is the harmonic
level features and adds with features from the Bottom-up mean of these two metrics. IoU refers to the intersection of
pathway via lateral connections to locate the features more the ground truth and the predicted segmentation divided by
precisely. the union of the ground truth and the predicted segmentation.

VOLUME 8, 2020 187887


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

IV. EXPERIMENTAL RESULTS AND DISCUSSION TABLE 3. Per-class segmentation performance obtained by the proposed
model over the UTFPR-SBD3 dataset, with and without (w/o) including
This Section presents experimental results obtained by the the Background class.
segmentation using the methods previously described in
Section III. Firstly we describe the implementation details,
the data augmentation techniques used during the train-
ing step, and the quantitative and qualitative results of the
EPYNET on the UTFPR-SBD3 dataset. Then, a dataset
comparison using the proposed IPBI measure is presented.
Finally, the proposed approach is compared with other state-
of-the-art approaches on CFPD, Fashionista, and Refined
Fashionista datasets.

A. IMPLEMENTATION DETAILS
Experiments were performed on a workstation with Intel core
i7-8700 processor, 32GBytes RAM, and a Nvidia Titan-Xp
GPU. The Tensorflow and Keras library were used to train
and test the proposed model.
We trained the EPYNET model with the UTFPR-
SBD3 dataset. We used the RMSProp optimizer with default
parameters and a learning rate of 0.001. During the training
process, the learning rate was reduced by a factor of 0.1,
whenever the evaluation metric did not improve for 5 epochs.
A predefined number of 100 epochs was defined. However, B. QUANTITATIVE RESULTS
training was stopped if the evaluation measure stagnated for The quantitative evaluation, presented in Table 3, shows the
10 consecutive epochs. Precision, Recall, and F1-Score obtained with the proposed
The generalization performance of the trained model was approach on UTFPR-SBD3 dataset. If including the Back-
accessed by means of the 10-fold cross-validation procedure, ground as a class, our approach obtained 81.3%, 76.4%, and
as usual in the literature [22], [29]. To ensure the same class 78.3%, respectively. On the other hand, without the Back-
distribution in each generated subset of the cross-validation, ground, values decrease slightly, although still suggesting
we used the stratified sampling method, proposed by [38]. that the inclusion of the Background among the classes of a
Therefore, this procedure guarantees a less optimistic gener- clothing segmentation problem may lead to distorted results.
alization estimate of the model. Considering the F1-Score, the best results were found for
By the end of each fold, the best model was used to predict the following classes: Background, Skin, Pants, and Hair.
the samples in the test set, and the performance was computed On the other hand, the poorest results were those for classes
using the measures presented in Section III-C. related to small items: Belt, Sweater, Neckwear, Socks, Eye-
It is well-known that DL methods require massive volumes wear, and Headwear. These items are those with the smallest
of data for training, and data augmentation techniques are number of pixels among all classes. Moreover, Neckwear
frequently used to overcome the lack of large and high-quality and Headwear are accessories that cover different styles and
datasets [39]. The basic idea is to make slight random changes shapes (e.g., Neckwear class includes bowties, neckties, and
in the input images to create more variety in the training scarves).
data. This procedure is known to take more robustness to the We also evaluate the overall segmentation performance of
trained models since they increase their generalization capa- the model by using the IoU measure, as shown in Figure 4.
bility on unseen images (test dataset). Among the many meth- The average IoU was 65.6%. According to [41], predictions
ods that can be applied for image data augmentation [40], with intersection over union more than 50% are considered
we used: flip, rotation, and random crop methods. There are satisfactory. Therefore, for most classes, results indicate that
two strategies for data augmentation, offline or online. In the the proposed approach is efficient for the clothing segmenta-
first approach, the data augmentation methods are applied to tion task.
the original training dataset to create a much larger dataset For only three classes the results were not satisfactory:
and, then, the augmented dataset is used for further training of Belt, Sweate, and Neckwear. Although the class Belt is
the model. In the other approach, the data augmentation meth- present in approximately 30% of the dataset regarding the
ods are randomly applied each time an image is presented to number of instances, it is a small object and occupies less
the model in the training step. We use the online approach as than 1% of the dataset considering the number of pixels.
it requires less storage for the images, at the expense of some Also, Neckwear and Sweater are among those classes with
extra processing. the lowest occurrence in the dataset.

187888 VOLUME 8, 2020


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

However, by increasing the number of instances of these


classes, an inevitable increase in the number of pixels of other,
more frequent classes (such as Skin and Hair) will occur.
As a consequence, the class imbalance would increase instead
of decreasing. We may also notice that many Sweaters were
predicted as Shirt or Coat since those classes are quite similar
to each other. Both types of misclassification highlight the
complexity of the semantic segmentation of clothes, still a
challenging problem.

C. QUALITATIVE RESULTS
In this Section, we present a visual evaluation of the outputs
predicted by the proposed approach. Figure 7a) shows some
FIGURE 4. IOU scores for each class. sample images of the test set with both, the ground truth and
the output provided by the EPYNET.
The trained model was able to satisfactorily segment differ-
ent types of attributes, and confirm that our approach is robust
enough for this task. Notwithstanding, in specific cases,
poor segmentation results were also predicted by our model,
as shown in Figure 7b). Notice that similar classes may cause
misclassifications. For instance, Stocking was occasionally
confused with Pants, Skirt was occasionally confused with
Dress and Sweater may be confused as Shirt or Coat.
As mentioned in Section IV-B, for the Sweater class, seg-
mentation results were not satisfactory. This class consists of
a piece of clothing, made of knitted or crocheted material, that
covers the top part of the body. It can be closed, also called a
pullover, or opened, usually called a cardigan. Sweaters can
have different shapes and styles and, due to such variability,
this class was frequently predicted as Shirt or Coat. This
may suggest that this category would be better divided into
two or more classes (see Figure 6).

FIGURE 5. Confusion matrix computed on the UTFPR-SBD3 dataset from


the sum of the confusion matrices generated in each fold of the
cross-validation. Vertical axis: ground truth classes; Horizontal axis:
predicted classes; cell values: percent of correct predictions.

The confusion matrix (Figure 5) shows the predicted and


target classifications. For most classes, the highest values are
found in the main diagonal, thus demonstrating the effective-
ness of the proposed approach.
Even for classes of objects with small dimensions, such
as Socks and Eyewear, the results were satisfactory, despite
having some classification errors. For instance, Eyewear was
FIGURE 6. Segmentation results of the Sweater class.
occasionally incorrectly predicted as Skin or Hair, and Socks
were wrongly predicted as Skin or Footwear. Although this is Furthermore, the illumination variation, shadows, and
reasonable from the semantic point of view, this fact suggests partially occluded clothes may result in an incorrect
that more data is required to improve the classification results. segmentation. According to [42], the performance decreases

VOLUME 8, 2020 187889


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

TABLE 4. Measures of Instances Balance, Pixels Balance, and IPBI with


(top lines) and without (botton lines) the Background class, for several
clothing segmentation datasets.

TABLE 5. Comparison with the state-of-the-art approaches. For each


dataset, the best results are highlighted.

FIGURE 7. Segmentation results on the test set based on: a) highest and
b) lowest IOU score. This Figure shows the input image, ground truth, and
the predicted class from left to right, respectively.

significantly when more than 50% of a garment is occluded.


For instance, Figure 7b) presents a dress that was predicted
as a skirt, probably because it is partially occluded by a coat.
Actually, such a problem is very difficult to address, even for
humans. Comparing the datasets according to IPBI , when includ-
ing the Background class, notice that UTFPR-SBD3 is the
D. COMPARISON OF DIFFERENT DATASETS best balanced-dataset, while the others are poorly balanced.
Based on the equations presented in Section III-A1, Particularly, the imbalance of CCP and Fashionista is due to
the Instances Balance (BI ), Pixels Balance (BP ) and IPBI its large number of classes (more than 50), which many of
were computed for all datasets used in this work. All metrics them include small objects (such as gloves and rings), or have
were computed with and without the Background class. This few instances. If excluding the Background class, UTFPR-
comparison is shown in Table 4. SBD3 is similarly balanced as CFPD, quite close to the
In this Table, it is clear the effect of the Background Refined Fashionista, which achieved the highest IPBI .
class in the Pixels Balance (BP ) measure. Such class is the However, recall that CFPD has many annotation errors caused
largest one (in pixels) in all datasets. Therefore, when present, by its superpixel-based annotation, as shown in Figure 1. The
it significantly decreases BP (and, as consequence, IPBI too). Refined Fashionista dataset, despite having the same number
No important changes are noticed in the Instances Balance of images as the Fashionista, has improved the balance by
(BI ) measure with or without the Background class. around 10% when compared with the original dataset, thanks

187890 VOLUME 8, 2020


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

to the fusion of classes with a small number of instances into pixels levels. Second, due to the difficulty in comparing
larger ones. datasets for clothing segmentation, a new measure of dataset
imbalance was introduced: IPBI . With such a measure, it is
E. COMPARISON WITH THE STATE-OF-THE-ART possible to evaluate the influence of the background, classes
The proposed approach was compared with the current state- with small items, or classes with a too high or too low
of-the-art methods on the CFPD, Fashionista, and Refined number of instances. The third, and most important contri-
Fashionista datasets. For a fair comparison, we use the same bution, is EPYNET. This framework is based on the pyra-
measures (Acc and IoU) reported by [30], available on midal architecture of FPN, and the EfficientNet model. It is
GitHub.3 Also, since the previous works did not exclude aimed for clothing semantic segmentation, in the context
the Background class in the evaluation, the cropping step, of soft biometrics. We presented an extensive comparison
described in Section III-B1, was not performed and the entire of EPYNET with other approaches using several popular
image was used as input during the training step. datasets, and it outperformed the state-of-art methods. Both,
The Fashionista dataset was divided into training and test quantitative and qualitative results presented show the effec-
set, as described in previous works [22], [30], [43], with tiveness of the EPYNET. Based on these results, we believe
10% of training images left out for validation. On the other that the proposed approach can be potentially useful for many
hand, the CFPD dataset was randomly divided into 78% for real-world applications related to soft biometrics, people
training, 2% for validation, and 20% for testing. Table 5 surveillance, image description, clothes recommendation,
shows that, for all datasets, EPYNET achieved better results people re-identification, and others.
than the other methods in the literature. Despite the good results achieved by EPYNET, we
Notice that our approach overpasses [32] by 9% of IoU observed that other factors could influence the segmentation
in the CFPD dataset, and 1% in the Refined Fashionista task, such as the environment illumination and the quality
dataset. In the Fashionista dataset, EPYNET also outper- of the image. Besides, occlusions and similar classes in the
forms [44] by 11%. Despite the annotation issues reported in dataset can degrade the predicted results. Future work will
Section III-A, we achieved better results for accuracy in the include improvements in the method to better handle illumi-
CFPD and Fashionista datasets, and competitive results in the nation changes, and to enhance the discrimination between
Refined Fashionista dataset. similar objects.

V. CONCLUSION ACKNOWLEDGMENT
Image segmentation has been one of the most challeng- The authors would like to thank to NVIDIA Corporation for
ing problems in computer vision that could be used to the donation of the Titan-Xp GPU board used in this work.
improve applications in many areas, including security
and surveillance. Recently, soft biometrics traits, including REFERENCES
[1] M. Romero, M. Gutoski, L. T. Hattori, M. Ribeiro, and H. S. Lopes, ‘‘Soft
types of clothes, have shown promising results in people’s biometrics classification in videos using transfer learning and bidirectional
re-identification. However, it is still an open problem because long short-term memory networks,’’ Learn. Nonlinear Models, vol. 18,
of the wide variety of types, shapes, styles, and colors of no. 1, pp. 47–59, Sep. 2020.
[2] A. Abdelwhab and S. Viriri, ‘‘A survey on soft biometrics for human
clothes. identification,’’ in Machine Learning and Biometrics, J. Yang, D. S. Park,
Although semantic segmentation using Deep Learning S. Yoon, Y. Chen, and C. Zhang, Eds. Rijeka, Croatia: IntechOpen, 2018,
algorithms has achieved great success in many research ch. 3.
[3] X. Zhao, Y. Chen, E. Blasch, L. Zhang, and G. Chen, ‘‘Face recognition
fields, it is still difficult for computers to understand and in low-resolution surveillance video streams,’’ Proc. SPIE, vol. 11017,
describe a scene as humans naturally do. pp. 147–159, Jul. 2019.
As discussed in Section III-A1, the segmentation problem [4] O. A. Arigbabu, S. M. S. Ahmad, W. A. W. Adnan, and S. Yussof,
‘‘Integration of multiple soft biometrics for human identification,’’ Pattern
faced in this work is naturally unbalanced. The presence of Recognit. Lett., vol. 68, pp. 278–287, Dec. 2015.
classes with objects with a small number of pixels (e.g. belts, [5] D. A. Reid, S. Samangooei, C. Chen, M. S. Nixon, and A. Ross, ‘‘Soft
ties, or sunglasses), or that occurs in all images (e.g. skin biometrics for surveillance: An overview,’’ in Handbook of Statistics,
vol. 31, C. Rao and V. Govindaraju, Eds. Amsterdam, The Netherlands:
and hair), makes it impossible to achieve a perfect balance. Elsevier, 2013, pp. 327–352.
An imbalanced dataset, whether in terms of instances or pix- [6] A. Dantcheva, C. Velardo, A. D’Angelo, and J.-L. Dugelay, ‘‘Bag of soft
els, can negatively influence the performance of segmentation biometrics for person identification,’’ Multimedia Tools Appl., vol. 51,
no. 2, pp. 739–777, Jan. 2011.
methods. [7] R. Vera-Rodriguez, P. Marin-Belinchon, E. Gonzalez-Sosa, P. Tome, and
This work has three contributions. First, motivated by J. Ortega-Garcia, ‘‘Exploring automatic extraction of body-based soft bio-
the need for a large, high-quality dataset with pixel-level metrics,’’ in Proc. Int. Carnahan Conf. Secur. Technol. (ICCST), Oct. 2017,
pp. 1–6.
annotations, we created a new dataset, named UTFPR-SBD3. [8] E. Gonzalez-Sosa, J. Fierrez, R. Vera-Rodriguez, and F. Alonso-Fernandez,
It was designated to overcome the annotation problems fre- ‘‘Facial soft biometrics for recognition in the wild: Recent works, annota-
quently found in other popular datasets, and to provide the tion, and COTS evaluation,’’ IEEE Trans. Inf. Forensics Security, vol. 13,
no. 8, pp. 2001–2014, Aug. 2018.
best possible balance over classes at the instances and the [9] S. Bashbaghi, E. Granger, R. Sabourin, and M. Parchami, Deep Learn-
ing Architectures for Face Recognition in Video Surveillance. Singapore:
3 [Link] Springer, 2019, pp. 133–154.

VOLUME 8, 2020 187891


A. de Souza Inácio, H. S. Lopes: EPYNET: Efficient Pyramidal Network for Clothing Segmentation

[10] X. Di and V. M. Patel, ‘‘Deep learning for tattoo recognition,’’ in Deep [34] S. Seferbekov, V. Iglovikov, A. Buslaev, and A. Shvets, ‘‘Feature pyra-
Learning for Biometrics. Cham, Switzerland: Springer, 2017, pp. 241–256. mid network for multi-class land segmentation,’’ in Proc. IEEE/CVF
[11] E. R. H. P. Isaac, S. Elias, S. Rajagopalan, and K. S. Easwarakumar, Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018,
‘‘Multiview gait-based gender classification through pose-based voting,’’ pp. 272–275.
Pattern Recognit. Lett., vol. 126, pp. 41–50, Sep. 2019. [35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
[12] H. A. Perlin and H. S. Lopes, ‘‘Extracting human attributes using a ‘‘MobileNetV2: Inverted residuals and linear bottlenecks,’’ in
convolutional neural network approach,’’ Pattern Recognit. Lett., vol. 68, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 250–259, Dec. 2015. pp. 4510–4520.
[13] K. M. A. Raihan, M. Khaliluzzaman, and S. M. Rezvi, ‘‘Recognition of [36] F. Lateef and Y. Ruichek, ‘‘Survey on semantic segmentation using deep
pedestrian clothing attributes from far view images using convolutional learning techniques,’’ Neurocomputing, vol. 338, pp. 321–348, Apr. 2019.
neural network,’’ in Proc. 10th Int. Conf. Comput., Commun. Netw. Tech- [37] G. E. A. P. A. Batista, A. C. P. L. F. Carvalho, and M. C. Monard, ‘‘Applying
nol. (ICCCNT), Jul. 2019, pp. 1–7. one-sided selection to unbalanced datasets,’’ in Proc. Mex. Int. Conf. Artif.
[14] Z. Chen, S. Liu, Y. Zhai, J. Lin, X. Cao, and L. Yang, ‘‘Human pars- Intell. Berlin, Germany: Springer, 2000, pp. 315–325.
ing by weak structural label,’’ Multimedia Tools Appl., vol. 77, no. 15, [38] K. Sechidis, G. Tsoumakas, and I. Vlahavas, ‘‘On the stratification of
pp. 19795–19809, Aug. 2018. multi-label data,’’ in Proc. Eur. Conf. Mach. Learn. Knowl. Discovery
[15] C.-H. Yoo, Y.-G. Shin, S.-W. Kim, and S.-J. Ko, ‘‘Context-aware encod- Databases. Berlin, Germany: Springer, 2011, pp. 145–158.
ing for clothing parsing,’’ Electron. Lett., vol. 55, no. 12, pp. 692–693, [39] N. Aquino, M. Gutoski, L. Hattori, and H. Lopes, ‘‘The effect of data aug-
Jun. 2019. mentation on the performance of convolutional neural networks,’’ in Proc.
[16] W. Ji, X. Li, F. Wu, Z. Pan, and Y. Zhuang, ‘‘Human-centric clothing 13th Brazilian Conf. Comput. Intell. (SBIC/ABRICOM), 2017, pp. 1–12.
segmentation via deformable semantic locality-preserving network,’’ IEEE [40] C. Shorten and T. M. Khoshgoftaar, ‘‘A survey on image data augmentation
Trans. Circuits Syst. Video Technol., early access, Dec. 25, 2019, doi: for deep learning,’’ J. Big Data, vol. 6, no. 60, pp. 1–48, 2019.
10.1109/TCSVT.2019.2962216. [41] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn,
[17] E. S. Jaha and M. S. Nixon, ‘‘Soft biometrics for subject identification and A. Zisserman, ‘‘The Pascal visual object classes challenge: A retro-
using clothing attributes,’’ in Proc. IEEE Int. Joint Conf. Biometrics, spective,’’ Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, Jan. 2015.
Piscataway, NJ, USA, Sep. 2014, pp. 1–6. [42] Y. Ge, R. Zhang, X. Wang, X. Tang, and P. Luo, ‘‘DeepFashion2: A ver-
[18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and satile benchmark for detection, pose estimation, segmentation and re-
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. 14th Eur. Conf. identification of clothing images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. Pattern Recognit. (CVPR), Jun. 2019, pp. 5332–5340.
[19] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [43] W. Ji, X. Li, Y. Zhuang, O. E. F. Bourahla, Y. Ji, S. Li, and J. Cui, ‘‘Semantic
‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf. locality-aware deformable network for clothing segmentation,’’ in Proc.
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. 27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 764–770.
[20] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for convolu- [44] J. Long, E. Shelhamer, and T. Darrell, ‘‘Fully convolutional networks
tional neural networks,’’ in Proc. 36th Int. Conf. Mach. Learn., vol. 97, for semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
2019, pp. 6105–6114. Recognit. (CVPR), Jun. 2015, pp. 3431–3440.
[21] S. Liu, J. Feng, C. Domokos, H. Xu, J. Huang, Z. Hu, and S. Yan, ‘‘Fash- [45] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun, ‘‘A high per-
ion parsing with weak color-category labels,’’ IEEE Trans. Multimedia, formance CRF model for clothes parsing,’’ in Proc. Asican Conf. Comput.
vol. 16, no. 1, pp. 253–265, Jan. 2014. Vis. (ACCV). Cham, Switzerland: Springer, 2015, pp. 64–81.
[22] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg, ‘‘Parsing
clothing in fashion photographs,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2012, pp. 3570–3577.
[23] A. D. S. Inácio, A. Brilhador, and H. S. Lopes, ‘‘Semantic segmen- ANDREI DE SOUZA INÁCIO received the [Link].
tation of clothes in the context of soft biometrics using deep learning and [Link]. degrees in computer science from the
methods,’’ in Proc. 14th Brazilian Congr. Comput. Intell., Nov. 2020, Federal University of Santa Catarina (UFSC),
pp. 1–7. in 2013 and 2016, respectively. He is currently
[24] A. Li, L. Liu, and S. Yan, Person Re-Identification by Attribute-Assisted pursuing the Ph.D. degree in electrical and com-
Clothes Appearance. London, U.K.: Springer, 2014, pp. 119–138. puter engineering with the Federal University of
[25] M. Ribeiro, M. Gutoski, A. E. Lazzaretti, and H. S. Lopes, ‘‘One- Technology – Paraná, Brazil. Since 2014, he has
class classification in images and videos using a convolutional autoen- been a Lecturer with the Federal Institute of Santa
coder with compact embedding,’’ IEEE Access, vol. 8, pp. 86520–86535, Catarina (IFSC). He has professional experiences
2020. in information systems design, Web development,
[26] P. Hu, D. Peng, Y. Sang, and Y. Xiang, ‘‘Multi-view linear discrimi- and IT project management. His research interests include, but not limited to
nant analysis network,’’ IEEE Trans. Image Process., vol. 28, no. 11, computer vision, machine learning, and data mining.
pp. 5352–5365, Nov. 2019.
[27] J.-S. Zhang, J. Cao, and B. Mao, ‘‘Application of deep learning and
unmanned aerial vehicle technology in traffic flow monitoring,’’ in Proc.
Int. Conf. Mach. Learn. Cybern. (ICMLC), Jul. 2017, pp. 189–194. HEITOR SILVÉRIO LOPES received the [Link].
[28] T. Hrkac, K. Brkic, and Z. Kalafatic, ‘‘Multi-class U-Net for segmentation and [Link]. degrees in electrical engineering from
of non-biometric identifiers,’’ in Proc. 19th Irish Mach. Vis. Image Process. the Federal University of Technology – Paraná
Conf., 2017, pp. 131–138. (UTFPR), Curitiba, in 1984 and 1990, respec-
[29] W. Yang, P. Luo, and L. Lin, ‘‘Clothing co-parsing by joint image segmen- tively, and the Ph.D. degree from the Federal Uni-
tation and labeling,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., versity of Santa Catarina, in 1996. Since 2003,
Piscataway, NJ, USA, Jun. 2014, pp. 3182–3189. he has been a Research Fellow with the Brazilian
[30] P. Tangseng, Z. Wu, and K. Yamaguchi, ‘‘Looking at outfit to National Research Council, in the area of com-
parse clothing,’’ 2017, arXiv:1703.01386. [Online]. Available:
puter science. In 2014, he spent a sabbatical year
[Link]
at the Department of Electrical Engineering and
[31] A. M. Ihsan, C. K. Loo, S. A. Naji, and M. Seera, ‘‘Superpixels features
extractor network (SP-FEN) for clothing parsing enhancement,’’ Neural
Computer Science, The University of Tennesse, Knoxville, USA. He is
Process. Lett., vol. 51, pp. 1–19, Jan. 2020. currently a Tenured Full Professor with the Department of Electronics and
[32] J. Martinsson and O. Mogren, ‘‘Semantic segmentation of fashion images the Graduate Program on Electrical Engineering and Applied Computer
using feature pyramid networks,’’ in Proc. IEEE/CVF Int. Conf. Comput. Science (CPGEI), UTFPR. He is also the Founder and the Current Head
Vis. Workshop (ICCVW), Oct. 2019, pp. 3133–3136. of the Bioinformatics and Computational Intelligence Laboratory (LABIC).
[33] S. Zheng, F. Yang, M. H. Kiapour, and R. Piramuthu, ‘‘ModaNet: A large- His research interests include computer vision, deep learning, evolutionary
scale street fashion dataset with polygon annotations,’’ in Proc. 26th ACM computation, and data mining.
Int. Conf. Multimedia, New York, NY, USA, 2018, pp. 1670–1678.

187892 VOLUME 8, 2020

You might also like