Age and Gender Classification Using CNN CVPR2015
Age and Gender Classification Using CNN CVPR2015
Abstract
Automatic age and gender classification has become relevant to an increasing amount of applications, particularly
since the rise of social platforms and social media. Nevertheless, performance of existing methods on real-world
images is still significantly lacking, especially when compared to the tremendous leaps in performance recently reported for the related task of face recognition. In this paper
we show that by learning representations through the use
of deep-convolutional neural networks (CNN), a significant
increase in performance can be obtained on these tasks. To
this end, we propose a simple convolutional net architecture
that can be used even when the amount of learning data
is limited. We evaluate our method on the recent Adience
benchmark for age and gender estimation and show it to
dramatically outperform current state-of-the-art methods.
1. Introduction
Figure 1. Faces from the Adience benchmark for age and gender classification [10]. These images represent some of the
challenges of age and gender estimation from real-world, unconstrained images. Most notably, extreme blur (low-resolution), occlusions, out-of-plane pose variations, expressions and more.
Age and gender play fundamental roles in social interactions. Languages reserve different salutations and grammar rules for men or women, and very often different vocabularies are used when addressing elders compared to
young people. Despite the basic roles these attributes play
in our day-to-day lives, the ability to automatically estimate
them accurately and reliably from face images is still far
from meeting the needs of commercial applications. This is
particularly perplexing when considering recent claims to
super-human capabilities in the related task of face recognition (e.g., [48]).
Past approaches to estimating or classifying these attributes from face images have relied on differences in facial feature dimensions [29] or tailored face descriptors
(e.g., [10, 15, 32]). Most have employed classification
schemes designed particularly for age or gender estimation
tasks, including [4] and others. Few of these past methods were designed to handle the many challenges of unconstrained imaging conditions [10]. Moreover, the machine
learning methods employed by these systems did not fully
2. Related Work
Before describing the proposed method we briefly review related methods for age and gender classification and
provide a cursory overview of deep convolutional networks.
Gaussian Mixture Models (GMM) [13] were used to represent the distribution of facial patches. In [54] GMM were
used again for representing the distribution of local facial
measurements, but robust descriptors were used instead of
pixel patches. Finally, instead of GMM, Hidden-MarkovModel, super-vectors [40] were used in [56] for representing face patch distributions.
An alternative to the local image intensity patches are robust image descriptors: Gabor image descriptors [32] were
used in [15] along with a Fuzzy-LDA classifier which considers a face image as belonging to more than one age
class. In [20] a combination of Biologically-Inspired Features (BIF) [44] and various manifold-learning methods
were used for age estimation. Gabor [32] and local binary
patterns (LBP) [1] features were used in [7] along with a
hierarchical age classifier composed of Support Vector Machines (SVM) [9] to classify the input image to an age-class
followed by a support vector regression [52] to estimate a
precise age.
Finally, [4] proposed improved versions of relevant component analysis [3] and locally preserving projections [36].
Those methods are used for distance learning and dimensionality reduction, respectively, with Active Appearance
Models [8] as an image feature.
All of these methods have proven effective on small
and/or constrained benchmarks for age estimation. To our
knowledge, the best performing methods were demonstrated on the Group Photos benchmark [14]. In [10]
state-of-the-art performance on this benchmark was presented by employing LBP descriptor variations [53] and a
dropout-SVM classifier. We show our proposed method to
outperform the results they report on the more challenging
Adience benchmark, designed for the same task.
Gender classification. A detailed survey of gender classification methods can be found in [34] and more recently
in [42]. Here we quickly survey relevant methods.
One of the early methods for gender classification [17]
used a neural network trained on a small set of near-frontal
face images. In [37] the combined 3D structure of the
head (obtained using a laser scanner) and image intensities were used for classifying gender. SVM classifiers
were used by [35], applied directly to image intensities.
Rather than using SVM, [2] used AdaBoost for the same
purpose, here again, applied to image intensities. Finally,
viewpoint-invariant age and gender classification was presented by [49].
More recently, [51] used the Webers Local texture Descriptor [6] for gender recognition, demonstrating nearperfect performance on the FERET benchmark [39].
In [38], intensity, shape and texture features were used with
mutual information, again obtaining near-perfect results on
the FERET benchmark.
Figure 2. Illustration of our CNN architecture. The network contains three convolutional layers, each followed by a rectified linear
operation and pooling layer. The first two layers also follow normalization using local response normalization [28]. The first Convolutional
Layer contains 96 filters of 77 pixels, the second Convolutional Layer contains 256 filters of 55 pixels, The third and final Convolutional
Layer contains 384 filters of 3 3 pixels. Finally, two fully-connected layers are added, each containing 512 neurons. See Figure 3 for a
detailed schematic view and the text for more information.
of the problems we are attempting to solve: age classification on the Adience set requires distinguishing between
eight classes; gender only two. This, compared to, e.g., the
ten thousand identity classes used to train the network used
of the number of classes (two for gender, eight for the eight
age classes of the age classification task), containing 1 in
the index of the ground truth and 0 elsewhere.
Network training. Aside from our use of a lean network
architecture, we apply two additional methods to further
limit the risk of overfitting. First we apply dropout learning [24] (i.e. randomly setting the output value of network neurons to zero). The network includes two dropout
layers with a dropout ratio of 0.5 (50% chance of setting
a neurons output value to zero). Second, we use dataaugmentation by taking a random crop of 227 227 pixels
from the 256 256 input image and randomly mirror it in
each forward-backward training pass. This, similarly to the
multiple crop and mirror variations used by [48].
Training itself is performed using stochastic gradient
decent with image batch size of fifty images. The initial learning rate is e 3, reduced to e 4 after 10K iterations.
Prediction. We experimented with two methods of using
the network in order to produce age and gender predictions
for novel faces:
Center Crop: Feeding the network with the face image, cropped to 227 227 around the face center.
Over-sampling: We extract five 227 227 pixel crop
regions, four from the corners of the 256 256 face
image, and an additional crop region from the center
of the face. The network is presented with all five images, along with their horizontal reflections. Its final
prediction is taken to be the average prediction value
across all these variations.
We have found that small misalignments in the Adience images, caused by the many challenges of these images (occlusions, motion blur, etc.) can have a noticeable impact
on the quality of our results. This second, over-sampling
method, is designed to compensate for these small misalignments, bypassing the need for improving alignment quality,
but rather directly feeding the network with multiple translated versions of the same face.
4. Experiments
Our method is implemented using the Caffe open-source
framework [26]. Training was performed on an Amazon
GPU machine with 1,536 CUDA cores and 4GB of video
memory. Training each network required about four hours,
predicting age or gender on a single image using our network requires about 200ms. Prediction running times can
conceivably be substantially improved by running the network on image batches.
4.2. Results
Table 2 and Table 3 presents our results for gender and
age classification respectively. Table 4 further provides a
confusion matrix for our multi-class age classification results. For age classification, we measure and compare both
the accuracy when the algorithm gives the exact age-group
classification and when the algorithm is off by one adjacent age-group (i.e., the subject belongs to the group immediately older or immediately younger than the predicted
group). This follows others who have done so in the past,
and reflects the uncertainty inherent to the task facial features often change very little between oldest faces in one
age class and the youngest faces of the subsequent class.
Both tables compare performance with the methods
described in [10]. Table 2 also provides a comparison
with [23] which used the same gender classification pipeline
of [10] applied to more effective alignment of the faces;
faces in their tests were synthetically modified to appear
facing forward.
Figure 4. Gender misclassifications. Top row: Female subjects mistakenly classified as males. Bottom row: Male subjects mistakenly
classified as females
Figure 5. Age misclassifications. Top row: Older subjects mistakenly classified as younger. Bottom row: Younger subjects mistakenly
classified as older.
15-20
734
919
1653
25-32
2308
2589
4897
38-43
1294
1056
2350
48-53
392
433
825
60442
427
869
Total
8192
9411
19487
Table 1. The AdienceFaces benchmark. Breakdown of the AdienceFaces benchmark into the different Age and Gender classes.
Evidently, the proposed method outperforms the reported state-of-the-art on both tasks with considerable gaps.
Also evident is the contribution of the over-sampling approach, which provides an additional performance boost
over the original network. This implies that better alignment (e.g., frontalization [22, 23]) may provide an additional boost in performance.
We provide a few examples of both gender and age misclassifications in Figures 4 and 5, respectively. These show
that many of the mistakes made by our system are due to extremely challenging viewing conditions of some of the Adience benchmark images. Most notable are mistakes caused
by blur or low resolution and occlusions (particularly from
heavy makeup). Gender estimation mistakes also frequently
occur for images of babies or very young children where
obvious gender attributes are not yet visible.
Method
Best from [10]
Best from [23]
Proposed using single crop
Proposed using over-sample
Accuracy
77.8 1.3
79.3 0.0
85.9 1.4
86.8 1.4
Exact
45.1 2.6
49.5 4.4
50.7 5.1
1-off
79.5 1.4
84.6 1.7
84.7 2.2
5. Conclusions
Though many previous methods have addressed the
problems of age and gender classification, until recently,
much of this work has focused on constrained images taken
in lab settings. Such settings do not adequately reflect appearance variations common to the real-world images in social websites and online repositories. Internet images, however, are not simply more challenging: they are also abun-
0-2
4-6
8-13
15-20
25-32
38-43
48-53
60-
0-2
0.699
0.256
0.027
0.003
0.006
0.004
0.002
0.001
4-6
0.147
0.573
0.223
0.019
0.029
0.007
0.001
0.001
8-13
0.028
0.166
0.552
0.081
0.138
0.023
0.004
0.008
15-20
0.006
0.023
0.150
0.239
0.510
0.058
0.007
0.007
25-32
0.005
0.010
0.091
0.106
0.613
0.149
0.017
0.009
38-43
0.008
0.011
0.068
0.055
0.461
0.293
0.055
0.050
48-53
0.007
0.010
0.055
0.049
0.260
0.339
0.146
0.134
600.009
0.005
0.061
0.028
0.108
0.268
0.165
0.357
[2]
[3]
[4]
[5]
dant. The easy availability of huge image collections provides modern machine learning based systems with effectively endless training data, though this data is not always
suitably labeled for supervised learning.
Taking example from the related problem of face recognition we explore how well deep CNN perform on these
tasks using Internet data. We provide results with a lean
deep-learning architecture designed to avoid overfitting due
to the limitation of limited labeled data. Our network is
shallow compared to some of the recent network architectures, thereby reducing the number of its parameters and
the chance for overfitting. We further inflate the size of the
training data by artificially adding cropped versions of the
images in our training set. The resulting system was tested
on the Adience benchmark of unfiltered images and shown
to significantly outperform recent state of the art.
Two important conclusions can be made from our results.
First, CNN can be used to provide improved age and gender classification results, even considering the much smaller
size of contemporary unconstrained image sets labeled for
age and gender. Second, the simplicity of our model implies that more elaborate systems using more training data
may well be capable of substantially improving results beyond those reported here.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Acknowledgments
[15]
[17]
References
[19]
[16]
[18]
[53] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In post-ECCV Faces in Real-Life Images
Workshop, 2008. 2
[54] S. Yan, M. Liu, and T. S. Huang. Extracting age information
from local spatially flexible patches. In Acoustics, Speech
and Signal Processing, pages 737740. IEEE, 2008. 2
[55] S. Yan, X. Zhou, M. Liu, M. Hasegawa-Johnson, and T. S.
Huang. Regression from patch-kernel. In Proc. Conf. Comput. Vision Pattern Recognition. IEEE, 2008. 2
[56] X. Zhuang, X. Zhou, M. Hasegawa-Johnson, and T. Huang.
Face age estimation using patch-based hidden markov model
supervectors. In Int. Conf. Pattern Recognition. IEEE, 2008.
2