0% found this document useful (0 votes)
56 views4 pages

A Framework For Real-Time Face-Recognition - 8965805 PDF

This document introduces a framework for real-time face recognition from live video feeds. It addresses key challenges like computational complexity, recognition in unconstrained environments, and multi-person recognition. The proposed system detects faces, tracks them, and recognizes individuals in video against a gallery of still images. It aims to achieve open-set face recognition in real-time on unconstrained videos. The system exploits deep neural networks for efficient face detection and feature extraction while minimizing computational overhead in other modules.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views4 pages

A Framework For Real-Time Face-Recognition - 8965805 PDF

This document introduces a framework for real-time face recognition from live video feeds. It addresses key challenges like computational complexity, recognition in unconstrained environments, and multi-person recognition. The proposed system detects faces, tracks them, and recognizes individuals in video against a gallery of still images. It aims to achieve open-set face recognition in real-time on unconstrained videos. The system exploits deep neural networks for efficient face detection and feature extraction while minimizing computational overhead in other modules.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Framework for Real-Time Face-Recognition

Samadhi Wickrama Arachchilage and Ebroul Izquierdo


Multimedia and Vision Group, School of Electronic Engineering and Computer Science
Queen Mary University of London, United Kingdom
Email: {s.wickramaarachchilage, ebroul.izquierdo}@qmul.ac.uk

Abstract—The advent and wide use of deep-learning technol- Despite the fact that end-to-end FR appear as a single
ogy has enabled tremendous advancements in the accuracy of discipline, the underlying research goals vary with application
face recognition under favourable conditions. Nonetheless, the specific requirements. For example, computational complexity
reported near-perfect performance on classic benchmarks like
lfw, does not include complications in unconstrained application. (images/frames per second) is a critical issue in real-time
The research reported in this paper addresses some of the recognition, but is generally flexible in offline processing. An
critical challenges of face recognition under adverse conditions. open-set face recognition does not benefit from the closed-set
In this context, we introduce an end-to-end framework for assumption that each probe face has a mate in the gallery.
real-time video-based face recognition. This system detects, And Multi-person video FR requires additional face tracking.
tracks and recognizes individuals from live video feed. The
proposed system addresses three key challenges of video-based Therefore, a generic design is not sufficient on the diverse
face recognition systems: end-to-end computational complexity, application base.
in the wild recognition and multi-person recognition. We exploit
sophisticated deep neural networks for face detection and facial This paper presents an end-to-end framework for real-time
feature extraction, while minimizing the computational overhead face recognition designed for an intelligent surveillance appli-
from the rest of the modules in the recognition pipeline. A cation. The system detects, tracks and recognizes individuals
comprehensive evaluation shows that the proposed system can in the probe video feed against a gallery of still images.
effectively recognize faces under unconstrained conditions, at
elevated frames per second rates. Given the use-case, our design goal is to implement open-
set FR on unconstrained videos while achieving real-time
I. I NTRODUCTION
efficiency. In doing so, we adopt a DCNN face detection
Face encompasses significant discriminative features and is system that recognized faces under motion blur, pose and
perceived by the community as a non-intrusive form of person illumination variations. To extract features for detected faces,
identification. As opposed to traits like fingerprint and iris, face we adopt a DCNN that has proven to be effective in image
can be captured without subject cooperation or awareness, and classification. Aimed at keeping the computational complexity
is arguably a dominant and prominent personal trait with great to minimum, we perform joint detection and alignment and
potential in practical application detection based tracking.
Face recognition (FR) has advanced rapidly in the recent The evaluations that follows analyses the functionality
years with sophisticated Deep Convolutional Neural Network throughout the FR pipeline. We report the face detection
(DCNN) based architectures. In 2014, an FR system code- accuracy, end face recognition accuracy and end-to-end com-
named DeepFace [1] reported near-human performance on lfw putational complexity evaluated under unconstrained environ-
[2] benchmark. The accuracy was later surpassed by studies ments.
and implementations like DeepId3 [3], FaceNet [4] and DLIB
software library [5]. The rest of the paper is organized as follows. Section
2 discusses the existing research on face detection, object
The reported high accuracy imply the commercial viability
tracking, face recognition and FR benchmarks. Section 3
in certain applications like traveler verification at border cross-
presents the system design followed by performance evaluation
ing points that allow reasonable constraints on illumination,
in section 4. Section 5 presents the conclusion.
frontal pose and controlled emotion from subject cooperation.
These evaluations do not provide sufficient evidence for reli-
ability in complicated applications like real time surveillance II. R ELATED W ORK
[6]. A classic FR process begins with face detection and fol-
lowed by face recognition and tracking. This section aims to
Studies that followed, focused on advanced scenarios like
brief the state-of-the-art along the FR pipeline.
FR in unconstrained images and video FR [6], [7], [8].
Nevertheless, majority of the studies and benchmarks tend to 1) Face Detection: Early approaches to face detection
isolate face recognition as an individual discipline and hence, includes Haar cascade based face detector by Viola and Jones
does not provide sufficient insights on critical issues arising [9] and Histograms of Oriented Gradients (HOG) for human
from inevitable integration with modules like face detection detection [10]. The two detectors have attractive features; both
(e.g : false recognitions resulting from false detections). perform well under high quality images with frontal pose, Haar

‹,(((
cascade based detector performs in real-time and HOG de-
tector is particularly resistant to false positives. Nevertheless,
both are reported to fail at visual variations such as pose,
lighting and blur [6], [11]. Lately, a system named Multi-task
Cascaded Convolutional Networks (MTCNN) [11] performed
joint implementation of face alignment and detection. The
approach, while reporting high true positive rate, does not
report near-perfect resilience to false positives.
2) Face Tracking: In the light of deep leaning based object
detection algorithms with improved accuracy, the need for
separate and sophisticated object tracking has been overruled.
Instead, they are replaced with rather simple detection based
tracking algorithms like Interception Over Union (IOU) [12] Fig. 1: The data flow of the proposed system where BBox-
These trackers can function at high speeds, thus introducing Bounding box coordinates of a detected face, f_i-Feature
negligible overheads to the overall system. representation for face i.
3) Face recognition: The FR system generally includes a
DCNN model that maps complex facial features to a low
5) Face Datasets: While benchmarks like IJB have ad-
dimensional feature vector. Depending on the application,
dressed the video FR [6], these protocols do not facilitate
these features are then compared for face verification or
strict image to video recognition. Even the few benchmarks
classified for face recognition.
that include still to video protocols [17] are biased in some
A DCNN model is comprised of two components, namely, form (e.g: limiting to a single race) or not entirely compatible
base DCNN and the supervisory signal/loss function. These with our application requirements (e.g: Continuous evaluation
have been addressed as separate research topics, of which, the throughout the video duration, Parallel recognition of multiple
combined outputs have been used in FR. DeepFace [1] uses a people).
9 layer deep net trained with softmax loss. DeepId3 [3] com-
III. FR SYSTEM
bines concepts from two deep image recognition frameworks,
VGGNet [13] and Inception architecture from GoogleNet [14], The proposed system is combination of five sub modules
trained with softmax loss. Facenet uses GoogleNet trained with each carrying out a unique task. The face detector performs
triplet loss. DLIB [5] uses a deep residual network [15] with simultaneous detection and alignment. The Embedding gener-
triplet loss. ation unit (EGU) maps face thumbnails to feature vectors of
fixed dimension. The euclidean distance between the feature
While the early research [13] suggests going deeper with vectors are used as a measure of similarity between two faces.
convolutions as a straightforward approach towards increased Figure 1 shows the system overview and dataflow.
accuracy, the concept has limitations in terms of practicality
due to the high level of computer resources deeper nets During training, the input labeled faces are aligned by the
demand. Studies that followed, focused more on computa- detector and the bounding boxes of aligned faces are fed to
tional simplicity achieved through architectural innovations EGU. The feature representations output by EGU are stored
like inception modules introduced in GoogleNet [14], residual with the corresponding label. During testing, the processing
learning [15], or a combination of inception and residual unit breaks the video in to frames at the system specified fps
learning [16]. level and forward the frames for face detection. The detector
outputs bounding boxes (bbox) of each detected face. The
While triplet loss has reported competitive performance in faces are tracked by the reported bbox coordinates. Active
FR [4], the performance mainly depend on effective triplet tracks (A track is considered to be active if the detection is
mining. In contrast, softmax loss trains as classification task regular over a threshold number of frames) are forwarded for
and hence, is reportedly simplifies the implementation [1]. FR. The immediate recognition result and the tracked previous
outputs are combined for an optimum current result.
4) FR on videos: Studies on video-based FR have em-
ployed 3 types of approaches; (1) FR on a per-frame basis The face detection-alignment unit, is an implementation of
[1], (2) result pooling over set of frames [8], and (3) linear the MTCNN architecture [11]. Figure 2 shows sample false
integration of information across frames for one-time FR [7]. positives and false negatives mined by detection. A percentage
As opposed to the first, the second and third approaches of the generated false positives are filtered out by tracking. The
maintain the information across all frames. Nevertheless, the ones that makes it as an active track by regularly appearing,
third reports optimum performance when quality face images results in false recognitions.
to remain consistent throughout, which is less probable in
The EGU is a DCNN model of Inception ResNet V1
unconstrained environments.
discussed in [16], trained with softmax loss. The model is
Fig. 2: Sample true positives (left) and false positives (right)
mined by MTCNN face detector

trained with VGGFace2 dataset which contains 3.31 million Fig. 3: Samples from the evaluation dataset. Top: Samples
images of 9131 subjects [18]. Given the limited availability from still image gallery. Bottom: Samples from probe videos
of application specific data, we perform transfer learning.
Generally, transfer learning includes a source domain which is
be to replicate the scenario with a dataset. Due to practical
trained offline, and a target domain for online processing. In
complications arising from limited computational resource and
this context, the source domain is the pre-trained model and
conflicting availability of test subjects, we decided on the
the target domain is the application specific data.
latter. The custom collected dataset includes a gallery set of
The IOU [12] based tracking mechanism is essential in the 150 still images of 38 individuals and a probe set of 22 videos
application requirements and beneficial in implementation; It that contain up to 7 people in a frame. Please refer to figure
enables the exploitation of collective information that a set of 3 for sample data. The probe video frames include 6655 faces
video frames provide by tracking each face and additionally out of which 2793 do not have a mate in the gallery.
function as a filter for false detections that occur due motion
Our performance metrics are as follows. Accuracy is mea-
blur and movement patterns. These are unlikely to continue
sured as a percentage of the reported results. The requirement
occurring and hence are filtered out as dead tracks.
of real-time surveillance is to recognize a person as and
The closest known face calculated by the classifier, gen- when he appears in the video. Thus, we introduce a rather
erates a shallow recognition result. The euclidean distance challenging metric for overall accuracy (acccuracy*) which is
between the face in the video and the closest mate in the true classifications against total number of faces appearing in
gallery images, is subjected to a threshold. The threshold the probe video dataset.
distance t is determined such that, two faces of same identity TP + TN
lies below the threshold and faces with different identities Accuracy = ∗ 100%
TP + TN + FP +FN
exceeds the threshold. The subject identity at any given point
in the video determined by max pooling over a series of
TP + TN
prior shallow results. The highest recorded shallow identity Accuracy* = ∗ 100%
is expected to be reported over a given percentage n of the Number of faces in the probe videos
considered set of frames, to qualify as the end recognition
result. The percentage n is determined such that, it is high In the equations, TP-True positive, TN-True negative, FP-
enough to rule out false recognitions arising from pose and False positive and FN-False negative.
expression variations and low enough to extract the the true
result among random false recognitions. Similarly, FPIR is the false positive identifications against
the reported recognitions and FPIR* is the false positive
identifications against the number of faces appearing in the
IV. P ERFORMANCE E VALUATION
probe video dataset.
The evaluations aim to verify the system performance
against the ideal specifications. Given the scenario of real- Figure 4 (Left) plots the accuracy* against FPIR*. Our
time surveillance where the cost of false alarms (registered system reports higher levels of accuracy* at lower levels of
individuals recognized as possible intruders) is high and the FPIR*, in comparison to DLIB. The results also indicate the
cost of missed alarms (possible intruders recognized as reg- cost of tracking within our system. The detections recorded
istered) is even higher, we strive for minimum false positive before the track qualifies as an active track, do not produce
identification rate (FPIR) with reasonable flexibility on False recognitions, hence reduces accuracy*. Since tracking enables
negative rejections. a processed result rather than immediate shallow recognition,
the cost of tracking is an arguably reasonable trade-off. In
The two systems in comparison are two state-of-the-art FR an attempt to minimize this trade-off, we aim to implement
software namely DLIB [5] and OpenFace [19]. OpenFace is IOU tracking over generated facial features, as possible future
typically designed for closed-set FR and was evaluated only research.
on a subset of the test protocol.
As shown in table I, our system shows a considerable
The ideal experimental set up would be to run the systems superiority in face detection in terms of true acceptance rate
at comparison in parallel with the real-time participation of (TAR). (Please note that the false acceptance rate was not
subjects and a less ideal yet not unreasonable approach would recorded since the impact of false detections is included in
TABLE I: Face detection results and innovation program under grant agreement No. 787123
FR System Face Detection Algorithm TAR (%) (PERSONA RIA project).
used by the system
R EFERENCES
OpenFace HOG [10] with alignment [20] 82.5
[1] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing
DLIB HOG [10] 88.01 the gap to human-level performance in face verification,” in The IEEE
Proposed system MTCNN [11] 99.75 Conference on Computer Vision and Pattern Recognition (CVPR), Jun.
2014.
[2] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces
90 100
Proposed system in the wild: A database forstudying face recognition in unconstrained
DLIB
environments,” in Workshop on faces in’Real-Life’Images: detection,
85 95
alignment, and recognition, 2008.
Accuracy*

[3] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition


Accuracy
80 90
with very deep neural networks,” CoRR, vol. abs/1502.00873, 2015.
[4] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-
75 85 Proposed system
DLIB ding for face recognition and clustering,” in The IEEE Conference on
80
Computer Vision and Pattern Recognition (CVPR), Jun. 2015.
70
0 1 2 3 4 5 0 1 2 3 4 5 [5] “High quality face recognition with deep metric learning.”
FPIR* FPIR
http://blog.dlib.net/2017/02/high-quality-face-recognition-with-
Fig. 4: Left: Accuracy* against FPIR* plot, Right: Accuracy deep.html. Accessed: 2019-05-20.
[6] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen,
against FPIR plot. P. Grother, A. Mah, and A. K. Jain, “Pushing the frontiers of uncon-
strained face detection and recognition: Iarpa janus benchmark a,” in The
10 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Average computational complexity (fps)

Proposed system Jun. 2015.


8 DLIB [7] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua,
OpenFace
“Neural aggregation network for video face recognition,” in The IEEE
6
Conference on Computer Vision and Pattern Recognition (CVPR), Jul.
4 2017.
[8] A. R. Chowdhury, T. Lin, S. Maji, and E. Learned-Miller, “One-to-many
2 face recognition with bilinear cnns,” in 2016 IEEE Winter Conference
0
on Applications of Computer Vision (WACV), pp. 1–9, Mar. 2016.
1 1.5 2 2.5 3 3.5 4 [9] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
Persons per frame
of simple features,” in Proceedings of the 2001 IEEE Computer Society
Fig. 5: End-to-End computational complexity measured as fps Conference on Computer Vision and Pattern Recognition. CVPR 2001,
vol. 1, pp. I–I, Dec. 2001.
rate. [10] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), Jun. 2005.
[11] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
the end recognition result as FP or FN). In contrast, DLIB alignment using multitask cascaded convolutional networks,” IEEE
detects the clear faces filtering the hard samples. The clear Signal Processing Letters, vol. 23, pp. 1499–1503, Oct. 2016.
face thumbnails make the FR process more convenient. This is [12] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-
detection without using image information,” in 2017 14th IEEE Inter-
reflected in the FPIR, Accuracy graph in figure 4 (right) where national Conference on Advanced Video and Signal Based Surveillance
DLIB reports higher performance when only the reported (AVSS), pp. 1–6, Aug. 2017.
results are accounted. [13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in International Conference on Learning
Figure 5 plots the end-to-end computational complexity Representations, Sept. 2015.
[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
recorded on an Intel Core i7-7740X CPU @ 4.30GHz. The V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
results show that our system outperforms both DLIB and in The IEEE Conference on Computer Vision and Pattern Recognition
OpenFace with an average of 0.12 seconds for end-to-end (CVPR), Jun. 2015.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition of one face. recognition,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Jun. 2016.
[16] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
V. C ONCLUSION inception-resnet and the impact of residual connections on learning,” in
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
This paper presents a framework for real-time face recog- [17] Z. Huang, S. Shan, R. Wang, H. Zhang, S. Lao, A. Kuerban, and
nition. We address some of the critical aspects of FR like X. Chen, “A benchmark and comparative study of video-based face
end-to-end computational complexity, recognition in the wild recognition on cox face database,” IEEE Transactions on Image Pro-
cessing, vol. 24, pp. 5967–5981, Dec. 2015.
and multi-person recognition. In a series of comprehensive [18] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2:
evaluations, our system shows substantial potential in practical A dataset for recognising faces across pose and age,” in International
application in challenging scenarios. We conclude with a note Conference on Automatic Face and Gesture Recognition, 2018.
[19] B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “Openface: A general-
on possible further enhancements as future research. purpose face recognition library with mobile applications,” tech. rep.,
CMU-CS-16-118, CMU School of Computer Science, 2016.
ACKNOWLEDGMENT [20] V. Kazemi and J. Sullivan, “One millisecond face alignment with an
ensemble of regression trees,” in 2014 IEEE Conference on Computer
The research activity leading to the publication has been Vision and Pattern Recognition, pp. 1867–1874, June 2014.
partially funded by the European Union Horizon 2020 research

You might also like