Analysis Based On Recent Deep Learning Approaches Applied in Real-Time Multi-Object Tracking A Review

Uploaded by

santosh.sannakki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Analysis Based On Recent Deep Learning Approaches Applied in Real-Time Multi-Object Tracking A Review

Uploaded by

santosh.sannakki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Received February 1, 2021, accepted February 15, 2021, date of publication February 22, 2021, date of current version

March 2, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3060821

Analysis Based on Recent Deep Learning

Approaches Applied in Real-Time
Multi-Object Tracking: A Review
LESOLE KALAKE 1, WANGGEN WAN 1, (Senior Member, IEEE), AND LI HOU2
1 School of Communications and Information Engineering, Institute of Smart City, Shanghai University, Shanghai 200444, China
2 School of Information Engineering, Huangshan University, Huangshan 245041, China

Corresponding author: Lesole Kalake ([email protected])

This work was supported in part by the Science and Technology Commission of Shanghai Municipality under Grant 18510760300, in part
by the Anhui Natural Science Foundation under Grant 1908085MF178, in part by the China Postdoctoral Science Foundation under
Grant 2020M681264, and in part by the Anhui Excellent Young Talents Support Program under Project gxyqZD2019069.

ABSTRACT The deep learning technique has proven to be effective in the classification and localization
of objects on the image or ground plane over time. The strength of the technique’s features has enabled
researchers to analyze object trajectories across multiple cameras for online multi-object tracking (MOT)
systems. In the past five years, these technical features have gained a reputation in handling several
real-time multiple object tracking challenges. This contributed to the increasing number of proposed deep
learning methods (DLMs) and networks seen by the computer vision community. The technique efficiently
handled various challenges in real-time MOT systems and improved overall tracking performance. However,
it experienced difficulties in the detection and tracking of objects in overcrowded scenes and motion
variations and confused appearance variations. Therefore, in this paper, we summarize and analyze the
95 contributions made in the past five years on deep learning-based online MOT methods and networks that
rank highest in the public benchmark. We review their expedition, performance, advantages, and challenges
under different experimental setups and tracking conditions. We also further categorize these methods and
networks into four main themes: Online MOT Based Detection Quality and Associations, Real-Time MOT
with High-Speed Tracking and Low Computational Costs, Modeling Target Uncertainty in Online MOT,
and Deep Convolutional Neural Network (DCNN), Affinity and Data Association. Finally, we discuss the
ongoing challenges and directions for future research.

INDEX TERMS Deep learning, detection quality, high-speed tracking, multi-camera object tracking,
real-time tracking.

I. INTRODUCTION the next frame based on detection results, and then gen-
In the past five years, deep learning-based online multi-object erate and link object tracklets accordingly [3], [10]. This
tracking (MOT) paradigms have been inferior to sparse prin- improved and strengthened the detection and tracking pro-
cipal component analysis [1], [2]. The emergence and expan- cesses to address the challenges of online MOTs using
sion of convolutional neural networks (CNNs) to DCNNs multiple cameras. It also gradually expanded deep learning
strengthened DLMs and tracking-by-detection (TBDs), thus approaches in real-time MOTs based on the single-camera
contributing to discernible progress in online MOTs [3]–[7]. tracking technique. However, the approaches implemented
The DCNN features and neural layers were used to detect with the single-camera tracking technique seemed more
and track countless objects that move on the streets and effective for offline MOT [11], [12] and harmed many
public spaces [8], [9]. In contrast, the TBD is used to opti- algorithms due to the view angle. The view angle had lim-
mize the tracker’s discriminative model, locate the target in itations and could not provide multiple angles, hence mak-
ing the single-camera technique’s algorithms susceptible to
The associate editor coordinating the review of this manuscript and velocity variations and vulnerable to misdetections, occlu-
approving it for publication was Charith Abhayaratne . sions, and fragmentations [13] due to both camera and object

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
32650 VOLUME 9, 2021
L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

movements [14], [15]. This ineffectively localized multiple first frame to the last frame [25]. Wen et al. [26] capital-
objects, extracted features, created bounding box regres- ized on this theorem by creating CLEAR MOT evaluation
sion detections, generated tracklets, and contributed to inap- metrics that have been implemented in neoteric work on
propriate matching or mapping of the specific appearance deep learning-based real-time MOT methods, multi-camera
information [6], [16], [17]. tracking techniques (MCTs), and DCNNs with the tracking-
Currently, researchers [5], [11], [18], [19] have summa- by-detection (TBD) approach to track objects across mul-
rized only the multi-object tracking literature predicated tiple frames [19], [26]. These evaluation metrics enabled
on general visual tracking and detection techniques based the standard calculations and presentation of multiple object
on experimental studies rather than concentrating on deep tracking results on false positive (FP), false negative (FN),
learning methods based on online MOT. In the past five false alarm (FA), fragments of target trajectories (FM), multi-
years, several proposed approaches have shown a signifi- object tracking accuracy (MOTA), and multi-object tracking
cant performance enhancement in real-time MOT and were precision (MOTP) of public datasets created based on both
able to approximate human vision. They have impressively single camera and multi-camera video capturing on differ-
promoted tracking performance by reducing the misdetec- ent environmental scenes. Therefore, it was necessary for
tion rate with the integration of a tracking-by-detection Wen et al. [26] to further benchmark and define the CLEAR
paradigm [20]–[24]. This led to the emergence of vari- MOT metric formulas for both MOTA and MOTP as follows:
ous efficient and robust algorithms with minimum real- P P
v t FNv,t + FPv,t + IDSv,t
time tracking challenges and complications in video data MOTA = 1 − P P (1)
processing [1], [5]. Therefore, it is important to summa- v t GTv,t
rize and analyze the existing DLMs and network-based where FNv,t and FPv,t denote false negatives and false posi-
online MOTs to pave the way for further studies. Hence, tives, respectively. Then, IDSv,t represent identity switches of
the present paper presents a systematic review of progress, trajectories, and GTv,t is the number of ground truth objects
challenges, and future research opportunities on DLM-based at time index t of sequence v. Then, MOTP metrics as the
online multi-object tracking applications. It further compares average dissimilarities between true positives and ground
and discusses how they enhanced the performance in online truth:
MOTs with various public datasets in various environmental P t
i,t d
setups. It then discusses the main functionalities and imple- MOTP = P i (2)
mentation strategies in detail. i ct
This paper is organized as follows: Section I provides a where ct denotes the number of matches in frame t and dit is
brief background on online multiple object tracking (MOT) the bounding box overlap per frame target with its assigned
and problem formulations. Section II presents the method- ground truth objects.
ology for gathering relevant works. Section III discusses
the extensive literature by considering deep learning-based B. TRADITIONAL SINGLE-CAMERA MULTI-OBJECT
online multi-object tracking methods’ advantages and persist- TRACKING
ing challenges. Section IV discusses the effectiveness of deep The single-camera tracking (SCT) technique, as illustrated
learning based on categorized themes: deep learning towards in Fig. 1, is a cost-inefficient traditional technical method
online multi-object tracking based on detection quality and used to detect multiple views of different objects. It enables
associations online MOT-based detection quality and asso- the enhancement of trackers to track multiple objects in a
ciations, real-time MOT with high-speed tracking and low video frame sequence based on the detection quality [27].
computational costs, modeling target uncertainty in online However, it provides a one-sided view and cannot provide
MOT, convolutional neural networks (CNNs), and affinity multiple views due to its limitations in handling rotations,
and data associations. Section V concludes the study. scaling, affinity distortions, quick movements, similarities,
and occlusions [28], [29]. These limitations led to degraded
A. ONLINE MULTI-OBJECT TRACKING (MOT) PROBLEM overall detector performance, and Lee and Hong [30] incor-
FORMATION porated separate detectors and classifiers for several dif-
Online multi-object tracking (MOT) is the variation of prob- ferent viewpoints to improve the detector performance.
lem estimations based on the given input video sequence with
several moving objects in frames [21]. It plays an essential
role in video surveillance applications by locating moving
objects in the video frames taken by either a single cam-
era or multiple networked cameras. It forms the process of
detecting, locating, associating, and tracking objects over a
period by collecting the observations from the initial frame
until the last-end frame. Then optimizes the sequential states
by modeling the maximum posterior estimation from the
conditional for all sequential states of all objects from the FIGURE 1. Single Camera Multi-Object Tracking Overview [19].