0% found this document useful (0 votes)
7 views

Analysis Based On Recent Deep Learning Approaches Applied in Real-Time Multi-Object Tracking A Review

Uploaded by

santosh.sannakki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Analysis Based On Recent Deep Learning Approaches Applied in Real-Time Multi-Object Tracking A Review

Uploaded by

santosh.sannakki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Received February 1, 2021, accepted February 15, 2021, date of publication February 22, 2021, date of current version

March 2, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3060821

Analysis Based on Recent Deep Learning


Approaches Applied in Real-Time
Multi-Object Tracking: A Review
LESOLE KALAKE 1, WANGGEN WAN 1, (Senior Member, IEEE), AND LI HOU2
1 School of Communications and Information Engineering, Institute of Smart City, Shanghai University, Shanghai 200444, China
2 School of Information Engineering, Huangshan University, Huangshan 245041, China

Corresponding author: Lesole Kalake ([email protected])


This work was supported in part by the Science and Technology Commission of Shanghai Municipality under Grant 18510760300, in part
by the Anhui Natural Science Foundation under Grant 1908085MF178, in part by the China Postdoctoral Science Foundation under
Grant 2020M681264, and in part by the Anhui Excellent Young Talents Support Program under Project gxyqZD2019069.

ABSTRACT The deep learning technique has proven to be effective in the classification and localization
of objects on the image or ground plane over time. The strength of the technique’s features has enabled
researchers to analyze object trajectories across multiple cameras for online multi-object tracking (MOT)
systems. In the past five years, these technical features have gained a reputation in handling several
real-time multiple object tracking challenges. This contributed to the increasing number of proposed deep
learning methods (DLMs) and networks seen by the computer vision community. The technique efficiently
handled various challenges in real-time MOT systems and improved overall tracking performance. However,
it experienced difficulties in the detection and tracking of objects in overcrowded scenes and motion
variations and confused appearance variations. Therefore, in this paper, we summarize and analyze the
95 contributions made in the past five years on deep learning-based online MOT methods and networks that
rank highest in the public benchmark. We review their expedition, performance, advantages, and challenges
under different experimental setups and tracking conditions. We also further categorize these methods and
networks into four main themes: Online MOT Based Detection Quality and Associations, Real-Time MOT
with High-Speed Tracking and Low Computational Costs, Modeling Target Uncertainty in Online MOT,
and Deep Convolutional Neural Network (DCNN), Affinity and Data Association. Finally, we discuss the
ongoing challenges and directions for future research.

INDEX TERMS Deep learning, detection quality, high-speed tracking, multi-camera object tracking,
real-time tracking.

I. INTRODUCTION the next frame based on detection results, and then gen-
In the past five years, deep learning-based online multi-object erate and link object tracklets accordingly [3], [10]. This
tracking (MOT) paradigms have been inferior to sparse prin- improved and strengthened the detection and tracking pro-
cipal component analysis [1], [2]. The emergence and expan- cesses to address the challenges of online MOTs using
sion of convolutional neural networks (CNNs) to DCNNs multiple cameras. It also gradually expanded deep learning
strengthened DLMs and tracking-by-detection (TBDs), thus approaches in real-time MOTs based on the single-camera
contributing to discernible progress in online MOTs [3]–[7]. tracking technique. However, the approaches implemented
The DCNN features and neural layers were used to detect with the single-camera tracking technique seemed more
and track countless objects that move on the streets and effective for offline MOT [11], [12] and harmed many
public spaces [8], [9]. In contrast, the TBD is used to opti- algorithms due to the view angle. The view angle had lim-
mize the tracker’s discriminative model, locate the target in itations and could not provide multiple angles, hence mak-
ing the single-camera technique’s algorithms susceptible to
The associate editor coordinating the review of this manuscript and velocity variations and vulnerable to misdetections, occlu-
approving it for publication was Charith Abhayaratne . sions, and fragmentations [13] due to both camera and object

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
32650 VOLUME 9, 2021
L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

movements [14], [15]. This ineffectively localized multiple first frame to the last frame [25]. Wen et al. [26] capital-
objects, extracted features, created bounding box regres- ized on this theorem by creating CLEAR MOT evaluation
sion detections, generated tracklets, and contributed to inap- metrics that have been implemented in neoteric work on
propriate matching or mapping of the specific appearance deep learning-based real-time MOT methods, multi-camera
information [6], [16], [17]. tracking techniques (MCTs), and DCNNs with the tracking-
Currently, researchers [5], [11], [18], [19] have summa- by-detection (TBD) approach to track objects across mul-
rized only the multi-object tracking literature predicated tiple frames [19], [26]. These evaluation metrics enabled
on general visual tracking and detection techniques based the standard calculations and presentation of multiple object
on experimental studies rather than concentrating on deep tracking results on false positive (FP), false negative (FN),
learning methods based on online MOT. In the past five false alarm (FA), fragments of target trajectories (FM), multi-
years, several proposed approaches have shown a signifi- object tracking accuracy (MOTA), and multi-object tracking
cant performance enhancement in real-time MOT and were precision (MOTP) of public datasets created based on both
able to approximate human vision. They have impressively single camera and multi-camera video capturing on differ-
promoted tracking performance by reducing the misdetec- ent environmental scenes. Therefore, it was necessary for
tion rate with the integration of a tracking-by-detection Wen et al. [26] to further benchmark and define the CLEAR
paradigm [20]–[24]. This led to the emergence of vari- MOT metric formulas for both MOTA and MOTP as follows:
ous efficient and robust algorithms with minimum real- P P 
v t FNv,t + FPv,t + IDSv,t
time tracking challenges and complications in video data MOTA = 1 − P P (1)
processing [1], [5]. Therefore, it is important to summa- v t GTv,t
rize and analyze the existing DLMs and network-based where FNv,t and FPv,t denote false negatives and false posi-
online MOTs to pave the way for further studies. Hence, tives, respectively. Then, IDSv,t represent identity switches of
the present paper presents a systematic review of progress, trajectories, and GTv,t is the number of ground truth objects
challenges, and future research opportunities on DLM-based at time index t of sequence v. Then, MOTP metrics as the
online multi-object tracking applications. It further compares average dissimilarities between true positives and ground
and discusses how they enhanced the performance in online truth:
MOTs with various public datasets in various environmental P t
i,t d
setups. It then discusses the main functionalities and imple- MOTP = P i (2)
mentation strategies in detail. i ct
This paper is organized as follows: Section I provides a where ct denotes the number of matches in frame t and dit is
brief background on online multiple object tracking (MOT) the bounding box overlap per frame target with its assigned
and problem formulations. Section II presents the method- ground truth objects.
ology for gathering relevant works. Section III discusses
the extensive literature by considering deep learning-based B. TRADITIONAL SINGLE-CAMERA MULTI-OBJECT
online multi-object tracking methods’ advantages and persist- TRACKING
ing challenges. Section IV discusses the effectiveness of deep The single-camera tracking (SCT) technique, as illustrated
learning based on categorized themes: deep learning towards in Fig. 1, is a cost-inefficient traditional technical method
online multi-object tracking based on detection quality and used to detect multiple views of different objects. It enables
associations online MOT-based detection quality and asso- the enhancement of trackers to track multiple objects in a
ciations, real-time MOT with high-speed tracking and low video frame sequence based on the detection quality [27].
computational costs, modeling target uncertainty in online However, it provides a one-sided view and cannot provide
MOT, convolutional neural networks (CNNs), and affinity multiple views due to its limitations in handling rotations,
and data associations. Section V concludes the study. scaling, affinity distortions, quick movements, similarities,
and occlusions [28], [29]. These limitations led to degraded
A. ONLINE MULTI-OBJECT TRACKING (MOT) PROBLEM overall detector performance, and Lee and Hong [30] incor-
FORMATION porated separate detectors and classifiers for several dif-
Online multi-object tracking (MOT) is the variation of prob- ferent viewpoints to improve the detector performance.
lem estimations based on the given input video sequence with
several moving objects in frames [21]. It plays an essential
role in video surveillance applications by locating moving
objects in the video frames taken by either a single cam-
era or multiple networked cameras. It forms the process of
detecting, locating, associating, and tracking objects over a
period by collecting the observations from the initial frame
until the last-end frame. Then optimizes the sequential states
by modeling the maximum posterior estimation from the
conditional for all sequential states of all objects from the FIGURE 1. Single Camera Multi-Object Tracking Overview [19].

VOLUME 9, 2021 32651


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

However, the combination could not produce satisfactory associated with trajectory objects and tracked independently
detection results due to difficulties in handling occlusions on each camera view [36]. Then, the velocity and position of
and misdetections on each detector and classifier [5]. Then object features are computed by grouping trajectories into one
Fajardo et al. [32], Azad and Misbahuddin. [31] further con- cluster that enables the connection among the camera views to
tributed to enhance the detector’s performance by labeling handle variations in motion, speed, and direction [37]. How-
the objects on the output of the maximal classifiers. They ever, the multi-camera multi-object tracking technique needs
stretched and reinvigorated the algorithm by estimating the to maintain the identity consistency of each target across mul-
object distance and detection with tracking-by-detection in tiple views and struggles when there is an object similarity
a deep convolutional neural network (DCNN). They utilized appearance [38], [95]. In this case, the current most deep
the network layers to extract features from the input video learning-based online MOT systems introduced the DCNN
frame sequences with learnable filters and added biases from and tracking-by-detection (TBD) paradigm to solve the prob-
the parameters of each Player. Then,Pthese filters and biases are lem of associating a target with multiple potential views [39].
represented by w = ki=1 wi and ki=1 bi, respectively. The They are designed in an end-to-end deep neural network to
generated feature map was represented by Xk and used to pass learn the association between tracks and detections, statement
the results to the next layer as an element of σ repeatedly on updates, initialization, and termination of tracks [40]. They
each convolutional layer. are further employed in the real-time tracking framework
  so that the associations between tracklets and detections are
Xkt = σ Wkt−1 · X t−1 + bt−1
k (3) cascaded from high-confidence tracklets to low-confidence
tracklets [41].
The approach successfully overcame tracklet loss by han-
dling multiple object new identities (IDs) and reassigning II. METHODOLOGY
issues [14]. However, the rotation and one side view in We performed two systematic electronic searches in Google
the single-camera technique [33] contributed to the lack Scholar and Web of Science according to the Preferred
of robustness and difficulties in handling long occlusions. Reporting Items for Systematic Reviews and Meta-Analyses
This resulted in high fragmentation, velocity changes, and (PRISMA) statement [42]. An extensive database search
appearance changes [24]. The challenges caused the splitting was conducted via expression with most essential terms
of camera object tracking into two tasks, i.e., (SCT) and such as ‘‘Multi-Object Tracking’’, ‘‘Real-time Multi-Object
inter-camera object tracking (ICT) [23]. Then, SCT is used Tracking’’, ‘‘Deep Learning Object Tracking’’, ‘‘Online
to obtain multi object trajectories in a single camera view Multi-Object Tracking’’, ‘‘High-Speed Tracking’’, ‘‘Deep
connected across multiple camera views through ICT [14]. Convolution Neural Network’’, and ‘‘Target Detection and
Therefore, this laid a solid foundation for DLMs with MCT Tracking’’ over the last 5 years, from 2015-2020. The final
techniques based on online MOT [34], [35]. search in these databases was performed on the 25th of
July 2020 and was restricted to peer-reviewed documents,
C. MULTI-CAMERA FOR MULTI-OBJECT TRACKING (MCT) such as journals and conference papers. Then, 80 duplicates
The technique has capitalized on the foundation laid with within the retrieved articles in the databases were removed.
SCT approaches [35]. It uses ICT, as shown in Fig. 2, As depicted in Fig. 3, we initialized the search expression
to capture the object across each camera on different angle with the diverse coalescence of key terms such as ‘‘Multi-
views despite the velocity and appearance variations [14]. Object Tracking, Target Detection, Tracking, and Real-time
The object detection from different camera views is Multi-Object Tracking ‘‘ that were used on Web of Science
and Google Scholar and this returned 5000 articles. We fur-
ther intensified the search expression by adding ‘‘Online
Multi-Object Tracking’’ and 1,500 articles with duplicates
were returned. We further restricted and reinvigorated both
the search expressions and filter by adding the ‘‘Deep Convo-
lutional Neural Network (DCNN), High-Speed Tracking and
publications’ range period (2015-2020) and screened the out-
comes (180 articles) to eliminate duplicates while ensuring
authenticity and competency. Therefore, the study reviewed
95 peer-reviewed papers published within the past five years
and supported the DLM-based online MOT, tracking-by-
detection, and DCNN.

III. EVALUATIONS OF DEEP LEARNING METHODS BASED


ONLINE MOT
FIGURE 2. Multi-Camera for Multi-Object Multicamera for Multi-object In this section, we explore the DLM-based MOT framework
Tracking (MCT) Overview. and proven records. The deep learning framework effectively

32652 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

FIGURE 3. The approach used to extract articles predicated on the diverse coalescence of
keys on a search expression.

improved the tracking performance from various tracking distance metric between trackers and detections and used
predictions and data associations [43], [44]. It automates the the long short-term memory (LSTM) prediction module to
capacity learning of appearance features via DCNN to pro- terminate the object in the next frame. However, the algorithm
mote discrimination and robustness for occlusion handling in relied heavily on optic flow with template matching and
online tracking optimization strategies [45], [46]. Therefore, resulted in poor detections and target data association history.
this has made DLMs more resourceful in promoting the accu- It considered only motion features and omitted appearance
racy of motion prediction and the performance of bipartite features; hence, it experienced low detection quality (25.3%)
matching between tracklets and detection [1], [10], [47]. and tracking results (30.3%) on the real-time MOT dataset
Thus, in Fig. 4, we categorized the approaches into four main (MOT2015). To overcome these shortcomings in detection
themes based on the capabilities and objectives in dealing quality, Milan et al. [16] extended the RNN and introduced a
with various challenges in real-time MOT. joint tracking and segmentation approach to estimate the state
of the tracked object by strengthening the detector response.
A. DEEP LEARNING TOWARDS ONLINE MULTI-OBJECT The network treats the states of objects, current observations,
TRACKING BASED ON DETECTIONS QUALITY AND their matching matrix, and existence probabilities as inputs.
ASSOCIATIONS Then it outputs the predicted states, updated results, and ter-
The detection quality is significant to improve the tracker’s minates the object based on new existence probabilities. The
capabilities in handling the object’s appearance similari- proposed algorithm further computes the matching matrix
ties, generating and associating the tracklets, reducing false and groups the designed LSTM-based networks to model
object detections, calculating, grouping the similarity trajec- the matching process between one object’s state and cur-
tories, and drifting [74], [75]. The TBD and most advanced rent observations. It then uses low-level image information
DLMs primarily rely on the quality of detections to gen- and super-pixels to specify a target as background. This has
erate and associate the tracklets effectively, as depicted enabled them to capitalize on the advantages of both high-
in Fig. 5 [3], [5], [6]. level spatial information and low-level motion cues to cre-
In this section, we used Table 1 and showed an overview ate a unified graphical model for multi-object tracking and
of the DLMs that are integrated with CNN to increase motion segmentation. It strengthened the algorithm to mea-
the detection quality rate by breaking input video into sure the object distance, size, location, and velocity through
frames [18], [47]. We further analyzed the anterior work on the implementation of the conditional random field (CRF)
the deep learning technique towards real-time MOT. model. The strategy used a super-pixel procedure to assign
Xiang et al. [47] proposed a multiple online object labels to all pixels belonging to the semantic object on the
tracking decision-making strategy using template tracking, video sequence. It assigned unique IDs to each detection at a
optical flow, and data association to strengthen a tracking- super-pixel level in the input video. However, the approach
by-detection technique by handling the target dynamics and improved the detection quality rate (76.0%) and struggled
association history. They modeled the objects’ similarity to accurately associate the target history in crowded scenes.
function by combining the different cues, appearance, loca- Consequently, better tracking results (65.3%) were recorded
tion, and motion. This triplet loss-based CNN learned the in a real-time MOT evaluation of a dataset (PETS2015).

VOLUME 9, 2021 32653


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

FIGURE 4. A framework of the DLM based MOTs investigated and categorized papers.

However, this result is not very convincing because the train- quality (14.0%) in real-time MOT evaluations of public
ing samples were insufficient to learn an optimized model at datasets (MOT2015 and MOT2016), and could not improve
once. the tracking accuracy (38.8%).
To manage these problems, Sanchez-Matilla et al. [51] Kutschbach et al. [49] presented an application of the
used multiple detectors with high-end and low-end confi- Gaussian mixture probability hypothesis density (GMPHD)
dence values to improve tracking performance. They used filter for multi-object tracking in video data. They extended
weak (low confidence score) detections to support an exist- both the kernelized correlation filters and GMPHD to use the
ing track when robust detections were missing. Then, fast scale space tracking (FSST) scheme and two separated
a perspective-dependent sampling mechanism is introduced models for estimating target translation and scaling. The algo-
to create newborn particles depending on their distance from rithm extracted the HOG feature from a region of 2.5 times
the camera. They further used the probability hypothesis its size and weighted it by a cosine window to highlight the
density particle (PHDP) framework to collect outputs from target in the center and to avoid boundary issues. To asso-
detectors. However, their approach failed to discriminate ciate detections with the tracks, the birth covariance was set
against a target in close range, resulted in low detection to a significant value in every possible direction. However,

32654 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

TABLE 1. Overview of deep learning methods used for online MOT based on detections quality and associations.

VOLUME 9, 2021 32655


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

their extended GMPHD on the dataset (UA-DETRAC) could R-CNN framework. The approach implemented the hierar-
not handle the large birth covariance of the first track and chical clustering algorithm to merge trajectories and proposed
resulted in preventing the initialization of new tracks. This led solving the object similarities via the appearance feature
to delayed track extraction and lower tracking performance extraction process. However, it was too slow to track objects
(14.5%) with a moderate detection quality rate (63.4%). in a real-time evaluation of a dataset (DukeMTMC) due to
To improve the detection module ability to detect and hierarchal paths and hence failed to handle appearance varia-
extract the tracks in a timely manner Zhao et al. [50] proposed tions with more frequent object identity changes. This led to
a compressed DCNN feature-based correlation filter and used a moderate performance in both tracking accuracy (54.1%)
semantic information [78] inherited from the detector. The and detection quality rate (55.0%). Then, Li et al. [53] incor-
approach integrated the two modules for the online MOT porated template branch and bounding box regression into
approach and enhanced the ability to reidentify (ReID) the a Siamese regional proposed network (SRPN) to improve
tracked object once it is lost. It also generated proposals for the detection speed. The distance between tracklet pairs is
small objects in deep layers with semantic information and learned via the extended Siamese network. The extended
hence reduced the false detection rate. However, the approach network extracted features for each detection in tracklets
failed to crop the target’s region of interest (ROI) in the and transferred these features to bidirectional gated recur-
detection stage and left the small object proposal generated rent unit (GRU) networks. Then, the algorithm generated the
in the shallow feature layers. This resulted in low computa- tracklets and split them into short sub-tracklets according
tional complexity, a high misdetection rate that caused a low to the local distance between bidirectional GRU outputs.
tracking accuracy (32.7%), and a moderate detection quality The sub-tracklets are reconnected to the long trajectories
rate (57.2%) in a real-time MOT evaluation of a public dataset using similarities between temporal pooling global features.
(KITTI). Then, Scheidegger et al. [14] proposed using DCNN This helped jettison the outliners via a cosine window and a
and a Poisson Multi-Bernoulli Mixture (PMBM) filter to pro- scale range. Consequently, it improved the detection quality
duce trajectories of the detected object in a world coordinate rate (83.0%) with weak associations that led to a moder-
system. Their approach used a deep neural network to detect ate performance on tracking accuracy (49.6%) in real-time
and estimate the distance of objects from a single input image. MOT datasets (OTB2015).
It fed the detections from the sequential images into a Poisson Sun et al. [48] suggest tackling object appearance and
multi-Bernoulli mixture (PMBM) filter. Then, the existing data association issues with tracking-by-detection via DCNN.
single-short multi-box detector (SDD) was incorporated to The approach combined appearance modeling, affinities, and
strengthen the detection of small objects on deeper layers networks to compute reliable trajectories and object associa-
rather than shallow layers. This played a significant role in tions on the current frame based on detections from multiple
building a multi-scale object detector that effectively detected previous frames. The target appearances and affinities in a
small objects with fewer false negatives on datasets (KITTI). pair of video frames were jointly learned in an end-to-end
Consequently, it improved the tracking accuracy (80.0%) fashion. This enabled the softmax layer of the network to
and detection quality rate (91.0%) with a brawny estimation separately look forward and backward in time for unidenti-
function that effectively calculated distances between objects. fiable objects in the frame pairs. It also contributed to han-
The non-static surface gives the impression that motion in dling the appearance and disappearance of multiple objects
the background pixels affects the detection quality [3], [12], between video frames. However, the approach’s overall net-
[29]. Hence, Ray and Chakraborty [12] used foreground work did not make assumptions for the input frame pairs
detection and recent dissimilarity frames to strengthen the to appear consecutively in a video. Although this promoted
detector and track associations in a variable background. The robustness against object occlusions, it could not cope very
approach separates background and foreground information well with the data association of object fashion in real-time
and then removes flickering background or noise by analyz- MOT evaluations of datasets (MOT15, MOT17, and UA-
ing the pseudo-motion-compensated on the current frame and DETRAC) with similarities in the frames that were at close
preceding frames. It uses the estimation function to predict locations in the scenes. Consequently, it degraded the detec-
the state of an object and the Kalman filter to track the tion quality rate (41.1%) with a better tracking performance
object. This solved the associations’ issue under occlusions (52.4%). To address this problem, Ren et al. [77] proposed
by refining the target region. Although the proposed approach a deep prediction-decision network in a collaborative deep
increased the target detection and association across the video reinforcement learning (C-DRL) method that simultaneously
frames, it failed to track and differentiate small objects [79] detected and predicted objects under a unified network via
with similarities under complex scenes on a real-time MOT deep reinforcement learning. To solve target associations and
dataset (VOT2016 and MOT2016) and achieved moderate location problems, the approach considered each object as
tracking accuracy (51.2%) with a high detection quality an agent and tracked it via the prediction network. It further
rate (86.0%). sought the optimal tracked results by exploiting the collabora-
Extending the work on the detector and unique identifica- tive interactions of different agents and environments via the
tions of the target, Zhang et al. [52] introduced multi-camera decision network. The network learned the object movement,
multi-object tracking by hierarchical clustering into a Fast size, speed, and direction [80] and predicted the next step of

32656 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

the target on the frame very well on real-time public datasets seemed to be a common issue that mostly led to unsat-
(MOT2015 and MOT2016). However, it becomes sensitive to isfactory overall tracking performance. Therefore, in this
appearance features and fragmentation, especially under long section, the deep learning methods in real-time MOTs with
occlusions and heavy interactions. Hence, in some videos high-speed tracking and low computational cost presented
with high sampling rates, the approach experienced a high in Table 2 are analyzed.
number of object losses for relatively more frames due to Zamir et al. [60] proposed an algorithm to solve data
occlusion that led to a high number in IDS. This degraded association issues via generalized minimum clique graphs by
both the tracking performance (47.3%) and detection quality finding the detections that correspond to one particular object
rate (30.4%). in different video frames. The approach has expanded the
Wei et al. [13] developed the learning framework to address node definition for clustering by grouping the nodes of an
the issues in tracking and misdetections under heavy object input graph into disjoint clusters. It searches for a subgraph
interactions. They used temporal-spatial information to deter- set of nodes that requires the minimum cost for the complete
mine the trajectory confidence in each frame. The approach graph to be produced. This required the authors to introduce
divides this process into a local and global association to the hypothetical nodes technique that handled the exit or
associate the trajectories with high confidence with the detec- entry problem and long-term occlusion occurrences during
tion result of the current frame to the local association and the tracking process. However, the assumptions made on
the one with low confidence with the detection results of object velocity over a short period caused the algorithm to
the current frame that are not matched to the global associ- struggle in modeling the motion of one person over the long
ation. This combination of spatial and temporal models of a run without knowing the destination structure of the scene
public dataset (PET2009) enhanced the tracklet association and especially when people were heavily interacting. To con-
and midsection in real-time object tracking. Compared to struct a hypothesis tree for multiple hypothesis association
Ren et al. [77], the proposed algorithm improved the tracking and tracking, Kim et al. [61] extended the MHT framework
accuracy by 9% with a low detection quality rate (17.0%). with appearance features using a multi-output regularized
Ning et al. [21] introduced a spatially recurrent convolu- least square method. Their approach exploited high-order
tional neural network (SR-CNN) by extending the spatial and appearance information by incorporating long-term appear-
temporal work to learn visual features of the past frame by ances via appearance feature extractions and deep neural
examining the historical locations. The approach tried to learn networks. The appearance features are dimensionally reduced
from historical visual semantics, detections, and tracklets by from deep dimensional features to handle appearance and
enabling automatic learning onto a tracker. It incorporated motion variations during tracking. These features boosted the
LSTM and enforced an end-to-end spatial-temporal regres- approach in handling exit-entry issues and long occlusions
sion with a single evaluation to enhance efficiency and effec- with high computational costs [44] and a high tracking-speed
tiveness [81] by spatially glimpsing on various regions and rate (0.9 seconds/frame) in the real-time dataset (TUD Cam-
regressing on the heat maps. However, the approach could not pus) but failed to model the motion of an object for an
accurately link the tracklets and had hardly reidentified (RID) extended period. Then, Tang et al. [54] introduced subgraph
objects under prolonged occlusions. Consequently, it resulted decomposition to improve the motion model for multiple
in poor tracking accuracy (43.0%) and detection quality rate object tracking through a finite set of hypothesis detections.
(17.0%) in a real-time MOT evaluation with a public dataset Their subgraph multi-cut model had the property of jointly
(OTB-30). Then, to locate and handle misdetection on a addressing the spatial issue (within-frame) and temporal
similar object, Fagot-Bouquet et al. [15] formulated a multi- (across-frame) associations. This gave the advantage of using
frame data association process based on a sliding window the minimum cost subgraph multi-cut to link and cluster plau-
and minimized energy sparsity that represented all detec- sible detections jointly across space and time. Although the
tions. The technique implemented the TBD paradigm based proposed approach enhanced performance on both tracking
on the sliding window and estimated trajectories for best speed (0.86 seconds/frame) and tracking accuracy (80.9%)
associating object detections. However, when the number on the public benchmark dataset (TUD campus), it could not
of frames increased on the sliding window, the appearance efficiently track objects with motion variations and hence
model suffered. Consequently, this led to a failure for the incurred high computational costs. Ruchay et al. [23] pro-
proposed approach to associate detection effectively. It also posed an algorithm to track targets based on local adaptive
had trouble handling object tracking under crowded scenes correlation filters and enabled object tracking with motion
on datasets (2DMOT2015 and MOT2016) and resulted in a variations in high-speed scenes. The adaptive procedure is
deteriorated overall tracking performance. applied for a typical scene background and multiple com-
posite filters. The impulse responses of optimum correlation
B. DEEP LEARNING METHODS IN REAL-TIME MOTS WITH filters are used to synthesize composite filters for distortion
HIGH-SPEED TRACKING AND LOW COMPUTATIONAL invariant object tracking. This is employed via a predic-
COST tion scheme that uses composite correlation filters to track
The slow algorithm tends to lose track of many tracked multiple objects with invariance to poses, occlusion, clutter,
objects with speed variations. The object’s speed assumptions and illumination variations. However, it has consequently

VOLUME 9, 2021 32657


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

TABLE 2. Overview of deep learning methods in real-time MOTS with high-speed tracking and low computational cost.

improved the speed tracking and led to high-speed tracking method. The technique uses detection failure to analyze the
(0.56 seconds/frame) with difficulties in handling illumina- peak and average of neighboring correlation values. It fur-
tion variations and prolonged occlusions. It also contributed ther re-tracks the target using tracking failure and calcu-
to a deteriorated tracking performance (53.3%) with moder- lates a motion vector of the target by selecting the preferred
ate computational costs. search window during tracking failure detection. Although
To increase tracking accuracy while preserving processing the proposed approach registered a high rate of both track-
speed, Shin et al. [46] incorporated three functional modules, ing accuracy (70%) and tracking speed (1.9 seconds/frame),
including tracking failure detection, re-tracking using multi- its retracking process required an additional computational
ple search windows, motion vector analysis, and a preferable load for multiple search windows on a public dataset (Visual
search window onto a kernelized filter (KFC)-based tracking Tracker Benchmark). This led to a high rate of motion

32658 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

direction prediction and target losses during the tracking information. The approach also incorporated a simple IOU
process in crowded scenes. tracker to track targets by associating detections with the
Sharma et al. [24] proposed an approach that minimizes highest intersection over union (IOU) to the last detections in
the computational costs by estimating motions on rough the previous frame. It rooted out the short tracks to improve
frames based on odometry and implemented features on the the algorithm’s sensitivity towards false positives. This con-
background. It adopted the tracking-by-detection paradigm tributed to a high tracking speed (0.4 seconds/frame), but
and took input from the monocular video frame sequence. more detections on the tracker have caused many mispredic-
It further used the target information beyond the bounding tions of detections in real-time evaluations of the DETRAC
box image pixels to estimate the 3D shape and posing. The and MOT16 datasets. Therefore, the approach resulted in
approach illustrated pairwise costs disambiguating across a high number of target losses and achieved an unsatisfac-
track viewpoint variations and relative target movements but tory tracking performance (25.3%) with high computational
suffered from IDS and fragmentations due to motion varia- resources.
tions in a real-time MOT evaluation of a public benchmark Real-time object tracking with speed tracking is a crucial
dataset (KITTI). Therefore, it resulted in moderate computa- technology for visual analysis, object detection, and motion
tional costs and high tracking speed (1.6 seconds/frame) with variation handling [2]. Redmon et al. [56] and Ren et al. [76]
satisfactory tracking accuracy (84.2%). proposed the regional proposal network for object proposals
To model object motion, Keuper et al. [57] and Chen and and shared the regional classification through convolutional
Ren [58] proposed a motion segmentation technique that layers and Fast R-CNN. The technique used Fast R-CNN
combined bottom-up motion segmentation with top-down to produce local proposals with optimized classification and
multiple object tracking. It grouped point trajectories through bounding box regression tasks. It enhanced the processing
clustering of bounding boxes to improve tracking accuracy speed by using CNN fully connected layers for region propos-
on small dense objects. It then used a supervised CNN to als without handcrafted features. However, the approach was
minimize the computational cost and a Faster R-CNN tracker complex with noisy detections and suffered from overfitting
to obtain detections in a video sequence without knowing and false detections in real-time MOT evaluations of public
the object’s category and interest. To enhance the tracking datasets (Picasso and MOT2016). Therefore, it resulted in a
accuracy, the detections from the Faster R-CNN detector were high tracking speed (0.01 seconds/frame), a moderate rate of
trained in real time using the MOT 2016 public dataset. How- both computational costs and tracking accuracy (57.9%).
ever, the technique enabled the approach to track objects at Weng and Kitani [22] tried to minimize the computa-
high speeds (1.8 seconds/frame), but it experienced moderate tional costs and system complexity for multiple online object
computational costs [34], complexity, difficulties in handling tracking. They proposed an approach that combined two
motion variation and annotated objects that are caused by filters with CNN to improve data association and object state
over-segmentation in real-time MOT evaluations of public estimation. The approach incorporated a vast space of the
datasets (MOT2016 and MOT2017). This led to a deterio- Kalman filter into a full 3D domain to handle 3D location,
rated overall tracking performance (47.1%). size, velocity, and object orientation to minimize the compu-
To extend the task of online MOT on segmenta- tational cost and system complexity [10]. It succeeded with
tion tracking with the creation of dense pixel-level high tracking speed (39.4%) to reduce the computational
annotations and semi-automatic annotation procedures. costs and system complexity but suffered from high false
Voigtlaender et al. [82] proposed a new baseline method object detections due to the lack of appearance feature extrac-
that jointly addressed object detection and segmentation tions in the real-time MOT evaluation of a public dataset
with a single convolutional neural network (SCNN). The (MOT2016). This contributed to a negative overall track-
approach implements the TrackR-CNN tracker as a baseline ing performance (39.4%). Then, Wang et al. [59] combined
to address all aspects of multi-object tracking and segmenta- temporal and appearance features to form a unified frame-
tion (MOTS) duties in real-time MOT evaluations of public work to reduce the computational cost by grouping track-
datasets (MOT2016, KITTI, and MOT19). It further extends lets together based on similarities. The approach extended
TrackR-CNN to Mask R-CNN [11] with 3D convolution the first architecture of the Siamese network to learn the
layers to incorporate temporal information and tracklet asso- associating affinities between tracklets. It further combined
ciations over time. Then, the TrackR-CNN masked-based the appearance model with CCN features and improved the
detections together with association features are used as tracking speed (0.5 seconds/frame) and tracking accuracy
input to a tracking algorithm to decide which detections (56.1%) through clustering and assigning unique individual
to select and link with bounding boxes. Although this led identities. However, the framework suffered from drifting,
to a highly satisfactory tracking speed (0.5 seconds/frame), object interaction, and occlusions in real-time MOT evalu-
it has also contributed to the algorithm’s failure to han- ations of public datasets (MOT2016 and MOT 2017).
dle the segmentation of speedy objects. Hence, the overall To solve computational efficiency, drifting, and occlu-
tracking performance (47.1%) experienced deteriorations. sions in online multi-object tracking, Chu et al. [55] propose
Bochinsk et al. [2] suggested a tracking-by-detection an object-specific particle filtering framework for real-time
paradigm to track targets with high speed without using image MOT evaluations. The approach tracked each object with

VOLUME 9, 2021 32659


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

TABLE 3. overview of deep learning methods on modeling target uncertainty in online MOTS.

two constructed CNN-based classifiers. To handle occlusion to a failure in handling objects’ appearance variations [2],
between objects, it learned spatial attention features based [23], [58] and high IDS (7,318) with a low tracking accu-
on the visible map using convolution and fully connected racy (33.4%) under heavy interaction in a real-time MOT
layers. Then, the spatial attention map weight features were evaluation of a public dataset (MOT2016). To increase
used to promote the accuracy of the classifier. It reduced discrimination, Wojke et al. [62] employed the deep fea-
time-consuming computation by sharing the CNN feature ture extraction technique based on a wide residual net-
maps. Then, a single object tracker was incorporated into the work (WRN) for the person re-identification process. They
spatial-temporal attention mechanism (STAM) procedure and normalized the I2 and 128-dimensional features before the
enabled target searching in the next frame. It has enhanced cosine softmax classifier layer [22]. Then, the cosine and
the tracking-speed (0.5 seconds/frame) and handled the inter- emotional Mahalanobis distances are used to fuse dissimi-
actions very well but could not uniquely differentiate targets larities. The approach incorporated the Kalman filter to find
that appeared similar in real-time MOT evaluations of public the movement and appearance features of the target. It further
datasets (MOT2015 and MOT 2016). This led to high false extracted the appearance feature through DCNN and tracked
alarm and misdetection rates under heavily dense scenes and the target individually. Although it has improved tracking
resulted in unsatisfactory tracking accuracy (46.0%). performance (61.4%), it struggled to handle target track-
ing under crowded, distanced views, drifting, and prolonged
C. DEEP LEARNING METHODS ON MODELING TARGET occlusions. Consequently, this led to high-frequency changes
UNCERTAINTY IN AN ONLINE MOT in object IDS rate (12,862) during a tracking process in a real-
Online MOT uncertainty is mainly caused by ineffectiveness time MOT evaluation of a public dataset (KITTI).
in associating targets with relevant tracks. This affects the per- Inflexible objects have been proven to cause object drift-
formance of many algorithms in handling object discrimina- ing [29]. Then, Gan et al. [63] use the merits of [55] to
tion and direction predetermination processes. In this section, develop an online MOT approach to handle the drift and
we explain the deep learning methods based on modeling identity (ID) switches caused by occlusions and integration
target uncertainty in online MOTs, as shown in Table 3. among targets. The approach used convolutional layers to
Bewley et al. [65] used the merits of single object track- extract appearance features [47] and fully connected layers
ing [83] to integrate the Kalman filter features with the to update a distinguished online target from the background.
Hungarian algorithm to find the association in visual tracks. It further used the interaction of appearance motion with the
The approach used CNN-based detection and Faster Region interaction cues of the target and the online ID assignment
CNN (FR-CNN) in an end-to-end fashion. The FR-CNN scheme based on multi-level features to confirm the trajectory
shared parameters between two stages to create an efficient of each target. This technique enhanced the model updates
framework for detections. However, the approach focused and identity association of the appearance model with STAM
more on efficiency and reliability to handle common frame- and CNN. It also contributed to the approach capabilities of
to-frame associations than robust detection errors. This led finding appropriate target detections in the previous frame

32660 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

and the effectiveness of linking them to the current frame. filtering process. It further generated the object hypothe-
However, the similarities, long-term occlusions, and velocity ses for the tracking targets in a clustering process. Then,
changes [85] on a tracked target mostly led to uncertainties the objects of interest were extracted by means of a subse-
in real-time tracking. Therefore, these factors contributed to quent feature-based bounding box fitting and rule-based fil-
the approach’s failure to differentiate targets, which led to a tering. However, the technique handled prolonged occlusions
high IDS (7,912), fragmentation, and unsatisfactory tracking and improved the tracking accuracy (86.1%) with a remark-
accuracy (44.0%) on public datasets (VOT and OTB) eval- able reduction in IDS (65) in real time MOT evaluations of
uations. Zhu et al. [41] and Liu et al. [76] tried to address public datasets (MOT2016 and KITTI).
the problems by combining appearance and motion mod-
els. The technique integrated the models onto the Siamese D. DEEP LEARNING METHODS WITH CNN, AFFINITY, AND
network to learn affinities for tracklets and replace previous DATA ASSOCIATION IN ONLINE MOTS
features from the IDLA. It further employed the online track- The traditional CNN architecture uses the handcrafting of
ing framework to cascade associations between tracklets and cost functions that hinder the tracking performance in most
detections in two stages based on target confidence levels recent works. It is mostly expanded and integrated with
(high-to-low). However, it could not track small objects in deep learning techniques, as illustrated in Table 4, to handle
motion that have similar appearance in real-time MOT evalu- object affinities and data associations. Hence, in this section,
ations of public datasets (MOT2016 and MOT2017). There- we explore the deep learning methods with CNN, affinity, and
fore, it suffered from bearable IDS (1,871) and challenging data association in an online MOTs.
tracking accuracy (48.3%). Then, Wang et al. [66] tried to To enhance target tracklet associations, Schulter et al. [17]
learn and track small objects in motion by extending the first proposed a formula that enabled the learning of arbitrary
architecture of the Siamese network to learn target detec- parameterized cost functions for all variables with association
tion, affinity associations between tracklets, and appearance problems and enhanced the MOT in real-time applications.
embedding in a shared model. The approach incorporated They constructed an end-to-end deep learning min-cost net-
the appearance-embedding model into a single-shot detector work flow and defined a loss function of the deep architecture
for simultaneously outputting detections and corresponding as the weighted I2 a distance of edge labels. The approach
embedding. It further used those detections for localization further optimized the algorithm by building network flow
and tracking and then linked tracks onto the appearance with its edges on multilayers to form a deep architecture
model for data associations. It achieved satisfactory tracking model. It has been able to track and re-identify the objects
accuracy (62.1%) in real-time MOT evaluations of public under complex scenes in real-time tracking. It further handled
datasets (MOT2016, MOT2017, and KITTI), but it could not the long occlusions and accurately estimated the objects’
describe the dependencies between tracklets with a similar affinity scores. Therefore, this contributed to a good low IDS
appearance. rate (65) and high rate achievement in both mostly tracked
To effectively differentiate objects with a similar appear- (58.3%) and tracking accuracy (67.4%) real-time MOT eval-
ance Fajardo et al. [32] proposed a deep appearance features uations of public datasets (KITTI, MOT2015 and MOT2016).
method to improve the object data association and affinity Kumar et al. [67] constructed a complementary graph func-
in different frames that uniquely tracked targets through the tion to capture the spatial-temporal and appearance informa-
CNN framework based on motion and appearance informa- tion. They further constructed an exclusion graph function
tion [80]. The approach struggled to recognize the cropped to ensure that some detections that occurred simultaneously
patches with limited information [86] and hence suffered do not share the same node labels. Then, the appearance
from false positives and misdetections in real-time MOT information is used to link detections into trajectories. How-
evaluations of public datasets (MOT2016 and MOT2017). ever, this contributed to a high tracking accuracy and a
Therefore, it resulted in moderate frequent object ID changes low IDS rate (5) achievement in real-time MOT evaluations
(4,123) and high tracking accuracy (75.2%). This increased of public datasets (APIDIS, PETS-2009 S2/L1, MOT2015
the attention onto tackling the high uncertainty number val- (TUD Stadtmitte, and TUD Crossing)) but struggled to asso-
ues in real-time tracking [87]. Kampker et al. [64] pre- ciate objects effectively at crossing scenes.
sented a real-time framework for multi-object detection and To solve object association ambiguities in cluttered multi-
maneuver-aware tracking for 3D LIDAR applications to object scenarios, Scheel et al. [71] suggested implementing
tackle object uncertainty in cluttered urban environments. the Monte Carlo algorithm with a multi-Bernoulli filter to
It combined a sensor occlusion-aware detection method with handle the association measurements between objects. They
computationally efficient rule-based filtering and adaptive extended the algorithm’s object filter to work directly on
probabilistic tracking to handle uncertainties arising from the the raw measurements and process multiple measurements
sensing limitation of 3D LIDAR and the complexity of the per object. Although the approach achieved high tracking
targets’ movement. The technique used algorithm detection accuracy (74.4%) in a real-time MOT evaluation of a pub-
as an input 3D point cloud and divided it into non-ground lic dataset (KITTI), it failed to calculate the association
and elevated measurements. This task was accomplished via measurements accurately and resulted in filter divergence.
a slope-based ground removal approach and a subsequent Leal-Taixe et al. [68] extended the technique into the Siamese

VOLUME 9, 2021 32661


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

TABLE 4. Overview of deep learning methods with CNN, affinity, and data association in online MOTS.

network to learn matching features for MOTs and then deter- It further used the deep features and motion information
mined the affinity score. Their approach used three types with a gradient boosting algorithm to formulate the tracking
of Siamese CNN topologies for computational cost, infor- as a linear programming problem and solved it efficiently.
mation distribution, and the streaming of the data to form However, it struggled to detect and associate the object tracks
inputs for CNN layers. It compares these three topologies under a dense population. This contributed to the unsatisfac-
and uses the third architecture to extract the in-depth features. tory overall performance with tracking accuracy (29.0%) and

32662 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

moderate achievements in both the mostly tracked (48.4%) to solve detached detections. To address the data association
and IDS rates (639) in a real-time MOT evaluation of a public problem in the paradigm, the technique discarded all the
dataset (MOT2015). unused data in the video sequence. It reduced the data to
To overcome the detection association problem, a few single measurements per frame and ran the detector.
Son et al. [40] proposed using quadruplet CNN to learn Then, tracklets are associated with each measurement of a
and associate detections across video frames using appear- corresponding target. This led to considerable target loss due
ance and motion cues. They extended the Siamese network to misdetection and tracks associations in crowded scenes.
using quadruplets of image patches as inputs and extracted It also contributed to a high IDS rate and a very low track-
these patches for three detections from one same object and ing performance in a real-time MOT evaluation of a public
another different object. The approach further constructed dataset (MOT2015).
a loss function that temporally learned the smooth appear- Wu et al. [27] applied a single-camera tracking (SCT)
ance embedded with the motion-aware position for metric technique to associate detached detections into tracks. They
learning. However, the proposed approach could not associate further used the tracks on multi-camera tracking (MCT) to
detections very well under crowded scenes and resulted in re-identify each track to form trajectories. However, their
high IDS (745). This further resulted in a low mostly tracked MCT base technique could not associate the tracks of dif-
rate (14.6%) and tracking accuracy (44.1%) in real-time MOT ferent cameras and resulted in illumination changes, view
evaluations of public datasets (2DMOT2015 and MOT2016). angle variation, and object appearance inconsistency. Though
Then, Lee et al. [72] used a CNN-based detector and this contributed to a satisfactory mostly tracked (51.8%)
Lucas-Kande Tracker (LKT)-based motion to compute the result, it could not efficiently associate detections and tracks
likelihood of foreground regions as the detection response of across cameras, hence resulting in poor tracking accuracy
different object classes. The technique separates the dynamic (9.65) in a real-time MOT evaluation of a public dataset
motion model of a Bayesian filter into entity translations and (MOT2016). To strengthen the target data association across
motion cues. Although this contributed to a better tracking cameras, Le et al. [37] proposed the use of Markov deci-
accuracy (62.4) and moderate rate for mostly tracked (31.5%) sion processing (MDP) to collaborate object tracking with
objects in a real-time MOT evaluation of a public dataset the camera network. The approach extended the MDP to a
(MOT 2015), it left the proposed approach struggling to multiple views framework. Then introduced a novel target
associate the tracklets over a long tracking period in the heavy association method across cameras. It further collected and
interactive scenes and resulted in high IDS (1,394). associated the tracking outcomes on each camera onto target
Kieritz et al. [36] capitalized on the established embedded tracks. This contributed to the effectiveness in handling the
target appearance process and proposed an online learning appearance similarities [90] under crowded scenes. However,
appearance model. Their technique combines the appearance it also led to a low IDS (240) with better overall performance
model with a simple motion model to estimate the change in in terms of mostly tracked (62.0%) objects and tracking accu-
position and smooth the trajectory. It used a classifier based racy (69.8%) in real-time MOT evaluations of public datasets
on integral channel features to detect persons in each frame. (PETS09-(S1 L1 and S2 L2)).
It further used the detector that uses LUV color channels, Houssineau et al[9] proposed a new online scheme for
a histogram of oriented gradients with several bins, and the evaluating ReID algorithms for object tracking aiming to
gradient magnitude to formulate fast detection over every improve the target ReID process within a camera at dif-
channel. However, the approach experienced deceptive ferent times. The approach considered several issues, such
appearances over long track periods and switching between as the open set, dynamic, small gallery set, and multiple
active and inactive states of the trajectories with a low num- camera configurations. However, it could not efficiently cap-
ber of associated detections. This hindered an overall per- ture the scenarios of online tracking for camera networks,
formance and resulted in a challenging tracking accuracy open sets, and the dynamic nature of the gallery set due to
(27.1%), a low mostly tracked rate (6.4%), and a moderate its limitation on considering only camera scenarios. Then,
IDS rate (1,490) in a real-time MOT evaluation of a public Tesfaye et al. [88] used a constrained domain set and three
dataset (MOT2015). hierarchical layers to enhance tracking of individual object
Vo et al. [86] implemented a multi-sensor generaliza- appearances in each camera. The approach splits the video
tion labeled multi-Bernoulli (GLMB) filter with two sen- into small segmentations and generates tracklets. It then
sors to reduce uncertainty about object existence and state. merged the detection boxes into consecutive frames [92] and
Zou et al. [89] used the established platform to update applied fast constrained domain sets (FCDS) in the first layer.
appearance information based on template matching rather In the second, it merged the tracklets into a routine with FCDS
than the learning-based approach. However, these approaches in each camera-across. It finally organized all tracks together
experienced frequent occlusions under heavy interactions in the third layer and built a graph of tracklet matching
and failed to handle appearance variations. This resulted in across cameras to verify whether a person appears in one
detached detections in real-time MOT evaluations. Huang and or more cameras across. However, the proposed approach
Zhou [69] proposed an online multi-object tracking approach achieved better performance in tracking accuracy (56.6%)
and used a recurrent convolutional neural network (R-CNN) and recorded high performance of mostly tracked objects in a

VOLUME 9, 2021 32663


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

FIGURE 5. Object Tracking based on Quality Detections [76].

FIGURE 6. Analysis of Deep Learning Algorithms based on Detection Quality and Associations in real-time MOTs.

real-time MOT evaluation of a public dataset (MOT2015), but portions and velocity differences between objects and track
failed to handle the similarity appearances, could not asso- missed objects under heavy collusion with the aid of the
ciate tracklets across cameras, and resulted in most frequent ReID process. Then, Ristani and Tomasi [4] suggested reduc-
changes in IDS (1,637) under crowded scenes. It also suffered ing the computational complexity by incorporating standard
from re-tracking and ReID due to fragmentation. hierarchical reasoning and sliding temporal techniques onto
In ensuring efficient target tracking and data associations a tracker. These approaches reduced the IDS rate, but they
in camera networks, Sharma et al. [93] used a camera selec- could track the objects for an extended period. This resulted
tion policy to select the candidate camera where the target in poor tracking accuracy in real-time MOT evaluations of
is likely to appear by fording the ReID queries during the public datasets (MOT2015 and MOT2016).
target transition. However, the approach brought affinities Jiang et al. [8] analyzed object trajectories across mul-
and computational complexities. Yoon et al. [70] designed tiple cameras to allow synthesis data and security analy-
an appearance matching network for robust online multiple sis of images in various scenarios. Their approach used a
object tracking to solve the computational bottleneck and multi-camera system without turning parameters from the
affinity issues. The proposed network utilized the structural ground truth and constructed a graph from 2D observa-
constraint information to represent the relative information tions of all camera pairs with no network configuration.

32664 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

FIGURE 7. Analysis of Deep Learning Algorithm-based Real-time MOTs with High-Speed Tracking and Low computational Costs.

FIGURE 8. Analysis of Deep Learning Algorithms based on Modeling Target Uncertainty in MOTs.

The proposed approaches could not efficiently associate the a fully convolutional neural network that shares most com-
tracklets, especially when they had the same motion and simi- putations on the entire image. It further adopted a deeply
larities in size and appearance. This resulted in unsatisfactory learned appearance representation to improve the identifica-
tracking performance in real-time MOT evaluations of public tion ability of a tracker. It also presented a hierarchical data
datasets (PETS09-(S1 L1 and S2 L2)). Then, Chen et al. [73] association strategy that utilizes the spatial information and
proposed handling unreliable detections by collecting candi- deeply learned person re-identification features to compare
dates from outputs of both detection and tracking processes. tracked objects with their historical features to decide whether
Their approach presented a novel scoring function based on the same target was previously identified. However, the

VOLUME 9, 2021 32665


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

FIGURE 9. Performance on handling the affinity and associations with CNN integrated DLMs.

proposed approach tracker generated more fragmentations detections and tracklets associations in video frames that have
when the target suddenly speeds up and affected the target complexities in detecting objects with motion variations [37].
data association. This contributed to a degraded overall per- It increased the number of problems encountered due to weak
formance with vast target losses that led to the unsatisfactory data associations and appearance similarities.
rate for mostly tracked (15.2%) objects and tracking accuracy DLMs such as [12]–[16], [21], [47]–[53], [77] were
(47.6%) in a real-time MOT evaluation of a public dataset combined to construct features and to learn appearance
(MOT2016). similarities between objects to improve detection quality
and associations. In contrast [2], [22]–[24], [46], [54]–[61],
IV. DISCUSSION [82], classified and assessed the matches between detec-
In this systematic review, we provide an overview of the tion and tracklets to quickly track the targets. However,
different DLMs for online MOTs in various environments, the density of uncertainty in the real-time MOTs persisted.
scenes, and datasets. Based on previous studies, we catego- Then [32], [41], [62]–[66], emerged to learn the control of
rized the proposed approaches into four themes (1. Online the problem through graphic models and flow optimizations
MOTs based on detections quality and associations, but experienced difficulties in handling the affinities. They
2. Real-time MOTs with high-speed tracking and low compu- used a data-driven mechanism by [8], [17], [27], [36]–[38],
tational cost, 3. Modeling target uncertainty in online MOTs, [40], [67]–[73], [89], [90] to learn the affinity models for
and 4. CNN, affinity, and data association in online MOTs). data association and replaced the handcrafted features with
In the methodology, the real-time MOTs based on deep a real-time MOT framework, such as Siamese CNN. This
learning techniques were less frequently accessible. For the contributed to the remarkable progress seen in deep learning
performance evaluation, the main challenges were the avail- techniques based on online MOTs and reduced the num-
ability and quality of the evaluation results to include all bers of most lost (ML) targets by enforcing a strong fore-
parameters as per the new standardized MOT evaluation ground and motion differentiation for all moving objects.
benchmark suggested by [26]. The performance outcomes of the previous studies have
The tracking accuracies of deep learning techniques varied shown that the deep learning approach is the most popular
between 14.5% and 86% in video processing under several paradigm applied by most researchers in real-time MOTs.
complex real-world problems [2]. The lowest tracking accu- In Table 5, we developed general comparisons and sum-
racy of 14.5% was achieved where the distance between marized the limitations and strengths of approaches based
the object detected in the first frame and its detection in on each theme. For example, in online MOTs based on
the next frame increased [49]. This is identified on target detection quality and association, detection and trajectory

32666 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

FIGURE 10. Illustration of the detections and feature learning with DLM-based online MOTs. (a). Insufficient detections (poor quality detections) in
MOT2020 enter/exit stadium; (b) sufficient detection in PETS09-S211 sequence; (c)Real-time high-speed tracking with a mono-camera in KITTI and
(d) multi-person tracking results, color features in deep learning.

construction are based on both current frame information and assumptions on surfaces, objects’ speed, and directions, led
the previous frame. The proposed deep learning approaches to blurred image capture when objects suddenly sped up. This
have successfully located detections that correspond to one slowed down the detection rate and led to poor detections
particular object in different frame sequences. They also that degraded the overall tracking performance. However,
proved that low-level detectors are the main factors that the implementation of DCCN, faster R-CCN, filters, and
thwart the capability of a tracker to estimate the state of segmentation with DLMs enabled high-speed object tracking
the target from ambiguous observations. This is supported with low computational costs, but the methods experi-
by the results of a considerable number of detached detec- enced difficulties in associating the tracks more, espe-
tions that could not be associated and hence degraded detec- cially when the estimated uncertainty was at a low level
tions’ quality response [2], [22], [23], [54]–[56], [59], [60]. (lower pixel).
However, background subtraction and feedback detections in Third, in modeling target uncertainty in online MOTs,
each frame could be utilized to improve the tracking per- the proposed methods had difficulty capturing objects from
formance in deep learning techniques that follow the TBD a mono-camera for online MOTs. This has blurred images
paradigm. The strategy could also be extended to other DLMs and increased uncertainty issues. Then, the independent
and make them more effective when nearby objects with self-motion that emerged with GLMB reduced the uncer-
similar appearances occlude each other in video frames. tainty about the object’s existence state but could not disam-
Second, real-time MOTs with high-speed and low com- biguate between objects and uncertainty due to insufficient
putational costs and common restrictions, such as making data association.

VOLUME 9, 2021 32667


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

TABLE 5. Strengths and limitations of various deep learning methods based on online MOTS.

Last, in the context of affinity and data association, differ- the quality of detection and tracking performance. The
ent techniques are applied to enhance CNN for object associ- approaches with low-quality detections could not maintain
ation. Their affinity computation in multiple frame sequences the object appearance features and yielded a poor tracking
could not distinguish objects with similar appearances or performance. The advocacy is well presented in Fig. 7, where
pedestrians wearing the same attires. This contributed to the high-speed tracking methods tend to skip the objects traveling
skipping of the detections for small objects distantly captured slowly and struggle to handle objects’ appearance variations
in the images to be suppressed. It also caused difficulties and motion variations. The detection accuracy inconsistency
in target tracking and data association issues across multi- caused a failure for the approaches to make decisions on
ple cameras [94], whereby each camera scene needed to be which targets are true incomers or leavers. This affected
merged in with those of the different cameras on the network. the overall tracking accuracy rate compared to that with
DLMs had trouble learning the incoming tracks low-speed tracking.
and differentiating various detections, as shown in The complexity and uncertainty in these approaches can
Figs. 6, 9, and 10(b)-(d), but they signified promising be seen in Figs. 8 and 10(a), where the multi-camera view
progress towards real-time MOT systems in handling the calculations of the motion trajectories from entering and
object observation formulation, affinity, and data association exiting the views lie in multifarious aspects. The mistrusted
problem. They used the detectors to enable the pass over of detections crowd the tracking accuracy performance.
the generated detections onto the trackers as the input for data Although there is noticeable progress, it is important to
associations. This compensated for the missing detections note that this review reports only on deep learning techniques
shown in Fig. 10(a) but struggled with weak detections based on real-time MOTs, as categorized in Fig. 4.
and a lack of tracklet association for small objects. This The DLMs have not been implemented thoroughly to solve
caused a high volume of mistrust and detached detections real-world problems. Thus, many challenges persist, and
that degraded tracking accuracy, as depicted in Fig. 9. The more studies need to be conducted to include a broadened
approaches struggled to handle objects’ appearance varia- scope on vehicles and top-view multiple object tracking using
tions and ReID; they could also not learn the appearance fea- drones
tures very well under crowded scenes and motion variations. For further research, it would be advisable to com-
However, the proposed approaches could differentiate var- pare the traditional CNN with DCNN techniques based on
ious detections and effectively learned the incoming tracks, experimental evaluations. It would also be important to
as shown in Figs. 6 and 10(b)-(d). The strategy improved include other techniques, such as deep convolutional

32668 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

generative adversarial networks (DCGANs), for a robust [10] S. Tian, F. Yuan, and G.-S. Xia, ‘‘Multi-object tracking with inter-feedback
algorithm to handle the challenges that have been reported between detection and tracking,’’ Neurocomputing, vol. 171, pp. 768–780,
Jan. 2016, doi: 10.1016/j.neucom.2015.07.028.
under complex environments. [11] M. Z. Islam, M. S. Islam, and M. S. Rana, ‘‘Problem analysis of multiple
object tracking system: A critical review,’’ IJARCCE, vol. 4, no. 11,
pp. 374–377, Nov. 2015, doi: 10.17148/ijarcce.2015.41183.
V. CONCLUSION
[12] K. S. Ray and S. Chakraborty, ‘‘An efficient approach for object detec-
This review paper analyzes and summarizes the latest tion and tracking of objects in a video with variable background,’’ 2017,
progress and challenges in real-time MOTs. We analyzed arXiv:1706.02672. [Online]. Available: http://arxiv.org/abs/1706.02672
[13] J. Wei, M. Yang, and F. Liu, ‘‘Learning spatio-temporal information for
several papers on deep learning techniques used in real-time
multi-object tracking,’’ IEEE Access, vol. 5, pp. 3869–3877, Jan. 2017, doi:
multiple object tracking. We further described and discussed 10.1109/access.2017.2686482.
the best results for the four main themes: online MOTs based [14] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and
on detections quality and associations, real-time MOTs with K. Granstrom, ‘‘Mono-camera 3D multi-object tracking using deep learn-
ing detections and PMBM filtering,’’ in Proc. IEEE Intell. Vehicles Symp.,
high-speed tracking and low computational cost, modeling Changshu, China, Jun. 2018, pp. 433–440.
target uncertainty in online MOTs, and CNN, affinity, and [15] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle, Improving Multi-
data association. For each theme, several papers are consid- Frame Data Association With Sparse Representations for Robust Near-
Online Multi-Object Tracking (Lecture Notes in Computer Science: Lec-
ered to illustrate the main challenges of the most popular ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
solutions proposed by the authors. B. M. Leibe, J, N. Sebe, and M. Welling, Eds. Cham, Switzerland: Springer,
Until now, there has been no review of the various recent 2016, pp. 774–790.
[16] A. Milan, L. Leal-Taixe, K. Schindler, and I. Reid, ‘‘Joint tracking and seg-
DMLs for online MOTs. Deep learning strategies are already mentation of multiple targets,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
widely used in real-time MOTs. Our analysis shows that Recognit., Jun. 2015, pp. 5397–5406.
DLMs improve the handling of multiple object detections [17] S. Schulter, P. Vernaza, W. Choi, and M. Chandraker, ‘‘Deep network
flow for multi-object tracking,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
and trajectory associations across sequential frames under Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 2730–2739.
challenging environments. The results could be used for fur- [18] D. Wang, W. Fang, W. Chen, T. Sun, and T. Chen, ‘‘Model update strategies
ther improvement of the solutions’ efficiency and robustness about object tracking: A state of the art review,’’ Electron., vol. 8, no. 11,
pp. 1–31, Oct. 2019, doi: 10.3390/electronics8111207.
on surveillance security management systems. They can also [19] M. Fiaz, A. Mahmood, and S. Ki Jung, ‘‘Tracking noisy targets: A review
be used for further studies in real-time MOT algorithms of recent object tracking approaches,’’ 2018, arXiv:1802.03098. [Online].
to promote the sustainable development goal (SDG) 16 by Available: http://arxiv.org/abs/1802.03098
[20] Z. He, J. Li, D. Liu, H. He, and D. Barber, ‘‘Tracking by animation:
contributing to adequate and timely decision-making by com- Unsupervised learning of multi-object attentive trackers,’’ in Proc. IEEE
mittees and justice institutions that protect and save lives in Conf. Comput. Vis. Pattern Recognit., Long Beach, CA, USA, Jun. 2019,
smart cities. pp. 1318–1327.
[21] G. Ning, Z. Zhang, C. Huang, X. Ren, H. Wang, C. Cai, and Z. He, ‘‘Spa-
tially supervised recurrent convolutional neural networks for visual object
REFERENCES tracking,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Baltimore, MD,
USA, May 2017, pp. 1–4.
[1] S. Moon, J. Lee, D. Nam, H. Kim, and W. Kim, ‘‘A comparative study
[22] X. Weng and K. Kitani, ‘‘A baseline for 3D multi-object tracking,’’ 2019,
on multi-object tracking methods for sports events,’’ in Proc. 19th Int.
arXiv:1907.03961. [Online]. Available: http://arxiv.org/abs/1907.03961
Conf. Adv. Commun. Technol. (ICACT), Bongpyeong, South Korea, 2017,
[23] A. N. Ruchay, V. I. Kober, and I. E. Chernoskulov, ‘‘Real-time tracking of
pp. 883–885.
multiple objects with locally adaptive correlation filters,’’ in Proc. Image
[2] E. Bochinski, V. Eiselein, and T. Sikora, ‘‘High-speed tracking-by- Process., Geoinformation Technol. Inf. Secur., Samara Oblast, Russia,
detection without using image information,’’ in Proc. 14th IEEE Int. Conf. 2017, pp. 214–218.
Adv. Video Signal Based Surveill. (AVSS), Lecce, Italy, Aug. 2017, pp. 1–6.
[24] S. Sharma, J. A. Ansari, J. K. Murthy, and K. M. Krishna, ‘‘Beyond pixels:
[3] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran, ‘‘Detect-and- Leveraging geometry and shape cues for online multi-object tracking,’’ in
track: Efficient pose estimation in videos,’’ in Proc. IEEE Conf. Comput. Proc. IEEE Int. Conf. Robot. Autom. (ICRA), Brisbane, QLD, Australia,
Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 350–359. May 2018, pp. 3508–3515.
[4] E. Ristani and C. Tomasi, ‘‘Features for multi-target multi-camera tracking [25] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, ‘‘MOT16:
and re-identification,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., A benchmark for multi-object tracking,’’ 2016, arXiv:1603.00831.
Salt Lake City, UT, USA, Jun. 2018, pp. 6036–6046. [Online]. Available: http://arxiv.org/abs/1603.00831
[5] M. Tiwari and R. Singhai, ‘‘A review of detection and tracking of object [26] L. Wen, D. Du, Z. Cai, Z. Lei, M.-C. Chang, H. Qi, J. Lim, M.-H. Yang,
from image and video sequences,’’ Int. J. Comput. Intell. Res., vol. 13, and S. Lyu, ‘‘UA-DETRAC: A new benchmark and protocol for multi-
no. 5, pp. 745–765, Mar. 2017, doi: 10.1109/cis.2009.13. object detection and tracking,’’ Comput. Vis. Image Understand., vol. 193,
[6] D. M. Patel, U. K. Jaliya, and H. D. Vasava, ‘‘Multiple object detection Apr. 2020, Art. no. 102907, doi: 10.1016/j.cviu.2020.102907.
and tracking: A survey,’’ Int. J. Res. Appl. Sci. Eng. Technol., vol. 6, no. 2, [27] C.-W. Wu, M.-T. Zhong, Y. Tsao, S.-W. Yang, Y.-K. Chen, and S.-Y. Chien,
pp. 809–813, Apr. 2018. ‘‘Track-clustering error evaluation for track-based multi-camera tracking
[7] K. Soomro, H. Idrees, and M. Shah, ‘‘Predicting the where and what system employing human re-identification,’’ in Proc. IEEE Conf. Com-
of actors and actions through online action localization,’’ in Proc. IEEE put. Vis. Pattern Recognit. Workshops (CVPRW), Honolulu, HI, USA,
Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, Jul. 2017, pp. 1416–1424.
pp. 2648–2657. [28] V. P. Bhuvana, M. Schranz, C. S. Regazzoni, B. Rinner, A. M. Tonello, and
[8] X. Jiang, Z. Fang, N. N. Xiong, Y. Gao, B. Huang, J. Zhang, L. Yu, M. Huemer, ‘‘Multi-camera object tracking using surprisal observations
and P. Harrington, ‘‘Data fusion-based multi-object tracking for uncon- in visual sensor networks,’’ EURASIP J. Adv. Signal Process., vol. 2016,
strained visual sensor networks,’’ IEEE Access, vol. 6, pp. 13716–13728, no. 1, p. 50, Apr. 2016, doi: 10.1186/s13634-016-0347-x.
Apr. 2018, doi: 10.1109/access.2018.2812794. [29] S. Bei, Z. Zhen, L. Wusheng, D. Liebo, and L. Qin, ‘‘Visual object tracking
[9] R. Martín-Nieto, Á. García-Martín, J. M. Martínez, and J. C. Sanmiguel, challenges revisited: VOT vs. OTB,’’ PLoS ONE, vol. 13, no. 9, Sep. 2018,
‘‘Enhancing multi-camera people detection by online automatic Art. no. e0203188, doi: 10.1371/journal.pone.0203188.
parametrization using detection transfer and self-correlation [30] S. Lee and H. Hong, ‘‘Use of gradient-based shadow detection for esti-
maximization,’’ Sensors, vol. 18, no. 12, p. 4385, Jul. 2018, doi: mating environmental illumination distribution,’’ Appl. Sci., vol. 8, no. 11,
10.3390/s18124385. pp. 1–13, Nov. 2018, doi: 10.3390/app8112255.

VOLUME 9, 2021 32669


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

[31] A. K. M. Azad and M. Misbahuddin, ‘‘Web-based object tracking [51] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, ‘‘Online multi-target
using collaborated camera network,’’ Adv. Internet Things, vol. 8, no. 2, tracking with strong and weak detections,’’ in Proc. Comput. Vis. ECCV
pp. 13–25, 2018, doi: 10.4236/ait.2018.82002. Workshops, in Lecture Notes in Computer Science, G. Hua and H. Jégou,
[32] S. Fajardo, F. R. García-Galvan, V. Barranco, J. C. Galvan, and S. F. Batlle, Eds. Cham, Switzerland: Springer, 2016, pp. 84–99.
‘‘Multi-person tracking based on faster R-CNN and deep appearance fea- [52] Z. Zhang, J. Wu, X. Zhang, and C. Zhang, ‘‘Multi-target, multi-
tures,’’ Vis. Object Tracking Deep Neural Netw., vol. 1, p. 13, Dec. 2016, camera tracking by hierarchical clustering: Recent progress on
doi: 10.5772/intechopen.85215. DukeMTMC project,’’ 2017, arXiv:1712.09531. [Online]. Available:
[33] J. H. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon, ‘‘Bayesian multi- http://arxiv.org/abs/1712.09531
object tracking using motion context from multiple objects,’’ in Proc. [53] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, ‘‘SiamRPN++:
IEEE Winter Conf. Appl. Comput. Vis., Jan. 2015, pp. 33–40, doi: Evolution of siamese visual tracking with very deep networks,’’ 2018,
10.1109/WACV.2015.12. arXiv:1812.11703. [Online]. Available: http://arxiv.org/abs/1812.11703
[34] S. Li, G. Battistelli, L. Chisci, W. Yi, B. Wang, and L. Kong, ‘‘Compu- [54] S. Tang, B. Andres, M. Andriluka, and B. Schiele, ‘‘Subgraph decomposi-
tationally efficient multi-agent multi-object tracking with labeled random tion for multi-target tracking,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
finite sets,’’ IEEE Trans. Signal Process., vol. 67, no. 1, pp. 260–275, Recognit., Boston, MA, USA, Jun. 2015, pp. 5033–5041.
Jan. 2019, doi: 10.1109/tsp.2018.2880704. [55] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu, ‘‘Online
[35] I. A. Iswanto and B. Li, ‘‘Visual object tracking based on mean-shift and multi-object tracking using CNN-based single object tracker with spatial-
particle-Kalman filter,’’ Procedia Comput. Sci., vol. 116, pp. 587–595, temporal attention mechanism,’’ in Proc. IEEE Int. Conf. Comput. Vis.
2017, doi: 10.1016/j.procs.2017.10.010. (ICCV), Venice, Italy, Oct. 2017, pp. 4846–4855.
[36] H. Kieritz, S. Becker, W. Hubner, and M. Arens, ‘‘Online multi-person [56] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
tracking using integral channel features,’’ in Proc. 13th IEEE Int. Conf. Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Adv. Video Signal Based Surveill. (AVSS), Colorado Springs, CO, USA, Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, pp. 779–788.
Aug. 2016, pp. 122–130. [57] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele, ‘‘Motion seg-
[37] Q. C. Le, D. Conte, and M. Hidane, ‘‘Online multiple view tracking: Tar- mentation & multiple object tracking by correlation co-clustering,’’ IEEE
gets association across cameras,’’ in Proc. 6th Workshop Activity Monitor. Trans. Pattern Anal. Mach. Intell., vol. 42, no. 1, pp. 1–13, Oct. 2018, doi:
Multiple Distrib. Sens. (AMMDS), Newcastle, U.K., Jul. 2018, pp. 1–12, 10.1109/TPAMI.2018.2876253.
Paper ffhal-01880374f. [58] L. Chen and M. Ren, ‘‘Multi-appearance segmentation and extended 0-1
[38] J. Ju, D. Kim, B. Ku, D. K. Han, and H. Ko, ‘‘Online multi-object programming for dense small object tracking,’’ PLoS ONE, vol. 13, no. 10,
tracking with efficient track drift and fragmentation handling,’’ J. Opt. pp. 1–14, Aug. 2018, doi: 10.1371/journal.pone.0206168.
Soc. Amer. A, Opt. Image Sci., vol. 34, no. 2, p. 280, Jan. 2017, doi: [59] G. Wang, Y. Wang, H. Zhang, R. Gu, and J. N. Hwang, ‘‘Exploit the
10.1364/josaa.34.000280. connectivity: Multi-object tracking with TrackletNet,’’ in Proc. 27th ACM
[39] J. Wang, X. Zeng, W. Luo, and W. An, ‘‘The application of neural network Int. Conf. Multimedia (MM), New York, NY, USA, Oct. 2019, pp. 482–490.
in multiple object tracking,’’ in Proc. Int. Conf. Comput. Sci. Softw. Eng. [60] A. R. Zamir, A. Dehghan, and M. Shah, GMCP-Tracker: Global Multi-
(CSSE), Jul. 2018, pp. 258–264, doi: 10.12783/dtcse/csse2018/24504. Object Tracking Using Generalized Minimum Clique Graphs (Lecture
Notes in Computer Science: Lecture Notes in Artificial Intelligence and
[40] J. Son, M. Baek, M. Cho, and B. Han, ‘‘Multi-object tracking with
Lecture Notes in Bioinformatics), A. Fitzgibbon, S. Lazebnik, P. S. Perona,
quadruplet convolutional neural networks,’’ in Proc. IEEE Conf. Com-
and Y. C. Schmid, Eds. Berlin, Germany: Springer, 2012, pp. 343–356.
put. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017,
[61] C. Kim, F. Li, and J. M. Rehg, Multi-Object Tracking With Neural Gating
pp. 3786–3795.
Using Bilinear LSTM (Lecture Notes in Computer Science: Lecture Notes
[41] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M. H. Yang, Online
in Artificial Intelligence and Lecture Notes in Bioinformatics), V. Ferrari,
Multi-Object Tracking With Dual Matching Attention Networks (Lecture
M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham, Switzerland:
Notes in Computer Science: Lecture Notes in Artificial Intelligence and
Springer, 2018, pp. 208–224.
Lecture Notes in Bioinformatics), V. Ferrari, M. Hebert, C. Sminchisescu,
[62] N. Wojke, A. Bewley, and D. Paulus, ‘‘Simple online and realtime tracking
and Y. Weiss, Eds. Cham, Switzerland: Springer, 2018, pp. 379–396.
with a deep association metric,’’ in Proc. IEEE Int. Conf. Image Process.
[42] D. Moher, A. Liberati, J. Tetzlaff, D. G. Altman, and P. Group, ‘‘Preferred (ICIP), Beijing, China, Sep. 2017, pp. 3645–3649.
reporting items for systematic reviews and meta-analyses: The PRISMA
[63] W. Gan, S. Wang, X. Lei, M.-S. Lee, and C.-C.-J. Kuo, ‘‘Online CNN-
statement,’’ Brit. Med. J., vol. 339, no. 7716, pp. 332–336, Nov. 2009, doi:
based multiple object tracking with enhanced model updates and iden-
10.1136/bmj.b2535.
tity association,’’ Signal Process., Image Commun., vol. 66, pp. 95–102,
[43] A. Osep, A. Hermans, F. Engelmann, D. Klostermann, M. Mathias, and Aug. 2018, doi: 10.1016/j.image.2018.05.008.
B. Leibe, ‘‘Multi-scale object candidates for generic object tracking in [64] A. Kampker, M. Sefati, A. S. A. Rachman, K. Kreisköther, and P. Campoy,
street scenes,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), Stockholm, ‘‘Towards multi-object detection and tracking in urban scenario under
Sweden, May 2016, pp. 3180–3187. uncertainties,’’ in Proc. 4th Int. Conf. Vehicle Technol. Intell. Transp. Syst.,
[44] V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic, ‘‘On pairwise costs for Setúbal, Portugal, 2018, pp. 156–167.
network flow multi-object tracking,’’ in Proc. IEEE Conf. Comput. Vis. [65] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, ‘‘Simple online
Pattern Recognit., Boston, MA, USA, Jun. 2015, pp. 5537–5545. and realtime tracking,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP),
[45] A. Bai and R. Simmons, ‘‘Multi-object tracking and identification via Phoenix, AZ, USA, Sep. 2016, pp. 3464–3468.
particle filtering over sets,’’ in Proc. 20th Int. Conf. Inf. Fusion (FUSION), [66] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, ‘‘Towards real-time
Xia’n, China, Mar. 2017, pp. 10–13. multi-object tracking,’’ in Proc. Comput. Vis. ECCV, in Lecture Notes in
[46] J. Shin, H. Kim, D. Kim, and J. Paik, ‘‘Fast and robust object tracking Computer Science, A. B. Vedaldi, H, T. Brox, J. M. Frahm, Eds. Cham,
using tracking failure detection in kernelized correlation filter,’’ Appl. Sci., Switzerland: Springer, 2020, pp. 107–122.
vol. 10, no. 2, p. 713, Jan. 2020, doi: 10.3390/app10020713. [67] A. Kumar K. C, L. Jacques, and C. De Vleeschouwer, ‘‘Discrimina-
[47] Y. Xiang, A. Alahi, and S. Savarese, ‘‘Learning to track: Online multi- tive and efficient label propagation on complementary graphs for multi-
object tracking by decision making,’’ in Proc. IEEE Int. Conf. Comput. object tracking,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1,
Vis. (ICCV), Santiago, Chile, Dec. 2015, pp. 4705–4713. pp. 61–74, Jan. 2017, doi: 10.1109/TPAMI.2016.2533391.
[48] S. Sun, N. Akhtar, H. Song, A. S. Mian, and M. Shah, ‘‘Deep affinity [68] L. Leal-Taixe, C. Canton-Ferrer, and K. Schindler, ‘‘Learning by tracking:
network for multiple object tracking,’’ IEEE Trans. Pattern Anal. Mach. Siamese CNN for robust target association,’’ in Proc. IEEE Conf. Com-
Intell., vol. 13, no. 9, p. 1, May 2019, doi: 10.1109/tpami.2019.2929520. put. Vis. Pattern Recognit. Workshops (CVPRW), Las Vegas, NV, USA,
[49] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora, ‘‘Sequential sensor Jun. 2016, pp. 418–425.
fusion combining probability hypothesis density and kernelized correlation [69] J. Huang and W. Zhou, ‘‘Online multi-target tracking using recurrent neural
filters for multi-object tracking in video data,’’ in Proc. 14th IEEE Int. networks,’’ in Proc. 31st AAAI Conf. Artif. Intell. (AAAI), San Francisco,
Conf. Adv. Video Signal Based Surveill. (AVSS), Lecce, Italy, Aug. 2017, CA, USA, Jul. 2019, pp. 4225–4232.
pp. 1–5. [70] Y.-C. Yoon, Y.-M. Song, K. Yoon, and M. Jeon, ‘‘Online multi-object
[50] D. Zhao, H. Fu, L. Xiao, T. Wu, and B. Dai, ‘‘Multi-object tracking tracking using selective deep appearance matching,’’ in Proc. IEEE Int.
with correlation filter for autonomous vehicle,’’ Sensors, vol. 18, no. 7, Conf. Consum. Electron. Asia (ICCE-Asia), Jeju, South Korea, Jun. 2018,
pp. 1–17, Mar. 2018, doi: 10.3390/s18072004. pp. 206–212.

32670 VOLUME 9, 2021


L. Kalake et al.: Analysis Based on Recent Deep Learning Approaches Applied in Real-Time MOT

[71] A. Scheel, C. Knill, S. Reuter, and K. Dietmayer, ‘‘Multi-sensor multi- [90] K. A. Shiva Kumar, K. R. Ramakrishnan, and G. N. Rathna, ‘‘Inter-
object tracking of vehicles using high-resolution radars,’’ in Proc. IEEE camera person tracking in non-overlapping networks,’’ in Proc.
Intell. Vehicles Symp. (IV), Gothenburg, Sweden, Jun. 2016, pp. 558–565. 11th Int. Conf. Distrib. Smart Cameras, Sep. 2017, pp. 55–62, doi:
[72] B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. K. Rhee, Multi- 10.1145/3131885.3131912.
Class Multi-Object Tracking Using Changing Point Detection (Lecture [91] J. Houssineau, D. E. Clark, S. Ivekovic, C. S. Lee, and J. Franco, ‘‘A unified
Notes in Computer Science: Lecture Notes in Artificial Intelligence and approach for multi-object triangulation, tracking and camera calibration,’’
Lecture Notes in Bioinformatics), B. Lee, E. Erdenee, S. Jin, M. Y. Nam, IEEE Trans. Signal Process., vol. 64, no. 11, pp. 2934–2948, Jun. 2016,
Y. G. Jung, and P. K. Rhee, Eds. Amsterdam, The Netherlands: Springer, doi: 10.1109/TSP.2016.2523454.
2016, pp. 68–83. [92] A. Bathija, ‘‘Visual object detection and tracking using YOLO and SORT,’’
[73] L. Chen, H. Ai, Z. Zhuang, and C. Shang, ‘‘Real-time multiple peo- Int. J. Eng. Res. Technol., vol. 8, no. 11, pp. 705–708, Mar. 2019.
ple tracking with deeply learned candidate selection and person re- [93] A. Sharma, S. Anand, and S. K. Kaul, ‘‘Reinforcement learning-based
identification,’’ in Proc. IEEE Int. Conf. Multimedia Expo (ICME), querying in camera networks for efficient target tracking,’’ in Proc.
San Diego, CA, USA, Jul. 2018, pp. 1–6. Int. Conf. Automated Planning Scheduling, vol. 29, no. 1, pp. 555–563,
[74] L. Hou, W. Wan, J.-N. Hwang, R. Muhammad, M. Yang, and K. Han, Nov. 2020.
‘‘Human tracking over camera networks: A review,’’ EURASIP J. Adv. [94] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, ‘‘Virtual worlds as proxy for
Signal Process., vol. 2017, no. 1, p. 43, Jun. 2017, doi: 10.1186/s13634- multi-object tracking analysis,’’ in Proc. IEEE Comput. Soc. Conf. Comput.
017-0482-z. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 2016, pp. 4340–4349.
[75] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real- [95] N. T. L. Anh, F. M. Khan, F. Negin, and F. Bremond, ‘‘Multi-object
time object detection with region proposal networks,’’ IEEE Trans. Pat- tracking using multi-channel part appearance representation,’’ in Proc.
tern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017, doi: 14th IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), Lecce, Italy,
10.1109/TPAMI.2016.2577031. Aug. 2017, pp. 1–6.
[76] P. Liu, X. Li, H. Liu, and Z. Fu, ‘‘Online learned siamese network with
auto-encoding constraints for robust multi-object tracking,’’ Electronics,
vol. 8, no. 6, p. 595, May 2019, doi: 10.3390/electronics8060595.
LESOLE KALAKE received the B.S. degree in
[77] L. Ren, J. Lu, Z. Wang, Q. Tian, and J. Zhou, Collaborative Deep Rein-
forcement Learning for Multi-Object Tracking (Lecture Notes in Computer
computer science and statistics and the B.Sc.
Science: Lecture Notes in Artificial Intelligence and Lecture Notes in degree (Hons.) in applied population science from
Bioinformatics), V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. the University of KwaZulu-Natal, South Africa,
Cham, Switzerland: Springer, 2018, pp. 605–621. in 2004 and 2015, respectively, and the M.Sc.
[78] A. Osep, W. Mehner, P. Voigtlaender, and B. Leibe, ‘‘Track, then decide: degree in information systems from the Kobe Insti-
Category-agnostic vision-based multi-object tracking,’’ in Proc. IEEE tute of Technology, Japan, in 2017. He is cur-
Int. Conf. Robot. Autom. (ICRA), Brisbane, QLD, Australia, May 2018, rently pursuing the Ph.D. degree with the School
pp. 3494–3501. of Information Engineering, Shanghai University.
[79] L. Xiong, X. Zhang, J. Liao, and G. Yang, ‘‘Multi-object tracking based His research interests include machine learning
on HOG template matching and non-maximum convergence algorithm,’’ and video/image processing.
Int. J. Signal Process., Image Process. Pattern Recognit., vol. 10, no. 1,
pp. 233–242, Jan. 2017, doi: 10.14257/ijsip.2017.10.1.23.
[80] V. Carletti, A. Greco, A. Saggese, and M. Vento, ‘‘Multi-object tracking by
flying cameras based on a forward-backward interaction,’’ IEEE Access, WANGGEN WAN (Senior Member, IEEE) was
vol. 6, pp. 43905–43919, May 2018, doi: 10.1109/access.2018.2864672. born in Nanchang, Jiangxi, China, in 1961.
[81] N. M. Al-Shakarji, G. Seetharaman, F. Bunyak, and K. Palaniappan, He received the M.S. and Ph.D. degrees in elec-
‘‘Robust multi-object tracking with semantic color correlation,’’ in Proc. tronic and information engineering from Xid-
14th IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), Lecce, Italy, ian University, Xi’an, China, in 1988 and 1992,
Aug. 2017, pp. 1–7. respectively. He was a Visiting Scholar with
[82] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, the Department of Computer Science, Minsk
and B. Leibe, ‘‘MOTS: Multi-object tracking and segmentation,’’ in Proc. Radio Engineering Institute, formerly USSR, from
IEEE Conf. Comput. Vis. Pattern Recognit., Long Beach, CA, USA, 1991 to 1992, and a Postdoctoral Research Fellow
Jun. 2019, pp. 7934–7943. with Xian Jiaotong University, from 1993 to 1995.
[83] C. Liu, R. Yao, S. H. Rezatofighi, I. Reid, and Q. Shi, ‘‘Multi-object He became a Visiting Professor with the Hong Kong University of Science
model-free tracking with joint appearance and motion inference,’’ in Proc. and Technology, and Hong Kong Polytechnic University, from 1998 to 2004.
Int. Conf. Digit. Image Comput. Techn. Appl. (DICTA), Sydney, NSW, He joined Shanghai University as a Full Professor, in June 2004. He is cur-
Australia, Nov. 2017, pp. 1–8. rently a full-time Professor and a Deputy Dean with the School of Communi-
[84] J. H. Yoon, C.-R. Lee, M.-H. Yang, and K.-J. Yoon, ‘‘Structural constraint cation and Information Engineering, Shanghai University. He has published
data association for online multi-object tracking,’’ Int. J. Comput. Vis., one book, over 150 articles, and ten patents. His current research interests
vol. 127, no. 1, pp. 1–21, Apr. 2018, doi: 10.1007/s11263-018-1087-1.
include multimedia signal processing, data mining, embedded systems, and
[85] Y.-C. Yoon, A. Boragule, Y.-M. Song, K. Yoon, and M. Jeon, ‘‘Online
system-on-chip design in a multimedia systems, digital audio/video process-
multi-object tracking with historical appearance matching and scene adap-
ing, computer architecture, embedded systems, and system-on-chip design.
tive detection filtering,’’ in Proc. 15th IEEE Int. Conf. Adv. Video Signal
Based Surveill. (AVSS), Auckland, New Zealand, Nov. 2018, pp. 1–6.
He is an IET Fellow and an ACM Professional Member.
[86] B.-N. Vo, B.-T. Vo, and M. Beard, ‘‘Multi-sensor multi-object track-
ing with the generalized labeled multi-Bernoulli filter,’’ IEEE Trans.
LI HOU received the B.S. degree in communi-
Signal Process., vol. 67, no. 23, pp. 5952–5967, Dec. 2019, doi:
cation engineering and the M.S. degree in power
10.1109/TSP.2019.2946023.
electronics from the Liaoning University of Tech-
[87] L. Wen, D. Du, S. Li, X. Bian, and S. Lyu, ‘‘Learning non-uniform
hypergraph for multi-object tracking,’’ in Proc. AAAI Conf. Artif. Intell.,
nology, in 2003 and 2006, respectively, and the
vol. 33, Jul. 2019, pp. 8981–8988, doi: 10.1609/aaai.v33i01.33018981. Ph.D. degree in communication and informa-
[88] Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah, ‘‘Multi- tion systems from Shanghai University, in 2017.
target tracking in multiple non-overlapping cameras using fast-constrained In 2006, she joined the School of Information
dominant sets,’’ Int. J. Comput. Vis., vol. 127, no. 9, pp. 1303–1320, Engineering, Huangshan University, where she has
May 2019, doi: 10.1007/s11263-019-01180-6. been an Associate Professor. Her current research
[89] Y. Zou, W. Zhang, W. Weng, and Z. Meng, ‘‘Multi-vehicle tracking via interests include machine learning, video/image
real-time detection probes and a Markov decision process policy,’’ Sensors, processing, and big data mining.
vol. 19, no. 6, p. 1309, Mar. 2019, doi: 10.3390/s19061309.

VOLUME 9, 2021 32671

You might also like