Self-Supervised Deep Correlation Tracking
Self-Supervised Deep Correlation Tracking
30, 2021
Abstract— The training of a feature extraction network typ- core task of tracking is to provide ground-truth of a selected
ically requires abundant manually annotated training samples, target at the first frame and use a tracker to accurately predict
making this a time-consuming and costly process. Accordingly, the target position in all consecutive video frames. Recently,
we propose an effective self-supervised learning-based tracker in
a deep correlation framework (named: self-SDCT). Motivated by trackers depend on deep convolutional neural networks (CNN)
the forward-backward tracking consistency of a robust tracker, trained on manually annotated images that have promising
we propose a multi-cycle consistency loss as self-supervised infor- tracking performance. However, it is still a tough problem
mation for learning feature extraction network from adjacent to train an efficient feature extraction network in the deep
video frames. At the training stage, we generate pseudo-labels learning-based tracking framework, because of the limited
of consecutive video frames by forward-backward prediction
under a Siamese correlation tracking framework and utilize the number of labeled training data.
proposed multi-cycle consistency loss to learn a feature extrac- Tracking methods based on deep CNN structure have
tion network. Furthermore, we propose a similarity dropout recently achieved remarkable performance and have become
strategy to enable some low-quality training sample pairs to be increasingly popular in the tracking community [1]–[5].
dropped and also adopt a cycle trajectory consistency loss in Usually, these deep CNN-based trackers exploit a pre-trained
each sample pair to improve the training loss function. At the
tracking stage, we employ the pre-trained feature extraction network for feature extraction purposes, then use a correlation
network to extract features and utilize a Siamese correlation or similarity function to calculate a similarity score for the
tracking framework to locate the target using forward tracking template sample and the candidate samples, after which they
alone. Extensive experimental results indicate that the proposed choose the candidate with a maximum score as object target in
self-supervised deep correlation tracker (self-SDCT) achieves the current image frame [1], [2], [6]–[8]. While these methods
competitive tracking performance contrasted to state-of-the-art
supervised and unsupervised tracking methods on standard have improved performance relative to trackers based on
evaluation benchmarks. hand-crafted features, online tracking without updating limits
the generalization capabilities. Although several trackers have
Index Terms— Visual tracking, self-supervised learning,
multi-cycle consistency loss. attempted to employ deep networks for feature expression,
when the target is unknown during training, it’s requisite
I. I NTRODUCTION to adapt the weights of the network online by executing
Stochastic Gradient Descent (SGD), which significantly affects
S INGLE target tracking is both a very hot and important
topic, with numerous applications including video surveil-
lance, autonomous vehicles, man-machine interaction, etc. The
the tracking speed [9]–[11]. In [12], Bertinetto et al. propose
a SiamFC tracker that focuses on learning a similarity function
of target and candidates in the offline phase, which achieves
Manuscript received June 7, 2020; revised October 15, 2020; accepted remarkable tracking performance compared with other trackers
October 30, 2020. Date of publication December 1, 2020; date of current
version December 10, 2020. This work was supported in part by the National from the same period. The ECO [4] tracker introduces a
Natural Science Foundation of China under Grant 61672183, in part by the factorized convolution operator into the discriminative correla-
Shenzhen Research Council under Grant JCYJ2017041310455226946 and tion filter model and proposes a generative model to enhance
Grant JCYJ20170815113552036, in part by the PCL Future Greater-Bay
Area Network Facilities for Large-scale Experiments and Applications under sample diversity, which can improve both tracking accuracy
Project PCL2018KP001, and in part by The Verficiation Platform of Multi-tier and speed.
Coverage Communication Network for Oceans under Project PCL2018KP002. However, there are two major shortcomings of these deep
The work of Di Yuan was supported by a scholarship from the China
Scholarship Council. The work of Xiaojun Chang was supported in part by CNN-based trackers. The first one is that the feature extraction
the Australian Research Council (ARC) Discovery Early Career Researcher network requires numerous manually annotated samples for
Award (DECRA) under Grant DE190100626. The associate editor coordi- training. These manually annotated training samples are very
nating the review of this manuscript and approving it for publication was
Prof. Sos S. Agaian. (Corresponding author: Zhenyu He.) limited and obtaining them is also time-consuming and costly,
Di Yuan and Qiao Liu are with the School of Computer Science and meaning that a trained feature extraction network based on
Technology, Harbin Institute of Technology, Shenzhen 518055, China (e-mail: limited labeled samples is unable to represent the target
[email protected]; [email protected]).
Xiaojun Chang is with the Faculty of Information Technology, Monash features well. The second one is that most deep convolutional
University, Melbourne, VIC 3800, Australia, and also with the Faculty of network-based trackers require a network with multiple layers
Computing and Information Technology, King Abdulaziz University, Jeddah to extract features and fine-tune their pre-trained networks in
21589, Saudi Arabia (e-mail: [email protected]).
Po-Yao Huang is with the School of Computer Science, Carnegie Mellon online tracking phases, which results in high computational
University, Pittsburgh, PA 15213 USA (e-mail: [email protected]). complexity. Some deep CNN-based trackers unable to achieve
Zhenyu He is with the School of Computer Science and Technology, Harbin a real-time tracking speed because of the high dimension
Institute of Technology, Shenzhen 518055, China, and also with the Peng
Cheng Laboratory, Shenzhen 518055, China (e-mail: [email protected]). of the feature extraction network [7], [9], [13]. For exam-
Digital Object Identifier 10.1109/TIP.2020.3037518 ple, the MDNet [9] tracker needs to pre-train a deep CNN
1057-7149 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 977
Fig. 2. Tracking example about the proposed self-SDCT tracker and other
Fig. 1. Comparison of tracking speed and AUC score of our self-SDCT supervised and unsupervised trackers.
tracker and other deep learning based trackers on OTB-100 dataset.
correlation tracking framework to track the target and the
architecture for the similarity-matching task. In the tracking average tracking speed is around 48 fps. Compared to other
stage, the MDNet tracker uses an SGD strategy to learn a supervised tracking methods (such as CFNet [16] and SiamFC
detector with candidates extracted from the current sequence. [12]) and unsupervised tracking methods (such as UDT [17]),
This approach could not obtain a real-time tracking speed our self-SDCT tracker can achieve competitive tracking results
because of the high computational consumption. As shown (see in Fig. 2).
in Fig.1, the computational overhead prevents tackers with The main contributions of this paper are as follows:
• We formulate a multi-cycle consistency loss based
deep features to achieve real-time performance (e.g., SINT
[2], MCPF [14], and CREST [15]). self-supervised learning manner to pre-training the deep
To solve the above two problems, in this work, we develop feature extraction network, which can take advantage of
a robust and efficient deep correlation-based tracker with two extensive unlabeled video samples rather than limited
key components: a self-supervised learning-based pre-trained manually annotated samples.
• We use a multi-cycle consistency loss, a low similar-
deep feature extraction network and an efficient deep corre-
lation tracking framework. Unlike most supervised and unsu- ity dropout, and a cycle trajectory consistency loss to
pervised deep trackers, our self-supervised self-SDCT tracker pre-train our feature extraction network jointly, which can
obtains a competitive tracking performance (see in Fig.1). effectively improve the representational ability and reduce
Despite the limitations in the number of labeled training the overfitting risk.
• We conduct extensive experimental evaluations to
samples, there are abundant unlabeled video sequences avail-
able for self-supervised learning. In light of this observa- demonstrate the competitive of our self-SDCT tracker
tion, we propose to train the feature extraction network with state-of-the-art supervised and unsupervised trackers
via self-supervised learning, so that only the label of the on large benchmarks: OTB-2013 [18], OTB-100 [19],
target in initial frame is needed. After the initial target’s UAVDT [20], TColor-128 [21], and UAV-123 [22].
ground-truth is provided, we use a correlation filter approach II. R ELATED W ORKS
to generate pseudo-labels for other samples, and also use In this section, we present some reviews of the relevant
a cycle-consistency loss method for network training. The literature regarding deep correlation tracking algorithms,
cyclic-consistency loss of most training network methods just self-supervised learning for feature representation algorithms,
calculated the difference between the original state and the and cycle consistency in time series.
final state after a forward-backward prediction. Different from
these methods, we use a multi-cycle-consistency loss for our A. Deep Correlation Tracking
network training, which considers both the final result (Fig.4: Trackers based on a deep correlation structure have gained
Final-Loss) and intermediate results (Fig.4:Mid-Loss). The increasing attention. The Siamese architecture-based tracking
multi-cycle-consistency can improve the robustness of our methods formulate the tracking task as a cross-correlation
feature extraction network. In addition, to alleviate the impact problem [2], [12], [16], [23]–[26]. The SINT [2] tracker
of low-quality training sample pairs, we propose a low simi- proposes to train a Siamese network that determines the
larity dropout strategy to dropout these training sample pairs. target location by finding the maximum similarity between
Besides, the target and background can be better distinguished candidate samples and the initial target. The SiamFC [12]
through the cyclic trajectory consistency of the target, which tracker incorporates a fully-convolutional network for track-
can reduce the influence of background information on the ing tasks, which demonstrates powerful representation ability
feature extraction network. Both the low similarity dropout of the offline training feature extraction network. Currently,
strategy and the cycle trajectory consistency loss can improve Siamese network-based trackers [27]–[30] enhance their track-
the feature extraction network effectively. Once the network ing accuracy by adding a region proposal network (RPN)
training is completed, we apply it to an efficient Siamese module. In [27], in order to obtain high accuracy and real-time
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
978 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
tracking performance, Li et al. present a SiamRPN tracker, involves learning to associate an area of a color reference
which can discarded the multi-scale test and online fine- frame with a region of a gray frame by learning an embedding
turning. However, the SiamRPN tracker is susceptible to then copying the reference color of the specified area to the
interference from similar objects in the tracking scene, which gray image. This represents a departure from other methods
will reduce tracking performance. Fan et al. [30] provides a that use an off-the-shelf approach for tracking, to provide
Siamese network-based cascaded RPN tracker (SiamCRPN). a supervisory signal for training [7], [48], [50]. In [51],
The SiamCRPN tracker gradually defines the target’s position the authors try to jointly learn optical flow and tracking and
in each RPN through the adjusted anchor frame, thereby consequently point out that these two problems are comple-
making the target positioning more accurate. Besides, the mentary. Lai et al. [52] proposed a memory-based method
correlation filter (CF)-based tracking methods train a linear to learn a feature representation, which can guarantee the
template to discriminate between a image patch and its trans- pixel-wise correspondences between frames. Our work is
lation. Benefiting from its formulation in the Fourier domain, inspired by the unsupervised representation learning method
CF-based trackers can achieve a fast-tracking speed [31]–[33]. of UDT [17], which integrates the tracking algorithm into
Therefore, to improve the tracking performance of CF-based unsupervised training. We train our deep network for fea-
trackers, researches have been carried out from different ture representation using a self-supervised learning approach,
aspects, e.g., scale estimation [34], spatio-temporal context which only requires an initial target location without any
[35], [36], learning models [37], non-linear kernels [38] additional information. The supervised information we used
and boundary effects [39]–[41]. Inspired by this, Meanwhile, to train the deep feature extraction network came from these
some deep learning-based tracking methods have attempted pseudo-labels generated by the forward-backward tracking.
to treat the correlation filter as an additional layer in their
network structure to achieve faster-tracking speed. The CFNet C. Cycle Consistency in Time Series
[16] tracker integrates the correlation filter into the Siamese The cycle consistency in time series has been
network-based tracking framework and gives a closed-form widely explored in many kinds of literature [53]–[55].
solution. The C-COT [42] tracker introduces an effective Wang et al. [51] proposes to use cycle consistency to learn
expression for training continuous convolution filters; more- visual representations, which mainly focusing on unifying the
over, the ECO [4] tracker proposes a factorized convolution optical flow and tracking in a single video to achieve better
operator, which significantly reduces the scale of parameters embedding representation using a self-supervised learning
in C-COT [42] tracker. All of these deep correlation tracking way. Dwibedi et al. [56] train a network using a differentiable
methods use either an off-the-shelf feature extraction network temporal cycle-consistency loss for seek correspondences
(e.g. VGGNet or AlexNet) to fine-tune on the tracking task across time in multiple videos [57]. Li et al. [58] proposes to
or a large number of manually labeled datasets to train track large image patches and establish associations between
their feature extraction networks. However, the former usually consecutive video frames. As a representative of cycle
brings high computational complexity to the tracker and makes consistency in time series, forward-backward consistency has
the tracking speed very slow, while the latter typically produce been widely used in tracking tasks. The TLD [59] tracker
some unsatisfactory tracking results due to insufficient labeled proposes a forward-backward error to estimate the reliability
training data. Although some changes in network structure of a tracking trajectory. Their tracking result is corrected by
can improve the feature representation capacity [43]–[46], verifying the trajectory backward and comparing it to the
the insufficient labeled training data is still a major constraint relevant trajectory. The MTA [60] tracker performs forward
on network performance. Accordingly, unlike the above deep tracking by predicting the forward-backward consistency of
trackers that use a pre-trained feature extraction network with multiple component trackers and identifying the best tracker
numerous manually labeled training samples, we adopted the through a maximum robustness score. The UDT [17] tracker
self-supervised learning method to trains the network by using revisits the forward-backward tracking framework and trains a
training data that just requires the initial target’s ground-truth, deep tracker to use an unsupervised way. However, the above-
as the tracking task does. mentioned cycle consistency in time series only focuses on
the final result; this can lead to inaccurate intermediate results
B. Self-supervised Learning for Feature Representation
while the final result is accurate. Therefore, we propose a
The learning of feature representations from numerous multi-cycle consistency that also considers the intermediate
videos or images has been extensively studied. Wu and Huang tracking results in the forward-backward tracking process,
[47] proposed a self-supervised learning approach using both yielding improved tracking performance.
supervised and unsupervised training data. Based on this, they
are given a discriminant-EM method that would automatically III. S ELF -S UPERVISED D EEP C ORRELATION T RACKING
select good classification features. The human visual system In this section, we propose the self-supervised deep corre-
often pays more attention to motion information; inspired lation tracking network for the tracking task. Firstly, we pro-
by this situation, Pathak et al. [48] propose a motion-based vide a brief review deep correlation network-based methods
segmentation on videos to obtain particular segments, which in Sec. III-A. We then present the self-supervised learning
are then used as a pseudo label to train a segmentation approach designed to pre-train the feature extraction network
convolutional network. Vondrick et al. [49] consider video with numerous non-labeled data in Sec. III-B. Furthermore,
coloring as a self-supervised learning problem. This method we adopt multi-cycle consistency loss, low similarity dropout
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 979
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
980 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 981
D. Model Update
To adapt the target appearance variations in the tracking
stage, a linear model update strategy was adopted to update
the correlation filter parameters:
W = (1 − δ)W t −1 + δW t , (9) pre-trained network with only the final consistency loss (shown
in Fig.4), self-SDCTolsd denotes the tracking result of the
where δ is learning rate and W t is the current correlation filter.
pre-trained network without low similarity dropout, and self-
IV. E XPERIMENTS SDCTot cl denotes the tracking result of the pre-trained net-
We first introduce some experimental details and the work without cycle trajectory consistency loss. From Table I
evaluation criterion, then analyze the effectiveness of each we can see the tracking performance of our self-SDCT track-
component of the proposed self-unsupervised learning-based ing method is significantly improved than the self-SDCTsccl
pre-trained feature extraction network. Finally, we make some tracker, which benefits from the multi-cycle consistency loss.
evaluation about our self-supervised learning based self-SDCT Moreover, if any of these three components are removed,
tracker alongside some state-of-the-art supervised and unsu- the tracking performance will be reduced. This directly reflects
pervised trackers on OTB-2013 [18], OTB-100 [19], UAVDT the effectiveness of multi-cycle consistency, low similarity
[20], TColor-128 [21], and UAV-123 [22] datasets. dropout and cycle trajectory consistency in the network train-
ing process. We also report the tracking performance of the
A. Experimental Details and Evaluation Criterion pre-trained network under different dropout rates as shown
1) Experimental Details: We follow UDT [17] and DCFNet in Table II. From this table, we can see that an appropriate
[24] which apply the stochastic gradient descent (SGD) with dropout rate (e.g., 5%, 10%) can bring certain tracking per-
a momentum of 0.9 to train the feature extraction network. formance improvements to the pre-trained network. However,
The weight decay is set to 5e-4, and the learning rate is set to a larger dropout rate (e.g., 20%) will reduce the diversity
1e-5. The network is trained for 30 epochs with a mini-batch of the training samples and cause the tracking performance
size of 32. The model update learning rate δ is set to 0.025. of the pre-trained network to decline. The dropout rate not
Our experiments are performed in Matlab2019 on a PC with specifically mentioned in this paper is set to 10%.
an i7 4.2GHz CPU and an NVIDIA GTX 2080Ti GPU. The
tracking speed is around 48 fps. C. State-of-the-Art Comparison
2) Evaluation Criterion: We mainly use the precision and In order to verify the proposed self-SDCT tracker, we made
success index [62] to evaluate the tracking performance of some experimental comparisons between our tracker and
our self-SDCT tracking method, which is introduced in OTB some state-of-the-art trackers on standard benchmark datasets
benchmark [18], [19]. The precision index refers to the including OTB-100 [19], UAVDT [20], TColor-128 [21], and
average distance precision of the predicted position and the UAV-123 [22].
ground-truth under different thresholds. Meanwhile, the suc- 1) Experiment on OTB-100 Benchmark: We conduct some
cess index is measured by an average overlap of the tracking comparisons between our self-SDCT tracker and other trackers
result and the ground-truth, and trackers are ranked using area- including ATOM [63], SiamRPN [27], MetaCREST [64],
under-the-curve (AUC). Moreover, tracking speed is also a UDT+ [17], TRACA [65], ARCF [66], ACFN [67], SiamTri
significant index for evaluating a tracker. [68], SiamFC [12], DCFNet [24], CFNet [16], CNT [69]
and UDT [17] on OTB-100 [19] dataset. Fig. 6 presents
B. Ablation Study the experimental results of comparing our self-SDCT tracker
We carry out ablation studies on OTB-2013 [18] and with these state-of-the-art trackers. In Fig. 6, we can know
OTB-100 [19] benchmark to analyze the effect of each com- our self-SDCT tracker is comparable with these baseline
ponent in the training process. The comparison results are fully-supervised trackers [12], [16], [24]. Compared to CFNet
demonstrated in Table I. Note that self-SDCT denotes the [16], our proposed tracker achieves 5.2% improvement in
tracking result of the pre-trained network included in each terms of AUC index. Moreover, the accuracy of our self-SDCT
component; self-SDCTsccl denotes the tracking result of the tracker is comparable to that of the SiamRPN [27] tracker.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
982 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
TABLE III
S TATE - OF - THE -A RT C OMPARISON ON THE UAVDT [20] D ATASET IN
T ERMS OF P RECISION S CORES , S UCCESS S CORES AND T RACKING
S PEED . T HE F IRST, S ECOND AND T HIRD B EST A RE H IGHLIGHTED
IN R ED , B LUE AND G REEN , R ESPECTIVELY
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 983
TABLE IV
P RECISION AND AUC S CORES OF THE P ROPOSED S ELF -SDCT T RACKER AND O THER T RACKERS ON THE UAV-123 [22] D ATASET. T HE F IRST, S ECOND
AND T HIRD B EST S CORES A RE H IGHLIGHTED IN R ED , B LUE AND G REEN , R ESPECTIVELY
TABLE V
P RECISION AND AUC S CORES OF O UR S ELF -SDCT T RACKER AND O THER T RACKERS ON THE TC OLOR -128 [21] D ATASET. T HE F IRST, S ECOND AND
T HIRD B EST S CORES A RE H IGHLIGHTED IN R ED , B LUE AND G REEN, R ESPECTIVELY
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
984 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
[10] L. Wang, W. Ouyang, X. Wang, and H. Lu, “STCT: Sequentially [36] D. Yuan, X. Shu, and Z. He, “TRBACF: Learning temporal regularized
training convolutional networks for visual tracking,” in Proc. IEEE Conf. correlation filters for high performance online visual object tracking,”
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1373–1381. J. Vis. Commun. Image Represent., vol. 72, Oct. 2020, Art. no. 102882.
[11] Y. Song et al., “VITAL: VIsual tracking via adversarial learning,” [37] B. Zhang et al., “Latent constrained correlation filter,” IEEE Trans.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Image Process., vol. 27, no. 3, pp. 1038–1048, Mar. 2018.
pp. 8990–8999. [38] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed
[12] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal.
“Fully-convolutional Siamese networks for object tracking,” in Proc. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
ECCV Workshop, 2016, pp. 850–865. [39] H. K. Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware
[13] D. Yuan, X. Li, Z. He, Q. Liu, and S. Lu, “Visual object tracking with correlation filters for visual tracking,” in Proc. IEEE Int. Conf. Comput.
adaptive structural convolutional network,” Knowl.-Based Syst., vol. 194, Vis. (ICCV), Oct. 2017, pp. 1135–1143.
Apr. 2020, Art. no. 105554. [40] W. Feng, R. Han, Q. Guo, J. Zhu, and S. Wang, “Dynamic saliency-
[14] T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle filter aware regularization for correlation filter-based object tracking,” IEEE
for robust object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Trans. Image Process., vol. 28, no. 7, pp. 3232–3245, Jul. 2019.
Recognit. (CVPR), Jul. 2017, pp. 4819–4827. [41] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang, “Learning
[15] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. H. Lau, and M.-H. Yang, spatial-temporal regularized correlation filters for visual tracking,” in
“CREST: Convolutional residual learning for visual tracking,” in Proc. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2574–2583. pp. 4904–4913.
[16] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr, [42] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond
“End-to-End representation learning for correlation filter based track- correlation filters: Learning continuous convolution operators for visual
ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), tracking,” in Proc. ECCV, 2016, pp. 472–488.
Jul. 2017, pp. 2085–2813. [43] C. Tian, Y. Xu, W. Zuo, B. Zhang, L. Fei, and C.-W. Lin, “Coarse-to-
[17] N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li, “Unsupervised fine CNN for image super-resolution,” IEEE Trans. Multimedia, early
deep tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. access, Jun. 1, 2020, doi: 10.1109/TMM.2020.2999182.
(CVPR), Jun. 2019, pp. 1308–1317. [44] S. Luan, C. Chen, B. Zhang, J. Han, and J. Liu, “Gabor convolutional
[18] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A bench- networks,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4357–4366,
mark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, Sep. 2018.
pp. 2411–2418. [45] C. Tian, Y. Xu, Z. Li, W. Zuo, L. Fei, and H. Liu, “Attention-
[19] Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE guided CNN for image denoising,” Neural Netw., vol. 124, pp. 117–129,
Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Apr. 2020.
Sep. 2015.
[46] X. Li, C. Ma, B. Wu, Z. He, and M.-H. Yang, “Target-aware deep
[20] D. Du et al., “The unmanned aerial vehicle benchmark: Object detection tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
and tracking,” in Proc. ECCV, 2018, pp. 370–386. (CVPR), Jun. 2019, pp. 1369–1378.
[21] P. Liang, E. Blasch, and H. Ling, “Encoding color information for visual
[47] Y. Wu and T. S. Huang, “Self-supervised learning for visual tracking
tracking: Algorithms and benchmark,” IEEE Trans. Image Process.,
and recognition of human hand,” in Proc. AAAI, 2000, pp. 243–248.
vol. 24, no. 12, pp. 5630–5644, Dec. 2015.
[48] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan, “Learning
[22] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for
features by watching objects move,” in Proc. IEEE Conf. Comput. Vis.
UAV tracking,” in Proc. ECCV, 2016, pp. 445–461.
Pattern Recognit. (CVPR), Jul. 2017, pp. 2701–2710.
[23] Z. Liang and J. Shen, “Local semantic siamese networks for fast
tracking,” IEEE Trans. Image Process., vol. 29, pp. 3351–3364, 2020. [49] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Mur-
phy, “Tracking emerges by colorizing videos,” in ECCV, vol. 2018,
[24] Q. Wang, J. Gao, J. Xing, M. Zhang, and W. Hu, “DCFNet: Discriminant
pp. 391–408.
correlation filters network for visual tracking,” 2017, arXiv:1704.04057.
[Online]. Available: http://arxiv.org/abs/1704.04057 [50] X. Wang, K. He, and A. Gupta, “Transitive invariance for self-supervised
[25] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning visual representation learning,” in Proc. IEEE Int. Conf. Comput. Vis.
dynamic siamese network for visual object tracking,” in Proc. IEEE Int. (ICCV), Oct. 2017, pp. 1329–1338.
Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1763–1771. [51] X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from
[26] Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu, “Structured the cycle-consistency of time,” in Proc. IEEE/CVF Conf. Comput. Vis.
Siamese network for real-time visual tracking,” in Proc. ECCV, 2018, Pattern Recognit. (CVPR), Jun. 2019, pp. 2566–2576.
pp. 351–366. [52] Z. Lai, E. Lu, and W. Xie, “MAST: A memory-augmented self-
[27] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual supervised tracker,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
tracking with siamese region proposal network,” in Proc. IEEE/CVF Recognit. (CVPR), Jun. 2020, pp. 6479–6488.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8971–8980. [53] P.-Y. Huang, G. Kang, W. Liu, X. Chang, and A. G. Hauptmann,
[28] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware “Annotation efficient cross-modal retrieval with adversarial attentive
Siamese networks for visual object tracking,” in Proc. ECCV, 2018, alignment,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019,
pp. 101–117. pp. 1758–1767.
[29] M. H. Abdelpakey and M. S. Shehata, “DP-siam: Dynamic pol- [54] C. Liu, X. Chang, and Y.-D. Shen, “Unity style transfer for person re-
icy siamese network for robust object tracking,” IEEE Trans. Image identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Process., vol. 29, pp. 1479–1492, 2020. (CVPR), Jun. 2020, pp. 6887–6896.
[30] H. Fan and H. Ling, “Siamese cascaded region proposal networks for [55] X. Chang, Y.-L. Yu, Y. Yang, and E. P. Xing, “Semantic pooling for
real-time visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern complex event analysis in untrimmed videos,” IEEE Trans. Pattern Anal.
Recognit. (CVPR), Jun. 2019, pp. 7952–7961. Mach. Intell., vol. 39, no. 8, pp. 1617–1632, Aug. 2017.
[31] T. Zhang, S. Liu, C. Xu, B. Liu, and M.-H. Yang, “Correlation particle [56] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, “Tem-
filter for visual tracking,” IEEE Trans. Image Process., vol. 27, no. 6, poral cycle-consistency learning,” in Proc. IEEE/CVF Conf. Comput. Vis.
pp. 2676–2687, Jun. 2018. Pattern Recognit. (CVPR), Jun. 2019, pp. 1801–1810.
[32] F. Liu, C. Gong, X. Huang, T. Zhou, J. Yang, and D. Tao, “Robust visual [57] P. Sermanet et al., “Time-contrastive networks: Self-supervised learning
tracking revisited: From correlation filter to template matching,” IEEE from video,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2018,
Trans. Image Process., vol. 27, no. 6, pp. 2777–2790, Jun. 2018. pp. 1134–1141.
[33] Z. He, S. Yi, Y.-M. Cheung, X. You, and Y. Yan Tang, “Robust object [58] X. Li, S. Liu, S. D. Mello, X. Wang, J. Kautz, and M.-H. Yang, “Joint-
tracking via key patch sparse representation,” IEEE Trans. Cybern., task self-supervised learning for temporal correspondence,” in Proc.
vol. 47, no. 2, pp. 354–364, Feb. 2017. NIPS, vol. 2019, pp. 317–327.
[34] G. Ding, W. Chen, S. Zhao, J. Han, and Q. Liu, “Real-time scalable [59] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,”
visual tracking via quadrangle kernelized correlation filters,” IEEE IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422,
Trans. Intell. Transp. Syst., vol. 19, no. 1, pp. 140–150, Jan. 2018. Jul. 2012.
[35] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Learning spatially [60] D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “Multihypothesis trajectory analysis
regularized correlation filters for visual tracking,” in Proc. IEEE Int. for robust visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern
Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4310–4318. Recognit. (CVPR), Jun. 2015, pp. 5088–5096.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 985
[61] O. Russakovsky et al., “ImageNet large scale visual recognition chal- Xiaojun Chang (Member, IEEE) received the Ph.D.
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. degree from the Centre for Artificial Intelligence
[62] M. Luo, X. Chang, Z. Li, L. Nie, A. G. Hauptmann, and Q. Zheng, and the Faculty of Engineering and Information
“Simple to complex cross-modal learning to rank,” Comput. Vis. Image Technology, University of Technology Sydney, Syd-
Understand., vol. 163, pp. 67–77, Oct. 2017. ney, in 2016. He is currently a Senior Lecturer
[63] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate with the Faculty of Information Technology, Monash
tracking by overlap maximization,” in Proc. IEEE/CVF Conf. Comput. University Clayton Campus, Australia. He is also
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4660–4669. a Distinguished Adjunct Professor with the Fac-
[64] E. Park and A. C. Berg, “Meta-tracker: Fast and robust online adaptation ulty of Computing and Information Technology,
for visual object trackers,” in Proc. ECCV, 2018, pp. 1–17. King Abdulaziz University. Before joining Monash,
[65] J. Choi et al., “Context-aware deep feature compression for high- he was a Postdoctoral Research Associate with the
speed visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern School of Computer Science, Carnegie Mellon University, working with
Recognit., Jun. 2018, pp. 479–488. Prof. Alex Hauptmann. He has spent most of time working on exploring
[66] Z. Huang, C. Fu, Y. Li, F. Lin, and P. Lu, “Learning aberrance repressed multiple signals (visual, acoustic, and textual) for automatic content analysis
correlation filters for real-time UAV tracking,” in Proc. IEEE/CVF Int. in unconstrained or surveillance videos. He has achieved top performance
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 2891–2900. in various international competitions, such as TRECVID MED, TRECVID
[67] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and J. Y. Choi, SIN, and TRECVID AVS. He is an ARC Discovery Early Career Researcher
“Attentional correlation filter network for adaptive visual tracking,” in Award (DECRA) Fellow from 2019 to 2021.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 4828–4837.
[68] X. Dong and J. Shen, “Triplet loss in Siamese network for object
tracking,” in Proc. ECCV, 2018, pp. 459–474. Po-Yao Huang (Member, IEEE) is currently pursu-
[69] K. Zhang, Q. Liu, Y. Wu, and M. H. Yang, “Robust visual tracking via ing the Ph.D. degree with the School of Computer
convolutional networks without training,” IEEE Trans. Image Process., Science, Carnegie Mellon University. His research
vol. 25, no. 4, pp. 1779–1792, 2016. interest includes multimodal machine learning. He is
[70] H. Fan and H. Ling, “Parallel tracking and verifying: A framework for particularly interested in bridging computer vision
real-time and high accuracy visual tracking,” in Proc. IEEE Int. Conf. and natural language processing for the tasks of mul-
Comput. Vis. (ICCV), Oct. 2017, pp. 5486–5494. timodal machine translation, cross-modal search and
[71] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr, “Staple: retrieval, and large-scale multimodal data mining
Complementary learners for real-time tracking,” in Proc. CVPR, 2016, and analysis.
pp. 1401–1409.
[72] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “MUlti-
store tracker (MUSTer): A cognitive psychology inspired approach to
object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2015, pp. 749–758. Qiao Liu received the B.E. degree in computer
[73] Z. Liu, Z. Lian, and Y. Li, “A novel adaptive kernel correlation filter science from Guizhou Normal University, Guiyang,
tracker with multiple feature integration,” in Proc. IEEE Int. Conf. Image China, in 2016. He is currently pursuing the Ph.D.
Process. (ICIP), Sep. 2017, pp. 254–265. degree with the Department of Computer Science
[74] J. Zhang, S. Ma, and S. Sclaroff, “MEEM: Robust tracking via and Technology, Harbin Institute of Technology,
multiple experts using entropy minimization,” in Proc. ECCV, 2014, Shenzhen, China. His current research interests
pp. 188–203. include thermal infrared object tracking and machine
[75] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminative learning.
scale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
no. 8, pp. 1561–1575, Aug. 2017.
[76] H. Ma, S. T. Acton, and Z. Lin, “SITUP: Scale invariant tracking
using average Peak-to-Correlation energy,” IEEE Trans. Image Process.,
vol. 29, pp. 3546–3557, 2020.
Zhenyu He (Senior Member, IEEE) received the
Di Yuan received the M.S. degree in applied math- Ph.D. degree from the Department of Computer
ematics from the Harbin Institute of Technology, Science, Hong Kong Baptist University, Hong Kong,
Shenzhen, China, in 2017, where he is currently in 2007. From 2007 to 2009, he worked as a
pursuing the Ph.D. degree in computer science with Postdoctoral Researcher with the Department of
the research in statute of biocomputing with the Computer Science and Engineering, Hong Kong
School of Computer Science and Technology. His University of Science and Technology. He is cur-
current research interests include object tracking, rently a Full Professor with the School of Com-
machine learning, and self-supervised learning. puter Science and Technology, Harbin Institute of
Technology, Shenzhen, China. His research interests
include machine learning, computer vision, image
processing, and pattern recognition.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.