Self-Supervised Deep Correlation Tracking

Uploaded by

sadiaraedionlab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Self-Supervised Deep Correlation Tracking

Uploaded by

sadiaraedionlab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

976 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

30, 2021

Self-Supervised Deep Correlation Tracking

Di Yuan , Xiaojun Chang , Member, IEEE, Po-Yao Huang, Member, IEEE, Qiao Liu,
and Zhenyu He , Senior Member, IEEE

Abstract— The training of a feature extraction network typ- core task of tracking is to provide ground-truth of a selected
ically requires abundant manually annotated training samples, target at the first frame and use a tracker to accurately predict
making this a time-consuming and costly process. Accordingly, the target position in all consecutive video frames. Recently,
we propose an effective self-supervised learning-based tracker in
a deep correlation framework (named: self-SDCT). Motivated by trackers depend on deep convolutional neural networks (CNN)
the forward-backward tracking consistency of a robust tracker, trained on manually annotated images that have promising
we propose a multi-cycle consistency loss as self-supervised infor- tracking performance. However, it is still a tough problem
mation for learning feature extraction network from adjacent to train an efficient feature extraction network in the deep
video frames. At the training stage, we generate pseudo-labels learning-based tracking framework, because of the limited
of consecutive video frames by forward-backward prediction
under a Siamese correlation tracking framework and utilize the number of labeled training data.
proposed multi-cycle consistency loss to learn a feature extrac- Tracking methods based on deep CNN structure have
tion network. Furthermore, we propose a similarity dropout recently achieved remarkable performance and have become
strategy to enable some low-quality training sample pairs to be increasingly popular in the tracking community [1]–[5].
dropped and also adopt a cycle trajectory consistency loss in Usually, these deep CNN-based trackers exploit a pre-trained
each sample pair to improve the training loss function. At the
tracking stage, we employ the pre-trained feature extraction network for feature extraction purposes, then use a correlation
network to extract features and utilize a Siamese correlation or similarity function to calculate a similarity score for the
tracking framework to locate the target using forward tracking template sample and the candidate samples, after which they
alone. Extensive experimental results indicate that the proposed choose the candidate with a maximum score as object target in
self-supervised deep correlation tracker (self-SDCT) achieves the current image frame [1], [2], [6]–[8]. While these methods
competitive tracking performance contrasted to state-of-the-art
supervised and unsupervised tracking methods on standard have improved performance relative to trackers based on
evaluation benchmarks. hand-crafted features, online tracking without updating limits
the generalization capabilities. Although several trackers have
Index Terms— Visual tracking, self-supervised learning,
multi-cycle consistency loss. attempted to employ deep networks for feature expression,
when the target is unknown during training, it’s requisite
I. I NTRODUCTION to adapt the weights of the network online by executing
Stochastic Gradient Descent (SGD), which significantly affects
S INGLE target tracking is both a very hot and important
topic, with numerous applications including video surveil-
lance, autonomous vehicles, man-machine interaction, etc. The
the tracking speed [9]–[11]. In [12], Bertinetto et al. propose
a SiamFC tracker that focuses on learning a similarity function
of target and candidates in the offline phase, which achieves
Manuscript received June 7, 2020; revised October 15, 2020; accepted remarkable tracking performance compared with other trackers
October 30, 2020. Date of publication December 1, 2020; date of current
version December 10, 2020. This work was supported in part by the National from the same period. The ECO [4] tracker introduces a
Natural Science Foundation of China under Grant 61672183, in part by the factorized convolution operator into the discriminative correla-
Shenzhen Research Council under Grant JCYJ2017041310455226946 and tion filter model and proposes a generative model to enhance
Grant JCYJ20170815113552036, in part by the PCL Future Greater-Bay
Area Network Facilities for Large-scale Experiments and Applications under sample diversity, which can improve both tracking accuracy
Project PCL2018KP001, and in part by The Verficiation Platform of Multi-tier and speed.
Coverage Communication Network for Oceans under Project PCL2018KP002. However, there are two major shortcomings of these deep
The work of Di Yuan was supported by a scholarship from the China
Scholarship Council. The work of Xiaojun Chang was supported in part by CNN-based trackers. The first one is that the feature extraction
the Australian Research Council (ARC) Discovery Early Career Researcher network requires numerous manually annotated samples for
Award (DECRA) under Grant DE190100626. The associate editor coordi- training. These manually annotated training samples are very
nating the review of this manuscript and approving it for publication was
Prof. Sos S. Agaian. (Corresponding author: Zhenyu He.) limited and obtaining them is also time-consuming and costly,
Di Yuan and Qiao Liu are with the School of Computer Science and meaning that a trained feature extraction network based on
Technology, Harbin Institute of Technology, Shenzhen 518055, China (e-mail: limited labeled samples is unable to represent the target
[email protected]; [email protected]).
Xiaojun Chang is with the Faculty of Information Technology, Monash features well. The second one is that most deep convolutional
University, Melbourne, VIC 3800, Australia, and also with the Faculty of network-based trackers require a network with multiple layers
Computing and Information Technology, King Abdulaziz University, Jeddah to extract features and fine-tune their pre-trained networks in
21589, Saudi Arabia (e-mail: [email protected]).
Po-Yao Huang is with the School of Computer Science, Carnegie Mellon online tracking phases, which results in high computational
University, Pittsburgh, PA 15213 USA (e-mail: [email protected]). complexity. Some deep CNN-based trackers unable to achieve
Zhenyu He is with the School of Computer Science and Technology, Harbin a real-time tracking speed because of the high dimension
Institute of Technology, Shenzhen 518055, China, and also with the Peng
Cheng Laboratory, Shenzhen 518055, China (e-mail: [email protected]). of the feature extraction network [7], [9], [13]. For exam-
Digital Object Identifier 10.1109/TIP.2020.3037518 ple, the MDNet [9] tracker needs to pre-train a deep CNN
1057-7149 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 977

Fig. 2. Tracking example about the proposed self-SDCT tracker and other
Fig. 1. Comparison of tracking speed and AUC score of our self-SDCT supervised and unsupervised trackers.
tracker and other deep learning based trackers on OTB-100 dataset.
correlation tracking framework to track the target and the
architecture for the similarity-matching task. In the tracking average tracking speed is around 48 fps. Compared to other
stage, the MDNet tracker uses an SGD strategy to learn a supervised tracking methods (such as CFNet [16] and SiamFC
detector with candidates extracted from the current sequence. [12]) and unsupervised tracking methods (such as UDT [17]),
This approach could not obtain a real-time tracking speed our self-SDCT tracker can achieve competitive tracking results
because of the high computational consumption. As shown (see in Fig. 2).
in Fig.1, the computational overhead prevents tackers with The main contributions of this paper are as follows:
• We formulate a multi-cycle consistency loss based
deep features to achieve real-time performance (e.g., SINT
[2], MCPF [14], and CREST [15]). self-supervised learning manner to pre-training the deep
To solve the above two problems, in this work, we develop feature extraction network, which can take advantage of
a robust and efficient deep correlation-based tracker with two extensive unlabeled video samples rather than limited
key components: a self-supervised learning-based pre-trained manually annotated samples.
• We use a multi-cycle consistency loss, a low similar-
deep feature extraction network and an efficient deep corre-
lation tracking framework. Unlike most supervised and unsu- ity dropout, and a cycle trajectory consistency loss to
pervised deep trackers, our self-supervised self-SDCT tracker pre-train our feature extraction network jointly, which can
obtains a competitive tracking performance (see in Fig.1). effectively improve the representational ability and reduce
Despite the limitations in the number of labeled training the overfitting risk.
• We conduct extensive experimental evaluations to
samples, there are abundant unlabeled video sequences avail-
able for self-supervised learning. In light of this observa- demonstrate the competitive of our self-SDCT tracker
tion, we propose to train the feature extraction network with state-of-the-art supervised and unsupervised trackers
via self-supervised learning, so that only the label of the on large benchmarks: OTB-2013 [18], OTB-100 [19],
target in initial frame is needed. After the initial target’s UAVDT [20], TColor-128 [21], and UAV-123 [22].
ground-truth is provided, we use a correlation filter approach II. R ELATED W ORKS
to generate pseudo-labels for other samples, and also use In this section, we present some reviews of the relevant
a cycle-consistency loss method for network training. The literature regarding deep correlation tracking algorithms,
cyclic-consistency loss of most training network methods just self-supervised learning for feature representation algorithms,
calculated the difference between the original state and the and cycle consistency in time series.
final state after a forward-backward prediction. Different from
these methods, we use a multi-cycle-consistency loss for our A. Deep Correlation Tracking
network training, which considers both the final result (Fig.4: Trackers based on a deep correlation structure have gained
Final-Loss) and intermediate results (Fig.4:Mid-Loss). The increasing attention. The Siamese architecture-based tracking
multi-cycle-consistency can improve the robustness of our methods formulate the tracking task as a cross-correlation
feature extraction network. In addition, to alleviate the impact problem [2], [12], [16], [23]–[26]. The SINT [2] tracker
of low-quality training sample pairs, we propose a low simi- proposes to train a Siamese network that determines the
larity dropout strategy to dropout these training sample pairs. target location by finding the maximum similarity between
Besides, the target and background can be better distinguished candidate samples and the initial target. The SiamFC [12]
through the cyclic trajectory consistency of the target, which tracker incorporates a fully-convolutional network for track-
can reduce the influence of background information on the ing tasks, which demonstrates powerful representation ability
feature extraction network. Both the low similarity dropout of the offline training feature extraction network. Currently,
strategy and the cycle trajectory consistency loss can improve Siamese network-based trackers [27]–[30] enhance their track-
the feature extraction network effectively. Once the network ing accuracy by adding a region proposal network (RPN)
training is completed, we apply it to an efficient Siamese module. In [27], in order to obtain high accuracy and real-time

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
978 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

tracking performance, Li et al. present a SiamRPN tracker, involves learning to associate an area of a color reference
which can discarded the multi-scale test and online fine- frame with a region of a gray frame by learning an embedding
turning. However, the SiamRPN tracker is susceptible to then copying the reference color of the specified area to the
interference from similar objects in the tracking scene, which gray image. This represents a departure from other methods
will reduce tracking performance. Fan et al. [30] provides a that use an off-the-shelf approach for tracking, to provide
Siamese network-based cascaded RPN tracker (SiamCRPN). a supervisory signal for training [7], [48], [50]. In [51],
The SiamCRPN tracker gradually defines the target’s position the authors try to jointly learn optical flow and tracking and
in each RPN through the adjusted anchor frame, thereby consequently point out that these two problems are comple-
making the target positioning more accurate. Besides, the mentary. Lai et al. [52] proposed a memory-based method
correlation filter (CF)-based tracking methods train a linear to learn a feature representation, which can guarantee the
template to discriminate between a image patch and its trans- pixel-wise correspondences between frames. Our work is
lation. Benefiting from its formulation in the Fourier domain, inspired by the unsupervised representation learning method
CF-based trackers can achieve a fast-tracking speed [31]–[33]. of UDT [17], which integrates the tracking algorithm into
Therefore, to improve the tracking performance of CF-based unsupervised training. We train our deep network for fea-
trackers, researches have been carried out from different ture representation using a self-supervised learning approach,
aspects, e.g., scale estimation [34], spatio-temporal context which only requires an initial target location without any
[35], [36], learning models [37], non-linear kernels [38] additional information. The supervised information we used
and boundary effects [39]–[41]. Inspired by this, Meanwhile, to train the deep feature extraction network came from these
some deep learning-based tracking methods have attempted pseudo-labels generated by the forward-backward tracking.
to treat the correlation filter as an additional layer in their
network structure to achieve faster-tracking speed. The CFNet C. Cycle Consistency in Time Series
[16] tracker integrates the correlation filter into the Siamese The cycle consistency in time series has been
network-based tracking framework and gives a closed-form widely explored in many kinds of literature [53]–[55].
solution. The C-COT [42] tracker introduces an effective Wang et al. [51] proposes to use cycle consistency to learn
expression for training continuous convolution filters; more- visual representations, which mainly focusing on unifying the
over, the ECO [4] tracker proposes a factorized convolution optical flow and tracking in a single video to achieve better
operator, which significantly reduces the scale of parameters embedding representation using a self-supervised learning
in C-COT [42] tracker. All of these deep correlation tracking way. Dwibedi et al. [56] train a network using a differentiable
methods use either an off-the-shelf feature extraction network temporal cycle-consistency loss for seek correspondences
(e.g. VGGNet or AlexNet) to fine-tune on the tracking task across time in multiple videos [57]. Li et al. [58] proposes to
or a large number of manually labeled datasets to train track large image patches and establish associations between
their feature extraction networks. However, the former usually consecutive video frames. As a representative of cycle
brings high computational complexity to the tracker and makes consistency in time series, forward-backward consistency has
the tracking speed very slow, while the latter typically produce been widely used in tracking tasks. The TLD [59] tracker
some unsatisfactory tracking results due to insufficient labeled proposes a forward-backward error to estimate the reliability
training data. Although some changes in network structure of a tracking trajectory. Their tracking result is corrected by
can improve the feature representation capacity [43]–[46], verifying the trajectory backward and comparing it to the
the insufficient labeled training data is still a major constraint relevant trajectory. The MTA [60] tracker performs forward
on network performance. Accordingly, unlike the above deep tracking by predicting the forward-backward consistency of
trackers that use a pre-trained feature extraction network with multiple component trackers and identifying the best tracker
numerous manually labeled training samples, we adopted the through a maximum robustness score. The UDT [17] tracker
self-supervised learning method to trains the network by using revisits the forward-backward tracking framework and trains a
training data that just requires the initial target’s ground-truth, deep tracker to use an unsupervised way. However, the above-
as the tracking task does. mentioned cycle consistency in time series only focuses on
the final result; this can lead to inaccurate intermediate results
B. Self-supervised Learning for Feature Representation
while the final result is accurate. Therefore, we propose a
The learning of feature representations from numerous multi-cycle consistency that also considers the intermediate
videos or images has been extensively studied. Wu and Huang tracking results in the forward-backward tracking process,
[47] proposed a self-supervised learning approach using both yielding improved tracking performance.
supervised and unsupervised training data. Based on this, they
are given a discriminant-EM method that would automatically III. S ELF -S UPERVISED D EEP C ORRELATION T RACKING
select good classification features. The human visual system In this section, we propose the self-supervised deep corre-
often pays more attention to motion information; inspired lation tracking network for the tracking task. Firstly, we pro-
by this situation, Pathak et al. [48] propose a motion-based vide a brief review deep correlation network-based methods
segmentation on videos to obtain particular segments, which in Sec. III-A. We then present the self-supervised learning
are then used as a pseudo label to train a segmentation approach designed to pre-train the feature extraction network
convolutional network. Vondrick et al. [49] consider video with numerous non-labeled data in Sec. III-B. Furthermore,
coloring as a self-supervised learning problem. This method we adopt multi-cycle consistency loss, low similarity dropout

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 979

Fig. 4. Example of the proposed multi-cycle consistency loss. The multi-cycle

consistency loss not only takes into account the final loss in the forward and
backward movement of the target (Final-Loss), but also the loss in the middle
of the movement (Mid-Loss).

f (Z ) as the target center, and treat it as the label center to

generate the pseudo-Gaussian label. The next step is to train
the new filter using the pseudo-Gaussian label and image
Fig. 3. An overview of the self-SDCT architecture. We use a Siamese patch Z . After that, these steps are repeated to generate the
correlation filters tracking framework as the baseline. The feature extraction pseudo-Gaussian labels for other samples. Finally, the feature
network is trained through a forward-backward tracking task under a Siamese
correlation framework with a multi-cycle consistency loss. Once the training extraction network is improved by repeated forward-backward
is complete, we, like other Siamese-based trackers, use only forward tracking tracking.
to locate the target.
B. Cycle Consistency Regression
and cycle trajectory consistency loss to improve the pre-trained Our work is motivated by forward-backward consistency
network effectively. Finally, we outline the training details in in time, which has been used to evaluate consistency in some
Sec. III-C. The architecture of our self-supervised tracker is tracking methods [59], [60]. Considering that the tracking task
illustrated in Fig.3. involves predicting and locating the target’s state in subsequent
A. Revisiting Deep Correlation Tracker frames after given the initial ground-truth, we propose a
self-supervised learning method that uses massive unlabeled
Tracking arbitrary targets can be addressed by using a cor-
data to pre-train our feature extraction network. In each video
relation learning method in a deep tracking framework (such
sequence, we choose 4 image frames as one training sample
as the Siamese framework [16], [24]). The Siamese correlation
pair. With ground-truth in the initial image frame, we use a
trackers propose to learn a function f (x, z) = g(ϕ(x), ϕ(z))
forward-backward tracking way under the Siamese correlation
that compares an exemplar image z to a candidate image x
framework to generate the pseudo-labels of other frames
and return a score that can indicate similarity. Since the dis-
for multi-cycle consistency training. To further enhance the
criminative correlation filters framework could be efficiently
capabilities of the feature extraction network, we also use a
calculated in the Fourier domain, it is often added into the deep
similarity function to drop out some low-quality training pairs
tracking framework as a network layer to improve tracking
and use a cycle trajectory consistency loss to highlight the role
speed. Motivated by this, we use the discriminative correlation
of moving targets in the training process.
filters framework for forward-backward tracking to generate
1) Multi-Cycle Consistency Loss: Convention forward-
pseudo-labels of training sample pairs.
backward tracking is concerned only with the final tracking
The discriminative correlation filters framework uses target
result; in other words, forward-backward tracking only cares
X and its label Y to train a filter W :
about the result of starting from the first frame and finally
W = arg min W ∗ X−Y 2 + λW 2 , (1) returning to the first frame (Fig.4: Final-Loss). As for the
W
accuracy of the tracking results of the intermediate frames,
where ∗ is the circular convolution, λ is a regularization
the current work has not been directly involved. In fact, many
parameter. Due to the label Y is Gaussian shape, the filter
trackers may still relocate to the target after a long period
W trained from the data X contains the coefficients of the
of time after losing the target. However, the performance of
Gaussian ridge regression. By using a Fourier transformation
such a tracker is unacceptable. We accordingly propose that
to compute this Gaussian ridge regression model, Eq.(1) could
both the final tracking result (Fig.4: Final-Loss) and the result
be acquired as follows:
of the intermediate frame (Fig.4: Mid-Loss) should be con-
F (X) F (Y )
W = F −1 , (2) sidered in the forward-backward tracking process. Therefore,
F (X) F (X) + λ we implement a multi-cycle consistency loss for the training
where F is the Fourier transformation and F −1 is its inverse stage. Fig.4 presents an overview of the proposed method. The
transformation. At tracking stage, an image patch Z with multi-cycle consistency loss can be written as follows:

the same size of X is cropped out in current frame, and its L it ot al = Lt , (4)
response score could be computed as:
where L it ot al is the multi-cycle consistency loss of the i -th
f (Z ) = F −1 (F (Z ) F (W )), (3)
training sample pair, L t is the forward-backward loss of each
where f (Z ) is the response map of image patch Z , while training sample in the same pair (L t = Rt − Rt 2 ), Rt and
means the element-wise product. Once f (Z ) is obtained, Rt denote the response map of forward-backward tracking of
we can select the location with maximum response value in the t-th image patch.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
980 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

the lowest similarity, our network becomes more suitable for

the tracking tasks.
3) Cycle Trajectory Consistency Loss: In addition to the low
similarity of sample pairs, which will degrade the performance
of the feature extraction network, the inclusion of a large
amount of background information in the training samples will
also affect the network’s performance. The trajectory of the
target can draw an effective distinction between the target and
the background [17], [60]. More specifically, the trajectory of
the target moving from current t-th frame to next t + 1-th
frame is consistent with the trajectory from t + 1-th frame
moving to t-th frame. Meanwhile, the relative target position
between these two frames is also consistent. After considering
the trajectory consistency, we designed a cycle trajectory con-
sistency loss L t c that could reduce the background impact on
the tracking performance. Accordingly, we formulate a cycle
trajectory consistency loss L t c to all training sample pairs.
Fig. 5. (a) Examples of training sample pairs with different similarities.
Dropping out some samples with low similarity can reduce training loss and
Every element L it c could be calculated as follows:
avoid overfitting. (b) Examples of cycle trajectory consistency loss. The target
has forward and backward consistency during the movement. In adjacent L it c = L it →t +1 ,
image frames, the moving part is more likely to be target than background.
1
L it →t +1 = (Rt − Rt +1 22 + Rt +1 − Rt 22 ), (7)
2) Low Similarity Dropout: The quality of the training 2
samples is also greatly affect feature learning for tracking. where L it c is the cycle trajectory consistency loss of the
In the training dataset, sample pairs may contain targets with i -th training pair samples, L it →t +1 is the cycle trajectory
different similarities (as shown in Fig.5(a)). The different consistency loss of t-th frame to t + 1-th frame in i -th training
similarities of samples in each pair have the same effect on the pair (see Fig.5(b)), Rt is the label of t-th frame, Rt +1 is the
training process, which affects the representational ability of label of t + 1-th frame generated by forward tracking, Rt is
the trained network. Moreover, if the training sample pairs are the label of t-th frame generated by backward tracking.
unable to contain the target at the same time, this constitutes a 4) Cycle Consistency Regression Loss: Taking into account
fatal blow to the trained feature extraction network. Therefore, the multi-cycle consistency loss, the low similarity dropout
in the training process, we take the similarity between samples and the cycle trajectory consistency loss, our cycle consistency
in each training pair into account to improve the robustness regression objective function can be written as follows:
of feature extraction network. High similarity indicates that
the sample pair is more important; thus, we retain it in the L cc = L icc , (8)
training process. The sample pairs with low similarity may
not contain moving objects simultaneously, but adding them to where L cc is the total cycle consistency regression loss, L icc
the training process will undermine the representational ability is the cycle consistency regression loss of the i -th training
of the feature extraction network. Therefore, we consider to pair samples (L icc = (L it ot al ∗ f drop
i
)/(L it c + ε)); moreover, ε
dropout the low similarity training pairs to solve this problem. is a parameter used to ensure that the denominator is not 0
The similarity from the samples in each training pair could be (we set ε = 1 in this work). We ensure the tracking accuracy
calculated as: by narrowing the difference between the forward-backward
tracking results of the same image frame. Furthermore, in the
fs = Si milari t y(x, y), (5)
case that the motion probability of the target is greater than
where x, y denotes training samples into same pair, x is the that of the background, we make sure that the tracking position
first frame and y = {y1 , y2 , y3 } are the other frames. The is the target rather than the background by increasing the
similarity function can be Euclidean function, Mahalanobis difference between adjacent frames.
function, Cosine function, etc. In this paper, we use the Euclid-
ean function. To ensure the quality of these training samples C. Self-Supervised Training Details
and avoid overfitting, we dropout 10% of these training sample 1) Network Structure: With reference to the DCFNet [24]
pairs: tracker and the UDT [17] tracker, we used a network with only

1, f vs > α two convolutional layers to extract features and track the target
fdrop = (6) under the Siamese framework. The filter sizes are 3×3×3×32
0, other wi se
and 3 × 3 × 32 × 32, respectively. Since there are only
where f drop denotes the dropout condition, while α is a two convolutional layers in the feature extraction network,
threshold determined by the similarity ranking result of all the magnitude of parameters in this network is very small. The
training sample pairs and the dropout rate. f vs = ( fs (x, y1 ) + training process only needs 30 iterations, and the model can
f s (x, y2 ) + f s (x, y3 ))/3 is the average similarity of each reach convergence. This lightweight network (less than 40KB)
sample pair. After dropping 10% training sample pairs with thus provides competitive and real-time tracking speed.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 981

2) Training Data: We choose the ILSVRC2015 [61] as our TABLE I

training dataset just like other supervised [12], [16], [24] and A BLATION S TUDY R ESULTS ON THE OTB-2013 AND
OTB-100 D ATASETS
unsupervised [17] trackers. Unlike the supervised trackers,
however, we do not require the labels for each image frame
[16], [24]; instead, we follow the unsupervised UDT [17]
tracker, which doesn’t pre-process any training data but rather
only crops the center patch in every image frame and resizes
it to 125×125. For each image video, we choose four cropped
patches from continuous frames, then set one as the template
TABLE II
image and the others as the search images. We take the target
T RACKING R ESULTS U NDER D IFFERENT D ROPOUT R ATES
in the template image center as the tracking target and give
its ground-truth.

D. Model Update
To adapt the target appearance variations in the tracking
stage, a linear model update strategy was adopted to update
the correlation filter parameters:
W = (1 − δ)W t −1 + δW t , (9) pre-trained network with only the final consistency loss (shown
in Fig.4), self-SDCTolsd denotes the tracking result of the
where δ is learning rate and W t is the current correlation filter.
pre-trained network without low similarity dropout, and self-
IV. E XPERIMENTS SDCTot cl denotes the tracking result of the pre-trained net-
We first introduce some experimental details and the work without cycle trajectory consistency loss. From Table I
evaluation criterion, then analyze the effectiveness of each we can see the tracking performance of our self-SDCT track-
component of the proposed self-unsupervised learning-based ing method is significantly improved than the self-SDCTsccl
pre-trained feature extraction network. Finally, we make some tracker, which benefits from the multi-cycle consistency loss.
evaluation about our self-supervised learning based self-SDCT Moreover, if any of these three components are removed,
tracker alongside some state-of-the-art supervised and unsu- the tracking performance will be reduced. This directly reflects
pervised trackers on OTB-2013 [18], OTB-100 [19], UAVDT the effectiveness of multi-cycle consistency, low similarity
[20], TColor-128 [21], and UAV-123 [22] datasets. dropout and cycle trajectory consistency in the network train-
ing process. We also report the tracking performance of the
A. Experimental Details and Evaluation Criterion pre-trained network under different dropout rates as shown
1) Experimental Details: We follow UDT [17] and DCFNet in Table II. From this table, we can see that an appropriate
[24] which apply the stochastic gradient descent (SGD) with dropout rate (e.g., 5%, 10%) can bring certain tracking per-
a momentum of 0.9 to train the feature extraction network. formance improvements to the pre-trained network. However,
The weight decay is set to 5e-4, and the learning rate is set to a larger dropout rate (e.g., 20%) will reduce the diversity
1e-5. The network is trained for 30 epochs with a mini-batch of the training samples and cause the tracking performance
size of 32. The model update learning rate δ is set to 0.025. of the pre-trained network to decline. The dropout rate not
Our experiments are performed in Matlab2019 on a PC with specifically mentioned in this paper is set to 10%.
an i7 4.2GHz CPU and an NVIDIA GTX 2080Ti GPU. The
tracking speed is around 48 fps. C. State-of-the-Art Comparison
2) Evaluation Criterion: We mainly use the precision and In order to verify the proposed self-SDCT tracker, we made
success index [62] to evaluate the tracking performance of some experimental comparisons between our tracker and
our self-SDCT tracking method, which is introduced in OTB some state-of-the-art trackers on standard benchmark datasets
benchmark [18], [19]. The precision index refers to the including OTB-100 [19], UAVDT [20], TColor-128 [21], and
average distance precision of the predicted position and the UAV-123 [22].
ground-truth under different thresholds. Meanwhile, the suc- 1) Experiment on OTB-100 Benchmark: We conduct some
cess index is measured by an average overlap of the tracking comparisons between our self-SDCT tracker and other trackers
result and the ground-truth, and trackers are ranked using area- including ATOM [63], SiamRPN [27], MetaCREST [64],
under-the-curve (AUC). Moreover, tracking speed is also a UDT+ [17], TRACA [65], ARCF [66], ACFN [67], SiamTri
significant index for evaluating a tracker. [68], SiamFC [12], DCFNet [24], CFNet [16], CNT [69]
and UDT [17] on OTB-100 [19] dataset. Fig. 6 presents
B. Ablation Study the experimental results of comparing our self-SDCT tracker
We carry out ablation studies on OTB-2013 [18] and with these state-of-the-art trackers. In Fig. 6, we can know
OTB-100 [19] benchmark to analyze the effect of each com- our self-SDCT tracker is comparable with these baseline
ponent in the training process. The comparison results are fully-supervised trackers [12], [16], [24]. Compared to CFNet
demonstrated in Table I. Note that self-SDCT denotes the [16], our proposed tracker achieves 5.2% improvement in
tracking result of the pre-trained network included in each terms of AUC index. Moreover, the accuracy of our self-SDCT
component; self-SDCTsccl denotes the tracking result of the tracker is comparable to that of the SiamRPN [27] tracker.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
982 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

TABLE III
S TATE - OF - THE -A RT C OMPARISON ON THE UAVDT [20] D ATASET IN
T ERMS OF P RECISION S CORES , S UCCESS S CORES AND T RACKING
S PEED . T HE F IRST, S ECOND AND T HIRD B EST A RE H IGHLIGHTED
IN R ED , B LUE AND G REEN , R ESPECTIVELY

Fig. 6. Precision and success plots on OTB-100 [19] dataset.

Fig. 7. Precision and success plots on UAVDT [20] dataset.

This tracking result demonstrates that the self-supervised

learning method for feature extraction network training is very
effective. Although the accuracy of our self-SDCT tracker
is little worse than the ATOM [63] tracker; this is mainly
because the ATOM tracker benefits from an accurate target
estimation strategy. The feature extraction network of the
ATOM tracker is trained by a supervised learning manner, Fig. 8. Precision and success plots for long-term attribute on UAVDT dataset.
which requires lots of labeled training data. Instead, the feature
extraction network of our self-SDCT tracker is trained by a best scores in both precision and AUC metrics. Compared
self-supervised learning manner, which means our network can with these CF-based tracking methods (e.g., SAMF, DSST),
be trained without the label of training data. the proposed self-SDCT tracker achieves remarkable improve-
2) Experiment on UAVDT Benchmark: Fig. 7 and Table. III ment in tracking performance. Compared with these deep
illustrates the experimental results of the proposed self-SDCT learning-based tracking methods (e.g., CFNet, CNT), the pro-
tracker against other trackers, including ARCF [66], Staple-CA posed self-SDCT tracker also makes some improvements in
[71], UDT+, UDT [17], SRDCF [35], SiamFC [12], CFNet tracking performance. By taking account of the intermediate
[16], CREST [15], MCPF [14], PTAV [70], SINT [2], STRCF state, our multi-cycle consistency loss based self-SDCT tracker
[41], and HDT [7] on the UAVDT [20] dataset. Among the outperforms than the unsupervised learning-based UDT [17]
compared tracking methods, our self-SDCT tracker performs tracker. In general, our self-SDCT tracking method shows
the second-best scores in both precision and AUC metrics. favorably against these state-of-the-art trackers in terms of
Although the tracking accuracy of our self-SDCT tracker is tracking performance.
worse than the ARCF [66] tracker, its tracking speed is far less 4) Experiment on TColor-128 Benchmark: We also test
than that of our tracker, and even can not meet the require- and verify our self-SDCT tracker on TColor-128 [21] bench-
ments for real-time tracking (as shown in Table. III). Fig. 8 mark against 12 state-of-the-art trackers, including UDT
shows the comparative performance of these trackers on the [17], BACF [39], SRDCF [35], Staple [71], MEEM [74],
long-term attribute on the UAVDT dataset, which demonstrates SiamFC [12], CFNet [16], HDT [7], DSST [75], ARCF
that our self-SDCT tracker can achieve competitive tracking [66], CNT [69] and SITUP [76]. The experimental com-
results on the long-term tracking sequence. These experimental parison results are presented in Table.V. Among these 12
results demonstrates that the self-supervised learning-based compared trackers, the correlation filters-based BACF, ARCF,
tracker can also achieve competitive tracking results. SRDCF and DSST trackers achieve the precision and AUC
3) Experiment on UAV-123 Benchmark: Table.IV demon- scores of (66.0%/49.6%), (70.9%/52.5%), (69.6%/51.6%)
strates the comparison results of our self-SDCT tracker and and (54.9%/38.7%) respectively. By contrast, our self-SDCT
other state-of-the-art trackers, including UDT [17], CFNet tracker performs well both in precision and AUC met-
[16], SRDCF [35], MUSTer [72], SAMF [73], MEEM [74], rics (72.9%/54.0%). Moreover, our self-SDCT tracker also
SiamFC [12], DSST [75], ARCF [66], ARCFH [66], BACF achieves competitive tracking performance compared to some
[39] and CNT [69] on the UAV-123 [22] dataset. In these supervised learning-based tracking methods (e.g., CFNet,
contrast tracking methods, our self-SDCT tracker shows the SiamFC). Compared with the SiamFC [12] tracker, our

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 983

TABLE IV
P RECISION AND AUC S CORES OF THE P ROPOSED S ELF -SDCT T RACKER AND O THER T RACKERS ON THE UAV-123 [22] D ATASET. T HE F IRST, S ECOND
AND T HIRD B EST S CORES A RE H IGHLIGHTED IN R ED , B LUE AND G REEN , R ESPECTIVELY

TABLE V
P RECISION AND AUC S CORES OF O UR S ELF -SDCT T RACKER AND O THER T RACKERS ON THE TC OLOR -128 [21] D ATASET. T HE F IRST, S ECOND AND
T HIRD B EST S CORES A RE H IGHLIGHTED IN R ED , B LUE AND G REEN, R ESPECTIVELY

our self-SDCT tracker also achieves some competitive tracking

results. Compared to other trackers, with only a limited amount
of labeled data and numerous amount of self-supervised pairs
to train the feature extraction network, our self-SDCT tracker
is still able to achieve competitive tracking performance.
V. C ONCLUSION
We propose an effective multi-cycle consistency loss-based
self-supervised learning method to train a deep feature
extraction network without need for numerous manual labeled
samples. In the proposed self-SDCT tracker, we use a
forward-backward prediction under a Siamese correlation-
based tracking framework to generate pseudo-labels of these
training samples; and adopt the multi-cycle consistency loss to
train feature extraction network. Meanwhile, we propose a low
similarity dropout strategy and a cycle-trajectory consistency
loss to enhance the robustness of the feature extraction
network. The ablation studies validated the effectiveness of
Fig. 9. Qualitative comparison of the self-SDCT tracker and other trackers
on some tracking sequences (from top to bottom are skiing, soccer, matrix, each component in the proposed self-SDCT tracker. Moreover,
skating2-1 and liquor). the Siamese correlation-based tracking architecture supplies
a faster tracking speed, which guarantees that the proposed
self-SDCT tracker, our tracker does not require a large amount self-SDCT tracker can be able to engage in real-time tracking.
of labeled data for training and still achieved more than a 3% Extensive experiments show the effectiveness of our proposed
improvement on the tracking performance. In summary, our self-SDCT tracker.
self-SDCT tracking method has achieved competitive tracking
R EFERENCES
performance compared with other state-of-the-art trackers.
[1] H. Li, Y. Li, and F. Porikli, “DeepTrack: Learning discriminative feature
D. Qualitative Comparison representations by convolutional neural networks for visual tracking,” in
Proc. BMVC, 2014, pp. 1–12.
We give a qualitative comparison of our self-supervised [2] R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese instance search
learning-based self-SDCT tracker with other state-of-the-art for tracking,” in Proc. CVPR, 2016, pp. 1420–1429.
[3] Q. Liu et al., “Multi-task driven feature models for thermal infrared
tracking methods, including UDT [17], SiamFC [12], CFNet tracking,” in Proc. AAAI, 2020, pp. 11604–11611.
[16], DCFNet [24], and SiamTri [68]. Fig. 9 shows the [4] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient
comparison results of these trackers on some challenging convolution operators for tracking,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 6638–6646.
video sequences. As for the unsupervised learning-based UDT [5] J. Choi, J. Kwon, and K. M. Lee, “Deep meta learning for real-time
[17] tracker, it is easily interfered in the scenes of occlusion target-aware visual tracking,” in Proc. IEEE/CVF Int. Conf. Comput.
and fast motion (e.g., matrix and skiing). An explanation Vis. (ICCV), Oct. 2019, pp. 911–920.
[6] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Convolutional
for such drawback is because it adopts the feature extraction features for correlation filter based visual tracking,” in Proc. IEEE Int.
network trained by using a single-cycle consistency loss under Conf. Comput. Vis. Workshop (ICCVW), Dec. 2015, pp. 621–629.
an unsupervised learning approach, meaning that it could [7] Y. Qi et al., “Hedged deep tracking,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2016, pp. 4303–4311.
not model a suitable target appearance in some complex [8] K. Li, Y. Kong, and Y. Fu, “Visual object tracking via multi-stream
scenes. In contrast, the proposed self-SDCT tracker adopts deep similarity learning networks,” IEEE Trans. Image Process., vol. 29,
a multi-cycle consistency loss to train the feature extraction pp. 3311–3320, 2020.
[9] H. Nam and B. Han, “Learning multi-domain convolutional neural
network, which can extract more robust features. Compared networks for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern
with other trackers, such as SiamFC [12] and CFNet [16], Recognit. (CVPR), Jun. 2016, pp. 4293–4302.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
984 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

[10] L. Wang, W. Ouyang, X. Wang, and H. Lu, “STCT: Sequentially [36] D. Yuan, X. Shu, and Z. He, “TRBACF: Learning temporal regularized
training convolutional networks for visual tracking,” in Proc. IEEE Conf. correlation filters for high performance online visual object tracking,”
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1373–1381. J. Vis. Commun. Image Represent., vol. 72, Oct. 2020, Art. no. 102882.
[11] Y. Song et al., “VITAL: VIsual tracking via adversarial learning,” [37] B. Zhang et al., “Latent constrained correlation filter,” IEEE Trans.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Image Process., vol. 27, no. 3, pp. 1038–1048, Mar. 2018.
pp. 8990–8999. [38] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed
[12] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal.
“Fully-convolutional Siamese networks for object tracking,” in Proc. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
ECCV Workshop, 2016, pp. 850–865. [39] H. K. Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware
[13] D. Yuan, X. Li, Z. He, Q. Liu, and S. Lu, “Visual object tracking with correlation filters for visual tracking,” in Proc. IEEE Int. Conf. Comput.
adaptive structural convolutional network,” Knowl.-Based Syst., vol. 194, Vis. (ICCV), Oct. 2017, pp. 1135–1143.
Apr. 2020, Art. no. 105554. [40] W. Feng, R. Han, Q. Guo, J. Zhu, and S. Wang, “Dynamic saliency-
[14] T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle filter aware regularization for correlation filter-based object tracking,” IEEE
for robust object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Trans. Image Process., vol. 28, no. 7, pp. 3232–3245, Jul. 2019.
Recognit. (CVPR), Jul. 2017, pp. 4819–4827. [41] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang, “Learning
[15] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. H. Lau, and M.-H. Yang, spatial-temporal regularized correlation filters for visual tracking,” in
“CREST: Convolutional residual learning for visual tracking,” in Proc. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2574–2583. pp. 4904–4913.
[16] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr, [42] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond
“End-to-End representation learning for correlation filter based track- correlation filters: Learning continuous convolution operators for visual
ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), tracking,” in Proc. ECCV, 2016, pp. 472–488.
Jul. 2017, pp. 2085–2813. [43] C. Tian, Y. Xu, W. Zuo, B. Zhang, L. Fei, and C.-W. Lin, “Coarse-to-
[17] N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li, “Unsupervised fine CNN for image super-resolution,” IEEE Trans. Multimedia, early
deep tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. access, Jun. 1, 2020, doi: 10.1109/TMM.2020.2999182.
(CVPR), Jun. 2019, pp. 1308–1317. [44] S. Luan, C. Chen, B. Zhang, J. Han, and J. Liu, “Gabor convolutional
[18] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A bench- networks,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4357–4366,
mark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, Sep. 2018.
pp. 2411–2418. [45] C. Tian, Y. Xu, Z. Li, W. Zuo, L. Fei, and H. Liu, “Attention-
[19] Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE guided CNN for image denoising,” Neural Netw., vol. 124, pp. 117–129,
Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Apr. 2020.
Sep. 2015.
[46] X. Li, C. Ma, B. Wu, Z. He, and M.-H. Yang, “Target-aware deep
[20] D. Du et al., “The unmanned aerial vehicle benchmark: Object detection tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
and tracking,” in Proc. ECCV, 2018, pp. 370–386. (CVPR), Jun. 2019, pp. 1369–1378.
[21] P. Liang, E. Blasch, and H. Ling, “Encoding color information for visual
[47] Y. Wu and T. S. Huang, “Self-supervised learning for visual tracking
tracking: Algorithms and benchmark,” IEEE Trans. Image Process.,
and recognition of human hand,” in Proc. AAAI, 2000, pp. 243–248.
vol. 24, no. 12, pp. 5630–5644, Dec. 2015.
[48] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan, “Learning
[22] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for
features by watching objects move,” in Proc. IEEE Conf. Comput. Vis.
UAV tracking,” in Proc. ECCV, 2016, pp. 445–461.
Pattern Recognit. (CVPR), Jul. 2017, pp. 2701–2710.
[23] Z. Liang and J. Shen, “Local semantic siamese networks for fast
tracking,” IEEE Trans. Image Process., vol. 29, pp. 3351–3364, 2020. [49] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Mur-
phy, “Tracking emerges by colorizing videos,” in ECCV, vol. 2018,
[24] Q. Wang, J. Gao, J. Xing, M. Zhang, and W. Hu, “DCFNet: Discriminant
pp. 391–408.
correlation filters network for visual tracking,” 2017, arXiv:1704.04057.
[Online]. Available: http://arxiv.org/abs/1704.04057 [50] X. Wang, K. He, and A. Gupta, “Transitive invariance for self-supervised
[25] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning visual representation learning,” in Proc. IEEE Int. Conf. Comput. Vis.
dynamic siamese network for visual object tracking,” in Proc. IEEE Int. (ICCV), Oct. 2017, pp. 1329–1338.
Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1763–1771. [51] X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from
[26] Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu, “Structured the cycle-consistency of time,” in Proc. IEEE/CVF Conf. Comput. Vis.
Siamese network for real-time visual tracking,” in Proc. ECCV, 2018, Pattern Recognit. (CVPR), Jun. 2019, pp. 2566–2576.
pp. 351–366. [52] Z. Lai, E. Lu, and W. Xie, “MAST: A memory-augmented self-
[27] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual supervised tracker,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
tracking with siamese region proposal network,” in Proc. IEEE/CVF Recognit. (CVPR), Jun. 2020, pp. 6479–6488.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8971–8980. [53] P.-Y. Huang, G. Kang, W. Liu, X. Chang, and A. G. Hauptmann,
[28] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware “Annotation efficient cross-modal retrieval with adversarial attentive
Siamese networks for visual object tracking,” in Proc. ECCV, 2018, alignment,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019,
pp. 101–117. pp. 1758–1767.
[29] M. H. Abdelpakey and M. S. Shehata, “DP-siam: Dynamic pol- [54] C. Liu, X. Chang, and Y.-D. Shen, “Unity style transfer for person re-
icy siamese network for robust object tracking,” IEEE Trans. Image identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Process., vol. 29, pp. 1479–1492, 2020. (CVPR), Jun. 2020, pp. 6887–6896.
[30] H. Fan and H. Ling, “Siamese cascaded region proposal networks for [55] X. Chang, Y.-L. Yu, Y. Yang, and E. P. Xing, “Semantic pooling for
real-time visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern complex event analysis in untrimmed videos,” IEEE Trans. Pattern Anal.
Recognit. (CVPR), Jun. 2019, pp. 7952–7961. Mach. Intell., vol. 39, no. 8, pp. 1617–1632, Aug. 2017.
[31] T. Zhang, S. Liu, C. Xu, B. Liu, and M.-H. Yang, “Correlation particle [56] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, “Tem-
filter for visual tracking,” IEEE Trans. Image Process., vol. 27, no. 6, poral cycle-consistency learning,” in Proc. IEEE/CVF Conf. Comput. Vis.
pp. 2676–2687, Jun. 2018. Pattern Recognit. (CVPR), Jun. 2019, pp. 1801–1810.
[32] F. Liu, C. Gong, X. Huang, T. Zhou, J. Yang, and D. Tao, “Robust visual [57] P. Sermanet et al., “Time-contrastive networks: Self-supervised learning
tracking revisited: From correlation filter to template matching,” IEEE from video,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2018,
Trans. Image Process., vol. 27, no. 6, pp. 2777–2790, Jun. 2018. pp. 1134–1141.
[33] Z. He, S. Yi, Y.-M. Cheung, X. You, and Y. Yan Tang, “Robust object [58] X. Li, S. Liu, S. D. Mello, X. Wang, J. Kautz, and M.-H. Yang, “Joint-
tracking via key patch sparse representation,” IEEE Trans. Cybern., task self-supervised learning for temporal correspondence,” in Proc.
vol. 47, no. 2, pp. 354–364, Feb. 2017. NIPS, vol. 2019, pp. 317–327.
[34] G. Ding, W. Chen, S. Zhao, J. Han, and Q. Liu, “Real-time scalable [59] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,”
visual tracking via quadrangle kernelized correlation filters,” IEEE IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422,
Trans. Intell. Transp. Syst., vol. 19, no. 1, pp. 140–150, Jan. 2018. Jul. 2012.
[35] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Learning spatially [60] D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “Multihypothesis trajectory analysis
regularized correlation filters for visual tracking,” in Proc. IEEE Int. for robust visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern
Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4310–4318. Recognit. (CVPR), Jun. 2015, pp. 5088–5096.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on February 08,2022 at 06:18:43 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: SELF-SUPERVISED DEEP CORRELATION TRACKING 985

[61] O. Russakovsky et al., “ImageNet large scale visual recognition chal- Xiaojun Chang (Member, IEEE) received the Ph.D.
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. degree from the Centre for Artificial Intelligence
[62] M. Luo, X. Chang, Z. Li, L. Nie, A. G. Hauptmann, and Q. Zheng, and the Faculty of Engineering and Information
“Simple to complex cross-modal learning to rank,” Comput. Vis. Image Technology, University of Technology Sydney, Syd-
Understand., vol. 163, pp. 67–77, Oct. 2017. ney, in 2016. He is currently a Senior Lecturer
[63] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate with the Faculty of Information Technology, Monash
tracking by overlap maximization,” in Proc. IEEE/CVF Conf. Comput. University Clayton Campus, Australia. He is also
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4660–4669. a Distinguished Adjunct Professor with the Fac-
[64] E. Park and A. C. Berg, “Meta-tracker: Fast and robust online adaptation ulty of Computing and Information Technology,
for visual object trackers,” in Proc. ECCV, 2018, pp. 1–17. King Abdulaziz University. Before joining Monash,
[65] J. Choi et al., “Context-aware deep feature compression for high- he was a Postdoctoral Research Associate with the
speed visual tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern School of Computer Science, Carnegie Mellon University, working with
Recognit., Jun. 2018, pp. 479–488. Prof. Alex Hauptmann. He has spent most of time working on exploring
[66] Z. Huang, C. Fu, Y. Li, F. Lin, and P. Lu, “Learning aberrance repressed multiple signals (visual, acoustic, and textual) for automatic content analysis
correlation filters for real-time UAV tracking,” in Proc. IEEE/CVF Int. in unconstrained or surveillance videos. He has achieved top performance
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 2891–2900. in various international competitions, such as TRECVID MED, TRECVID
[67] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and J. Y. Choi, SIN, and TRECVID AVS. He is an ARC Discovery Early Career Researcher
“Attentional correlation filter network for adaptive visual tracking,” in Award (DECRA) Fellow from 2019 to 2021.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 4828–4837.
[68] X. Dong and J. Shen, “Triplet loss in Siamese network for object
tracking,” in Proc. ECCV, 2018, pp. 459–474. Po-Yao Huang (Member, IEEE) is currently pursu-
[69] K. Zhang, Q. Liu, Y. Wu, and M. H. Yang, “Robust visual tracking via ing the Ph.D. degree with the School of Computer
convolutional networks without training,” IEEE Trans. Image Process., Science, Carnegie Mellon University. His research
vol. 25, no. 4, pp. 1779–1792, 2016. interest includes multimodal machine learning. He is
[70] H. Fan and H. Ling, “Parallel tracking and verifying: A framework for particularly interested in bridging computer vision
real-time and high accuracy visual tracking,” in Proc. IEEE Int. Conf. and natural language processing for the tasks of mul-
Comput. Vis. (ICCV), Oct. 2017, pp. 5486–5494. timodal machine translation, cross-modal search and
[71] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr, “Staple: retrieval, and large-scale multimodal data mining
Complementary learners for real-time tracking,” in Proc. CVPR, 2016, and analysis.
pp. 1401–1409.
[72] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “MUlti-
store tracker (MUSTer): A cognitive psychology inspired approach to
object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2015, pp. 749–758. Qiao Liu received the B.E. degree in computer
[73] Z. Liu, Z. Lian, and Y. Li, “A novel adaptive kernel correlation filter science from Guizhou Normal University, Guiyang,
tracker with multiple feature integration,” in Proc. IEEE Int. Conf. Image China, in 2016. He is currently pursuing the Ph.D.
Process. (ICIP), Sep. 2017, pp. 254–265. degree with the Department of Computer Science
[74] J. Zhang, S. Ma, and S. Sclaroff, “MEEM: Robust tracking via and Technology, Harbin Institute of Technology,
multiple experts using entropy minimization,” in Proc. ECCV, 2014, Shenzhen, China. His current research interests
pp. 188–203. include thermal infrared object tracking and machine
[75] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminative learning.
scale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
no. 8, pp. 1561–1575, Aug. 2017.
[76] H. Ma, S. T. Acton, and Z. Lin, “SITUP: Scale invariant tracking
using average Peak-to-Correlation energy,” IEEE Trans. Image Process.,
vol. 29, pp. 3546–3557, 2020.
Zhenyu He (Senior Member, IEEE) received the
Di Yuan received the M.S. degree in applied math- Ph.D. degree from the Department of Computer
ematics from the Harbin Institute of Technology, Science, Hong Kong Baptist University, Hong Kong,
Shenzhen, China, in 2017, where he is currently in 2007. From 2007 to 2009, he worked as a
pursuing the Ph.D. degree in computer science with Postdoctoral Researcher with the Department of
the research in statute of biocomputing with the Computer Science and Engineering, Hong Kong
School of Computer Science and Technology. His University of Science and Technology. He is cur-
current research interests include object tracking, rently a Full Professor with the School of Com-
machine learning, and self-supervised learning. puter Science and Technology, Harbin Institute of
Technology, Shenzhen, China. His research interests
include machine learning, computer vision, image
processing, and pattern recognition.