0% found this document useful (0 votes)
17 views

Siamrcr: Reciprocal Classification and Regression For Visual Object Tracking

Uploaded by

pjl1995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Siamrcr: Reciprocal Classification and Regression For Visual Object Tracking

Uploaded by

pjl1995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

SiamRCR: Reciprocal Classification and Regression for Visual Object Tracking


Jinlong Peng1∗ , Zhengkai Jiang1∗ , Yueyang Gu1∗ , Yang Wu2† ,
Yabiao Wang1 , Ying Tai1 , Chengjie Wang1 and Weiyao Lin3
1
Tencent Youtu Lab
2
Kyoto University
3
Shanghai Jiao Tong University
{jeromepeng, zhengkjiang, yueyanggu, caseywang, yingtai, jasoncjwang}@tencent.com,
[email protected], [email protected]

Abstract GT IoU: 0.60


GT IoU: 0.60 Cls Score: 0.98
Cls Score: 0.85
Recently, most siamese network based track- Tracking Score: 0.63
Tracking Score: 0.58

ers locate targets via object classification and


bounding-box regression. Generally, they select the GT IoU: 0.80
bounding-box with maximum classification confi- Cls Score: 0.59 GT IoU: 0.94
Tracking Score: 0.77 Cls Score: 0.75
dence as the final prediction. This strategy may Tracking Score: 0.92

miss the right result due to the accuracy misalign-


ment between classification and regression. In this Figure 1: Case study on the accuracy misalignment problem be-
paper, we propose a novel siamese tracking al- tween classification and regression of siamese network based track-
gorithm called SiamRCR, addressing this problem ing models and our solution. The yellow bounding boxes denote
with a simple, light and effective solution. It builds the ground-truths, while the red and green bounding boxes are the
reciprocal links between classification and regres- winners ranked by the classification score and the proposed track-
sion branches, which can dynamically re-weight ing score, respectively. Clearly, the tracking scores generated by
SiamRCR are much more consistent with the localization/regression
their losses for each positive sample. In addition, accuracy values (IoU), leading to better tracking performance.
we add a localization branch to predict the localiza-
tion accuracy, so that it can work as the replacement
of the regression assistance link during inference.
This branch makes the training and inference more ficiency [Bertinetto et al., 2016; Li et al., 2018; Zhang et al.,
consistent. Extensive experimental results demon- 2020]. A siamese network consists of two branches sharing
strate the effectiveness of SiamRCR and its superi- the same parameters for feature extracting. Exemplar im-
ority over the state-of-the-art competitors on GOT- age (ground-truth in the first frame) and search image (ROI
10k, LaSOT, TrackingNet, OTB-2015, VOT-2018 of a frame to be tracked in) are inputs to the siamese net-
and VOT-2019. Moreover, our SiamRCR runs at work. After feature extraction and cross-correlation, it breaks
65 FPS, far above the real-time requirement. into two branches: a classification branch outputs a confi-
dence map for position estimation and a regression branch
predicts the target bounding box information corresponding
1 Introduction to each position of the confidence map. Such a network struc-
As one of the fundamental research topics in computer vi- ture allows a straight-forward inference method: finding the
sion, visual object tracking (VOT) plays an important role in maximum value on the 2D confidence map (from the classi-
many applications such as human-computer interaction, vi- fication branch) and then using its position to get the corre-
sual surveillance, medical image processing, and so on. It sponding regressed bounding box information (from the re-
aims to locate objects in subsequent sequences according to gression branch). However, such a siamese structure gen-
a given ground-truth for each object target in a chosen video erally has classification and regression optimized indepen-
frame where the target appears. There is no prior knowledge dently and all existing models have failed to make them prop-
about object class, which is the most significant characteris- erly synchronized. This results in the accuracy misalign-
tic of object tracking. Although researchers have paid much ment between classification and regression. As shown in
attention to object tracking, it is still a challenging task when Figure 1, the predicted box with high classification confi-
any of the following factors exists significantly: occlusion, dence may not have high regression accuracy in terms of IoU
deformation, and scale variation. (Interaction over Union) score. Due to the misalignment, the
Recently, siamese network based tracking has attracted in- bounding-box which locates the target more accurately than
creasing interest due to its balance between accuracy and ef- others might be discarded, leading to an inferior tracking per-
formance. Although some recent siamese network [Danelljan

Equal contribution. et al., 2019; Xu et al., 2020] have tried to predict the localiza-

Corresponding author: Yang Wu ([email protected]). tion/regression accuracy, the misalignment over there is still

952
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Feature Extractor Feature Combination Localization-Aware Anchor-Free Head


127x127 256x7x7
Backbone Classification Score
Focal Loss
1x25x25

256x25x25 Localization Score


255x255 BCE Loss
1x25x25

Regression
IoU Loss
256x25x25 4x25x25
256x31x31

Convolution Layers Cross Correlation Features Output Features Loss


Classification Assistance Link Regression Assistance Link Localization Supervision

Figure 2: Our proposed siamese framework on Reciprocal Classification and Regression (SiamRCR). It consists of a feature extractor, a
feature combination module, and a three-branch siamese head structure with the novel reciprocal links over the individual losses. Note that
the three links between the three branches are only designed for loss calculation during training and do not exist during inference.

severe since the independent optimization issue of classifica- 1. We propose a novel tracking model that solves the long-
tion and regression remains unsolved. standing unsolved classification and regression misalignment
In this paper, we propose a novel solution to alleviate the problem, with new simple, intuitive and efficient designs.
misalignment, which builds a reciprocal relationship between 2. It presents a new way on how to link losses of multiple
classification and regression, so that they can be optimized in branches and make the training and inference process more
a synchronized way for generating accuracy consistent out- consistent, which may provide inspirations to other tasks.
puts. Since the reciprocal relationship is the key for its suc- 3. Our SiamRCR achieves state-of-the-art performance on
cess, we name our model Siamese Network based Recipro- six public benchmarks, including GOT-10k, TrackingNet, La-
cal Classification and Regression with SiamRCR as its ab- SOT, OTB-2015, VOT-2018 and VOT-2019. The framework
breviation. The overall framework of SiamRCR is shown in is built on an anchor-free mechanism with a more direct cen-
Figure 2. Besides the commonly used classification branch ter offset and width/height prediction, running at 65 FPS.
and regression branch, we add two links (the classification
assistance link and the regression assistance link) to build the 2 Related Works
reciprocal relationship between them during model training.
Classification assists regression by weighting the regression 2.1 Siamese Network based Framework
loss with the classification confidence, so that regression can Comparing with traditional correlation filter tracking meth-
focus more on high confident positions for more precise loca- ods, recent siamese network based methods have achieved
tion. Regression assists classification by weighting the clas- superior performance since the pioneering work SiamFC was
sification loss with the localization score derived from the re- proposed [Bertinetto et al., 2016]. More recent studies [Li
gressed bounding box and the ground-truth box, forcing clas- et al., 2018; Li et al., 2019a] try to introduce object de-
sification score to be more consistent with regression accu- tection progresses into object tracking for more accurate lo-
racy. Since there is no such localization score during test- cation prediction. Though these works have explored sev-
ing/inference (ground-truth bounding box is unknown), a lo- eral important aspects, the accuracy misalignment problem
calization branch is added to predict such a localization score between classification and regression has been overlooked.
at each position, so that the prediction can be used as localiza- Ocean [Zhang et al., 2020] partially concerns a similar is-
tion score’s approximation to be consistent with the training sue and presents a feature alignment module to alleviate it
model. Therefore, the multiplication of the classification con- by utilizing the prediction of regression branch to refine the
fidence and the localization prediction confidence generates a classification branch. However, this cannot eliminate the mis-
new tracking score/confidence map for regression during test- alignment problem as the alignment is monodirectional. Dif-
ing, which ensures the consistency with the training process. ferently, our SiamRCR focuses on the misalignment problem
Besides the key idea of reciprocal classification and re- and proposes a simple, intuitive and more thorough solution
gression, two other designs also contribute to the effective- with bidirectional and reciprocal links and a novel comple-
ness and superiority of our model. One is that we choose mentary branch for making training and inference consistent.
to build on the anchor-free tracking mechanism so that the
whole model can be one-stage, clean, efficient with fewer 2.2 Anchor-Free Tracking Mechanism
hyper-parameters. The other is that our model predicts center Anchor-free methods have recently attracted widespread at-
offset and width/height of the target, which is more straight- tention in the object detection field [Law and Deng, 2018;
forward and efficient than other VOT methods. Duan et al., 2019; Tian et al., 2019; Zhou et al., 2019] due
The main contributions of this work are listed as follows: to their simplicity and efficiency. Naturally, the anchor-free

953
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

mechanism has also been introduced to the tracking field [Xu the localization accuracy. Its output can serve as the approx-
et al., 2020; Chen et al., 2020; Zhang et al., 2020]. Mul- imation of the localization score during inference to generate
tiple object tracking (MOT) is a related area of VOT [Peng a more accurate tracking score together with the classification
et al., 2020a; Peng et al., 2020c]. In MOT area, based confidence. The key components are in detail as below.
on CenterNet [Zhou et al., 2019], CenterTrack [Zhou et
al., 2020] obtains high performance by predicting the cen- 3.2 Anchor-Free Tracking with Box Regression
ter point, width/height and center offset of each object. To For the i-th input pair from the training set, we have
our best knowledge, SiamRCR is the first VOT method pre- Fi ∈ RC×H×W denotes the feature map of the classifi-
dicting center offset and width/height of the target, which is cation branch and s be the total stride. The ground-truth
more straightforward and efficient than ever. bounding box for the current frame is defined as Bx,y ∗
=
∗ ∗ ∗ ∗
(x0 , y0 , x1 , y1 ), i.e., coordinates of the bounding box. For
2.3 Dynamic Sample Re-weighting each location (x, y) on the feature map Fi , we can map it
Existing trackers [Li et al., 2018; Li et al., 2019a; Xu et al., back onto the input frame to get the corresponding image co-
2020; Peng et al., 2020b] directly use some heuristic rules, ordinates (b 2s c + xs, b 2s c + ys). Different from anchor-based
e.g., the Focal Loss [Lin et al., 2017] to define the labels of trackers, which consider the location on the input frame as the
samples and their weights. PrDiMP [Danelljan et al., 2020] center of anchor boxes and regress the target bounding boxes
models the uncertainty of the labels. Such predefined static w.r.t. the anchor boxes, we directly regress the target boxes’
weights lead to the accuracy misalignment problem between width and height values and the center offsets at the location.
classification and regression, which harms the final tracking In this way, our tracker views locations as training samples
accuracy. However, in our SiamRCR, the sample weights for instead of anchor boxes, which follows the paradigm of the
each loss become dynamic as they are conditioned on the FCNs [Long et al., 2015] for semantic segmentation.
other branch’s outputs which keep changing during the in- Specially, the sample at location (x, y) is considered to
teraction. Such dynamic sample re-weighting mechanism is be positive if it falls into a radius r at the ground-truth
novel and also critical to the effectiveness of our model. box center, and the radius is a hyper-parameter for the pro-
posed method. Otherwise, it is a negative sample (back-
2.4 Localization Prediction Strategy ground). Besides the label (denoted by c∗x,y ) for foreground-
In object detection area, IoU-Net [Jiang et al., 2018] predicts background classification, we also have a 4D real vector
the IoU between each detected box and the matched ground- t∗x,y = (w∗ , h∗ , ∆x∗ , ∆y ∗ ) indicating the regression target
truth to guide the box regression, which is class-specific thus for the localization. Here, w∗ and h∗ are the width and height
not directly suitable for VOT. ATOM [Danelljan et al., 2019] of target ground-truth bounding box, while ∆x∗ and ∆y ∗
trains a target-specific IoU prediction network offline and are the center offsets between the current location and the
SiamFC++ [Xu et al., 2020] estimates the bounding box qual- ground-truth box. Formally, if location (x, y) is associated to

ity based on centerness [Tian et al., 2019]. However, both the the ground-truth box Bx,y , which has width w∗ and height

purpose and implementation of the localization branch in our h , then we have
SiamRCR are different. Our localization branch is a natural w∗ = x∗1 − x∗0 , h∗ = y1∗ − y0∗ ,
auxiliary of the reciprocal classification and regression struc- ∗ (1)
∆x = (x∗0 + x∗1 )/2 − x, ∆y ∗ = (y0∗ + y1∗ )/2 − y.
ture which itself is a better solution than existing works, while
the IoU network in other works is the main. Moreover, our lo- Corresponding to the training target, SiamRCR predicts a
calization branch is simple and lightweight, which ensures the classification confidence score pcls
x,y , a regressed 4D vector
effectiveness and efficiency of the algorithm simultaneously. tx,y = (w, h, ∆x, ∆y) for the bounding box, and a local-
ization confidence score ploc
x,y denoting the predicted localiza-
3 Proposed Method tion accuracy. It is worth noting that SiamRCR has 5× fewer
network parameters than the popular anchor-based tracker
3.1 Overview SiamRPN [Li et al., 2018] with 5 anchor boxes per location.
The proposed siamese tracking framework is shown in Fig-
ure 2. Different from previous anchor-based [Li et al., 2018; 3.3 Reciprocal Classification and Regression
Li et al., 2019a] methods which rely on pre-defined anchor In existing siamese network tracking models, classification
sizes and scales, our method is anchor-free. It operates as and regression branches operate in parallel and get optimized
follows. First, the target template and the current frame are independently with their own losses, which aggravates the ac-
both fed into the shared feature extractor (using the back- curacy misalignment of their results. In fact, when a regressed
bone of [He et al., 2016]) to generate their corresponding fea- bounding box has low accuracy, the corresponding classifica-
tures. Then, such features are combined through depth-wise tion score should not be high, because if that position be-
cross-correlation operation to create correlated feature maps, comes the winner of classification confidence the bad local-
which are further fed into the corresponding classification and ization will lead to bad tracking performance. And when a
regression branches of the anchor-free tracking head. The bounding box has a low classification score, there is no mean-
built-in reciprocal links dynamically re-weight the samples ing for the regression to try hard to get a high localization ac-
for computing each loss of the two branches. A new localiza- curacy for it will not be the winner anyway. Therefore, these
tion branch grows from the regression branch for predicting two branches need to talk to each other for aligning the accu-

954
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Classification Score Localization Score Branch


The regression assistance link makes the classification branch
Tracking Score aware of the regression accuracy during training, thanks to

the ground-truth bounding box Bx,y for computing the lo-
1x25x25
256x25x25 calization score. However, in the inference stage there is
Localization Score no such ground-truth. Directly using the classification con-
fidence map pcls to select the winner bounding box may still
lead to certain accuracy misalignment, as the localization
score was hands-on during the classification branch’s train-
1x25x25 ing. The hands-on inductive training makes pcls collaborative
with the localization score but not necessarily consistent with
Regression (w, ℎ, ∆𝑥, ∆𝑦) it. Therefore, we let the regression branch grow a new branch
256x25x25 called localization branch to be trained for predicting the lo-
calization score given the feature maps for regression, under
4x25x25 Predicted Box the following loss function.
1 X ∗
Figure 3: The head of SiamRCR during inference. The classifica- Lloc = I{c∗x,y =1} LBCE (ploc
x,y , IoU (Bx,y , Bx,y )),
tion score and localization score are multiplied to generate the final Npos x,y
tracking score for ranking the predicted bounding boxes. (5)
where LBCE is the Binary Cross Entropy (BCE) loss.
As shown in Figure 3, during inference, the final track-
racy of their results. In this paper, we propose a novel strategy ing score (used for ranking the predicted bounding boxes) is
called reciprocal classification and regression to make these computed by multiplying pcls loc
x,y with px,y , making the infer-
two branches assist each other. It is implemented by building ence localization-aware. Thus, the localization branch can
two links, including regression assistance link and classifica- further reduce the low-quality boxes and improve the overall
tion assistance link. tracking accuracy.
Regression Assistance Link The Overall Training Objective
To eliminate the chance that low localization accuracy bound- With the above losses for SiamRCR’s three branches, we can
ing boxes still get high classification scores, a simple yet ef- define its final training loss function as:
fective solution is to use the localization accuracy to weight
the classification loss. Such an assistance link from re- L = Lcls + λ1 ∗ Lreg + λ2 ∗ Lloc . (6)
gression can be regarded as a kind of dynamic sample re- where λ1 and λ2 are the hyper-parameters for balancing these
weighting as the localization accuracy keeps changing during losses. In our experiments, they are all set to 1.
the model optimization. The dynamically re-weighted classi-
fication loss can be formulated as: 4 Experiments
1 X ∗ ∗
Lcls = LF ocal (pcls
x,y , cx,y )∗IoU (Bx,y , Bx,y ), (2)
4.1 Implementation Details
Npos x,y
Training Phase. We utilize ResNet-50 [He et al., 2016] as
the backbone of our SiamRCR. We remove the last conv-
where LF ocal and IoU denote the focal loss [Lin et al., 2017] block for higher resolution feature map and utilize dilated
and the IoU score, respectively, Npos is the number of posi- convolution for higher receptive field [Li et al., 2019a]. The
tive samples, and B = (x0 , y0 , x1 , y1 ) is the predicted bound- backbone is initialized with the parameters pre-trained on Im-
ing box at location (x, y) with predicted width/height (w, h) ageNet [Russakovsky et al., 2015]. The whole network is
and center offsets (∆x, ∆y): optimized by Stochastic Gradient Descent (SGD) with mo-
x0 = x + ∆x − w/2, y0 = y + ∆y − h/2, mentum 0.9 on the datasets of GOT-10k [Huang et al., 2019],
(3) TrackinegNet [Müller et al., 2018], COCO [Lin et al., 2014],
x1 = x + ∆x + w/2, y1 = y + ∆y + h/2.
LaSOT [Fan et al., 2019], ImageNet VID [Russakovsky et
Classification Assistance Link al., 2015] and ImageNet DET [Russakovsky et al., 2015].
To avoid low confidence positions getting highly accurate We totally train the network for 20 epochs. The batch size
bounding boxes, the regression branch should be aware of is 128. The learning rate is from 0.000001 to 0.1 in the first
the classification confidence. To this end, pcls x,y is utilized to 5 epochs for warm-up and from 0.1 to 0.0001 with cosine
dynamically re-weight the regression loss as: schedule in the last 15 epochs. We freeze the backbone in the
first 10 epochs and fine-tune it in the other 10 epochs with a
1 X
Lreg = I{c∗x,y =1} LIoU (tx,y , t∗x,y ) ∗ pcls
x,y , (4) reduced learning rate (multiplying 0.1). The size of exemplar
Npos x,y image and search image are 127*127 and 255*255, respec-
tively. Our algorithm is implemented by Python 3.6 and Py-
where LIoU is the IoU loss as in UnitBox [Yu et al., 2016]; Torch 1.1.0. The experiments are conducted on a server with
I{c∗x,y =1} is an indicator function, which equals to 1 if c∗x,y = Intel(R) Xeon(R) CPU E5-2680 v4 2.40GHz, and a NVIDIA
1 and 0 otherwise. Tesla P40 24GB GPU with CUDA 10.1.

955
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

1.0 1.0
IoU with Ground-Truth Boxes Localization Branch Reciprocal Links AO↑

IoU with Ground-Truth Boxes


0.8 0.8
I √ 0.594
0.6 0.6
II √ 0.615
III √ √ 0.611
0.4 0.4 IV 0.624
0.2 R=0.38 0.2 R=0.61
Table 1: Ablation study on GOT-10k test set.
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Tracking Score Tracking Score r R AO↑ [email protected][email protected]
(a) (b)
1.0 1.0 1 8 0.593 0.723 0.458
2 16 0.624 0.752 0.460
IoU with Ground-Truth Boxes

IoU with Ground-Truth Boxes

0.8 0.8
3 24 0.619 0.747 0.459
0.6 0.6
4 32 0.612 0.743 0.446
5 40 0.611 0.740 0.474
0.4 0.4

Table 2: Comparative experiment in terms of r on GOT-10k test set.


0.2 R=0.71 0.2 R=0.77

0.0 0.0 sistent with the reciprocal links, serving well as the replace-
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Tracking Score Tracking Score ment of the regression assistance link for inference. To bet-
(c) (d)
ter demonstrate how well our SiamRCR alleviates the accu-
Figure 4: The correlation between IoU scores and the tracking score, racy misalignment problem, we illustrate the correlation be-
together with the Pearson correlation coefficient R. (a) Baseline tween the IoU of regressed bounding box (w.r.t. the matched
model, the tracking score is the classification score. (b) Using cen- ground-truth) and the tracking score in Figure 4. As shown in
terness proposed in FCOS [Tian et al., 2019] as the tracking score. Figure 4(a), the Pearson correlation coefficients between IoU
(c) Baseline + localization branch. (d) SiamRCR. and tracking score is only 0.38, showing that the classifica-
tion score is indeed not consistent with the real localization
accuracy. Figure 4(c) and 4(d) show that both the localization
Testing Phase. We utilize the same offline testing strat- branch and the reciprocal links are effective and necessary,
egy as [Xu et al., 2020]. The ground-truth after augmenta- and they can well collaborate with each other.
tion in the first frame is used as the exemplar image and we
keep it unchanged during the whole testing phase. A cosine- Predicted IoU vs. Centerness. Centerness is pre-defined
window [Bertinetto et al., 2016] is multiplied on the confi- label which indicates the distance between candidates and
dence map. We adopt a linear interpolation updating strategy target center. Some object detection [Tian et al., 2019] or
on scale prediction to make the final box change smoothly object tracking [Xu et al., 2020] algorithms utilize centerness
over time. We evaluate SiamRCR on six public benchmarks to assist localization. In our SiamRCR, we discard this kind
following their corresponding protocols: GOT-10k [Huang et of fixed prior and utilize predicted IoU as dynamic super-
al., 2019], TrackingNet [Müller et al., 2018], LaSOT [Fan et vised localization information. Thus, our localization branch
al., 2019], OTB-2015 [Wu et al., 2015], VOT-2018 [Kristan can estimate the localization confidence more accurately. As
et al., 2018] and VOT-2019 [Kristan et al., 2019]. shown in Figure 4 (b) and (c), our localization prediction
mechanism alleviates the misalignment between classifica-
4.2 Ablation Study tion and regression, which is better than centerness.
Component. The ablation study results on the key compo- Radius. Radius r is a significant hyper-parameter in our
nents of SiamRCR are presented in Table 1. The baseline (I) proposed anchor-free framework. It decides the division of
without localization branch and reciprocal links obtains an positive samples and negative samples during training. We
AO (Average Overlap) of 0.594. With localization branch, conduct comparative experiment in terms of r. The results
SiamRCR can predict the localization score of the regressed are shown in Table 2. R is the corresponding radius of r
bounding box, making the final tracking score more consis- in the original input video frame, which is 8 times r. When
tent with the real IoU than the classification score. Multiply- r = 1, the performance on GOT-10k is poor since the number
ing the localization score alone (II) improves the performance of positive samples is too small. When r = 2, our SiamRCR
by 3.54% compared with baseline, showing the significance achieves the best performance. When r = 4 or r = 5,
of the accuracy misalignment between classification and re- the positive samples are redundant since some candidates far
gression. Building reciprocal assistance links itself (III) can from target center are divided into positive samples. There-
also gain a relative improvement of 2.86% over the baseline, fore, the performance drops compared with r = 2 or r = 3.
proving that the misalignment can be alleviated between clas-
sification and regression. When these two components are 4.3 Comparison with the State-of-the-Art
both adopted, the relative performance is more remarkable: We compare our SiamRCR with 18 state-of-the-art trackers.
5.05%, which is nearly equal to the direct sum of both perfor- The datasets and experimental settings are detailed as below.
mance gains. It confirms that the localization branch is con- Due to space limitations, the experiments on OTB-2015 and

956
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Success plots on GOT-10k Succ.↑ Prec.↑


1.0
SiamFC [Bertinetto et al., 2016] 0.339 0.336
MDNet [Nam and Han, 2016] 0.373 0.397
0.8
GradNet [Li et al., 2019b] 0.351 0.365
SiamRPN++ [Li et al., 2019a] 0.491 0.496
Success rate

0.6 Ours: [0.624]


ATOM [Danelljan et al., 2019] 0.505 0.514
DiMP50: [0.611] DiMP50 [Bhat et al., 2019] 0.564 0.568
SiamFC++: [0.595] ROAM++ [Yang et al., 2020] 0.447 0.445
0.4 Ocean: [0.592]
SiamBAN [Chen et al., 2020] 0.514 0.518
ATOM18: [0.556]
SiamRPN++: [0.517] Ocean [Zhang et al., 2020] 0.526 0.526
0.2
SiamRPN: [0.463] SiamFC++ [Xu et al., 2020] 0.544 0.547
SiamFC: [0.348]
ECO: [0.316] SiamRCR (ours) 0.575 0.599
MDNet: [0.299]
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Table 4: Comparison of tracking results on LaSOT benchmark.
Overlap threshold

Figure 5: Comparison of tracking results on GOT-10k benchmark.


EAO↑ Accuracy↑ Robustness↓

Succ.↑ Prec.↑ N-Prec.↑ SPM [Wang et al., 2019a] 0.275 0.577 0.507
SiamRPN++ [Li et al., 2019a] 0.285 0.599 0.482
SiamFC [Bertinetto et al., 2016] 0.559 0.518 0.652 SiamMask [Wang et al., 2019b] 0.287 0.594 0.461
ECO [Danelljan et al., 2017] 0.554 0.492 0.618 SiamBAN [Chen et al., 2020] 0.327 0.602 0.396
UPDT [Zhang et al., 2019] 0.611 0.557 0.702 Ocean [Zhang et al., 2020] 0.327 0.590 0.376
ATOM [Danelljan et al., 2019] 0.703 0.648 0.771
SiamRPN++ [Li et al., 2019a] 0.733 0.694 0.800 SiamRCR (ours) 0.336 0.602 0.386
DiMP50 [Bhat et al., 2019] 0.740 0.687 0.801
KYS [Bhat et al., 2020] 0.740 0.688 0.800 Table 5: Comparison of tracking results on VOT-2019 benchmark.
SiamAttn [Yu et al., 2020] 0.752 0.715 0.817
SiamFC++ [Xu et al., 2020] 0.754 0.705 0.800
set and conduct evaluation following the protocol II in [Fan
SiamRCR (ours) 0.764 0.716 0.818
et al., 2019]. As shown in Table 4, our SiamRCR achieves
0.575 of Succ. and 0.599 of Prec., and outperforms recent
Table 3: Comparison of tracking results on TrackingNet benchmark. SOTA tracker Ocean by 8.5% and 13.9% in terms of both
Red and blue fonts indicate the best and second results respectively. Success and Precision score respectively. It also achieves
better performance compared with other localization-aware
VOT-2018 are presented in the supplementary. trackers (ATOM and SiamFC++), proving that our reciprocal
GOT-10k. The evaluation follows the protocols in [Huang links with localization branch is better.
et al., 2019]. For a fair comparison, we train SiamRCR only VOT-2019. With challenging factors such as occlusion, fast
on the train subset which consists of about 10,000 sequences motion and illumination changing in 60 test sequences, VOT-
and test it on the test subset of 180 sequences. As shown 2019 provides a comprehensive evaluation platform for VOT.
in Figure 5, our SiamRCR achieves 0.624 of AO, which is Commonly used metrics for it are Expected Average Overlap
the best among evaluated trackers (including the online up- (EAO), Accuracy and Robustness. EAO takes both Accuracy
dating tracker DiMP). The slightly inferior performance at and Robustness into account to verify the overall tracking
large overlap threshold might due to SiamRCR’s strategy of performance. We report experimental results on VOT-2019
predicting the center offsets and width/height, rather than pre- in Table 5. Our SiamRCR achieves the best EAO score, the
dicting the bounding box coordinate offsets (e.g. SiamFC++), best Accuracy score and the second best Robustness score.
as larger value ranges can lead to less preciseness. However, Ocean performs slightly better in Robustness with the multi-
our strategy better solves the misalignment problem. feature combination strategy. As our SiamRCR only uses sin-
TrackingNet. The test subset of it contains 511 sequences gle conv-feature for estimation, it is faster than Ocean. More-
and 70 object classes. We also train our model only on Track- over, it demonstrates superior effectiveness and efficiency.
ingNet train subset. There are three metrics in TrackingNet:
Success (Succ.), Precision (Prec.) and Normalized Preci- 5 Conclusion
sion (N-Prec.). We report the results in Table 3. SiamRCR In this paper, we have proposed a novel anchor-free object
surpasses other state-of-the-art trackers on all three evalua- tracking framework which is efficient and effective. It ad-
tion metrics. In particular, SiamRCR obtains 0.764 of Succ., dresses the long-term standing accuracy misalignment prob-
0.716 of Prec. and 0.818 of N-Prec., which further demon- lem of Siamese network based models. Elaborate ablation
strates the superior tracking performance of our SiamRCR. studies have shown the effectiveness of the whole proposed
LaSOT. LaSOT is a large-scale long-term tracking bench- model and its key components. Without bells and whistles,
mark. It contains 1,400 sequences and more than 3.5 mil- the proposed method achieves state-of-the-art performance
lion frames. We train our model only on LaSOT train sub- on six tracking benchmarks, with a running speed of 65 FPS.

957
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

References [Lin et al., 2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaim-
[Bertinetto et al., 2016] Luca Bertinetto, Jack Valmadre, Joao F ing He, and Piotr Dollár. Focal loss for dense object detection. In
Henriques, Andrea Vedaldi, and Philip HS Torr. Fully- CVPR, 2017.
convolutional siamese networks for object tracking. In ECCV, [Long et al., 2015] Jonathan Long, Evan Shelhamer, and Trevor
2016. Darrell. Fully convolutional networks for semantic segmentation.
[Bhat et al., 2019] Goutam Bhat, Martin Danelljan, Luc Van Gool, In CVPR, 2015.
and Radu Timofte. Learning discriminative model prediction for [Müller et al., 2018] Matthias Müller, Adel Bibi, Silvio Giancola,
tracking. In ICCV, 2019. et al. Trackingnet: A large-scale dataset and benchmark for ob-
[Bhat et al., 2020] Goutam Bhat, Martin Danelljan, Luc Gool, Van, ject tracking in the wild. In ECCV, 2018.
and Radu Timofte. Know your surroundings: Exploiting scene [Nam and Han, 2016] Hyeonseob Nam and Bohyung Han. Learn-
information for object tracking. In ECCV, 2020. ing multi-domain convolutional neural networks for visual track-
ing. In CVPR, 2016.
[Chen et al., 2020] Zedu Chen, Bineng Zhong, Guorong Li, Sheng-
ping Zhang, and Rongrong Ji. Siamese box adaptive network for [Peng et al., 2020a] Jinlong Peng, Yueyang Gu, Yabiao Wang,
visual tracking. In CVPR, 2020. Chengjie Wang, Jilin Li, and Feiyue Huang. Dense scene multi-
ple object tracking with box-plane matching. In ACM MM, 2020.
[Danelljan et al., 2017] Martin Danelljan, Goutam Bhat, Fa-
had Shahbaz Khan, and Michael Felsberg. ECO: efficient con- [Peng et al., 2020b] Jinlong Peng, Changan Wang, Fangbin Wan,
volution operators for tracking. In CVPR, 2017. Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li,
Feiyue Huang, and Yanwei Fu. Chained-tracker: Chaining paired
[Danelljan et al., 2019] Martin Danelljan, Goutam Bhat, Fa- attentive regression results for end-to-end joint multiple-object
had Shahbaz Khan, and Michael Felsberg. ATOM: accurate detection and tracking. In ECCV, 2020.
tracking by overlap maximization. In CVPR, 2019.
[Peng et al., 2020c] Jinlong Peng, Tao Wang, Weiyao Lin, Jian
[Danelljan et al., 2020] Martin Danelljan, Luc Van Gool, and Radu Wang, John See, Shilei Wen, and Erui Ding. Tpm: Multiple
Timofte. Probabilistic regression for visual tracking. In CVPR, object tracking with tracklet-plane matching. PR, 2020.
2020.
[Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Su Hao,
[Duan et al., 2019] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang et al. Imagenet large scale visual recognition challenge. IJCV,
Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets 2015.
for object detection. In CVPR, 2019. [Tian et al., 2019] Zhi Tian, Chunhua Shen, Hao Chen, and Tong
[Fan et al., 2019] Heng Fan, Liting Lin, Fan Yang, et al. Lasot: A He. Fcos: Fully convolutional one-stage object detection. In
high-quality benchmark for large-scale single object tracking. In ICCV, 2019.
CVPR, 2019. [Wang et al., 2019a] Guangting Wang, Chong Luo, Zhiwei Xiong,
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and and Wenjun Zeng. Spm-tracker: Series-parallel matching for
Jian Sun. Deep residual learning for image recognition. In CVPR, real-time visual object tracking. In CVPR, 2019.
2016. [Wang et al., 2019b] Qiang Wang, Li Zhang, Luca Bertinetto,
[Huang et al., 2019] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Weiming Hu, and Philip HS Torr. Fast online object tracking
Got-10k: A large high-diversity benchmark for generic object and segmentation: A unifying approach. In CVPR, 2019.
tracking in the wild. TPAMI, 2019. [Wu et al., 2015] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang.
[Jiang et al., 2018] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Object tracking benchmark. TPAMI, 2015.
Xiao, and Yuning Jiang. Acquisition of localization confidence [Xu et al., 2020] Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and
for accurate object detection. In ECCV, 2018. Gang Yu. Siamfc++: Towards robust and accurate visual tracking
[Kristan et al., 2018] Matej Kristan, Ales Leonardis, Jiri Matas, with target estimation guidelines. In AAAI, 2020.
et al. The sixth visual object tracking vot2018 challenge results. [Yang et al., 2020] Tianyu Yang, Pengfei Xu, Runbo Hu, Hua Chai,
In ECCV, 2018. and Antoni Chan. Roam: Recurrently optimizing tracking model.
[Kristan et al., 2019] Matej Kristan, Jiri Matas, Ales Leonardis, In CVPR, 2020.
et al. The seventh visual object tracking vot2019 challenge re- [Yu et al., 2016] Jiahui Yu, Yuning Jiang, Zhangyang Wang,
sults. In ICCVW, 2019. Zhimin Cao, and Thomas Huang. Unitbox: An advanced object
[Law and Deng, 2018] Hei Law and Jia Deng. Cornernet: Detect- detection network. In ACM MM, 2016.
ing objects as paired keypoints. In ECCV, 2018. [Yu et al., 2020] Yuechen Yu, Yilei Xiong, Weilin Huang, and
[Li et al., 2018] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xi- Matthew R. Scott. Deformable siamese attention networks for
aolin Hu. High performance visual tracking with siamese region visual object tracking. In CVPR, 2020.
proposal network. In CVPR, 2018. [Zhang et al., 2019] Lichao Zhang, Abel Gonzalez-Garcia, Joost
[Li et al., 2019a] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Jun- van de Weijer, Martin Danelljan, and Fahad Shahbaz Khan.
liang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese Learning the model update for siamese trackers. In ICCV, 2019.
visual tracking with very deep networks. In CVPR, 2019. [Zhang et al., 2020] Zhipeng Zhang, Houwen Peng, Jianlong Fu,
[Li et al., 2019b] Peixia Li, Boyu Chen, Wanli Ouyang, Dong Bing Li, and Weiming Hu1. Ocean: Object-aware anchor-free
tracking. In ECCV, 2020.
Wang, Xiaoyun Yang, and Huchuan Lu. Gradnet: Gradient-
guided network for visual object tracking. In ICCV, 2019. [Zhou et al., 2019] Xingyi Zhou, Dequan Wang, and Philipp
Krähenbühl. Objects as points. arXiv:1904.07850, 2019.
[Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Belongie,
et al. Microsoft coco: Common objects in context. In ECCV, [Zhou et al., 2020] Xingyi Zhou, Vladlen Koltun, and Philipp
2014. Krähenbühl. Tracking objects as points. In ECCV, 2020.

958

You might also like