sensors-22-01585
sensors-22-01585
Article
SiamMixer: A Lightweight and Hardware-Friendly Visual
Object-Tracking Network
Li Cheng 1,2 , Xuemin Zheng 1,2 , Mingxin Zhao 1,2 , Runjiang Dou 1, *, Shuangming Yu 1 , Nanjian Wu 1,2,3
and Liyuan Liu 1
Abstract: Siamese networks have been extensively studied in recent years. Most of the previous
research focuses on improving accuracy, while merely a few recognize the necessity of reducing
parameter redundancy and computation load. Even less work has been done to optimize the runtime
memory cost when designing networks, making the Siamese-network-based tracker difficult to
deploy on edge devices. In this paper, we present SiamMixer, a lightweight and hardware-friendly
visual object-tracking network. It uses patch-by-patch inference to reduce memory use in shallow
layers, where each small image region is processed individually. It merges and globally encodes
Citation: Cheng, L.; Zheng, X.; Zhao,
feature maps in deep layers to enhance accuracy. Benefiting from these techniques, SiamMixer
M.; Dou, R.; Yu, S.; Wu, N.; Liu, L.
demonstrates a comparable accuracy to other large trackers with only 286 kB parameters and 196 kB
SiamMixer: A Lightweight and
extra memory use for feature maps. Additionally, we verify the impact of various activation functions
Hardware-Friendly Visual
Object-Tracking Network. Sensors
and replace all activation functions with ReLU in SiamMixer. This reduces the cost when deploying
2022, 22, 1585. https://doi.org/ on mobile devices.
10.3390/s22041585
Keywords: visual object-tracking; deep features; siamese network; lightweight neural network; edge
Academic Editors: Yangquan Chen,
computing devices
Subhas Mukhopadhyay, Nunzio
Cennamo, M. Jamal Deen, Junseop
Lee and Simone Morais
of one branch is the initial target image, and the other one is the search region image. The
Siamese network trackers perform target localization according to the similarity between
the target and the search region.
The Siamese network tracker eliminates the need for complex descriptor design,
uses large amounts of labeled data for training, and learns to distinguish targets from
the background. Thanks to the learning and generalizing ability of neural networks, the
Siamese network tracker can track targets that do not appear in the training set. The
Siamese network tracker lowers the design barrier of a general-purpose target tracker with
guaranteed tracking performance.
Benefiting from the compact design, promising generalization ability, and powerful
performance, Siamese network trackers have been a popular research topic in recent years.
Related research can be divided into three mainstreams: bounding box prediction, robust
backbone network design, and online learning strategies. More specifically, SiamRPN++ [2]
and OCEAN [3] use anchor or anchor-free methods to generate a precise bounding box.
SiamFC++ [4] uses GoogleNet [5] instead of AlexNet [6] as the backbone network and
demonstrates the impact of backbone networks on Siamese network trackers. ATOM [7]
and DIMP [8] use online template updates in the Siamese network, achieving state-of-the-
art performance.
Although these methods can improve tracking accuracy and robustness, they ignore
computational overhead and memory footprint, therefore limiting their applications in mobile
devices. Ideally, if network parameters and intermediate feature maps fit in the processor
cache without data exchange with DDR, the energy efficiency would undoubtedly increase.
The backbone network contributes directly to the performance of the Siamese network.
Designing efficient and lightweight neural networks for mobile devices has attracted much
attention in the past few years. SqueezeNet [9] was one of the first networks to optimize
the network size, proposing to reduce the network size using downsampling and 1 × 1
convolutional kernels. MobileNetV1 [10] and MobileNetV2 [11] introduced a new block,
which used depth-separated convolution as an alternative to spatial convolutions. This
further reduced the number of parameters of the network and improved its accuracy.
We propose to build lightweight target-tracking algorithms by constructing lightweight
backbone networks. We start from the best practice and build the lightweight network
with the basic block of MobileNetV2. Unlike other lightweight networks, we pay extra
attention to the runtime memory of the network and the impact of the activation function
on the network performance. We manually design the network structure and demonstrate
its merits in building lightweight tracking models.
The main contributions of this paper are summarized below:
1. We propose a novel lightweight and hardware-friendly visual object-tracking model
based on the Siamese tracking scheme, namely SiamMixer.
2. We design a compact backbone consisting of patch-based convolutions and mixer
modules. The patch-based convolution reduces feature map memory use by pro-
cessing each image patch individually. The mixer module enhances the accuracy by
merging and encoding global information of feature maps.
3. We verify the activation function impact on tracking accuracy and use ReLU as a
satisfying alternative for exponential-based functions, which is favorable for Single-
Instruction Multiple-Data (SIMD) operations.
Extensive experimental results demonstrate that the proposed method has comparable
performance with many off-the-shelf Siamese networks, while the memory footprint is
significantly lower.
The structure of this paper is as follows: Section 2 reviews the Siamese-network-
based trackers most relevant to our approach and the common approaches for building
lightweight neural networks. Sections 3.1–3.3 introduce a description of the major compo-
nents of the proposed network, including the convolutional layer for feature extraction, the
mixer module for global encoding of the feature map, and the cross-correlation for target
localization. The training setup and the loss functions design are described in Section 3.4.
Sensors 2022, 22, 1585 3 of 15
Section 3.5 introduces the datasets and evaluation metrics we used. Section 4.1 intro-
duces our experimental results and compares them with the state-of-the-art algorithms.
In Section 4.2, we analyze the storage overhead of SiamMixer for weights and feature maps.
Section 5 concludes the paper.
2. Related Work
In this section, we review the visual tracker based on the Siamese network and popular
methods for building lightweight networks to illustrate how our work differs from prior work.
3. Proposed Algorithm
We propose to build lightweight target-tracking algorithms by constructing lightweight
backbone networks, namely SiamMixer. The network can be divided into two parts, the
backbone network for extracting image features and the correlation computation for object
searching and locating. The diagram of the proposed tracker is shown in Figure 1.
Per-Patch Per-Layer
Mixer Module
Mixer Module
MB4 5x5
MB6 5x5
MB6 5x5
MB6 5x5
Response map
128x128
Mixer Module
Mixer Module
MB6 5x5
MB4 5x5
MB6 5x5
MB6 5x5
256x256
Per-Patch Per-Layer
Figure 1. Diagram of SiamMixer network structure. The MobileNetV2 block is denoted as MB
{expansion ratio} {kernel size}.
mension of Ci nput × Wi nput × Hi nput, where Ci nput denotes the image channels, Wi nput
denotes the image width, and Hi nput denotes the image height.
We apply an n × n depth-wise convolutional layer followed by a pointwise (1 × 1)
convolutional layer to conduct structural encoding. To preserve the simplicity of the
network structure, the MobileNetV2 blocks used for structural encoding are implemented
with the same kernel size. The architecture of the backbone network is shown in Table 1.
Table 1. Architecture of backbone network. The patch-based inference layer is annotated with ? .
To reduce the runtime memory cost, we conduct the convolutional layer in a patch-
by-patch order. During convolutional layer inference, one small image patch is processed
at a time. Once the small image patches are processed, the memory space they occupy is
freed so that the peak memory cost can be reduced. The main drawback of this method
is that it is spatially constrained and unable to encode the global information of the input
image. Lin [22] proposes perceptual field redistribution via NAS, thus solving the problem
of constrained perceptual fields caused by patch-based inference. However, this requires
an additional hyperparameter optimization in the already substantial search space. This
will incur a considerable search cost.
Therefore, we propose to use the mixer module to globally encode the convolutional
feature maps. A patch-based inference example is shown in Figure 2.
Patch1
Conv2
Merged
Patch2 Feature Map
Conv1
Figure 3. Diagram of the modified Mixer layer. Each Mixer layer contains two MLPs, one called
Token Mixing MLP, and the other one called channel-mixing MLP. Token Mixing and Channel Mixing
both use residual connections to ensure that the deep mixer network can be trained to converge. The
input to the Mixer layer is a series of image patches that have been flattened into vectors. For each
image patch, the vector dimension is D = C × WP × HP , where C is the number of channels of the
patch, WP is the width of the patch, and HP is the height of the patch. The BN in the figure denotes
BatchNorm (the original Mixer uses LayerNorm).
We combine patch-based inference with the Mixer layer to save the network from
the restricted perceptual field. According to our experimental results, the combination of
patch-based convolution and the Mixer layer significantly improves the accuracy of the
network. To simplify the computational process, we modify the basic module of MLP-Mixer
as follows:
1. Replace GELU activation function with ReLU activation function.
2. Replace LayerNorm with BatchNorm.
3. Use Conv1d for channel mixing.
Exponential arithmetic in mobile devices is usually expensive. To reduce network
deployment costs, we use ReLU to replace the activation function that involves exponential
operations. LayerNorms are commonly used for normalization in RNN networks because
the input to RNN networks usually varies with the length of the sequence. In addition, due
to the large size of the RNN network, it is not practical to use large batch size training to
reduce internal covariate shift. However, LayerNorms require hidden layer statistics during
both training and inference, which can slow down the inference of the networks. Since the
mixer module we use has fixed dimensional inputs and the network size is small enough
to use a large batch size for training, we think it is reasonable to replace the LayerNorms
with BatchNorms. The original mixer network uses feature map transposition and a fully
connected layer to implement channel-mixing operations. However, the transposition
process would introduce unnecessary memory access and bring no computational rev-
enue. Therefore, we use one-dimensional convolution to implement the channel-mixing
process equivalently.
The adjustments we made to the network architecture made deploying the network
on mobile devices easier. Experimental data shows that the impact of these adjustments
on network accuracy is acceptable.
response. To find the location of the target in the new image, we search pixel by pixel for
the candidate location most similar to the template image. To obtain the correct scale of
the target, the search area is resized into multiple scales, and the scale with the highest
classification score is chosen to be the final scale. Although no high-level modeling of the
target is performed, it provides a reliable and straightforward method for target localization,
which is beneficial for our evaluation of the backbone network as well as for deployment
on edge computing devices.
1 M N
e−yv pi
Lt (y, v pi , vni ) =
MN ∑ ∑(log( e−yv pi + e−yvni
)) (3)
i j
where M, N are the number of positive and negative samples. y is the ground truth label.
The parameters of the network can be obtained by stochastic gradient descent:
where z, x and θ is the parameters of the network, target image and search region
image, respectively.
Image pairs are obtained from the annotated video dataset. Both images in the image
pair contain the target. The class of the object is ignored during training. The dataset
is enhanced using random horizontal flips, random rotations, and random luminance
changes, where the probability of random horizontal flips is 0.5%, random rotations are
from −2◦ to 2◦ . The center of rotation is the center of the image. The random luminance
variation uses the brightness factor to jitter image brightness. The brightness factor is
chosen uniformly from [0.7, 1.3]
4. Experiment Results
Our tracker is implemented using the PyTorch framework on a computer with an
Intel Xeon Silver 4114 CPU and 4 Geforce GTX 1080 GPUs. The training is performed on
GOT-10k [28] dataset. We evaluated our method on OTB100 [26] and UAV123 [27] bench-
mark datasets, and selected the state-of-the-art algorithms for a quantitative compari-
son, namely LightTrack [21], SiamFC [12], SiamRPN [13], SiamRPN++ [2], OCEAN [3],
GOTURN [29], MUSTer [30], MEEM [31], STRUCK [32], TLD [33], BACF [34] and KCF [35].
Table 3. Success score of 4 structural variants of SiamMixer scored on OTB100, ↑ denotes that higher
value is better.
Our algorithms can run at more than real-time speeds on common GPU devices while
maintaining a low memory footprint. On the Nvidia Jetson Xavier development board, an
edge computing device, our algorithms can run at quasi-real-time speeds.
As shown in Tables 2 and 3, the increase in the depth of the mixer module brings
limited performance improvement while significantly slowing down the network and
increasing the number of parameters in the network. In addition, overly deep networks
degrade network performance, which is consistent with the phenomenon described in
SiamDW [14]. Therefore, we believe that SiamMixer-XS should be the optimal candidate
for deployment at edge computing devices. For the performance comparison, we focused
on the performance of the SiamMixer-XS.
We record the success score of different structure networks on the OTB100 [26] dataset,
calculate the information density (accuracy per parameters) [37,38], and compare it with
the state-of-the-art models. The comparison results are shown in Table 4.
Information density [37,38] is a metric that can effectively evaluate the efficiency of
using network parameters. We want to make the most of limited storage space for edge-side
deployments, so we introduce this metric in the comparison.
As can be seen from the comparison results, our SiamMixer-XS has a 6.8× smaller
number of parameters than LightTrack-Mobile [21], the state-of-the-art lightweight net-
work, and an 8.16× smaller number of parameters than SiamFC [12], which has similar
Sensors 2022, 22, 1585 9 of 15
Table 4. Model Analysis, ↑ denotes that higher value is better, ↓ denotes that lower value is better.
2 X U V 6 L / 8 * ( / 8 >