0% found this document useful (0 votes)
63 views14 pages

High Speed Tracking With Kernelized Correlation Filters

This document proposes a new Kernelized Correlation Filter (KCF) for visual object tracking. KCF uses kernel ridge regression to train a classifier to distinguish the target object from background patches, while incorporating thousands of translated sample patches efficiently using circulant matrices. This allows KCF to achieve the same computational complexity as fast linear correlation filters, while gaining the representational power of kernels. Experiments show KCF outperforms state-of-the-art trackers like Struck and TLD on a benchmark, running at over 100 frames per second.

Uploaded by

Larry Coca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views14 pages

High Speed Tracking With Kernelized Correlation Filters

This document proposes a new Kernelized Correlation Filter (KCF) for visual object tracking. KCF uses kernel ridge regression to train a classifier to distinguish the target object from background patches, while incorporating thousands of translated sample patches efficiently using circulant matrices. This allows KCF to achieve the same computational complexity as fast linear correlation filters, while gaining the representational power of kernels. Experiments show KCF outperforms state-of-the-art trackers like Struck and TLD on a benchmark, running at over 100 frames per second.

Uploaded by

Larry Coca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

High-Speed Tracking with


Kernelized Correlation Filters
João F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista

Abstract—The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target
and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled
sample patches. Such sets of samples are riddled with redundancies – any overlapping pixels are constrained to be the same. Based
on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting
data matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several
orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest
arXiv:1404.7584v3 [cs.CV] 5 Nov 2014

competitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernel
algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of
linear correlation filters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-ranking
trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented
in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

Index Terms—Visual tracking, circulant matrices, discrete Fourier transform, kernel methods, ridge regression, correlation filters.

1 I NTRODUCTION at different relative translations, without iterating over them


explicitly. This is made possible by the discovery that, in the
A RGUABLY one of the biggest breakthroughs in recent
visual tracking research was the widespread adoption
of discriminative learning methods. The task of tracking, a
Fourier domain, some learning algorithms actually become
easier as we add more samples, if we use a specific model for
crucial component of many computer vision systems, can translations.
be naturally specified as an online learning problem [1], [2]. These analytical tools, namely circulant matrices, pro-
Given an initial image patch containing the target, the goal vide a useful bridge between popular learning algorithms
is to learn a classifier to discriminate between its appearance and classical signal processing. The implication is that we
and that of the environment. This classifier can be evaluated are able to propose a tracker based on Kernel Ridge Re-
exhaustively at many locations, in order to detect it in gression [8] that does not suffer from the “curse of ker-
subsequent frames. Of course, each new detection provides nelization”, which is its larger asymptotic complexity, and
a new image patch that can be used to update the model. even exhibits lower complexity than unstructured linear
It is tempting to focus on characterizing the object of regression. Instead, it can be seen as a kernelized version
interest – the positive samples for the classifier. However, of a linear correlation filter, which forms the basis for the
a core tenet of discriminative methods is to give as much fastest trackers available [9], [10]. We leverage the power-
importance, or more, to the relevant environment – the ful kernel trick at the same computational complexity as
negative samples. The most commonly used negative sam- linear correlation filters. Our framework easily incorporates
ples are image patches from different locations and scales, multiple feature channels, and by using a linear kernel we
reflecting the prior knowledge that the classifier will be show a fast extension of linear correlation filters to the multi-
evaluated under those conditions. channel case.
An extremely challenging factor is the virtually unlim-
ited amount of negative samples that can be obtained from
an image. Due to the time-sensitive nature of tracking, 2 R ELATED WORK
modern trackers walk a fine line between incorporating 2.1 On tracking-by-detection
as many samples as possible and keeping computational
demand low. It is common practice to randomly choose only A comprehensive review of tracking-by-detection is outside
a few samples each frame [3], [4], [5], [6], [7]. the scope of this article, but we refer the interested reader
Although the reasons for doing so are understandable, to two excellent and very recent surveys [1], [2]. The most
we argue that undersampling negatives is the main factor popular approach is to use a discriminative appearance
inhibiting performance in tracking. In this paper, we de- model [3], [4], [5], [6]. It consists of training a classifier
velop tools to analytically incorporate thousands of samples online, inspired by statistical machine learning methods, to
predict the presence or absence of the target in an image
• The authors are with the Institute of Systems and Robotics, University of patch. This classifier is then tested on many candidate
Coimbra. patches to find the most likely location. Alternatively, the
E-mail: {henriques,ruicaseiro,pedromartins,batista}@isr.uc.pt
position can also be predicted directly [7]. Regression with
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

Kernelized Correlation Filter (proposed) TLD Struck

Figure 1: Qualitative results for the proposed Kernelized Correlation Filter (KCF), compared with the top-performing Struck and
TLD. Best viewed on a high-resolution screen. The chosen kernel is Gaussian, on HOG features. These snapshots were taken at
the midpoints of the 50 videos of a recent benchmark [11]. Missing trackers are denoted by an “x”. KCF outperforms both Struck
and TLD, despite its minimal implementation and running at 172 FPS (see Algorithm 1, and Table 1).

class labels can be seen as classification, so we use the two not preclude an exhaustive search, a notable optimization
terms interchangeably. is to use a fast but inaccurate classifier to select promising
We will discuss some relevant trackers before focusing patches, and only apply the full, slower classifier on those
on the literature that is more directly related to our an- [18], [19].
alytical methods. Canonical examples of the tracking-by- On the training side, Kalal et al. [4] propose using
detection paradigm include those based on Support Vec- structural constraints to select relevant sample patches from
tor Machines (SVM) [12], Random Forest classifiers [6], or each new image. This approach is relatively expensive,
boosting variants [13], [5]. All the mentioned algorithms had limiting the features that can be used, and requires careful
to be adapted for online learning, in order to be useful for tuning of the structural heuristics. A popular and related
tracking. Zhang et al. [3] propose a projection to a fixed method, though it is mainly used in offline detector learn-
random basis, to train a Naive Bayes classifier, inspired ing, is hard-negative mining [20]. It consists of running
by compressive sensing techniques. Aiming to predict the an initial detector on a pool of images, and selecting any
target’s location directly, instead of its presence in a given wrong detections as samples for re-training. Even though
image patch, Hare et al. [7] employed a Structured Output both approaches reduce the number of training samples, a
SVM and Gaussian kernels, based on a large number of major drawback is that the candidate patches have to be
image features. Examples of non-discriminative trackers considered exhaustively, by running a detector.
include the work of Wu et al. [14], who formulate track- The initial motivation for our line of research was the
ing as a sequence of image alignment objectives, and of recent success of correlation filters in tracking [9], [10]. Cor-
Sevilla-Lara and Learned-Miller [15], who propose a strong relation filters have proved to be competitive with far more
appearance descriptor based on distribution fields. Another complicated approaches, but using only a fraction of the
discriminative approach by Kalal et al. [4] uses a set of computational power, at hundreds of frames-per-second.
structural constraints to guide the sampling process of a They take advantage of the fact that the convolution of
boosting classifier. Finally, Bolme et al. [9] employ classical two patches (loosely, their dot-product at different relative
signal processing analysis to derive fast correlation filters. translations) is equivalent to an element-wise product in the
We will discuss these last two works in more detail shortly. Fourier domain. Thus, by formulating their objective in the
Fourier domain, they can specify the desired output of a
2.2 On sample translations and correlation filtering linear classifier for several translations, or image shifts, at
Recall that our goal is to learn and detect over translated once.
image patches efficiently. Unlike our approach, most at- A Fourier domain approach can be very efficient, and
tempts so far have focused on trying to weed out irrelevant has several decades of research in signal processing to draw
image patches. On the detection side, it is possible to use from [21]. Unfortunately, it can also be extremely limiting.
branch-and-bound to find the maximum of a classifier’s We would like to simultaneously leverage more recent ad-
response while avoiding unpromising candidate patches vances in computer vision, such as more powerful features,
[16]. Unfortunately, in the worst-case the algorithm may large-margin classifiers or kernel methods [22], [20], [23].
still have to iterate over all patches. A related method finds A few studies go in that direction, and attempt to apply
the most similar patches of a pair of images efficiently [17], kernel methods to correlation filters [24], [25], [26], [27]. In
but is not directly translated to our setting. Though it does these works, a distinction must be drawn between two types
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

of objective functions: those that do not consider the power almost matches the performance of non-linear kernels. We
spectrum or image translations, such as Synthetic Discrim- name it Dual Correlation Filter (DCF), and show how it
inant Function (SDF) filters [25], [26], and those that do, is related to a set of recent, more expensive multi-channel
such as Minimum Average Correlation Energy [28], Optimal filters [31]. Experimentally, we demonstrate that the KCF
Trade-Off [27] and Minimum Output Sum of Squared Error already performs better than a linear filter, without any
(MOSSE) filters [9]. Since the spatial structure can effectively feature extraction. With HOG features, both the linear DCF
be ignored, the former are easier to kernelize, and Kernel and non-linear KCF outperform by a large margin top-
SDF filters have been proposed [26], [27], [25]. However, ranking trackers, such as Struck [7] or Track-Learn-Detect
lacking a clearer relationship between translated images, (TLD) [4], while comfortably running at hundreds of frames-
non-linear kernels and the Fourier domain, applying the per-second.
kernel trick to other filters has proven much more difficult
[25], [24], with some proposals requiring significantly higher
computation times and imposing strong limits on the num-
ber of image shifts that can be considered [24]. 4 B UILDING BLOCKS
For us, this hinted that a deeper connection between
translated image patches and training algorithms was In this section, we propose an analytical model for image
needed, in order to overcome the limitations of direct patches extracted at different translations, and work out the
Fourier domain formulations. impact on a linear regression algorithm. We will show a
natural underlying connection to classical correlation filters.
2.3 Subsequent work The tools we develop will allow us to study more compli-
cated algorithms in Sections 5-7.
Since the initial version of this work [29], an interesting
time-domain variant of the proposed cyclic shift model has
been used very successfully for video event retrieval [30].
Generalizations of linear correlation filters to multiple chan- 4.1 Linear regression
nels have also been proposed [31], [32], [33], some of which
building on our initial work. This allows them to leverage We will focus on Ridge Regression, since it admits a simple
more modern features (e.g. Histogram of Oriented Gradi- closed-form solution, and can achieve performance that is
ents – HOG). A generalization to other linear algorithms, close to more sophisticated methods, such as Support Vector
such as Support Vector Regression, was also proposed [31]. Machines [8]. The goal of training is to find a function
We must point out that all of these works target off-line f (z) = wT z that minimizes the squared error over samples
training, and thus rely on slower solvers [31], [32], [33]. In xi and their regression targets yi ,
contrast, we focus on fast element-wise operations, which
are more suitable for real-time tracking, even with the kernel X 2 2
min (f (xi ) − yi ) + λ kwk . (1)
trick. w
i

3 C ONTRIBUTIONS The λ is a regularization parameter that controls overfit-


A preliminary version of this work was presented earlier ting, as in the SVM. As mentioned earlier, the minimizer has
[29]. It demonstrated, for the first time, the connection a closed-form, which is given by [8]
between Ridge Regression with cyclically shifted samples
and classical correlation filters. This enabled fast learning −1
w = X T X + λI X T y. (2)
with O (n log n) Fast Fourier Transforms instead of expen-
sive matrix algebra. The first Kernelized Correlation Filter
was also proposed, though limited to a single channel. where the data matrix X has one sample per row xi , and
Additionally, it proposed closed-form solutions to compute each element of y is a regression target yi . I is an identity
kernels at all cyclic shifts. These carried the same O (n log n) matrix.
computational cost, and were derived for radial basis and Starting in Section 4.4, we will have to work in the
dot-product kernels. Fourier domain, where quantities are usually complex-
The present work adds to the initial version in signifi- valued. They are not harder to deal with, as long as we
cant ways. All the original results were re-derived using a use the complex version of Eq. 2 instead,
much simpler diagonalization technique (Sections 4-6). We
extend the original work to deal with multiple channels, −1
w = X H X + λI X H y, (3)
which allows the use of state-of-the-art features that give an
important boost to performance (Section 7). Considerable
T
new analysis and intuitive explanations are added to the where X H is the Hermitian transpose, i.e., X H = (X ∗ ) ,
initial results. We also extend the original experiments from and X ∗ is the complex-conjugate of X . For real numbers,
12 to 50 videos, and add a new variant of the Kernelized Eq. 3 reduces to Eq. 2.
Correlation Filter (KCF) tracker based on Histogram of In general, a large system of linear equations must be
Oriented Gradients (HOG) features instead of raw pixels. solved to compute the solution, which can become pro-
Via a linear kernel, we additionally propose a linear multi- hibitive in a real-time setting. Over the next paragraphs we
channel filter with very low computational complexity, that will see a special case of xi that bypasses this limitation.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

+30 +15 Base sample -15 -30

Figure 2: Examples of vertical cyclic shifts of a base sample. Figure 3: Illustration of a circulant matrix. The rows are cyclic
Our Fourier domain formulation allows us to train a tracker shifts of a vector image, or its translations in 1D. The same
with all possible cyclic shifts of a base sample, both vertical and properties carry over to circulant matrices containing 2D im-
horizontal, without iterating them explicitly. Artifacts from the ages.
wrapped-around edges can be seen (top of the left-most image),
but are mitigated by the cosine window and padding.
4.3 Circulant matrices
To compute a regression with shifted samples, we can use
4.2 Cyclic shifts the set of Eq. 5 as the rows of a data matrix X :
For notational simplicity, we will focus on single-channel,  
one-dimensional signals. These results generalize to multi- x1 x2 x3 ··· xn
channel, two-dimensional images in a straightforward way 
 xn x1 x2 ··· xn−1 

(Section 7).
xn−1 xn x1 ··· xn−2
 
X = C(x) =  . (6)
Consider an n × 1 vector representing a patch with 
.. .. .. .. ..

.
 
the object of interest, denoted x. We will refer to it as  . . . . 
the base sample. Our goal is to train a classifier with both x2 x3 x4 ··· x1
the base sample (a positive example) and several virtual
samples obtained by translating it (which serve as negative An illustration of the resulting pattern is given in Fig. 3.
examples). We can model one-dimensional translations of What we have just arrived at is a circulant matrix, which has
this vector by a cyclic shift operator, which is the permutation several intriguing properties [34], [35]. Notice that the pat-
matrix tern is deterministic, and fully specified by the generating
vector x, which is the first row.
  What is perhaps most amazing and useful is the fact
0 0 0 ··· 1 that all circulant matrices are made diagonal by the Discrete

 1 0 0 ··· 0 
 Fourier Transform (DFT), regardless of the generating vector
0 1 0 ··· 0 x [34]. This can be expressed as
 
P = . (4)
.. .. .. .. .
 
. . ..
 
 . . 
X = F diag (x̂) F H , (7)
0 0 ··· 1 0
where F is a constant matrix that does not depend on x,
T and x̂ denotes the DFT of the generating vector, x̂ = F (x).
The product P x = [xn , x1 , x2 , . . . , xn−1 ] shifts x by
From now on, we will always use a hat ˆ as shorthand for
one element, modeling a small translation. We can chain
the DFT of a vector.
u shifts to achieve a larger translation by using the matrix
The constant matrix F is known as the DFT matrix, and
power P u x. A negative u will shift in the reverse direction.
is the unique matrix √ that computes the DFT of any input
A 1D signal translated horizontally with this model is illus-
vector, as F (z) = nF z. This is possible because the DFT
trated in Fig. 3, and an example for a 2D image is shown in
is a linear operation.
Fig. 2.
Eq. 7 expresses the eigendecomposition of a general
The attentive reader will notice that the last element circulant matrix. The shared, deterministic eigenvectors F
wraps around, inducing some distortion relative to a true lie at the root of many uncommon features, such as commu-
translation. However, this undesirable property can be mit- tativity or closed-form inversion.
igated by appropriate padding and windowing (Section
A.1). The fact that a large percentage of the elements of a
signal are still modeled correctly, even for relatively large 4.4 Putting it all together
translations (see Fig. 2), explains the observation that cyclic We can now apply this new knowledge to simplify the linear
shifts work well in practice. regression in Eq. 3, when the training data is composed
Due to the cyclic property, we get the same signal x of cyclic shifts. Being able to work solely with diagonal
periodically every n shifts. This means that the full set of matrices is very appealing, because all operations can be
shifted signals is obtained with done element-wise on their diagonal elements.
Take the term X H X , which can be seen as a non-
centered covariance matrix. Replacing Eq. 7 in it,
{P u x | u = 0, . . . n − 1} . (5)
X H X = F diag (x̂∗ ) F H F diag (x̂) F H . (8)
Again due to the cyclic property, we can equivalently
view the first half of this set as shifts in the positive direc- Since diagonal matrices are symmetric, taking the Her-
tion, and the second half as shifts in the negative direction. mitian transpose only left behind a complex-conjugate, x̂∗ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Additionally, we can eliminate the factor F H F = I . This The solution to these filters looks like Eq. 12 (see Ap-
property is the unitarity of F and can be canceled out in pendix A.2), but with two crucial differences. First, MOSSE
many expressions. We are left with filters are derived from an objective function specifically
formulated in the Fourier domain. Second, the λ regularizer
X H X = F diag (x̂∗ ) diag (x̂) F H . (9) is added in an ad-hoc way, to avoid division-by-zero. The
derivation we showed above adds considerable insight, by
Because operations on diagonal matrices are element- specifying the starting point as Ridge Regression with cyclic
wise, we can define the element-wise product as and shifts, and arriving at the same solution.
obtain Circulant matrices allow us to enrich the toolset put for-
ward by classical signal processing and modern correlation
X H X = F diag (x̂∗ x̂) F H . (10) filters, and apply the Fourier trick to new algorithms. Over
An interesting aspect is that the vector in brackets is the next section we will see one such instance, in training
known as the auto-correlation of the signal x (in the Fourier non-linear filters.
domain, also known as the power spectrum [21]). In classical
signal processing, it contains the variance of a time-varying 5 N ON - LINEAR REGRESSION
process for different time lags, or in our case, space. One way to allow more powerful, non-linear regression
The above steps summarize the general approach taken functions f (z) is with the “kernel trick” [23]. The most
in diagonalizing expressions with circulant matrices. Ap- attractive quality is that the optimization problem is still
plying them recursively to the full expression for linear linear, albeit in a different set of variables (the dual space).
regression (Eq. 3), we can put most quantities inside the On the downside, evaluating f (z) typically grows in com-
diagonal, plexity with the number of samples.
Using our new analysis tools, however, we will show
x̂∗
 
ŵ = diag ŷ, (11) that it is possible to overcome this limitation, and obtain
x̂∗ x̂ + λ non-linear filters that are as fast as linear correlation filters,
or better yet, both to train and evaluate.

x̂∗ ŷ 5.1 Kernel trick – brief overview


ŵ = . (12)
x̂∗ x̂ + λ This section will briefly review the kernel trick, and define
The fraction denotes element-wise division. We can eas- the relevant notation.
ily recover w in the spatial domain with the Inverse DFT, Mapping the inputs of a linear problem to a non-linear
which has the same cost as a forward DFT. The detailed feature-space ϕ(x) with the kernel trick consists of:
steps of the recursive diagonalization that yields Eq. 12 are 1) Expressing the solution w as a linear combination
given in Appendix A.5. of the samples:
At this point we just found an unexpected formula from X
classical signal processing – the solution is a regularized w= αi ϕ(xi ) (13)
correlation filter [9], [21]. i
Before exploring this relation further, we must high- The variables under optimization are thus α, in-
light the computational efficiency of Eq. 12, compared to stead of w. This alternative representation α is said
the prevalent method of extracting patches explicitly and to be in the dual space, as opposed to the primal space
solving a general regression problem. For example, Ridge w (Representer Theorem [23, p. 89]).
Regression has a cost of O n3 , bound by the matrix in-
version and products1 . On the other hand, all operations in 1) Writing the algorithm in terms of dot-products
Eq. 12 are element-wise (O (n)), except for the DFT, which ϕT (x)ϕ(x0 ) = κ(x, x0 ), which are computed using
bounds the cost at a nearly-linear O (n log n). For typical the kernel function κ (e.g., Gaussian or Polynomial).
data sizes, this reduces storage and computation by several
The dot-products between all pairs of samples are usu-
orders of magnitude.
ally stored in a n × n kernel matrix K , with elements

4.5 Relationship to correlation filters Kij = κ(xi , xj ). (14)


Correlation filters have been a part of signal processing since The power of the kernel trick comes from the implicit
the 80’s, with solutions to a myriad of objective functions use of a high-dimensional feature space ϕ(x), without ever
in the Fourier domain [21], [28]. Recently, they made a instantiating a vector in that space. Unfortunately, this is
reappearance as MOSSE filters [9], which have shown re- also its greatest weakness, since the regression function’s
markable performance in tracking, despite their simplicity complexity grows with the number of samples,
and high FPS rate.
n
X
1. We remark that the complexity of training algorithms is usually f (z) = wT z = αi κ(z, xi ). (15)
reported in terms of the number of samples n, disregarding the number i=1
of features m. Since in our case m = n (X is square), we conflate the In the coming sections we will show how most draw-
two quantities. For comparison, the fastest SVM solvers have “linear”
complexity in the samples O (mn), but under the same backs of the kernel trick can be avoided, assuming circulant
 conditions m =
n would actually exhibit quadratic complexity, O n2 . data.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

5.2 Fast kernel regression This analogy can be taken further. Since a kernel is
The solution to the kernelized version of Ridge Regression equivalent to a dot-product in a high-dimensional space
is given by [8] ϕ(·), another way to view Eq. 18 is

−1 0
α = (K + λI) y, (16) kixx = ϕT (x0 )ϕ(P i−1 x), (19)
where K is the kernel matrix and α is the vector of coeffi- which is the cross-correlation of x and x0 in the high-
cients αi , that represent the solution in the dual space. dimensional space ϕ(·).
Now, if we can prove that K is circulant for datasets of Notice how we only need to compute and operate on
cyclic shifts, we can diagonalize Eq. 16 and obtain a fast the kernel auto-correlation, an n × 1 vector, which grows
solution as for the linear case. This would seem to be intu- linearly with the number of samples. This is contrary to the
itively true, but does not hold in general. The arbitrary non- conventional wisdom on kernel methods, which requires
linear mapping ϕ(x) gives us no guarantee of preserving computing an n × n kernel matrix, scaling quadratically
any sort of structure. However, we can impose one condition with the samples. Our knowledge of the exact structure of
that will allow K to be circulant. It turns out to be fairly K allowed us to do better than a generic algorithm.
broad, and apply to most useful kernels.
Finding the optimal α is not the only problem that can
be accelerated, due to the ubiquity of translated patches in a
Theorem 1. Given circulant data C (x), the corresponding tracking-by-detection setting. Over the next paragraphs we
kernel matrix K is circulant if the kernel function satisfies will investigate the effect of the cyclic shift model on the
κ(x, x0 ) = κ(M x, M x0 ), for any permutation matrix M . detection phase, and even in computing kernel correlations.

For a proof, see Appendix A.2. What this means is that,


for a kernel to preserve the circulant structure, it must treat 5.3 Fast detection
all dimensions of the data equally. Fortunately, this includes
most useful kernels. It is rarely the case that we want to evaluate the regression
function f (z) for one image patch in isolation. To detect
Example 2. The following kernels satisfy Theorem 1: the object of interest, we typically wish to evaluate f (z) on
several image locations, i.e., for several candidate patches.
• Radial Basis Function kernels – e.g., Gaussian. These patches can be modeled by cyclic shifts.
• Dot-product kernels – e.g., linear, polynomial. Denote by K z the (asymmetric) kernel matrix between
• Additive kernels – e.g., intersection, χ2 and Hellinger all training samples and all candidate patches. Since the
kernels [36]. samples and patches are cyclic shifts of base sample x and
• Exponentiated additive kernels. base patch z, respectively, each element of K z is given by
κ(P i−1 z, P j−1 x). It is easy to verify that this kernel matrix
Checking this fact is easy, since reordering the dimen- satisfies Theorem 1, and is circulant for appropriate kernels.
sions of x and x0 simultaneously does not change κ(x, x0 ) Similarly to Section 5.2, we only need the first row to
for these kernels. This applies to any kernel that combines define the kernel matrix:
dimensions through a commutative operation, such as sum,
product, min and max.
K z = C(kxz ), (20)
Knowing which kernels we can use to make K circulant,
it is possible to diagonalize Eq. 16 as in the linear case, where kxz is the kernel correlation of x and z, as defined
obtaining before.
From Eq. 15, we can compute the regression function for

α̂ = , (17) all candidate patches with
k̂xx +λ
where kxx is the first row of the kernel matrix K = C(kxx ), T
f (z) = (K z ) α. (21)
and again a hat ˆ denotes the DFT of a vector. A detailed
derivation is in Appendix A.3. Notice that f (z) is a vector, containing the output for all
To better understand the role of kxx , we found it useful cyclic shifts of z, i.e., the full detection response. To compute
to define a more general kernel correlation. The kernel corre- Eq. 21 efficiently, we diagonalize it to obtain
0
lation of two arbitrary vectors, x and x0 , is the vector kxx
with elements
f̂ (z) = k̂xz α̂. (22)
0
0
kixx = κ(x , P i−1
x). (18)
Intuitively, evaluating f (z) at all locations can be seen as
In words, it contains the kernel evaluated for different a spatial filtering operation over the kernel values kxz . Each
relative shifts of the two arguments. Then k̂xx is the kernel f (z) is a linear combination of the neighboring kernel values
correlation of x with itself, in the Fourier domain. We can from kxz , weighted by the learned coefficients α. Since this
refer to it as the kernel auto-correlation, in analogy with the is a filtering operation, it can be formulated more efficiently
linear case. in the Fourier domain.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

6 FAST KERNEL CORRELATION 6.3 Other kernels


Even though we have found faster algorithms for training The approach from the preceding two sections depends on
and detection, they still rely on computing one kernel cor- the kernel value being unchanged by unitary transforma-
relation each (kxx and kxz , respectively). Recall that kernel tions, such as the DFT. This does not hold in general for
correlation consists of computing the kernel for all relative other kernels, e.g. intersection kernel. We can still use the
shifts of two input vectors. This represents the last standing fast training and detection results (Sections 5.2 and 5.3), but
computational bottleneck, as a naive evaluation of n kernels kernel correlation must be evaluated by a more expensive
for signals of size n will have quadratic complexity. How- sliding window method.
ever, using the cyclic shift model will allow us to efficiently
exploit the redundancies in this expensive computation.
7 M ULTIPLE CHANNELS
6.1 Dot-product and polynomial kernels In this section, we will see that working in the dual has the
Dot-product kernels have the form κ(x, x0 ) = g(xT x0 ), for advantage of allowing multiple channels (such as the ori-
0 entation bins of a HOG descriptor [20]) by simply summing
some function g . Then, kxx has elements
over them in the Fourier domain. This characteristic extends
0
kixx = κ(x0 , P i−1 x) = g x0T P i−1 x .

(23) to the linear case, simplifying the recently-proposed multi-
channel correlation filters [31], [32], [33] considerably, under
Let g also work element-wise on any input vector. This specific conditions.
way we can write Eq. 23 in vector form
0 7.1 General case
kxx = g(C(x) x0 ) . (24)
To deal with multiple channels, in this section we will
This makes it an easy target for diagonalization, yielding assume that a vector x concatenates the individual vectors
0 for C channels (e.g. 31 gradient orientation bins for a HOG
kxx = g F −1 (x̂∗ x̂0 ) ,

(25) variant [20]), as x = [x1 , . . . , xC ].
where F −1 denotes the Inverse DFT. Notice that all kernels studied in Section 6 are based
In particular, for a polynomial kernel κ(x, x0 ) = on either dot-products or norms of the arguments. A dot-
b
T 0
x x +a , product can be computed by simply summing the indi-
vidual dot-products for each channel. By linearity of the
0 b
kxx = F −1 (x̂∗ x̂0 ) + a . (26) DFT, this allows us to sum the result for each channel in
the Fourier domain. As a concrete example, we can apply
Then, computing the kernel correlation for these partic- this reasoning to the Gaussian kernel, obtaining the multi-
ular kernels can be done using only a few DFT/IDFT and channel analogue of Eq. 30,
element-wise operations, in O (n log n) time.
 
6.2 Radial Basis Function and Gaussian kernels xx0 1  2 0 2 −1 P ∗ 0
k = exp − 2 kxk + kx k − 2F ( c x̂c x̂ c ) .
RBF kernels have the form κ(x, x0 ) = h(kx − x0 k ), for
2 σ
0 (31)
some function h. The elements of kxx are
It is worth emphasizing that the integration of multi-
ple channels does not result in a more difficult inference
0
 2 
kixx = κ(x0 , P i−1 x) = h x0 − P i−1 x (27)
problem – we merely have to sum over the channels when
We will show (Eq. 29) that this is actually a special case computing kernel correlation.
of a dot-product kernel. We only have to expand the norm,
0

2 2
 7.2 Linear kernel
kixx = h kxk + kx0 k − 2x0T P i−1 x . (28)
For a linear kernel κ(x, x0 ) = xT x0 , the multi-channel
The permutation P i−1
does not affect the norm of x extension from the previous section simply yields
2 2
due to Parseval’s Theorem [21]. Since kxk and kx0 k are 0
kxx = F −1 ( x̂∗c x̂0 c ) .
P
constant w.r.t. i, Eq. 28 has the same form as a dot-product c (32)
kernel (Eq. 23). Leveraging the result from the previous We named it the Dual Correlation Filter (DCF). This filter
section, is linear, but trained in the dual space α. We will discuss the
  advantages over other multi-channel filters shortly.
0 2 2
kxx = h kxk + kx0 k − 2F −1 (x̂∗ x̂0 ) . (29) A recent extension of linear correlation filters to multiple
channels was discovered independently by three groups
As a particularly useful special  case, for a Gaussian [31], [32], [33]. They allow much faster training times than
2
kernel κ(x, x0 ) = exp − σ12 kx − x0 k we get unstructured algorithms, by decomposing the problem into
 one linear system for each DFT frequency, in the case of
1   Ridge Regression. Henriques et al. [31] additionally gener-
xx0 2 0 2 −1 ∗ 0
k = exp − 2 kxk + kx k − 2F (x̂ x̂ ) . alize the decomposition to other training algorithms.
σ
(30) However, Eq. 32 suggests that, by working in the dual
As before, we can compute the full kernel correlation in with a linear kernel, we can train a linear classifier with
only O (n log n) time. multiple channels, but using only element-wise operations.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

This may be unexpected at first, since those works require Algorithm 1 : Matlab code, with a Gaussian kernel.
more expensive matrix inversions [31], [32], [33]. Multiple channels (third dimension of image patches) are sup-
ported. It is possible to further reduce the number of FFT calls.
We resolve this discrepancy by pointing out that this Implementation with GUI available at:
is only possible because we only consider a single base http://www.isr.uc.pt/~henriques/
sample x. In this case, the kernel matrix K = XX T is
n × n, regardless of the number of features or channels. Inputs
It relates the n cyclic shifts of the base sample, and can • x: training image patch, m × n × c
• y: regression target, Gaussian-shaped, m × n
be diagonalized by the n basis of the DFT. Since K is
• z: test image patch, m × n × c
fully diagonal we can use solely element-wise operations. Output
However, if we consider two base samples, K becomes • responses: detection score for each location, m × n
2n × 2n and the n DFT basis are no longer enough to
fully diagonalize it. This incomplete diagonalization (block- function alphaf = train(x, y, sigma, lambda)
diagonalization) requires more expensive operations to deal k = kernel_correlation(x, x, sigma);
alphaf = fft2(y) ./ (fft2(k) + lambda);
with, which were proposed in those works. end
With an interestingly symmetric argument, training with
multiple base samples and a single channel can be done in function responses = detect(alphaf, x, z, sigma)
the primal, with only element-wise operations (Appendix k = kernel_correlation(z, x, sigma);
responses = real(ifft2(alphaf .* fft2(k)));
A.6). This follows by applying the same reasoning to the end
non-centered covariance matrix X T X , instead of XX T . In
this case we obtain the original MOSSE filter [9]. function k = kernel_correlation(x1, x2, sigma)
c = ifft2(sum(conj(fft2(x1)) .* fft2(x2), 3));
In conclusion, for fast element-wise operations we can d = x1(:)’*x1(:) + x2(:)’*x2(:) - 2 * c;
choose multiple channels (in the dual, obtaining the DCF) or k = exp(-1 / sigma^2 * abs(d) / numel(d));
multiple base samples (in the primal, obtaining the MOSSE), end
but not both at the same time. This has an important impact
on time-critical applications, such as tracking. The general
case [31] is much more expensive and suitable mostly for
offline training applications. α and x with the ones from the previous frame, to provide
the tracker with some memory.
8 E XPERIMENTS
8.1 Tracking pipeline
We implemented in Matlab two simple trackers based on 8.2 Evaluation
the proposed Kernelized Correlation Filter (KCF), using a
Gaussian kernel, and Dual Correlation Filter (DCF), using
We put our tracker to the test by using a recent benchmark
a linear kernel. We do not report results for a polynomial
that includes 50 video sequences [11] (see Fig. 1). This
kernel as they are virtually identical to those for the Gaus-
dataset collects many videos used in previous works, so we
sian kernel, and require more parameters. We tested two
avoid the danger of overfitting to a small subset.
further variants: one that works directly on the raw pixel
values, and another that works on HOG descriptors with a For the performance criteria, we did not choose average
cell size of 4 pixels, in particular Felzenszwalb’s variant [20], location error or other measures that are averaged over
[22]. Note that our linear DCF is equivalent to MOSSE [9] frames, since they impose an arbitrary penalty on lost
in the limiting case of a single channel (raw pixels), but it trackers that depends on chance factors (i.e., the position
has the advantage of also supporting multiple channels (e.g. where the track was lost), making them not comparable. A
HOG). Our tracker requires few parameters, and we report similar alternative is bounding box overlap, which has the
the values that we used, fixed for all videos, in Table 2. disadvantage of heavily penalizing trackers that do not track
The bulk of the functionality of the KCF is presented across scale, even if the target position is otherwise tracked
as Matlab code in Algorithm 1. Unlike the earlier version perfectly.
of this work [29], it is prepared to deal with multiple An increasingly popular alternative, which we chose for
channels, as the 3rd dimension of the input arrays. It im- our evaluation, is the precision curve [11], [5], [29]. A frame
plements 3 functions: train (Eq. 17), detect (Eq. 22), and may be considered correctly tracked if the predicted target
kernel_correlation (Eq. 31), which is used by the first center is within a distance threshold of ground truth. Preci-
two functions. sion curves simply show the percentage of correctly tracked
The pipeline for the tracker is intentionally simple, and frames for a range of distance thresholds. Notice that by
does not include any heuristics for failure detection or mo- plotting the precision for all thresholds, no parameters are
tion modeling. In the first frame, we train a model with the required. This makes the curves unambiguous and easy to
image patch at the initial position of the target. This patch is interpret. A higher precision at low thresholds means the
larger than the target, to provide some context. For each new tracker is more accurate, while a lost target will prevent it
frame, we detect over the patch at the previous position, from achieving perfect precision for a very large threshold
and the target position is updated to the one that yielded range. When a representative precision score is needed, the
the maximum value. Finally, we train a new model at the chosen threshold is 20 pixels, as done in previous works
new position, and linearly interpolate the obtained values of [11], [5], [29].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

Mean
0.9 Mean
Algorithm Feature precision
FPS
(20 px)
0.8
KCF 73.2% 172
HOG
0.7 DCF 72.8% 292
Proposed
KCF Raw 56.0% 154
0.6 DCF pixels 45.1% 278
Precision

Struck [7] 65.6% 20


0.5
TLD [4] 60.8% 28
0.4 Other MOSSE [9] 43.1% 615
KCF on HOG [0.732] algorithms MIL [5] 47.5% 38
DCF on HOG [0.728]
0.3 Struck [0.656] ORIA [14] 45.7% 9
TLD [0.608]
KCF on raw pixels [0.560]
CT [3] 40.6% 64
0.2 MIL [0.475]
ORIA [0.457] Table 1: Summary of experimental results on the 50 videos
0.1 DCF on raw pixels [0.451]
MOSSE [0.431] dataset. The reported quantities are averaged over all videos.
CT [0.406] Reported speeds include feature computation (e.g. HOG).
0
0 10 20 30 40 50
Location error threshold
MOSSE [9] tracks a region that has the same support as
Figure 4: Precision plot for all 50 sequences. The proposed the target object, while our implementation tracks a region
trackers (bold) outperform state-of-the-art systems, such as
TLD and Struck, which are more complicated to implement that is 2.5 times larger (116x170 on average). Reducing the
and much slower (see Table 1). Best viewed in color. tracked region would allow us to approach their FPS of 615
(Table 1), but we found that it hurts performance, especially
for the kernel variants. Another interesting observation from
8.3 Experiments on the full dataset Table 1 is that operating on 31 HOG features per spatial cell
We start by summarizing the results over all videos in Table can be slightly faster than operating on raw pixels, even
1 and Fig. 4. For comparison, we also report results for though we take the overhead of computing HOG features
several other systems [7], [4], [9], [5], [14], [3], including into account. Since each 4x4 pixels cell is represented by
some of the most resilient trackers available – namely, Struck a single HOG descriptor, the smaller-sized DFTs counter-
and TLD. Unlike our simplistic implementation (Algorithm balance the cost of iterating over feature channels. Taking
1), these trackers contain numerous engineering improve- advantage of all 4 cores of a desktop computer, KCF/DCF
ments. Struck operates on many different kinds of features take less than 2 minutes to process all 50 videos (∼29,000
and a growing pool of support vectors. TLD is specifically frames).
geared towards re-detection, using a set of structural rules
with many parameters. 8.4 Experiments with sequence attributes
Despite this asymmetry, our Kernelized Correlation Fil-
The videos in the benchmark dataset [11] are annotated with
ter (KCF) can already reach competitive performance by
attributes, which describe the challenges that a tracker will
operating on raw pixels alone, as can be seen in Fig. 4. In this
face in each sequence – e.g., illumination changes or occlu-
setting, the rich implicit features induced by the Gaussian
sions. These attributes are useful for diagnosing and char-
kernel yield a distinct advantage over the proposed Dual
acterizing the behavior of trackers in such a large dataset,
Correlation Filter (DCF).
without having to analyze each individual video. We report
We remark that the DCF with single-channel features
results for 4 attributes in Figure 5: non-rigid deformations,
(raw pixels) is theoretically equivalent to a MOSSE filter
occlusions, out-of-view target, and background clutter.
[9]. For a direct comparison, we include the results for the
The robustness of the HOG variants of our tracker
authors’ MOSSE tracker [9] in Fig. 4. The performance of
regarding non-rigid deformations and occlusions is not
both is very close, showing that any particular differences
surprising, since these features are known to be highly
in their implementations do not seem to matter much.
discriminative [20]. However, the KCF on raw pixels alone
However, the kernelized algorithm we propose (KCF) does
still fares almost as well as Struck and TLD, with the kernel
yield a noticeable increase in performance.
making up for the features’ shortcomings.
Replacing pixels with HOG features allows the KCF and
One challenge for the system we implemented is an out-
DCF to surpass even TLD and Struck, by a relatively large
of-view target, due to the lack of a failure recovery mecha-
margin (Fig. 4). This suggests that the most crucial factor
nism. TLD performs better than most other trackers in this
for high performance, compared to other trackers that use
case, which illustrates its focus on re-detection and failure
similar features, is the efficient incorporation of thousands
recovery. Such engineering improvements could probably
of negative samples from the target’s environment, which
benefit our trackers, but the fact that KCF/DCF can still
they do with very little overhead.
outperform TLD shows that they are not a decisive factor.
Timing. As mentioned earlier, the overall complexity of our Background clutter severely affects almost all trackers,
closed-form solutions is O (n log n), resulting in its high except for the proposed ones, and to a lesser degree, Struck.
speed (Table 1). The speed of the tracker is directly related For our tracker variants, this is explained by the implicit in-
to the size of the tracked region. This is an important clusion of thousands of negative samples around the tracked
factor when comparing trackers based on correlation filters. object. Since in this case even the raw pixel variants of our
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

Deformation (19 sequences) Occlusion (29 sequences)


0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
Precision

Precision
0.5 0.5

0.4 0.4
KCF on HOG [0.740] KCF on HOG [0.749]
DCF on HOG [0.740] DCF on HOG [0.726]
0.3 Struck [0.521] 0.3 Struck [0.564]
TLD [0.512] TLD [0.563]
KCF on raw pixels [0.480] KCF on raw pixels [0.505]
0.2 MIL [0.455] 0.2 ORIA [0.435]
CT [0.435] MIL [0.427]
0.1 MOSSE [0.367] 0.1 CT [0.412]
ORIA [0.355] DCF on raw pixels [0.410]
DCF on raw pixels [0.350] MOSSE [0.397]
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold

Out of view (6 sequences) Background clutter (21 sequences)


0.8 0.9

0.7 0.8

0.7
0.6
0.6
0.5
Precision

Precision
0.5
0.4
0.4
KCF on HOG [0.650] KCF on HOG [0.733]
0.3 DCF on HOG [0.632] DCF on HOG [0.719]
TLD [0.576] 0.3 Struck [0.585]
Struck [0.539] KCF on raw pixels [0.503]
0.2 MIL [0.393] MIL [0.456]
KCF on raw pixels [0.358]
0.2 TLD [0.428]
CT [0.336] DCF on raw pixels [0.407]
0.1 ORIA [0.315] 0.1 ORIA [0.389]
DCF on raw pixels [0.261] CT [0.339]
MOSSE [0.226] MOSSE [0.339]
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold

Figure 5: Precision plot for sequences with attributes: occlusion, non-rigid deformation, out-of-view target, and background
clutter. The HOG variants of the proposed trackers (bold) are the most resilient to all of these nuisances. Best viewed in color.

tracker have a performance very close to optimal, while squared loss. We also hope to generalize this framework to
TLD, CT, ORIA and MIL show degraded performance, we other operators, such as affine transformations or non-rigid
conjecture that this is caused by their undersampling of deformations.
negatives.
We also report results for other attributes in Fig. 7.
Generally, the proposed trackers are the most robust to 6 ACKNOWLEDGMENT
of the 7 challenges, except for low resolution, which affects The authors acknowledge support by the FCT
equally all trackers but Struck. project PTDC/EEA-CRO/122812/2010, grants
SFRH/BD75459/2010, SFRH/BD74152/2010, and
9 C ONCLUSIONS AND FUTURE WORK SFRH/BPD/90200/2012.
In this work, we demonstrated that it is possible to analyti-
cally model natural image translations, showing that under A PPENDIX A
some conditions the resulting data and kernel matrices be-
come circulant. Their diagonalization by the DFT provides a
general blueprint for creating fast algorithms that deal with A.1 Implementation details
translations. We have applied this blueprint to linear and As is standard with correlation filters, the input patches (ei-
kernel ridge regression, obtaining state-of-the-art trackers ther raw pixels or extracted feature channels) are weighted
that run at hundreds of FPS and can be implemented with by a cosine window, which smoothly removes discontinu-
only a few lines of code. Extensions of our basic approach ities at the image boundaries caused by the cyclic assump-
seem likely to be useful in other problems. Since the first tion [9], [21]. The tracked region has 2.5 times the size of
version of this work, circulant data has been exploited the target, to provide some context and additional negative
successfully for other algorithms in detection [31] and video samples.
event retrieval [30]. An interesting direction for further work Recall that the training samples consist of shifts of a base
is to relax the assumption of periodic boundaries, which sample, so we must specify a regression target for each one
may improve performance. Many useful algorithms may in y. The regression targets y simply follow a Gaussian
also be obtained from the study of other objective functions function, which takes a value of 1 for a centered target,
with circulant data, including classical filters such as SDF and smoothly decays to 0 for any other shifts, according
or MACE [25], [26], and more robust loss functions than the to the spatial bandwidth s. Gaussian targets are smoother
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

KCF/DCF parameters With raw pixels With HOG


Feature bandwidth σ 0.2 0.5
Adaptation rate 0.075 0.02

Spatial bandwidth s mn/10
Regularization λ 10−4

Table 2: Parameters used in all experiments. In this table, n and


m refer to the width and height of the target, measured in pixels
or HOG cells.
(a) (b)
Figure 6: Regression targets y, following a Gaussian function
than binary labels, and have the benefit of reducing ringing
with spatial bandwidth s (white indicates a value of 1, black a
artifacts in the Fourier domain [21]. value of 0). (a) Placing the peak in the middle will unnecessarily
A subtle issue is determining which element of y is the cause the detection output to be shifted by half a window
regression target for the centered sample, on which we will (discussed in Section A.1). (b) Placing the peak at the top-left
center the Gaussian function. Although intuitively it may element (and wrapping around) correctly centers the detection
seem to be the middle of the output plane (Fig. 6-a), it turns output.
out that the correct choice is the top-left element (Fig. 6-b).
The explanation is that, after computing a cross-correlation
between two images in the Fourier domain and converting
back to the spatial domain, it is the top-left element of the Kij = κ(x, P (j−i) mod n x), (38)
result that corresponds to a shift of zero [21]. Of course, since where mod is the modulus operation (remainder of division
we always deal with cyclic signals, the peak of the Gaussian by n).
function must wrap around from the top-left corner to the We now use the fact the elements of a circulant matrix
other corners, as can be seen in Fig. 6-b. Placing the Gaussian X = C(x) (Eq. 6) satisfy
peak in the middle of the regression target is common
in some filter implementations, and leads the correlation Xij = x((j−i) mod n)+1 , (39)
output to be unnecessarily shifted by half a window, which
must be corrected post-hoc2 . that is, a matrix is circulant if its elements only depend
Another common source of error is the fact that most on (j − i) mod n. It is easy to check that this condition is
implementations of the Fast Fourier Transform3 do not satisfied by Eq. 6, and in fact it is often used as the definition
compute the unitary DFT. This means that the L2 norm of of a circulant matrix [34].
the signals is not preserved, unless the output is corrected Because Kij also depends on (j − i) mod n, we must
by a constant factor. With some abuse of notation, we can conclude that K is circulant as well, finishing our proof.
say that the unitary DFT may be computed as
A.3 Kernel Ridge Regression with Circulant data
FU (x) = fft2(x) / sqrt(m ∗ n), (33) This section shows a more detailed derivation of Eq. 17. We
where the input x has size m × n, and similarly for the start by replacing K = C(kxx ) in the formula for Kernel
inverse DFT, Ridge Regression, Eq. 16, and diagonalizing it

FU−1 (x) = ifft2(x) ∗ sqrt(m ∗ n). (34) α = (C(kxx ) + λI) y


−1
(40)
   −1
= F diag k̂xx F H + λI y. (41)
A.2 Proof of Theorem 1
Under the theorem’s assumption that κ(x, x0 ) = By simple linear algebra, and the unitarity of F (F F H =
κ(M x, M x0 ), for any permutation matrix M , then I ),

   −1
Kij = κ(P i x, P j x) (35) α = F diag k̂xx F H + λF IF H y (42)
= κ(P −i P i x, P −i P j x).
−1
(36)

= F diag k̂xx + λ F H y, (43)
Using known properties of permutation matrices, this
reduces to which is equivalent to
 −1
Kij = κ(x, P j−i x). (37) F H α = diag k̂xx + λ F H y. (44)

By the cyclic nature of P , it repeats every nth power, i.e. Since for any vector F z = ẑ, we have
P n = P 0 . As such, Eq. 37 is equivalent to  
∗ 1
α̂ = diag ŷ∗ . (45)
2. This is usually done by switching the quadrants of the output, e.g. k̂xx + λ
with the Matlab built-in function fftshift. It has the same effect as
shifting Fig. 6-b to look like Fig. 6-a. Finally, because the product of a diagonal matrix and a
3. For example Matlab, NumPy, Octave and the FFTW library. vector is simply their element-wise product,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

A.6 MOSSE filter


ŷ∗
α̂∗ = . (46) The only difference between Eq. 12 and the MOSSE filter [9]
k̂xx + λ is that the latter minimizes the error over (cyclic shifts of)
multiple base samples xi , while Eq. 12 is defined for a single
base sample x. This was done for clarity of presentation, and
A.4 Derivation of fast detection formula
the general case is easily derived. Note also that MOSSE
To diagonalize Eq. 21, we use the same properties as in the does not support multiple channels, which we do through
previous section. We have our dual formulation.
The cyclic shifts of each base sample xi can be expressed
T
in a circulant
  matrix Xi . Then, replacing the data matrix
f (z) = (C (kxz )) α (47) X1
T
X 
 
X0 = 
  
= F diag k̂xz F H α (48)  . 2  in Eq. 3 results in
  ..
= F H diag k̂xz F α (49) !−1
X X
w= XiH Xi + λI XjH y, (60)
which is equivalent to
j i
  by direct application of the rule for products of block matri-
F f (z) = diag k̂xz F α. (50)
ces. Factoring the bracketed expression,
!−1
Replicating the same final steps from the previous sec-
!
X X
H H
tion, w= Xi Xi + λI Xi y. (61)
i i
xz Eq. 61 looks exactly like Eq. 3, except for the sums. It
f̂ (z) = k̂ α̂. (51)
is then trivial to follow the same steps as in Section 4.4 to
diagonalize it, and obtain the filter equation
A.5 Linear Ridge Regression with Circulant data P ∗
x̂ ŷ
This is a more detailed version of the steps from Section 4.4. ŵ = P i∗ i . (62)
i x̂i x̂i + λ
It is very similar to the kernel case. We begin by replacing
Eq. 10 in the formula for Ridge Regression, Eq. 3.
R EFERENCES
−1 [1] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. Dehghan,
w = F diag (x̂∗ x̂) F H + λI XH y (52) and M. Shah, “Visual tracking: an experimental survey,” TPAMI,
2013.
By simple algebra, and the unitarity of F , we have [2] H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song, “Recent ad-
vances and trends in visual tracking: A review,” Neurocomputing,
vol. 74, no. 18, pp. 3823–3831, Nov. 2011.
−1 H [3] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive
w = F diag (x̂∗ x̂) F H + λF H IF X y (53) tracking,” in ECCV, 2012.
  [4] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-
∗ −1 H H
= F diag (x̂ x̂ + λ) F X y (54) detection,” TPAMI, 2012.
[5] B. Babenko, M. Yang, and S. Belongie, “Robust object tracking with
−1
= F diag (x̂∗ x̂ + λ) F H F diag (x̂) F H y (55) online multiple instance learning,” TPAMI, 2011.
  [6] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof,
x̂ “On-line random forests,” in 3rd IEEE ICCV Workshop on On-line
= F diag F H y. (56) Computer Vision, 2009.
x̂∗ x̂ + λ
[7] S. Hare, A. Saffari, and P. Torr, “Struck: Structured output tracking
with kernels,” in ICCV, 2011.
Then, this is equivalent to [8] R. Rifkin, G. Yeo, and T. Poggio, “Regularized least-squares clas-
sification,” Nato Science Series Sub Series III Computer and Systems
x̂∗
 
Sciences, vol. 190, pp. 131–154, 2003.
F w = diag F y, (57) [9] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual
x̂∗ x̂ + λ object tracking using adaptive correlation filters,” in CVPR, 2010,
pp. 2544–2550.
and since for any vector F z = ẑ, [10] D. S. Bolme, B. A. Draper, and J. R. Beveridge, “Average of
synthetic exact filters,” in CVPR, 2009.
[11] Y. Wu, J. Lim, and M. H. Yang, “Online object tracking: A bench-
x̂∗
 
ŵ = diag ŷ. (58) mark,” in CVPR, 2013.
x̂∗ x̂ + λ [12] S. Avidan, “Support vector tracking,” TPAMI, 2004.
[13] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line
We may go one step further, since the product of a boosting for robust tracking,” in ECCV, 2008.
[14] Y. Wu, B. Shen, and H. Ling, “Online robust image alignment via
diagonal matrix and a vector is just their element-wise iterative convex optimization,” in CVPR, 2012.
product. [15] L. Sevilla-Lara and E. Learned-Miller, “Distribution fields for
tracking,” in CVPR, 2012.
x̂∗ ŷ [16] C. Lampert, M. Blaschko, and T. Hofmann, “Beyond sliding
ŵ = (59) windows: Object localization by efficient subwindow search,” in
x̂∗ x̂ + λ CVPR, 2008.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

Out−of−plane rotation (39 sequences) Fast motion (17 sequences) Illumination variation (25 sequences)
0.9 0.8 0.8

0.8 0.7 0.7

0.7
0.6 0.6
0.6
0.5 0.5
Precision

Precision

Precision
0.5
0.4 0.4
0.4
KCF on HOG [0.729] Struck [0.604] KCF on HOG [0.711]
DCF on HOG [0.712] 0.3 KCF on HOG [0.602] 0.3 DCF on HOG [0.699]
0.3 Struck [0.597] DCF on HOG [0.559] Struck [0.558]
TLD [0.596] TLD [0.551] TLD [0.537]
KCF on raw pixels [0.541] 0.2 KCF on raw pixels [0.441] 0.2 KCF on raw pixels [0.448]
0.2 ORIA [0.493] MIL [0.396] ORIA [0.421]
MIL [0.466] CT [0.323] MOSSE [0.375]
0.1 DCF on raw pixels [0.428] 0.1 DCF on raw pixels [0.275] 0.1 CT [0.359]
CT [0.394] ORIA [0.274] MIL [0.349]
MOSSE [0.390] MOSSE [0.213] DCF on raw pixels [0.309]
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold

Scale variation (28 sequences) Motion blur (12 sequences) Low resolution (4 sequences)
0.8 0.8 0.7

0.7 0.7 0.6

0.6 0.6
0.5
0.5 0.5
Precision

Precision

Precision
0.4
0.4 0.4
0.3 Struck [0.545]
KCF on HOG [0.679] KCF on HOG [0.650]
0.3 DCF on HOG [0.654] 0.3 DCF on HOG [0.588] KCF on raw pixels [0.396]
Struck [0.639] Struck [0.551] DCF on raw pixels [0.396]
TLD [0.606] TLD [0.518] 0.2 KCF on HOG [0.381]
0.2 KCF on raw pixels [0.492] 0.2 KCF on raw pixels [0.394] DCF on HOG [0.354]
MIL [0.471] MIL [0.357] TLD [0.349]
CT [0.448] CT [0.306] MOSSE [0.239]
0.1 ORIA [0.445] 0.1 DCF on raw pixels [0.270]
0.1 ORIA [0.195]
DCF on raw pixels [0.426] MOSSE [0.244] MIL [0.171]
MOSSE [0.387] ORIA [0.234] CT [0.152]
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold

Figure 7: Precision plots for 6 attributes of the dataset. Best viewed in color. In-plane rotation was left out due to space constraints.
Its results are virtually identical to those for out-of-plane rotation (above), since they share almost the same set of sequences.

[17] B. Alexe, V. Petrescu, and V. Ferrari, “Exploiting spatial overlap [33] V. N. Boddeti, T. Kanade, and B. V. Kumar, “Correlation filters for
to efficiently compute appearance distances between image win- object alignment,” in CVPR, 2013.
dows,” in NIPS, 2011. [34] R. M. Gray, Toeplitz and Circulant Matrices: A Review. Now
[18] H. Harzallah, F. Jurie, and C. Schmid, “Combining efficient object Publishers, 2006.
localization and image classification,” in ICCV, 2009. [35] P. J. Davis, Circulant matrices. American Mathematical Soc., 1994.
[19] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple [36] A. Vedaldi and A. Zisserman, “Efficient additive kernels via ex-
kernels for object detection,” in ICCV, 2009. plicit feature maps,” TPAMI, 2011.
[20] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
“Object detection with discriminatively trained part-based mod-
els,” TPAMI, 2010.
[21] R. C. González and R. E. Woods, Digital image processing. Prentice
Hall, 2008.
[22] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature João F. Henriques received his M.Sc. degree
pyramids for object detection,” TPAMI, 2014. in Electrical Engineering from the University of
[23] B. Schölkopf and A. Smola, Learning with kernels: Support vector Coimbra, Portugal in 2009. He is currently a
machines, regularization, optimization, and beyond. MIT Press, 2002. Ph.D. student at the Institute of Systems and
[24] D. Casasent and R. Patnaik, “Analysis of kernel distortion- Robotics, University of Coimbra. His current re-
invariant filters,” in Proceedings of SPIE, vol. 6764, 2007, p. 1. search interests include Fourier analysis and
[25] R. Patnaik and D. Casasent, “Fast FFT-based distortion-invariant machine learning algorithms in general, with
kernel filters for general object recognition,” in Proceedings of SPIE, computer vision applications in detection and
vol. 7252, 2009, p. 1. tracking.
[26] K.-H. Jeong, P. P. Pokharel, J.-W. Xu, S. Han, and J. Principe, “Ker-
nel based synthetic discriminant function for object recognition,”
in ICASSP, 2006.
[27] C. Xie, M. Savvides, and B. Vijaya-Kumar, “Kernel correlation
filter based redundant class-dependence feature analysis (KCFA)
on FRGC2.0 data,” in Analysis and Modelling of Faces and Gestures,
2005.
[28] A. Mahalanobis, B. Kumar, and D. Casasent, “Minimum average
Rui Caseiro received the B.Sc. degree in electrical engineering (spe-
correlation energy filters,” Applied Optics, 1987.
cialization in automation) from the University of Coimbra, Coimbra,
[29] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting
Portugal, in 2005. Since 2007, he has been involved in several research
the circulant structure of tracking-by-detection with kernels,” in
projects, which include the European project “Perception on Purpose”
ECCV, 2012.
and the National project “Brisa-ITraffic”. He is currently a Ph.D. student
[30] J. Revaud, M. Douze, S. Cordelia, and H. Jégou, “Event retrieval
and researcher with the Institute of Systems and Robotics and the
in large video collections with circulant temporal encoding,” in
Department of Electrical and Computer Engineering, Faculty of Science
CVPR, 2013.
and Technology, University of Coimbra. His current research interests
[31] J. F. Henriques, J. Carreira, R. Caseiro, and J. Batista, “Beyond hard
include the interplay of differential geometry with computer vision and
negative mining: Efficient detector learning via block-circulant
pattern recognition.
decomposition,” in ICCV, 2013.
[32] H. K. Galoogahi, T. Sim, and S. Lucey, “Multi-channel correlation
filters,” in ICCV, 2013.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

Pedro Martins received both his M.Sc. and


Ph.D. degrees in Electrical Engineering from
the University of Coimbra, Portugal in 2008 and
2012, respectively. Currently, he is a Postdoc-
toral researcher at the Institute of Systems and
Robotics (ISR) in the University of Coimbra, Por-
tugal. His main research interests include non-
rigid image alignment, face tracking and facial
expression recognition.

Prof. Jorge Batista received the M.Sc. and


Ph.D. degree in Electrical Engineering from the
University of Coimbra in 1992 and 1999, re-
spectively. He joins the Department of Electri-
cal Engineering and Computers, University of
Coimbra, Coimbra, Portugal, in 1987 as a re-
search assistant where he is currently an As-
sociate Professor. Prof. Jorge Batista was the
Head of the Department of Electrical Engineer-
ing and Computer from the Faculty of Science
and Technology of University of Coimbra from
November 2011 until November 2013 and is a founding member of
the Institute of Systems and Robotics (ISR) in Coimbra, where he is
a Senior Researcher. His research interest focus on a wide range of
computer vision and pattern analysis related issues, including real-time
vision, video surveillance, video analysis, non-rigid modeling and facial
analysis. More recently his research activity also focus on the interplay
of Differential Geometry in computer vision and pattern recognition prob-
lems. He has been involved in several national and European research
projects, several of them as PI at UC.

You might also like