High Speed Tracking With Kernelized Correlation Filters
High Speed Tracking With Kernelized Correlation Filters
Abstract—The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target
and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled
sample patches. Such sets of samples are riddled with redundancies – any overlapping pixels are constrained to be the same. Based
on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting
data matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several
orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest
arXiv:1404.7584v3 [cs.CV] 5 Nov 2014
competitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernel
algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of
linear correlation filters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-ranking
trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented
in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.
Index Terms—Visual tracking, circulant matrices, discrete Fourier transform, kernel methods, ridge regression, correlation filters.
Figure 1: Qualitative results for the proposed Kernelized Correlation Filter (KCF), compared with the top-performing Struck and
TLD. Best viewed on a high-resolution screen. The chosen kernel is Gaussian, on HOG features. These snapshots were taken at
the midpoints of the 50 videos of a recent benchmark [11]. Missing trackers are denoted by an “x”. KCF outperforms both Struck
and TLD, despite its minimal implementation and running at 172 FPS (see Algorithm 1, and Table 1).
class labels can be seen as classification, so we use the two not preclude an exhaustive search, a notable optimization
terms interchangeably. is to use a fast but inaccurate classifier to select promising
We will discuss some relevant trackers before focusing patches, and only apply the full, slower classifier on those
on the literature that is more directly related to our an- [18], [19].
alytical methods. Canonical examples of the tracking-by- On the training side, Kalal et al. [4] propose using
detection paradigm include those based on Support Vec- structural constraints to select relevant sample patches from
tor Machines (SVM) [12], Random Forest classifiers [6], or each new image. This approach is relatively expensive,
boosting variants [13], [5]. All the mentioned algorithms had limiting the features that can be used, and requires careful
to be adapted for online learning, in order to be useful for tuning of the structural heuristics. A popular and related
tracking. Zhang et al. [3] propose a projection to a fixed method, though it is mainly used in offline detector learn-
random basis, to train a Naive Bayes classifier, inspired ing, is hard-negative mining [20]. It consists of running
by compressive sensing techniques. Aiming to predict the an initial detector on a pool of images, and selecting any
target’s location directly, instead of its presence in a given wrong detections as samples for re-training. Even though
image patch, Hare et al. [7] employed a Structured Output both approaches reduce the number of training samples, a
SVM and Gaussian kernels, based on a large number of major drawback is that the candidate patches have to be
image features. Examples of non-discriminative trackers considered exhaustively, by running a detector.
include the work of Wu et al. [14], who formulate track- The initial motivation for our line of research was the
ing as a sequence of image alignment objectives, and of recent success of correlation filters in tracking [9], [10]. Cor-
Sevilla-Lara and Learned-Miller [15], who propose a strong relation filters have proved to be competitive with far more
appearance descriptor based on distribution fields. Another complicated approaches, but using only a fraction of the
discriminative approach by Kalal et al. [4] uses a set of computational power, at hundreds of frames-per-second.
structural constraints to guide the sampling process of a They take advantage of the fact that the convolution of
boosting classifier. Finally, Bolme et al. [9] employ classical two patches (loosely, their dot-product at different relative
signal processing analysis to derive fast correlation filters. translations) is equivalent to an element-wise product in the
We will discuss these last two works in more detail shortly. Fourier domain. Thus, by formulating their objective in the
Fourier domain, they can specify the desired output of a
2.2 On sample translations and correlation filtering linear classifier for several translations, or image shifts, at
Recall that our goal is to learn and detect over translated once.
image patches efficiently. Unlike our approach, most at- A Fourier domain approach can be very efficient, and
tempts so far have focused on trying to weed out irrelevant has several decades of research in signal processing to draw
image patches. On the detection side, it is possible to use from [21]. Unfortunately, it can also be extremely limiting.
branch-and-bound to find the maximum of a classifier’s We would like to simultaneously leverage more recent ad-
response while avoiding unpromising candidate patches vances in computer vision, such as more powerful features,
[16]. Unfortunately, in the worst-case the algorithm may large-margin classifiers or kernel methods [22], [20], [23].
still have to iterate over all patches. A related method finds A few studies go in that direction, and attempt to apply
the most similar patches of a pair of images efficiently [17], kernel methods to correlation filters [24], [25], [26], [27]. In
but is not directly translated to our setting. Though it does these works, a distinction must be drawn between two types
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
of objective functions: those that do not consider the power almost matches the performance of non-linear kernels. We
spectrum or image translations, such as Synthetic Discrim- name it Dual Correlation Filter (DCF), and show how it
inant Function (SDF) filters [25], [26], and those that do, is related to a set of recent, more expensive multi-channel
such as Minimum Average Correlation Energy [28], Optimal filters [31]. Experimentally, we demonstrate that the KCF
Trade-Off [27] and Minimum Output Sum of Squared Error already performs better than a linear filter, without any
(MOSSE) filters [9]. Since the spatial structure can effectively feature extraction. With HOG features, both the linear DCF
be ignored, the former are easier to kernelize, and Kernel and non-linear KCF outperform by a large margin top-
SDF filters have been proposed [26], [27], [25]. However, ranking trackers, such as Struck [7] or Track-Learn-Detect
lacking a clearer relationship between translated images, (TLD) [4], while comfortably running at hundreds of frames-
non-linear kernels and the Fourier domain, applying the per-second.
kernel trick to other filters has proven much more difficult
[25], [24], with some proposals requiring significantly higher
computation times and imposing strong limits on the num-
ber of image shifts that can be considered [24]. 4 B UILDING BLOCKS
For us, this hinted that a deeper connection between
translated image patches and training algorithms was In this section, we propose an analytical model for image
needed, in order to overcome the limitations of direct patches extracted at different translations, and work out the
Fourier domain formulations. impact on a linear regression algorithm. We will show a
natural underlying connection to classical correlation filters.
2.3 Subsequent work The tools we develop will allow us to study more compli-
cated algorithms in Sections 5-7.
Since the initial version of this work [29], an interesting
time-domain variant of the proposed cyclic shift model has
been used very successfully for video event retrieval [30].
Generalizations of linear correlation filters to multiple chan- 4.1 Linear regression
nels have also been proposed [31], [32], [33], some of which
building on our initial work. This allows them to leverage We will focus on Ridge Regression, since it admits a simple
more modern features (e.g. Histogram of Oriented Gradi- closed-form solution, and can achieve performance that is
ents – HOG). A generalization to other linear algorithms, close to more sophisticated methods, such as Support Vector
such as Support Vector Regression, was also proposed [31]. Machines [8]. The goal of training is to find a function
We must point out that all of these works target off-line f (z) = wT z that minimizes the squared error over samples
training, and thus rely on slower solvers [31], [32], [33]. In xi and their regression targets yi ,
contrast, we focus on fast element-wise operations, which
are more suitable for real-time tracking, even with the kernel X 2 2
min (f (xi ) − yi ) + λ kwk . (1)
trick. w
i
Figure 2: Examples of vertical cyclic shifts of a base sample. Figure 3: Illustration of a circulant matrix. The rows are cyclic
Our Fourier domain formulation allows us to train a tracker shifts of a vector image, or its translations in 1D. The same
with all possible cyclic shifts of a base sample, both vertical and properties carry over to circulant matrices containing 2D im-
horizontal, without iterating them explicitly. Artifacts from the ages.
wrapped-around edges can be seen (top of the left-most image),
but are mitigated by the cosine window and padding.
4.3 Circulant matrices
To compute a regression with shifted samples, we can use
4.2 Cyclic shifts the set of Eq. 5 as the rows of a data matrix X :
For notational simplicity, we will focus on single-channel,
one-dimensional signals. These results generalize to multi- x1 x2 x3 ··· xn
channel, two-dimensional images in a straightforward way
xn x1 x2 ··· xn−1
(Section 7).
xn−1 xn x1 ··· xn−2
X = C(x) = . (6)
Consider an n × 1 vector representing a patch with
.. .. .. .. ..
.
the object of interest, denoted x. We will refer to it as . . . .
the base sample. Our goal is to train a classifier with both x2 x3 x4 ··· x1
the base sample (a positive example) and several virtual
samples obtained by translating it (which serve as negative An illustration of the resulting pattern is given in Fig. 3.
examples). We can model one-dimensional translations of What we have just arrived at is a circulant matrix, which has
this vector by a cyclic shift operator, which is the permutation several intriguing properties [34], [35]. Notice that the pat-
matrix tern is deterministic, and fully specified by the generating
vector x, which is the first row.
What is perhaps most amazing and useful is the fact
0 0 0 ··· 1 that all circulant matrices are made diagonal by the Discrete
1 0 0 ··· 0
Fourier Transform (DFT), regardless of the generating vector
0 1 0 ··· 0 x [34]. This can be expressed as
P = . (4)
.. .. .. .. .
. . ..
. .
X = F diag (x̂) F H , (7)
0 0 ··· 1 0
where F is a constant matrix that does not depend on x,
T and x̂ denotes the DFT of the generating vector, x̂ = F (x).
The product P x = [xn , x1 , x2 , . . . , xn−1 ] shifts x by
From now on, we will always use a hat ˆ as shorthand for
one element, modeling a small translation. We can chain
the DFT of a vector.
u shifts to achieve a larger translation by using the matrix
The constant matrix F is known as the DFT matrix, and
power P u x. A negative u will shift in the reverse direction.
is the unique matrix √ that computes the DFT of any input
A 1D signal translated horizontally with this model is illus-
vector, as F (z) = nF z. This is possible because the DFT
trated in Fig. 3, and an example for a 2D image is shown in
is a linear operation.
Fig. 2.
Eq. 7 expresses the eigendecomposition of a general
The attentive reader will notice that the last element circulant matrix. The shared, deterministic eigenvectors F
wraps around, inducing some distortion relative to a true lie at the root of many uncommon features, such as commu-
translation. However, this undesirable property can be mit- tativity or closed-form inversion.
igated by appropriate padding and windowing (Section
A.1). The fact that a large percentage of the elements of a
signal are still modeled correctly, even for relatively large 4.4 Putting it all together
translations (see Fig. 2), explains the observation that cyclic We can now apply this new knowledge to simplify the linear
shifts work well in practice. regression in Eq. 3, when the training data is composed
Due to the cyclic property, we get the same signal x of cyclic shifts. Being able to work solely with diagonal
periodically every n shifts. This means that the full set of matrices is very appealing, because all operations can be
shifted signals is obtained with done element-wise on their diagonal elements.
Take the term X H X , which can be seen as a non-
centered covariance matrix. Replacing Eq. 7 in it,
{P u x | u = 0, . . . n − 1} . (5)
X H X = F diag (x̂∗ ) F H F diag (x̂) F H . (8)
Again due to the cyclic property, we can equivalently
view the first half of this set as shifts in the positive direc- Since diagonal matrices are symmetric, taking the Her-
tion, and the second half as shifts in the negative direction. mitian transpose only left behind a complex-conjugate, x̂∗ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
Additionally, we can eliminate the factor F H F = I . This The solution to these filters looks like Eq. 12 (see Ap-
property is the unitarity of F and can be canceled out in pendix A.2), but with two crucial differences. First, MOSSE
many expressions. We are left with filters are derived from an objective function specifically
formulated in the Fourier domain. Second, the λ regularizer
X H X = F diag (x̂∗ ) diag (x̂) F H . (9) is added in an ad-hoc way, to avoid division-by-zero. The
derivation we showed above adds considerable insight, by
Because operations on diagonal matrices are element- specifying the starting point as Ridge Regression with cyclic
wise, we can define the element-wise product as and shifts, and arriving at the same solution.
obtain Circulant matrices allow us to enrich the toolset put for-
ward by classical signal processing and modern correlation
X H X = F diag (x̂∗ x̂) F H . (10) filters, and apply the Fourier trick to new algorithms. Over
An interesting aspect is that the vector in brackets is the next section we will see one such instance, in training
known as the auto-correlation of the signal x (in the Fourier non-linear filters.
domain, also known as the power spectrum [21]). In classical
signal processing, it contains the variance of a time-varying 5 N ON - LINEAR REGRESSION
process for different time lags, or in our case, space. One way to allow more powerful, non-linear regression
The above steps summarize the general approach taken functions f (z) is with the “kernel trick” [23]. The most
in diagonalizing expressions with circulant matrices. Ap- attractive quality is that the optimization problem is still
plying them recursively to the full expression for linear linear, albeit in a different set of variables (the dual space).
regression (Eq. 3), we can put most quantities inside the On the downside, evaluating f (z) typically grows in com-
diagonal, plexity with the number of samples.
Using our new analysis tools, however, we will show
x̂∗
ŵ = diag ŷ, (11) that it is possible to overcome this limitation, and obtain
x̂∗ x̂ + λ non-linear filters that are as fast as linear correlation filters,
or better yet, both to train and evaluate.
5.2 Fast kernel regression This analogy can be taken further. Since a kernel is
The solution to the kernelized version of Ridge Regression equivalent to a dot-product in a high-dimensional space
is given by [8] ϕ(·), another way to view Eq. 18 is
−1 0
α = (K + λI) y, (16) kixx = ϕT (x0 )ϕ(P i−1 x), (19)
where K is the kernel matrix and α is the vector of coeffi- which is the cross-correlation of x and x0 in the high-
cients αi , that represent the solution in the dual space. dimensional space ϕ(·).
Now, if we can prove that K is circulant for datasets of Notice how we only need to compute and operate on
cyclic shifts, we can diagonalize Eq. 16 and obtain a fast the kernel auto-correlation, an n × 1 vector, which grows
solution as for the linear case. This would seem to be intu- linearly with the number of samples. This is contrary to the
itively true, but does not hold in general. The arbitrary non- conventional wisdom on kernel methods, which requires
linear mapping ϕ(x) gives us no guarantee of preserving computing an n × n kernel matrix, scaling quadratically
any sort of structure. However, we can impose one condition with the samples. Our knowledge of the exact structure of
that will allow K to be circulant. It turns out to be fairly K allowed us to do better than a generic algorithm.
broad, and apply to most useful kernels.
Finding the optimal α is not the only problem that can
be accelerated, due to the ubiquity of translated patches in a
Theorem 1. Given circulant data C (x), the corresponding tracking-by-detection setting. Over the next paragraphs we
kernel matrix K is circulant if the kernel function satisfies will investigate the effect of the cyclic shift model on the
κ(x, x0 ) = κ(M x, M x0 ), for any permutation matrix M . detection phase, and even in computing kernel correlations.
This may be unexpected at first, since those works require Algorithm 1 : Matlab code, with a Gaussian kernel.
more expensive matrix inversions [31], [32], [33]. Multiple channels (third dimension of image patches) are sup-
ported. It is possible to further reduce the number of FFT calls.
We resolve this discrepancy by pointing out that this Implementation with GUI available at:
is only possible because we only consider a single base http://www.isr.uc.pt/~henriques/
sample x. In this case, the kernel matrix K = XX T is
n × n, regardless of the number of features or channels. Inputs
It relates the n cyclic shifts of the base sample, and can • x: training image patch, m × n × c
• y: regression target, Gaussian-shaped, m × n
be diagonalized by the n basis of the DFT. Since K is
• z: test image patch, m × n × c
fully diagonal we can use solely element-wise operations. Output
However, if we consider two base samples, K becomes • responses: detection score for each location, m × n
2n × 2n and the n DFT basis are no longer enough to
fully diagonalize it. This incomplete diagonalization (block- function alphaf = train(x, y, sigma, lambda)
diagonalization) requires more expensive operations to deal k = kernel_correlation(x, x, sigma);
alphaf = fft2(y) ./ (fft2(k) + lambda);
with, which were proposed in those works. end
With an interestingly symmetric argument, training with
multiple base samples and a single channel can be done in function responses = detect(alphaf, x, z, sigma)
the primal, with only element-wise operations (Appendix k = kernel_correlation(z, x, sigma);
responses = real(ifft2(alphaf .* fft2(k)));
A.6). This follows by applying the same reasoning to the end
non-centered covariance matrix X T X , instead of XX T . In
this case we obtain the original MOSSE filter [9]. function k = kernel_correlation(x1, x2, sigma)
c = ifft2(sum(conj(fft2(x1)) .* fft2(x2), 3));
In conclusion, for fast element-wise operations we can d = x1(:)’*x1(:) + x2(:)’*x2(:) - 2 * c;
choose multiple channels (in the dual, obtaining the DCF) or k = exp(-1 / sigma^2 * abs(d) / numel(d));
multiple base samples (in the primal, obtaining the MOSSE), end
but not both at the same time. This has an important impact
on time-critical applications, such as tracking. The general
case [31] is much more expensive and suitable mostly for
offline training applications. α and x with the ones from the previous frame, to provide
the tracker with some memory.
8 E XPERIMENTS
8.1 Tracking pipeline
We implemented in Matlab two simple trackers based on 8.2 Evaluation
the proposed Kernelized Correlation Filter (KCF), using a
Gaussian kernel, and Dual Correlation Filter (DCF), using
We put our tracker to the test by using a recent benchmark
a linear kernel. We do not report results for a polynomial
that includes 50 video sequences [11] (see Fig. 1). This
kernel as they are virtually identical to those for the Gaus-
dataset collects many videos used in previous works, so we
sian kernel, and require more parameters. We tested two
avoid the danger of overfitting to a small subset.
further variants: one that works directly on the raw pixel
values, and another that works on HOG descriptors with a For the performance criteria, we did not choose average
cell size of 4 pixels, in particular Felzenszwalb’s variant [20], location error or other measures that are averaged over
[22]. Note that our linear DCF is equivalent to MOSSE [9] frames, since they impose an arbitrary penalty on lost
in the limiting case of a single channel (raw pixels), but it trackers that depends on chance factors (i.e., the position
has the advantage of also supporting multiple channels (e.g. where the track was lost), making them not comparable. A
HOG). Our tracker requires few parameters, and we report similar alternative is bounding box overlap, which has the
the values that we used, fixed for all videos, in Table 2. disadvantage of heavily penalizing trackers that do not track
The bulk of the functionality of the KCF is presented across scale, even if the target position is otherwise tracked
as Matlab code in Algorithm 1. Unlike the earlier version perfectly.
of this work [29], it is prepared to deal with multiple An increasingly popular alternative, which we chose for
channels, as the 3rd dimension of the input arrays. It im- our evaluation, is the precision curve [11], [5], [29]. A frame
plements 3 functions: train (Eq. 17), detect (Eq. 22), and may be considered correctly tracked if the predicted target
kernel_correlation (Eq. 31), which is used by the first center is within a distance threshold of ground truth. Preci-
two functions. sion curves simply show the percentage of correctly tracked
The pipeline for the tracker is intentionally simple, and frames for a range of distance thresholds. Notice that by
does not include any heuristics for failure detection or mo- plotting the precision for all thresholds, no parameters are
tion modeling. In the first frame, we train a model with the required. This makes the curves unambiguous and easy to
image patch at the initial position of the target. This patch is interpret. A higher precision at low thresholds means the
larger than the target, to provide some context. For each new tracker is more accurate, while a lost target will prevent it
frame, we detect over the patch at the previous position, from achieving perfect precision for a very large threshold
and the target position is updated to the one that yielded range. When a representative precision score is needed, the
the maximum value. Finally, we train a new model at the chosen threshold is 20 pixels, as done in previous works
new position, and linearly interpolate the obtained values of [11], [5], [29].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
Mean
0.9 Mean
Algorithm Feature precision
FPS
(20 px)
0.8
KCF 73.2% 172
HOG
0.7 DCF 72.8% 292
Proposed
KCF Raw 56.0% 154
0.6 DCF pixels 45.1% 278
Precision
0.8 0.8
0.7 0.7
0.6 0.6
Precision
Precision
0.5 0.5
0.4 0.4
KCF on HOG [0.740] KCF on HOG [0.749]
DCF on HOG [0.740] DCF on HOG [0.726]
0.3 Struck [0.521] 0.3 Struck [0.564]
TLD [0.512] TLD [0.563]
KCF on raw pixels [0.480] KCF on raw pixels [0.505]
0.2 MIL [0.455] 0.2 ORIA [0.435]
CT [0.435] MIL [0.427]
0.1 MOSSE [0.367] 0.1 CT [0.412]
ORIA [0.355] DCF on raw pixels [0.410]
DCF on raw pixels [0.350] MOSSE [0.397]
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold
0.7 0.8
0.7
0.6
0.6
0.5
Precision
Precision
0.5
0.4
0.4
KCF on HOG [0.650] KCF on HOG [0.733]
0.3 DCF on HOG [0.632] DCF on HOG [0.719]
TLD [0.576] 0.3 Struck [0.585]
Struck [0.539] KCF on raw pixels [0.503]
0.2 MIL [0.393] MIL [0.456]
KCF on raw pixels [0.358]
0.2 TLD [0.428]
CT [0.336] DCF on raw pixels [0.407]
0.1 ORIA [0.315] 0.1 ORIA [0.389]
DCF on raw pixels [0.261] CT [0.339]
MOSSE [0.226] MOSSE [0.339]
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold
Figure 5: Precision plot for sequences with attributes: occlusion, non-rigid deformation, out-of-view target, and background
clutter. The HOG variants of the proposed trackers (bold) are the most resilient to all of these nuisances. Best viewed in color.
tracker have a performance very close to optimal, while squared loss. We also hope to generalize this framework to
TLD, CT, ORIA and MIL show degraded performance, we other operators, such as affine transformations or non-rigid
conjecture that this is caused by their undersampling of deformations.
negatives.
We also report results for other attributes in Fig. 7.
Generally, the proposed trackers are the most robust to 6 ACKNOWLEDGMENT
of the 7 challenges, except for low resolution, which affects The authors acknowledge support by the FCT
equally all trackers but Struck. project PTDC/EEA-CRO/122812/2010, grants
SFRH/BD75459/2010, SFRH/BD74152/2010, and
9 C ONCLUSIONS AND FUTURE WORK SFRH/BPD/90200/2012.
In this work, we demonstrated that it is possible to analyti-
cally model natural image translations, showing that under A PPENDIX A
some conditions the resulting data and kernel matrices be-
come circulant. Their diagonalization by the DFT provides a
general blueprint for creating fast algorithms that deal with A.1 Implementation details
translations. We have applied this blueprint to linear and As is standard with correlation filters, the input patches (ei-
kernel ridge regression, obtaining state-of-the-art trackers ther raw pixels or extracted feature channels) are weighted
that run at hundreds of FPS and can be implemented with by a cosine window, which smoothly removes discontinu-
only a few lines of code. Extensions of our basic approach ities at the image boundaries caused by the cyclic assump-
seem likely to be useful in other problems. Since the first tion [9], [21]. The tracked region has 2.5 times the size of
version of this work, circulant data has been exploited the target, to provide some context and additional negative
successfully for other algorithms in detection [31] and video samples.
event retrieval [30]. An interesting direction for further work Recall that the training samples consist of shifts of a base
is to relax the assumption of periodic boundaries, which sample, so we must specify a regression target for each one
may improve performance. Many useful algorithms may in y. The regression targets y simply follow a Gaussian
also be obtained from the study of other objective functions function, which takes a value of 1 for a centered target,
with circulant data, including classical filters such as SDF and smoothly decays to 0 for any other shifts, according
or MACE [25], [26], and more robust loss functions than the to the spatial bandwidth s. Gaussian targets are smoother
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
−1
Kij = κ(P i x, P j x) (35) α = F diag k̂xx F H + λF IF H y (42)
= κ(P −i P i x, P −i P j x).
−1
(36)
= F diag k̂xx + λ F H y, (43)
Using known properties of permutation matrices, this
reduces to which is equivalent to
−1
Kij = κ(x, P j−i x). (37) F H α = diag k̂xx + λ F H y. (44)
By the cyclic nature of P , it repeats every nth power, i.e. Since for any vector F z = ẑ, we have
P n = P 0 . As such, Eq. 37 is equivalent to
∗ 1
α̂ = diag ŷ∗ . (45)
2. This is usually done by switching the quadrants of the output, e.g. k̂xx + λ
with the Matlab built-in function fftshift. It has the same effect as
shifting Fig. 6-b to look like Fig. 6-a. Finally, because the product of a diagonal matrix and a
3. For example Matlab, NumPy, Octave and the FFTW library. vector is simply their element-wise product,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
Out−of−plane rotation (39 sequences) Fast motion (17 sequences) Illumination variation (25 sequences)
0.9 0.8 0.8
0.7
0.6 0.6
0.6
0.5 0.5
Precision
Precision
Precision
0.5
0.4 0.4
0.4
KCF on HOG [0.729] Struck [0.604] KCF on HOG [0.711]
DCF on HOG [0.712] 0.3 KCF on HOG [0.602] 0.3 DCF on HOG [0.699]
0.3 Struck [0.597] DCF on HOG [0.559] Struck [0.558]
TLD [0.596] TLD [0.551] TLD [0.537]
KCF on raw pixels [0.541] 0.2 KCF on raw pixels [0.441] 0.2 KCF on raw pixels [0.448]
0.2 ORIA [0.493] MIL [0.396] ORIA [0.421]
MIL [0.466] CT [0.323] MOSSE [0.375]
0.1 DCF on raw pixels [0.428] 0.1 DCF on raw pixels [0.275] 0.1 CT [0.359]
CT [0.394] ORIA [0.274] MIL [0.349]
MOSSE [0.390] MOSSE [0.213] DCF on raw pixels [0.309]
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold
Scale variation (28 sequences) Motion blur (12 sequences) Low resolution (4 sequences)
0.8 0.8 0.7
0.6 0.6
0.5
0.5 0.5
Precision
Precision
Precision
0.4
0.4 0.4
0.3 Struck [0.545]
KCF on HOG [0.679] KCF on HOG [0.650]
0.3 DCF on HOG [0.654] 0.3 DCF on HOG [0.588] KCF on raw pixels [0.396]
Struck [0.639] Struck [0.551] DCF on raw pixels [0.396]
TLD [0.606] TLD [0.518] 0.2 KCF on HOG [0.381]
0.2 KCF on raw pixels [0.492] 0.2 KCF on raw pixels [0.394] DCF on HOG [0.354]
MIL [0.471] MIL [0.357] TLD [0.349]
CT [0.448] CT [0.306] MOSSE [0.239]
0.1 ORIA [0.445] 0.1 DCF on raw pixels [0.270]
0.1 ORIA [0.195]
DCF on raw pixels [0.426] MOSSE [0.244] MIL [0.171]
MOSSE [0.387] ORIA [0.234] CT [0.152]
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold
Figure 7: Precision plots for 6 attributes of the dataset. Best viewed in color. In-plane rotation was left out due to space constraints.
Its results are virtually identical to those for out-of-plane rotation (above), since they share almost the same set of sequences.
[17] B. Alexe, V. Petrescu, and V. Ferrari, “Exploiting spatial overlap [33] V. N. Boddeti, T. Kanade, and B. V. Kumar, “Correlation filters for
to efficiently compute appearance distances between image win- object alignment,” in CVPR, 2013.
dows,” in NIPS, 2011. [34] R. M. Gray, Toeplitz and Circulant Matrices: A Review. Now
[18] H. Harzallah, F. Jurie, and C. Schmid, “Combining efficient object Publishers, 2006.
localization and image classification,” in ICCV, 2009. [35] P. J. Davis, Circulant matrices. American Mathematical Soc., 1994.
[19] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple [36] A. Vedaldi and A. Zisserman, “Efficient additive kernels via ex-
kernels for object detection,” in ICCV, 2009. plicit feature maps,” TPAMI, 2011.
[20] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
“Object detection with discriminatively trained part-based mod-
els,” TPAMI, 2010.
[21] R. C. González and R. E. Woods, Digital image processing. Prentice
Hall, 2008.
[22] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature João F. Henriques received his M.Sc. degree
pyramids for object detection,” TPAMI, 2014. in Electrical Engineering from the University of
[23] B. Schölkopf and A. Smola, Learning with kernels: Support vector Coimbra, Portugal in 2009. He is currently a
machines, regularization, optimization, and beyond. MIT Press, 2002. Ph.D. student at the Institute of Systems and
[24] D. Casasent and R. Patnaik, “Analysis of kernel distortion- Robotics, University of Coimbra. His current re-
invariant filters,” in Proceedings of SPIE, vol. 6764, 2007, p. 1. search interests include Fourier analysis and
[25] R. Patnaik and D. Casasent, “Fast FFT-based distortion-invariant machine learning algorithms in general, with
kernel filters for general object recognition,” in Proceedings of SPIE, computer vision applications in detection and
vol. 7252, 2009, p. 1. tracking.
[26] K.-H. Jeong, P. P. Pokharel, J.-W. Xu, S. Han, and J. Principe, “Ker-
nel based synthetic discriminant function for object recognition,”
in ICASSP, 2006.
[27] C. Xie, M. Savvides, and B. Vijaya-Kumar, “Kernel correlation
filter based redundant class-dependence feature analysis (KCFA)
on FRGC2.0 data,” in Analysis and Modelling of Faces and Gestures,
2005.
[28] A. Mahalanobis, B. Kumar, and D. Casasent, “Minimum average
Rui Caseiro received the B.Sc. degree in electrical engineering (spe-
correlation energy filters,” Applied Optics, 1987.
cialization in automation) from the University of Coimbra, Coimbra,
[29] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting
Portugal, in 2005. Since 2007, he has been involved in several research
the circulant structure of tracking-by-detection with kernels,” in
projects, which include the European project “Perception on Purpose”
ECCV, 2012.
and the National project “Brisa-ITraffic”. He is currently a Ph.D. student
[30] J. Revaud, M. Douze, S. Cordelia, and H. Jégou, “Event retrieval
and researcher with the Institute of Systems and Robotics and the
in large video collections with circulant temporal encoding,” in
Department of Electrical and Computer Engineering, Faculty of Science
CVPR, 2013.
and Technology, University of Coimbra. His current research interests
[31] J. F. Henriques, J. Carreira, R. Caseiro, and J. Batista, “Beyond hard
include the interplay of differential geometry with computer vision and
negative mining: Efficient detector learning via block-circulant
pattern recognition.
decomposition,” in ICCV, 2013.
[32] H. K. Galoogahi, T. Sim, and S. Lucey, “Multi-channel correlation
filters,” in ICCV, 2013.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14