0% found this document useful (0 votes)
32 views14 pages

Detecting Flying Objects Using A Single Moving Camera: Artem Rozantsev, Vincent Lepetit, and Pascal Fua, Fellow, IEEE

This document proposes an approach for detecting small flying objects like drones from video captured by a moving camera. It argues that both appearance and motion cues are needed to solve this difficult problem. The approach uses regression to stabilize video patches accounting for camera motion, and then classifies "spatio-temporal image cubes" built from stabilized patches to detect objects. It was shown to outperform methods using only appearance or motion cues. The authors also created test datasets for drones and aircraft that will be made publicly available.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views14 pages

Detecting Flying Objects Using A Single Moving Camera: Artem Rozantsev, Vincent Lepetit, and Pascal Fua, Fellow, IEEE

This document proposes an approach for detecting small flying objects like drones from video captured by a moving camera. It argues that both appearance and motion cues are needed to solve this difficult problem. The approach uses regression to stabilize video patches accounting for camera motion, and then classifies "spatio-temporal image cubes" built from stabilized patches to detect objects. It was shown to outperform methods using only appearance or motion cues. The authors also created test datasets for drones and aircraft that will be made publicly available.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Detecting Flying Objects using a Single Moving


Camera
Artem Rozantsev, Vincent Lepetit, and Pascal Fua, Fellow, IEEE,

Abstract—We propose an approach for detecting flying objects such as Unmanned Aerial Vehicles (UAVs) and aircrafts when they
occupy a small portion of the field of view, possibly moving against complex backgrounds, and are filmed by a camera that itself moves.
We argue that solving such a difficult problem requires combining both appearance and motion cues. To this end we propose a
regression-based approach for object-centric motion stabilization of image patches that allows us to achieve effective classification on
spatio-temporal image cubes and outperform state-of-the-art techniques.
As this problem has not yet been extensively studied, no test datasets are publicly available. We therefore built our own, both for UAVs
and aircrafts, and will make them publicly available so they can be used to benchmark future flying object detection and collision
avoidance algorithms.

Index Terms—Motion compensation, object detection.

1 I NTRODUCTION
We are headed for a world in which the skies are occupied not
only by birds and planes but also by unmanned drones ranging
from relatively large Unmanned Aerial Vehicles (UAVs) to much
smaller consumer ones. Some of these will be instrumented and
able to communicate with each other to avoid collisions but not
all. Therefore, the ability to use inexpensive and light sensors
such as cameras for collision-avoidance purposes will become
increasingly important.
This problem has been tackled successfully in the automotive
world, for example there are now commercial products [1], [2]
designed to sense and avoid both pedestrians and other cars.
In the world of flying machines much progress has been made
towards accurate position estimation and navigation from single
or multiple cameras [3], [4], [5], [6], [7], [8], [9], but less in the
field of visual-guided collision avoidance [10]. In particular, it is Figure 1: Detecting a small flying object against a complex moving
not possible to simply extend the algorithms used for pedestrian background. (Left) It is almost invisible to the human eye and hard
and automobile detection to the world of aircrafts and drones, as to detect from a single image. (Right) Yet, our algorithm can find
flying object detection poses some unique challenges: it by using appearance and motion cues.
• The environment is fully three dimensional, which makes
the motions more complex (e.g., objects may move in any
direction in the 3D space and may appear in any part of the Fig. 1 illustrates some examples, where even for humans it is hard
frame). to find a flying object based just on a single image. By contrast,
• Flying objects have very diverse shapes and can be seen when looking at the sequence of frames, these objects suddenly
against either the ground or the sky, which produces complex pop up and are easily spotted, which suggests that motion cues are
and changing backgrounds. crucial for detection.
• Given the speeds involved, potentially dangerous objects However, these motion cues are difficult to exploit when the
must be detected when they are still far away, which means images are acquired by a moving camera and feature backgrounds
they may be very small in the images. that are challenging to stabilize because they are non-planar and
rapidly changing. Furthermore, since there may be other moving
• A. Rozantsev is with the Computer Vision Laboratory, École Polytechnique objects in the scene, such as a person in the top row of Fig. 1,
Fédérale de Lausanne, Lausanne, Switzerland. motion by itself is not enough and appearance must also be taken
E-mail: [email protected] into account.
• V. Lepetit is with the Institute for Computer Graphics and Vision, Graz
University of Technology, Graz, Austria. In this paper, we detect whether an object of interest is present
• P. Fua is with the Computer Vision Laboratory, École Polytechnique and constitutes danger by classifying 3D descriptors computed
Fédérale de Lausanne, Lausanne, Switzerland. from spatio-temporal image cubes. We will refer to them as
Manuscript received June 19, 2015; revised September 17, 2014. st-cubes. They are formed by stacking motion-stabilized image
2

UAVs Aircrafts
Uniform background Very noisy background Non-uniform background Noisy background

No motion compensation

Optical Flow

Our approach

(a) (b) (c) (d)

Figure 2: Motion compensation for four different st-cubes of flying objects seen against different backgrounds. (Top) For each one, we
show four consecutive patches before motion stabilization. In the leftmost plot below the patches, the blue dots denote the location of
the true center of the drone and the red cross is the patch center over time. The other two plots depict the x and y deviations of the drone
center with respect to the patch center. (Middle) The same four st-cubes and corresponding graphs after motion compensation using
an optical flow approach, as suggested by [11]. (Bottom) The same four st-cubes and corresponding graphs after motion compensation
using our approach.

windows over several consecutive frames, which give more infor- frames, and those that combine the two. We briefly review all three
mation than using a single image. What makes this approach both types in this section. In the results section, we will demonstrate
practical and effective is a regression-based motion-stabilization that we can outperform state-of-the-art representatives of each
algorithm. Unlike those relying on optical flow, it remains effective class.
even when the shape of the object to be detected is blurry or Appearance-based methods rely on Machine Learning and
barely visible, as illustrated by Fig. 2. This arises from the fact have proved to be powerful even in the presence of complex
that learning-based motion compensation focuses on the object lighting variations or cluttered background. They are typically
and is more resistant to complicated backgrounds, compared to based on Deformable Part Models (DPM) [16], Convolutional
the optical flow method as shown in Fig. 2. Neural Networks (CNN) [17], or Random Forests [18]. Among
St-cubes have been routinely used for action recognition pur- them the Aggregate Channel Features (ACF) [19] algorithm is
poses [12], [13], [14] using a monocular camera. By contrast, most considered as one of the best.
current detection algorithms work either on a single frame, or by These approaches work best when the target objects are
estimating the optical flow from consecutive frames. Our approach sufficiently large and clearly visible in individual images, which is
can therefore be seen as a way to combine both the appearance often not the case in our applications. For example, in the images
and motion information to achieve effective detection in a very of Fig. 1, the object is small and it is almost impossible to make
challenging context. In our experiments we show that this method out from the background without motion cues.
allows to achieve higher accuracy, comparing to either appearance Motion-based approaches can themselves be subdivided into
or motion-based methods individually. two subclasses. The first comprises those that rely on background
We first proposed using st-cubes for flying objects detection subtraction [20], [21], [22], [23] and determine objects as groups
in an earlier conference paper [15]. In this initial version of our of pixels that are different from the background. The second
processing pipeline, we performed motion compensation using includes those that depend on optical flow [24], [25], [26].
boosted trees. In this paper we refine this idea by using deep Background subtraction works best when the camera is static
learning techniques that yield better stabilization and, thus, better or its motion is small enough to be easily compensated for, which
overall performance. is not the case for the on-board camera of a fast moving aircraft.
Flow-based methods are more reliable in such situations but
still critically dependent on the quality of the flow vectors, which
2 R ELATED W ORK tends to be low when the target objects are small and blurry. Some
Approaches for detecting moving objects can be classified into methods combine both optical flow and background subtraction
three main categories: those that rely on appearance in individual algorithms [27], [28]. However, in our case there may be motion
frames, those that rely primarily on motion information across in different parts of the images, for example people or tree
3

Figure 3: Object detection pipeline with st-cubes and motion compensation. Provided a set of video frames from the camera, we use a
multi-scale sliding window approach to extract st-cubes. We than process every patch of the st-cube to compensate for the motion of
the aircraft and then run the detector. (best seen in color)

tops. Thus motion information is not enough for reliable flying object and will compare their respective performance in Section 5.
object detection. Other methods that combine optical flow and We will discuss motion compensation in Section 4.
background subtraction, such as [29], [30], [31], [32] still critically More specifically, we want to train a classifier that takes as
depend on optical flow, which is often estimated with [26] and thus input st-cubes such as those depicted by Fig. 4 and returns 1 or
may suffer from the low quality of the flow vectors. In addition to -1, depending on the presence or absence of a flying object. Let
optical flow dependence, [31] makes an assumption that camera (sx , sy , st ) be the size of our st-cubes. For training purposes, we
motion is translational, which is violated in aerial videos. use a dataset of pairs (bi , yi ), i ∈ [1, N ], where bi ∈ Rsx ×sy ×st
Hybrid approaches combine information about object ap- is an st-cube, in other words st image patches of resolution sx ×sy
pearance and motion patterns and are therefore the closest in spirit pixels. Label yi ∈ {−1, 1} indicates whether or not a target object
to what we propose. For example, in [33], histograms of flow is present.
vectors are used as features in conjunction with more standard
appearance features and are fed to a statistical learning method. 3.1 3D HoG with Gradient Boost
This approach was refined in [11] by first aligning the patches
The first approach we tested relies on boosted trees [34] to learn
to compensate for motion and then using the differences of the
a classifier ψ(·) of the the form ψ(b) = ΣH j=1 αj hj (b), where
frames, which may or may not be consecutive, as additional
αj=1..H are real valued weights, b ∈ Rsx ×sy ×st is the input st-
features. The alignment relies on the Lucas-Kanade optical flow
cube, hj : Rsx ×sy ×st → R are weak learners, and H is the
algorithm [25]. The resulting algorithm works very well for pedes-
number of selected weak learners, which controls the complexity
trian detection and outperforms most of the single-frame methods.
of the classifier. The α’s and h’s are learned in a greedy manner,
However, when the target objects become smaller and harder to
using the Gradient Boost algorithm [34], which can be seen as
see, the flow estimates become unreliable and this approach, like
an extension of the classic AdaBoost to real-valued weak learners
the purely flow-based ones, becomes less effective.
and more general loss functions.
In standard Gradient Boost fashion, we take our weak learners
3 D ETECTION F RAMEWORK to be regression trees hj (b) = T (θj , HoG3D(b)), where θj
denotes the tree parameters and HoG3D(b), the 3-dimensional
Our detection pipeline is illustrated by Fig. 3 and comprises the Histograms of Gradients (HoG3D) computed for b. HoG3D was
following steps: introduced in [14], and can be seen as an extension of the standard
• Divide the video sequence into N -frame overlapping tempo- HoG [35] with an additional temporal dimension. It is fast to
ral slices. The larger the overlap is, the higher the precision compute and proved to be robust to illumination changes in many
but only up to a point. Our experiments show that making the applications, and allows us to combine appearance and motion
overlap more than 50% increases computation time without efficiently.
improving performance. Thus, 50% is what we used. At each iteration j , the weak learner hj (·) with the cor-
• Build st-cubes from each slice using a sliding window ap- responding weight αj is taken as the one that minimizes the
proach, independently at each scale. exponential loss function:
• Apply our motion compensation algorithm to the patches of
N
each of the st-cubes to create stabilized st-cubes. X
(hj (.), αj ) = argmin e−y(ψj−1 (bi )+αh(bi )) . (1)
• Classify each st-cube as containing an object of interest or h(.),α i=1
not.
• Since each scale has been processed independently, we per-
The tests in the nodes of the trees compare one coordinate of
form non-maximum suppression in scale space. If there are the HoG3D vector with a threshold, both selected during the
several detections for the same spatial location at different optimization.
scales, we only retain the highest-scoring one. As an alter-
native to this simple scheme, we have developed a more 3.2 Convolutional Neural Networks
sophisticated learning-based one, which we discuss in more Since Convolutional Neural Networks (CNN) [36] have proved
details in Section 6.4. very successful in many detection problems, we have tested it
In this section, we introduce two separate approaches—one as an alternative classification method. We use the architecture
based on boosted trees, the other one on Convolutional Neural depicted by Fig. 5, which alternates convolutional layers and
Networks—to deciding whether or not an st-cube contains a target pooling layers. Convolutional layers use 3D linear filters while
4

UAVs Aircrafts

Figure 5: The structure of the Convolutional Neural Network,


which we used for flying object detection. CL, PL and FL
correspond to Convolution, Pooling, and Fully-connected layers
respectively.
Figure 4: Sample patches of the UAVs and aircrafts. Each row
corresponds to a single st-cube and illustrates different possible
motions that an aircraft could have.

Coarse
pooling layers apply max-pooling in 2D spatial regions only. The alignment
last layer is fully connected and outputs the probability that the
input st-cube contains an object of interest. We use the hyperbolic
tangent function as the non-linear operator [37].
We take the input of our CNN is a normalized st-cube
Refinement
b − µ(b)
η= , (2)
σ(b)
where µ(b) and σ(b) are the mean and standard deviation of the Figure 6: Structure of the CNNs used for motion compensation.
pixel intensities in b, respectively. Normalization is an important (Top) The first network uses extended patches to correct for the
step because network parameters optimization fails to converge large displacements of the aircraft. (Bottom) The second network
when using raw image intensities. is applied after rectification by the motion predicted by the first
During training, we write the probability that an st-cube η network, and is designed to correct for the small motions.
contains an object of interest (y = 1) or is a part of the background
(y = 0) as
eCNN(η)[y] to use motion compensation to eliminate this problem. Motion
P (Y = y | η) = , y = {0, 1} , (3)
eCNN(η)[0]
+ eCNN(η)[1] compensation will allow us to accumulate visual evidence from
where CNN(η)[y] denotes the classification score that the network multiple frames, without adding variation due to the object motion.
predicts for η as being part of class y and e(·) denotes the expo- We therefore aim at centering the target object, so that when
nential function. We then minimize the negative log-likelihood present in an st-cube, it remains at the center of all its image
patches.
N
More specifically, let It denote the t-th frame of the video
X
L(W, bias) = − log P (Y = yk | ηk ) (4)
k=1
sequence and (i, j) some pixel position in it. The st-cube bi,j,t
is the 3D array of pixel intensities from images Iz with z ∈
with respect to the CNN parameters. Here (ηk , yk ) are pairs [t − st + 1, t] at image locations (k, l) with k ∈ [i − sx + 1, i] and
of normalized st-cubes and their corresponding labels from the l ∈ [j −sy +1, j], as depicted by Fig. 4. Correcting for motion can
training dataset, as defined in Section 3. To this end, we use be formulated as allowing patches mi,j,z , z ∈ [t − st + 1, t] of the
the algorithm of [38] combined with Dropout [39] to improve st-cube to shift horizontally and vertically in individual images.
generalization.
In [11], these shifts are computed using optical flow infor-
We tried many different network configurations, in terms of
mation, which has been shown to be effective for pedestrians
the number of filters per layer and the size of the filters. However,
occupying a large fraction of the patch and moving relatively
they all yield similar performance, which suggests that only
slowly from one frame to the next. However, as can be seen in
minor improvements could be obtained by further tweaking the
Fig. 4, these assumptions do not hold in our case and we will
network. We also tried varying the dimensions of the st-cube.
show in Section 6 that this negatively impacts performance. To
These variations have a more significant influence on performance,
overcome this difficulty, we introduce instead a learning-based
which will be evaluated in Section 5.
approach to compensate for motion and keep the object in the
center of the mi,j,z patches of the st-cube even when the target
4 M OTION C OMPENSATION object’s appearance changes drastically.
Neither of the two approaches to classifying st-cubes introduced More specifically, we treat motion compensation problem as a
in the previous section accounts for the fact that both the gradient regression task: given a single image patch, we want to predict the
orientations used to build the 3D HoG and the filter responses 2D translation that best centers the target object. By rectifying all
in the CNN case are biased by the global object motion. This the image patches in an st-cube with their predicted translation,
makes the learning task much more difficult and we propose we can then align the images of the object of interest together.
5

achieve invariance to small motions. In our case, such invariance


would be counter-productive because these motions are precisely
what we are trying to estimate. Furthermore, the computational
complexity remains manageable even without the first pooling
layer. We trained the first CNN using examples involving large
2D translations (coarse-CNN) and the second smaller ones (fine-
CNN). In practice we use the latter to refine the predictions of the
former. As when using boosted-trees, we use CNN-regressors it-
Figure 7: Combining multiple detections in several images of a eratively until convergence, as described at the end of Section 4.1.
video sequence. The red square and dots depict the positions of We first correct for large displacements by applying several times
the original detection across the 50 frames preceding two different coarse-CNN and we then apply fine-CNN, which is trained to
images. The green square and dots illustrate the position of the compensate for small shifts of the object, for a couple more
same detections after refinement. They are superposed and form iterations.
much smoother trajectories. (best seen in color) In fact, we also tried training two different boosted-tree regres-
sors such as those discussed in Section 4.1. Unlike in the case of
the CNN regressors, it produced no significant improvement. This
likely happens because our boosted trees motion compensation
algorithm is based on HoG, where histograms are computed over
the bins of fixed size. This, in fact, introduces invariance to
small deviations of objects, which makes it hard to achieve high
localization precision.

4.3 Motion Compensated st-cubes


(a) UAV dataset (b) Aircraft dataset Once the regressors have been trained, we use them to compensate
Figure 8: Sample image patches containing aircrafts or UAVs from for motion and build the st-cubes that we will use as input for
our datasets. classification, as depicted by Fig. 3. Fig. 2 illustrates several
st-cubes of a drone from the testing dataset and after motion
compensation, using either optical flow from [11] or our approach.
4.1 Boosted tree-based regressors Note that the latter tends to keep the target object much closer to
One way to predict the translation for an input patch m, is to train the center, especially when the background is non-uniform and
two different boosted trees regressors [40] φx (m) and φy (m), one noisy or under lighting changes.
for each 2D direction (horizontal and vertical). Part of the difficulty in detecting fast moving flying objects
As for detection, we use regression trees hj (m) = is that they can appear anywhere in the 3D environment and that
T (θj , HoG(m)) as weak learners, where HoG(m) denotes the their apparent size can vary enormously. This makes it necessary
Histograms of Oriented Gradients for patch m. The difference to scan the whole image at different scales using a sliding window
is that we minimize here a quadratic loss function instead of an to avoid missing anything, which is computationally expensive.
exponential one Fortunately, our motion compensation scheme frees us from
the need to evaluate every image position. When there is a target
L(r, φ∗ (m)) = (r − φ∗ (m))2 , (5) object, our algorithm automatically shifts the patch so it is in the
center. As a result, instead of having to test windows centered
where m is the input patch, r the corresponding expected 2D
at every pixel location, we only have to check non-overlapping
vector, and φ∗ (m) = [φx (m), φy (m)]> the 2D vector predicted
ones because the algorithm will automatically shift their location
by the 2 regression trees.
to center the target object when one is present. This also makes its
We then apply these regressors in an iterative way: we obtain a
unnecessary to use heuristics such as non-maximum suppression,
first estimate of the shift of the target object—if present—from the
as all the detections that arise from a single object will be shifted to
center of the patch. We translate it according to this estimate, and
the same position. The duplicates can therefore easily be removed,
we re-apply the regressors. We iterate until both shift estimates
leaving us with just a single detection per object, as illustrated by
drop to 0 or the algorithm reaches a preset number of iterations.
Fig. 7.
In practice, 4 to 5 iterations are enough to achieve good accuracy.
As discussed in Section 3, we process each scale indepen-
dently. We then perform non-maximum suppression in scale-space
4.2 CNN-based regressors as a final step.
Another possible approach is to use a Convolutional Neural
Network (CNN) to solve the regression task. CNNs are more
flexible, as features are learned directly from the training data, 5 D ESIGNING THE O PTIMAL A PPROACH
in contrast to the hand-designed HoG features we need to use with The two key components of our pipeline are motion compen-
our boosted tree-based regressors. sation and classification of the st-cubes, both of which can be
We trained two separate CNNs whose structure is depicted implemented using either CNNs or hand-designed features. In this
by Fig. 6. Note that there is no pooling layer after the first section, we test the various possible combinations and justify the
convolutional one. This is because pooling layers are typically parameter choices we made for the final evaluation of our whole
used not only to reduce computational complexity but also to approach against several baselines, as described in Section 6.
6

UAV Dataset

Aircraft Dataset

Figure 9: An object’s apparent size can change enormously depending on its pose and distance to the camera. We therefore use a sliding
window approach at different resolutions. The green boxes denote detections by our algorithm, which successfully handles background,
lighting, scale, and pose changes.

Since the problem of detecting small flying objects has not UAV dataset
yet received extensive attention from our community, there is not
yet any standard dataset that can be used for testing purposes.
We therefore built our own, one for UAVs and one for planes.
We first describe them and then describe our testing protocol and
the metrics we used for evaluation purposes. Finally, we perform
the above-mentioned comparisons and demonstrate that the best
results are obtained by using the CNN approach of Section 4.2 for
motion compensation and the HoG3D descriptors of Section 3.1 Aircraft dataset
for actual detection.

5.1 Datasets
To evaluate the performance of our approach, we built two separate
datasets. They feature many real-world challenges including fast
illumination changes and complex backgrounds, such as those
created by moving treetops seen against a changing sky. They Several failure cases
are as follows.

• UAV dataset. It comprises 20 video sequences of 4000


752 × 480 frames each on average. They were acquired
by a camera mounted on a drone filming similar ones while
flying indoors and outdoors. The outdoor sequences present
Figure 10: Examples of motion compensation. The first image in
a broad variety of lighting and weather conditions. All these
each pair shows the middle patch of the original st-cube, coming
videos contain up to two objects of the same category per
from the sliding window. The second image corresponds to the
frame. However, the shape of the drones is rarely perfectly
same patch after applying our motion compensation algorithm.
visible and thus their appearance is extremely variable due to
Failure cases are often due to motion estimation failures, which
changing altitudes, lighting conditions, and even aliasing and
happen when the appearance of the object is heavily corrupted by
color saturation due to their small apparent sizes. Fig. 8(a)
noise.
illustrates some examples of the variety of appearance of a
drone present in this dataset.
• Aircraft dataset. It consists of 20 publicly available videos 5.2 Training and Testing
of radio-controlled planes. Some videos were acquired by a
camera on the ground and the rest was filmed by a camera In all cases, we used half of the data to train regressors and
on board of an aircraft. These videos vary in length from detectors. We manually supplied 8000 bounding boxes centered
hundreds to thousands of frames and in resolution from on a UAV and 4000 on a plane.
640 × 480 to 1280 × 720. Fig. 8(b) depicts the variety of We used the Boosted trees implementation of [41] for both
plane types. The aircrafts may also appear under different regression and detection. To compute the HoG3D and HoG
angles, which makes the problem more complex. Fig. 9 descriptors, we used the publicly available implementations [14]
shows some examples of the pose variation that a plane could and [42], respectively. We used Theano [17] to build the CNN
have throughout the video sequence. models for both regression and detection tasks. In both of these
cases we used the method described in [38] for optimization. The
7

UAV dataset Aircraft dataset

Figure 11: Influence of the st-cubes sizes on the performance of Boosted trees (HBT-Detection) and CNN (CNN-Detection) detectors
with CNN-based motion compensation method, as described in Section 5.2.3. The plots are colored according M R|F P P I=1 criterion
(introduced in Section 5.2.2). Here blue corresponds to the higher M R|F P P I=1 , while red to the lower one. The darker lines on both
plots correspond to the best performing examples of two different types of machine learning algorithms, according to the same criterion.
The evaluation was performed on the validation subsets of the UAV and Aircraft datasets. (best seen in color)

structures of the CNNs for detection and motion compensation are unable to correctly predict the location of the object in the patch.
depicted by Figs. 5 and 6 respectively. Here the parameters of each This typically occurs when the patches are very noisy and the
layer—the numbers of filters per layer and their dimensions—are object is almost not visible.
given in the figures in the format N × (kx , ky , kt ), where N and To handle the wide range of flying objects apparent sizes, we
(kx , ky , kt ) are the number of filters and their sizes respectively. use a multi-scale sliding window detector. Fig. 9 shows the same
UAV and plane appearing at various distances from the camera
5.2.1 Training the Motion Regressors throughout the video sequence.
To provide labeled examples where the aircraft or UAV is not
5.2.2 Evaluation Metrics
in the center of the patch but still at least partially within it, we
randomly shifted the ground truth bounding boxes by a translation In our experiments we consider an object to be correctly detected
of magnitude up to half of their sizes. This step was repeated for all if there is 50% overlap between the detected bounding box and
the frames of the training database to cover the variety of shapes the ground-truth bounding box.
and backgrounds in front of which the aircraft might appear. We report precision-recall curves. Precision is computed as the
Applying large translations to the training data allows us number of true positives detected by the algorithm divided by the
to run the detection to only non-overlapping patches without total number of detections. Recall is the number of true positives
missing the target, as explained at the end of Section 4.3. This divided by the number of the positive test examples. Additionally
procedure allows us to generate as much training data as needed we use the
R 1 Average Precision measure, which we take to be the
for both Boosted trees (HBT-Regression) and CNN regressors integral 0 p(r)dr, where p is the precision, and r the recall.
(CNN-Regression), which is important for performance especially We also report the log-average miss-rate (MR) with respect
as the latter is known to require large amounts of training data. to the average number of false positive per image (FPPI). The
The apparent size of the objects in the UAV and Aircraft miss-rate is computed as the number of true positives missed by
datasets varies from 10 to 100 pixels. To train the regressor, we the detector, divided by the total number of true positives; FPPI
used 40 × 40 patches containing the UAV or aircraft shifted from is computed as the total number of false positives, divided by the
the center. total number of images in the testing dataset:
The CNN-based regressor relies on convolutions of the orig- Nd
MR = 1− Ntp ,
inal patch with filters from different network layers, which may Nf d (6)
FPPI = Nf ,
produce artifacts close to the patch borders and degrade perfor-
mance when the object is only partially visible. To reduce the where Nd , Nf d , Ntp , Nf are the number of true and false detec-
influence of such artifacts, we extend the input patch by 25% in tions, the number of positively labeled examples and the number
both the horizontal and vertical directions. This needs to be done of frames in the test set, respectively.
only for the coarse alignment CNN, as depicted by the top row
of Fig. 6. It is not required for the refinement CNN that only 5.2.3 Motion Compensation Performance Analysis
estimates small motions. Prior to evaluating the detection accuracy of the methods we
Fig. 10 depicts some examples of motion compensation. Note need to apply motion compensation to the st-cubes. Thus we
that even though both aircrafts and drones appear in front of need to evaluate, which motion compensation method performs
changing backgrounds, the motion compensation algorithm cor- best. To this end, we created a validation dataset by selecting
rectly estimates the object location within the patch. Fig. 10 also one video from each dataset. These videos are then used to
illustrates some cases when the motion compensation system is generate data, using the method introduced in Section 5.2.1. We
8

use the validation set to tune the parameters and then perform the
comparison against competing approaches on the test set.
We compare HBT-Regression and CNN-Regression in terms
of Root Mean Square Error (RMSE). More formally, we are given
a validation set of pairs (Xi , Sia ), i ∈ 1..N , where Xi is a patch
and Sia ∈ R2 corresponds to the true shift of the object from
p p
the center of the patch. Let also Si ∈ R2 : Si = φ(Xi ) be
the prediction of the shift of the object, obtained by the motion
compensation system. Then the RMSE is computed using the
following equation:
(a) UAV dataset (b) Aircraft dataset
v
u
u1 X N
RMSE = t (S p − Sia )2 . (7) Average Precision
N i=1 i
HBT-Detection detection algorithm,
p UAV dataset Aircraft dataset
Note that Si and Sia do not depend on the size of the patch. together with different motion com-
pensation methods
Table 1 depicts the results of this comparison.
CNN-Regression outperforms HBT-Regression on both datasets. No motion compensation 0.485 0.497
For reference we also provide RMSE0 , which is computed as: Optical flow 0.540 0.652
v HBT-Regression 0.751 0.789
u N
u1 X CNN-Regression 0.849 0.864
RMSE0 = t (S a )2 . (8)
N i=1 i
Figure 12: Comparison of motion compensation methods on the
RMSE0 reflects the case when no motion compensation is applied. test subsets of our datasets. For all the motion compensation
algorithms we have used the same HBT-Detection approach, as
RMSE it proved to be more accurate, comparing to CNN-Detection.
method UAV dataset Aircraft dataset Unlike the optical flow-based algorithm, our regression-based ones
No motion compensation (RMSE0 ) 0.1474 0.1451
properly identify the shift in object position and correct for it, even
HBT-Regression 0.0939 0.0805 when the background is complex and the object outlines are barely
CNN-Regression 0.0669 0.0749 visible. This yields a better precision/recall. Table in the bottom
of the figure depicts the Average Precision score for the methods
Table 1: Performance of motion compensation methods. The presented above.
valuation was performed on the validation subsets of the UAV
and Aircraft datasets.
For the CNNs of Section 3.2, we tried different network
We therefore used the CNN-Regression algorithm to produce configurations, with variations of the number and size of filters in
a number of aligned st-cubes of sizes ranging from (sx , sy , st ) = the convolutional layers and varying numbers of fully connected
(28, 28, 4) to (sx , sy , st ) = (40, 40, 11), some of which we used layers. In the end, they all ended up yielding very similar results.
for training and others for testing. For patches smaller than 40 × The final configuration that we used is illustrated by Fig. 5. We
40, we simply upscale them to 40×40 before applying the motion will refer to this method as CNN-Detection.
compensation regressors. The choice of st controls the trade-off As depicted by Fig. 11, HBT and CNN detectors perform
between detecting far away objects using large values and closer similarly on the plane dataset but the former clearly outperforms
ones using smaller ones. This is because, when the object is very the latter on the UAV dataset when we allow a single false positive
close, the apparent motion may become too large for our motion per frame on average. This may seem surprising but similar
compensation scheme. We found that increasing st beyond 11 did behaviors have been reported by [43] where the top four methods
not bring any improvement in performance, while decreasing it rely on decision forests while the Deep learning approach ranks
below 4 left us with too little motion information. only sixth. In our case, this may be attributable to the size of the
As described above we have used the same video sequences training database not being large enough to take full advantage
to select the most appropriate size st for the st-cube. Fig. 11 sum- of the power of CNNs. Furthermore, for tasks that require as few
marizes our experiments, in terms of Average miss-rate curves. false positives as possible, the CNNs win.
The legend of the plot describes the set-up used during the ex- In any event, these experiments suggest that the optimal
periments. The number in brackets correspond to the (sx , sy , st ) dimension of the st-cube depends on the task at hand. The apparent
dimensions of the st-cube. The order of the curves in the legend is size of the UAVs is small, which favors large temporal dimension.
designed in the way that the highest curve is highest in terms of As can be seen in Fig. 11(a), the best results are obtained for
M R|F P P I=1 measure. The lowest curve corresponds to the best st = 11. By contrast, the Aircraft dataset comprises examples of
performing set-up. For the different detection algorithms we show planes flying at many different distances from the camera. In this
the best performing results by making the curves darker. case, st = 7 is optimal for both HoG3D descriptors and CNNs.
The classifiers of Section 3.1 rely on boosted trees operating
on HoG3D descriptors [14]. We computed them using the default 5.2.4 Detection-Based Evaluation
parameters, that is, 24 orientations per bin of size 4 × 4 × 2 pixels. Another way to evaluate our motion compensation algorithm is
The Boosted trees detector uses 1500 trees of depth 2. We will to compare the detectors, trained on the data, processed with either
further refer to this method as HBT-Detection. HBT-Regression or CNN-Regression methods. This measures the
9

influence our motion compensation algorithm has on the accuracy


of the detector, which is what we are interested in. We have chosen
HBT-Detection method for detection task, as it is faster to train and
it showed better accuracy on validation set, based on experiments,
depicted by Fig. 11. We compared our two methods described
in Section 4 with an optical flow based method [11], which is
probably the best available.
Fig. 12 illustrates the results of this comparison. We also
provide the performance of the same detector, trained and tested
on the data without motion stabilization for reference.
Our methods are able to correctly compensate for the UAV
motion even in the cases where the background is complex and UAV dataset Aircraft dataset
the drone might not be visible due to image saturation and noise. Figure 13: Comparing against appearance-based approaches [16],
Fig. 2(b,d) illustrates this hard situation with an example. On [17], [19], [44] in terms of precision/recall. For both the UAV
the contrary, the optical flow method is more focused on the and Aircraft datasets, the blue curve depicts our approach and is
background, which decreases its performance. Fig. 2(c) shows an significantly above the others.
example of a relatively easy situation, where the aircraft is clearly
visible, but the optical flow algorithm fails to correctly compensate
for its movement, while our regression-based approach succeeds.
• Appearance-Based Approaches rely on detection in in-
Fig. 2(a) illustrates another situation, where the object is not
dividual frames. We will compare against Deformable
in the center of the patch for the middle image of the st-cube.
Part Models (DPM) [16], single-frame based Convolu-
Optical flow methods will align other patches of the st-cube with
tional Neural Networks (s-f CNN-based detector) [17], Ran-
respect to the middle one, which will result in object being shifted
dom Forests [44], and the Aggregate Channel Features
from the center in all the st-cube patches. By contrast, our motion
method (ACF) [19], the latter being widely considered to
compensation algorithm does not require any reference frame,
be among the best.
leading to higher accuracy.
Since our algorithm considers st-cubes, for a fair comparison
Using motion compensation for alignment of the st-cubes
with these single-frame algorithms, we proceed as follows.
results into a higher performance of the detectors, as in-class
Similarly to our approach we divide the video sequence into a
variation of the data is decreased. Fig. 12 shows that we can
set of N -frame overlapping slices. We further extract st-cubes
achieve at least 15% improvement in average precision on both
using a sliding window approach, but motion compensation
datasets using our motion compensation algorithm.
is not applied. We then run the single frame based detector on
Our CNN-based motion compensation algorithm performs
each of the patches of these st-cubes and consider the whole
best. It yields about a 10% increase in accuracy, compared to the
st-cube b as positive if the weighted average of scores of the
boosted trees method. Such difference in performance most likely
patches in b is positive. We use a simple Gaussian kernel G
lies in the nature of the features used by these machine learning
centered on the middle frame of b as a weighting function. G
techniques. The boosted trees regressor is using HoG features,
is defined as G = exp (−(i − st /2)2 /2σ 2 ), where st is the
which might not be perfectly suited for the problem, while the
filter size and σ is taken as σ = 0.3((st − 1)/2 − 1) + 0.8
filters in the CNN are learned directly from the data. As the CNN
as often done. We tried simply averaging over the detection
obtains better accuracy, for our further experiments we will use
scores of the set of patches in the b, but it resulted in lower
the CNN-based motion compensation.
accuracy, because the detectors tend to give a higher score to
6 C OMPARING AGAINST C OMPETING M ETHODS the middle frame, in which the object appears to be close to
the patch center.
In this section, we compare the performance of the pipeline of
Section 3, optimized as described in Section 5, against several • Motion-based Approaches do not use any appearance in-
state-of-the-art algorithms on the two challenging datasets intro- formation and rely purely on the correct estimation of the
duced in Section 5.1. For these experiments, we therefore use st- background motion. Among those we experimented with
cubes whose sizes are (28, 28, 11) for UAVs and (28, 28, 7) for MultiCue background subtraction [21], [22] and large dis-
planes, which are those we determined to yield the lowest miss- placement optical flow [26].
rates when we use HoG3D descriptors for detection and CNNs for
• Hybrid approaches are closer in spirit to ours and correct
motion compensation.
for motion using image-flow. Among those, the one presented
We first list the algorithms we use as baselines. and show
in [11] is the most recent we know of and the one we compare
that ours outperforms them consistently both for plane and UAV
against. The main difference is that it relies on optical flow
detection. We then demonstrate that motion compensation does not
for motion compensation whereas we use CNNs. To ensure
significantly degrade performance in cases when it is not strictly
a fair comparison, we used the same patches to construct the
needed, such as when two aircrafts are on a collision course.
st-cubes both for our method and to extract the features [11]
requires.
6.1 Baselines
To demonstrate the effectiveness of our approach, we compare it For all the motion-based [21], [22], [26] and single-frame-
against state-of-the-art algorithms. We chose them to be represen- based [16], [17], [19], [44] approaches, the code was down-
tative of the three different ways the problem of detecting small loaded from publicly available sources. In particular, for ACF
moving objects can be approached, as discussed in Section 2. and Random Forests, we used the toolboxes of [45] and [41]
10

Our approach

Background subtraction

Optical flow

UAV dataset Aircraft dataset

Figure 14: Comparing against motion-based methods [21], [26]. (First row) Our detector detects the objects by relying on motion and
appearance, as evidenced by the green rectangles. (Middle row) Background subtraction results of [21]. Only in the leftmost frame
of the three on the left, is there a blob that corresponds to a UAV, along with one that does not. Similarly, there is a small blob that
corresponds to a plane in the central frame of the three right-most ones and many large ones in the others that do not clearly correspond
to anything. (Bottom row) Optical flow computed using the algorithm of [26]. The plane and UAV generate a distinctly visible pattern
in 2 or the 3 right-most images but in none of the three left-most ones. (best seen in color)

respectively. The DPM implementation is publicly available [16]. Average Precision


We also used the open source BGSLibrary [22] for state-of-the- Method UAV dataset Aircraft dataset
art background subtraction. To compute features, we used default Single-frame based approaches
parameter configurations much as we did in our own pipeline DPM [16] 0.573 0.470
for HoG3D. For algorithms relying on Random Forest, we tried Random Forests [44] 0.618 0.563
s-f CNN-based detector [17] 0.682 0.647
varying the number of trees, and kept the number yielding the best
ACF [19] 0.652 0.648
results, again much as we did to find the best CNN configurations Hybrid approaches
in our pipeline. For [11], we did not find a publicly available Park [11] 0.568 0.705
implementation and reimplemented the algorithm ourselves. Ours 0.849 0.864

Table 2: Average precision of detection methods on our datasets.


6.2 Evaluation against Competing Approaches We can see that in both cases our approach is able to reach higher
We used the same video sequences to train all the methods from detection accuracy. We achieve about 15% increase comparing to
the three classes described above. We compare here their results the best competing algorithms for the UAV and Aircraft datasets.
against ours.
estimation, which makes it hard to generalize for a large variety
6.2.1 Appearance-Based Methods.
of flying objects.
In Fig. 13, we compare our method with appearance-based ones on
our two datasets in terms of precision/recall. Table 2 summarizes 6.2.2 Motion-Based Methods
the results in terms of Average Precision. For both the UAV Fig. 14 depicts cases where background subtraction [21] and
and Aircraft datasets we improve on average by 15 − 20% over optical flow computation [26] algorithms, even though they are
ACF [19], which itself outperforms the others. state-of-the-art, do not work well enough for detecting UAVs or
The CNN approach, provided by [17] yields scores comparable planes in the challenging conditions we consider.
to those of the Random Forests and ACF methods. The structure We did not compute precision-recall curves using these
of the network is the one depicted by Fig. 5, except for the fact that motion-based methods because it is unclear how big the moving
we replaced 3D convolutions by standard 2D ones. To boost CNN part of the frame should be considered as an aircraft. We have
performance, we used Local Contrast Normalization (LCN) [46] tested several potential sizes and the resulting average precision
after every convolutional layer and minimize the Hinge Loss at the values were much lower than those in Table 2 in all cases.
final layer of the network, which was shown to be effective [47],
[48]. 6.2.3 Hybrid approaches
The DPM [16] performs worst on average. This likely happens In Fig. 15, we compare our method against the hybrid approach
because it depends on using the correct size of the bins for HoG of [11], which relies on motion compensation using Lucas-Kanade
11

penalizing when such invariance is not required. However, in this


case, the loss is almost negligible.
This is significant because, in a practical on-board system,
detecting aircrafts on a collision course, which present a clear and
immediate danger, would probably take priority over detecting all
others. The former does not require motion compensation while
the latter does. However, since nothing is lost by having motion
compensation on, we can detect all aircrafts, whether on a collision
course or not, without performance loss in the crucial case of those
that are.
(a) UAV dataset (b) Aircraft dataset 6.4 Scale Adjustment
Figure 15: Comparing against the hybrid method of [11]. Our As discussed in Section 4.3, we must run our detection scheme
approach performs better for both UAVs and Planes. at different image resolutions to accommodate rapid size changes.
This additional computational burden can be reduced by com-
pensating not only for motion but also for size, which makes it
optical flow method, and yields state-of-the-art performance for possible to reduce the number of scales the system needs to check.
pedestrian detection. As shown in Fig. 2, optical flow motion com- More specifically, we trained a regressor φsc (·) to adjust for
pensation cannot achieve good performance in our case, mostly scale so that the bounding box fits the object of interest, much
because the target object is rather small and its appearance can in the same way as we learned a regressor to compensate for
significantly change due to illumination and background changes. motion. Fig. 19 illustrates this process in two separate cases.
As a result, our regression-based approach allows to achieve Note that in the case of Fig. 19(b), there were originally two
higher performance for both the UAV and aircraft datasets. This different detections, which were collapsed into the same one after
suggests that accurate localization of the object in the patch is es- adjustment without having to perform non-maximum suppression.
sential and leads to significant improvement in detection accuracy. Since CNNs have proved more effective for motion compensa-
Fig. 16 shows several frames to illustrate the performance of our tion than HoG based regressors, we used them to implement scale
approach. adjustment as well. We found out experimentally that using just a
single patch to predict the true scale of the object is not enough.
As in [49], we therefore used several scales as inputs to the CNN.
6.3 Collision Courses Fig. 20 illustrates its structure.
Motion compensation can be seen as a way to make the st-cube The input to this CNN is a set of images of the object at
invariant from the motion of the aircraft, as it keeps flying object different scales, which are provided as separate channels. Its
in the center, for all the patches of the st-cube. To evaluate whether output is the estimated scale of the object. Since there is no pooling
enforcing this kind of invariance negatively impacts performance layer after the first convolutional layer, we can estimate the scale
in the situations when it is not required, we applied our approach with high precision. Furthermore, this CNN can be combined with
to the case of aircrafts on collision courses. the motion stabilization one of Section 4 to increase the accuracy
As shown in Fig. 17, if the aircraft A1 , observed from the of both motion compensation and scale adjustment. The structure
camera of another aircraft A2 , is on collision course with A2 then of the resulting composite CNN is similar to the one depicted
its behavior can be characterized by two important properties: by Fig. 20. However, the output of its fully-connected layer has
3 floating point values instead of only 2. The first two are the
• A1 remains at constant angle with respect to A2 shifts from the center of the patch in the spatial domain and the
• the apparent size of A1 increases from the point of view of last one is the estimated scale. This replaces NMS in scale space,
A2 as described in Section 3, and yields precise object localization.
These properties are invariant from the actual positions of the Fig. 21 depicts some scale-adjustment results.
aircrafts in the 3D environment, the only constraint is that the paths Table 3 compares the time required to process a single st-cube
of the aircrafts should intersect, which effectively means collision. using our approach with and without scale adjustment. In this case,
In the scope of this paper only the first property is important, we have used st-cubes of size (40, 40, 4) and 7 scales for the scale
which means that motion stabilization is not needed, as A1 will adjustment algorithm. Note that the number of scales can be se-
always occupy the same position in the image from the camera of lected with respect to the desired localization quality. Thus having
A2 , provided A1 and A2 are on collision course. many scales will yield more precise estimation of the object size,
We therefore searched publicly available sources for video at the cost of a computation time increase. In our experiments we
sequences in which airplanes appear to be on a collision course selected 7 scales, which results in high localization precision, as
for a substantial amount of time. We found fourteen, which vary depicted by Fig. 22, while keeping the processing time relatively
in length from tens to several hundreds of frames. As before, we low. Even though adding scale adjustment to motion compensation
used half of them to train the detector and the others to test it. increases the processing time per st-cube, it reduces the overall
In Fig. 18, we compare our results with and without motion computation time by a factor of about 4. This is because it replaces
stabilization. As expected, even though the non-stabilized results the need of doing NMS across 7 different scales, which takes
were poor in the general case, they are much better in this specific 0.123 ∗ 7 = 0.861 seconds, by processing one st-cube while
scenario. Incorporating motion stabilization very slightly degrades accounting for scale, which takes 0.193 seconds.
performance, which could be expected because enforcing any kind In Table 4, we evaluate our approach on the UAV dataset with
of invariance always loses some amount of information and is and without scale adjustment. Even though HBT-Detection with
12

UAV dataset

Aircraft dataset

Figure 16: Some detection results. Thumbnails at the side of each figure show the zoomed-in versions of the detections made by our
algorithm.

Figure 17: Collision courses. (Left) The apparent size of a stan-


(a) (b)
dard glider and its 15 m wingspan flying towards another aircraft
at a relatively slow speed (100 km/h) is very small 33s before Figure 19: Scale adjustment. The red bounding box shows the
impact, but the glider completely fills the field of view only half a original detection and the green one the position adjusted for scale
minute later, 3s before impact. (Right) An aircraft on a collision and motion. The thumbnails on the right are zoomed-in versions of
course is seen in a constant direction but its apparent size grows, the detections, with the top one illustrating the original detection
slowly at first and then faster. and the bottom one showing the one after being motion and scale
are adjusted. (best seen in color)

AveP
st-cube (Average Precision)
W/o compensation 0.907
With compensation 0.904

Figure 20: Structure of the scale adjustment Convolutional Neural


Figure 18: Performance for aircrafts on a collision course. (Left) Network. Several input channels contain object at different scales.
Precision/recall with and without motion compensation. (Right) The output of the CNN is a number, which characterizes the true
Average Precision with and without motion compensation. scale of the object. ‘CL’ denotes a convolutional layer, ‘PL’ a
pooling layer, and ‘FL’ a fully connected layer.
motion compensation + detection 0.123s
motion and scale adjustment + detection 0.193s between detected and ground-truth bounding boxes. Thus, it is un-
necessary to localize the target objects very precisely. We therefore
Table 3: Speed comparison of the motion and scale adjustment use our method without scale adjustment on 8 distinctive scales,
methods with motion compensation. We provide the time needed which yields a good balance between accuracy and computational
to process a single st-cube using an Intelr Xeonr CPU E5-2650 time.
v2 running at 2.60GHz. Fig. 22 illustrates the performance of our detection method
in combination with motion compensation and scale adjustment.
Our algorithm localizes the flying object with a great accuracy and
scale adjustment allows for faster computation, its performance is yields trajectories that are smooth both in the spatial domain and
slightly lower than without scale adjustment. This is mainly due to in scale space. Provided that the camera is calibrated and given the
the artifacts that appear when resizing small noisy images. Greater true size of the object, we can estimate its distance to the camera,
scale numbers improve detection accuracy at the cost of increased which is critical for collision avoidance purposes.
computation time. Different other examples that illustrate the performance of our
In the experiments of Section 6.2, we rely on 50% overlap motion compensation and detection approaches can be found at the
13

UAV dataset

Airraft dataset

Figure 22: Precise estimation of the scale of the object allows us


to localize it in 3-D space. (Top Left) Scale and motion adjusted
Figure 21: Sample results for simultaneous scale and motion
detection of the aircraft in one frame of a video sequence. (Top
compensation. The left image of each pair contains the original
Right) Projection of the points of the 3D trajectory throughout the
patch, where neither scale nor position are corrected. The right
previous 20 frames to the image plane. (Bottom Left) Changes of
patch depicts the resulting patch after scale and motion correction.
object scale. (Bottom Right) Trajectory of the object in 3D space
UAV dataset is quite smooth due to the motion compensation algorithm, while
number of scales average miss-rate
neither tracking nor additional smoothing is applied.
Method:
processed per frame for FPPI = 1

HBT-Detection
without scale adjustment 4 51% R EFERENCES
without scale adjustment 8 50% [1] “Mercedes-Benz Intelligent Drive,” http://techcenter.mercedes-
with scale adjustment 8 54% benz.com/en/intelligent drive/detail.html/.
with scale adjustment 16 52% [2] “Mobileeye Inc.” http://us.mobileye.com/technology/.
with scale adjustment 32 48% [3] G. Conte and P. Doherty, “An Integrated UAV Navigation System Based
on Aerial Image Matching,” in IEEE Aerospace Conference, 2008, pp.
3142–3151.
Table 4: Evaluation of the HBT-Detection method on the UAV [4] C. Martı́nez, I. F. Mondragón, M. Olivares-Méndez, and P. Campoy, “On-
dataset with and without scale adjustment. Both method perform Board and Ground Visual Pose Estimation Techniques for UAV Control,”
better when more scales are used, at the cost of increasing the Journal of Intelligent and Robotic Systems, vol. 61, no. 1-4, pp. 301–320,
computation time. 2011.
[5] L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys, “PIXHAWK:
A System for Autonomous Flight Using Onboard Computer Vision,” in
IEEE International Conference on Robotics and Automation, 2011.
following link: http://cvlab.epfl.ch/research/unmanned/detection. [6] C. Hane, C. Zach, J. Lim, A. Ranganathan, and M. Pollefeys, “Stereo
Depth Map Fusion for Robot Navigation,” in Proceedings of Interna-
tional Conference on Intelligent Robots and Systems, 2011, pp. 1618–
1625.
7 C ONCLUSION [7] S. Weiss, M. Achtelik, S. Lynen, M. Achtelik, L. Kneip, M. Chli, and
R. Siegwart, “Monocular Vision for Long-Term Micro Aerial Vehicle
We showed that temporal information from a sequence of frames State Estimation: A Compendium,” Journal of Field Robotics, vol. 30,
pp. 803–831, 2013.
plays a vital role in detection of small fast moving objects like [8] S. Lynen, M. Achtelik, S. Weiss, M. Chli, and R. Siegwart, “A Robust and
UAVs or aircrafts in complex outdoor environments. We therefore Modular Multi-Sensor Fusion Approach Applied to MAV Navigation,”
developed an object-centric learning-based motion compensation in Conference on Intelligent Robots and Systems, 2013.
approach that is robust to changes in the appearance of both object [9] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast Semi-Direct
Monocular Visual Odometry,” in International Conference on Robotics
and background. Both CNN and Boosted trees methods allow and Automation, 2014.
us to outperform state-of-the-art techniques on two challenging [10] T. Zsedrovits, A. Zarándy, B. Vanek, T. Peni, J. Bokor, and T. Roska,
datasets. The CNN proved to be more suitable for motion compen- “Visual Detection and Implementation Aspects of a UAV See and Avoid
System,” in European Conference on Circuit Theory and Design, 2011.
sation than the Boosted trees introduced in our previous work [15]. [11] D. Park, C. L. Zitnick, D. Ramanan, and P. Dollár, “Exploring Weak
To evaluate our algorithms, we collected two challenging Stabilization for Motion Feature Extraction,” in Conference on Computer
datasets for UAVs and Aircrafts detection. We hope that these Vision and Pattern Recognition, 2013.
datasets will become used as a new benchmark for improving fly- [12] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition
via Sparse Spatio-Temporal Features,” in VS-PETS, October 2005, pp.
ing objects detection and visual-based aerial collision avoidance. 65–72.
[13] I. Laptev, “On Space-Time Interest Points,” International Journal of
Computer Vision, 2005.
8 ACKNOWLEDGMENTS [14] D. Weinland, M. Ozuysal, and P. Fua, “Making Action Recognition
Robust to Occlusions and Viewpoint Changes,” in European Conference
on Computer Vision, 2010.
This work was conducted in the context of the “Visual detection
[15] A. Rozantsev, V. Lepetit, and P. Fua, “Flying Objects Detection from a
and tracking of flying objects in Unmanned Aerial Vehicles” Single Moving Camera,” in Conference on Computer Vision and Pattern
project, funded by Honeywell International, Inc. Recognition, 2015.
14

[16] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object [43] R. Benenson, O. Mohamed, J. Hosang, and B. Schiele, “Ten Years of
Detection with Discriminatively Trained Part Based Models,” IEEE Pedestrian Detection, What Have We Learned?” in European Conference
Transactions on Pattern Analysis and Machine Intelligence, 2010. on Computer Vision, 2014.
[17] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Berg- [44] L. Breiman, “Random Forests,” Machine Learning, 2001.
eron, N. Bouchard, and Y. Bengio, “Theano: new features and speed [45] P. Dollár, “Piotr’s Computer Vision Matlab Toolbox (PMT),” http:
improvements,” 2012. //vision.ucsd.edu/∼pdollar/toolbox/doc/index.html.
[18] A. Bosch, A. Zisserman, and X. Munoz, “Image Classification Using [46] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the
Random Forests and Ferns,” in International Conference on Computer Best Multi-Stage Architecture for Object Recognition?” in Conference
Vision, 2007. on Computer Vision and Pattern Recognition, 2009.
[19] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral Channel Features,” [47] J. Jin, K. Fu, and C. Zhang, “Traffic Sign Recognition with Hinge
in British Machine Vision Conference, 2009. Loss Trained Convolutional Neural Networks,” IEEE Transactions on
[20] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian Computer Vision Intelligent Transportation Systems, vol. 15, pp. 1991–2000, 2014.
System for Modeling Human Interactions,” IEEE Transactions on Pattern [48] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial
Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831–843, 2000. Transformer Networks,” arXiv Preprint, 2015.
[21] N. Seungjong and J. Moongu, “A New Framework for Background [49] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands Deep in Deep
Subtraction Using Multiple Cues,” in Asian Conference on Computer Learning for Hand Pose Estimation,” arXiv Preprint, 2015.
Vision. Springer Berlin Heidelberg, 2013, pp. 493–506.
[22] A. Sobral, “BGSLibrary: An OpenCV C++ Background Subtraction
Library,” in IX Workshop de Visao Computacional, 2013.
[23] D. Zamalieva and A. Yilmaz, “Background Subtraction for the Moving
Camera: A Geometric Approach,” Computer Vision and Image Under- Artem Rozantsev joined EPFL in 2012 as
standing, vol. 127, pp. 73–85, 2014. Ph.D. candidate at Computer Vision Labo-
[24] T. Brox and J. Malik, “Object Segmentation by Long Term Analysis of ratory, under the supervision of Prof. Pascal
Point Trajectories,” in European Conference on Computer Vision, 2010,
pp. 282–295. Fua and Prof. Vincent Lepetit. He received
[25] B. Lucas and T. Kanade, “An Iterative Image Registration Technique his Specialist degree in Mathematics and
with an Application to Stereo Vision,” in International Joint Conference Computer Science in 2012 from Lomonosov
on Artificial Intelligence, 1981, pp. 674–679.
Moscow State University. His research inter-
[26] T. Brox and J. Malik, “Large Displacement Optical Flow: Descriptor
Matching in Variational Motion Estimation,” IEEE Transactions on ests include object detection, synthetic data
Pattern Analysis and Machine Intelligence, 2011. generation and machine learning.
[27] Y. Zhang, S.-J. Kiselewich, W.-A. Bauson, and R. Hammoud, “Robust
Moving Object Detection at Distance in the Visible Spectrum and Beyond Vincent Lepetit is a Professor at the Institute
Using a Moving Camera,” in Conference on Computer Vision and Pattern for Computer Graphics and Vision, TU Graz
Recognition, 2006. and a Visiting Professor at the Computer
[28] S.-W. Kim, K. Yun, K.-M. Yi, S.-J. Kim, and J.-Y. Choi, “Detection of
Moving Objects with a Moving Camera Using Non-Panoramic Back-
Vision Laboratory, EPFL. He received the
ground Model,” Machine Vision and Applications, vol. 24, pp. 1015– PhD degree in Computer Vision in 2001
1028, 2013. from the University of Nancy, France, after
[29] S. Kwak, T. Lim, W. Nam, B. Han, and J. Han, “Generalized Back- working in the ISA INRIA team. He then
ground Subtraction Based on Hybrid Inference by Belief Propagation
and Bayesian Filtering,” in International Conference on Computer Vision, joined the Virtual Reality Lab at EPFL as a
2011, pp. 2174–2181. post-doctoral fellow and became a founding
[30] A. Elqursh and A. Elgammal, “Online Moving Camera Background member of the Computer Vision Laboratory.
Subtraction,” in European Conference on Computer Vision, 2012, pp. He became a Professor at TU GRAZ in February 2014. His
228–241.
[31] M. Narayana, A. Hanson, and E. Learned-miller, “Coherent Motion Seg- research interests include vision-based Augmented Reality, 3D
mentation in Moving Camera Videos Using Optical Flow Orientations,” camera tracking, Machine Learning, object recognition, and 3D
in International Conference on Computer Vision, 2013. reconstruction. He often serves as program committee member
[32] A. Papazoglou and V. Ferrari, “Fast Object Segmentation in Uncon-
and area chair of major vision conferences (CVPR, ICCV, ECCV,
strained Video,” in International Conference on Computer Vision, 2013.
[33] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New Features and ACCV, BMVC). He is an editor for the International Journal of
Insights for Pedestrian Detection,” in Conference on Computer Vision Computer Vision (IJCV) and the Computer Vision and Image
and Pattern Recognition, 2010. Understanding (CVIU) journal.
[34] J. Friedman, “Stochastic Gradient Boosting,” Computational Statistics &
Data Analysis, 2002. Pascal Fua received an engineering degree
[35] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human from Ecole Polytechnique, Paris, in 1984
Detection,” in Conference on Computer Vision and Pattern Recognition,
2005.
and the Ph.D. degree in Computer Science
[36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learn- from the University of Orsay in 1989. He
ing Applied to Document Recognition,” IEEE, 1998. then worked at SRI International and INRIA
[37] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Sophia-Antipolis as a Computer Scientist.
Networks,” in International Conference on Artificial Intelligence and
Statistics, 2011. He joined EPFL in 1996 where he is now
[38] M. D. Zeiler, “ADADELTA: an Adaptive Learning Rate Method,” Com- a Professor in the School of Computer and
puting Research Repository, 2012. Communication Science and heads the Com-
[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- puter Vision Laboratory. His research interests include shape mod-
dinov, “Dropout: A Simple Way to Prevent Neural Networks from
Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929– eling and motion recovery from images, analysis of microscopy
1958, 2014. images, and Augmented Reality. He has (co)authored over 200
[40] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical publications in refereed journals and conferences. He is an IEEE
Learning: Data Mining, Inference and Prediction. Springer, 2009. Fellow and has been an associate editor of IEEE journal Trans-
[41] R. Sznitman, C. Becker, F. Fleuret, and P. Fua, “Fast Object Detection
with Entropy-Driven Evaluation,” in Conference on Computer Vision and actions for Pattern Analysis and Machine Intelligence. He often
Pattern Recognition, 2013, pp. 3270–3277. serves as program committee member, area chair, and program
[42] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of chair of major vision conferences.
computer vision algorithms,” http://www.vlfeat.org/, 2008.

You might also like