Detecting Flying Objects Using A Single Moving Camera: Artem Rozantsev, Vincent Lepetit, and Pascal Fua, Fellow, IEEE
Detecting Flying Objects Using A Single Moving Camera: Artem Rozantsev, Vincent Lepetit, and Pascal Fua, Fellow, IEEE
Abstract—We propose an approach for detecting flying objects such as Unmanned Aerial Vehicles (UAVs) and aircrafts when they
occupy a small portion of the field of view, possibly moving against complex backgrounds, and are filmed by a camera that itself moves.
We argue that solving such a difficult problem requires combining both appearance and motion cues. To this end we propose a
regression-based approach for object-centric motion stabilization of image patches that allows us to achieve effective classification on
spatio-temporal image cubes and outperform state-of-the-art techniques.
As this problem has not yet been extensively studied, no test datasets are publicly available. We therefore built our own, both for UAVs
and aircrafts, and will make them publicly available so they can be used to benchmark future flying object detection and collision
avoidance algorithms.
1 I NTRODUCTION
We are headed for a world in which the skies are occupied not
only by birds and planes but also by unmanned drones ranging
from relatively large Unmanned Aerial Vehicles (UAVs) to much
smaller consumer ones. Some of these will be instrumented and
able to communicate with each other to avoid collisions but not
all. Therefore, the ability to use inexpensive and light sensors
such as cameras for collision-avoidance purposes will become
increasingly important.
This problem has been tackled successfully in the automotive
world, for example there are now commercial products [1], [2]
designed to sense and avoid both pedestrians and other cars.
In the world of flying machines much progress has been made
towards accurate position estimation and navigation from single
or multiple cameras [3], [4], [5], [6], [7], [8], [9], but less in the
field of visual-guided collision avoidance [10]. In particular, it is Figure 1: Detecting a small flying object against a complex moving
not possible to simply extend the algorithms used for pedestrian background. (Left) It is almost invisible to the human eye and hard
and automobile detection to the world of aircrafts and drones, as to detect from a single image. (Right) Yet, our algorithm can find
flying object detection poses some unique challenges: it by using appearance and motion cues.
• The environment is fully three dimensional, which makes
the motions more complex (e.g., objects may move in any
direction in the 3D space and may appear in any part of the Fig. 1 illustrates some examples, where even for humans it is hard
frame). to find a flying object based just on a single image. By contrast,
• Flying objects have very diverse shapes and can be seen when looking at the sequence of frames, these objects suddenly
against either the ground or the sky, which produces complex pop up and are easily spotted, which suggests that motion cues are
and changing backgrounds. crucial for detection.
• Given the speeds involved, potentially dangerous objects However, these motion cues are difficult to exploit when the
must be detected when they are still far away, which means images are acquired by a moving camera and feature backgrounds
they may be very small in the images. that are challenging to stabilize because they are non-planar and
rapidly changing. Furthermore, since there may be other moving
• A. Rozantsev is with the Computer Vision Laboratory, École Polytechnique objects in the scene, such as a person in the top row of Fig. 1,
Fédérale de Lausanne, Lausanne, Switzerland. motion by itself is not enough and appearance must also be taken
E-mail: [email protected] into account.
• V. Lepetit is with the Institute for Computer Graphics and Vision, Graz
University of Technology, Graz, Austria. In this paper, we detect whether an object of interest is present
• P. Fua is with the Computer Vision Laboratory, École Polytechnique and constitutes danger by classifying 3D descriptors computed
Fédérale de Lausanne, Lausanne, Switzerland. from spatio-temporal image cubes. We will refer to them as
Manuscript received June 19, 2015; revised September 17, 2014. st-cubes. They are formed by stacking motion-stabilized image
2
UAVs Aircrafts
Uniform background Very noisy background Non-uniform background Noisy background
No motion compensation
Optical Flow
Our approach
Figure 2: Motion compensation for four different st-cubes of flying objects seen against different backgrounds. (Top) For each one, we
show four consecutive patches before motion stabilization. In the leftmost plot below the patches, the blue dots denote the location of
the true center of the drone and the red cross is the patch center over time. The other two plots depict the x and y deviations of the drone
center with respect to the patch center. (Middle) The same four st-cubes and corresponding graphs after motion compensation using
an optical flow approach, as suggested by [11]. (Bottom) The same four st-cubes and corresponding graphs after motion compensation
using our approach.
windows over several consecutive frames, which give more infor- frames, and those that combine the two. We briefly review all three
mation than using a single image. What makes this approach both types in this section. In the results section, we will demonstrate
practical and effective is a regression-based motion-stabilization that we can outperform state-of-the-art representatives of each
algorithm. Unlike those relying on optical flow, it remains effective class.
even when the shape of the object to be detected is blurry or Appearance-based methods rely on Machine Learning and
barely visible, as illustrated by Fig. 2. This arises from the fact have proved to be powerful even in the presence of complex
that learning-based motion compensation focuses on the object lighting variations or cluttered background. They are typically
and is more resistant to complicated backgrounds, compared to based on Deformable Part Models (DPM) [16], Convolutional
the optical flow method as shown in Fig. 2. Neural Networks (CNN) [17], or Random Forests [18]. Among
St-cubes have been routinely used for action recognition pur- them the Aggregate Channel Features (ACF) [19] algorithm is
poses [12], [13], [14] using a monocular camera. By contrast, most considered as one of the best.
current detection algorithms work either on a single frame, or by These approaches work best when the target objects are
estimating the optical flow from consecutive frames. Our approach sufficiently large and clearly visible in individual images, which is
can therefore be seen as a way to combine both the appearance often not the case in our applications. For example, in the images
and motion information to achieve effective detection in a very of Fig. 1, the object is small and it is almost impossible to make
challenging context. In our experiments we show that this method out from the background without motion cues.
allows to achieve higher accuracy, comparing to either appearance Motion-based approaches can themselves be subdivided into
or motion-based methods individually. two subclasses. The first comprises those that rely on background
We first proposed using st-cubes for flying objects detection subtraction [20], [21], [22], [23] and determine objects as groups
in an earlier conference paper [15]. In this initial version of our of pixels that are different from the background. The second
processing pipeline, we performed motion compensation using includes those that depend on optical flow [24], [25], [26].
boosted trees. In this paper we refine this idea by using deep Background subtraction works best when the camera is static
learning techniques that yield better stabilization and, thus, better or its motion is small enough to be easily compensated for, which
overall performance. is not the case for the on-board camera of a fast moving aircraft.
Flow-based methods are more reliable in such situations but
still critically dependent on the quality of the flow vectors, which
2 R ELATED W ORK tends to be low when the target objects are small and blurry. Some
Approaches for detecting moving objects can be classified into methods combine both optical flow and background subtraction
three main categories: those that rely on appearance in individual algorithms [27], [28]. However, in our case there may be motion
frames, those that rely primarily on motion information across in different parts of the images, for example people or tree
3
Figure 3: Object detection pipeline with st-cubes and motion compensation. Provided a set of video frames from the camera, we use a
multi-scale sliding window approach to extract st-cubes. We than process every patch of the st-cube to compensate for the motion of
the aircraft and then run the detector. (best seen in color)
tops. Thus motion information is not enough for reliable flying object and will compare their respective performance in Section 5.
object detection. Other methods that combine optical flow and We will discuss motion compensation in Section 4.
background subtraction, such as [29], [30], [31], [32] still critically More specifically, we want to train a classifier that takes as
depend on optical flow, which is often estimated with [26] and thus input st-cubes such as those depicted by Fig. 4 and returns 1 or
may suffer from the low quality of the flow vectors. In addition to -1, depending on the presence or absence of a flying object. Let
optical flow dependence, [31] makes an assumption that camera (sx , sy , st ) be the size of our st-cubes. For training purposes, we
motion is translational, which is violated in aerial videos. use a dataset of pairs (bi , yi ), i ∈ [1, N ], where bi ∈ Rsx ×sy ×st
Hybrid approaches combine information about object ap- is an st-cube, in other words st image patches of resolution sx ×sy
pearance and motion patterns and are therefore the closest in spirit pixels. Label yi ∈ {−1, 1} indicates whether or not a target object
to what we propose. For example, in [33], histograms of flow is present.
vectors are used as features in conjunction with more standard
appearance features and are fed to a statistical learning method. 3.1 3D HoG with Gradient Boost
This approach was refined in [11] by first aligning the patches
The first approach we tested relies on boosted trees [34] to learn
to compensate for motion and then using the differences of the
a classifier ψ(·) of the the form ψ(b) = ΣH j=1 αj hj (b), where
frames, which may or may not be consecutive, as additional
αj=1..H are real valued weights, b ∈ Rsx ×sy ×st is the input st-
features. The alignment relies on the Lucas-Kanade optical flow
cube, hj : Rsx ×sy ×st → R are weak learners, and H is the
algorithm [25]. The resulting algorithm works very well for pedes-
number of selected weak learners, which controls the complexity
trian detection and outperforms most of the single-frame methods.
of the classifier. The α’s and h’s are learned in a greedy manner,
However, when the target objects become smaller and harder to
using the Gradient Boost algorithm [34], which can be seen as
see, the flow estimates become unreliable and this approach, like
an extension of the classic AdaBoost to real-valued weak learners
the purely flow-based ones, becomes less effective.
and more general loss functions.
In standard Gradient Boost fashion, we take our weak learners
3 D ETECTION F RAMEWORK to be regression trees hj (b) = T (θj , HoG3D(b)), where θj
denotes the tree parameters and HoG3D(b), the 3-dimensional
Our detection pipeline is illustrated by Fig. 3 and comprises the Histograms of Gradients (HoG3D) computed for b. HoG3D was
following steps: introduced in [14], and can be seen as an extension of the standard
• Divide the video sequence into N -frame overlapping tempo- HoG [35] with an additional temporal dimension. It is fast to
ral slices. The larger the overlap is, the higher the precision compute and proved to be robust to illumination changes in many
but only up to a point. Our experiments show that making the applications, and allows us to combine appearance and motion
overlap more than 50% increases computation time without efficiently.
improving performance. Thus, 50% is what we used. At each iteration j , the weak learner hj (·) with the cor-
• Build st-cubes from each slice using a sliding window ap- responding weight αj is taken as the one that minimizes the
proach, independently at each scale. exponential loss function:
• Apply our motion compensation algorithm to the patches of
N
each of the st-cubes to create stabilized st-cubes. X
(hj (.), αj ) = argmin e−y(ψj−1 (bi )+αh(bi )) . (1)
• Classify each st-cube as containing an object of interest or h(.),α i=1
not.
• Since each scale has been processed independently, we per-
The tests in the nodes of the trees compare one coordinate of
form non-maximum suppression in scale space. If there are the HoG3D vector with a threshold, both selected during the
several detections for the same spatial location at different optimization.
scales, we only retain the highest-scoring one. As an alter-
native to this simple scheme, we have developed a more 3.2 Convolutional Neural Networks
sophisticated learning-based one, which we discuss in more Since Convolutional Neural Networks (CNN) [36] have proved
details in Section 6.4. very successful in many detection problems, we have tested it
In this section, we introduce two separate approaches—one as an alternative classification method. We use the architecture
based on boosted trees, the other one on Convolutional Neural depicted by Fig. 5, which alternates convolutional layers and
Networks—to deciding whether or not an st-cube contains a target pooling layers. Convolutional layers use 3D linear filters while
4
UAVs Aircrafts
Coarse
pooling layers apply max-pooling in 2D spatial regions only. The alignment
last layer is fully connected and outputs the probability that the
input st-cube contains an object of interest. We use the hyperbolic
tangent function as the non-linear operator [37].
We take the input of our CNN is a normalized st-cube
Refinement
b − µ(b)
η= , (2)
σ(b)
where µ(b) and σ(b) are the mean and standard deviation of the Figure 6: Structure of the CNNs used for motion compensation.
pixel intensities in b, respectively. Normalization is an important (Top) The first network uses extended patches to correct for the
step because network parameters optimization fails to converge large displacements of the aircraft. (Bottom) The second network
when using raw image intensities. is applied after rectification by the motion predicted by the first
During training, we write the probability that an st-cube η network, and is designed to correct for the small motions.
contains an object of interest (y = 1) or is a part of the background
(y = 0) as
eCNN(η)[y] to use motion compensation to eliminate this problem. Motion
P (Y = y | η) = , y = {0, 1} , (3)
eCNN(η)[0]
+ eCNN(η)[1] compensation will allow us to accumulate visual evidence from
where CNN(η)[y] denotes the classification score that the network multiple frames, without adding variation due to the object motion.
predicts for η as being part of class y and e(·) denotes the expo- We therefore aim at centering the target object, so that when
nential function. We then minimize the negative log-likelihood present in an st-cube, it remains at the center of all its image
patches.
N
More specifically, let It denote the t-th frame of the video
X
L(W, bias) = − log P (Y = yk | ηk ) (4)
k=1
sequence and (i, j) some pixel position in it. The st-cube bi,j,t
is the 3D array of pixel intensities from images Iz with z ∈
with respect to the CNN parameters. Here (ηk , yk ) are pairs [t − st + 1, t] at image locations (k, l) with k ∈ [i − sx + 1, i] and
of normalized st-cubes and their corresponding labels from the l ∈ [j −sy +1, j], as depicted by Fig. 4. Correcting for motion can
training dataset, as defined in Section 3. To this end, we use be formulated as allowing patches mi,j,z , z ∈ [t − st + 1, t] of the
the algorithm of [38] combined with Dropout [39] to improve st-cube to shift horizontally and vertically in individual images.
generalization.
In [11], these shifts are computed using optical flow infor-
We tried many different network configurations, in terms of
mation, which has been shown to be effective for pedestrians
the number of filters per layer and the size of the filters. However,
occupying a large fraction of the patch and moving relatively
they all yield similar performance, which suggests that only
slowly from one frame to the next. However, as can be seen in
minor improvements could be obtained by further tweaking the
Fig. 4, these assumptions do not hold in our case and we will
network. We also tried varying the dimensions of the st-cube.
show in Section 6 that this negatively impacts performance. To
These variations have a more significant influence on performance,
overcome this difficulty, we introduce instead a learning-based
which will be evaluated in Section 5.
approach to compensate for motion and keep the object in the
center of the mi,j,z patches of the st-cube even when the target
4 M OTION C OMPENSATION object’s appearance changes drastically.
Neither of the two approaches to classifying st-cubes introduced More specifically, we treat motion compensation problem as a
in the previous section accounts for the fact that both the gradient regression task: given a single image patch, we want to predict the
orientations used to build the 3D HoG and the filter responses 2D translation that best centers the target object. By rectifying all
in the CNN case are biased by the global object motion. This the image patches in an st-cube with their predicted translation,
makes the learning task much more difficult and we propose we can then align the images of the object of interest together.
5
UAV Dataset
Aircraft Dataset
Figure 9: An object’s apparent size can change enormously depending on its pose and distance to the camera. We therefore use a sliding
window approach at different resolutions. The green boxes denote detections by our algorithm, which successfully handles background,
lighting, scale, and pose changes.
Since the problem of detecting small flying objects has not UAV dataset
yet received extensive attention from our community, there is not
yet any standard dataset that can be used for testing purposes.
We therefore built our own, one for UAVs and one for planes.
We first describe them and then describe our testing protocol and
the metrics we used for evaluation purposes. Finally, we perform
the above-mentioned comparisons and demonstrate that the best
results are obtained by using the CNN approach of Section 4.2 for
motion compensation and the HoG3D descriptors of Section 3.1 Aircraft dataset
for actual detection.
5.1 Datasets
To evaluate the performance of our approach, we built two separate
datasets. They feature many real-world challenges including fast
illumination changes and complex backgrounds, such as those
created by moving treetops seen against a changing sky. They Several failure cases
are as follows.
Figure 11: Influence of the st-cubes sizes on the performance of Boosted trees (HBT-Detection) and CNN (CNN-Detection) detectors
with CNN-based motion compensation method, as described in Section 5.2.3. The plots are colored according M R|F P P I=1 criterion
(introduced in Section 5.2.2). Here blue corresponds to the higher M R|F P P I=1 , while red to the lower one. The darker lines on both
plots correspond to the best performing examples of two different types of machine learning algorithms, according to the same criterion.
The evaluation was performed on the validation subsets of the UAV and Aircraft datasets. (best seen in color)
structures of the CNNs for detection and motion compensation are unable to correctly predict the location of the object in the patch.
depicted by Figs. 5 and 6 respectively. Here the parameters of each This typically occurs when the patches are very noisy and the
layer—the numbers of filters per layer and their dimensions—are object is almost not visible.
given in the figures in the format N × (kx , ky , kt ), where N and To handle the wide range of flying objects apparent sizes, we
(kx , ky , kt ) are the number of filters and their sizes respectively. use a multi-scale sliding window detector. Fig. 9 shows the same
UAV and plane appearing at various distances from the camera
5.2.1 Training the Motion Regressors throughout the video sequence.
To provide labeled examples where the aircraft or UAV is not
5.2.2 Evaluation Metrics
in the center of the patch but still at least partially within it, we
randomly shifted the ground truth bounding boxes by a translation In our experiments we consider an object to be correctly detected
of magnitude up to half of their sizes. This step was repeated for all if there is 50% overlap between the detected bounding box and
the frames of the training database to cover the variety of shapes the ground-truth bounding box.
and backgrounds in front of which the aircraft might appear. We report precision-recall curves. Precision is computed as the
Applying large translations to the training data allows us number of true positives detected by the algorithm divided by the
to run the detection to only non-overlapping patches without total number of detections. Recall is the number of true positives
missing the target, as explained at the end of Section 4.3. This divided by the number of the positive test examples. Additionally
procedure allows us to generate as much training data as needed we use the
R 1 Average Precision measure, which we take to be the
for both Boosted trees (HBT-Regression) and CNN regressors integral 0 p(r)dr, where p is the precision, and r the recall.
(CNN-Regression), which is important for performance especially We also report the log-average miss-rate (MR) with respect
as the latter is known to require large amounts of training data. to the average number of false positive per image (FPPI). The
The apparent size of the objects in the UAV and Aircraft miss-rate is computed as the number of true positives missed by
datasets varies from 10 to 100 pixels. To train the regressor, we the detector, divided by the total number of true positives; FPPI
used 40 × 40 patches containing the UAV or aircraft shifted from is computed as the total number of false positives, divided by the
the center. total number of images in the testing dataset:
The CNN-based regressor relies on convolutions of the orig- Nd
MR = 1− Ntp ,
inal patch with filters from different network layers, which may Nf d (6)
FPPI = Nf ,
produce artifacts close to the patch borders and degrade perfor-
mance when the object is only partially visible. To reduce the where Nd , Nf d , Ntp , Nf are the number of true and false detec-
influence of such artifacts, we extend the input patch by 25% in tions, the number of positively labeled examples and the number
both the horizontal and vertical directions. This needs to be done of frames in the test set, respectively.
only for the coarse alignment CNN, as depicted by the top row
of Fig. 6. It is not required for the refinement CNN that only 5.2.3 Motion Compensation Performance Analysis
estimates small motions. Prior to evaluating the detection accuracy of the methods we
Fig. 10 depicts some examples of motion compensation. Note need to apply motion compensation to the st-cubes. Thus we
that even though both aircrafts and drones appear in front of need to evaluate, which motion compensation method performs
changing backgrounds, the motion compensation algorithm cor- best. To this end, we created a validation dataset by selecting
rectly estimates the object location within the patch. Fig. 10 also one video from each dataset. These videos are then used to
illustrates some cases when the motion compensation system is generate data, using the method introduced in Section 5.2.1. We
8
use the validation set to tune the parameters and then perform the
comparison against competing approaches on the test set.
We compare HBT-Regression and CNN-Regression in terms
of Root Mean Square Error (RMSE). More formally, we are given
a validation set of pairs (Xi , Sia ), i ∈ 1..N , where Xi is a patch
and Sia ∈ R2 corresponds to the true shift of the object from
p p
the center of the patch. Let also Si ∈ R2 : Si = φ(Xi ) be
the prediction of the shift of the object, obtained by the motion
compensation system. Then the RMSE is computed using the
following equation:
(a) UAV dataset (b) Aircraft dataset
v
u
u1 X N
RMSE = t (S p − Sia )2 . (7) Average Precision
N i=1 i
HBT-Detection detection algorithm,
p UAV dataset Aircraft dataset
Note that Si and Sia do not depend on the size of the patch. together with different motion com-
pensation methods
Table 1 depicts the results of this comparison.
CNN-Regression outperforms HBT-Regression on both datasets. No motion compensation 0.485 0.497
For reference we also provide RMSE0 , which is computed as: Optical flow 0.540 0.652
v HBT-Regression 0.751 0.789
u N
u1 X CNN-Regression 0.849 0.864
RMSE0 = t (S a )2 . (8)
N i=1 i
Figure 12: Comparison of motion compensation methods on the
RMSE0 reflects the case when no motion compensation is applied. test subsets of our datasets. For all the motion compensation
algorithms we have used the same HBT-Detection approach, as
RMSE it proved to be more accurate, comparing to CNN-Detection.
method UAV dataset Aircraft dataset Unlike the optical flow-based algorithm, our regression-based ones
No motion compensation (RMSE0 ) 0.1474 0.1451
properly identify the shift in object position and correct for it, even
HBT-Regression 0.0939 0.0805 when the background is complex and the object outlines are barely
CNN-Regression 0.0669 0.0749 visible. This yields a better precision/recall. Table in the bottom
of the figure depicts the Average Precision score for the methods
Table 1: Performance of motion compensation methods. The presented above.
valuation was performed on the validation subsets of the UAV
and Aircraft datasets.
For the CNNs of Section 3.2, we tried different network
We therefore used the CNN-Regression algorithm to produce configurations, with variations of the number and size of filters in
a number of aligned st-cubes of sizes ranging from (sx , sy , st ) = the convolutional layers and varying numbers of fully connected
(28, 28, 4) to (sx , sy , st ) = (40, 40, 11), some of which we used layers. In the end, they all ended up yielding very similar results.
for training and others for testing. For patches smaller than 40 × The final configuration that we used is illustrated by Fig. 5. We
40, we simply upscale them to 40×40 before applying the motion will refer to this method as CNN-Detection.
compensation regressors. The choice of st controls the trade-off As depicted by Fig. 11, HBT and CNN detectors perform
between detecting far away objects using large values and closer similarly on the plane dataset but the former clearly outperforms
ones using smaller ones. This is because, when the object is very the latter on the UAV dataset when we allow a single false positive
close, the apparent motion may become too large for our motion per frame on average. This may seem surprising but similar
compensation scheme. We found that increasing st beyond 11 did behaviors have been reported by [43] where the top four methods
not bring any improvement in performance, while decreasing it rely on decision forests while the Deep learning approach ranks
below 4 left us with too little motion information. only sixth. In our case, this may be attributable to the size of the
As described above we have used the same video sequences training database not being large enough to take full advantage
to select the most appropriate size st for the st-cube. Fig. 11 sum- of the power of CNNs. Furthermore, for tasks that require as few
marizes our experiments, in terms of Average miss-rate curves. false positives as possible, the CNNs win.
The legend of the plot describes the set-up used during the ex- In any event, these experiments suggest that the optimal
periments. The number in brackets correspond to the (sx , sy , st ) dimension of the st-cube depends on the task at hand. The apparent
dimensions of the st-cube. The order of the curves in the legend is size of the UAVs is small, which favors large temporal dimension.
designed in the way that the highest curve is highest in terms of As can be seen in Fig. 11(a), the best results are obtained for
M R|F P P I=1 measure. The lowest curve corresponds to the best st = 11. By contrast, the Aircraft dataset comprises examples of
performing set-up. For the different detection algorithms we show planes flying at many different distances from the camera. In this
the best performing results by making the curves darker. case, st = 7 is optimal for both HoG3D descriptors and CNNs.
The classifiers of Section 3.1 rely on boosted trees operating
on HoG3D descriptors [14]. We computed them using the default 5.2.4 Detection-Based Evaluation
parameters, that is, 24 orientations per bin of size 4 × 4 × 2 pixels. Another way to evaluate our motion compensation algorithm is
The Boosted trees detector uses 1500 trees of depth 2. We will to compare the detectors, trained on the data, processed with either
further refer to this method as HBT-Detection. HBT-Regression or CNN-Regression methods. This measures the
9
Our approach
Background subtraction
Optical flow
Figure 14: Comparing against motion-based methods [21], [26]. (First row) Our detector detects the objects by relying on motion and
appearance, as evidenced by the green rectangles. (Middle row) Background subtraction results of [21]. Only in the leftmost frame
of the three on the left, is there a blob that corresponds to a UAV, along with one that does not. Similarly, there is a small blob that
corresponds to a plane in the central frame of the three right-most ones and many large ones in the others that do not clearly correspond
to anything. (Bottom row) Optical flow computed using the algorithm of [26]. The plane and UAV generate a distinctly visible pattern
in 2 or the 3 right-most images but in none of the three left-most ones. (best seen in color)
UAV dataset
Aircraft dataset
Figure 16: Some detection results. Thumbnails at the side of each figure show the zoomed-in versions of the detections made by our
algorithm.
AveP
st-cube (Average Precision)
W/o compensation 0.907
With compensation 0.904
UAV dataset
Airraft dataset
HBT-Detection
without scale adjustment 4 51% R EFERENCES
without scale adjustment 8 50% [1] “Mercedes-Benz Intelligent Drive,” http://techcenter.mercedes-
with scale adjustment 8 54% benz.com/en/intelligent drive/detail.html/.
with scale adjustment 16 52% [2] “Mobileeye Inc.” http://us.mobileye.com/technology/.
with scale adjustment 32 48% [3] G. Conte and P. Doherty, “An Integrated UAV Navigation System Based
on Aerial Image Matching,” in IEEE Aerospace Conference, 2008, pp.
3142–3151.
Table 4: Evaluation of the HBT-Detection method on the UAV [4] C. Martı́nez, I. F. Mondragón, M. Olivares-Méndez, and P. Campoy, “On-
dataset with and without scale adjustment. Both method perform Board and Ground Visual Pose Estimation Techniques for UAV Control,”
better when more scales are used, at the cost of increasing the Journal of Intelligent and Robotic Systems, vol. 61, no. 1-4, pp. 301–320,
computation time. 2011.
[5] L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys, “PIXHAWK:
A System for Autonomous Flight Using Onboard Computer Vision,” in
IEEE International Conference on Robotics and Automation, 2011.
following link: http://cvlab.epfl.ch/research/unmanned/detection. [6] C. Hane, C. Zach, J. Lim, A. Ranganathan, and M. Pollefeys, “Stereo
Depth Map Fusion for Robot Navigation,” in Proceedings of Interna-
tional Conference on Intelligent Robots and Systems, 2011, pp. 1618–
1625.
7 C ONCLUSION [7] S. Weiss, M. Achtelik, S. Lynen, M. Achtelik, L. Kneip, M. Chli, and
R. Siegwart, “Monocular Vision for Long-Term Micro Aerial Vehicle
We showed that temporal information from a sequence of frames State Estimation: A Compendium,” Journal of Field Robotics, vol. 30,
pp. 803–831, 2013.
plays a vital role in detection of small fast moving objects like [8] S. Lynen, M. Achtelik, S. Weiss, M. Chli, and R. Siegwart, “A Robust and
UAVs or aircrafts in complex outdoor environments. We therefore Modular Multi-Sensor Fusion Approach Applied to MAV Navigation,”
developed an object-centric learning-based motion compensation in Conference on Intelligent Robots and Systems, 2013.
approach that is robust to changes in the appearance of both object [9] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast Semi-Direct
Monocular Visual Odometry,” in International Conference on Robotics
and background. Both CNN and Boosted trees methods allow and Automation, 2014.
us to outperform state-of-the-art techniques on two challenging [10] T. Zsedrovits, A. Zarándy, B. Vanek, T. Peni, J. Bokor, and T. Roska,
datasets. The CNN proved to be more suitable for motion compen- “Visual Detection and Implementation Aspects of a UAV See and Avoid
System,” in European Conference on Circuit Theory and Design, 2011.
sation than the Boosted trees introduced in our previous work [15]. [11] D. Park, C. L. Zitnick, D. Ramanan, and P. Dollár, “Exploring Weak
To evaluate our algorithms, we collected two challenging Stabilization for Motion Feature Extraction,” in Conference on Computer
datasets for UAVs and Aircrafts detection. We hope that these Vision and Pattern Recognition, 2013.
datasets will become used as a new benchmark for improving fly- [12] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition
via Sparse Spatio-Temporal Features,” in VS-PETS, October 2005, pp.
ing objects detection and visual-based aerial collision avoidance. 65–72.
[13] I. Laptev, “On Space-Time Interest Points,” International Journal of
Computer Vision, 2005.
8 ACKNOWLEDGMENTS [14] D. Weinland, M. Ozuysal, and P. Fua, “Making Action Recognition
Robust to Occlusions and Viewpoint Changes,” in European Conference
on Computer Vision, 2010.
This work was conducted in the context of the “Visual detection
[15] A. Rozantsev, V. Lepetit, and P. Fua, “Flying Objects Detection from a
and tracking of flying objects in Unmanned Aerial Vehicles” Single Moving Camera,” in Conference on Computer Vision and Pattern
project, funded by Honeywell International, Inc. Recognition, 2015.
14
[16] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object [43] R. Benenson, O. Mohamed, J. Hosang, and B. Schiele, “Ten Years of
Detection with Discriminatively Trained Part Based Models,” IEEE Pedestrian Detection, What Have We Learned?” in European Conference
Transactions on Pattern Analysis and Machine Intelligence, 2010. on Computer Vision, 2014.
[17] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Berg- [44] L. Breiman, “Random Forests,” Machine Learning, 2001.
eron, N. Bouchard, and Y. Bengio, “Theano: new features and speed [45] P. Dollár, “Piotr’s Computer Vision Matlab Toolbox (PMT),” http:
improvements,” 2012. //vision.ucsd.edu/∼pdollar/toolbox/doc/index.html.
[18] A. Bosch, A. Zisserman, and X. Munoz, “Image Classification Using [46] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the
Random Forests and Ferns,” in International Conference on Computer Best Multi-Stage Architecture for Object Recognition?” in Conference
Vision, 2007. on Computer Vision and Pattern Recognition, 2009.
[19] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral Channel Features,” [47] J. Jin, K. Fu, and C. Zhang, “Traffic Sign Recognition with Hinge
in British Machine Vision Conference, 2009. Loss Trained Convolutional Neural Networks,” IEEE Transactions on
[20] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian Computer Vision Intelligent Transportation Systems, vol. 15, pp. 1991–2000, 2014.
System for Modeling Human Interactions,” IEEE Transactions on Pattern [48] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial
Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831–843, 2000. Transformer Networks,” arXiv Preprint, 2015.
[21] N. Seungjong and J. Moongu, “A New Framework for Background [49] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands Deep in Deep
Subtraction Using Multiple Cues,” in Asian Conference on Computer Learning for Hand Pose Estimation,” arXiv Preprint, 2015.
Vision. Springer Berlin Heidelberg, 2013, pp. 493–506.
[22] A. Sobral, “BGSLibrary: An OpenCV C++ Background Subtraction
Library,” in IX Workshop de Visao Computacional, 2013.
[23] D. Zamalieva and A. Yilmaz, “Background Subtraction for the Moving
Camera: A Geometric Approach,” Computer Vision and Image Under- Artem Rozantsev joined EPFL in 2012 as
standing, vol. 127, pp. 73–85, 2014. Ph.D. candidate at Computer Vision Labo-
[24] T. Brox and J. Malik, “Object Segmentation by Long Term Analysis of ratory, under the supervision of Prof. Pascal
Point Trajectories,” in European Conference on Computer Vision, 2010,
pp. 282–295. Fua and Prof. Vincent Lepetit. He received
[25] B. Lucas and T. Kanade, “An Iterative Image Registration Technique his Specialist degree in Mathematics and
with an Application to Stereo Vision,” in International Joint Conference Computer Science in 2012 from Lomonosov
on Artificial Intelligence, 1981, pp. 674–679.
Moscow State University. His research inter-
[26] T. Brox and J. Malik, “Large Displacement Optical Flow: Descriptor
Matching in Variational Motion Estimation,” IEEE Transactions on ests include object detection, synthetic data
Pattern Analysis and Machine Intelligence, 2011. generation and machine learning.
[27] Y. Zhang, S.-J. Kiselewich, W.-A. Bauson, and R. Hammoud, “Robust
Moving Object Detection at Distance in the Visible Spectrum and Beyond Vincent Lepetit is a Professor at the Institute
Using a Moving Camera,” in Conference on Computer Vision and Pattern for Computer Graphics and Vision, TU Graz
Recognition, 2006. and a Visiting Professor at the Computer
[28] S.-W. Kim, K. Yun, K.-M. Yi, S.-J. Kim, and J.-Y. Choi, “Detection of
Moving Objects with a Moving Camera Using Non-Panoramic Back-
Vision Laboratory, EPFL. He received the
ground Model,” Machine Vision and Applications, vol. 24, pp. 1015– PhD degree in Computer Vision in 2001
1028, 2013. from the University of Nancy, France, after
[29] S. Kwak, T. Lim, W. Nam, B. Han, and J. Han, “Generalized Back- working in the ISA INRIA team. He then
ground Subtraction Based on Hybrid Inference by Belief Propagation
and Bayesian Filtering,” in International Conference on Computer Vision, joined the Virtual Reality Lab at EPFL as a
2011, pp. 2174–2181. post-doctoral fellow and became a founding
[30] A. Elqursh and A. Elgammal, “Online Moving Camera Background member of the Computer Vision Laboratory.
Subtraction,” in European Conference on Computer Vision, 2012, pp. He became a Professor at TU GRAZ in February 2014. His
228–241.
[31] M. Narayana, A. Hanson, and E. Learned-miller, “Coherent Motion Seg- research interests include vision-based Augmented Reality, 3D
mentation in Moving Camera Videos Using Optical Flow Orientations,” camera tracking, Machine Learning, object recognition, and 3D
in International Conference on Computer Vision, 2013. reconstruction. He often serves as program committee member
[32] A. Papazoglou and V. Ferrari, “Fast Object Segmentation in Uncon-
and area chair of major vision conferences (CVPR, ICCV, ECCV,
strained Video,” in International Conference on Computer Vision, 2013.
[33] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New Features and ACCV, BMVC). He is an editor for the International Journal of
Insights for Pedestrian Detection,” in Conference on Computer Vision Computer Vision (IJCV) and the Computer Vision and Image
and Pattern Recognition, 2010. Understanding (CVIU) journal.
[34] J. Friedman, “Stochastic Gradient Boosting,” Computational Statistics &
Data Analysis, 2002. Pascal Fua received an engineering degree
[35] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human from Ecole Polytechnique, Paris, in 1984
Detection,” in Conference on Computer Vision and Pattern Recognition,
2005.
and the Ph.D. degree in Computer Science
[36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learn- from the University of Orsay in 1989. He
ing Applied to Document Recognition,” IEEE, 1998. then worked at SRI International and INRIA
[37] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Sophia-Antipolis as a Computer Scientist.
Networks,” in International Conference on Artificial Intelligence and
Statistics, 2011. He joined EPFL in 1996 where he is now
[38] M. D. Zeiler, “ADADELTA: an Adaptive Learning Rate Method,” Com- a Professor in the School of Computer and
puting Research Repository, 2012. Communication Science and heads the Com-
[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- puter Vision Laboratory. His research interests include shape mod-
dinov, “Dropout: A Simple Way to Prevent Neural Networks from
Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929– eling and motion recovery from images, analysis of microscopy
1958, 2014. images, and Augmented Reality. He has (co)authored over 200
[40] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical publications in refereed journals and conferences. He is an IEEE
Learning: Data Mining, Inference and Prediction. Springer, 2009. Fellow and has been an associate editor of IEEE journal Trans-
[41] R. Sznitman, C. Becker, F. Fleuret, and P. Fua, “Fast Object Detection
with Entropy-Driven Evaluation,” in Conference on Computer Vision and actions for Pattern Analysis and Machine Intelligence. He often
Pattern Recognition, 2013, pp. 3270–3277. serves as program committee member, area chair, and program
[42] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of chair of major vision conferences.
computer vision algorithms,” http://www.vlfeat.org/, 2008.