Chen, Meng - 2013 - Vehicle detection from UAVs by using SIFT with implicit shape model-annotated
Chen, Meng - 2013 - Vehicle detection from UAVs by using SIFT with implicit shape model-annotated
Abstract — In recent years, unmanned aerial vehicles (UAVs) colour and edge detection methods which combined multi-scale
have gained a great importance in both military and civilian mean-shift segmentation with novel histogram enhancement
applications. In this paper, we proposed a vehicle detection and multi-channel edge information to construct a robust
method from UAVs which integrated of Scalar Invariant Feature saliency map from a given images from UAV. It extracted 9
Transform (SIFT) and Implicit Shape Model (ISM). Firstly, a set features from different colour channels of the original image
of keypoints was detected in the testing image by using SIFT. and used these features to obtain the edge, ultimately enabling
Secondly, feature descriptors around the keypoints were detection artificial objects. [2] Developed a texture-based
generated by using the ISM. Support Vector Machines (SVMs)
method for vehicle detection. First, based on the Gaussian
were applied during the keypoints selection. The experiment used
distribution hypotheses the background has been estimated.
a video shoot by a UAV in a highway and the results showed the
performance and the effectiveness of the method. Second, the texture was extracted and analysed by Fast
Wavelet Transform (FWT) and Grey Level Co-occurrence
Keywords:Vehicle detection,Scale Invariant Feature Transform Matrix (GLCM) which produced better detection results than
(SIFT), Implicit Shape Model (ISM), Unmanned Aerial Vehicle the colour methods. A considerable number of approaches
(UAV). proposed to use learning methods. In such approaches, object
detection is learned from a set of training samples. First, each
of the training samples is processed to create certain features.
I. INTRODUCTION
The decision will then be made by a trained classifier. The
Unmanned Aerial Vehicles (UAVs) has become a new texture descriptor Histograms of Oriented Gradients (HoG) has
generation in worldwide aviation industry. UAVs have been been proposed in [3] [4] [5] [6] [7] for vehicle detection. This
developed and used in military missions because they can approach divides the image into small rectangular cells and
achieve absolute “zero” casualties. In recent years, UAVs have then processes the histogram of the gradient orientations in
also been involved in the civilian applications due to their each cell. These histograms are used as feature vectors for the
potential abilities: high mobility, fast deployment and wide Support Vector Machine (SVM) to make the decisions. [5]
surveillance scope, as well as being able to be deployed in Applied Disparity maps to improve the detection accuracy.
extreme environments and weather. UAVs can be equipped Based on HoG, a Boosting Light and Pyramid Sampling
with different types of imaging cameras depending on the Histogram of Oriented Gradients has been proposed by [6]
mission. They also have GPS equipped on board with the which speeds up the detection process by using fewer samples
automatic positioning and stabilization systems. The basic of the gradients and a pyramid structure is used for
UAV operations involve flying in an area and sending a live computation of HoG and the boost the obtained features to
video streaming back to a human operator. This method can be reduce the dimensionality of the feature vectors. [7] Integrated
easily become tedious to the operator if a target object is as HoG with other feature extraction techniques for the vehicle
complicated as the background noise. As a result, the UAV detection. Support Vector Machine (SVM) classifier has been
needs a “brain” to think for itself rather than being solely applied in [3] [4] [5] [6] [7].
human controlled.
In this paper, we propose a method of vehicle detection
Vehicle detection from aerial images is becoming a based on the feature extraction process of Scalar Invariant
growing research topic in surveillance traffic monitoring, and Feature Transform (SIFT) which is a point matching method.
in military applications such as border control, etc. There has The point has more specific information than the texture
been research into vehicle detection and tracking in airborne processing method which involves region processing [14]. The
videos. However, moving vehicle detection from the UAV SIFT can identify the consistent keypoints between the testing
platform is still remains very challenging under some image and the training samples. The keypoints are then
circumstances. Most of these challenges include complex classified by the Support Vector Machine (SVM) classifier.
background, fast movement of a UAVs resulting in blurred The Implicit Shape Model is then applied for clustering the
images, and limited computational resources, etc. [1] keypoints to detect the vehicle. The reason we used SVM is
implemented object detection in UAV imagery based on the that it is able to achieve the optimal solution for the detection
3140
3146
threshold which can detect similar keypoint in the testing To find the key points which are invariant to scale, it is
images. Also SIFT has quicker processing duration than the A- necessary to find the difference of the adjacent scale images.
SIFT. So we choose SIFT for the keypoints detection in our We can create a pyramid to calculate the DoG. After generating
project. the DoG we can compare each pixel in the DoG with its
neighbours. Each keypoint has to be compared with 8
III. THE VEHICLE DETECTION METHOD neighbours in the same scale and other 2×9=18 neighbours in
the upper and lower scales which makes 8+2×9 = 26
neighbours in total to compare. If a point is the maximum or
Since our work is to detect vehicles from UAVs, the typical minim in the all 26 neighbours, it will be classified as a
features of vehicles need to be defined and detected. Because keypoint in the image scale.
the video is captured by the UAV, so the vehicle features need
to be invariant to scale, rotation, translation and not affected by 2) Determine and filter the keypoints
illumination effects. SIFT algorithm is a local feature
extraction algorithm that can find the key points which are The DoG method is sensitive to the noise and edge in the
invariant to scale, rotation, and translation. SIFT key points are image, so we have to set a filter to reject low contrast points
the extreme points of the Gaussian scale space differences. In and the poor points among the edges. By the fitting 3D
the Gaussian image, each keypoint is the results of either quadratic function, the location and the scale of the keypoints
maximum or minimum values from the comparison of its 26 can be more accurately.
neighbourhood pixels with the current, upper and lower scale.
The SIFT algorithm calculates the instable extreme point and 3) Orientatie each point
the accurate position of the pixel by using the Taylor expansion
and the Hessian matrix. Also in the Gaussian image the For each keypoint, a direction of the point to the maximum
gradient values and direction of each pixel of neighbourhood of the gradient direction in the histogram is generated. The
near the key point is calculated to get the keypoint subsequent descriptor structure takes this direction as a
independence scale and direction. Since these keypoints are reference. For each images L ( x , y ) , the gradient magnitude
invariant to those affections of the images, this method can m ( x , y ) and the orientation ( x , y ) are calculated as :
achieve a better detection result.
m(x,y) = (L(x+ 1, y)− L(x− 1, y))2 + (L(x,y + 1)− L(x,y − 1))2 (4)
A. SIFT feature extraction has mainly four steps: L(x, y + 1) − L(x, y − 1) (5)
(x, y) = tan − 1
1) Insert Image and detect extreme points L(x + 1, y) − L(x − 1, y)
2) Determine and filter the keypoint The sampling area of a keypoint is in the centre of the
neighbours which adopts histogram statistics to the gradient
3) Orientate each keypoints
direction of the neighbour’s pixel. The range of the gradient
4) Generate the SIFT vector for each keypoint histogram is 0 to 360 degrees (36 in total 10 degrees each). The
key point’s main direction of the neighbours’ gradient is
1) Detect extreme point represented by the peak of the histogram and if there is other
energy reaches to 80% of the peak value it will be represented
The keypoints in SIFT features are the stable points across as secondary direction of the key point. As a result, each
the image. This can identify the possible scale invariant keypoint has a total numbers of 8 directions.
locations. The scale space of an image can be defined as: L ( x , 4) Genertate the SIFT vector
y , ) with the variable scale Gaussian function G ( x , y , )
in the input image I (x , y) the is the scale factor.
In the last step, for each keypoint is assigned a descriptor
which has the gradients to achieve further invariance. In the
L(x,y,)=G(x,y,) ×I (x,y) construction of the descriptors the direction of each descriptor
is rotated to the main direction of the image. This makes the
Difference of Gaussian (DoG) has been used in the rotational invariance. After that, an 8×8 window of the
detection of the stable keypoints which can find the scale-space keypoint is created. Lowe [9] proposed 4 × 4 = 16 seeds to
extrema in DoG. describe for each keypoint in the calculation process in order to
increase the matching rate [9]. So a 4 × 4 × 8 = 128 data is
D ( x , y , ) = (G ( x , y , k ) G ( x , y , k )) × I ( x , y ) formed for each keypoint. This is then normalized to unit
lengths which can reduce the defects of the illumination
changes. Any change in contrast in any pixel is cancelled by
the vector normalization.
D ( x , y , ) = L ( x , y , k ) L ( x , y , )
Now the SIFT feature has removed all the invariance
where the k is a constant factor which separates two adjacent effects, including changes in scale, rotation and illumination.
scales from the original image and D ( x , y , ) is the DoG. We can also get two vectors which are the coordination
positions of the keypoints with the orientation and scale value,
the second is a 128 descriptor. Ultimately we can now locate
3141
3147
the keypoint by the coordination and classify the keypoint by We created 850 car samples and 260 environment samples
the descriptor. for the training (Table II).
B. Implicit Shape Model TABLE II. THE TOTAL NUMBER OF THE TRAINING SAMPLES
We used Implicit Shape Model(ISM) to integrate the
keypoints in order to increase the detection accuracy. this Training samples
works as follows: Images Total Keypoints
Vehicle 850 42617
1. Detecting the interesting points in the image which is Enviro 260 16518
in a similar way to SIFT. Total 1110 59135
2. Generating a codebook and visual vocabulary from the B. SVM and ISM
training data.
We applied these training samples to train the SVM
3. Probabilistic voting for the matching in the codebook. classifier, with the label 1 being the vehicle and the label 0 the
environment or anything else.
4. Object hypotheses
During testing, we calculated the SIFT keypoints in each
5. Object segmentation
test image. Each point, it had a 128 dimension descriptor. We
We combine the codebook and the probabilistic methods applied those descriptors into the trained SVM classifier and
into the SIFT method. In order to generate a codebook for the the SVM decided which class each point should belong to. Due
vehicles, we extracted 60×60 pixels image patches around the to a substantial numbers of the training samples it is possible
position of each keypoints. Each patch as a separate cluster, that some of the environment point have a similar descriptor
similar clusters are merged if the average similarity between value with the vehicle point, so we used the ISM as a filter to
their constituent patches stays between the thresholds. This further improve the detection. After the SIFT matching (Fig.3)
means the distribution of the vehicle keypoints should have we obtained the points which were likely to be classified as a
large density in the testing image. The codebook can separate vehicle by the SVM. We then generated the codebook for each
the environment keypoints which has lower density distribution. matching keypoint (Fig.4). According to the codebook, the
algorithm voted for the most matching keypoints then gets the
C. SVM Classification final results (Fig.5).
We used a SVM classifier to classify the keypoints to be a
vehicle or the environment. We created a set of training
samples N, and we label each training sample with either1 or 0
depending on the class of the sample. During the training
process, SVM can find hyper plans in a kernel-induced feature
space which can divide the training samples into two separate
groups.
A. Training sample
First of all, we created a set of training samples from a
video taken by the UAV above a motorway. We also carefully
separated the vehicle and the environment to reduce the false
detection (Fig.1).
Figure 3. SIFT points and matching points in the test image.
Figure 1. Some of theCar training samples with (left) and without (right) environment.
3142
3148
Figure 4. The codebook for the matching points Vehicle Detection Information
Features Total Positive Negative Accurac
Car
Car False False y
SIFT 107 71 5 36 77.65%
SIFT +
88 78 1 9 90.59%
ISM
3143
3149
compared the detection performance between the three features REFERENCES
which are HOG, SIFT and ISM, SIFT and ISM with AdaBoost
in different sample size (Table VI). The result shows that the [1] Jan,S., Toby P.B “Automatic Salient Object Detection in UAV Imagery”.
AdaBoost can improve the classifier process that slightly School of Enginerering Cranfield University UK. 25th International UAV
increased the detection results. Systems Conference.
[2] Peiqun Lin Jianmin Xu and Jianyong Bian. “Robust Vehicle Detection
in Vision Systems Based on Fast Wavelet Transform and Texture
Analysis” IEEE International Conference on Automation and Logistics
August 18 - 21, 2007, Jinan, China. Pp 2958 - 2963
[3] Feng Han, Ying Shan, Ryan Cekander, Harpreet S. Sawhney, and
Rakesh Kumar “A Two-Stage Approach to People and Vehicle
Detection with HOG-Based SVM” Performance Metrics for Intelligent
Systems 2006 Workshop, 2006 pp 133 - 140.
[4] Tarak Gandhi and Mohan M. Trivedi, “Video Based Surround Vehicle
Detection, Classification and Logging from Moving Platforms: Issues
and Approaches”, IEEE Intelligent Vehicles Symposium Istanbul,
Turkey, June 13-15, 2007 pp 1067 - 1071
[5] Sebastian Tuermer, Franz Kurz, Peter Reinartz and Uwe Stilla,
“Airborne Vehicle Detection in Dense Urban Areas Using HoG Features
and Disparity Maps” IEEE Journal Of Selected Topics In Applied Earth
Observations And Remote Sensing vol. 99 pp1-11 11 February 2013
[6] Xianbin Cao, Changxia Wu, Jinhe Lan, Pingkun Yan, and Xuelong Li,
“Vehicle Detection and Motion Analysis in Low-Altitude Airborne
Video Under Urban Environment”, IEEE Transactions On Circuits And
Systems For Video Technology, vol. 21 pp 1522-1533 , no. 10, October
2011
[7] Joshua Gleason, Ara V.Nefian, Xavier Bouyssounousse, Terry Fong and
George Bebis, “Vehicle Detection from Aerial Imagery”, IEEE
International Conference on Robotics and AutoMation Shanghai
Figure 6. SIFT points detection with different number of training samples
International Conference Centre 2011 Shanghai,P.R. China pp 2065-
2070.
[8] David G. Lowe, "Object recognition from local scale-invariant features,"
We used six sets of different sized training samples to train International Conference on Computer Vision, Corfu, Greece September
1999, pp. 1150-1157.
the classifier. Obviously, the more training samples we used
[9] David G. Lowe, "Distinctive image features from scale-invariant
the better detection result we obtained. However, the trend of keypoints," International Journal of Computer Vision, 60, 2 (2004), pp.
the detection points was on the decline which means the 91-110.
training sample is reaching the saturation status (Table V). [10] Y. Ke and R. Sukthankar.PCA-SIFT: “A More Distinctive
Representation for Local Image Descriptors” ,Proc. Conf. Computer
Vision and Pattern pp II-506 - II-513 Vol.2 July 2004
V. CONCLUSTION
[11] Bay,H,. Tuytelaars, T., &Van Gool, L.(2006). “SURF: Speeded Up
In this paper we developed a method of vehicle detection Robust Features”, Computer Vision and Image Understanding (CVIU),
from a UAV. The method involves matching the invariant Vol. 110, No. 3, pp. 346--359, 2008.
points that affine any transformations with a codebook rating [12] J.M. Morel and G.Yu, “ASIFT: A New Framework for Fully Affine
features. Invariant Image Comparison”, SIAM Journal on Imaging Sciences, vol.
2, issue 2, pp438 - 469, 2009.
The final result shows that the method has a 94.13% [13] J. Matas, O. Chum, M. Urban, and T. Pajdla. "Robust wide baseline
accuracy rate. This result can be affected by the size of the stereo from maximally stable extremal regions." Proc. of British
training samples, the more training samples we used higher Machine Vision Conference, pp 384-396, 2002.
detection accuracy we got. The processing speed is quite slow [14] Luo Juan, Oubong Gwun, “A Comparison of SIFT, PCA-SIFT and
SURF”, International Journal of Image Processing (IJIP) Volume 3,
compare with the real time processing because the SIFT needs Issue 4, pp 143-152 2009
to process every keypoints for the vehicle detection. In further,
[15] C.-C. Chang and C.-J. Lin, LIBSVM — A Library for Support Vector
we will increase the training samples size of both vehicle and Machines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
environment samples. We will also reduce the processing time [16] A. Vedaldi and B. Fulkerson, VLFeat platform. [Online]. Available:
for SIFT points matching by reducing the 128 features to 64 http://www.vlfeat.org/index.html
features which can reduce halve the processing time in the
SIFT methods.
3144
3150