0% found this document useful (0 votes)
6 views

Chen, Meng - 2013 - Vehicle detection from UAVs by using SIFT with implicit shape model-annotated

The paper presents a vehicle detection method using UAVs that combines the Scalar Invariant Feature Transform (SIFT) and Implicit Shape Model (ISM) to enhance detection accuracy. It discusses the challenges of detecting vehicles in aerial imagery, such as complex backgrounds and motion blur, and highlights the effectiveness of SIFT in identifying keypoints for vehicle detection. The method was tested on UAV-captured video footage, demonstrating improved performance in detecting vehicles compared to traditional methods.

Uploaded by

ravinderytuse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Chen, Meng - 2013 - Vehicle detection from UAVs by using SIFT with implicit shape model-annotated

The paper presents a vehicle detection method using UAVs that combines the Scalar Invariant Feature Transform (SIFT) and Implicit Shape Model (ISM) to enhance detection accuracy. It discusses the challenges of detecting vehicles in aerial imagery, such as complex backgrounds and motion blur, and highlights the effectiveness of SIFT in identifying keypoints for vehicle detection. The method was tested on UAV-captured video footage, demonstrating improved performance in detecting vehicles compared to traditional methods.

Uploaded by

ravinderytuse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2013 IEEE International Conference on Systems, Man, and Cybernetics

Vehicle Detection from UAVs by using SIFT with


Implicit Shape Model

XIYAN CHEN QINGGANG MENG


Department of Computer Science Department of Computer Science
Loughborough University Loughborough University
Loughborough, UK Loughborough, UK
E-mail [email protected] E-mail [email protected]

Abstract — In recent years, unmanned aerial vehicles (UAVs) colour and edge detection methods which combined multi-scale
have gained a great importance in both military and civilian mean-shift segmentation with novel histogram enhancement
applications. In this paper, we proposed a vehicle detection and multi-channel edge information to construct a robust
method from UAVs which integrated of Scalar Invariant Feature saliency map from a given images from UAV. It extracted 9
Transform (SIFT) and Implicit Shape Model (ISM). Firstly, a set features from different colour channels of the original image
of keypoints was detected in the testing image by using SIFT. and used these features to obtain the edge, ultimately enabling
Secondly, feature descriptors around the keypoints were detection artificial objects. [2] Developed a texture-based
generated by using the ISM. Support Vector Machines (SVMs)
method for vehicle detection. First, based on the Gaussian
were applied during the keypoints selection. The experiment used
distribution hypotheses the background has been estimated.
a video shoot by a UAV in a highway and the results showed the
performance and the effectiveness of the method. Second, the texture was extracted and analysed by Fast
Wavelet Transform (FWT) and Grey Level Co-occurrence
Keywords:Vehicle detection,Scale Invariant Feature Transform Matrix (GLCM) which produced better detection results than
(SIFT), Implicit Shape Model (ISM), Unmanned Aerial Vehicle the colour methods. A considerable number of approaches
(UAV). proposed to use learning methods. In such approaches, object
detection is learned from a set of training samples. First, each
of the training samples is processed to create certain features.
I. INTRODUCTION
The decision will then be made by a trained classifier. The
Unmanned Aerial Vehicles (UAVs) has become a new texture descriptor Histograms of Oriented Gradients (HoG) has
generation in worldwide aviation industry. UAVs have been been proposed in [3] [4] [5] [6] [7] for vehicle detection. This
developed and used in military missions because they can approach divides the image into small rectangular cells and
achieve absolute “zero” casualties. In recent years, UAVs have then processes the histogram of the gradient orientations in
also been involved in the civilian applications due to their each cell. These histograms are used as feature vectors for the
potential abilities: high mobility, fast deployment and wide Support Vector Machine (SVM) to make the decisions. [5]
surveillance scope, as well as being able to be deployed in Applied Disparity maps to improve the detection accuracy.
extreme environments and weather. UAVs can be equipped Based on HoG, a Boosting Light and Pyramid Sampling
with different types of imaging cameras depending on the Histogram of Oriented Gradients has been proposed by [6]
mission. They also have GPS equipped on board with the which speeds up the detection process by using fewer samples
automatic positioning and stabilization systems. The basic of the gradients and a pyramid structure is used for
UAV operations involve flying in an area and sending a live computation of HoG and the boost the obtained features to
video streaming back to a human operator. This method can be reduce the dimensionality of the feature vectors. [7] Integrated
easily become tedious to the operator if a target object is as HoG with other feature extraction techniques for the vehicle
complicated as the background noise. As a result, the UAV detection. Support Vector Machine (SVM) classifier has been
needs a “brain” to think for itself rather than being solely applied in [3] [4] [5] [6] [7].
human controlled.
In this paper, we propose a method of vehicle detection
Vehicle detection from aerial images is becoming a based on the feature extraction process of Scalar Invariant
growing research topic in surveillance traffic monitoring, and Feature Transform (SIFT) which is a point matching method.
in military applications such as border control, etc. There has The point has more specific information than the texture
been research into vehicle detection and tracking in airborne processing method which involves region processing [14]. The
videos. However, moving vehicle detection from the UAV SIFT can identify the consistent keypoints between the testing
platform is still remains very challenging under some image and the training samples. The keypoints are then
circumstances. Most of these challenges include complex classified by the Support Vector Machine (SVM) classifier.
background, fast movement of a UAVs resulting in blurred The Implicit Shape Model is then applied for clustering the
images, and limited computational resources, etc. [1] keypoints to detect the vehicle. The reason we used SVM is
implemented object detection in UAV imagery based on the that it is able to achieve the optimal solution for the detection

978-1-4799-0652-9/13 $31.00 © 2013 IEEE 3145


3139
DOI 10.1109/SMC.2013.535
compared with other learning methods, such as rule-based SIFT has 4 stages: 1.Scale-space extrema detection. 2.
methods and neural networks, as these are often limited by the Keypoint localization. 3. Orientation assignment. 4. Keypoint
local minima. descriptor. In the first stage, the Difference–of-Gaussian (DoG)
function is used to identify the interest points. In the second
II. RESEARCH FOR THE KEYPOINTS DETECTION METHODS step, the low contrast points and edges are rejected. Hessian
matrix was used to calculate the principal curvatures and set a
A. SIFT boundary to eliminate certain keypoints. In third step, from the
gradient orientations of sample points, the orientation
David G. Lowe firstly proposed the Scale-Invariant histogram is created. In the final stage, a 4×4×8=128
Features Transform (SIFT) [8] and made an improvement in dimensions descriptor is formed [9].
[9]. In [9], Lowe not only presented SIFT but also discussed
the keypoint matching. SIFT can be invariant to image scale, PCA-SIFT uses PCA instead of histogram in the gradient
rotation and blur which is a key advantage to vehicle detection normalisation step of SIFT. The PCA feature vector is smaller
from UAVs. than the SIFT feature vector. It also can use same matching
algorithms as SIFT. In PCA-SIFT, 2×39×39=3042 elements
B. PCA-SIFT have been created by horizontal and vertical gradient maps for
41×41 patch centred to the keypoint [10].
After Lowe, Ke and Sukthankar used Principal Components
Analysis (PCA) to normalize gradient patches instead of SURF uses different method for detecting features
Histograms to normalize the gradient patches [10]. They compared with SIFT. SIFT builds a pyramid in the test image
showed that PCA-based local descriptors were also identifiable and filters each layer and finds the difference of Gaussians with
and capable of image deformations. PCA is a standard certain sigma values. In SURF, it creates a “stack” is created
technique for dimensionality reduction which itself is a better without the sampling for higher levels in the pyramid [11]. It
approach to the keypoint. However, the speed of extracting filters the stack by using a box filter nearby the second-order
robust features of the methods is slow. Gaussian because it uses the integral images which allow the
computation time reduced a lot.
C. SURF
A-SIFT extends the SIFT method to a fully affine invariant
Bay and Tuytelaars proposed Speeded up Robust Features method. It adds the scale, the camera longitude angle, and the
and used integral images for image convolutions and Fast- latitude angle in addition to the rotations, translations and
Hessian detector [11]. SIFT and SURF algorithms have zooms in the SIFT detector.
different methods of detecting features points. SIFT builds a
pyramids of the image and processes each layer with Gaussians MSER tries to be affine invariant but it is not fully scale
upon the increasing sigma values, before taking the difference. invariant. The limitation of MESR is that it is influenced by the
On the other hand, SURF creates an easy process sampling optical blur in affine transform.
method for higher levels in the pyramid resulting in images of According to [14], we made a table comparing all of the
the same resolution. The results show that it works faster and detection features.
produces good results.

D. A-SIFT TABLE I. CONCLUSION OF ALL FEATURES

Jean-Michel Morel and Guoshen Yu proposed Affine-SIFT Features


Methods PCA-
[12] which increased the descriptors information in SIFT. In SIFT SURF A-SIFT MSER
SIFT
other words, A-SIFT simulates three parameters which are the
scale, the camera longitude angle, and the latitude angle and Time good common best common good
normalized the translation and rotation. A-SIFT has higher Scale good common good best common
matching accuracy then SIFT, PCA-SIFT and SURF but
requires a longer processing time. Rotation good good common best good

blur good common good best common


E. MSER
illumination best good common good common
In 2002, J.Matas, et al proposed Maximally Stable Extremal
Regions (MSERs) [13]. This is different from the point feature affine good good good good good
algorithms, MSERs are the regions that which are either darker
or brighter than the surroundings, and are stable across a range
of thresholds of the intensity functions. According to the Table I, A-SIFT has the best matching
performance compared to other methods, but the processing
time is longer. The SIFT has reasonable performance and
All 5 features above are the methods for the object shorter processing time. Comparing with A-SIFT and SIFT, the
detection by point. It is thus important to select and use the A-SIFT method is better designed for matching the same object
most appropriate one in the detection. We compared all five in two different images, the threshold for A-SIFT is narrower
methods and found SIFT has better results in the detection which only matches the exactly same keypoints, during the
process. testing, the matching points were significantly lower when we
used different video. The SIFT has a considerably wider

3140
3146
threshold which can detect similar keypoint in the testing To find the key points which are invariant to scale, it is
images. Also SIFT has quicker processing duration than the A- necessary to find the difference of the adjacent scale images.
SIFT. So we choose SIFT for the keypoints detection in our We can create a pyramid to calculate the DoG. After generating
project. the DoG we can compare each pixel in the DoG with its
neighbours. Each keypoint has to be compared with 8
III. THE VEHICLE DETECTION METHOD neighbours in the same scale and other 2×9=18 neighbours in
the upper and lower scales which makes 8+2×9 = 26
neighbours in total to compare. If a point is the maximum or
Since our work is to detect vehicles from UAVs, the typical minim in the all 26 neighbours, it will be classified as a
features of vehicles need to be defined and detected. Because keypoint in the image scale.
the video is captured by the UAV, so the vehicle features need
to be invariant to scale, rotation, translation and not affected by 2) Determine and filter the keypoints
illumination effects. SIFT algorithm is a local feature
extraction algorithm that can find the key points which are The DoG method is sensitive to the noise and edge in the
invariant to scale, rotation, and translation. SIFT key points are image, so we have to set a filter to reject low contrast points
the extreme points of the Gaussian scale space differences. In and the poor points among the edges. By the fitting 3D
the Gaussian image, each keypoint is the results of either quadratic function, the location and the scale of the keypoints
maximum or minimum values from the comparison of its 26 can be more accurately.
neighbourhood pixels with the current, upper and lower scale.
The SIFT algorithm calculates the instable extreme point and 3) Orientatie each point
the accurate position of the pixel by using the Taylor expansion
and the Hessian matrix. Also in the Gaussian image the For each keypoint, a direction of the point to the maximum
gradient values and direction of each pixel of neighbourhood of the gradient direction in the histogram is generated. The
near the key point is calculated to get the keypoint subsequent descriptor structure takes this direction as a
independence scale and direction. Since these keypoints are reference. For each images L ( x , y ) , the gradient magnitude
invariant to those affections of the images, this method can m ( x , y ) and the orientation  ( x , y ) are calculated as :
achieve a better detection result.
m(x,y) = (L(x+ 1, y)− L(x− 1, y))2 + (L(x,y + 1)− L(x,y − 1))2 (4)

A. SIFT feature extraction has mainly four steps: L(x, y + 1) − L(x, y − 1) (5)
(x, y) = tan − 1
1) Insert Image and detect extreme points L(x + 1, y) − L(x − 1, y)
2) Determine and filter the keypoint The sampling area of a keypoint is in the centre of the
neighbours which adopts histogram statistics to the gradient
3) Orientate each keypoints
direction of the neighbour’s pixel. The range of the gradient
4) Generate the SIFT vector for each keypoint histogram is 0 to 360 degrees (36 in total 10 degrees each). The
key point’s main direction of the neighbours’ gradient is
1) Detect extreme point represented by the peak of the histogram and if there is other
energy reaches to 80% of the peak value it will be represented
The keypoints in SIFT features are the stable points across as secondary direction of the key point. As a result, each
the image. This can identify the possible scale invariant keypoint has a total numbers of 8 directions.
locations. The scale space of an image can be defined as: L ( x , 4) Genertate the SIFT vector
y ,  ) with the variable scale Gaussian function G ( x , y ,  )
in the input image I (x , y) the  is the scale factor.
In the last step, for each keypoint is assigned a descriptor
which has the gradients to achieve further invariance. In the
L(x,y,)=G(x,y,) ×I (x,y)  construction of the descriptors the direction of each descriptor
is rotated to the main direction of the image. This makes the
Difference of Gaussian (DoG) has been used in the rotational invariance. After that, an 8×8 window of the
detection of the stable keypoints which can find the scale-space keypoint is created. Lowe [9] proposed 4 × 4 = 16 seeds to
extrema in DoG. describe for each keypoint in the calculation process in order to
increase the matching rate [9]. So a 4 × 4 × 8 = 128 data is
D ( x , y ,  ) = (G ( x , y , k )  G ( x , y , k )) × I ( x , y )   formed for each keypoint. This is then normalized to unit
lengths which can reduce the defects of the illumination
changes. Any change in contrast in any pixel is cancelled by
the vector normalization.
D ( x , y ,  ) = L ( x , y , k )  L ( x , y ,  )  
Now the SIFT feature has removed all the invariance
where the k is a constant factor which separates two adjacent effects, including changes in scale, rotation and illumination.
scales from the original image and D ( x , y ,  ) is the DoG. We can also get two vectors which are the coordination
positions of the keypoints with the orientation and scale value,
the second is a 128 descriptor. Ultimately we can now locate

3141
3147
the keypoint by the coordination and classify the keypoint by We created 850 car samples and 260 environment samples
the descriptor. for the training (Table II).

B. Implicit Shape Model TABLE II. THE TOTAL NUMBER OF THE TRAINING SAMPLES
We used Implicit Shape Model(ISM) to integrate the
keypoints in order to increase the detection accuracy. this Training samples
works as follows: Images Total Keypoints
Vehicle 850 42617
1. Detecting the interesting points in the image which is Enviro 260 16518
in a similar way to SIFT. Total 1110 59135
2. Generating a codebook and visual vocabulary from the B. SVM and ISM
training data.
We applied these training samples to train the SVM
3. Probabilistic voting for the matching in the codebook. classifier, with the label 1 being the vehicle and the label 0 the
environment or anything else.
4. Object hypotheses
During testing, we calculated the SIFT keypoints in each
5. Object segmentation
test image. Each point, it had a 128 dimension descriptor. We
We combine the codebook and the probabilistic methods applied those descriptors into the trained SVM classifier and
into the SIFT method. In order to generate a codebook for the the SVM decided which class each point should belong to. Due
vehicles, we extracted 60×60 pixels image patches around the to a substantial numbers of the training samples it is possible
position of each keypoints. Each patch as a separate cluster, that some of the environment point have a similar descriptor
similar clusters are merged if the average similarity between value with the vehicle point, so we used the ISM as a filter to
their constituent patches stays between the thresholds. This further improve the detection. After the SIFT matching (Fig.3)
means the distribution of the vehicle keypoints should have we obtained the points which were likely to be classified as a
large density in the testing image. The codebook can separate vehicle by the SVM. We then generated the codebook for each
the environment keypoints which has lower density distribution. matching keypoint (Fig.4). According to the codebook, the
algorithm voted for the most matching keypoints then gets the
C. SVM Classification final results (Fig.5).
We used a SVM classifier to classify the keypoints to be a
vehicle or the environment. We created a set of training
samples N, and we label each training sample with either1 or 0
depending on the class of the sample. During the training
process, SVM can find hyper plans in a kernel-induced feature
space which can divide the training samples into two separate
groups.

IV. IMPLIMENTATION AND TESTING RESULTS

A. Training sample
First of all, we created a set of training samples from a
video taken by the UAV above a motorway. We also carefully
separated the vehicle and the environment to reduce the false
detection (Fig.1).
Figure 3. SIFT points and matching points in the test image.

Figure 1. Some of theCar training samples with (left) and without (right) environment.

Then we calculate the SIFT keypoints for each samples


(Fig.2).

Figure 2. SIFT keypoints for the training samples

3142
3148
Figure 4. The codebook for the matching points Vehicle Detection Information
Features Total Positive Negative Accurac
Car
Car False False y
SIFT 107 71 5 36 77.65%
SIFT +
88 78 1 9 90.59%
ISM

TABLE V. TESTING RESULTS FOR VEHICLE DETECTION

Sample Vehicle Detection Information


Size Total Points Car Points False Points Accuracy
1110
Training 545 513 32 94.13%
Samples
1000
Training 517 465 52 89.94%
Samples
900
Training 497 420 77 84.51%
Samples
700
Training 462 359 103 77.71%
Figure 5. Final result. Samples
500
Training 411 256 155 62.29%
C. Testing Samples
300
We used two sets of testing videos. The first set is for Training 328 130 198 39.63%
extracting the training samples and calibrating the SIFT and the Samples
SVM parameters, the other set is for testing the actual accuracy
for our approach.
We also divided our experiment into three phases. First we TABLE VI. DETECTION ACCURACY
tried to use different numbers of training samples to check the
Vehicle Detection Accuracy
influence caused by the size of training samples on the Sample
Size SIFT + ISM +
detection accuracy. We used five group sizes (1000, 900, 700, HOG SIFT + ISM
AdaBoost
500, 300) of training samples which include vehicle samples 1110
and environment samples. Then we used pure SIFT feature Training 88% 94% 95%
without the ISM to detect the vehicle to see the change in the Samples
accuracy. We compared the detection accuracy by using the 1000
Training 86% 90% 91%
SIFT and ISM method with the SIFT method in different Samples
sample sizes. We also added the AdaBoost to the classifier in 900
order to get a better performance. Training 80% 85% 86%
Samples
The testing video is 1 minute 38 seconds long and has total 700
2461 frames and 85 vehicles appeared. Training 72% 77% 78%
Samples
500
TABLE III. TESTING RESULTS FOR SIFT Training 59% 62% 62%
Samples
SIFT Detection Information 300
Features False
Total Points Car points Accuracy Training 38% 40% 44%
Detection Samples
SIFT 1122 726 396 64.71%
SIFT + According to the testing results we conclude that the SIFT
545 513 32 94.13%
ISM
and ISM features can raise the accuracy for vehicle detection
(Table III). The accuracy is low when we used SIFT only, but
in the SIFT method the vehicle keypoints in the testing images
TABLE IV. TESTING RESULTS FOR VEHICLE DETECTION were well detected. However the false negative was too high
because the environment may have similar keypoint features.
Vehicle Detection Information
Features After the ISM feature was applied, the environment points
Total Positive Negative Accurac
Car
Car
False False y were eliminated by the codebook voting which can
significantly increased the detection accuracy (Table IV). We

3143
3149
compared the detection performance between the three features REFERENCES
which are HOG, SIFT and ISM, SIFT and ISM with AdaBoost
in different sample size (Table VI). The result shows that the [1] Jan,S., Toby P.B “Automatic Salient Object Detection in UAV Imagery”.
AdaBoost can improve the classifier process that slightly School of Enginerering Cranfield University UK. 25th International UAV
increased the detection results. Systems Conference.
[2] Peiqun Lin Jianmin Xu and Jianyong Bian. “Robust Vehicle Detection
in Vision Systems Based on Fast Wavelet Transform and Texture
Analysis” IEEE International Conference on Automation and Logistics
August 18 - 21, 2007, Jinan, China. Pp 2958 - 2963
[3] Feng Han, Ying Shan, Ryan Cekander, Harpreet S. Sawhney, and
Rakesh Kumar “A Two-Stage Approach to People and Vehicle
Detection with HOG-Based SVM” Performance Metrics for Intelligent
Systems 2006 Workshop, 2006 pp 133 - 140.
[4] Tarak Gandhi and Mohan M. Trivedi, “Video Based Surround Vehicle
Detection, Classification and Logging from Moving Platforms: Issues
and Approaches”, IEEE Intelligent Vehicles Symposium Istanbul,
Turkey, June 13-15, 2007 pp 1067 - 1071
[5] Sebastian Tuermer, Franz Kurz, Peter Reinartz and Uwe Stilla,
“Airborne Vehicle Detection in Dense Urban Areas Using HoG Features
and Disparity Maps” IEEE Journal Of Selected Topics In Applied Earth
Observations And Remote Sensing vol. 99 pp1-11 11 February 2013
[6] Xianbin Cao, Changxia Wu, Jinhe Lan, Pingkun Yan, and Xuelong Li,
“Vehicle Detection and Motion Analysis in Low-Altitude Airborne
Video Under Urban Environment”, IEEE Transactions On Circuits And
Systems For Video Technology, vol. 21 pp 1522-1533 , no. 10, October
2011
[7] Joshua Gleason, Ara V.Nefian, Xavier Bouyssounousse, Terry Fong and
George Bebis, “Vehicle Detection from Aerial Imagery”, IEEE
International Conference on Robotics and AutoMation Shanghai
Figure 6. SIFT points detection with different number of training samples
International Conference Centre 2011 Shanghai,P.R. China pp 2065-
2070.
[8] David G. Lowe, "Object recognition from local scale-invariant features,"
We used six sets of different sized training samples to train International Conference on Computer Vision, Corfu, Greece September
1999, pp. 1150-1157.
the classifier. Obviously, the more training samples we used
[9] David G. Lowe, "Distinctive image features from scale-invariant
the better detection result we obtained. However, the trend of keypoints," International Journal of Computer Vision, 60, 2 (2004), pp.
the detection points was on the decline which means the 91-110.
training sample is reaching the saturation status (Table V). [10] Y. Ke and R. Sukthankar.PCA-SIFT: “A More Distinctive
Representation for Local Image Descriptors” ,Proc. Conf. Computer
Vision and Pattern pp II-506 - II-513 Vol.2 July 2004
V. CONCLUSTION
[11] Bay,H,. Tuytelaars, T., &Van Gool, L.(2006). “SURF: Speeded Up
In this paper we developed a method of vehicle detection Robust Features”, Computer Vision and Image Understanding (CVIU),
from a UAV. The method involves matching the invariant Vol. 110, No. 3, pp. 346--359, 2008.
points that affine any transformations with a codebook rating [12] J.M. Morel and G.Yu, “ASIFT: A New Framework for Fully Affine
features. Invariant Image Comparison”, SIAM Journal on Imaging Sciences, vol.
2, issue 2, pp438 - 469, 2009.
The final result shows that the method has a 94.13% [13] J. Matas, O. Chum, M. Urban, and T. Pajdla. "Robust wide baseline
accuracy rate. This result can be affected by the size of the stereo from maximally stable extremal regions." Proc. of British
training samples, the more training samples we used higher Machine Vision Conference, pp 384-396, 2002.
detection accuracy we got. The processing speed is quite slow [14] Luo Juan, Oubong Gwun, “A Comparison of SIFT, PCA-SIFT and
SURF”, International Journal of Image Processing (IJIP) Volume 3,
compare with the real time processing because the SIFT needs Issue 4, pp 143-152 2009
to process every keypoints for the vehicle detection. In further,
[15] C.-C. Chang and C.-J. Lin, LIBSVM — A Library for Support Vector
we will increase the training samples size of both vehicle and Machines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
environment samples. We will also reduce the processing time [16] A. Vedaldi and B. Fulkerson, VLFeat platform. [Online]. Available:
for SIFT points matching by reducing the 128 features to 64 http://www.vlfeat.org/index.html
features which can reduce halve the processing time in the
SIFT methods.

3144
3150

You might also like