6416 978-1-5386-7150-4/18/$31.00 ©2018 Ieee Igarss 2018
6416 978-1-5386-7150-4/18/$31.00 ©2018 Ieee Igarss 2018
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:34 UTC from IEEE Xplore. Restrictions apply.
era POSE, at least one calibrated camera image (though two ing us to validate our results using a total station and avoiding
are preferable), an aerial orthophoto, and an accompanying some measurement bias.
digital elevation model (DEM)[1].
There are several well defined relationships that can be
leveraged to reconstruct complete or partial POSE parameters
[3] given a series of known points in a pair of images. Com-
puting an image homography also provides us the parameters
of relative camera POSE between aerial and perspective im-
ages. Homographies can be adjusted for imprecision error via
iterative least squares adjustment [1]. Incorporating the DEM
with the aerial orthophoto, we can also derive transformation
parameters from 3D-2D space, where the 2D perspective im-
age maps back into 3D world space coordinates [1].
Often issues in object variation (lighting, scale, deformation, Fig. 1. Processing Methodology
etc.) preclude perfectly accurate object detection and track-
ing. Incorporating the YOLO framework [4] handles most
variable presentation aspects. YOLO’s flexibility on input
size and speed makes it a natural choice over the RCNN fam- 4.1. Required Inputs
ily, especially over an eclectic mix of smaller input images.
Subsequently, we explore the efficacy of several traditional Several inputs are necessary in combination with the perspec-
tracking algorithms: Track-Learn-Detect (TLD), Kernalized tive video footage, and the processing flow is visualized in
Correlation Filters (KCF), and Multiple Instance Learning Figure 1:
(MIL) [5, 6, 7] on the outputs of the YOLO network. • Keypoints: 400 point pairs used to compute image homog-
raphy collected by hand as our results applying Shi-
Tomasi corner detection [8] were surprisingly sparse.
3. DATASETS • Registrations: homography parameters are computed indi-
vidually from perspective to aerial and vice versa (in-
Perspective video was provided by the TAMUCC Univer- verted) using the keypoints.
sity Police Department. For this initial work we analyzed a • Working Area Geometry: used to check the accuracy of a
1280x720 video file at 20 frames per second from an AXIS registration and create a mask of the working area in
Q6044-E PTZ Dome Network Camera. The camera was held the perspective image containing physical occlusions.
mostly in a static POSE. Aerial imagery products of campus • Regions Of Interest (ROIs): areas around the perspective
provided by the Measurement Analytics Lab at TAMUCC view periphery where track-able objects are likely to
were generated from a fixed-wing UAV (Sensefly eBee) plat- pass entering or exiting the frame.
form with an RGB camera. The orthophoto and DEM were
dervived from a point cloud generated using Structure-from-
Motion over a masked area of the parking lot with a ground
sample distance of 2.79cm.
The applied YOLO network was trained on a subset of
PASCAL VOC 2007 & 2012 data, specifically on the classes
of people, cars, and motorbikes. This allows us to reduce
some model overhead for increased performance.
4. METHODS
6417
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:34 UTC from IEEE Xplore. Restrictions apply.
4.2. Image Transformations registration accuracy decreases, visible in Figure 2 as crooked
lines and disjoint connections. The highest recorded devia-
The process of deriving image transformations applies image
tion across all registration iterations was ∼3.5m at the farthest
homography and is iterated upon with several collections of
ends of the 118 × 93m parking lot from the camera origin,
keypoints, increasing in volume starting with 8 keypoints and
∼3% transformation error at worst. Notably, deviations are
doubling until we encompass all 400 points. The efficacy of
not linear. We theorize their cause is image imperfections due
each transformation is evaluated by measuring the difference
to the intentional lack of camera calibration and/or distortion
of known points in the image with where they should align on
caused by the weather dome.
the aerial orthophoto, an example of which is shown in Figure
2.
(b)
Fig. 3. The transformation of the DEM into the perspective Fig. 4. Detection and Tracking Results with Occlusion Mask.
image plane. a): Perspective View of Detection and Tracking Algorithm
With Detected Occlusion Areas. b): Aerial View of Detection
and Tracking Algorithm
5. RESULTS
6418
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:34 UTC from IEEE Xplore. Restrictions apply.
5.3. Detection & Tracking Accuracy 7. REFERENCES
In over six hours of reviewed video, YOLO detected every ve- [1] C. Bozzini, M. Conedera, and P. Krebs, “A new mono-
hicle which passed through designated ROIs around the per- plotting tool to extract georeferenced vector data and
spective view periphery. An optimization was made to detec- orthorectified raster data from oblique non-metric pho-
tion operations by only calling them as movement-areas com- tographs,” International Journal of Heritage in the Dig-
puted in ROIs began to decrease. This corresponded to a ma- ital Era, vol. 1, no. 3, Swiss Federal Research Institute
jority of instances where vehicles would present themselves WSL, Insubric Ecosystem Research Group, CH-6500
best for detection. Tentatively, we would rate this applica- Bellinzona, Switzerland, 2012.
tion of YOLO as 99.99% accurate at object detection, whose
outputs were passed to the tracking algorithms we evaluated. [2] T. Produit and D. Tuia, “An open tool to
Table 1 outlines performance of existing tracking methods register landscape oblique images and and gen-
available on OpenCV and our own variation on Lucas-Kanade erate their synthetic model,” REMOTE SENS-
optical-flow. Our variation performs k-means clustering on ING & SPATIAL ANALYSIS, pp. 170–176,
the detected features between frames and drops points which 2012. [Online]. Available: http://2012.ogrs-
become stuck on similar features and exceed a distance limit. community.org/2012 papers/d3 2 produit abstract.pdf
We define tracking accuracy as lock persistence on a set of [3] D. A. Strausz Jr, “An application of pho-
14 vehicles traveling radically different paths until they exit togrammetric techniques to the measurement of
view. Global denotes tracking context in the entirety of the historic photographs,” 2001. [Online]. Available:
frame, while Patch denotes a moving window around tracked https://ir.library.oregonstate.edu/downloads/js956g515
objects as a subset of the frame.
[4] J. Redmon, A. Farhadi, U. of Washington, and
A. I. for AI, “Yolo9000: Better, faster, stronger,” in
Table 1. Tracking Algorithm Performance*: At the time of this Computer Vision and Pattern Recognition. University
writing the TLD implmentation in OpenCV was not stable under Patch track-
ing. Value is a conservative estimate from what data could be recorded.
of Washington; Allen Institute for AI, 2016. [Online].
Global & Patch Available: https://arxiv.org/pdf/1612.08242
Global & Patch
FPS with [5] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-
Tracker Tracking
Coord. learning-detection,” IEEE TRANSACTIONS ON PAT-
Accuracy
Transforms TERN ANALYSIS AND MACHINE INTELLIGENCE,
TLD 8, 16* 79%, 65%* vol. 6, no. 1, Jan. 2010.
KCF 40, 70 58%, 58%
[6] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista,
MIL 6, 12 65%, 65%
“High-speed tracking with kernelized correlation fil-
Optical-flow 70, 60 85%, 85%
ters,” IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, vol. 37, no. 3, pp. 583–596, mar
2015.
6. CONCLUSIONS & FUTURE WORKS [7] B. Babenko, M.-H. Yang, and S. Belongie, “Visual
tracking with online multiple instance learning,” in
We conclude that this is a valid approach for accomplishing
2009 IEEE Conference on Computer Vision and Pattern
our outlined goal of integrating video data with UAS prod-
Recognition. IEEE, jun 2009.
ucts based on the relatively low degree of registration error
over a comparatively large area. Theoretically it is possible [8] J. Shi and C. Tomasi, “Good features to track,” Com-
to near-fully automate the processing workflow, however we puter Vision and Pattern Recognition, 1994. [Online].
are curious to investigate adapting alternative methods, such Available: http://www.ai.mit.edu/courses/6.891/
as Boerner’s work on automatically computing camera POSE handouts/shi94good.pdf
[9] to fully automate image registration. Similarly, regions
[9] R. Boerner and M. Krhnert, “Brute force matching be-
of interest in perspective images combined with shapefiles of
tween camera shots and synthetic images from point
travel networks could isolate ROIs for vehicle detection algo-
clouds,” ISPRS - International Archives of the Pho-
rithmically. We also plan to test the system with continually
togrammetry, Remote Sensing and Spatial Information
reduced image quality, in order to determine the minimum re-
Sciences, vol. XLI-B5, pp. 771–777, jun 2016.
quirements where this system could operate with a negligible
degree of uncertainty. As an alternative to traditional track- [10] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi,
ing mechanics, we also look to incorporate other in-progress and P. H. S. Torr, “Fully-convolutional siamese net-
work based on Bertinetto et al’s [10] study of generic object works for object tracking,” Computer Vision and Pattern
tracking using Siamese networks. Recognition, 2016.
6419
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:39:34 UTC from IEEE Xplore. Restrictions apply.