DissThesis Temp2 Partialcompleted
DissThesis Temp2 Partialcompleted
Abstract i
Acronyms ii
List of Figures 1
Chapter 1 Introduction 2
1.1 Motivation 2
2.4 Summary 28
4.3 Mapping 33
5.2 ResNet-50 41
5.3 DenseNet-161 47
5.4 DenseNet-169 41
5.5 DenseNet-201 49
5.6 MobileNet-V2 49
8.1 Simulators 54
Chapter 9 Results 58
10.1 Conclusion 58
References 62
Abstract
Utilization of autonomous Unmanned Aerial Vehicles (UAVs) in various sectors like
industries, agriculture, disaster and calamity hit environments, etc has not only achieved
system for autonomous navigation of low cost UAVs in GPS-denied environments. The
presented solution is divided into three steps: 1) localization and mapping, 2) scale and
depth estimation, 3) obstacle avoidance and path planning. Localization and mapping is
required for creating the map of the unknown environment and it can be done through a
vision-based SLAM algorithm, in our case we used ORB SLAM for obtaining a map of
the surrounding environment. However, the map generated from ORB SLAM is a sparse
point cloud map and cannot be used to get an accurate measure of the depth of obstacles
present in the surrounding environment of the UAV. Hence, in order to solve this problem
CNN-based depth estimation algorithm is used for improving the results of depth
estimation, thereby minimizing the number of collisions with the obstacles. Next, a deep
learning model is developed to identify the key points like left-turn, right-turn,
crossroads, etc in the estimated depth map of the scene. Then finally, an efficient path
planning algorithm is implemented to find an obstacle free path for the UAV to traverse
from the starting point to the target location using occupancy grid map of the
environment, and the necessary control commands are sent over the WiFi to the UAV. The
proposed approach is first evaluated on Unreal Engine 5.2 Simulator in which a virtual
environment with obstacles is created using AirSim plugins. Finally, to verify the
effectiveness of the implementation real time experiments are performed with DJI
ii
List of Tables
iii
List of Figures
Figure 8: Samples of the RGB image and the class labels from NYU Depth V2 dataset. 36
Figure 15: Images of selected keypoints and generated point cloud map for TUM sequence. 55
Figure 16: Images of selected keypoints and generated point cloud map for Kitti sequence. 55
Figure 17: Images of selected keypoints and generated point cloud map for Captured sequence. 56
1
Chapter 1: Introduction
This chapter provides a brief insight into the project "Monocular Visual SLAM-based
Mapping and Autonomous Navigation in GPS-denied Environments" and highlights the
key points of the pre-dissertation work carried out as an initial phase of this project. This
chapter includes the motivation for the project, the objectives, the scope, and the
organization of the project.
1.1 Motivation
Earlier, UAVs(Unmanned Aerial Vehicles) were mainly used in the military domain and
seen as expensive entities, whereas the significant amount of drop in their prices in the
last few years tending them to become commonplace in the consumer market. People are
indulging more and more towards these consumer drones according to their needs.
According to the Market Reports, 8 April 2020, it is estimated that the UAV market will
reach 45.8 billion by 2025. In the light of this analysis, it is also expected that the UAV
segment will grow at the highest CAGR(Compound Annual Growth Rate), as demands
from both military and non-military sectors increase, that is, the demand for surveillance,
reconnaissance, and inspection applications. New applications and use cases are being
discovered for Autonomous UAVs at a significant rate. For example, Aerial photography
is used by farmers for high precision farming, which is achieved by collecting
multi-spectral images taken through the aerial vehicle, as it could help the farmers to
detect which crops need the most fertilizer and which don’t, leading to the efficient use of
resources. UAVs are also being used for the inspection of areas that are hard to reach such
as inspection of the wind turbine, confined places, mines, forests, nuclear power plants,
and chemical factories, etc.
Being able to set out for a mapping mission beforehand through autonomous flight to
collect data and provide useful information about the target location with minimal human
operator interaction has made UAVs an invaluable tool. The ease of use and low price of
current UAVs are largely responsible for their prodigious market growth. However, the
above-discussed applications still require an ample amount of human intervention,
limiting the scalability of the proposed solutions. In order to make these systems scalable,
a high level of autonomy is required. Because sending and training people to use these
2
systems could be a bottleneck for the growth of UAVs for certain applications. Therefore,
the increased demand for autonomy has attracted a lot more indulgence.
Areas, where autonomous drones have not yet become commonplace, are GPS-denied
environments, such as indoor environments, forests, remote areas, etc. Companies like
Flyability have been working on UAVs responsible for indoor inspection. These UAVs
are equipped with cages around them so they can safely bump into obstacles. They are
used to inspect the caves, gas, and oil storage units, and sewers. These UAVs still require
trained pilots and training of local companies before their deployment.
To overcome these problems UAVs are often outfitted with a suite of sensors that aids
navigation in GPS-denied environments. In order for the UAV to localize itself
autonomously without leveraging external sensors, the go-to solution is SLAM
(Simultaneous Localization and Mapping). Earlier, SLAM algorithms used external
sensors to create a map of the environment and are able to localize within this map.
LIDAR sensors were used for performing the task of both localization and mapping. They
use rotating laser sensors to measure the surrounding distances. The sensor would be
attached to a gimbal below the UAV so that it would stay at a proper level. But the
problem with these sensors is that they are relatively heavy, large, and expensive as
compared to cameras, becoming unsuitable for micro aerial vehicles. In later research
work, RGBD(RGB+Depth) and stereo camera systems are utilized as an improvement
over sensors. Relying on visual SLAM rather than laser-based SLAM results in more
versatility of these camera systems. However, they often require more computation
power. Therefore, in order to solve the problems of reducing the weight, size, and cost of
the UAV we are aiming to propose a monocular vision-based UAV navigation technique
that is capable of constructing metrically scaled 3D maps of GPS-denied environments
that can be used for localization, obstacle detection, and avoidance. This task could be
easily accomplished by using only a monocular camera installed on a commercial
low-cost UAV. The proposed project aims to solve the problem of autonomous UAV
navigation in GPS-denied environments using a method that integrates SLAM algorithms
and CNN-based depth estimation algorithms to scale the data, densify, and deliver a map
of the environment suitable for exploration in real-time.
3
1.2 Objectives and Scope
The main objective of this proposed work is to develop and evaluate a monocular
vision-based system that can be implemented on low-cost UAVs to make them capable of
real-time autonomous exploration and navigation in GPS-denied areas. Unmanned Aerial
Vehicles (UAVs), with onboard camera sensors, proved to be a vital resource in various
situations as they can be used to record and analyze real-time video data of environments
affected by various kind of calamities and disasters such as floods, earthquake-affected
areas, and forest fires, etc from a remote location. Furthermore, with the ability of
autonomous navigation and collision avoidance, UAVs can set forth to places where
humans can’t and collect video data for further understanding, thereby helping mankind
in dealing with such disastrous situations. However, in certain scenarios, where the GPS
precision is too low or GPS signal is not present at all like the indoor environments
proximity sensors are utilized to perform the task of autonomous navigation but many
proximity sensors such as LIDARs have high power consumption and they are
heavyweight sensors and hence are not suitable for aerial vehicles. Thus, Autonomous
UAV Navigation aided with monocular vision is an emerging, however more challenging
task. Several factors like illumination effects, texture-less and texture-based regions, etc
affect the proper deployment of vision-based methods. In order to overcome these
limitations, several methodologies have been proposed, such as optical flow methods,
stereo vision-based approaches, vanishing line technique, etc. However, none of them is
able to perform well in all kinds of scenarios. Therefore, after reviewing all of the
existing techniques and the challenges associated with them we are going to propose an
ORB SLAM based state of the art UAV’s autonomous navigation strategy for GPS-denied
environments, our proposed approach will only exploit the visual sensor, that is, the front
camera of UAV for localization, mapping, obstacle detection, and collision avoidance.
4
and can be categorized into three types: inertial navigation, satellite navigation, and
vision-based navigation. Although, none of these approaches is perfect for fully
autonomous UAV navigation. Therefore, it is difficult to go for an appropriate method for
autonomous UAV navigation depending upon the particular task. Although, after
reviewing the proposed methods for UAV navigation under each category, vision-based
navigation proves to be a promising and primary research area with the rapid
development of computer vision. The visual sensors that are used in vision-based
autonomous navigation have several advantages over traditional sensors. First, the visual
sensors can provide a good amount of online information about the surrounding
environment; Second, visual sensors are perfect for the perception of the dynamic
environment because they possess a valuable and remarkable anti-inference ability; Third,
generally, most of the visual sensors are passive, that is, they intercept the sensing system
from being detected. The United States and Europe have already developed research
institutions, such as NASA Johnson Space Centers (Herwitz et al. 2004). Massachusetts
Institute of Technology (Langelaan and Roy 2009), the University of Pennsylvania (Hao
et al. 2008) and several other top-ranked institutions are also rapidly developing research
on vision-based autonomous UAV navigation and have embodied this technology into
next-generation transport systems such as NextGen (Strohmeier et al. 2014) and SESAR
(Hrabar 2008).
The illustration of vision-based navigation is presented in Figure 2.
Using inputs from proprioceptive and exteroceptive sensors, after processing of
localization and mapping, obstacle detection and avoidance, and path planning, the
navigation system will ultimately provide continuous control to steer the UAV towards
the location of the target.
5
Figure 1: Vision-based UAV navigation system
Belmonte et al. 2019 have classified UAVs into the following four types:
Fixed-Wing: These kinds of aircrafts consist of a rigid wing with a predetermined airfoil
that generates lift due to forward airspeed of the UAV, thereby making the flight possible.
The forward airspeed of a UAV is generated by the thrust generated by a propeller in the
forward direction. In addition to this, these aircraft are generally characterized for their
high voyage speed and long endurance mainly used for long-distance, long-range, and
high altitude flights.
Rotary-Wing: These aircraft are characterized by their ability to carry out missions that
require hovering of the flight. They have rotors composed of blades in constant motion,
which are required to generate lift by producing airflow. These flying machines are also
known as vertical take-off and landing (VTOL) rotorcraft and allow a heavier payload,
easier take-off and landing, and better control than fixed-wing aircraft.
Flapping-Wing: These micro-UAVs are generally known for reproducing the flight of
insects or birds. They have extremely low endurance and low payload capability due to
their reduced size. These UAVs have low power consumption and can perform vertical
take-offs and landings with flexibility.
6
Airship: An airship also known as a dirigible is a “lighter-than-air” UAV that is propelled
and driven through the air by using propellers and rudders or another thrust. These
aircraft fly upwards by cushioning a large cavity like a balloon with a lifting gas. The
three main types of airship are: non-rigid(or blimps), semi-rigid, and rigid. A blimp or
non-rigid airship is a kind of “pressure-airship”, that is a steerable, powered,
lighter-than-air vehicle, and its shape remains maintained by the pressure of gases within
its envelope. These air vehicles are suitable for long-duration flights as no energy is
required for lifting them, so that saved energy can be used as a power source for
displacement of actuators, thus allowing long-duration flights. Furthermore, these aircraft
are capable of navigating safely at low levels, close to local people and buildings. Figure
3 represents the discussed four classes of UAVs.
Usually, UAVs obtain information of surroundings and states of their own from both
proprioceptive and exteroceptive sensors. The conventional sensors used for navigation
are generally gyroscope, axis accelerometer, Inertial navigation system (INS), and global
positioning sensor (GPS). However, these sensors are likely to affect the localization and
navigation of UAVs. One of the biggest drawbacks of the Inertial navigation system (INS)
is the generation of bias error which is caused by the integral drift problem and results in
loss of accuracy to some extent. Also, the Global positioning system has limited
reliability, since in some areas such as indoor environments, its precision is too low or it
is not present at all. Furthermore, minute errors of acceleration and angular velocity are
continuously engulfed into the errors (either linear or quadratic in nature) of velocity and
position.
7
Therefore, one of the reasons for the limited use of UAVs in applications of both life and
production is the improper functioning of traditional sensors. In order to cope up with
these issues, the researchers are being more bothered about the use of state-of-the
techniques for the enhancement of both the accuracy and robustness of the pose
estimation. Ideally, one can obtain better state estimation by using multi-sensor data
fusion (Carrilo et al. 2012), which integrates the advantages of different types of multiple
sensors. However, limited to the specific environments, such as GPS-denied areas, it is
assumed that the incorporation of multiple sensors in small UAVs would be unnecessary
and impractical. Therefore, a more typical approach is needed for the enhancement of
UAVs’ environmental perception ability.
After comparing visual sensors to ultrasonic sensors, laser lightning, GPS, and other
traditional sensors, we conclude that visual sensors capture rich information of
surroundings with texture, color, and other visual data. Additionally, they are easier to
deploy and cheaper, hence, this is the reason why vision-based autonomous navigation is
becoming a relevant topic in current research. The visual sensors which are commonly
used for vision-based navigation are divided into four types as shown by Fig 3:
monocular cameras, stereo cameras, RGB-D cameras, and fisheye cameras.
Monocular cameras are generally used in applications where minimum weight and
compactness are required. Furthermore, they are cheap and can be easily deployed into
UAVs. However, they are not capable of obtaining a depth map of the surrounding
environment. When two monocular cameras of the same configuration are mounted on a
rig then this system becomes a stereo camera. Therefore, a stereo camera not only
provides everything that a monocular camera can give but also offers extra benefits of
two views. Additionally, it can be used to obtain depth maps using the parallax principle
without the aid of infrared sensors. RGB-D cameras are capable of simultaneously
generating depth maps and visible images using infrared sensors. They are mostly used in
indoor environments due to the restricted range of infrared sensors. Fisheye cameras are
adapted from monocular cameras with the additional capability of providing a broad
viewing angle and additional capability of obstacle avoidance in complex environments.
8
(a) (b) (c) (d)
Figure 3: Typical visual sensors. (a) monocular camera; (b) stereo camera; (c) RGB-D
camera; (d) fisheye camera
This report starts by giving a description of the basic concepts used in the development of
the project in simple language and real life examples to facilitate understanding even for
developers with limited experience in the field of Computer Vision. A limited background
of vision-based simultaneous localization and mapping is provided for the reader's
understanding followed by concepts of autonomous exploration. An overview of
autonomous UAV navigation is given and the research going on in this area. The next
chapter illustrates the review of the literature in the field of vision-based autonomous
UAV navigation. The problem of autonomous navigation is divided into three tasks:
visual localization and mapping, obstacle detection and avoidance, and path planning.
Chapter 3 and 4 provides an insight into the methodology and work done progress made
by us till the end of third semester and discusses all the algorithms that can be used to
build the autonomous UAV that will exploit only an on-board camera for navigating in
GPS-denied environments. In chapter 5 the technical details of the project are specified.
Finally, conclusions are provided in chapter 6.
9
Chapter 2 : Literature Review
2.1 Visual localization and mapping
On the basis of the surrounding world and prior information collected for autonomous
navigation, visual localization and mapping systems can be categorized into three classes:
mapless system, map-building system, and map-based system (Desouza and Kak 2002),
as shown by Figure 1.
A. Mapless system
10
robot. First, it estimates the optical speed of both the cameras that are parallel to the wall,
respectively. If the optical velocities come out to be equal then the robot proceeds along
the central line; otherwise, the robot goes forward along the speed of small places.
Although, it might have poor performance while navigating in a texture-less environment.
However, in recent years a great development of optical flow techniques has been made
with several breakthroughs in the field of detection and tracking. Recently,
Nourani-Vatani et al. (2014) has proposed a novel approach for scene change detection
using and describing the optical flow. Furthermore, Herisse et al. (2012) have achieved
hovering of flight and landing maneuvers on a proceeding platform by fusing an inertial
measurement unit (IMU) together with optical flow measurements. Even high-level tasks,
like surveillance and target tracking can be achieved with systems using dense optical
flow calculation as they can identify movements of all moving objects (Maier and
Humenberger 2013). In the field of localization and mapping, the feature tracking method
(Chao et al. 2013) has become a standard and robust approach. This method can
determine the maneuver of an object by detecting invariant features including lines,
corners, and so on, and their corresponding movement in sequential images. The invariant
features that have been previously seen in the environment are needed to be observed
again from different perspectives, illumination conditions, and distances during the task
of robot navigation (Szenher 2008). Basically, sparse features that can be used in
localization and mapping are not suitable for obstacle avoidance as they are not dense
enough. A behavioral method for navigation that utilized a system capable of recognizing
visual landmarks combined with a fuzzy-based obstacle avoidance system was proposed
by Li and Yang (2003). Hui et al. (2018) proposed a vanishing point-based approach for
localization and mapping of autonomous UAVs to be used for safe and robust inspection
of power transmission lines. In this proposed method authors regarded the transmission
tower as an important landmark by which continuous surveillance and monitoring of the
corridor can be achieved. Later on, Bianchi and Barfoot (2021) demonstrated a robust and
fast method for localization of a UAV using satellite images which were further utilized
for training autoencoders. In this work, the authors collected Google Earth (GE) images,
and then an autoencoder is trained to compact these images to a low-dimensional vector
representation without distorting key features. The trained autoencoder model is further
trained to compress a real-time image captured by UAV which is then compared with the
pre-collected autoencoded GE images by using an inner-product kernel. Hence,
11
localization of a UAV can be achieved by distributing weights over the relative GE poses.
The evaluation results of this work have shown that the proposed approach is able to
achieve root mean square error (RMSE) of less than 3m in real-time experiments.
B. Map-building system
Many times, it is difficult for a UAV to navigate with a preexisting map of the
surroundings due to certain environmental limitations. In addition to this, it would be
unfeasible to acquire a map of the target location in emergent cases (such as disaster
relief). Therefore, in such situations, it would be a more efficient and attractive solution to
build maps at the same time of flight. Map-building system based robot navigation has
been tremendously used in both semi-autonomous and fully autonomous fields, and this
technique for robot localization and mapping is getting more and more popular with the
expeditious development of visual simultaneous localization and mapping (visual SLAM)
methods (Strasdat et al. 2012; Aulinas et al. 2008). The size of the UAVs available in the
market nowadays is becoming much smaller than before, which restricts their capacity of
payload to certain degrees. Hence, researchers have been more focused on the usage of
simple and light-weight cameras rather than the traditional complex sonar and laser radar,
etc. Stanford CART Robot (Moravec 1983) carried out the first attempt at single
camera-based map-building techniques. Later on, the detection of 3D coordinates of
images was improved using a modified version of an interest operator algorithm. The
result that was expected from this system was the demonstration of 3D coordinates of
obstacles, which were placed on a grid with 2 m cells. However, this technology is
capable of reconstructing the obstacles in the environment but still, it cannot model a
large-scale world environment. Afterward, in order to recover poses of cameras and
environment structure simultaneously, vision-based SLAM techniques have made a good
amount of progress and result in three kinds of map-building methods: indirect, direct,
and hybrid according to the way of image processing by visual sensors. Wang et al.
(2020) proposed and developed an efficient system for autonomous 3D exploration with
UAV. The authors put forward a systematic approach towards robust and efficient
autonomous UAV navigation. For localization of UAV, the authors used a road map which
was built incrementally and maintained during the inspection process. This road map is
responsible for providing the gain and cost-to-go for a candidate region that is to be
explored and these are the two quantities of the next best view (NBV) evaluation. The
12
authors verified their proposed framework in various 3D environments and the results
obtained exhibit the typical features in NBV selection and much better performance of
their approach in terms of exploration efficiency as compared to other methods.
1) Indirect method
Indirect method based systems first detect and extract attributes from images and then use
them as inputs for localization and estimation of motion instead of using the images
directly. Generally, the extracted features are assumed to be invariant to viewpoint
changes and rotation, as well as robust to noise and blur. In the last decades, thorough
research on detection and description of features has been undertaken and several types of
feature detectors and descriptors have been proposed (Tuytelaars and Mikolajczyk 2008;
Li and Allinson 2008). Therefore, most of the recently proposed SLAM algorithms are
below the feature-based framework. Davison (2003) proposed a monocular vision-based
localization with a real-time performance by mapping a distinctive and sparse set of
natural features based on a top-down Bayesian framework. This work is a landmark for
monocular vision aided SLAM and has a great influence on future research. Klein and
Murray (2007) proposed an algorithm for parallel tracking and mapping. This algorithm
is the first one that divides the SLAM system into two independent parallel threads:
tracking and mapping which is the standard of current feature-based SLAM systems. A
state-of-the-art approach for large-scale navigation was proposed by Mahon et al. (2008).
Drifts of trajectories that are common in large-scale environments are being corrected by
visual loop-closure detections. Celik et al. (2009) proposed a visual SLAM-based system
for navigation in unknown indoor environments without the aids of GPS. UAVs have to
predict the state and range using only a single onboard camera. Representing the
environment using energy-based feature points and straight architectural lines is the heart
of the navigation strategy. Harmat et al. (2015) presented a vision-based attitude
estimation method based on multi-camera parallel tracking and mapping (PTAM) (Klein
and Murray 2007). PTAM first integrates the estimates of ego-motion from multiple
cameras and then correlates the mapping modules and pose estimation. The authors also
proposed a novel approach of external parametric calibration for systems using multiple
visual sensors with non-overlapping fields of view. Although, many of the indirect
methods are able to reconstruct only a specific set of points as they pull out only distinct
feature points from images. These types of methods are called sparse indirect methods.
13
So, researchers have been anticipating that dense indirect methods could reconstruct
dense maps of the environment. Valgaerts et al. (2012) exploited an energy-based method
for the estimation of a fundamental matrix and finding similarities from it. Ranftl et al.
(2016) used a segmented optical flow field for producing a dense depth map from two
successive frames, which means that a dense reconstruction of the scene can be achieved
beneath this framework through the optimization of a convex program.
2) Direct method
Indirect methods are proven to perform better in only ordinary environments. However,
they get stuck in texture-less environments. Therefore, the direct method based
map-building came into play. Map-building algorithms based on direct methods optimize
geometric parameters by exploiting all the information regarding intensity present in the
images of the surroundings, which can provide strength to geometric and photometric
distortions present in images. Furthermore, direct methods can find dense resemblances
so they are capable of reconstructing a dense map at an additional price of computation.
Silveria et al. (2008) presented a novel method for the estimation of scene structure and
pose of the camera. In this work the authors directly utilized intensities of the image as
observations, thereby, leveraging all data present in images. Hence, this approach is more
robust than indirect methods for the surroundings with few feature points. Newcombe et
al. (2011) proposed a real-time monocular visual SLAM algorithm, DTAM, that also
estimates the 6 DOF motion of a camera using direct methods. By frame-rating the entire
alignment of the image, this proposed algorithm is capable of generating dense surfaces at
a frame rate based on the estimated comprehensive textured depth maps. Engel et al.
(2014) proposed and developed an efficient probabilistic direct method to approximate
semi-dense maps, which can be used further for image alignment. Rather than optimizing
parameters without using a scale, LSD SLAM (Engel et al. 2014) uses a different
approach that exploits pose graph optimization, which allows loop closure detection and
scale drift correction in real-time by explicitly taking scale factor into account. Krul et al.
(2021) presented and developed a visual SLAM-based approach for the localization of a
small UAV with a monocular camera in indoor environments for farming and livestock
management. The authors compared the performance of two visual SLAM algorithms:
LSD-SLAM and ORB-SLAM and found that ORB-SLAM based localization suits best at
those workplaces. Further, this algorithm was tested through several experiments
14
including waypoint navigation and generation of maps from the clustered areas in the
greenhouse and a dairy farm.
3) Hybrid method
The hybrid method is a combination of both direct and indirect methods. The first step in
the hybrid method includes initialization of feature correspondences by using indirect
methods, then in the next step, the camera poses are refined continuously using direct
methods. Forster et al. (2014) proposed an innovative semi-direct algorithm (SVO), for
the estimation of the states of a UAV. Similar to PTAM, this algorithm also implements
motion estimation and mapping of point clouds in two different threads. By combining
gradient information and pixel brightness with the calibration of feature points and
minimization of reprojection error, more accurate pose estimation can be obtained using
SVO. Subsequently, Forster et al. (2015) developed a computationally efficient algorithm
for real-time landing spot detection and 3D reconstruction of the surroundings. In order to
carry out real-time exploration, SVO needs to execute with a camera of high frame rate,
as it has limited resources for heavy computation.
4) Multi-sensor fusion
Bonin-Font et al. (2008) developed a system for the navigation of ground mobile robots
in which they utilized laser scanners for accessing 3D point clouds of relatively good
quality. Also, nowadays UAVs can be equipped with these laser scanners as their size is
getting smaller. Though different kinds of measurements from different types of sensors
can be fused together and this can enable a more robust and accurate estimation of the
UAV state. A general-purpose multi-sensor fusion extended Kalman filter (MSF-EKF)
can deal with multiple delayed measurement signals for different types of sensors and
provide a more robust and accurate estimation of the UAV attitude for robust control and
navigation (Lynen et al. 2013). Magree and Johnson (2014) proposed and developed a
navigation system that exploits the fusion of visual SLAM and laser SLAM with an
EKF-based inertial navigation system. The monocular visual SLAM is responsible for
finding data association and estimating the pose of the UAV, whereas the laser SLAM
system is liable for scan-to-map matching utilizing a Monte Carlo framework. Hu and Wu
(2020) proposed a multi-sensor fusion method based on correction of an adaptive error
using Extended Kalman Filter (EKF) for localization of UAV. In their presented approach
first, a multi-sensor system for localization is fabricated using acceleration sensors,
15
gyroscopes, mileage sensors, and magnetic sensors. Then, the data obtained from these
sensors is adjusted and compared in order to minimize the error from the estimated value.
Finally, measurement noise and system noise covariance parameters in EKF are
optimized through the transformative iteration mechanism of the genetic algorithm. Then,
the authors figure out the adaptive degree by obtaining the absolute value of the
difference between the predicted and the real value of EKF.
C. Map-based system
Map-based systems enable the UAVs to explore the environment with divergence and
ability of movement planning after defining the spatial layout of the surroundings in a
map. Typically, there are two types of environmental maps: occupancy grids and octree.
Each of them includes varying extents of detail, from the 3D model of the environment to
the interrelationship of environmental elements. Fournier et al. (2007) presented a method
that used a 3D volumetric sensor that can enable an autonomous robot to efficiently
explore and map urban environments. In their work, they used a multi-resolution octree
for the construction of the 3D model of the environment. Later on, an open-source
framework for the depiction of 3D environment models was developed (Hornung et al.
2013). Basically, the idea here is to render the 3D model of the environment octree.
Although not only using the occupied space but also free and unknown space. In addition
to this, the represented model is made to occupy less space by using an octree map
compression technique, which makes the system capable of efficiently storing and
updating the 3D models. Gutmann et al. 2008 collect and process data of the surroundings
using a stereo camera, which can be used further to generate a 3D map of the
environment. The key of this method is the extended scan line grouping technique that the
authors used for precise segmentation of the range data into planar segments. This
approach can also effectively deal with noise present in the depth estimated by the stereo
vision algorithm. Dryanovski et al. (2010) proposed a method that represents a 3D
environment by using a multi-volume occupancy grid which is a cable of explicitly
storing information about both free space and obstacles. It can also correct former sensor
readings that were potentially erroneous by fusing and filtering in new positive or
negative sensor data incrementally. Table 1 summarizes the algorithms for localization
and mapping cited in this review.
16
Authors Types Methods
Sensors
17
Harmat et al. (2015) Map-building Indirect; sparse Multi cameras
Obstacle avoidance is a crucial step in the process of autonomous navigation since with
the help of this capability a robot can detect, avoid collision with nearby objects, and
navigate safely without any risk of a crash. Thus, this method plays an important role in
increasing the level of automation of UAVs. The main objective of obstacle avoidance is
to estimate the distances between the aerial vehicle and obstacles after detecting them.
When the UAV is getting closer to obstacles, then it is supposed to stay away or turn
around as per the instructions of the obstacle avoidance technique used. One of the
18
some particular missions. In contrast, SLAM-based techniques can provide a
19
precise estimation of metric maps using a sophisticated SLAM algorithm, thereby making
UAVs capable of navigating safely with a collision avoidance feature exploiting more
environment information (Gageik et al. 2015). A novel technique of mapping a prior
unknown environment using a SLAM system was proposed by Moreno-Armendariz and
Calvo (2014). Furthermore, the authors leveraged a novel artificial potential field
technique to avoid both static and dynamic obstacles. Later on, in order to deal with the
poor illumination of indoor environments and dependency on the count of feature points
Bai et al. (2015) proposed a procedure for generating self-adaptive feature points in the
map, which is based on the PTAM (Parallel Tracking and Mapping) algorithm.
Subsequently, Esrafilian and Taghirad (2016) proposed an approach based on oriented
fast and rotated brief SLAM (ORB-SLAM) for UAV navigation with obstacle detection
and avoidance aspect. The proposed method first processes data from video by computing
the 3D locations of the aerial vehicle and then generates a sparse cloud map. After this, it
enriches the density of the sparse map, and then finally, it generates a collision-free road
map using potential fields and rapidly exploring random trees (RRT). Yathirajam et al.
(2020) proposed and developed a real-time system that exploits ORB SLAM for building
a 3D map of the environment and the path for UAV generated continuously to navigate to
the goal position in the shortest time. The main contributions of this approach are:
implementation of chain-based path planning with built-in obstacle avoidance feature and
generation of the path for UAV with minimum overheads. Subsequently, Lindqvist et al.
(2020) proposed and demonstrated a Nonlinear Model Predictive Control (NMPC) for
obstacle avoidance and navigation of a UAV. In this work, the authors proposed a scheme
to identify differences between different kinds of trajectories to predict positions of future
obstacles. The authors conduct various laboratory experiments in order to illustrate the
efficacy of the proposed architecture and to prove that the presented technique delivers
computationally stable and fast solutions to obstacle avoidance in dynamic environments.
Table 2 encapsulates the techniques used for obstacle detection and avoidance.
20
Table 2: Summary of important methods in obstacle detection and avoidance.
Path planning is an essential step in the task of autonomous UAV navigation. Path
planning is required for finding a best path from the initial point to the location of the
target, on the basis of certain performance indicators like the minimum price of work, the
shortest time of flight, and the shortest flying route. At the time of path planning, the
UAV is also required to avoid obstacles. Based on the environmental information that is
to be utilized for computing an optimal path, the problem of path planning can be divided
into two categories: global path planning and local path planning. The aim of the global
path planning algorithm is to figure out an optimal path using a priori geographical map
of the environment. Although, the task of global path planning is not itself enough for
controlling a real-time UAV, especially when there are several other tasks that need to be
carried out immediately or unexpected hurdles emerging during the flight. Hence, there is
21
a need for local path planning for estimating the path free from collisions in real-time by
using data of the sensors taken from surrounding environments.
Global path planner requires a priori geographical map of the surrounding environment
with locations of starting and target points to compute a preliminary path, therefore the
global map is also known as non-dynamic, that is static map. Generally, there are two
kinds of algorithms that are used for global path planning: Heuristic searching techniques
and a sequence of intelligent algorithms. Li et al. (2019) proposed and developed an
algorithm for planning the trajectory of multi-UAV in static environments. The proposed
algorithm includes three main stages: the generation of initial trajectory, the correction of
trajectory, and the smooth trajectory planning. The first phase of the proposed algorithm
can be achieved through MACO which incorporates metropolis measure into the node
screening method of ant colony optimization (ACO) that can efficiently and effectively
avoid cascading into the stagnation and local optimal solution. Then for solving the next
phase of the algorithm the authors of this work proposed three different trajectory
correction schemes for solving the problem of collision avoidance. Finally, the
discontinuity resulting from the acute and edged turn in trajectory planning is solved by
using the inscribed circle (IC) smooth method. Results obtained through various
laboratory experiments demonstrate the high feasibility and effectiveness of the proposed
solution from perspectives of collision avoidance, optimal solution, and optimized
trajectory in the trajectory planning problem for UAVs. Yathirajam et al. (2020) proposed
a chain-based path planning approach for the generation of a feasible path for UAV in the
ORB-SLAM framework with dynamic constraints on the length of the path and minimum
turn radius. The presented path planning algorithm enumerates a set of nodes that could
move in a force field, thereby permitting the rapid modifications of the path in real-time
as cost function changes. Subsequently, Jayaweera and Hanoun (2020) demonstrated a
path planning algorithm for UAVs that enables them to follow ground moving targets.
The proposed technique utilized a dynamic artificial potential field (D-APF), the path
generated by this algorithm is smooth and feasible to dynamic environments with
hindrance and capable of handling motion profiles for the target moving on the ground
considering the change in their direction and speed. The existing path planning
techniques, such as graph-based algorithms and swarm intelligence algorithms are not
22
capable of incorporating UAV dynamic models and flying time into evolution. In order to
overcome these limitations of existing methods Shao et al, (2021) proposed a hierarchical
scheme for trajectory optimization with revised particle swarm optimization (PSO) and
Gauss pseudospectral method (GPM). The proposed scheme is a two-layered approach. In
the first layer, the authors designed a better version of PSO for path planning, then in the
second layer after utilizing the waypoints in path generated by improved PSO, a fitted
curve is constructed and used as the starting values for GPM. After comparing these
initial values with the ones generated randomly, the authors conclude that the designed
curve can improve the efficiency of GPM significantly. Further, the authors validate their
presented scheme through plenty of simulations and the results obtained demonstrate that
the proposed technique achieves much better efficiency as compared to existing path
planning methods.
The well-known heuristic search method A-star algorithm is evolved from the classic
Dijkstra algorithm. In the past few years, the A-star has been greatly developed and
improved, hence lots of improved heuristic search methods have been derived from it. A
modified A-star algorithm and an orographic database were used by Vachtsevanos et al.
(1997) for searching the best track and building a digital map respectively. The heuristic
A-star algorithm was used by Rouse (1989), who first dissected the whole region into
several square grids and then used the A-star algorithm for finding the best path that is
based on the value function of different grid points along the estimated path. Szczerba et
al. (2000) proposed the sparse A-star search (SAS) for computing an optimal path, the
presented algorithm successfully reduces the computational complexity by imposing
additional constraints and limitations to space searching procedure during path planning.
A dynamic A-star algorithm, which is also known as the D-star algorithm was proposed
and developed by Stentz (1994) for computing an optimal path in a partially or
completely unspecified dynamic environment. The algorithm keeps on updating the map
obtained from completely unknown environments and then replanning the path on
detection of new obstacles on its path. Yershova et al. (2005) proposed sampling-based
path planning similar to rapidly exploring random trees (RRT) that can generate an
optimal path free from collision when no prior information of the surrounding
23
environment is provided. Sanchez-Lopez et al. (2018) proposed and developed a 3D path
planning solution for UAVs which makes them capable of figuring out an optimal,
feasible, and collision-free path in complex dynamic environments. In the proposed
approach authors exploit a probabilistic graph in order to sample admissible space
without considering the existing obstacles. Whenever planning is required, then the A-star
discrete search algorithm explores the generated probabilistic graph for obtaining an
optimal collision-free path. Authors validate their proposed solution in the V-REP
simulator and then incorporate it into a real-time UAV. As a kind of common obstacle in
complex 3D environments, U-type obstacles might be responsible for confusing a UAV
thus leading to a collision. Therefore, in order to solve this problem Zhang et al. (2019)
proposed and developed a novel Ant- Colony Optimization (ACO)-based method called
Self Heuristic Ant (SHA) for the generation of the optimal, collision-free path in complex
3D environments with dense U-type obstacles. In this approach authors first construct the
whole space using a grid model of workspace and then a novel optimal function for path
planning of UAV is designed. In order to present ACO deadlock, that is, trapping of ants
in U-type obstacles in the absence of an optional successor node, authors designed two
different search approaches for selecting the next path node. Furthermore, Self Heuristic
Ant (SHA) is used for improving the efficacy of the ACO-based method. Finally, results
obtained after conducting several deeply investigated experiments illustrate that the
probability of deadlock state can be reduced to a great extent with the implementation of
proposed search strategies.
2) Intelligent algorithms
In the past few years, researchers tried to work out global path planning problems using
intelligent algorithms and proposed lots of intelligent searching techniques. However, the
most renowned intelligent algorithms are the simulated anneal arithmetic (SAA)
algorithm and genetic algorithm. Zhang et al. 2012 used the genetic algorithm and SAA
methods in the computation of an optimal path. Crossover and mutation operations of
genetic algorithm and Metropolis criterion are used for the evaluation of adaptation
function of the path, thereby improving the effectiveness of path planning. Andert and
Adolf (2009) proposed and developed an optimized global path planning technique using
the improved conjugate direction method and the simulated annealing algorithm.
24
B. Local path planning
Local path planning methods exploit information of the local environment and state
estimation of UAV to dynamically plan a local path incorporated without failure and
collision. Path planning in dynamic environments might become computationally
expensive due to some unexpected factors, such as the motion of objects in the dynamic
environments. In order to overcome this problem, path planning algorithms need to be
feasible to the dynamic characteristics of the environment, by using information such as
the shape, size, and location about different parts of the environment obtained through a
variety of sensors.
Conventional local path planning techniques include artificial potential field methods,
neural network methods, fuzzy logic methods, and spatial search methods, etc. Some
general local path planning methods are discussed below. Sugihara and Suzuki (1996)
proposed an artificial field method to navigate a robot from the local environment into an
abstract artificial potential field environment. The target location has the “attraction” as
well as an object with “repulsion” to the navigating robot, hence these two forces are
responsible for moving the robot towards the target location. An example of the usage of
the artificial gravitational field method for computing the path through radar threat area
was given by Bortoff (2000). Genetic algorithms can provide a typical framework for
solving typical and complex problems of optimization, especially the ones that are related
to the computation of an optimal path. These algorithms are inspired by the inheritance
and evolution concepts of biological phenomena. Problems can be solved according to the
principle of “survival of the fittest and survival competition,” to obtain an optimal
solution. A genetic algorithm is composed of five main elements: initial population,
chromosome coding, genetic operation, fitness function, and control parameters. Pellazar
(1998) proposed a path planning solution based on a genetic algorithm for an aircraft.
Under the confession of biological functions, neural networks are established as
computational methods. An example of a path planning technique implemented using
Hopfield networks was given by Gilmore and Czuchry (1992). Parunak et al. 2002
proposed an ant colony algorithm which is a new type of bionic algorithm inspired by the
ant activity. It is a stochastic optimization method that resembles the behavioral
characteristics of ants. This technique could accomplish good results by solving a series
of onerous combinatorial optimizations. Table 3 provides an outline of different methods
25
discussed for the task of path planning.
26
2.4 Summary
In the literature review discussed above vision-based autonomous navigation is illustrated
from three facets: Localization and mapping, Obstacle detection and avoidance, and Path
planning. Localization and mapping of a UAV is the foundation for autonomous UAV
navigation which is responsible for providing information related to location and
environment. Obstacle avoidance and path planning are necessary for safe navigation to
the target location without any collision. After reviewing all the benchmark studies we
conclude that ORB SLAM generates sparse maps of the surrounding environment,
therefore obtaining accurate estimates of UAV’s 6 DoF will be very difficult by utilizing
only sparse maps. Certain solutions have been proposed by researchers for this problem.
Some of these solutions are: Fusion of IMU measurements with estimated 6 DoF using
Extended Kalman Filter for obtaining non-linear pose of UAV, Utilization of CNNs for
getting scaled pose and depth estimates. But there are several issues that are associated
with these approaches, hence for our project we will try to design a new algorithm that
could leverage both IMU measurements and CNNs for scale and depth measurements
thereby, improving the accuracy of pose estimation of the UAV.
27
Chapter 3 : Autonomous Navigation
Framework
In this section, the details of the proposed state-of-the-art approach for solving the
monocular vision-based navigation problem is presented. First, the overview of the
proposed methodology is discussed, and then the steps of the proposed algorithm are
discussed including the method for UAV localization using SLAM algorithm, CNN based
scaling and depth estimation, densification of estimated depth images by fine tuning the
estimation network to take an RGB image and a sparse depth as an integrated input, deep
learning based technique for identifying key points from the depth image to provide the
UAV with appropriate control commands, and implementation of the efficient path
planning algorithm for detecting obstacle free and shortest route in the estimated depth
map after converting it to occupancy 2D grid map.
28
Figure 5: Simplified overview of the proposed architecture
In order to localize UAV using only monocular vision, visual SLAM is used. SLAM is
the process of creating a map of an area and localizing oneself in this map. This map can
be thought of as a model or representation of aspects of interest describing the
environment where the robot is operating. SLAM systems are classified into two
categories: laser-based and visual-based. The systems that leverage the laser-based SLAM
approaches are equipped with LIDAR for creating laser scans which are further used to
create a map of the environment. Whereas, the visual-SLAM systems use camera images
for localization and mapping of the environment. These systems are much cheaper and
efficient as compared to laser-based SLAM systems. There are two state-of-the-art
methods for solving monocular vision-based SLAM problems, direct-based methods, and
feature-based methods. Feature-based methods work through the extraction of a set of
unique features from each image. These extracted features are unique points, known as
key points, that can be identified in an image. By matching the same points in multiple
views the feature-based algorithms can determine the different positions from where these
points were observed. On the other hand, the direct method does not rely on the sub
representation of the image, but they compare the entire image for referencing and then
use the image intensities in order to obtain information about surroundings and location.
Maps created through VSLAM (direct method) can be represented as point clouds
Whereas, in the case of feature-based SLAM each point in the point cloud is a
triangulated key point. On the basis of the density, these cloud points are classified into
the following three categories:
29
(i) Sparse point clouds- They are often, produced by feature-based SLAM, contain only
a small subset of the environment. As, they are generated through the extraction of the
blobs, corners, and boundaries of the objects.
(ii) Semi-dense point clouds- These kinds of point clouds are comparatively denser than
the sparse point clouds and they are capable of showing a lot more information about the
area including smaller and smaller details. But often, these maps are unable to represent
textureless regions. These kinds of point clouds are produced by direct SLAM methods.
(iii) Dense point clouds- These maps are capable of including each and every detail of
the environment. However, visually pleasing, they are not so useful as concerning SLAM
because they are much computationally expensive as compared to other map generation
methods. They are not compatible with real-time requirements in most cases.
The robust and complete feature-based SLAM implementation which we are supposed to
use in our proposed project is ORB-SLAM(Oriented FAST and Rotated BRIEF SLAM)
which is based on PTAM(Parallel Tracking and Mapping). It exploits ORB features for all
tasks like determining the movement, creating a 3D representation, and relocalizing, thus
making the system more efficient, straightforward, and reliable.
A significant advantage of using ORB SLAM is that it is open-source and an active
community is still working with its codebase, due to which, there are many modified and
improved versions available along with the proper documentation and detailed
descriptions of the inner workings of this work.
Oriented FAST and rotated BRIEF SLAM use feature matching to determine the position
of the robot. ORB SLAM builds an image pyramid with multiple levels. So that it could
allow feature extraction on different levels of scale. An image pyramid can be described
as a multiscale representation of a single image where each layer consists of a
downsampled version of the previous layer. ORB detects key points in each layer, making
it partially scale-invariant. Then all the matched key points found by ORB are converted
into a binary feature vector that uniquely describes the feature. After all the features have
been extracted, ORB SLAM distorts them using the provided camera parameters. ORB
SLAM system leverages three main threads which are running in parallel, each of them is
responsible for performing a separate task in the process. These threads are tracking,
30
mapping, and loop closure. They are described in more detail in the following sections.
An example of an image pyramid with 5 levels of scale is shown in figure 6.
This thread is responsible for estimating the current pose by using new camera frames. It
also decides whether an image could be used as a new keyframe or not. Whenever a new
frame is received ORB SLAM calculates features points and then by using feature
matching between the current frame and the previous frame, the camera movement is
determined. Then a decision has to be made whether the current frame should be used to
create a new keyframe or not. In order to consider a frame as a keyframe following
criteria must be met:
● After the last tracking was lost, around 20 frames must have been processed.
● If more than 20 frames have been processed since the last keyframe, then the local
mapping thread should be paused to allow the insertion of a new keyframe.
● The processed frame should contain at least 50 matched feature points.
● The number of feature points that matched between the current frame and the
reference frame should be less than 90% of the total feature points.
2) Local Mapping
The local mapping thread is responsible for optimizing and keeping the map up to date.
Whenever it receives a new keyframe from the tracking thread, it first checks whether the
removal of any map point is possible or not. After this, it creates new map points by
31
triangulating feature matches between the new keyframe and the keyframe with which it
shares the most map points.
3) Loop Closure
Loop closure is responsible for handling the problem of position drift, that is, when one
travels large distances and then returns to the same position, then it may be possible that
start and end positions do not exactly overlap due to small errors. In order to deal with
this problem, a possible solution is that the similarity between the most recent keyframe
and its neighbors should be calculated and then the lowest score of that similarity measure
should be compared with the rest of the keyframes. If the similarity score of another
keyframe is higher than the lowest similarity score, then that keyframe would be
considered as a loop closure candidate.
32
A. CNN-based depth estimation and densification
Convolutional Neural networks are used for single image-based depth estimation. They
use RGB images as input to predict the distance from the camera to each pixel in meters.
ORB SLAM generates a depth map, however, this depth map only consists of a small
portion of the environment. By fusing this sparse depth map with the CNN, a denser and
accurate depth map could be obtained. This can be achieved by modifying the CNN to not
only take an RGB image as input, but also a sparse depth map. In this section, the inner
workings and technical details of the depth estimation network are further described in
the following subsections.
1) Network Architecture
The proposed Convolutional neural network based depth estimation model has an
encoder-decoder architecture, as shown in Figure 7, where the encoder is responsible for
processing input information and extracting depth features from the input RGB image,
these feature maps are then stacked together to generate the output of the encoder. The
goal of this encoding layer is to provide the larger receptive fields, thus capturing more
global information about the image. Further, the output of the encoder is fed to the
decoder which is made up of upsampling blocks and deconvolution layers. These
upsampling blocks process the stacked depth features and construct the depth map equal
to the size of the ground truth depth image. In this research, first the five different depth
estimation models with different backbone networks: ResNet-50, DenseNet-161,
DenseNet-169, DenseNet-201, and MobileNet-V2 are created and evaluated on NYU-V2
depth dataset. Then, in an attempt to further reduce the depth regression error and
improve the quality of predictions, the ensemble learning approach is leveraged. In the
presented ensemble learning work, the predictions from the individual depth estimation
models are combined in three different ways, using simple average, assigning different
weights to the depth predictions from five individual models and then obtaining the
optimized values for these weights using two different techniques: genetic algorithm and
particle swarm optimization.
The above discussed approaches to monocular depth estimation are presented in detail in
the subsequent chapters.
33
Figure 7: The network architecture builds upon ResNet-50, however, the fully connected
layer at the end of ResNet is replaced by a decoding layer.
2) Training datasets
For training the CNN model we are supposed to use the NYU Depth V2 dataset. It
contains 48,521 indoor images. These images are all of the indoor scenes that have been
recorded with the help of a Kinect camera at a resolution of 640 x 480 pixels. This is a
labeled dataset in which depth maps are presented as ground truth images. It includes
different types of indoor places like basements, bathrooms, bedrooms, offices, and dining
rooms, all taken from3 different cities. An example is shown in figure 8, which shows
different images along with their class labels.
Figure 8: Samples of the RGB image and the class labels from the NYU Depth V2
Dataset.
34
3) Data Augmentation
The training data must be modified in each iteration in order to improve the accuracy and
robustness of the model. This will also help in the more usage of the same amount of
data. It can be achieved in the following ways:
Rotation: images are rotated at random to simulate different observation angles. The
angles should be chosen randomly in the specified range, r ϵ [-5,5].
Color Normalization: The RGB images could get normalized by using mean substitution
and division by the standard deviation.
Color Jitter: The saturation, contrast, and brightness of the RGB images scaled at
random by ki ϵ [0.6,1.4].
Flipping: There is a 50% chance that the image could flip horizontally.
Sample noise: In order to make the model more robust, a small amount of noise is added
to the distance measurements.
4) Depth Sampling
The proposed CNN model will use a sparse depth image and an RGB image as input and
then the evaluation will be done using depth images as ground truth. However, the RGB
images and depth images are provided with the dataset but not the sparse depth images,
which means they need to be generated during training. In order to provide the network
with sparse samples during training, sparsifiers are used, converting the ground depth
images into sparse subsets that can be used as an input during training. The most common
way to design the sparsifier is to randomly sample points from the depth image. An ORB
sparsifier must be developed in order to make sure that the depth estimation network is
optimally using the depth samples generated by ORB SLAM. This process allows
leveraging the full advantage of the information provided by ORB points because ORB
uses an edge detector that mainly provides samples at the edges of obstacles. However,
the limitation of ORB edge detectors is that it is not good in obtaining adequate features
in textureless areas, therefore, making it harder to get correct depth predictions in these
areas. This problem can be mitigated either by filtering out low texture areas while
performing reconstruction or by creating a convex hull around each sparse sample.
35
5) Intrinsic parameters of the neural network
A downside of the neural networks is that the intrinsic parameters, like the focal length of
the camera, are embedded into the network weights. Also, the network is responsible for
changing the resolution of the image as it has to make sure that the image size is
consistent with the input size. In order to correct the camera’s intrinsic parameters that are
responsible for the change in resolution resulting from cropping the image, equation 2
will be used. Where f is the focal length in the x and y directions of the frame, r is the
ratio by which the scale is reduced from input image to output of the CNN, c is the optical
center.
Ann = [ fxrx 0 cxrx ; 0 fyry cyry ; 0 0 1] (1)
6) Training loss
We train our depth estimation network with Mean Absolute Error (MAE) loss function
and compare their evaluation results with the existing state-of-the-art monocular depth
estimation methods. The MAE loss function is generally used for optimization in
supervised regression problems. MAE loss is responsible for optimizing the simple norm
of the euclidean distance between the estimated and ground-truth value:
(ŷ − 𝑦) = || ŷ − 𝑦||, where ŷ represents the estimated depth and y represents
ground truth depth. For our problem MAE performs very well and the depth maps
generated by the model trained with this loss function have sharp boundaries rather than a
smooth low quality visual representation of edges. The advantage of using the MAE loss
function is that it doesn’t penalize heavily for outliers present in the training dataset, as all
of the errors (differences between predicted and ground-truth values) are weighted
according to one linear scale.
7) Evaluation metrics
We evaluate our depth estimation model with the following metrics:
RMSE
The root mean squared error is calculated as the square root of the mean of the square of
the difference between the predicted value and ground truth value.
𝑟𝑚𝑠𝑒 = √ ( 𝑚𝑒𝑎𝑛( ( ŷ − 𝑦)2 ) (2)
36
AbsREL
The absolute relative error is the mean of the absolute value of the relative difference
between the predicted and ground-truth value of depth.
𝑟𝑒𝑙 = 𝑚𝑒𝑎𝑛 (( ŷ − 𝑦 ) / 𝑦) (3)
In the equations (1), (2), and (3), yi and ŷi represent the ground truth and predicted depths
respectively, and card is the function for computing cardinality of the set.
B. Scale Calculation
Monocular vision-based SLAM uses an arbitrary scale. This limits its real-time
applications, therefore, the point clouds generated by ORB SLAM should be scaled using
the information obtained from the trained CNN that provides metric depth information of
the scene. In order to tackle this arbitrary scale problem, an initialization procedure must
be introduced which could determine the ratio between the SLAM point cloud and CNN
depth estimation. This can be achieved by using the triangulated map points obtained by
SLAM over a set amount of keyframes and then matching the depth estimates obtained by
the CNN, and hence, by determining their ratio a scale can be obtained.
37
sparse depth map as a KD-tree. KD-tree is a data structure that allows an efficient nearest
neighbor search. The mean distance from each point to its n nearest neighbors will be
determined. Then all those points whose mean distances are outside an interval defined by
the mean and standard deviation of the entire set will be considered as outliers and hence,
they will be removed from the image.
38
method to solve these problems related to boundaries through fusing sparse features
obtained from SLAM and CNN predicted depth maps so as to obtain a more accurate and
dense 3D reconstruction. A simple convolutional neural network with encoder-decoder
architecture is designed in such a way to take two inputs: RGB image and sparse depth
image for estimating the desired depth map of the scene. The experimental results
discussed in the report verify that the novel approach of fusing sparse map generated by
the SLAM algorithm and depth image estimated by the CNN significantly improve the
overall performance of the system and outperforms the existing state-of-the-art depth
estimation methods. The point cloud and sparse map of the ORB SLAM and depth image
of the RGB scene are visually represented in Figure 9.
Figure 9: Examples of RGB, Sparse depth, ground truth depth, and CNN predicted depth
images. Image source: [ref]
39
dataset of RGB images of the virtual indoor corridor environment. The idea behind this
approach is to use image classification method for the identification of keypoints, where a
dataset is needed to be created by manually labeling each RGB image of the corridor with
appropriate UAV action like go straight, turn left, turn right, etc. However, the estimated
depth map corresponding to the RGB image of the scene is used to classify the image into
a keypoint or non-keypoint image before forwarding it to a deep learning model for
getting UAV control commands. The flowchart for the discussed keypoints detection
method is present in figure 10.
Figure 10: Flowchart representing the keypoints detection algorithm used in this work
First, the video feed of the UAV flying in an indoor corridor environment is sent over
WiFi to the CNN model for depth estimation. Once, the depth map for the video frame is
obtained, then the following tasks are performed to determine whether that video frame
can be used for detecting keypoints or not:
● The depth image is transformed into the grid of equal sized 9 cells (3x3) and the
average of the pixel values for each cell is calculated and it is considered to be the
average or mean depth of each cell.
40
● If the average depth of the cells present in extreme rows and columns of the depth
map grid is much lesser than the average depth of the central cells then it means
that UAV is flying safely and in the middle of the corridor and the video frame is
considered as the non keypoint image.
● If the average depth of the extreme cells is slightly less than the average depth of
the central cells, then it means that the UAV is facing crossroads. The
corresponding video frame is considered to be the keypoint in this case.
● If the average depth of the extreme cells is almost equal to the average depth of
central cells, then there is a chance that UAV is going towards a wall. The image
is considered as a keypoint in this case also.
● If the depth of the upper row cells is higher than the lower cells or vice-versa, then
this means that UAV is facing stairs, again this video frame is considered as a
keypoint.
Then, if the video frame is classified as a keypoint image then it is forwarded to the deep
learning model for the UAV control commands. In this work, the model which is used to
classify the keypoint images into the set of control actions or commands leverage
pre-trained DenseNet-161[ref] as its backbone network. The evaluation results present in
the subsequent sections verify the effectiveness of the proposed approach to autonomous
UAV navigation in indoor corridor environments.
Path planning is the task of detecting or finding an obstacle-free path from starting point
to target location which can be easily followed by a mobile robot navigating in the
environment. However, before implementing a path planning algorithm for the
autonomous navigation of a mobile robot one needs to represent the surrounding
environment in an appropriate manner. In the presented work, ORB-SLAM algorithm is
used for estimating the pose of the UAV and creating the map of the environment. The
surrounding map generated by the SLAM algorithm is converted to the 2-dimensional
occupancy grid map to plan the efficient path for the robot in the environment to reach the
target location using the A-star algorithm. An occupancy grid is created by dividing the
environment into the regular cells of equal dimensions. Each cell is associated with a
value in range [0,1], representing the probability of the presence of obstacles. In our
41
work, we represent each cell by only ‘0’ or ‘1’ where ‘0’ means obstacle-free space and
‘1’ is for the occupied position. This kind of occupancy map is known as binary
occupancy grid map. The accuracy of the representation depends on the 3-D map
obtained using SLAM algorithm and the step size of the grid, that is, the dimension of the
cells into which the environment is discretized. Smaller the step size of the grid is chosen
more precise will be the environment representation. However, this approach will be
computationally expensive and can take more time in generating control commands for
the navigating robot. Therefore, it would be suitable to choose average size cells for
transforming the 3-D environment into a 2-D grid map. In our case we have used a
precise 100 cm grid step size. For visual representation of the environment map,
occupancy grid maps are generated with white and black color cells, where white cell
represents the empty space and an occupied position is identified with black cell. In
Figure 11 occupancy grid maps for simple and complex environments are presented.
Figure 12: Examples of 2D Occupancy grid maps of simple and complex environments.
In the above figures, the red arrow represents the starting point and the golden star
represents the target. Black color boundaries represent obstacles (e.g walls) in the
environment. The path discovered by the A-star heuristic based path planning algorithm
is shown by gray color dots.
The existing path planning algorithms that are generally used by researchers for planning
the dynamic motion of mobile robots are based on heuristics, sampling probability, and
optimization, for example, A-star, Dijkstra, RRT, Ant Colony Optimisation(ACO),
42
Particle Swarm Optimisation, etc. By considering the simplicity and efficiency of the
method we preferred the classical A-star algorithm over others. The sampling and
optimization based motion planning algorithms are suitable in a large and dynamic
environment where the agent has to find the shortest and obstacle free route to the goal
position, and can make compromise with the computational time. In this work, we
performed the test flights in an indoor corridor environment with the assumption that it
will remain static, consistent and doesn’t change dynamically. Hence, we leverage the
simple heuristic based path planning algorithm, A-star, that can find the shortest and
efficient path for the UAV in minimum time. However, before the implementation of the
path planning algorithm it is important to define all the actions that have to be taken by
UAV while navigating in the corridor environment. These actions are defined in the
control function of the UAV which activates and communicated to the UAV over WiFi
according to the directions of the path identified by the A-star algorithm in the occupancy
grid map of the 3D environment. The UAV actions which are defined in the control script
of this implementation are listed and described in Table 4.
Command Description
43
3.7 Implementation details
In this section, the technical details of the proposed approach that we preferred to use for
vision-based autonomous UAV navigation are discussed. The section illustrates all the
software tools and libraries that we are going to leverage in our proposed work.
A. Software requirements
The softwares, libraries, tools, and frameworks that we are supposed to use in the
implementation of this project are:
● OpenCV
● Pangolin
● Eigen3
● DBoW2 and g2o
● Tensorflow
● Keras
A. OpenCV
OpenCV is a massive library for machine learning and computer vision used in various
applications. It can also be used to carry out various types of video and image processing,
including identifying faces and objects in photos and videos. OpenCV simplifies the
process of identifying image patterns and performing mathematical operations on them.
OpenCV is released under a free BSD license and the first OpenCV version that came in
the market for academic and commercial use was 1.0. It supports interfaces for various
programming languages including C, C++, Python, Java, etc. Furthermore, it can be
easily installed and configured on Linux, Mac, Windows, iOS, and Android.
The main focus of the designers of OpenCV was on the development of real-time
applications for efficient computation. The code of OpenCV is written in optimized
C/C++ to leverage multi-core processing.
Some of the applications that can be solved using OpenCV are listed below:
● face detection
● Autonomous inspection and surveillance
● Counting number of persons (foot traffic in a mall, etc)
● Counting the number of vehicles on the highway along with their speed.
44
● Installations of interactive art.
● Anomaly detection in the manufacturing process.
● Stitching street view image
● Multimedia search and retrieval.
● Navigation and control of self-driving cars and robots.
● Recognition of Objects in a given image.
● Analysis of Medical images and scans.
● Motion of 3D structure in movies.
● Recognition of TV Channels advertisements.
B. Pangolin
Pangolin is a set of portable and lightweight utility libraries for designing 3D, video or
numeric based algorithms and programs. It is widely used for removing platform-specific
boilerplate and visualizing data in the field of Computer Vision. The main purpose of this
software is to maximize flexibility and portability through simple factories and interfaces
over things such as video and windowing and to minimize boilerplate. Furthermore,
Pangolin provides a suite of utilities for bilateral fabrication and debugging, like tweaking
of variables, 3D manipulation, and a Quake-like drop down console for live tweaking and
python scripting.
C. Eigen 3
Eigen is a template library written in C++ programming language for linear algebraic
computations: vectors, matrices, numerical solvers, and other related algorithms. Eigen is
fast, versatile, elegant, reliable, and has good compiler support.
45
SLAM. DBoW2 is used for recognition of place whereas g2o is leveraged for carrying
out non-linear optimizations.
E. Tensorflow
Tensorflow is developed by Google scientists as an open source framework to execute
deep learning, machine learning, and other predictive and statistical analytics workloads.
It is designed for streamlining the process of development and execution of advanced
analytics applications for users like statisticians, predictive modelers, and data scientists.
This framework is responsible for handling datasets that are organized as computing
nodes in the form of graphs. Tensors are multidimensional vectors or matrices that are
represented by edges that connect nodes in a graph. A dataflow architecture that is used
by Tensorflow programs works with typical intermediate results generated in
computations; they can carry out very large applications in parallel processing, for
example neural networks. Both high-level and low-level APIs are included in the
framework. In order to simplify application programming and data pipeline development
Google recommends using high-level APIs.
G. Keras
Keras is responsible for providing an interface for Neural networks and it is available as
an open-source software library. It acts as an interface for the Tensorflow framework.
Various backends including Tensorflow, Microsoft Cognitive Toolkit, Theano, and
PaidML are supported by Keras until version 2.3 but in version 2.4 only Tensorflow is
supported. It is designed for enabling fast experimentation with deep neural networks
with focus on being modular, extensible, and user-friendly. Francois Chollet, a Google
engineer, had developed Keras as part of the project ONEIROS (Open-ended
Neuro-Electronic Intelligent Robot Operating System). Keras consists of various
pre-implemented neural network building blocks such as activation functions, layers,
objectives, optimizers, etc. Furthermore, Keras provides support for recurrent and
convolutional neural networks with common utility layers like batch normalization,
dropout, and pooling.
46
Chapter 4: Simultaneous localization
and Mapping
In this chapter the details of the tasks that have been completed till now in the
implementation process of the proposed approach for autonomous UAV navigation are
presented. In the subsequent sections, first the implementation details of the CNN-based
depth estimation model are provided, then in the sections 4.2 and 4.3 the procedure of the
camera calibration with the results of camera calibration algorithm for the Asus Vivobook
webcam and the details of the generation of point cloud map using ORB SLAM are
discussed respectively. Then finally, in section 4.4, the overview of all the software
utilities that have to be used in proposed work is presented.
4.3 Mapping
47
Figure 11: ORB SLAM system overview (Artal et al. 2015)
48
Chapter 5: Transfer learning based depth
estimation
This section demonstrates the results of all the experiments that have been completed till
now as an independent component of the proposed work. First the evaluation results of
the CNN-based depth estimation model are presented which is trained and validated on
NYU depth dataset. Next, the ORB SLAM is executed on three different datasets and the
outputs generated by the ORB SLAM algorithm in each case are shown. However, real
time testing of the proposed project through flight experiments is still remaining and to be
completed by the end of final semester. Although, the specifications of the UAV which
we are supposed to use for this task are presented in the third sub-section.
5.2 ResNet-50
5.3 DenseNet-161
5.4 DenseNet-169
5.5 DenseNet-201
49
5.6 MobileNet-V2
50
Chapter 7: Path planning and Control
This chapter provides the description of the most important task in the autonomous
navigation process. After the successful execution of this step the UAV would be able to
fly though the given environment from starting position to the target location
autonomously and safely without colliding with the surrounding obstacles. The
subsections discussed in this chapter explain the algorithms used in this work for UAV
path planning and control. First the method for obtaining key points from the RGB and
depth images is presented. Then, a brief overview of the occupancy 2-D grid maps used
for visualizing and executing path planning algorithms is provided. Next, the algorithm
used to plan a trajectory for the autonomous UAV flight in the indoor corridor
environments is discussed. Finally, the vision based control approach used in this research
work for transferring appropriate navigation commands to the UAV is demonstrated.
51
● The captured RGB image frame is fed to the depth estimation model.
● The obtained depth map is divided into the grid of 3x3 cells, and the average of
the pixel values of each cell is calculated and it determines the average depth of
each cell present in the RGB frame of the surrounding environment.
● If the average depth of the cells at the centre is: much higher, slightly higher, and
close to the average depth of the cells at the boundaries of the grid, then in all
these three cases the image frame will be classified as a “keypoint image”,
otherwise as “non-keypoint image”.
● If the RGB frame is classified as “keypoint image” then a pretrained DensNet-161
model is used to determine the appropriate control command for the UAV, so that
it can cross the keypoint safely without colliding with the wall, door, etc.
In order to classify the captured RGB frames according to different types of keypoints,
pretrained DenseNet-161 model is used with its last layer replaced with several number of
convolutional layers, a fully connected dense layer, and for combining the outputs of
these layers, softmax activation function is used. Further details about the architecture of
the pretrained DenseNet-161 model used in this work is represented in Table 5.
Conv1 3 1 1 0
Conv 2 1024 1 1 0
Conv3 128 5 1 0
Conv4 16 1 1 0
Flatten - - - -
Dense 2 - - -
52
7.2 Perception and planning
The 3-D map of the surrounding environment obtained from ORB SLAM algorithm is
converted to two dimensional occupancy grid for the execution and visualization of the
path planning algorithm leverage in the autonomous navigation of the UAV in indoor
corridor environment for the given starting and destination locations. Existing path
planning algorithms are broadly classified into three types: 1) Heuristics based path
planning; 2) Sampling based path planning; 3) Optimization based path planning. Figure
15 represents the categorization of these path planning algorithms with examples.
53
required to reach the affected regions in minimum time in order to gather the necessary
information about the condition of the calamity hit area and living beings present there. In
contrast, the regular UAV based applications such as video recording, photo shoot, target
following, surveillance, etc use the length of the path discovered as their performance
objective, because in such applications it is required to manage UAV resources in a most
economic way. Further discussion on the performance objectives of UAV path planning
algorithms is given in Table 5.
Table 5: Different performance objectives that must be considered before selecting the
path planning algorithm for UAV based mission.
Computing time It denotes overall time required to find a path using a graph.
Path nodes It denotes a set of nodes that a UAV follows during flight.
Problem size It denotes the size of the problem on which path is determined.
Graph size It denotes the size of the graph (no. of nodes, edges) employed
to find a path.
54
Flexibility It denotes efforts/time required to make a solution usable for
different missions.
Hyper parameter It denotes the number and variety of parameters to find a path.
Endurance It denotes the ability of a UAV to fly for a long period of time
with low-cost planning.
In this work, that is, autonomous UAV navigation in an indoor-corridor environment, the
path searching method is selected with the objective of minimizing computational time
and system memory used by the algorithm. After reviewing all the classical path
searching methods we decided to leverage a heuristic based A-star algorithm for
estimating the collision free and efficient trajectory for the UAV to follow for reaching
the desired target location. The A-star path planning algorithm is responsible for finding a
shortest path between the start node and an end node of the graph. For each node of the
graph, a cost is associated which is defined as:
f=g+h (4)
where g is the sum of all the nodes that lie in between the path from the start node and the
current node and h is the estimated cost (via heuristic) for the distance between the
current node and the end node.
The shortest path from the starting node to the target or end node is determined after
considering the nodes with minimum cost in the path between the starting and target
55
locations. Assuming each cell of the binary occupancy grid is the node of an undirected
graph, and each node has four neighbors, that is, top, left, right, and bottom positions. The
cost of the path between a node and its neighbor(g) is set to 1, and the heuristic distance,
h, between the two nodes n1, n2 is given by euclidean norm as represented in the
following equation:
In Equation (5), x and y are the row and column position of the given cell in the binary
occupancy grid.
During both simulation and real time flight the implemented path planning algorithm
executed successfully in less than 1 ms. As, we assume the static and previously known
environment, so the path or the required trajectory is detected by the algorithm before
UAV take off.
56
next suitable state without collision with the help of the actuators. The control technology
used in DJI Tello is not open source, hence it is difficult to state how the flight controller
operates exactly. However, an important property of the DJI Tello flight controller is that
it is not extremely precise in navigating according to the given distance, that is, when the
user input the control command for the drone to move 2 m forward, then in this case the
drone will move forward upto the distance of 2+ε, where ε is the small deviation or
random noise might be caused due to the error in the measurement of the flight controller
and actuators or air drift. This characteristic of the manufacturer’s flight controller has a
very important implication, that is, it is not sufficient to completely rely on the UAV flight
controller for all the user defined actions, for example if the user executes a command,
move 5 m forward and turn 90o anti-clockwise for the programmable flight controller,
then it is not necessary that UAV will move the same and rotate through the exact angle,
however it might be possible that it hits the wall or any other obstacle. Therefore, the user
must be cautious about this behavior of the UAV actuators and flight controller and he
should design the obstacle avoidance and depth estimation algorithm in such a way that
even due to small deviations in the flight commands the UAV will remain safe. In this
work, a high level control interface is designed by assuming that the flight controllers of
the used UAV lead it to navigate safely with minute errors or deviations. Each of the
general control commands of the drone like take-off, landing, moving, etc has its own
individual control interface designed by the fabricator. In order to avoid implementing the
autonomous navigation framework specific to each command interface, a novel
high-level control interface layer is designed in this work which can be used on top of the
manufacturer’s control interface for establishing the feasible communication between the
UAV flight controller and the navigation algorithm. The architecture of the implemented
high level control interface layer is represented in figure 16.
The figure demonstrates all the classes and functions that are implemented in the control
script used in this work generating high-level control commands that communicate to the
UAV flight control over WiFi and then it regulates the movement of UAV actuators for
leading the UAV to the desired state with stabilization.
57
Figure 16: Architecture of the designed high level control interface
In the implemented control unit, four classes are defined: Controller, AirSimDrone,
TelloEDU, and AirSimDroneNoisy. The Class Controller inherits Classes AirSimDrone
and TelloEdu, whereas AirSimDroneNoisy is inherited by AirSimDrone Class. All the
required functions for the desirable maneuver of the UAV are defined in the Controller
Class. The AirSimDrone class provides the control interface for the simulated Parrot A.R.
drone and the TelloEDU Class is designed to act as the high level control interface for DJI
Tello drone. AirSimDroneNoisy is the child of AirSimDrone class as it implements all the
functions of AirSimDrone Class along with the additional noise method which is used to
evaluate the robustness of the proposed navigation algorithm. As, in real world scenarios
the conditions of the navigation environment are not ideal, there might be air drift,
inaccuracies in the flight controller, hardware issues, etc. In order to implement the
simulation of UAV navigation which is close to real world exploration, a random noise is
added to the simulation environment created using Unreal Engine 5.0 with AirSim plugin.
58
Chapter 8: Simulation and Real-time
Experiments
This section demonstrates all the simulations and real time experiments that have been
carried out with the Parrot A.R. and DJI Tello-Ryze drone respectively. Furthermore, the
information related to the simulators, environments and drones that are used to perform
the experiments to evaluate the proposed approach is provided in the subsequent sections.
8.1 Simulators
Simulators are developed for virtually replicating the actions that are to be performed by
the mobile robot in the real world. It is necessary to carry out the simulation experiments
with the drones and robots before evaluating them in the physical environment to avoid
damage to the hardware and the test environment. In the case of UAVs the less flight time
and the limited battery life may affect its performance during test flight, and researchers
might not be able to get correct evaluation results, hence realizing the need for a
simulator. However, in order to achieve reasonable and realistic simulation results, the
simulator must be capable of cloning and modeling the real world indoor corridor
environment with appropriate number of walls, turns and crossroads. Along with this, the
simulated environment must also consider other real time factors including, the wind
turbulence, shadows, brightness of the room, etc. Even though many physical simulators
exist for simulating the robot motion and dynamics, but for analyzing the performance of
autonomous drones and UAV motion only two of these classical simulators are generally
leveraged by the researchers: Gazebo and Unreal Engine 5.0.
A. Gazebo
Gazebo[ref] is a non-proprietary software developed by Open Robotics[ref] in the year
2002. Gazebo allows researchers to simulate the variety of robots in complex outdoor and
indoor environments efficiently and precisely. It is compatible with ROS(Robot Operating
System)[ref], and a very popular simulator, especially for drones. Being lightweight and
platform independent, Gazebo can be easily downloaded and used with any operating
system. The Gazebo simulator is not developed for imitating the virtual drone motion
alone, however one needs to manually add and configure all the necessary tools and
59
resources required for successful execution of the application. Furthermore, the
documentation of the tool is also not specifically focused on the drone simulation.
Therefore, sometimes it is difficult to leverage the simulator without prior experience and
guidance of the proficient.
C. Airsim
AirSim[ref] is an open-source or non-proprietary module, developed by Microsoft, to be
used with Unreal Engine for performing the realistic simulation of drones and cars. It
provides all the necessary functionalities required for having a simulated environment
thereby leveraging the high quality realism rendered by the game engine. The Aisim is
developed with the objective of simulating UAVs on top of Artificial Intelligence
Interface. An API, developed in python programming language, is constituted with all the
necessary and vital functionalities which are required to carry out the high quality
simulation of autonomous and Intelligent vehicles and systems. Furthermore, it can also
be used for the purpose of synthetic data collection as all the necessary sensors and
images required for performing augmentation and annotation are available, for example,
LIDAR, depth maps, segmented images, etc.). The idea behind the development of the
plugin is to imitate the real environment behavior faithfully and to give the researchers
the similar kind of challenging surroundings as in the physical world so that they can
verify the effectiveness of their proposed approaches confidently.
60
8.2 Simulated Environments
Leveraging the AirSim plugins, realistic simulated environments are created which are
faithful to the physical world. In this work, the UAV simulation is carried out in the
indoor-corridor environments created with AirSim plugins. The virtual UAV used is the
Parrot A.R. drone with front camera for capturing images of the surroundings. These
images are then forward to the autonomous navigation framework for processing and the
generation of control commands which are then forward to the UAV as feedback. In order
to verify the efficiency and generalization of the proposed approach the simulated
environment used for executing test flight is composed of corridors, turns, walls, and
doors, etc. The visual representation of the used simulated environment is given in Figure
12.
61
DJI Tello Specifications
Price 9,000 INR
Dimensions 99 x 92.5 x 41 mm
Weight 80 grams
Propeller 3”
Sensors Altitude sensor, barometer, IMU, 720p camera
Communication 2.4GHz WiFi
Camera
Photo 5MP (2592 x 1936)
FOV 82.6o
Video HD 720p, 30FPS
Format JPEG(Photo); MP4(Video)
Image Stabilization Electronic Image Stabilization (EIS)
Flight Performance
Max Flight Distance 100m
Max Speed 8m/s
Max Flight Time 13 min
Max Flight Height 50m
62
8.4 Experiments with DJI-tello Ryze
The step by step proposed approach for autonomous UAV navigation is discussed in
Chapter 3, where the complete overview of the design framework is presented. And, in
order to evaluate the presented navigation algorithms first experiments are carried out in
the simulated environment with a virtual drone having a front facing camera. After the
successful execution of the test flight in the simulated environments, finally in this section
the results obtained with a physical DJI Tello drone navigating in the real world indoor
corridor environments are discussed. DJI Tello is made to navigate in the ‘L’ shaped
corridor of Main Building, Zakir Husain College of Engineering and Technology, Aligarh
Muslim University. In this corridor, the UAV flies 16 m forward and 5 m rightwards after
taking a clockwise turn. The occupancy grid map representing this ‘L’ shaped corridor is
represented in figure 12. And, figure 13 represents the images of the drone navigating in
the corridor.
63
Figure 13: Pictures of DJI Tello flying autonomously in the corridor of Main Building,
ZHCET, AMU
In the ‘L’ shaped corridor, first the UAV moves forward upto 16 mm and then it takes a
sharp right turn after rotating clockwise for 0.9 m and then it moves 4 m forward in the
east direction. Throughout the corridor, the drone completed its navigation task safely
without colliding with the walls. However, sometimes in between the navigation process
it was slightly deviating from the centre of the corridor due to air drift, although DJI
Tello’s controller handles the air drift using the internal sensors and keeps the drone safe
from the collision with the walls. Therefore, both the simulation and real time
experiments verify the effectiveness of the proposed state-of-the-art approach to
autonomous UAV navigation in GPS denied environments. In the occupancy grid
presented in figure 12, black color cells represent the walls whereas, white color margins
represent the empty space through which UAV can safely navigate the corridor. Further,
in order to avoid the collision with the wall and any other static obstacle present in the
corridor, a threshold is needed, so that we can measure the depth from the object to the
UAV at that distance. If the depth is greater than the threshold value then the UAV can
continue to fly in the moving direction otherwise it has to change the direction
accordingly. In this work, the depth threshold is set to the value of 50 cm.
64
Chapter 9: Results
This chapter demonstrates the experimental results obtained after the evaluation of each
step in the process of autonomous UAV navigation. First, the ORB SLAM results, that is,
the obtained point cloud sparse maps from the RGB images are presented. Next, the
monocular depth estimation results obtained using deep learning models and their
ensemble are discussed and finally the comparison of the procured depth estimation
results with the existing state-of-the-art monocular depth estimation techniques are
presented.
9.1 ORB-SLAM results
In this section, the results obtained after executing the ORB SLAM on two different
image sequences: TUM , Kitti are presented.
A. TUM image sequence
(a) (b)
Figure 15: (a) Key points selected in a given frame ; (b) point cloud map of the sequence
The (a) part of figure 15 represents the ORB features identified by the ORB SLAM in the
RGB image keyframe. These map points are used in the construction of sparse 3-D map
of the environment which is represented in (b) part. As the vehicle with visual
sensor/camera moves forward the ORB SLAM is responsible for selecting keyframes
from the sequence and these keyframes ORB features are extracted which are further used
for the tasks of map initialization, loop closing, and tracking. TUM RGB-D dataset is
especially captured for the evaluation of Visual SLAM algorithms by the Computer
Vision group, TUM School of Computation, Information and Technology, Technical
65
University of Munich, Germany.
(a) (b)
Figure 16: (a) Key points selected in a given frame ; (b) point cloud map of the sequence
The KITTI dataset [ref] is prepared at Karlsruhe Institute of Technology for developing
real-time computer vision applications such as 3-D scene reconstruction, depth
estimation, autonomous driving, virtual reality, etc. Therefore, the KITTI dataset can help
the computer vision researchers in evaluating the performance of their models for
comparing the effectiveness of the proposed schemes with the state-of-the-art. In thai
work the KITTI odometry sequence can be used to test the execution of ORB SLAM and
to evaluate the performance of the proposed autonomous navigation approach by
leveraging the map generated by ORB SLAM on the TUM RGB-D[ref] and KITTI[ref]
image sequences.
9.2 Depth Estimation results
Transfer learning based approach is used for estimating the depth from the given RGB
images captured through a monocular camera. The depth estimation network used in this
work consists of encoder-decoder architecture where, encoder is a pre-trained CNN
model and responsible for extracting dense depth features from the RGB image and
generating the encoded feature stack, where decoder is used for the construction of the
depth map using the encoded depth features generated by the encoder. The encoder is the
backbone network of the depth estimation model as the depth estimation results of the
66
model solely depend on the type and number of features extracted by the encoder. In this
work, the five different depth estimation models are developed using five different
pre-trained CNN models as the encoder networks in these depth estimation models
respectively. The different encoder networks used for extracting dense depth features
from the RGB images are: ResNet-50, DenseNet-161, DenseNet-169, DenseNet-201, and
MobileNet-V2. Table 4 represents the evaluation results obtained from these five models
on the NYU V2 depth dataset[ref].
Table 4: Comparison of the evaluation results obtained for the individual depth
estimation models based on pre-trained encoder networks.
Encoder RMSE AbsREL δ1<1.25 δ2<1.252 δ3<1.253
Networks
For visual representation, figure 19 represents the gallery of RGB and ground truth depth
images from NYU V2 depth dataset, and the depth maps estimated by the five different
depth estimation models discussed above.
67
Figure 19: From top to bottom: RGB images (first row); ground truth depth images
(second row); depth images estimated by ResNet-50 (third row); depth images estimated
by DenseNet-161 (fourth row); depth images estimated by DenseNet-169 (fifth row);
depth images estimated by DenseNet-201 (sixth row); depth images estimated by
MobileNet-V2 (seventh row);
The minimum depth regression error is obtained with the model based on DenseNet-161.
These models differ in the architecture, number of convolution layers, and number of
hyperparameters. Therefore, each network extracts a different number of features from
the images and these extracted features are also distinct in their level of quality. In order
to benefit from the characteristics of each trained depth estimation model and to get
maximum number of high-level features from the RGB image, the authors of this work
decided to leverage the ensemble learning approach in their deep learning based
monocular depth estimation task. The evaluation results obtained with the ensemble of all
the five pretrained depth estimation models are less effective and impactful than those
obtained from the ensemble of best four depth estimation models based on the encoder
networks- ResNet-50, DenseNet-161, DenseNet-169, and DenseNet-201.
Three types of ensembling techniques are used in which predictions from the four best
depth estimation models are combined in three different ways: 1) Calculating the average
of all the predictions obtained from the four models in the ensemble; 2) Assigning
different weights to the models in the prediction and compute the optimize value of these
68
weights or coefficients using Genetic Algorithm; and 3) Particle Swarm Optimization.
In Table 5 the validation results procured using these three models on NYU V2 depth
dataset are presented.
Table 5: Comparison of evaluation results obtained by using three different ensembling
methods
From the Table 5, one can observe that the results obtained with all the three ensemble
approaches are relatively same, however the least depth regression error and the
maximum percentage of correctly predicted depth pixels is obtained in case of Genetic
algorithm based optimization, whereas the minimum relative error is computed during the
validation of average based ensemble. Figure 20 represents the comparison depth maps
obtained with different ensembling approaches.
69
Figure 20: From top to bottom: RGB images (first row); ground truth depth images
(second row); depth images estimated by average based ensemble method (third row);
depth images estimated by GA based optimization technique (fourth row); depth images
estimated by Particle Swarm Optimization algorithm(fifth row)
70
Liu et al., 2016 0.759 0.213 0.650 0.906 0.976
[47]
From the above results, it can be seen that the proposed ensemble learning based
approach for monocular depth estimation has outperformed all the existing state-of-the-art
methods on all evaluation metrics.
71
Chapter 10: Conclusion and Future
Research
This chapter concludes the dissertation thesis and mentions certain guidelines for the
future research work that can be carried out for further improvements in the proposed
solution to the problem of Vision based autonomous UAV Navigation in GPS-denied
environments.
10.1 Conclusion
This thesis presented a discussion on vision-based autonomous navigation of UAVs
mainly from three aspects: localization and mapping, obstacle avoidance, and path
planning. Localization and mapping are the core of autonomous navigation, which is also
responsible for providing information about the environment and location. Obstacle
avoidance and path planning are vital in order to make the UAV capable of reaching the
target location safely and quickly without collision. Several visual SLAM algorithms
have been developed by the computer vision society, still, many of them cannot be
leveraged directly for navigation of UAVs due to shortcomings posed by their processing
power and their architecture.
Nonetheless, UAVs share similar navigation solutions with mobile ground robots but still,
researchers are facing many challenges in the implementation of vision-based
autonomous UAV navigation. The UAV is required to process a large amount of sensors’
data in order to achieve safe and steady flight in real-time, especially the processing of
visual data which increases the computational cost to a great extent. Thus, autonomous
navigation of UAVs under limited consumption of power and computing resources has
become a major challenge in the research field. It should also be noted that in contrast to
ground vehicles UAVs are not able to just stop navigating in the state of great uncertainty,
that is generation of incoherent commands can make the UAV unstable. Along with this,
UAV could exhibit unpredictable behavior whenever the computational requirements are
not enough to update attitude and velocity in time or in the case of hardware-mechanical
failure. Therefore, researchers must put efforts into developing computer vision
algorithms that possess the capability of responding quickly to the dynamic behavior of
the environment. The development of such algorithms will help in improving the native
ability of UAVs to navigate smoothly in various attitudes and orientations with sudden
72
appearance and disappearance of targets and obstacles. Furthermore, autonomous UAV
navigation needs a global or local 3D map of the surrounding environment, thereby
increasing the computation and storage overhead. Hence, there are many challenges for
long-time UAV navigation in complex environments. Besides that, tracking and
localization collapse can happen in the course of UAV flight due to motion blur caused by
rotation and fast movement. Algorithms used for tracking objects should be robust against
illumination, noise, vehicle disturbances, and occlusions. Otherwise, it will be very
difficult for the tracker to estimate the trajectory of the target, and to function in
consonance with the controllers of UAV. Hence, highly sophisticated, erudite, and robust
control schemes must exist for closing the loop optimally using visual data. Therefore, we
are expecting future research on relocalization and loop detection. In addition to this, we
also found that with a partial or 3D map, one is not only required to find a collision-free
path but also to optimize energy consumption or the length of the resulting path. In
contrast to 2D path planning, the challenges, and difficulties in 3D map construction of
the environment increase exponentially as the complexity of dynamic constraints and
kinematics of UAVs increase. Thus, there are no efficient solutions to this NP-hard
problem, even contemporary path planning algorithms also abide by the same problem of
local minimum. Therefore, a more efficient and robust global optimization algorithm is
still under research and development. The research work presented in this literature
illustrates that few techniques are proved experimentally but many of the vision-based
SLAM and obstacle avoidance strategies are not yet fully incorporated in the navigation
controllers of autonomous UAVs, since the demonstrated methods either operate under
some constraints in simple environments or their working is proved only through
simulations. Therefore, influential engineering is essential to move the current state of the
art a step ahead and for the evaluation of their performance in real-world environments.
Another crucial finding from this review is that most of the flight tests discussed in the
presented work were carried out on small commercial UAVs with increased payload for
onboard processing units and multiple sensors. Yet, it is clear that trending research is
focusing more on micro aerial vehicles that are capable of navigating indoors, outdoors
and inspecting and maintaining target infrastructure utilizing their acrobatic maneuvering
competencies.
Hence, in order to apply the theoretical findings on monocular vision based autonomous
navigation we have put forward a state-of-the-art approach that leveraged visual ORB
73
SLAM for building the map of the environment and to localize the UAV in that
environment, but the obtained map is sparse and cannot represent the obstacles present in
the environment clearly, so deep CNNs are used for the densification of that map, the
obtained depth map is also used for the identification of keypoints in the navigation
environment. Then, the generated dense depth map is converted to occupancy 2-D grid
map for the execution of the A-star path planning algorithm. A deep learning model is
trained for the generation of appropriate control commands corresponding to each
keypoint. These action commands are sent to the UAV over WiFi for providing it proper
control and guidance during navigation in the GPS-denied environment. In order to
achieve the objective of safe UAV navigation with minimum number of collisions we
analyze the performance of five pre-trained deep CNN models that can be used as depth
feature extractors in the depth estimation models. These five pre-trained models that form
the encoder network of the depth estimation models are: ResNet-50, DenseNet-161,
DenseNet-169, DenseNet-201, and MobileNet-V2. The best evaluation results are
obtained in case of DenseNet-161 encoder network, hence in this work DenseNet-161 is
used for the generation of depth map of the 3-D RGB scene during UAV navigation.
Furthermore, In order to increase the accuracy of depth estimation we followed an
ensemble learning approach in which predictions from the best four depth estimation
models are combined in three different ways, in first ensemble model the average of the
predicted depths of all the models in the ensemble is considered as the final predicted
depth, second ensemble approach approach deals with the assignment of different weights
to the models in the ensemble and these weights are optimized using evolutionary
optimization (Genetic) algorithm and using particle swarm optimization in the third
ensemble model. However, the ensemble based approach is not leveraged in the task of
real-time depth estimation during autonomous UAV navigation, instead it can be used in
future research work on the development of autonomous mobile robots that can navigate
safely in the unknown dynamic environment independently.
74
10.2 Future research
In this work, the DJI Tello drone receives the control commands from the workstation
over WiFi network, therefore the need for WiFi network connection imposed a limitation
on the proposed vision-based autonomous UAV navigation approach. In order to
overcome this limitation, all the deep learning models that have been used in this work
should be transferred to the NVIDIA Jetson Nano chip which can be embodied in the
UAV hardware for on-board processing, thus eliminating the need of stable workstation
and WiFi connectivity.
75
References
76
Known Target.” Paper Presented at the AIAA Guidance, Navigation, and Control
Conference, Boston, MA, August 19–22.
14. Davison, A. J. 2003. “Real-Time Simultaneous Localisation and Mapping with a
Single-Camera.” Paper Presented at the Computer Vision, IEEE International
Conference on, Nice, France, October 13–16.
15. Desouza, G. N., and A. C. Kak. 2002. “Vision for Mobile Robot Navigation: A
Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2):
237–267.
16. Dryanovski, I., W. Morris, and J. Xiao. 2010. “Multi-Volume Occupancy Grids:
An Efficient Probabilistic 3D Mapping Model for Micro Aerial Vehicles.” Paper
Presented at the Intelligent Robots and Systems, IEEE/RSJ International
Conference on, Taipei, China, October 18–22.
17. Elmokadem, T., & Savkin, A. V. (2021). A Hybrid Approach for Autonomous
Collision-Free UAV Navigation in 3D Partially Unknown Dynamic
Environments. Drones, 5(3), 57. doi:10.3390/drones5030057.
18. Engel, J., T. Schöps, and D. Cremers. 2014. “LSD-SLAM: Large-scale Direct
Monocular SLAM.” Paper presented at the European Conference on Computer
Vision, Zurich, Switzerland, September 6–12, 834–849.
19. Esrafilian, O., and H. D. Taghirad. 2016. “Autonomous Flight and Obstacle
Avoidance of a Quadrotor by Monocular SLAM.” Paper Presented at the Robotics
and Mechatronics, 4th IEEE International Conference on, Tehran, Iran, October
26–28.
20. Forster, C., M. Pizzoli, and D. Scaramuzza. 2014. “SVO: Fast Semi-direct
Monocular Visual Odometry.” Paper Presented at the Robotics and Automation,
IEEE International Conference on, Hong Kong, China, May 31–June 7.
21. Forster, C., M. Faessler, F. Fontana, M. Werlberger, and D. Scaramuzza. 2015.
“Continuous On-board Monocular-vision-based Elevation Mapping Applied to
Autonomous Landing of Micro Aerial Vehicles.” Paper Presented at the Robotics
and Automation, IEEE International Conference on, Seattle, USA, May 26–30.
22. Fournier, J., B. Ricard, and D. Laurendeau. 2007. “Mapping and Exploration of
Complex Environments Using Persistent 3D Model.” Paper Presented at the
Computer and Robot Vision, Fourth Canadian Conference on, IEEE, Montreal,
Canada, May 28–30.
23. Gageik, N., P. Benz, and S. Montenegro. 2015. “Obstacle Detection and Collision
Avoidance for a UAV with Complementary Low-Cost Sensors.” IEEE Access 3:
599–609.
24. Gaspar, J., N. Winters, and J. Santos-Victor. 2000. “Vision-Based Navigation and
Environmental Representations with an Omnidirectional Camera.” IEEE
Transactions on Robotics and Automation 16 (6): 890–898.
25. Gilmore, J. F., and A. J. Czuchry. 1992. “A Neural Network Model for Route
Planning Constraint Integration.” Paper Presented at the Neural Networks, IEEE
International Joint Conference on, Baltimore, MD, USA, June 7–11.
77
26. Gosiewski, Z., J. Ciesluk, and L. Ambroziak. 2011. “Vision-Based Obstacle
Avoidance for Unmanned Aerial Vehicles.” Paper Presented at the Image and
Signal Processing, 4th IEEE International Congress on, Shanghai, China, October
15–17.
27. Gutmann, J., M. Fukuchi, and M. Fujita. 2008. “3D Perception and Environment
Map Generation for Humanoid Robot Navigation.” The International Journal of
Robotics Research 27 (10): 1117–1134.
28. Haag, J., W. Denk, and A. Borst. 2004. “Fly Motion Vision is Based on Reichardt
Detectors regardless of the Signal-to-Noise Ratio.” Proceedings of the National
Academy of Sciences of the United States of America 101 (46): 16333–16338.
29. Han, J., and Y. Chen. 2014. “Multiple UAV Formations for Cooperative Source
Seeking and Contour Mapping of a Radiative Signal Field.” Journal of Intelligent
& Robotic Systems 74 (1–2): 323–332.
30. Harmat, A., M. Trentini, and I. Sharf. 2015. “Multi-camera Tracking and Mapping
for Unmanned Aerial Vehicles in Unstructured Environments.” Journal of
Intelligent & Robotic Systems 78 (2): 291–317.
31. Herissé, B., T. Hamel, R. Mahony, and F. Russotto. 2012. “Landing a VTOL
Unmanned Aerial Vehicle on a Moving Platform Using Optical Flow.” IEEE
Transactions on Robotics 28 (1): 77–89.
32. Herwitz, S. R., L. F. Johnson, S. E. Dunagan, R. G. Higgins, D. V. Sullivan, J.
Zheng, and B. M. Lobitz. 2004. “Imaging from an Unmanned Aerial Vehicle:
Agricultural Surveillance and Decision Support.” Computers and Electronics in
Agriculture 44 (1): 49–61.
33. Horn, B. K., and B. G. Schunck. 1981. “Determining Optical Flow.” Artificial
Intelligence 17 (1–3): 185–203. Hornung, A., K. M. Wurm, M. Bennewitz, C.
Stachniss, and W. Burgard. 2013. “OctoMap: An Efficient Probabilistic 3D
Mapping Framework Based on Octrees.” Autonomous Robots 34 (3): 189–206.
34. How, J. P., B. Behihke, A. Frank, D. Dale, and J. Vian. 2008. “Real-time Indoor
Autonomous Vehicle Test Environment.” IEEE Control Systems 28 (2): 51–64.
35. Hrabar, S. 2008. “3D Path Planning and Stereo-based Obstacle Avoidance for
Rotorcraft UAVs.” Paper Presented at the Intelligent Robots and Systems,
IEEE/RSJ International Conference on, Nice, France, September 22–26.
36. Hu, F., & Wu, G. (2020). Distributed Error Correction of EKF Algorithm in
Multi-Sensor Fusion Localization Model. IEEE Access, 8, 93211-93218.
doi:10.1109/access.2020.2995170.
37. Hui, X., Bian, J., Zhao, X., & Tan, M. (2018). Vision-based autonomous
navigation approach for unmanned aerial vehicle transmission-line inspection.
International Journal of Advanced Robotic Systems, 15(1), 172988141775282.
doi:10.1177/1729881417752821.
38. Jayaweera, H. M., & Hanoun, S. (2020). A Dynamic Artificial Potential Field
(D-APF) UAV Path Planning Technique for Following Ground Moving Targets.
IEEE Access, 8, 192760-192776. doi:10.1109/access.2020.3032929.
78
39. Jones, E. S. 2009. “Large Scale Visual Navigation and Community Map
Building.” PhD diss., University of California at Los Angeles.
40. Kanellakis, C., & Nikolakopoulos, G. (2017). Survey on Computer Vision for
UAVs: Current Developments and Trends. Journal of Intelligent & Robotic
Systems, 87(1), 141-168. doi:10.1007/s10846-017-0483-z.
41. Klein, G., and D. Murray. 2007. “Parallel Tracking and Mapping for Small AR
Workspaces.” Paper Presented at the Mixed and Augmented Reality, 6th IEEE and
ACM International Symposium on, Nara, Japan, November 13–16.
42. Krul, S., Pantos, C., Frangulea, M., & Valente, J. (2021). Visual SLAM for Indoor
Livestock and Farming Using a Small Drone with a Monocular Camera: A
Feasibility Study. Drones, 5(2), 41. doi:10.3390/drones502004.
43. Kümmerle, R., G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard. 2011. “g 2
o: A General Framework for Graph Optimization.” Paper Presented at the
Robotics and Automation, IEEE International Conference on, Shanghai, China,
May 9–13.
44. Langelaan, J. W., and N. Roy. 2009. “Enabling New Missions for Robotic
Aircraft.” Science 326 (5960): 1642–1644.
45. Leutenegger, S., S. Lynen, M. Bosse, R. Siegwart, and P. Furgale. 2015.
“Keyframe-Based Visual–Inertial Odometry Using Nonlinear Optimization.” The
International Journal of Robotics Research 34 (3): 314–334.
46. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper Depth
Prediction with Fully Convolutional Residual Networks,” tech. rep., 2016.
47. Li, B., Qi, X., Yu, B., & Liu, L. (2020). Trajectory Planning for UAV Based on
Improved ACO Algorithm. IEEE Access, 8, 2995-3006.
doi:10.1109/access.2019.2962340.
48. Li, J., and N. M. Allinson. 2008. “A Comprehensive Review of Current Local
Features for Computer Vision.” Neurocomputing 71 (10): 1771–1787.
49. Li, H., and S. X. Yang. 2003. “A Behavior-based Mobile Robot with a Visual
Landmark-recognition System.” IEEE/ASME Transactions on Mechatronics 8 (3):
390–400.
50. Lindqvist, B., Mansouri, S. S., Agha-Mohammadi, A., & Nikolakopoulos, G.
(2020). Nonlinear MPC for Collision Avoidance and Control of UAVs With
Dynamic Obstacles. IEEE Robotics and Automation Letters, 5(4), 6001-6008.
doi:10.1109/lra.2020.3010730.
51. Lucas, B. D., and T. Kanade. 1981. “An Iterative Image Registration Technique
with an Application to Stereo Vision.” Paper Presented at the DARPA Image
Understanding Workshop, 7th International Joint Conference on Artificial
Intelligence, Vancouver, Canada, August 24–28.
52. Lynen, S., M. W. Achtelik, S. Weiss, M. Chli, and R. Siegwart. 2013. “A Robust
and Modular Multi-Sensor Fusion Approach Applied to Mav Navigation.” Paper
Presented at the Intelligent Robots and Systems, IEEE/RSJ International
Conference on, Tokyo, Japan, November 3–7.
79
53. Magree, D., and E. N. Johnson. 2014. “Combined Laser and Vision-Aided Inertial
Navigation for an Indoor Unmanned Aerial Vehicle.” Paper Presented at the
American Control Conference, IEEE, Portland, USA, June 4–6.
54. Mahon, I., S. B. Williams, O. Pizarro, and M. Johnson Roberson. 2008. “Efficient
View-based SLAM Using Visual Loop Closures.” IEEE Transactions on Robotics
24 (5): 1002–1014.
55. Maier, J., and M. Humenberger. 2013. “Movement Detection Based on Dense
Optical Flow for Unmanned Aerial Vehicles.” International Journal of Advanced
Robotic Systems 10 (2): 146.
56. Márquez-Gámez, D. 2012. “Towards Visual Navigation in Dynamic and
Unknown Environment: Trajectory Learning and following, with Detection and
Tracking of Moving Objects.” PhD diss., l’Institut National des Sciences
Appliquées de Toulouse. Matsumoto, Y., K. Ikeda, M. Inaba, and H. Inoue. 1999.
“Visual Navigation Using Omnidirectional View Sequence.” Paper Presented at
the Intelligent Robots and Systems, IEEE/RSJ International Conference on,
Kyongju, South Korea, October 17 –21.
57. Maza, I., F. Caballero, J. Capitán, J. R. Martínez-de-Dios, and A. Ollero. 2011.
“Experimental Results in Multi-UAV Coordination for Disaster Management and
Civil Security Applications.” Journal of Intelligent & Robotic Systems 61 (1):
563–585.
58. Mirowski, P., R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, and M. Denil.
2017. “Learning to Navigate in Complex Environments.” The 5th International
Conference on Learning Representations, Toulon, France, April 24–26.
59. Moravec, H. P. 1983. “The Stanford Cart and the CMU Rover.” Proceedings of the
IEEE 71 (7): 872–884.
60. Moreno-Armendáriz, M. A., and H. Calvo. 2014. “Visual SLAM and Obstacle
Avoidance in Real-Time for Mobile Robots Navigation.” Paper Presented at the
Mechatronics, Electronics and Automotive Engineering (ICMEAE), IEEE
International Conference on, Cuernavaca, Mexico, November 18–21.
61. Newcombe, R. A., S. J. Lovegrove, and A. J. Davison. 2011. “DTAM: Dense
Tracking and Mapping in Real-time.” Paper Presented at the Computer Vision
(ICCV), IEEE International Conference on, Washington, DC, USA, November
6–13.
62. Nourani-Vatani, N., P. Vinicius, K. Borges, J. M. Roberts, and M. V. Srinivasan.
2014. “On the Use of Optical Flow for Scene Change Detection and Description.”
Journal of Intelligent & Robotic Systems 74 (3–4): 817.
63. Padhy, R.M., Choudhry, S.K., Sa, P.K., & Bakshi, S. (2019). Obstacle Avoidance
for Unmanned Aerial Vehicles. IEEE Consumer Electronics Magazine, 19,
2162-2248. doi: 10.1109/MCE.2019.2892280.
64. Parunak, H. V., M. Purcell, and R. O’Connell. 2002. “Digital Pheromones for
Autonomous Coordination of Swarming UAVs.” Paper Presented at the 1st UAV
Conference, Portsmouth, Virginia, May 20–23.
80
65. Pellazar, M. B. 1998. “Vehicle Route Planning with Constraints Using Genetic
Algorithms”. Paper Presented at the Aerospace and Electronics Conference,
NAECON, 1998, IEEE National, Dayton, USA, July 17.
66. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós 2015, “ORB-SLAM: a Versatile
and Accurate Monocular SLAM System,” IEEE Transactions on Robotics 31(5):
1147-1163.
67. Ranftl, R., V. Vineet, Q. Chen, and V. Koltun. 2016. “Dense Monocular Depth
Estimation in Complex Dynamic Scenes.” Paper Presented at the Computer
Vision and Pattern Recognition, IEEE Conference on, Las Vegas, USA, June
27–30.
68. Roberge, V., M. Tarbouchi, and G. Labonté. 2013. “Comparison of Parallel
Genetic Algorithm and Particle Swarm Optimization for Real-time UAV Path
Planning.” IEEE Transactions on Industrial Informatics 9 (1): 132–141.
69. Rogers, B., and M. Graham. 1979. “Motion Parallax as an Independent Cue for
Depth Perception.” Perception 8 (2): 125–134.
70. Rouse, D. M. 1989. “Route Planning Using Pattern Classification and Search
Techniques.” Aerospace and Electronics Conference, IEEE National, Dayton,
USA, May 22–26.
71. Ruffier, F., S. Viollet, S. Amic, and N. Franceschini. 2003. “Bio-inspired Optical
Flow Circuits for the Visual Guidance of Micro Air Vehicles.” Paper Presented at
the Circuits and Systems, IEEE International Symposium on, Bangkok, Thailand,
May 25–28.
72. Sanchez-Lopez, J. L., Wang, M., Olivares-Mendez, M. A., Molina, M., & Voos,
H. (2018). A Real-Time 3D Path Planning Solution for Collision-Free Navigation
of Multirotor Aerial Robots in Dynamic Environments. Journal of Intelligent &
Robotic Systems, 93(1-2), 33-53. doi:10.1007/s10846-018-0809-5.
73. Santos-Victor, J., G. Sandini, F. Curotto, and S. Garibaldi. 1993. “Divergent
Stereo for Robot Navigation: Learning from Bees.” Paper Presented at the
Computer Vision and Pattern Recognition, IEEE Computer Society Conference
on, New York, USA, June 15–17.
74. Seitz, S. M., B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. 2006. “A
Comparison and Evaluation of Multi-view Stereo Reconstruction Algorithms.”
Paper Presented at the Computer Vision and Pattern Recognition, IEEE Computer
Society Conference on, New York, USA, June 17-22.
75. Shao, S., He, C., Zhao, Y., & Wu, X. (2021). Efficient Trajectory Planning for
UAVs Using Hierarchical Optimization. IEEE Access, 9, 60668-60681.
doi:10.1109/access.2021.3073420.
76. Shen, S., Y. Mulgaonkar, N. Michael, and V. Kumar. 2014. “Multi-Sensor Fusion
for Robust Autonomous Flight in Indoor and Outdoor Environments with a
Rotorcraft MAV.” Paper Presented at the Robotics and Automation (ICRA), IEEE
International Conference on, Hong Kong, China, May 31–June 7.
77. Silveira, G., E. Malis, and P. Rives. 2008. “An Efficient Direct Approach to Visual
SLAM.” IEEE Transactions on Robotics 24 (5): 969–979.
81
78. Srinivasan, M. V., and R. L. Gregory. 1992. “How Bees Exploit Optic Flow:
Behavioural Experiments and Neural Models [and Discussion].” Philosophical
Transactions of the Royal Society of London B: Biological Sciences 337 (1281):
253–259.
79. Stentz, A. 1994. “Optimal and Efficient Path Planning for Partially-Known
Environments.” Paper Presented at the IEEE International Conference on Robotics
and Automation, San Diego, CA, USA, May 8–13.
80. Strasdat, H., J. M. Montiel, and A. J. Davison. 2012. “Visual SLAM: Why Filter?”
Image and Vision Computing 30 (2):65–77.
81. Strohmeier, M., M. Schäfer, V. Lenders, and I. Martinovic. 2014. “Realities and
Challenges of Nextgen Air Traffic Management: The Case of ADS-B.” IEEE
Communications Magazine 52 (5): 111–118.
82. Sugihara, K., and I. Suzuki. 1996. “Distributed Algorithms for Formation of
Geometric Patterns with Many Mobile Robots.” Journal of Robotic Systems 13
(3): 127–139.
83. Szczerba, R. J., P. Galkowski, I. S. Glicktein, and N. Ternullo. 2000. “Robust
Algorithm for Real-time Route Planning.” IEEE Transactions on Aerospace and
Electronic Systems 36 (3): 869–878.
84. Szeliski, R. 2010. Computer Vision: Algorithms and Applications. London:
Springer Science & Business Media.
85. Szenher, M. D. 2008. “Visual Homing in Dynamic Indoor Environments.” Ph.D.
diss. The University of Edinburgh.
86. Trujillo, J., Munguia, R., Urzua, S., Guerra, E., & Grau, A. (2020). Monocular
Visual SLAM Based on a Cooperative UAV–Target System. Sensors, 20(12),
3531. doi:10.3390/s20123531.
87. Tuytelaars, T., and K. Mikolajczyk. 2008. “Local Invariant Feature Detectors: A
Survey.” Foundations and Trends in Computer Graphics and Vision 3 (3):
177–280.
88. Vachtsevanos, G., W. Kim, S. Al-Hasan, F. Rufus, M. Simon, D. Shrage, and J.
Prasad. 1997. “Autonomous Vehicles: From Flight Control to Mission Planning
Using Fuzzy Logic Techniques.” Paper Presented at the Digital Signal Processing,
13th International Conference on, IEEE, Santorini, Greece, July 2–4.
89. Valgaerts, L., A. Bruhn, M. Mainberger, and J. Weickert. 2012. “Dense versus
Sparse Approaches for Estimating the Fundamental Matrix.” International Journal
of Computer Vision 96 (2): 212–234.
90. Wang, C., Ma, H., Chen, W., Liu, L., & Meng, M. Q. (2020). Efficient
Autonomous Exploration With Incrementally Built Topological Map in 3-D
Environments. IEEE Transactions on Instrumentation and Measurement, 69(12),
9853-9865. doi:10.1109/tim.2020.3001816.
91. Yathirajam, B., S.M, V. & C.M, A. (2020). Obstacle Avoidance for Unmanned Air
Vehicles Using Monocular-SLAM with Chain-Based Path Planning in GPS
Denied Environments. Journal of Aerospace System Engineering, 14(2), 1-11.
82
92. Yershova, A., L. Jaillet, T. Simeon, and S. M. Lavalle. 2005. “Dynamic-Domain
RRTs: Efficient Exploration by Controlling the Sampling Domain.” Paper
Presented at the IEEE International Conference on Robotics and Automation,
Barcelona, Spain, April 18–22.
93. Zhang, C., Hu, C., Feng, J., Liu, Z., Zhou, Y., & Zhang, Z. (2019). A
Self-Heuristic Ant-Based Method for Path Planning of Unmanned Aerial Vehicle
in Complex 3-D Space With Dense U-Type Obstacles. IEEE Access, 7,
150775-150791. doi:10.1109/access.2019.2946448.
94. Zhang, Q., J. Ma, and Q. Liu. 2012. “Path Planning Based Quadtree
Representation for Mobile Robot Using Hybrid-Simulated Annealing and Ant
Colony Optimization Algorithm.” Paper Presented at The Intelligent Control and
Automation (WCICA), 10th World Congress on, IEEE, Beijing, China, July 6–8.
83