0% found this document useful (0 votes)
7 views

[2] Acceleration-based Quadrotor Guidance Under Time Delays Using Deep Reinforcement Learning

Uploaded by

yagiz.kurt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

[2] Acceleration-based Quadrotor Guidance Under Time Delays Using Deep Reinforcement Learning

Uploaded by

yagiz.kurt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

AIAA SciTech Forum 10.2514/6.

2021-1751
11–15 & 19–21 January 2021, VIRTUAL EVENT
AIAA Scitech 2021 Forum

Acceleration-based Quadrotor Guidance Under Time Delays


Using Deep Reinforcement Learning

Kirk Hovell∗ and Steve Ulrich†


Carleton University, Ottawa, Ontario, K1S 5B6, Canada

Murat Bronz‡
ENAC, Université de Toulouse, Toulouse, 31055, France

This paper investigates the use of deep reinforcement learning to act as closed-loop guidance
for quadrotors and the ability for such a system to be trained entirely in simulation before being
transferred for use on a real quadrotor. It improves upon previous work where velocity-based
deep reinforcement learning was used to guide the motion of spacecraft. Here, an acceleration-
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

based closed-loop deep reinforcement learning guidance system is developed and compared to
previous work. In addition, state augmentation is included due to dynamics delays present.
Simulated results show acceleration-based deep reinforcement learning closed-loop guidance
has significant performance benefits compared to velocity-based guidance in previous work,
namely: a simpler reward function, less overshoot, and better steady-state error. To evaluate
the ability for this system to be used on a real quadrotor, the trained system is deployed to the
Paparazzi aircraft simulation software, and is implemented on real flight hardware at École
Nationale de l’Aviation Civile for an experimental comparison. Experimental results confirm
the simulated results—that acceleration-based deep guidance outperforms velocity-based deep
guidance and should therefore be used in future work.

Nomenclature

A = action space
a = action
a𝐶 = acceleration
𝛼 = policy network learning rate
𝛽 = value network learning rate
𝐵 = number of value distribution bins
𝐷 = dynamics delay, number of timesteps
d = distance between the chaser and the target, m
𝐸 = episode number
E = expectation
𝜖 = weight-smoothing parameter
𝑓 = follower, as a superscript
𝐹 = applied force, N
f = reward field
𝑔 = activation function
𝛾 = discount factor for future rewards
𝐽 = total expected rewards
K𝑝 = proportional gain matrix
K𝐼 = integral gain matrix
K = reward weighting
𝐾 = number of actors used
𝐿 = loss function
∗ PhD Candidate, Department of Mechanical and Aerospace Engineering, 1125 Colonel By Drive. Student Member AIAA.
† Associate Professor, Department of Mechanical and Aerospace Engineering, 1125 Colonel By Drive. Senior Member AIAA.
‡ Assistant Professor, UAV/Optimization Group, ENAC Lab, Member AIAA

Copyright © 2021 by Kirk Hovell, Steve Ulrich, and Murat Bronz. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.
𝑀 = mini-batch size
𝑚 = follower mass, kg
𝑁 = N-step return length
N = normal distribution
𝜂, 𝑐 1 = constants
o = observation
O = observation space
𝜋𝜃 = policy neural network with parameters 𝜃
𝜙0 = exponential moving average of the true value network weights 𝜙
𝜓 = attitude, rad
𝑅 = replay buffer size
𝑟 = reward
𝜎 = exploration noise standard deviation
𝑡 = timestep number as a subscript, target as a superscript
𝜃0 = exponential moving average of the true policy weights 𝜃
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

𝜏 = applied torque, Nm
u = control effort
v = velocity, m/s
x = state
X = state space
𝑌 = target value distribution
𝑍𝜙 = value neural network with parameters 𝜙

I. Introduction
uadrotors have become useful in recent decades due to advances in microcontroller and battery technology, with
Q common applications of: research, search and rescue, surveillance, photography, package delivery, and sport racing.
Many quadrotor control theories have been presented [1–3]. Some have applied backstepping control [4], sliding mode
control [5], model predictive control [6], dynamic inversion techniques [7–10], or used cameras for feedback [11, 12].
A number of guidance approaches have also been proposed to guide the motion of a quadrotor [13, 14], including
trajectory calculations for quadrotor swarms [15] and dynamically feasible trajectory calculations [16]. The traditional
guidance and control techniques presented are hand-crafted to solve a particular problem and their design may require
significant engineering effort. As quadrotors are applied to more difficult tasks, the effort required to design these
guidance algorithms and controllers may become infeasible. For example, an autonomous search-and-rescue task where
an unknown building is explored and GPS is unavailable is a difficult problem to hand-craft a response to. Motivated by
difficult guidance tasks, this paper uses deep reinforcement learning to learn, rather than hand-craft, a guidance strategy
for quadrotors.
Deep reinforcement learning consists of an agent that chooses actions to explore an environment. The environment
returns rewards to the agent corresponding to the quality of the action taken from a given observation. Through trial
and error, the agent attempts to learn a policy that creates an optimal mapping from observations to actions in order to
maximize the rewards received. In using this approach, complex behaviour can emerge from a relatively simple reward
structure. For example, deep reinforcement learning has learned how to play video games at a superhuman level, using
only the game score as the reward signal [17]. The policy input observation is the screen pixels and the action output is
the button presses on the controller. Neural networks have become a popular choice for representing the policy, as they
have been shown to be universal function approximators [18]. Deep reinforcement learning has seen many successes in
recent years, from the successful completion of many Atari 2600 games [17] to mastering the game of Go [19].
Training deep reinforcement learning policies on physical robots is time consuming, expensive, and leads to
significant wear-and-tear on the robot because, even with state-of-the-art learning algorithms, the task may have to
be attempted hundreds or thousands of times before learning succeeds. Alternatively, training can be completed
in simulation and the resulting policy can be transferred to a real robot. However, this approach often encounters
problems due to the simulation-to-reality gap—policies trained entirely in simulation tend to overfit the simulated
dynamics, which can never perfectly model reality [20–22]. Algorithms that focus on learning speed are being developed
such that learning can be performed entirely on a robot in a short period of time [23, 24]. However, some high-risk
robotic domains cannot afford any on-board learning due to the possibility of damaging the robot. For this reason,

2
the development of algorithms that can be trained entirely in simulation and deployed to a robot is an active research
area. Domain randomization is one solution to this problem [25–31], where environmental parameters are randomized
for each training simulation to force the policy to become robust to a variety of environments. Other efforts combine
domain randomization with neural representations of intermediate states [32], while others train in simulation and
fine-tune the policy once deployed to experiment [31–33].
Reinforcement learning has been used in quadrotor applications. In 2005, model-based reinforcement learning was
used to generate a dynamics model of a quadrotor that was used to develop an optimal controller [34]. Reinforcement
learning has also been used as an inner-loop controller for a simulated quadrotor, where it was shown that reinforcement
learning can outperform classical control techniques [35]. Others used quadrotors to carry payloads using reinforcement
learning [36]. A quadrotor stabilization task [37] summed the policy output with a conventional PD controller to guide
the learning process and help with the simulation-to-reality transfer. A simulated fleet of wildfire surveillance aircraft
used deep reinforcement learning to command the flight-path of the aircraft [38]. Domain randomization has enabled
real quadrotor flight after being trained entirely in simulation [30], and domain randomization has since also been used
for drone racing [29].
In terms of quadrotor guidance, reinforcement learning was used to guide a quadrotor in a grid-world [39] and,
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

along with model predictive control, it was used to guide a quadrotor through an indoor maze [40]. The approach uses a
fixed grid of nodes, each with an associated cost. When each node is visited, the cost is updated. Instead of generating a
grid that uses reinforcement learning to evaluate the desirability of each location, this paper, in contrast, uses deep
reinforcement learning as closed-loop guidance to generate desired acceleration signals in real time that are fed to a
controller. Restricting reinforcement learning to handle guidance-only was suggested by Harris et al. [41]. They argued
that reinforcement learning should not be tasked with learning guidance and control since control theory already has
great success. Hovell and Ulrich’s previous work used deep reinforcement learning for closed-loop guidance and used a
conventional controller to track the guidance signal [42]. The closed-loop guidance was trained entirely in simulation
and transferred to a real spacecraft platform experiment. The controller was able to handle discrepancies between the
simulated training environment and the experimental facility. The technique was named deep guidance, and was shown
to be a possible bridge from simulation to reality for reinforcement learning.
This paper continues the study of the deep guidance technique [42]. It improves the technique, by proposing a novel
implementation where acceleration guidance signals are used, and compares it to the previous implementation where
velocity guidance signals are used. In addition, this paper applies the deep guidance technique to a new quadrotor
domain, where nonlinear dynamics are present and system delays must be accounted for. In light of the above, the novel
contributions of this work are:
1) The application of deep reinforcement learning to real quadrotors using a modified closed-loop deep guidance
technique.
2) Improving the deep guidance technique itself by introducing closed-loop acceleration signals and comparing it to
previous work.
This paper is organized as follows: Sec. II presents background on deep reinforcement learning and the specific
learning algorithm used in this paper, Sec. III describes the quadrotor scenario considered and how the improved deep
guidance technique is applied to it, Sec. IV presents numerical simulations demonstrating the effectiveness of the
technique, Sec. V presents experimental results, and Sec. VI concludes this paper.

II. Deep Reinforcement Learning


Deep reinforcement learning attempts to discover a policy, 𝜋 𝜃 , approximated using a neural network that has
trainable parameters 𝜃, in order to select appropriate actions a ∈ A given the current observation, o ∈ O, that maximize
the scalar rewards received, 𝑟, over time. The observation may be the underlying state x ∈ X but in general it contains
an observation from which the state must be inferred. The action is obtained from the policy, at timestep 𝑡, through

a𝑡 = 𝜋 𝜃 (o𝑡 ) (1)

This work uses the Distributed Distributional Deep Deterministic Policy Gradient algorithm [43] (D4PG) to train
the policy. Although there are many algorithms available, the D4PG algorithm was selected because it operates in
continuous state and action spaces, it can be trained distributed across many CPUs, it has a deterministic output, and
because it achieves state-of-the-art performance. The algorithm is explained in brief in the following subsection.

3
A. D4PG Algorithm
The D4PG [43] algorithm has an actor-critic architecture. This means there is a policy neural network, that is
used during training and once training is complete, and a value neural network, that is used during training only. The
policy network accepts observations and calculates actions and has trainable parameters 𝜃. The value network accepts
observations and actions and predicts a probability distribution of the total expected rewards from this observation-action
pair, 𝑍 𝜙 (o, a), and has trainable parameters 𝜙. Figure 1 shows the policy and value neural networks. The total reward

o πθ (o) Zϕ (o, a)
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

(a) Policy neural network (b) Value neural network

Fig. 1 Policy and value neural networks used in the D4PG algorithm.

expected from a given observation is 


𝐽 (𝜃) = E 𝑍 𝜙 (o, 𝜋 𝜃 (o) (2)
where 𝐽 (𝜃) is the expected rewards from the given state as a function of the policy weights 𝜃 and E denotes the
expectation. Reinforcement learning attempts to systematically adjust 𝜃 in order to discover a policy 𝜋 𝜃 that maximizes
𝐽 (𝜃).
To train the neural networks in order to develop an appropriate policy, data are needed. To generate this data, a
simulation that is representative of the task to be learned must be developed. At each timestep, the policy network is used
to choose an action to take as a function of the observation using Eq. (1). The action is executed on the environment,
which returns the next observation and a reward associated with how well the task is being completed. The observation,
action, reward, and next observation data are continuously generated by repeatedly running 𝐾 simulations in parallel.
The generated data are placed in a replay buffer that holds the most recent 𝑅 timesteps. The learning algorithm randomly
draws batches of size 𝑀 of these data to train the policy and value networks using backpropagation and gradient descent.
Over time, the performance of the policy is expected to increase. To train the value network, gradient-descent is used to
minimize the cross-entropy loss function given by
 
𝐿(𝜙) = E −𝑌 log(𝑍 𝜙 (o, a)) (3)

where 𝑌 is the target value distribution [44]. The target value distribution is treated as a better estimate of the true value
distribution because it is calculated from the simulated data and its own predictions. It is calculated using
𝑁
Õ −1
𝑌𝑡 = 𝛾 𝑛 𝑟 𝑡+𝑛 + 𝛾 𝑁 𝑍 𝜙0 (o𝑡+𝑁 , 𝜋 𝜃 0 (o𝑡+𝑁 )) (4)
𝑛=0

where 𝑟 𝑡+𝑛 is the reward received at timestep 𝑡 + 𝑛, 𝛾 is the discount factor for future rewards (future rewards are weighted
lower than current rewards) and 𝑍 𝜙0 (o𝑡+𝑁 , 𝜋 𝜃 0 (o𝑡+𝑁 )) is the value distribution itself evaluated 𝑁 timesteps into the
future. Since we do not have all the data from the entire episode, we must approximate the value distribution for the
remainder of the episode after 𝑁 steps of data are used [45]. Also, 𝜃 0 and 𝜙 0 are an exponential moving average of the
policy network weights 𝜃 and value network weights 𝜙, calculated by

𝜃 0 = (1 − 𝜖)𝜃 0 + 𝜖 𝜃 (5)
0 0
𝜙 = (1 − 𝜖)𝜙 + 𝜖 𝜙 (6)

with 𝜖  1. Using copies of the policy and value networks with smoothed weights when calculating the value
distribution estimate for the remainder of the episode has been shown to have a stabilizing effect on the learning [17].

4
To train the value network one iteration, Eq. (4) is used to recursively calculate an updated estimate for the true value
network distribution. The loss function is then minimized using Eq. (3) through adjusting 𝜙 using learning rate 𝛽. The
value network slowly approaches the value distributions dictated by the simulated data. The value network parameters
are then smoothed and used in Eq. (4). This is a recursive process that Sutton and Barto [46] described as: “we learn a
guess from a guess.”
Once the value network is trained one iteration, the policy network is trained one iteration. The intention is to adjust
the policy parameters 𝜃 such that the expected rewards for a given observation, 𝐽 (𝜃), increase. Since neural networks
are differentiable, the chain rule can be used to compute

𝜕𝐽 (𝜃) 𝜕𝐽 (𝜃) 𝜕a
= (7)
𝜕𝜃 𝜕a 𝜕𝜃

where 𝜕𝐽𝜕a( 𝜃) is computed from the value network and 𝜕𝜃 𝜕a


is computed from the policy network. In a sense, we
differentiate through the value network into the policy network. More formally
   
∇ 𝜃 𝐽 (𝜃) = E ∇ 𝜃 𝜋 𝜃 (o)E ∇a 𝑍 𝜙 (o, a) |a= 𝜋 𝜃 (o)
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

(8)

describes how the policy parameters, 𝜃, should be updated to increase the expected rewards obtained by the policy when
used. Finally, for a learning rate 𝛼, the parameters 𝜃 are updated via

𝜃 = 𝜃 + ∇ 𝜃 𝐽 (𝜃)𝛼 (9)

To generate training data, 𝐾 agents run independent episodes using the most up-to-date version of the policy.
Exploration noise is added to the suggested action to encourage exploration of the action space. Trial-and-error of this
sort is the mechanism through which positive rewards are discovered. The training process then reinforces the actions
that led to those positive rewards. The action taken at timestep 𝑡 is

a𝑡 = 𝜋 𝜃 (o𝑡 ) + N (0, 𝜎) (10)

here, N (0, 𝜎) is a normal distribution with zero mean and 𝜎 as the standard deviation of the exploration noise. The
D4PG algorithm is summarized in Algorithm 1.

III. Problem Statement


This section discusses the quadrotor pose tracking environment within which the deep guidance system will be
trained. Though the deep guidance technique may allow for complex behaviours to be learned, a relatively simple task
is attempted here in order to determine whether a velocity- or acceleration-based implementation of the deep guidance
technique is most appropriate. Conclusions from this work will be used to inform attempts at solving difficult guidance
problems in future work.
The task presented here consists of two quadrotors: a target and a follower. The quadrotors start at some initial
conditions, and the follower must move itself to a location three metres offset from the target. The target and follower
positions and velocities are
h i
x𝑡 𝑎𝑟 𝑔𝑒𝑡 ,𝑡 = 𝑥 𝑡𝑡 𝑦 𝑡𝑡 𝜓𝑡𝑡 (11)
h i
v𝑡 𝑎𝑟 𝑔𝑒𝑡 ,𝑡 = 𝑣 𝑡𝑥𝑡 𝑣 𝑡𝑦𝑡 𝜔𝑡𝑡 (12)
h i
x 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 = 𝑥 𝑡𝑓 𝑦𝑡
𝑓
(13)
h i
v 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 = 𝑣 𝑥𝑓 𝑡 𝑣 𝑦𝑓 𝑡 (14)

where 𝑥 and 𝑦 represent the positions in Cartesian space, 𝜓 represents the yaw about the 𝑧 axis, 𝜔 represents the yaw
rate, subscript 𝑡 refers to the timestep number, superscript 𝑡 corresponds to the target, superscript 𝑓 corresponds to the
follower, and 𝑣 is the velocity.

5
Learner
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

Initialize policy network weights 𝜃 and value network weights 𝜙 randomly


Initialize smoothed policy and target value network weights 𝜃 0 = 𝜃 and 𝜙 0 = 𝜙
Launch 𝐾 actors and copy policy weights 𝜃 to each actor
repeat
Sample a batch of 𝑀 data points from the replay buffer
ComputeÍ 𝑁the target value distribution used to train the value network
−1 𝑛
𝑌𝑡 = 𝑛=0 𝛾 𝑟 𝑡+𝑛 + 𝛾 𝑁 𝑍 𝜙0 (o𝑡+𝑁 , 𝜋 𝜃 0 (o𝑡+𝑁 ))
 
Update value network weights 𝜙 by minimizing the loss function 𝐿 (𝜙) = E −𝑌 log(𝑍 𝜙 (o, a)) using
learning rate 𝛽
   
Compute policy gradients using ∇ 𝜃 𝐽 (𝜃) = E ∇ 𝜃 𝜋 𝜃 (o)E ∇a 𝑍 𝜙 (o, a) |a= 𝜋 𝜃 (o) and update policy weights
via 𝜃 = 𝜃 + ∇ 𝜃 𝐽 (𝜃)𝛼
Update the smoothed network weights slowly in the direction of the main policy and value network weights
𝜃 0 = (1 − 𝜖)𝜃 0 + 𝜖 𝜃 and 𝜙 0 = (1 − 𝜖)𝜙 0 + 𝜖 𝜙 for 𝜖  1
until acceptable performance

Actor

repeat
From the given observation, use the policy to calculate an action and add exploration noise with 𝜎 as the
standard deviation a𝑡 = 𝜋 𝜃 (o𝑡 ) + N (0, 𝜎)
Step environment forward one timestep using action a𝑡
Í 𝑁 −1 𝑛
Record (o𝑡 , a𝑡 , 𝑟 𝑡 = 𝑛=0 𝛾 𝑟 𝑡+𝑛 , o𝑡+𝑁 ) and store in the replay buffer
At the end of each episode, obtain the most up-to-date version of the policy 𝜋 𝜃 from the Learner and reset
the environment.
until acceptable performance
Algorithm 1: D4PG [43]

6
ot vt ut ot+1
Deep Guidance Controller Dynamics

xt

Delay

Fig. 2 Velocity-based deep guidance strategy [42].

ot aCt ut ot+1
Deep Guidance Controller Dynamics

xt

Delay
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

Fig. 3 Acceleration-based deep guidance strategy.

A. Deep Guidance
The deep guidance technique, introduced by Hovell and Ulrich [42], allows for reinforcement learning to be used on
real robot platforms, despite being trained entirely in simulation, by limiting deep reinforcement learning to only learn
the guidance portion of the guidance, navigation, and control process. The learned closed-loop guidance system passes
signals, calculated in real-time, to a conventional controller to track regardless of modelling errors, and is presented as a
possible solution to the simulation-to-reality problem. Figure 2 shows the previous work’s closed-loop deep guidance
approach for guiding velocities. The observation is o𝑡 , the state is x𝑡 , the velocity guidance signal is v𝑡 , the control
effort is u𝑡 , and the next observation is o𝑡+1 .
In this work, a different strategy is proposed: instead of the deep guidance block issuing desired velocities, it issues
desired accelerations, as shown in Fig. 3, where a𝐶𝑡 is the guided acceleration. This modification to the deep guidance
technique is expected to improve performance while simplifying the reward function, since training will now occur
within a domain of the same order to the dynamics where the policy will be used once training is complete. To ensure
the deep guidance policy does not overfit a particular controller during training, an ideal controller is assumed. This
allows for the controller and dynamics blocks to be combined into a single kinematics block, as shown in Fig. 4. Any
controller may be used along with the deep guidance system to control a real robot. As will be shown in this paper, the
controller and dynamics used to evaluate the deep guidance performance in simulation can be significantly different
than the controller and dynamics used in an experiment. The velocity-based and acceleration-based deep guidance
strategies are compared in this paper.

B. Kinematics Model
While the deep guidance policy is being trained, a kinematic model approximates the dynamics model and the ideal
controller, as shown in Fig. 4. When the velocity-based deep guidance model is used, the policy input observation is:
h iT
o𝑡 = x𝑡 𝑎𝑟 𝑔𝑒𝑡 ,𝑡 x 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 (15)

ot aCt Ideal ut ot+1 ot aCt ot+1


Deep Guidance Dynamics Deep Guidance Kinematics
Controller
xt

Delay Delay

Fig. 4 Deep guidance with an ideal controller for training purposes in simulation.

7
When the acceleration-based deep guidance model is used, the policy input observation is:
h iT
o𝑡 = x𝑡 𝑎𝑟 𝑔𝑒𝑡 ,𝑡 x 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 v𝑡 𝑎𝑟 𝑔𝑒𝑡 ,𝑡 v 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 (16)

During training, with an ideal controller assumed, the guided velocity signal v𝑡 is directly integrated once to obtain the
next state. The acceleration signal a𝐶𝑡 is integrated twice to obtain the next state. All integration is performed using the
Scipy [47] Adams/Backward differentiation formula methods in Python. When delays are included, as discussed in the
following subsection, the guided velocity or acceleration is stored and an action 𝐷 timesteps old is used instead.

C. Time Delay Consideration


This work includes the possibility for the closed-loop deep guidance model to encounter system delays, which may
occur through actuation delays, signal delays, or measurement delays. When the action taken is not immediately realized
on the quadrotor, the input observation to the system, Eqs. (15) or (16), no longer contain all the information necessary
for the policy to decide on an appropriate action. One response to this problem is to augment the observation with past
actions equal to the length of the delay [48]. This allows the agent to become aware of previous actions it has taken
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

when making decisions on what action to take next. When augmentation is used, the observation for velocity-based
guidance becomes
h iT
o𝑡 = x𝑡 𝑎𝑟 𝑔𝑒𝑡 x 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 a𝑡−1 a𝑡−2 ... a𝑡−𝐷 (17)
for a delay of 𝐷 timesteps. When the acceleration-based deep guidance model is used, the augmented observation is:
h iT
o𝑡 = x𝑡 𝑎𝑟 𝑔𝑒𝑡 x 𝑓 𝑜𝑙𝑙𝑜𝑤 𝑒𝑟 v𝑡 𝑎𝑟 𝑔𝑒𝑡 v 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 a𝑡−1 a𝑡−2 ... a𝑡−𝐷 (18)

Although state augmentation allows for optimal policies to be found despite system delays, if the delay is too long the
observation may grow too large for learning to be tractable.

D. Dynamics Model
The policy is trained within a kinematics environment to remove any overfitting to simulated dynamics or a specific
controller. To measure the learning performance, the trained policy is periodically evaluated on an environment with full
dynamics and a controller, as shown in Figs. 2 and 3. In other words, the trained deep guidance policy is “deployed” to
another simulation for evaluation in much the same way that it is deployed to an experiment in Sec. V. When a velocity
guidance signal is issued, it is tracked using a proportional controller of the form

u𝑡 = K 𝑝 v𝑡 − v 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 (19)
where K 𝑝 = diag{0.1, 0.1}. The K 𝑝 values were chosen by trial-and-error until satisfactory performance was achieved.
When the acceleration-based deep guidance configuration is used, an integral controller is used of the form

u𝑡 = u𝑡−1 + K𝐼 a𝐶𝑡 − v¤ 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 (20)
with K𝐼 = diag{0.5, 0.5}, also chosen by trial-and-error. Note that only planar translational motion is commanded by
the acceleration-based deep guidance due to coupling between the 𝑥, 𝑦, and 𝑧 axes, as discussed in Sec. V.C.
Regardless of which guidance method is used, the associated controller outputs a control effort that is executed on
the same dynamics. A planar double-integrator dynamics model is used to simulate the follower motion. The simplicity
of this model compared to the actual quadrotor dynamics will further show how this technique can handle dynamic
differences between simulation and reality. The accelerations due to the control forces are
𝐹𝑥
𝑥¥ = (21)
𝑚
𝐹𝑦
𝑦¥ = (22)
𝑚
where 𝐹𝑥 and 𝐹𝑦 are the forces applied in the 𝑋 and 𝑌 directions, respectively, 𝑚 is the follower mass, 𝑥¥ and 𝑦¥ are the
accelerations in 𝑋 and 𝑌 , respectively. The accelerations are numerically integrated twice to obtain the position and
velocity at the following timestep.
The following subsection discusses the reward function used to incentivize the deep guidance policy to learn the
desired behaviour.

8
E. Reward Function
At each timestep, the follower receives rewards according to the state and the action taken. It is the designer’s role to
craft this reward function to encourage the desired behaviour. Note that the reward function may be based off of the
state even though the policy only receives an observation of the state and not the underlying state itself. In this work,
the follower quadrotor is tasked with moving to a location three metres offset from the target quadrotor. The reward
function has three components:
1) Rewards are given according to the follower position. If the follower moves in the direction of the desired
location, it receives a positive reward.
2) Penalties are given for the follower colliding with the target.
3) For the velocity-based deep guidance only: penalties are given for high velocities near the desired location to
discourage overshooting.
To calculate the position reward, a reward field f is generated.
h i
f (x𝑡 ) = −|x𝑡 𝑎𝑟 𝑔𝑒𝑡 ,𝑡 + 3 cos(𝜓𝑡𝑡 ) 3 sin(𝜓𝑡𝑡 ) − x 𝑓 𝑜𝑙𝑙𝑜𝑤𝑒𝑟 ,𝑡 | (23)
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

The first two terms represent the desired location. When the follower is at the desired location, the reward field is zero.
It decreases linearly away from the desired state. The difference in the reward field between the current and previous
timestep is used to calculate the reward given to the agent. A positive reward is therefore given if the action chosen
brings the follower closer to the desired location, and a negative reward otherwise.

𝑟 𝑡 = kK(f (x𝑡 ) − f (x𝑡−1 )) k (24)

the states are weighted with K = diag{125, 125}, determined by trial-and-error. A penalty, 𝑟 𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 15, is given if the
follower and target collide to encourage the reward-seeking follower to move to the desired location safely.
For the velocity-based guidance approach, the follower often overshoots the desired location. To reduce this
overshoot, a penalty is given proportional to the guided velocity signal. This value is divided by the distance to the
desired location such that high velocities far from the desired location are not severely penalized.
To summarize, the reward function for the velocity-based deep guidance strategy is
kv 𝑒𝑟 ,𝑡 k
(
kK(f (x𝑡 ) − f (x𝑡−1 )) k − 𝑐 1 kf𝑓 𝑜𝑙𝑙𝑜𝑤
(x𝑡 ) k+𝜂 − 𝑟 𝑐𝑜𝑙𝑙𝑖𝑑𝑒 for kdt k ≤ 0.3
𝑟𝑡 = kv 𝑓 𝑜𝑙𝑙𝑜𝑤 𝑒𝑟 ,𝑡 k (25)
kK(f (x𝑡 ) − f (x𝑡−1 )) k − 𝑐 1 kf (x𝑡 ) k+𝜂 otherwise

with a small constant 𝜂 = 0.01, a velocity penalty weight 𝑐 1 = 0.5 such that it does not dominate the reward function,
and d𝑡 is the distance between the follower and the target. The reward function for the acceleration-based deep guidance
strategy does not include the velocity-penalizing term and is therefore
(
kK(f (x𝑡 ) − f (x𝑡−1 )) k − 𝑟 𝑐𝑜𝑙𝑙𝑖𝑑𝑒 for kdt k ≤ 0.3
𝑟𝑡 = (26)
kK(f (x𝑡 ) − f (x𝑡−1 )) k otherwise

As expected, the reward function for the acceleration-based deep guidance approach in Eq. (26) is simpler than the
velocity-based reward function in Eq. (25). A simpler reward function is possible since overshoots will be experienced
during training and therefore the policy will learn how to prevent them, as opposed to needing a user-designed term in
the reward function to prevent overshoots.
The learning algorithm details are presented in the following subsection.

F. Learning Algorithm Implementation Details


The policy and value neural networks are equipped with 400 and 300 neurons in their first and second hidden layers,
respectively—Fig. 1 is not to scale. The action input to the value network skips the first hidden layer, as this was
empirically shown to yield better results [43, 49]. Rectified linear units are used as the nonlinear activation functions
within each neuron in both hidden layers, shown below
(
0 for 𝑦 < 0
𝑔(𝑦) = (27)
𝑦 for 𝑦 ≥ 0

9
A 𝑔(𝑦) = tanh(𝑦) nonlinear activation function is used in the output layer of the policy network to ensure the guided
velocity or acceleration is bounded. It is then scaled to the action range. The output layer of the value network uses a
softmax function to ensure the output is indeed a valid probability distribution
𝑒 𝑦𝑖
𝑔(𝑦 𝑖 ) = Í 𝐵 ∀ 𝑖 = 1, . . . , 𝐵 (28)
𝑦𝑘
𝑘=1 𝑒

for each element 𝑦 𝑖 and for 𝐵 bins in the value distribution. Fifty-one evenly-spaced bins are used, 𝐵 = 51, inspired by
the original value distribution paper [44]. The value bounds, within which the bins are evenly divided, were empirically
found to be [−5000, 0] for the velocity-based guidance and [−200, 300] for the acceleration-based guidance. The
stochastic gradient-descent optimization routine named Adam [50] is used to train the policy and value networks.
Learning rates of 𝛼 = 𝛽 = 0.0001 are used. The observations and action inputs are normalized before passing through
the networks to avoid the vanishing gradients problem [51]. The replay buffer 𝑅 holds 106 samples, and during
training, mini-batches of size 𝑀 = 256 are used. The smoothed network parameters are updated each training iteration
with 𝜖 = 0.001. The standard deviation of the noise applied to the actions during training to force exploration is
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

𝜎 = 13 [max(a) − min(a)] (0.99998) 𝐸 , where 𝐸 is the episode number. Selecting this standard deviation empirically
leads to good exploration of the action space, and decaying the exploration as episodes continue refines the search space.
Ten actors 𝐾 = 10, a discount factor of 𝛾 = 0.99, a dynamics delay of length 𝐷 = 3, and a timestep of 0.2 seconds were
used. N-step return lengths of 𝑁 = 1 and 𝑁 = 2 were used for the velocity- and acceleration-based guidance strategies,
respectively. The Tensorflow [52] machine learning framework was used to generate, train, and evaluate the neural
networks.
Every five training episodes, the current policy is “deployed” and run in a full dynamics environment with a
controller, as described in Sec. III.D. During deployment, 𝜎 = 0 in Eq. (10) such that no exploration noise is applied to
the deep guidance velocity or acceleration signals.

IV. Simulation Results


To determine which deep guidance strategy is most effective, both the velocity-based deep guidance and acceleration-
based deep guidance policies are trained in simulation. Both are trained on the same task—that is, to learn how to guide
a quadrotor from a randomized initial position to a position three metres offset from the front-face of the target quadrotor.
While this task is easily accomplished by conventional guidance and control techniques, it serves to determine the
suitability to use deep reinforcement learning for quadrotor guidance and to compare the velocity- and acceleration-based
deep guidance approaches. The initial conditions of the target and follower are
h i
x𝑡 𝑎𝑟 𝑔𝑒𝑡 ,0 = 0 m 0 m 0 rad + N (0, 1) (29)
h i
v𝑡 𝑎𝑟 𝑔𝑒𝑡 ,0 = 0 m/s 0 m/s 0 rad/s (30)
h i
x 𝑓 𝑜𝑙𝑙𝑜𝑤 𝑒𝑟 ,0 = 0 m 2 m + N (0, 1) (31)
h i
v 𝑓 𝑜𝑙𝑙𝑜𝑤 𝑒𝑟 ,0 = 0 m/s 0 m/s (32)

In other words, the target and follower initial positions are randomized around their nominal locations on each episode.
This forces the policy to become robust to a variety of initial conditions.

A. Velocity-based Guidance Results


The velocity-based deep guidance system was trained on the quadrotor pose tracking task. A learning curve is shown
in Fig. 5a. The learning curve shows the total rewards received per episode, as a function of how many training episodes
were completed. Deep reinforcement learning aims to increase the average rewards per episode as training progresses.
The learning curve increases as training progresses, indicating that the task is being learned. Similarly, the loss function
decreases in Fig. 5b. Sample follower trajectories are shown in Fig. 6. The shaded quadrotor represents the initial
follower position, the solid one represents its final location, and the dashed line shows its trajectory. The shaded rotors
on the target represent its front—the chaser is tasked with moving three metres away from the target in this direction.
Even once the learning curve and loss function reached their plateaus, indicating that training was complete, significant
overshoots and steady-state error are observed when individual velocity-based guidance trajectories are plotted.

10
·104

0
3

−0.5
2

Loss
Reward

−1

−1.5 1

−2
0
0 2 4 6 8 0 0.5 1 1.5
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

Episode ·104 Training Iteration ·106


(a) Learning curve (b) Loss function

Fig. 5 Velocity-based deep guidance training progress.

B. Acceleration-based Guidance Results


The acceleration-based deep guidance system was also trained on the identical quadrotor task. A learning curve is
shown in Fig. 7a, and the associated loss function is shown in Fig. 7b. The learning curve increases as expected. Sample
trajectories are shown in Fig. 8, and show that the acceleration-based deep guidance strategy effectively learned to solve
the quadrotor pose tracking task. Figure 9 shows a time-series comparison of the scalar position error between the
follower and the desired location—3 m away from the target. Both the velocity- and acceleration-based deep guidance
strategies are shown for comparison. As expected, the acceleration strategy exhibits less overshoot and steady-state
error than the velocity strategy. In addition, the acceleration-based reward function was simpler as it did not require
additional terms to try and artificially dampen the overshoots. It is likely that the overshoot-dampening term of the
velocity-based reward function contributed to the steady-state offset seen in Fig. 9. It can be concluded that for this
application, and likely all second-order systems, it is more appropriate to use acceleration-based guidance as opposed to
velocity-based guidance. The root cause is hypothesized to be: a deep guidance system that issues desired velocity
signals is trained in a first-order kinematics environment and therefore does not encounter key factors that second-order
systems possess, like momentum. Therefore, since the velocity-based deep guidance system has been trained in an
environment when it can realize any velocity immediately, it is inappropriate to apply such a technique to a second-order
system—large overshoots will inevitably occur even with additional reward function crafting. With acceleration-based
deep guidance, the system learns that it must apply a negative acceleration before reaching the desired location to avoid
overshoots because it sees this during training. In addition, desired accelerations can be realized much more quickly
than desired velocities. Having a system be able to track the deep guidance signals accurately is crucial for the high-level
problem-solving abilities of deep reinforcement learning for robotics to be realized and useful.
The simulations are validated experimentally in the following section.

V. Experimental Validation
To explore whether the proposed deep guidance method detailed above will allow for the policies trained entirely in
simulation to be directly transferred to real robot platforms, the guidance policies trained in simulation are executed
on real quadrotors in an experimental facility at École Nationale de l’Aviation Civile (ENAC). Details of the indoor
quadrotor experimental facility are presented, followed by the experimental setup and results.

A. Experiment Facility
Quadrotors are flown in an indoor facility at ENAC. The quadrotors are named Explorer 1 and 2, and are shown in
Fig. 10a. Their mass is 535 g (including the battery), and their maximum thrust is 40 N. They are powered by a 3-cell

11
6
Episode 78,600 Episode 79,500
2

4
0
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

Y, m
Y, m

−2

0 Target
Target −4

−2
−6
−2 0 2 4 6 −2 0 2 4 6
X, m X, m
4 6

2 4

0 2
Y, m

Y, m

−2 Target 0 Target

−4 −2
Episode 79,600 Episode 79,900

−4 −2 0 2 4 −2 0 2 4 6
X, m X, m

Fig. 6 Visualization of velocity-based follower trajectories at various episodes once training is complete.

12
4

100

0 3
Reward

Loss
−100
2

−200

0 2 4 6 0 0.5 1 1.5 2 2.5


Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

Episode ·10 4
Training Iteration ·105
(a) Learning curve (b) Loss function

Fig. 7 Acceleration-based deep guidance training progress.

battery at 11.1 V and 2300 mAh, which provides 15 minutes of flight time. The flight arena is 10 m × 10 m × 9 m with
a mesh exterior, as shown in Fig. 10b. A 16-camera Optitrack system is used to track the motion of the quadrotors at
sub-millimetre resolution in real-time and relays this information directly to the quadrotors. The Explorer vehicles used
are equipped with the Paparazzi Autopilot System [53], an open-sourced software package for unmanned aerial systems.
Paparazzi consists of a ground segment, running on a personal computer, an airborne segment, running on-board the
quadrotor, and a communication link between them. The on-board Tawaki is shown in Fig. 10c has the characteristics
listed in Table 1.
Table 1 General characteristics for the Tawaki v1.0 autopilot board

Description Details
MCU STM32F7
IMU ICM20600 (accel, gyro) + LIS3MDL (mag)
Baro BMP3
Serial 3 UARTS, I2C (5V + 3.3V), SPI
Servo 8 PWM/DShot output (+ ESC telemetry)
RC 2 inputs: PPM, SBUS, Spektrum
AUX 8 multi purpose auxiliary pins
(ADC, timers, UART, flow control, GPIO, ...)
Logger SD card slot
USB DFU flash, mass storage, serial over USB
Power 6V to 17V input (2-4S LiPo)
3.3V and 5V, 4A output
Weight 12 grams

B. Experimental Setup
The simulations designed and presented in Sec. IV are replicated experimentally. The same quadrotor motion task is
presented, and both the velocity-guidance and the acceleration-guidance are compared. The final parameters 𝜃 of the

13
Episode 62,800 Episode 63,000
2 2

0 0
Y, m

Y, m
−2 Target −2 Target

−4 −4

−6 −6
−6 −4 −2 0 2 −4 −2 0 2 4
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

X, m X, m
4
Episode 63,100
4

2
2
Target
0
Y, m

Y, m

Target −2
−2

−4
Episode 63,400
−4
−4 −2 0 2 −4 −2 0 2 4
X, m X, m

Fig. 8 Visualization of acceleration-based follower trajectories at various episodes once training is complete.

6 Velocity—Episode 78,600
Acceleration—Episode 62,800
Position error, m

0
0 20 40 60 80
Time, s

Fig. 9 Time-series scalar follower position error.

14
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

(a) Ground computer and quadrotors (b) Indoor flight arena

(c) Tawaki autopilot board

Fig. 10 ENAC facility and flight hardware.

15
policy network trained in simulation are exported for use in experiment; no further training is performed.
The quadrotors take off under their regular autopilot software and move to their initial conditions. Then, once at
zero velocity, the follower quadrotor is switched into deep guidance mode where it listens for guided desired velocities
or accelerations from the policy network, and then uses feedback control to track that velocity or acceleration signal.
The follower quadrotor is tasked to move from its initial location to three metres offset from the front-face of the
target quadrotor. There are many discrepancies between the simulated environment within which the policy is trained
and the experimental facility where its performance is evaluated. The simulated environment did not model the
vehicles as quadrotors—they were modelled as double-integrator point masses. Rotor dynamics, air disturbances, and
sensor inaccuracies were unmodelled. In addition, mass, size, and the on-board controllers that track the velocity
and acceleration signals were different than the ones used during evaluation of the policy performance in simulation.
Incremental nonlinear dynamic inversion control [9] is used as the on-board controller to track the commanded velocities
or accelerations. Dramatic discrepancies exist between the simulated and experimental environments. Therefore, this is
an excellent test of the simulation-to-reality capabilities of the deep guidance technique, and is an appropriate facility to
compare the two proposed deep guidance solutions.
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

C. Experimental Results
The experimental results are shown in Fig. 11. Both the velocity- and acceleration-based deep guidance approaches
have successfully transferred from simulation to experiment, as the trajectories in Fig. 11a and 11b show. The
acceleration-based model was expected to outperform the velocity-based one in terms of overshoot, as discussed in
Sec. IV.B. The steady-state error the velocity-based model suffers from is likely due to the velocity-based reward function
in Eq. (25), where a term was included to penalize high velocities near the desired location, weighted by 𝑐 1 , in order
to reduce overshoots. However, it appears that the term also discourages the final steady-state error to be low, since
the velocity needed to move to the desired location may result in more penalties than rewards. In trying to prevent
overshoot, this term caused steady-state offset, though perhaps with additional tuning of the 𝑐 1 parameter this could
be prevented. The result is further evidence that a simpler reward function is best, and the acceleration-based deep
guidance approach allows for a simpler reward function with better performance.
A time-series view of the results is shown in Fig. 11c, where the acceleration-based experiment outperforms the
velocity-based one both in terms of overshoot and steady-state offset. The target is moved twice in experiment—a
task that was not encountered during training. Both deep guidance systems continue to track the desired location even
when the target is moved. To further test the acceleration deep guidance model’s ability to effectively transfer from
simulation to reality, the same acceleration model used on the quadrotor is also is executed on a hexacopter. Results,
shown in Fig. 12, show that the same guidance logic is applicable to an entirely different vehicle—a fact that would not
be possible if deep reinforcement learning was used to directly calculate rotor torques from observations (i.e., if deep
reinforcement learning was tasked with guidance and control).
Both the velocity- and acceleration-based deep guidance models performed worse in experiment than in simulation.
Many differences between the simulator the policy was trained within and the experimental facility exist, such as: the
quadrotors have significantly more complex dynamics than the double-integrator model that was used in training; the
experimental quadrotor mass, thrust, perturbations, and controller were different than the policy was evaluated on in
simulation; significant accelerometer noise was encountered; and dynamic coupling between the axes was experienced
in experiment. Although the experimental results were slightly worse than the simulated ones, it demonstrates the
ability of the deep guidance system to transfer from simulation to reality even when significant dynamic differences are
present. Altitude commands from the deep guidance system were attempted but ultimately led to poor performance due
to the dynamic coupling that exists between the axes. Future work should examine how deep guidance can be applied to
systems where coupling exists.
Videos from the two experiments can be found at: https://youtu.be/36S_sOfTc-0. All the code used in this
work is available at https://github.com/Kirkados/AIAA_GNC_2021.

VI. Conclusion
This paper improved on previous work where deep reinforcement learning was applied to the guidance problem for
robotics. It used deep reinforcement learning to perform closed-loop guidance, named deep guidance, and compared
whether it is more appropriate for velocity or acceleration guidance signals to be issued. A simulated quadrotor task was
designed where one quadrotor starts at a randomized initial state and must move itself in front of another quadrotor.
Results conclusively showed that acceleration-based deep guidance is more effective at guiding the motion and it allowed

16
4 4

2 2
Y, m

Y, m
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

0 0

−2 −2
Target Target

−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
X, m X, m
(a) Velocity guidance (b) Acceleration guidance
5
Velocity
Acceleration
4 Target moved
Position error, m

0
0 20 40 60 80
Time, s

(c) Time-series scalar follower position error.

Fig. 11 Experimental results.

17
4

Y, m
0

−2
Target

−4
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

−4 −2 0 2 4
X, m

Fig. 12 Hexacopter hardware and experimental trajectory.

for a simpler reward function. Training the system using kinematics that are of the same order as the dynamics within
which the policy will be implemented appears to be important. To demonstrate the ability of the deep guidance system
to transfer to reality, the trained policy is transferred from simulation to a real quadrotor facility at École Nationale
de l’Aviation Civile in Toulouse, France. The two quadrotors performed flights in an indoor facility such that the
velocity- and acceleration-based guidance systems could be compared. The experimental results confirmed that the
acceleration-based deep guidance approach is more appropriate, and it confirmed the simulation-to-reality transfer
abilities of the deep guidance approach. Future work should explore the use of the acceleration-based deep guidance
technique on more difficult problems in robotics, and investigate the use of the deep guidance technique in scenarios
where dynamic coupling exists.

Acknowledgments
This research was financially supported in part by the Natural Sciences and Engineering Research Council of Canada
under the Postgraduate Scholarship-Doctoral PGSD3-503919-2017 award and the Ontario Graduate Scholarship.

References
[1] Bouabdallah, S., Murrieri, P., and Siegwart, R., “Design and Control of an Indoor Micro Quadrotor,” IEEE International
Conference on Robotics and Automation, Piscataway, NJ, 2004, pp. 4393 – 4398.
doi:10.3929/ethz-a-010085499
[2] Hoffmann, G. M., Huang, H., Waslander, S. L., and Tomlin, C. J., “Quadrotor Helicopter Flight Dynamics and Control: Theory
and Experiment,” AIAA Guidance, Navigation, and Control Conference, Hilton Head, SC, 2007, AIAA Paper 2007-6461.
doi:10.2514/6.2007-6461
[3] Bouabdallah, S. and Siegwart, R., “Full Control of a Quadrotor,” IEEE/RSJ International Conference on Intelligent Robots and
Systems, Piscataway, NJ, 2007, pp. 153–158.
doi:10.1109/IROS.2007.4399042
[4] Madani, T. and Benallegue, A., “Backstepping Control for a Quadrotor Helicopter,” IEEE International Conference on
Intelligent Robots and Systems, Piscataway, NJ, 2006, pp. 3255–3260.
doi:10.1109/IROS.2006.282433
[5] Xu, R. and Ozguner, U., “Sliding Mode Control of a Quadrotor Helicopter,” IEEE Conference on Decision and Control, IEEE,
Piscataway, NJ, 2006, pp. 4957–4962.
doi:10.1109/CDC.2006.377588

18
[6] Alexis, K., Nikolakopoulos, G., and Tzes, A., “Model Predictive Quadrotor Control: Attitude, Altitude and Position Experimental
Studies,” IET Control Theory and Applications, Vol. 6, No. 12, 2012, pp. 1812–1827.
doi:10.1049/iet-cta.2011.0348

[7] Das, A., Subbarao, K., and Lewis, F., “Dynamic Inversion with Zero-dynamics Stabilisation for Quadrotor Control,” IET
Control Theory & Applications, Vol. 3, No. 3, 2009, pp. 303–314.
doi:10.1049/iet-cta:20080002

[8] Smeur, E. J., De Croon, G. C., and Chu, Q., “Gust Disturbance Alleviation with Incremental Nonlinear Dynamic Inversion,”
IEEE International Conference on Intelligent Robots and Systems, Piscataway, NJ, 2016, pp. 5626–5631.
doi:10.1109/IROS.2016.7759827

[9] Smeur, E. J., Chu, Q., and De Croon, G. C., “Adaptive Incremental Nonlinear Dynamic Inversion for Attitude Control of Micro
Air Vehicles,” Journal of Guidance, Control, and Dynamics, Vol. 39, No. 3, 2016, pp. 450–461.
doi:10.2514/1.G001490

[10] Smeur, E. J., de Croon, G. C., and Chu, Q., “Cascaded Incremental Nonlinear Dynamic Inversion for MAV Disturbance
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

Rejection,” Control Engineering Practice, Vol. 73, 2018, pp. 79–90.


doi:10.1016/j.conengprac.2018.01.003

[11] Altuǧ, E., Ostrowski, J., and Mahony, R., “Control of a Quadrotor Helicopter Using Visual Feedback,” IEEE International
Conference on Robotics and Automation, IEEE, Piscataway, NJ, 2002, pp. 72–77.
doi:10.1109/ROBOT.2002.1013341

[12] Altuǧ, E., Ostrowski, J. P., and Taylor, C. J., “Quadrotor Control Using Dual Camera Visual Feedback,” IEEE International
Conference on Robotics and Automation, Piscataway, NJ, 2003, pp. 4294–4299.
doi:10.1109/robot.2003.1242264

[13] Hoffmann, G. M., Waslander, S. L., and Tomlin, C. J., “Quadrotor Helicopter Trajectory Tracking Control,” AIAA Guidance,
Navigation and Control Conference and Exhibit, Honolulu, HI, 2008, AIAA Paper 2008-7410.
doi:10.2514/6.2008-7410

[14] Mellinger, D. and Kumar, V., “Minimum Snap Trajectory Generation and Control for Quadrotors,” IEEE International
Conference on Robotics and Automation, Piscataway, NJ, 2011, pp. 2520–2525.
doi:10.1109/ICRA.2011.5980409

[15] Honig, W., Preiss, J. A., Kumar, T. K., Sukhatme, G. S., and Ayanian, N., “Trajectory Planning for Quadrotor Swarms,” IEEE
Transactions on Robotics, Vol. 34, No. 4, 2018, pp. 856–869.
doi:10.1109/TRO.2018.2853613

[16] Mellinger, D., Michael, N., and Kumar, V., “Trajectory Generation and Control for Precise Aggressive Maneuvers with
Quadrotors,” International Journal of Robotics Research, Vol. 31, No. 5, 2012, pp. 664–674.
doi:10.1177/0278364911434236

[17] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski,
G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D.,
“Human-Level Control Through Deep Reinforcement Learning,” Nature, Vol. 518, No. 7540, 2015, pp. 529–533.
doi:10.1038/nature14236

[18] Hornik, K., Stinchcombe, M., and White, H., “Multilayer Feedforward Networks are Universal Approximator,” Neural Networks,
Vol. 2, No. 5, 1989, pp. 359–366.
doi:10.1016/0893-6080(89)90020-8

[19] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A.,
Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Van Den Driessche, G., Graepel, T., and Hassabis, D., “Mastering the Game of Go
Without Human Knowledge,” Nature, Vol. 550, No. 7676, 2017, pp. 354–359.
doi:10.1038/nature24270

[20] Sünderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., Upcroft, B., Abbeel, P., Burgard, W., Milford, M., and
Corke, P., “The Limits and Potentials of Deep Learning for Robotics,” Computing Research Repository, 2018.
arXiv:1804.06557

19
[21] Kober, J., Bagnell, J. A., and Peters, J., “Reinforcement Learning in Robotics: A Survey,” The International Journal of Robotics
Research, Vol. 32, No. 11, 2015, pp. 1238–1274.
doi:10.1177/0278364913495721

[22] Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P., Konolige, K.,
Levine, S., and Vanhoucke, V., “Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping,”
Computing Research Repository, 2017.
arXiv:1709.07857

[23] Yang, Y., Caluwaerts, K., Iscen, A., Zhang, T., Tan, J., and Sindhwani, V., “Data Efficient Reinforcement Learning for Legged
Robots,” Conference on Robotic Learning, Osaka, Japan, 2019.
arXiv:1907.03613

[24] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S.,
“Soft Actor-Critic Algorithms and Applications,” Computing Research Repository, 2018.
arXiv:1812.05905
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

[25] OpenAI, “Learning Dexterous In-Hand Manipulation,” Computing Research Repository, 2018.
arXiv:1808.00177

[26] OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G.,
Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W., and Zhang, L., “Solving Rubik’s
Cube with a Robot Hand,” Computing Research Repository, 2019.
arXiv:1910.07113

[27] Lee, J., Hwangbo, J., and Hutter, M., “Robust Recovery Controller for a Quadrupedal Robot using Deep Reinforcement
Learning,” Computing Research Repository, 2019.
arXiv:1901.07517

[28] Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P., “Sim-to-Real Transfer of Robotic Control with Dynamics
Randomization,” IEEE International Conference on Robotics and Automation, Piscataway, NJ, 2018, pp. 3803–3810.
doi:10.1109/ICRA.2018.8460528

[29] Loquercio, A., Kaufmann, E., Ranftl, R., Dosovitskiy, A., Koltun, V., and Scaramuzza, D., “Deep Drone Racing: From
Simulation to Reality with Domain Randomization,” IEEE Transactions on Robotics, Vol. 36, No. 1, 2020, pp. 1–14.
doi:10.1109/TRO.2019.2942989

[30] Sadeghi, F. and Levine, S., “CAD2RL: Real Single-image Flight Without a Single Real Image,” Robotics: Science and Systems,
Cambridge, MA, 2017.
doi:10.15607/RSS.2017.XIII.034

[31] Van Baar, J., Sullivan, A., Cordorel, R., Jha, D., Romeres, D., and Nikovski, D., “Sim-to-real Transfer Learning Using
Robustified Controllers in Robotic Tasks Involving Complex Dynamics,” IEEE International Conference on Robotics and
Automation, Piscataway, NJ, 2019, pp. 6001–6007.
doi:10.1109/ICRA.2019.8793561

[32] James, S., Wohlhart, P., Kalakrishnan, M., Kalashnikov, D., Irpan, A., Ibarz, J., Levine, S., Hadsell, R., and Bousmalis, K.,
“Sim-to-real via Sim-to-sim: Data-efficient Robotic Grasping Via Randomized-to-canonical Adaptation Networks,” IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, Piscataway, NJ, 2019, pp. 12619–12629.
doi:10.1109/CVPR.2019.01291

[33] Cutler, M. and How, J. P., “Autonomous Drifting Using Simulation-aided Reinforcement Learning,” IEEE International
Conference on Robotics and Automation, IEEE, Piscataway, NJ, 2016, pp. 5442–5448.
doi:10.1109/ICRA.2016.7487756

[34] Waslander, S. L., Hoffmann, G. M., Jang, J. S., and Tomlin, C. J., “Multi-agent Quadrotor Testbed Control Design: Integral
Sliding Mode vs. Reinforcement Learning,” IEEE/RSJ International Conference on Intelligent Robots and Systems, Piscataway,
NJ, 2005, pp. 468–473.
doi:10.1109/IROS.2005.1545025

[35] Koch, W., Mancuso, R., West, R., and Bestavros, A., “Reinforcement Learning for UAV Attitude Control,” ACM Transactions
on Cyber-Physical Systems, Vol. 3, No. 2, 2019, pp. 1–21.
doi:10.1145/3301273

20
[36] Palunko, I., Faust, A., Cruz, P., Tapia, L., and Fierro, R., “A Reinforcement Learning Approach Towards Autonomous Suspended
Load Manipulation Using Aerial Robots,” IEEE International Conference on Robotics and Automation, Piscataway, NJ, 2013,
pp. 4896–4901.
doi:10.1109/ICRA.2013.6631276

[37] Hwangbo, J., Sa, I., Siegwart, R., and Hutter, M., “Control of a Quadrotor with Reinforcement Learning,” IEEE Robotics and
Automation Letters, Vol. 2, No. 4, 2017, pp. 2096–2103.
doi:10.1109/LRA.2017.2720851

[38] Julian, K. D. and Kochenderfer, M. J., “Distributed Wildfire Surveillance with Autonomous Aircraft Using Deep Reinforcement
Learning,” Journal of Guidance, Control, and Dynamics, Vol. 42, No. 8, 2019, pp. 1768–1778.
doi:10.2514/1.G004106

[39] Junell, J. L., Van Kampen, E.-J., de Visser, C. C., and Chu, Q., “Reinforcement Learning Applied to a Quadrotor Guidance Law
in Autonomous Flight,” AIAA Guidance, Navigation, and Control Conference, Kissimmee, FL, 2015, AIAA Paper 2015-1990.
doi:10.2514/6.2015-1990
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

[40] Greatwood, C. and Richards, A. G., “Reinforcement Learning and Model Predictive Control for Robust Embedded Quadrotor
Guidance and Control,” Autonomous Robots, Vol. 43, 2019, pp. 1681–1693.
doi:10.1007/s10514-019-09829-4

[41] Harris, A., Teil, T., and Schaub, H., “Spacecraft Decision-Making Autonomy using Deep Reinforcement Learning,” AAS/AIAA
Space Flight Mechanics Meeting, Ka’anapali, HI, 2019, AAS Paper 19-447.

[42] Hovell, K. and Ulrich, S., “Deep Reinforcement Learning for Spacecraft Proximity Operations Guidance,” Journal of Spacecraft
and Rockets, Article in Press.

[43] Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap, T.,
“Distributed Distributional Deterministic Policy Gradients,” International Conference on Learning Representations, Vancouver,
Canada, 2018.
arXiv:1804.08617

[44] Bellemare, M. G., Dabney, W., and Munos, R., “A Distributional Perspective on Reinforcement Learning,” Computing Research
Repository, 2017.
arXiv:1707.06887

[45] Mnih, V., Badia, A., Mirza, M., Graves, A., and Lillicrap, T., “Asynchronous Methods for Deep Reinforcement Learning,”
International Conference on Machine Learning, New York, NY, 2016.
arXiv:1602.01783

[46] Sutton, R. S. and Barto, A. G., Reinforcement Learning: An Introduction, Cambridge: MIT Press, 2nd ed., 1998, pg. 148.

[47] Oliphant, T. E., “Python for Scientific Computing,” Computing in Science & Engineering, Vol. 9, No. 3, 2007, pp. 10–20.
doi:10.1109/MCSE.2007.58

[48] Katsikopoulos, K. V. and Engelbrecht, S. E., “Markov Decision Processes with Delays and Asynchronous Cost Collection,”
IEEE Transactions on Automatic Control, Vol. 48, No. 4, 2003, pp. 568–574.
doi:10.1109/TAC.2003.809799

[49] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D., “Continuous Control with
Deep Reinforcement Learning,” Computing Research Repository, 2016.
arXiv:1509.02971

[50] Kingma, D. P. and Ba, J., “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations,
San Diego, CA, 2015.
arXiv:1412.6980

[51] Hochreiter, S., “The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions,” International
Journal of Uncertainty, Fuzziness and Knowlege-Based Systems, Vol. 6, No. 2, 1998, pp. 107–116.
doi:10.1142/S0218488598000094

21
[52] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat,
S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D.,
Schuster, M., Monga, R., Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke,
V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X., “TensorFlow:
Large-Scale Machine Learning on Heterogeneous Systems,” Computing Research Repository, 2016, Software available from
tensorflow.org.
arXiv:1603.04467

[53] Hattenberger, G., Bronz, M., and Gorraz, M., “Using the Paparazzi UAV System for Scientific Research,” International Micro
Air Vehicles Conference and Flight Competition, Delft, Netherlands, 2014, pp. 247–252.
doi:10.4233/uuid:b38fbdb7-e6bd-440d-93be-f7dd1457be60
Downloaded by ASELSAN - A.S. on September 13, 2022 | http://arc.aiaa.org | DOI: 10.2514/6.2021-1751

22

You might also like