Trajectory Optimization For Autonomous Flying Base Station Via Reinforcement Learning
Trajectory Optimization For Autonomous Flying Base Station Via Reinforcement Learning
I. I NTRODUCTION Fig. 1. UAV BS optimizing its trajectory to maximize the sum rate of the
transmission to a group of users, e.g. in case of stationary transmitter failure.
Compared to traditional mobile network infrastructure,
mounting base stations (BSs) or access points (APs) on
unmanned aerial vehicles (UAVs) promises faster and dynamic
network deployment, the possibility to extend coverage beyond the path planning of the UAV base station. This allows for
existing stationary APs and provide additional capacity to optimizing the users’ QoS during the whole flying time as well
users in localized areas of high demand, such as concerts as combining it with other mission critical objectives such as
and sports events. Fast deployment is especially useful in energy conservation by reducing flying time (e.g. [2] and [6])
scenarios when sudden network failure occurs and delayed or integrating landing spots for the UAV in the trajectory [4].
re-establishment not acceptable, e.g. in disaster and search- In this work and as depicted in figure 1, we consider the
and-rescue situations [1]. In remote areas where it is not UAV acting as a BS serving multiple users maximizing the
feasible or economically efficient to extend permanent network sum of the information rate over the flying time, but a multi-
infrastructure, high-flying balloons or unmanned solar planes tude of other applications exist. [8] and [9] provide summaries
(as in Google’s project Loon and Facebook’s Internet.org of the general challenges and opportunities. In [4], [10] and
initiative) could provide Internet access to half the world’s [11], the authors investigate an IoT-driven scenario where an
population currently without it. autonomous drone gathers data from distant network nodes.
In all mentioned scenarios where flying APs hold promise, The authors of [7] and [12] work on an application where an
a decisive factor for the system’s ability to serve the highest existing ground-based communications network could be used
possible number of users with the best achievable Quality for beyond line-of-sight (LOS) control of UAVs if resulting
of Service (QoS) is the UAV’s location. Previous work has interference within the ground network is managed. In [2] and
either addressed the placement problem of finding optimal [5], a scenario similar to this work is considered where a UAV-
positions for flying APs (e.g. [2], [3]) or optimizing the mounted BS serves a group of users. Whereas the authors in
UAV’s trajectory from start to end [4]–[7]. Whereas fixed [5] also maximize the sum rate of the users, the goal in [2] is to
locations fulfilling a certain communication network’s goal cover the highest possible number of users while minimizing
are determined in the placement problem, the alternative is transmit power.
to embed the optimization of the communication system with Recent successes in the application of deep reinforcement
learning to problems of control, perception and planning
The authors acknowledge the support of the SeCIF project within the achieving superhuman performance, e.g. playing Atari video
French-German Academy for the Industry of the Future as well as the support games [13], have created interest in many areas, though RL-
from the PERFUME project funded by the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation programme based path planning for mobile robots and UAVs in particular
(grant agreement no. 670896). has not been investigated widely. Deep learning applications
in UAV guidance focus often on perception and have mostly with transmit power P , noise power N and pathloss Lk of
auxiliary functions for the actual path planning, see [14] the k-th user. The UAV-user distance dk (t) with the UAV at
for a review. In [3], a radio map is learned which is then constant altitude H and all users at ground level is given as
used to find optimal UAV relay positions. In [7] a deep RL
2 2
system based on echo state network (ESN) cells is used to dk (t) = H 2 + (x(t) − ak ) + (y(t) − bk ) (4)
guide cellular-connected UAVs towards a destination while
With the pathloss exponent set to α = 2 for vacuum, the
minimizing interference.
pathloss for user k is given as
Our work focuses on a different scenario where the UAV
carries a base station and becomes part of the mobile com- Lk = dk (t)−α · 10XRayleigh /10 · βshadow (5)
munication infrastructure serving a group of users. Movement
decisions to maximize the sum rate over flying time are made where small-scale fading was modeled as a Rayleigh dis-
directly by a reinforcement Q-learning system. Previous works tributed random variable XRayleigh with scaling factor σ = 1.
not employing machine learning often rely on strict models of The attenuation through obstacle obstruction was modeled
the environment or assume the channel state information (CSI) with a discrete factor βshadow ∈ {1, 0.01} which is set to
to be predictable. In contrast, the Q-learning algorithm requires βshadow = 0.01 in the obstacle’s shadow and to βshadow = 1
no explicit information about the environment and is able to everywhere else. Using the described model, the maximization
learn the topology of the network to improve the system-wide problem can be formulated as
performance. We compare a standard table-based approach and T K
a neural network as Q-function approximators. max Rk (t)dt (6)
x(t),y(t) t=0 k=1
II. S YSTEM M ODEL To guarantee that a feasible solution exists, T and V must
be chosen such that the UAV is at least able to travel from
A. UAV Model
initial to
final position along the minimum distance path, i.e.
The UAV has a maximum flying time T , by the end of which V T ≥ (xf − x0 )2 + (yf − y0 )2 .
it is supposed to return to a final position. During the flying
III. F UNDAMENTALS OF Q-L EARNING
time t ∈ [0, T ], the UAV’s position is given by (x(t), y(t)) and
a constant altitude H. It is moving with a constant velocity V . Q-learning is a model-free reinforcement learning method
The initial position of the UAV is (x0 , y0 ), whereas (xf , yf ) firstly proposed by Watkins and developed further in 1992
is the final position. x(t) and y(t) are smooth functions of [15]. It is classified as model-free because it has no internal
class C ∞ and defined as representation of the environment.
Reinforcement learning in general proceeds in a cycle of
[0, T ] → R [0, T ] → R interactions between an agent and its environment. At time t,
x: y: (1) the agent observes a state st ∈ S, performs an action at ∈ A
t → x(t) t → y(t)
and subsequently receives a reward rt ∈ R. The time index is
subjected to then incremented and the environment propagates the agent to
a new state st+1 , from where the cycle restarts.
x(0) = x0 , y(0) = y0 Q-learning specifically allows an agent to learn to act
x(T ) = xf , y(T ) = yf optimally in an environment that can be represented by a
Markov decision process (MDP). Consider a finite MDP
The UAV’s constant velocity is enforced over the time S, A, P, R, γ with state space S, action space A, state tran-
derivatives ẋ(t) and ẏ(t) with sition probability Pa (s, s ) = P r(st+1 = s | st = s, at = a),
reward function Ra (s, s ) and discount factor γ ∈ [0, 1) which
ẋ2 (t) + ẏ 2 (t) = V, t ∈ [0, T ] (2) controls the importance of future rewards in relation to present
reward.
B. Communication Channel Model The goal for the agent is to learn a behavior rule that
The communication channel between the UAV AP and a maximizes the reward it receives. A behavior rule that tells
number of K users is described by the log-distance path loss it how to select actions given a certain state is referred to as
model including small-scale fading and a constant attenuation a policy and can be stochastic in general. It is given as
factor in the shadow of the obstacle. π(a|s) = P r [at = a | tt = s] (7)
The communication link is modeled as an orthogonal point-
to-point channel. The information rate of the k-th user, k ∈ Q-learning is based on iteratively improving the state-action
{1, ..., K} located at a constant position (ak , bk ) ∈ R2 at value function (or Q-function) which represents an expectation
ground level is given by of the future reward when taking action a in state s and
following policy π from thereon after. The Q-function is
P
Rk (t) = log2 1 + · Lk (3) Qπ (s, a) = Eπ {Rt | st = s, at = a} (8)
N
2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)
where the discounted sum of all future rewards at current time flying time T . Each new movement decision is evaluated by
t is called the return Rt ∈ R given by the environment according to the achieved sum rate computed
−1 with the channel model described in II-B and a numerical
T
Rt = γ k rt+1+k (9) reward based on the rate result issued to the agent. The
k=0 reward is then used to update the Q-value of the state and
chosen action according to the rule defined in (11). As the
with discount factor γ ∈ [0, 1) as set in the MDP definition
drone is propagated to its new position and the time index
and reaching the terminal state at time t + T . Given the Q-
∗ t is incremented, the cycle repeats until the maximum flying
function with perfect information Qπ (s, a), an optimal policy
time is reached and the learning episode ends. Random action
can be derived by selecting actions greedily:
probability is decreased and drone position and time index
∗
π ∗ (a|s) = arg max Qπ (s, a) (10) are reset for the start of a new episode. The number of episodes
a must be chosen so that sufficient knowledge of environment
From combining (8) and (9) it follows that Rt can be and network topology have accumulated through iterative
approximated in expectation based on the agent’s next step in updates of the Q-table.
the environment. The central Q-learning update rule to make
iterative improvements on the Q-function is therefore given by B. NN-based Q-learning (Q-net)
the UAV autonomously return to its landing spot within the B. Limitations and Future Work
flying time limit. Comparing table-based and neural network The relatively high number of learning episodes to obtain
approximators for the Q-function showed that using a table is the described results shows a limitation in choosing a Q-
not feasible for large state spaces, but training a NN provides learning approach in comparison to methods in previous
the necessary scalability and proved to be more efficient using works. This is a consequence of the generality of Q-learning
less training data then table-based Q-learning. and the avoidance to make any assumptions about the environ-
ment contained within the approach. Integrating even a coarse
model of the environment with an alternative model-based RL
method would result in faster convergence, but also entail a
loss in learning universality. The long learning time is put
into perspective by the fact that the main training can be com-
pleted offline based on the prior distribution before a shorter
adaptation phase to the true setting. Future work will include
considerations about dynamically changing environments, as
well as a more detailed look at real-world constraints such as
the energy efficiency of the learned trajectory.
R EFERENCES
[1] K. Namuduri, “Flying cell towers to the rescue,” IEEE Spectrum, vol. 54,
no. 9, pp. 38–43, Sep. 2017.
[2] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D
placement of an unmanned aerial vehicle base station (UAV-bs) for
energy-efficient maximal coverage,” IEEE Wireless Communications
Letters, vol. 6, no. 4, pp. 434–437, Aug. 2017.
[3] J. Chen and D. Gesbert, “Optimal positioning of flying relays for wire-
less networks: A LOS map approach,” in IEEE International Conference
on Communications (ICC), 2017.
[4] R. Gangula, D. Gesbert, D.-F. Külzer, and J. M. Franceschi Quintero,
“A landing spot approach to enhancing the performance of UAV-aided
wireless networks,” in IEEE International Conference on Communica-
tions (ICC) (accepted), 2018, Kansas City, MO, USA.
[5] R. Gangula, P. de Kerret, O. Esrafilian, and D. Gesbert, “Trajectory
optimization for mobile access point,” in Asilomar Conference on
Signals, Systems, and Computers, 2017, Pacific Grove, CA, USA.
[6] Y. Zeng and R. Zhang, “Energy-efficient uav communication with tra-
jectory optimization,” IEEE Transactions on Wireless Communications,
vol. 16, no. 6, pp. 3747–3760, 2017.
[7] U. Challita, W. Saad, and C. Bettstetter, “Cellular-connected uavs over
5g: Deep reinforcement learning for interference management,” arXiv
Fig. 2. The final trajectory for the table-based approach after completed preprint arXiv:1801.05500, 2018.
episode ntable = 800, 000 depicted in the simulation environment with two [8] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with
users and one obstacle. The gray area is in the shadow of the obstacle. As a unmanned aerial vehicles: opportunities and challenges,” IEEE Com-
visualization for the table, the four Q-values, one for each action, are shown munications Magazine, vol. 54, no. 5, pp. 36–42, 2016.
for the start position (0, 0). At time index t = 0, the Q-value for action ’up’ [9] L. Gupta, R. Jain, and G. Vaszkun, “Survey of important issues in uav
was learned to promise the highest future return. communication networks,” IEEE Communications Surveys & Tutorials,
vol. 18, no. 2, pp. 1123–1152, 2016.
[10] C. Zhan, Y. Zeng, and R. Zhang, “Energy-efficient data collection in uav
enabled wireless sensor network,” submitted to IEEE Wireless Commun.
Letters, available online at https://arxiv. org/abs/1708.00221, 2017.
[11] J. Gong, T.-H. Chang, C. Shen, and X. Chen, “Aviation time minimiza-
tion of uav for data collection over wireless sensor networks,” arXiv
preprint arXiv:1801.02799, 2018.
[12] B. Van der Bergh, A. Chiumento, and S. Pollin, “Lte in the sky: trading
off propagation benefits with interference costs for aerial nodes,” IEEE
Communications Magazine, vol. 54, no. 5, pp. 44–50, 2016.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, 2015.
[14] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. Campoy, “A
review of deep learning methods and applications for unmanned aerial
vehicles,” Journal of Sensors, vol. 2017, no. 3296874, 2017.
[15] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,
Fig. 3. Sum information rate between UAV BS and users per episode over vol. 8, no. 3-4, pp. 279–292, 1992.
learning time comparing table-based and NN approximators of the Q-function. [16] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning,
The plot only shows a clipped range of learning episodes as the table-based 2nd ed. Cambridge, Massachusetts: MIT Press, 2017.
solution converges after ntable = 800, 000 episodes, whereas Q-net only
needs nN N = 27, 000 episodes to come to a similar solution.