0% found this document useful (0 votes)
30 views

Trajectory Optimization For Autonomous Flying Base Station Via Reinforcement Learning

load cell optimization

Uploaded by

Abhishek Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Trajectory Optimization For Autonomous Flying Base Station Via Reinforcement Learning

load cell optimization

Uploaded by

Abhishek Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

Trajectory Optimization for Autonomous Flying


Base Station via Reinforcement Learning
Harald Bayerlein, Paul de Kerret, and David Gesbert
Communication Systems Department, EURECOM
Sophia Antipolis, France
Email: {harald.bayerlein, paul.dekerret, gesbert}@eurecom.fr

Abstract—In this work, we study the optimal trajectory of an


unmanned aerial vehicle (UAV) acting as a base station (BS)
to serve multiple users. Considering multiple flying epochs, we
leverage the tools of reinforcement learning (RL) with the UAV
acting as an autonomous agent in the environment to learn the
trajectory that maximizes the sum rate of the transmission during
flying time. By applying Q-learning, a model-free RL technique,
an agent is trained to make movement decisions for the UAV. We
compare table-based and neural network (NN) approximations
of the Q-function and analyze the results. In contrast to previous
works, movement decisions are directly made by the neural
network and the algorithm requires no explicit information about
the environment and is able to learn the topology of the network
to improve the system-wide performance.

I. I NTRODUCTION Fig. 1. UAV BS optimizing its trajectory to maximize the sum rate of the
transmission to a group of users, e.g. in case of stationary transmitter failure.
Compared to traditional mobile network infrastructure,
mounting base stations (BSs) or access points (APs) on
unmanned aerial vehicles (UAVs) promises faster and dynamic
network deployment, the possibility to extend coverage beyond the path planning of the UAV base station. This allows for
existing stationary APs and provide additional capacity to optimizing the users’ QoS during the whole flying time as well
users in localized areas of high demand, such as concerts as combining it with other mission critical objectives such as
and sports events. Fast deployment is especially useful in energy conservation by reducing flying time (e.g. [2] and [6])
scenarios when sudden network failure occurs and delayed or integrating landing spots for the UAV in the trajectory [4].
re-establishment not acceptable, e.g. in disaster and search- In this work and as depicted in figure 1, we consider the
and-rescue situations [1]. In remote areas where it is not UAV acting as a BS serving multiple users maximizing the
feasible or economically efficient to extend permanent network sum of the information rate over the flying time, but a multi-
infrastructure, high-flying balloons or unmanned solar planes tude of other applications exist. [8] and [9] provide summaries
(as in Google’s project Loon and Facebook’s Internet.org of the general challenges and opportunities. In [4], [10] and
initiative) could provide Internet access to half the world’s [11], the authors investigate an IoT-driven scenario where an
population currently without it. autonomous drone gathers data from distant network nodes.
In all mentioned scenarios where flying APs hold promise, The authors of [7] and [12] work on an application where an
a decisive factor for the system’s ability to serve the highest existing ground-based communications network could be used
possible number of users with the best achievable Quality for beyond line-of-sight (LOS) control of UAVs if resulting
of Service (QoS) is the UAV’s location. Previous work has interference within the ground network is managed. In [2] and
either addressed the placement problem of finding optimal [5], a scenario similar to this work is considered where a UAV-
positions for flying APs (e.g. [2], [3]) or optimizing the mounted BS serves a group of users. Whereas the authors in
UAV’s trajectory from start to end [4]–[7]. Whereas fixed [5] also maximize the sum rate of the users, the goal in [2] is to
locations fulfilling a certain communication network’s goal cover the highest possible number of users while minimizing
are determined in the placement problem, the alternative is transmit power.
to embed the optimization of the communication system with Recent successes in the application of deep reinforcement
learning to problems of control, perception and planning
The authors acknowledge the support of the SeCIF project within the achieving superhuman performance, e.g. playing Atari video
French-German Academy for the Industry of the Future as well as the support games [13], have created interest in many areas, though RL-
from the PERFUME project funded by the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation programme based path planning for mobile robots and UAVs in particular
(grant agreement no. 670896). has not been investigated widely. Deep learning applications

978-1-5386-3512-4/18/$31.00 ©2018 IEEE


2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

in UAV guidance focus often on perception and have mostly with transmit power P , noise power N and pathloss Lk of
auxiliary functions for the actual path planning, see [14] the k-th user. The UAV-user distance dk (t) with the UAV at
for a review. In [3], a radio map is learned which is then constant altitude H and all users at ground level is given as
used to find optimal UAV relay positions. In [7] a deep RL 
2 2
system based on echo state network (ESN) cells is used to dk (t) = H 2 + (x(t) − ak ) + (y(t) − bk ) (4)
guide cellular-connected UAVs towards a destination while
With the pathloss exponent set to α = 2 for vacuum, the
minimizing interference.
pathloss for user k is given as
Our work focuses on a different scenario where the UAV
carries a base station and becomes part of the mobile com- Lk = dk (t)−α · 10XRayleigh /10 · βshadow (5)
munication infrastructure serving a group of users. Movement
decisions to maximize the sum rate over flying time are made where small-scale fading was modeled as a Rayleigh dis-
directly by a reinforcement Q-learning system. Previous works tributed random variable XRayleigh with scaling factor σ = 1.
not employing machine learning often rely on strict models of The attenuation through obstacle obstruction was modeled
the environment or assume the channel state information (CSI) with a discrete factor βshadow ∈ {1, 0.01} which is set to
to be predictable. In contrast, the Q-learning algorithm requires βshadow = 0.01 in the obstacle’s shadow and to βshadow = 1
no explicit information about the environment and is able to everywhere else. Using the described model, the maximization
learn the topology of the network to improve the system-wide problem can be formulated as
performance. We compare a standard table-based approach and  T  K
a neural network as Q-function approximators. max Rk (t)dt (6)
x(t),y(t) t=0 k=1

II. S YSTEM M ODEL To guarantee that a feasible solution exists, T and V must
be chosen such that the UAV is at least able to travel from
A. UAV Model
initial to
final position along the minimum distance path, i.e.
The UAV has a maximum flying time T , by the end of which V T ≥ (xf − x0 )2 + (yf − y0 )2 .
it is supposed to return to a final position. During the flying
III. F UNDAMENTALS OF Q-L EARNING
time t ∈ [0, T ], the UAV’s position is given by (x(t), y(t)) and
a constant altitude H. It is moving with a constant velocity V . Q-learning is a model-free reinforcement learning method
The initial position of the UAV is (x0 , y0 ), whereas (xf , yf ) firstly proposed by Watkins and developed further in 1992
is the final position. x(t) and y(t) are smooth functions of [15]. It is classified as model-free because it has no internal
class C ∞ and defined as representation of the environment.
Reinforcement learning in general proceeds in a cycle of
   
[0, T ] → R [0, T ] → R interactions between an agent and its environment. At time t,
x: y: (1) the agent observes a state st ∈ S, performs an action at ∈ A
t → x(t) t → y(t)
and subsequently receives a reward rt ∈ R. The time index is
subjected to then incremented and the environment propagates the agent to
a new state st+1 , from where the cycle restarts.
x(0) = x0 , y(0) = y0 Q-learning specifically allows an agent to learn to act
x(T ) = xf , y(T ) = yf optimally in an environment that can be represented by a
Markov decision process (MDP). Consider a finite MDP
The UAV’s constant velocity is enforced over the time S, A, P, R, γ with state space S, action space A, state tran-
derivatives ẋ(t) and ẏ(t) with sition probability Pa (s, s ) = P r(st+1 = s | st = s, at = a),
 reward function Ra (s, s ) and discount factor γ ∈ [0, 1) which
ẋ2 (t) + ẏ 2 (t) = V, t ∈ [0, T ] (2) controls the importance of future rewards in relation to present
reward.
B. Communication Channel Model The goal for the agent is to learn a behavior rule that
The communication channel between the UAV AP and a maximizes the reward it receives. A behavior rule that tells
number of K users is described by the log-distance path loss it how to select actions given a certain state is referred to as
model including small-scale fading and a constant attenuation a policy and can be stochastic in general. It is given as
factor in the shadow of the obstacle. π(a|s) = P r [at = a | tt = s] (7)
The communication link is modeled as an orthogonal point-
to-point channel. The information rate of the k-th user, k ∈ Q-learning is based on iteratively improving the state-action
{1, ..., K} located at a constant position (ak , bk ) ∈ R2 at value function (or Q-function) which represents an expectation
ground level is given by of the future reward when taking action a in state s and
  following policy π from thereon after. The Q-function is
P
Rk (t) = log2 1 + · Lk (3) Qπ (s, a) = Eπ {Rt | st = s, at = a} (8)
N
2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

where the discounted sum of all future rewards at current time flying time T . Each new movement decision is evaluated by
t is called the return Rt ∈ R given by the environment according to the achieved sum rate computed
−1 with the channel model described in II-B and a numerical

T
Rt = γ k rt+1+k (9) reward based on the rate result issued to the agent. The
k=0 reward is then used to update the Q-value of the state and
chosen action according to the rule defined in (11). As the
with discount factor γ ∈ [0, 1) as set in the MDP definition
drone is propagated to its new position and the time index
and reaching the terminal state at time t + T . Given the Q-
∗ t is incremented, the cycle repeats until the maximum flying
function with perfect information Qπ (s, a), an optimal policy
time is reached and the learning episode ends. Random action
can be derived by selecting actions greedily:
probability  is decreased and drone position and time index

π ∗ (a|s) = arg max Qπ (s, a) (10) are reset for the start of a new episode. The number of episodes
a must be chosen so that sufficient knowledge of environment
From combining (8) and (9) it follows that Rt can be and network topology have accumulated through iterative
approximated in expectation based on the agent’s next step in updates of the Q-table.
the environment. The central Q-learning update rule to make
iterative improvements on the Q-function is therefore given by B. NN-based Q-learning (Q-net)

Qπ (st , at ) ← Qπ (st , at )+ Representation of the Q-function by a table is clearly


 not practical in large state and actions spaces as table size
α rt + γ max Qπ (st+1 , a) − Qπ (st , at ) (11) increases exponentially. Instead, the Q-function can be repre-
a
sented by an alternative nonlinear function approximator, such
with learning rate α ∈ [0, 1] determining to what extend
as a neural network (NN) composed of connected artificial
old information is overridden and discount factor γ ∈ [0, 1)
neurons organized in layers.
balancing the importance of short-term and long-term reward.
A model with two hidden layers, each with a number of
γ approaching 1 will make the agent focus on gaining long-
nnodes = 100 neurons, proofed adequate to make a direct
term reward, whereas choosing γ = 0 will make it consider
comparison with the standard table-based approach. Choosing
only the immediate reward of an action. A value of γ = 1
the right architecture and learning parameters is in general
could lead to action values diverging. Q-learning will converge
a difficult task and has to be done through heuristics and
to the optimal policy regardless of the exploration strategy
extensive simulations. The NN input was chosen to contain
being followed, under the assumption that each state-action
only the minimal information of one state space sample,
pair is visited an infinite number of times, and the learning
feeding current position (xt , yt ) of the drone and time index t
parameter α is decreased appropriately [16].
into the network denoted Qπθ with NN parameters θ. Four
IV. Q- LEARNING FOR T RAJECTORY O PTIMIZATION output nodes directly represent the Q-values of the action
A. Table-based Q-learning space. The neural network was implemented using Google’s
TensorFlow library.
In this section, we describe how the Q-learning algorithm
The basic procedure of the NN Q-learning algorithm is the
was adapted for trajectory optimization of a UAV BS inside
same as in the table-based approach described in the previous
a simulated environment with the Q-function being approxi-
section IV-A. However during training, the weights of the
mated by a four-dimensional table of Q-values. Each Q-value
network are iteratively updated based on the reward signal
represents thereby a unique state-action pair and a higher value
such that the output Q-values better represent the achieved
relative to other values promises a higher return according to
reward using the update rule (11). To avoid divergence and
definition (8).
oscillations typically associated with NN-based Q-learning, the
In order to promote initial exploration of the state space, the
training process makes use of the replay memory and target
Q-table is initialized with high Q-values to entice the agent to
network improvements described in [13].
visit each state-action pair at least once, a concept know as
optimism in the face of uncertainty. After the UAV’s position
V. S IMULATION
(x0 , y0 ) and the time index t = 0 have been initialized, the
agent makes its first movement decision according to the - A straight-forward, simulated environment was set up
greedy policy, where with probability  ∈ [0, 1] a random to evaluate the Q-learning algorithm. The state space
action to explore the state space is taken and in all other S = {x, y, t} was chosen to contain position of the
cases the action that maximizes the Q-function. Therefore drone and time. For simplicity, available actions for the
a balance must be found between the share of random and drone were limited to movement in four directions A =
non-random actions which is referred to as the exploration- { up , right, down , lef t } within the plane of constant al-
exploitation trade-off [16]. The probability for random actions titude H = 20. It follows that there are four Q-values
 is exponentially decreasing over the learning time. representing the action space for each position in the grid and
The agent’s initial movement decision starts the first learn- each time index, as is shown in figure 2 exemplarily for one
ing episode which terminates upon reaching the maximum position and time index.
2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

A. Environment VI. R ESULTS


The final learned trajectory by the table-based approach is
The simulated environment, as depicted in figure 2, is based
depicted in figure 2 while the development of the resulting
on a 15 by 15 grid world which is populated at initialization
sum rate per episode during learning is shown in figure 3 for
with two static users at ground level and a cuboid obstacle with
both approaches. It is important to note that the number of
height equal to the constant altitude of the drone standing on a
episodes in figure 3 is clipped due to the fact that the table-
2 by 4 ground plane. The initial and final position of the UAV
based solution only converged after n = 800, 000 episodes in
are set to the lower left corner (x0 , y0 ) = (xf , yf ) = (0, 0).
comparison to n = 27, 000 for Q-net. The NN approximator
The obstacle obstructs the LOS connection between users therefore shows a much higher training data efficiency, mainly
and UAV BS in part of the area. The signal strength in due to the fact that training data can be reused in the NN
the shadowed part, shown as gray in figure 2, is reduced training. The sum rate shows a high increase in the respective
by a fixed factor of βshadow = 0.01. After initialization first third of the learning phase when the rough layout of
of the environment and placement of users and obstacle, the trajectory is learned. Exploration slows down in the later
a shadowing map for the whole area is computed utilizing phases of the learning process, which means that only details
ray tracing, which is then used as a lookup table during in the trajectory change and the absolute impact on the sum
the learning process. In addition, random samples at each rate is consequently small.
new time index from a Rayleigh distributed random variable The final trajectory shows that the agent is able to infer
XRayleigh modeling small-scale fading are used to compute information about network topology and environment from
the current information rate for each user according to the the reward signal. Both approaches, the table-based and NN
channel model equations (3) and (5). The resulting sum rate approximators, converge to a trajectory with the same charac-
forms the basis of the reward signal. teristics. Specifically, the agent’s behavior shows that it learned
the following:
B. Learning Parameters • The UAV reaches the maximum cumulative rate point
between the two users on a short and efficient path.
The main component of the reward signal is the achieved • It avoids flying through the shadowed area keeping the
sum rate between users and BS. An additional negative reward sum rate high during the whole flying time.
is added if the action chosen by the agent leads the UAV to • While the action space does not allow for hovering on one
step outside the 15 by 15 grid. The third component can be position, the drone learns to circle around the maximum
added by a safety check that activates when the UAV fails to cumulative rate point.
make the decision to return to the landing position before the • The agent decides to return to its landing position in time
maximum flying time T = 50 is reached. The safety system to avoid crashing and on an efficient trajectory.
then activates and forces the UAV to return while awarding a In this simple environment, both approaches are able to
negative reward for each necessary activation of the system. find efficient trajectories in reasonable computation time. This
Except when the safety system is activated, movement changes for larger state spaces. Evaluating both approaches
decisions are made based on the -greedy policy described in a 30 by 30 grid environment with four randomly placed
in IV-A. The probability for random actions  is exponen- users and obstacles each showed that table-based learning is
tially decreasing over the learning time with decay constant not able to find a trajectory outperforming random movement
λaction = 14 for NN and table-based approach alike. decisions within a realistic computation time. In the same
No explicit rules exist to choose the learning parameters and environment, Q-net converges to a high sum rate trajectory
the parameters for the update rule (11) in general, which is within nN N = 30, 000 training episodes and a computation
why they have to be found through a combination of heuristics time of about one hour on a basic office computer.
and search on the parameter space. For the table-based approx-
VII. D ISCUSSION
imation, a combination of constant learning rate αtable = 0.3
and number of learning episodes ntable = 800, 000 was A. Summary
selected. As the goal in our scenario independent of approach We have introduced a novel Q-learning system to directly
is to maximize the sum rate over the whole flying time, the make movement decisions for a UAV BS serving multiple
discount factor was set to γ = 0.99 to make the agent focus users. The UAV acts as an autonomous agent in the envi-
on long-term reward for both approaches. ronment to learn the trajectory that maximizes the sum rate
The learning rate in the update rule (11) for the NN-based of the transmission over the whole flying time without the
approach is set to αnn = 1. Instead, the learning speed need for explicit information about the environment. We have
during NN training is controlled with the gradient descent formulated a maximization problem for the sum rate, which
step size, which is exponentially decayed with decay constant we solved iteratively by approximating the Q-function. Our
λgradient = 5 over the whole training time from a value simulation has shown that the agent is able to learn the
of 0.005 to 0.00005. A number of nnn = 27, 000 learning network topology and infer information about the environment
episodes proved sufficient for the training of the NN. to find a trajectory that maximizes the sum rate and which lets
2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)

the UAV autonomously return to its landing spot within the B. Limitations and Future Work
flying time limit. Comparing table-based and neural network The relatively high number of learning episodes to obtain
approximators for the Q-function showed that using a table is the described results shows a limitation in choosing a Q-
not feasible for large state spaces, but training a NN provides learning approach in comparison to methods in previous
the necessary scalability and proved to be more efficient using works. This is a consequence of the generality of Q-learning
less training data then table-based Q-learning. and the avoidance to make any assumptions about the environ-
ment contained within the approach. Integrating even a coarse
model of the environment with an alternative model-based RL
method would result in faster convergence, but also entail a
loss in learning universality. The long learning time is put
into perspective by the fact that the main training can be com-
pleted offline based on the prior distribution before a shorter
adaptation phase to the true setting. Future work will include
considerations about dynamically changing environments, as
well as a more detailed look at real-world constraints such as
the energy efficiency of the learned trajectory.
R EFERENCES
[1] K. Namuduri, “Flying cell towers to the rescue,” IEEE Spectrum, vol. 54,
no. 9, pp. 38–43, Sep. 2017.
[2] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D
placement of an unmanned aerial vehicle base station (UAV-bs) for
energy-efficient maximal coverage,” IEEE Wireless Communications
Letters, vol. 6, no. 4, pp. 434–437, Aug. 2017.
[3] J. Chen and D. Gesbert, “Optimal positioning of flying relays for wire-
less networks: A LOS map approach,” in IEEE International Conference
on Communications (ICC), 2017.
[4] R. Gangula, D. Gesbert, D.-F. Külzer, and J. M. Franceschi Quintero,
“A landing spot approach to enhancing the performance of UAV-aided
wireless networks,” in IEEE International Conference on Communica-
tions (ICC) (accepted), 2018, Kansas City, MO, USA.
[5] R. Gangula, P. de Kerret, O. Esrafilian, and D. Gesbert, “Trajectory
optimization for mobile access point,” in Asilomar Conference on
Signals, Systems, and Computers, 2017, Pacific Grove, CA, USA.
[6] Y. Zeng and R. Zhang, “Energy-efficient uav communication with tra-
jectory optimization,” IEEE Transactions on Wireless Communications,
vol. 16, no. 6, pp. 3747–3760, 2017.
[7] U. Challita, W. Saad, and C. Bettstetter, “Cellular-connected uavs over
5g: Deep reinforcement learning for interference management,” arXiv
Fig. 2. The final trajectory for the table-based approach after completed preprint arXiv:1801.05500, 2018.
episode ntable = 800, 000 depicted in the simulation environment with two [8] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with
users and one obstacle. The gray area is in the shadow of the obstacle. As a unmanned aerial vehicles: opportunities and challenges,” IEEE Com-
visualization for the table, the four Q-values, one for each action, are shown munications Magazine, vol. 54, no. 5, pp. 36–42, 2016.
for the start position (0, 0). At time index t = 0, the Q-value for action ’up’ [9] L. Gupta, R. Jain, and G. Vaszkun, “Survey of important issues in uav
was learned to promise the highest future return. communication networks,” IEEE Communications Surveys & Tutorials,
vol. 18, no. 2, pp. 1123–1152, 2016.
[10] C. Zhan, Y. Zeng, and R. Zhang, “Energy-efficient data collection in uav
enabled wireless sensor network,” submitted to IEEE Wireless Commun.
Letters, available online at https://arxiv. org/abs/1708.00221, 2017.
[11] J. Gong, T.-H. Chang, C. Shen, and X. Chen, “Aviation time minimiza-
tion of uav for data collection over wireless sensor networks,” arXiv
preprint arXiv:1801.02799, 2018.
[12] B. Van der Bergh, A. Chiumento, and S. Pollin, “Lte in the sky: trading
off propagation benefits with interference costs for aerial nodes,” IEEE
Communications Magazine, vol. 54, no. 5, pp. 44–50, 2016.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, 2015.
[14] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. Campoy, “A
review of deep learning methods and applications for unmanned aerial
vehicles,” Journal of Sensors, vol. 2017, no. 3296874, 2017.
[15] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,
Fig. 3. Sum information rate between UAV BS and users per episode over vol. 8, no. 3-4, pp. 279–292, 1992.
learning time comparing table-based and NN approximators of the Q-function. [16] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning,
The plot only shows a clipped range of learning episodes as the table-based 2nd ed. Cambridge, Massachusetts: MIT Press, 2017.
solution converges after ntable = 800, 000 episodes, whereas Q-net only
needs nN N = 27, 000 episodes to come to a similar solution.

You might also like