0% found this document useful (0 votes)
24 views13 pages

Multi-Agent Deep Reinforcement Learning For Persistent Monitoring With Sensing Communication and Localization Constraints

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

Multi-Agent Deep Reinforcement Learning For Persistent Monitoring With Sensing Communication and Localization Constraints

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 1

Multi-Agent Deep Reinforcement Learning


for Persistent Monitoring With
Sensing, Communication, and
Localization Constraints
Manav Mishra , Prithvi Poddar , Rajat Agrawal , Jingxi Chen , Pratap Tokekar ,
and P. B. Sujit , Senior Member, IEEE

Abstract— Determining multi-robot motion policies for persis- monitoring, etc. Typically, these applications are large-scale, and
tently monitoring a region with limited sensing, communication, hence using a multi-robot system helps achieve the mission
and localization constraints in non-GPS environments is a objectives effectively. Often, the robots are subject to limited
challenging problem. To take the localization constraints into sensing range and communication range, and they may need
account, in this paper, we consider a heterogeneous robotic system to operate in GPS-denied areas. In such scenarios, developing
consisting of two types of agents: anchor agents with accurate motion planning policies for the robots is difficult. Due to the lack
localization capability and auxiliary agents with low localization of GPS, alternative localization mechanisms, like SLAM, high-
accuracy. To localize itself, the auxiliary agents must be within accurate INS, UWB radio, etc. are essential. Having SLAM or a
the communication range of an anchor, directly or indirectly. The highly accurate INS system is expensive, and hence we use agents
robotic team’s objective is to minimize environmental uncertainty having a combination of expensive, accurate localization systems
through persistent monitoring. We propose a multi-agent deep (anchor agents) and low-cost INS systems (auxiliary agents)
reinforcement learning (MARL) based architecture with graph whose localization can be made accurate using cooperative
convolution called Graph Localized Proximal Policy Optimization localization techniques. To determine efficient motion policies,
(GALOPP), which incorporates the limited sensor field-of-view, we use a multi-agent deep reinforcement learning technique
communication, and localization constraints of the agents along (GALOPP) that takes the heterogeneity in the vehicle localization
with persistent monitoring objectives to determine motion policies capability, limited sensing, and communication constraints into
for each agent. We evaluate the performance of GALOPP account. GALOPP is evaluated using simulations and compared
on open maps with obstacles having a different number of with baselines like random search, random search with ensured
anchor and auxiliary agents. We further study 1) the effect of communication, greedy search, and area partitioning. The results
communication range, obstacle density, and sensing range on the show that GALOPP outperforms the baselines. The GALOPP
performance and 2) compare the performance of GALOPP with approach offers a generic solution that be adopted with various
area partition, greedy search, random search, and random search other applications.
with communication constraint strategies. For its generalization
capability, we also evaluated GALOPP in two different environ- Index Terms— Multi-agent deep reinforcement learning
ments – 2-room and 4-room. The results show that GALOPP (MARL), persistent monitoring (PM), graph neural networks.
learns the policies and monitors the area well. As a proof-of-
concept, we perform hardware experiments to demonstrate the I. I NTRODUCTION
performance of GALOPP.

Note to Practitioners—Persistent monitoring is performed in


various applications like search and rescue, border patrol, wildlife
V ISIBILITY-AWARE persistent monitoring (PM) problem
involves continuous surveillance of a bounded environ-
ment by a single agent or a multi-agent system considering
limited field-of-view (FOV) constraints into account [1], [2],
Manuscript received 18 November 2023; revised 3 February 2024; [3], [4], [5], [6], [7], [8], [9]. Various applications, like search
accepted 30 March 2024. The work of Manav Mishra was supported by and rescue, border patrol, critical infrastructure, etc., require
Prime Minister Research Fellowship (PMRF). The work of Pratap Tokekar
was supported in part by NSF under Grant 1943368, in part by ONR under persistent monitoring for timely information. Ideally, persistent
Grant N00014-18-1-2829, and in part by an Amazon Research Award. This monitoring requires spatial and temporal separation of a team
article was recommended for publication by Associate Editor H.-J. Kim and of robots in a larger environment to cooperatively carry out
Editor J. Li upon evaluation of the reviewers’ comments. (Corresponding
author: P. B. Sujit.) effective surveillance. The problem becomes complex as the
Manav Mishra, Rajat Agrawal, and P. B. Sujit are with the Department multi-robot systems are subjected to limited sensing range,
of Electrical Engineering and Computer Science, IISER Bhopal, Bhopal communication range, and localization constraints due to
462038, India (e-mail: [email protected]; [email protected];
[email protected]). non-GPS environments. In this paper, we study the prob-
Prithvi Poddar is with the Department of Mechanical and Aerospace lem of determining motion planning policies for each agent
Engineering, University at Buffalo, Buffalo, NY 14068 USA (e-mail: in a multi-agent system for persistently monitoring a given
[email protected]).
Jingxi Chen and Pratap Tokekar are with the Department of Computer environment, considering all the constraints using a graph
Science, University of Maryland, College Park, MD 20742 USA (e-mail: communication-based multi-agent deep reinforcement learning
[email protected]; [email protected]). (MARL) framework.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TASE.2024.3385412. Generating motion policies for each agent using determinis-
Digital Object Identifier 10.1109/TASE.2024.3385412 tic strategies becomes challenging due to the above constraints,
1545-5955 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

agents makes the problem of determining persistent monitor-


ing strategies for the agents challenging.
Simultaneously addressing numerous constraints is pivotal
in this context due to the intricacies involved in persistent
monitoring. In GPS-denied scenarios, conventional methods
relying on GPS or beacons fall short. Moreover, assuming
complete communication between agents is often unrealistic,
making it imperative to develop strategies that operate under
partial or intermittent connectivity.
Consider applications such as search and rescue in
urban/remote regions affected by flooding. In these scenar-
ios, UAVs can be used to monitor and enhance situational
awareness. However, the cameras have a finite field-of-view,
and the communication range with the base station is also
limited. In addition to these two constraints, acquiring GPS
becomes challenging in cloudy weather conditions. Therefore,
it is necessary to determine motion policies for the agents
while considering all these constraints.
Fig. 1. Persistent monitoring in a 2-D environment using a team of anchor and In this paper, we propose Graph Localized Proximal Policy
auxiliary agents with FOV, localization, and communication range constraint. Optimization (GALOPP), a multi-agent proximal policy opti-
mization [14] algorithm coupled with a graph convolutional
neural network [15] to perform persistent monitoring with
as the agents require complete information about all possible such heterogeneous agents subject to sensing, communication,
interactions with information sharing among the agents. and localization constraints. The persistent monitoring envi-
Hence, developing alternate strategies for multi-agent systems ronment is modeled as a two-dimensional discrete grid, and
to learn to monitor complex environments is imperative. each cell in the grid is allocated a negative reward. When a cell
We consider a scenario where a team of robots equipped is within the sensing range of any agent, then the reward value
with a limited field-of-view (FOV) sensor and limited com- reduces to zero. Otherwise, the negative reward accumulates
munication range is deployed to persistently monitor a over time. Thus, the agents must learn their motion strategy
GPS-denied environment, as shown in Figure 1. As the envi- to minimize the net reward accumulated over time, showing
ronment does not support GPS, one can deploy agents that efficient persistent monitoring. We consider PPO in GALOPP
have expensive sensors such as tactical grade IMUs or cam- because it is known for its stability, high sample efficiency, and
eras/LIDARs in conjunction with high computational power resistance to hyperparameter tuning [14].
to carry out onboard SLAM for accurate localization with The approach presented in this paper addresses the chal-
very low position uncertainty; these agents are called anchor lenge of balancing exploration for monitoring while adhering
agents. However, such a system becomes highly expensive to localization constraints. Its novelty lies in the development
for deployments. On the other hand, we can deploy agents of a model for guiding multiple decentralized agents in con-
with low-grade IMUs that are cheaper but exhibit high drift ducting monitoring in an uncertain environment. The primary
resulting in poor localization accuracy; these agents are called contribution involves designing a decision-making strategy
auxiliary agents. Auxiliary agents can be used in conjunction based on Multi-Agent Reinforcement Learning for effective
with external supporting localization units (like UWB ranging area surveillance while considering sensing, communication,
or cooperative localization [10], [11], [12]) to reduce local- and localization constraints.
ization uncertainty so that they are helpful in performing the The main contributions of this paper are:
coverage. Hence, as a trade-off, in this paper, we consider
• Development of a multi-agent deep reinforcement learn-
a robotic team consisting of anchor and auxiliary agents to
ing algorithm (GALOPP) for persistently monitoring a
monitor a region persistently.
region considering the limited sensing range, communi-
The auxiliary agents can localize using the notion of coop-
cation, and localization constraints into account.
erative localization by communicating with the anchor agents
• Evaluating the performance of GALOPP for varying
directly or indirectly through other auxiliary agents and hence
parameters – sensing area, communication ranges, the
have reduced uncertainty in their positional beliefs. As the
ratio of anchor to auxiliary agents, obstacle density, and
auxiliary agents need to be in communication with anchor
centralized vs decentralized map sharing.
agents, their motion is restricted, which can result in lower
• Comparing the performance of GALOPP to baseline
monitoring performance as some areas may not be covered.
approaches, namely, random search, random search with
However, intermittent connection with the anchor agents
ensured communication, greedy search, and area parti-
will enable auxiliary agents to recover from the localization
tioning with lawn-mower sweeping strategy.
uncertainty while maintaining coverage across all regions [13].
This conflicting objective of monitoring the complete area The rest of the paper is structured as follows. In Section II,
while periodically maintaining connectivity from the anchor we provide a review of the existing literature on this problem.

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MISHRA et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR PERSISTENT MONITORING 3

In Section III, we define the persistent monitoring problem above approaches cannot be applied directly, and modifying
with multiple agents. In Section IV, we describe the GALOPP them to accommodate localization constraints is difficult.
architecture and we evaluate the performance of GALOPP in The interconnection of exploration and localization objectives
Section V. In Section VI, the proof-of-concept of GALOPP increases the problem’s complexity.
performance using a team of nanocopters is described and we Another approach is to learn from the environment to
conclude in Section VII. determine agent paths while considering the sensing, commu-
nication, and localization constraints. Reinforcement learning
II. R ELATED W ORK can be one such learning-based approach that can learn to
The persistent monitoring problem can be considered as a determine paths for multiple agents while considering all
persistent area coverage or as a persistent routing problem the constraints. Multi-agent reinforcement learning (MARL)
visiting a set of targets periodically. Under persistent area based path planning literature focuses on developing effi-
coverage problem, one can consider the mobile variant of cient and effective algorithms for multi-agent systems on
the Art Gallery Problem (AGP) [16] where the objective is cooperative multi-agent tasks covering a broad spectrum of
to find the minimum number of guards to cover the area. applications [23], [24], [25], [26], [27], [28]. Blumenkamp
There are several variants on AGP for moving agents under and Prorok [29] study inter-agent communication for self-
visibility constraints [17]. An alternative way for coverage is interested agents in cooperative path planning but do not
to use cellular decomposition methods, where the area can account for localization constraints and assume complete
be decomposed into cells, and the agents can be assigned to connectivity throughout. Omidshafiei et al. [23] formalize the
these cells for coverage [18], [19]. In AGP and its variants, concept of MARL under partial observability, which applies
the visibility range is infinite but restricted by environmental to scenarios with limited sensing range. Chen et al. [30]
constraints such as obstacles or boundaries. developed a method to find trajectories for agents to cover
In addressing the persistent routing problem, one can an area continuously but with the assumption that all agents
approach it using different variants of multiple Watchman have full access to the environment due to unrestricted com-
Route Problem (n-WRP) [20]. In these approaches, the goal is munication access among agents. In the above articles, the
to find a route for each agent for monitoring while minimizing problem of determining motion policies for the agents con-
the latency in visit time. Yu et al. [1] propose a method sidering the localization, sensing, and communication range
for monitoring events with stochastic arrivals at multiple constraints jointly has not been adequately addressed. In this
stations by combining stochastic modeling and control theory work, through GALOPP, we address the problem of persistent
to optimize real-time monitoring and resource utilization. monitoring considering all three constraints using a deep
Tokekar and Kumar [17] propose a novel method for persistent reinforcement learning framework.
monitoring using robot teams that employs a coverage path
planning algorithm accounting for visibility constraints to III. P ROBLEM S TATEMENT
optimize coverage and path length trade-off. Wang et al. [8]
propose a method for cooperative persistent surveillance on A. Persistent Monitoring Problem
a road network using multiple Unmanned Ground Vehicles We consider the persistent monitoring problem in a 2D grid
(UGVs) with detection capabilities. Lin and Cassandras [4] world environment G ⊆ R2 of size A × B. Each grid cell G αβ ,
apply a decentralized control algorithm to consider agents’ 1 ≤ α ≤ A and 1 ≤ β ≤ B, has a reward Rαβ (t) associated
dynamics and monitoring constraints in real-time, enabling the with it at time t. When the cell G αβ is within the sensing range
solution of the problem by finding the optimal trajectory for of an agent, then Rαβ (t) → 0, otherwise, the reward decays
each agent. Washington and Schwager [21] propose an RSVI linearly with a decay rate 1αβ > 0. We consider negative
algorithm for real-time drone surveillance policy optimization reward as it refers to a penalty on the cell for not monitoring.
considering battery life and charging constraints, balancing At time t = 0, Rαβ (t) = 0, ∀(α, β) and
the trade-off between surveillance coverage and energy con- 
sumption. Maini et al. [22] propose a coverage algorithm  max{Rαβ (t) − 1αβ , −Rmax }

that considers the visibility constraints of robots to monitor Rαβ (t + 1) = if G αβ is not monitored at time t
linear features such as roads, pipelines, and power lines on 

0 if G αβ is monitored at time t,
terrains with obstacles. Mersheeva and Friedrich [7] develop
a framework for multi-UAV monitoring with priorities that (1)
address the efficient allocation of UAVs while considering where Rmax refers to the maximum penalty a grid cell can
resource limitations. accumulate so that the negative reward Rαβ is bounded.
The above-cited approaches assume agents with either The objective of the persistent monitoring problem is to find
unrestricted sensing and/or communication and have full local- a policy for the agents to minimize the neglected time, which
ization. In contrast, our context involves agents with limited in turn, maximizes the total accumulated reward by G over a
access to information in these aspects. Devising a control finite time T . The optimal policy is given as
policy that allows for persistent area coverage while respecting
localization constraints is challenging. In the absence of GPS, T X
A X
B 
π
X
the agent is unaware of its true position, and the estimated π ∗ = arg max Rαβ (t) , (2)
π
position covariance steadily increases over time. Therefore, the t=0 α=1 β=1

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

team to monitor any arbitrary region persistently due to the


communication-constrained motion of the agent. Intermittent
connectivity of agents leads to a better exploration of the area
allowing more flexibility [31], [32]. The auxiliary agents, once
disconnected, do not contribute to the net rewards obtained by
Fig. 2. Sensing range of the agent (a) agent position (b) When sensing range the team. Since the objective is to find a policy that maximizes
ℓ = 1, the cells that the vehicle can sense g = 3 × 3. (c) When ℓ = 2, the the rewards, the problem statement enables the agents to learn
sensing grid becomes g = 5 × 5.
that connectivity increases the rewards, so they should be
connected. The connectivity constraints are indirectly implied
through rewards and not hard-coded into the agent decision-
where π ∗ is an optimal global joint-policy that dictates the
π making policy. We abstract the localization constraints through
actions of the agent in a multi-agent system, and Rαβ is the
the connectivity graph G during decision-making, which is
reward obtained by following a policy π.
detailed in section IV.
Problem: Given a 2D grid world environment G, determine
a joint-optimal policy π ∗ for N agents to minimize the neglect
time at each cell in G taking sensing range, communication C. Using Kalman Filter for State Estimation
range and localization constraints into account. In a cooperative localization (CL) setting, one way an
auxiliary agent can localize is by observing an anchor agent.
B. Localization for Persistent Monitoring We assume that all the agents know their starting position
accurately.
The grid G consists of N -agents to perform the monitoring
To handle the position uncertainties, we apply a Kalman
task. The agents have a communication range ρ. At every
Filter (KF) [33] to update its state mean and covariance. The
time step, a connectivity graph G = ⟨V, E⟩ is generated
KF propagates the uncertainty in the position of the auxiliary
between the agents. An edge connection ei j is formed between
agent as long as it is unlocalized, and upon localization, the
agents i and j, if dist (i, j) ≤ ρ, where dist (i, j) is the
agent is made aware of its true location. The motion model of
Euclidean distance between agents i and j. The connectivity
the auxiliary agent is,
of any agent with an anchor agent is checked by using the
Depth-First Search (DFS) algorithm. Each agent estimates µt+1 = At µt + Bt u t + ϵt , (3)
its position using Kalman Filer (KF). The anchor agents
have high-end localization capabilities; hence, the position where µt and µt+1 are the positions of the agents at time t
uncertainty is negligible. However, the auxiliary agents can and t + 1 respectively, ϵt is a random variable representing
localize accurately if they are connected to an anchor agent, the error in the prediction, drawn from a normal distribution
either directly or indirectly (multi-hop connection). As we do with zero mean and covariance Rt , At = Bt = I2×2 , and u t is
not consider any loss of information or cost associated with the control input at time t. Upon observing another agent, the
communication to an anchor agent, an auxiliary agent achieves observation model can be formulated as,
localization upon observing an anchor agent. Consequently, z t = Ct µt + δt , (4)
a multi-hop communication to an anchor agent can localize " a #
0
another auxiliary agent not in direct communication contact Ct = x g −a
. (5)
with an anchor agent [12]. 0 ygb−b
An agent located at position (α, β) has a field of view that  
a
covers a square region with dimensions g × g, where g = Here z t = is the relative position of the observed agent in
b
2ℓ+1, and the agent can sense ℓ cells in all cardinal directions.
the context of the KF observation model, where a = x g −x and
As the anchor agents are accurately localized, they can
b = yg −y, Here, (x g , yg ) is the global position of the observed
update the rewards Rαβ (t) in the grid world G, that is, set
agent and (x, y) is the current position of the observing agent.
Rαβ (t) = 0. The auxiliary agents connected to the anchor
δt is the error in the observation that is drawn from a normal
agents either directly or via multi-hop connections can also
distribution with 0 mean and covariance Q t . Given the motion
update the rewards Rαβ (t) = 0. However, those auxiliary
and observation models, we can write the KF algorithm as
agents that are disconnected from the anchor agents can
mentioned in Algorithm 1.
observe the world but cannot update the rewards due to
Based on the environment model, vehicle motion, and
localization uncertainty associated with an increase in the
localization model, we introduce our proposed GALOPP
covariance of the vehicle. When the vehicle reconnects with
multi-agent reinforcement learning architecture in the next
the anchor vehicle network, its uncertainty is reduced, and
section.
it can update the rewards. The world that the auxiliary agent
observes during disconnection is not considered for simplicity.
An interesting aspect of solving Equation (2) to determine IV. G RAPH L OCALIZED PPO - GALOPP
policies for the agents is that it does not explicitly assume The multi-agent persistent monitoring task requires every
that the graph network is always connected. Although a individual agent to compute its policies using its own and
strict connectivity constraint increases the global positional the neighboring agents’ observations. This makes computing
belief of the entire team, it reduces the ability of the policy for an agent a non-stationary problem that can be

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MISHRA et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR PERSISTENT MONITORING 5

Fig. 3. Complete pipeline consisting of GALOPP model with environmental interaction. The observations from each agent are processed by the ConvNet,
and the generated embeddings are passed to the GraphNet following the communication graph formed among the agents. The GraphNet processes the input
embeddings and generates aggregated information vectors that are passed through the actor network. The actor network generates a probability distribution
over the possible actions for each agent, and the agents execute the actions having the highest probability. The critic provides feedback to the actor about the
actions’ expected value with respect to achieving the RL objective.

Fig. 4. (a) Schematic representation of GALOPP architecture. Each agent block of the architecture represents an actor-critic model. (b) The mini-map is the
image of the environment G, resized to g × g. The local map is a g × g slice of the environment G centered around the agent. The mini-map and local map
are concatenated together to form the input oi for agent i.

Algorithm 1 KF (µt−1 , 6t−1 , ut , z t , got Obser vation) network to determine the agents’ policy. Such an algorithm
µ̄t = At µt−1 + Bt ut is faster to train and execute but is not scalable to many
6̄t = At 6t−1 AtT + Ot agents. The decentralized approach overcomes these short-
if gotObservation → True then comings by assigning individual actor networks to each agent.
K t = 6̄t CtT (Ct 6̄t CtT + Q t )−1 However, training multiple networks can be computationally
µt = µ̄t + K t (z t − Ct µ̄t ) expensive. In this paper, we utilize the Centralized Train-
6t = (I − K t Ct )6̄t ing and Decentralized Execution (CTDE) [34] strategy. This
return µt ,6t helps in retaining the computational efficiency of centralized
actor-critic and the robustness of decentralized actors.
end
else
return µ̄t , 6̄t A. Architectural Overview
end The complete GALOPP pipeline with environmental inter-
action is shown in Figure 3, while the GALOPP architecture
details are shown in Figure 4a. The GALOPP architecture
tackled using either a centralized or a decentralized algorithm. consists of a multi-agent actor-critic model that imple-
A centralized approach will comprise a single actor-critic ments Proximal Policy Optimization (PPO) [14] to determine

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

individual agent policies. Multi-agent PPO is preferred over with h i to generate a new information vector ζi (as shown in
other policy gradient methods to avoid having large policy figure 4a).
updates and achieve better learning stability in monitoring The agents are heterogeneous agents (anchor and auxiliary)
tasks. It also has better stability, high sample efficiency, and where the localization information is a parameter aggregated
resistance to hyperparameter tuning [35]. in the graph network component of GALOPP. An agent’s
The observation of agent i is denoted as oi , comprising a aggregated information vector depends on the current position
2-channel image: the first channel, termed local map, repre- in the environment, the generated message embedding, and the
sents the locally observed visibility map, while the second localization status of each neighboring agent.
channel consists of an independently maintained global map GraphNet transfers the information vector ζi to all agents
version, compressed to match the dimensions of the local within the communication graph. The agents take in the
map (as depicted in Fig. 4b) and referred to as mini-map. weighted average of the embeddings of the neighborhood
The local map values depict a binary map indicating obstacle agents. The basic building block of a GraphNet is a graph
presence within the agent’s geometric visibility constraint. convolutional layer, which is defined as [15]:
The global map values represent the reward value heatmap
of each cell in the grid, which is subsequently compressed to H (k+1) = σ (A g H (k) W (k) ), (6)
form a mini-map. This image is passed through a Convolu- where H (k) is the feature matrix of the k-th layer, with
tional Neural Network (ConvNet) [36] to generate individual each row representing a node in the graph and each column
embeddings h i for each agent, which is then augmented with representing a feature of that node. A g is the graph’s adjacency
agent i’s positional mean µi and covariance 6i , as shown matrix, which encodes the connectivity between nodes. W (k)
in Figure 4a. This is the complete information ζi of the is the weight matrix of the k-th layer, which is used to learn a
agent’s current state. This information vector ζi forms the linear transformation of the node features and σ is a non-linear
node embedding of the graph G. It is then processed by a activation function, such as ReLU or sigmoid.
Graph Convolutional Network (GraphNet) [15] that enforces After the message passing, the aggregated information vec-
the relay of messages in the generated connectivity graph G tor ζi′ for each agent i, for a GraphNet having k hidden layers,
to ensure inter-agent communication. The decentralized actors is given as,
then use the embeddings generated by GraphNet to learn the
policy, while a centralized critic updates the overall value ζi′ = H (k) = σ (A g H (k−1) W (k−1) ). (7)
function of the environment. The model is trained end-to-end
The aggregated information vector ζ ′ is now passed on to the
for the persistent monitoring problem. The local computation
actor-critic network MLP. The actor-network makes decisions
involves - updating the local map, the mean and covariance
for the agent, and a separate critic network evaluates the actor’s
of the position, and updating each agent’s maintained global
actions to provide feedback, allowing the actor to improve its
map. The central computation is the computation of the joint
decision-making over time.
policy for the multi-agent RL problem. The components of the
GALOPP architecture are described in the below subsections.
C. Multi-Agent Actor-Critic Method Using PPO
The decentralized actors in the multi-agent PPO take in
B. Embedding Extraction and Message Passing the aggregated information vector ζi′ and generate the cor-
The GALOPP model inputs the shared global reward values responding action probability distribution π. The action space
in the 2D grid. The observation of an agent i at time t is the set consists of five discrete options: {up, down, left, right, stay},
of cells that are within the sensing range (termed as the local representing decisions to move in one of the four cardinal
map) and also a compressed image of the current grid (termed directions or to remain in the current location.
as mini-map) with the pixel values equal to the penalties The centralized critic estimates the environment’s value
accumulated by the grid cells [37]. Each agent has a separate function to influence the individual actor’s policy (Figure 3).
copy of the mini-map. Each agent updates the copy of their The shared reward for all agents is defined in Equation (2).
mini-map, and the monitoring awareness is updated through For a defined episode length T , the agent interacts with the
inter-agent connectivity. Figure 6 illustrates a representation of environment to generate and collect the trajectory values in
how the decentralized map is updated. The connected agents the form of states, actions, and rewards {si , ai , ri }. The stored
compare and aggregate the global map at each time step for a values are then sampled iteratively to update the action proba-
network graph by taking the element-wise maximum for each bilities and to fit the value function through back-propagation.
grid cell G αβ in the environment. The element-wise maximum Let θ1 be the actor trainable parameter and θ2 be the critic
value of each grid cell is shared among the connected agents. trainable parameter. Discounted return measures the long-term
The mini-map is resized to the shape of the local map of the value of a sequence
PT ofk actions. The discounted return is given
agent and then concatenated to form a 2-channel image (shown as G(t; θ1 ) = k=0 γ r (t + k + 1; θ1 ), where γ ∈ [0, 1) is the
in figure 4b). This forms the sensing observation input oi for discount factor and T is the episode time horizon. The Value
the model at time t. The ConvNet converts the observation oi function V (sti ; θ2 ) represents the expected long-term reward
into a low-dimensional feature vector h i termed the embedding that an agent i can expect to receive if it starts in that state s at
vector. The positional mean µi and covariance matrix 6i of time t. It is updated as the agent interacts with the environment
agent i is then flattened, and their elements are concatenated and learns from its experiences. The value function estimate,

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MISHRA et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR PERSISTENT MONITORING 7

Fig. 5. (a) Outline of the open map. (b) The agents cannot move into black pixels, while the non-black regions need to be persistently monitored. As the
anchor agents (red stars) and auxiliary agents (dark blue triangles) monitor, their trajectory is shown as the fading white trails for the last 30 steps. The
communication range between the agents is shown in red lines. (c) The trajectories of the anchor and auxiliary agents while monitoring are shown by the red
and blue lines, respectively.

where rt (θ1 ) = πθ1 /πθold


1
is the current policy’s (πθ1 ) action
probability ratio to the previous policy distribution πθold
1
. The
cli p function clips the probability ratio rt (θ1 ) to the trust-
region interval [1 − ϵ, 1 + ϵ].
GALOPP is trained end to end by minimizing the modified
PPO objective function using the trajectory values collected
from the interactions with the environment. GALOPP min-
imizes the multi-agent PPO objective function to train the
network. The algorithm updates the action probabilities and fits
the value function through back-propagation. This allows the
Fig. 6. (a) Illustration of decentralized map-sharing among agents in
persistent monitoring. (b) Overview of how agents within communicable model to learn from experience and improve its performance
range of one another update their global maps in a decentralized setting. over time.
The resultant global map is generated by taking the element-wise maximum
value from the individual global maps of the agents.

V. E XPERIMENTS AND A NALYSIS


which is defined as V (sti ; θ2 ) = E[G(t)|sti ], is provided by
the critic network. The advantage estimate function Âi is a We evaluate the performance of GALOPP on an open map
measure of how much better a particular action is compared environment as shown in Figure 5. The open map has an
to the average action taken by the current policy. It is defined area of 30 × 30 sq. units, where 5 obstacles having random
as the difference between the discounted return and the state geometry are placed. The agents have a sensing range of
value estimate given by ℓ = 7 in the 2D environment. We use the accumulated reward
Âit (θ1 , θ2 ) = G(t; θ1 ) − V (sti ; θ2 ). (8) metric to evaluate the performance. The total reward at time
t is defined as R(t) = α,β Rαβ (t). The grid cells’ penalties
P
PPO uses the advantage function to adjust the probability are updated with a decay rate of 1αβ = 1, ∀(α, β). A cell’s
of selecting an action to make the policy more likely to maximum negative reward is Rmax = 400. The simulation
take actions with a higher advantage. This helps ensure that parameters used in the experiments are detailed in Table I.
the policy makes the most efficient use of its resources and
maximizes the expected reward over time [14]. The modified
multi-agent PPO objective function to be minimized in the A. Model
GALOPP network is given as,
GALOPP was trained and tested using Python 3.6 on a
 N 
1 X 1 X CLI P workstation with Ubuntu 20.04 L.T.S operating system, with
L(θ1 , θ2 ) = (θ1 , θ2 ) ,

L (9)
m m N i=1 i an Intel(R) Core(TM) i9 CPU and an NVIDIA GeForce
RTX 3090 GPU (running on CUDA 11.1 drivers). The neural
where N is the total number of agents and m is the mini- networks were written and trained using PyTorch 1.8 and
batch size, and L iC L I P (θ1 , θ2 ) refers to the clipped surrogate dgl-cu111 (deep graph library). We now provide details of
objective function [14] defined as the various parameters used in the model. The GALOPP
L iC L I P (θ1 , θ2 ) = Êt [min(rt (θ1 ) Âit (θ1 , θ2 ), architecture consists of 4 deep neural networks: ConvNet,
GraphNet, Actor MLP, and Critic MLP, as shown in Figure 3.
clip(rt (θ1 ), 1 − ϵ, 1 + ϵ) Âit (θ1 , θ2 ))], (10) The details of this architecture are given in Table II.

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

TABLE I
S IMULATION PARAMETERS FOR GALOPP

TABLE II
Fig. 7. Comparison of the average reward on increasing the communication
PARAMETERS FOR THE N EURAL N ETWORKS range of the agents in the open-map environment.

map of the agent forming a 2-channeled image of dimension


15 × 15. The action space has five actions: up, down, left,
right, and stay. Each action enables the agent to move by one
pixel, respectively.
6) Evaluation: For testing the learned policies, we evaluate
it for 100 episodes, each episode for T = 1000 time steps,
in their respective environments. The reward for test episode
τ is denoted by Rτep = t=1
PT
R(t) and the final reward Ravg
after n = 100 episodes are calculated as Ravg = n1 nτ =1 Rτep .
P
The Ravg is used to evaluate the model’s performance. Next,
we will evaluate the performance of the GALOPP under
different parameters.

1) Embedding Generator (ConvNet): This convolutional B. Effect of Increase in Communication Range


neural network takes a 2-channeled 15 × 15 image (local map With an increase in communication, the agents will be
and mini-map) as the input and generates a 32-dimensional able to communicate as well as localize better while reaching
feature vector. We then append a 6-dimensional state vector various locations in the environment. A lower communication
to this feature vector (positional mean and covariance) to form range can make agents close to each other, and hence, the
a 38-dimensional feature vector that acts as the embedding for agents are unable to explore and cover different regions,
the graph convolution network. The state vector is derived by making an ineffective strategy. We consider a system com-
flattening the agent’s covariance matrix 6t and appending it prising 2 anchor agents and 2 auxiliary agents, and vary the
to the position vector µt . communication range from ρ = 10 units to ρ = 30 units,
2) Graph Convolution Network (GraphNet): The embed- with an increment of 5 units. We evaluate the performance of
dings generated by the embedding generator are passed GALOPP under different communication ranges as shown in
through a single-layered feed-forward graph convolution net- Fig. 7.
work to generate the embeddings for the actor networks of the From the figure, we can see that with a reduced communica-
individual agents. tion range of 10 and 15, the agents are unable to monitor the
3) Actor MLP: The actor takes the embeddings generated region properly, hence resulting in higher negative rewards.
by the ConvNet and the aggregated information vector from As we increase the communication range to 20, the perfor-
the GraphNet network as the input and generates the proba- mance improves as the agents can communicate better while
bility distribution for the available actions. maintaining localization accuracy. However, by increasing the
4) Critic MLP: The critic network takes the embeddings communication range higher than 20, there is a marginal
generated by the ConvNet for each agent and returns the improvement in the performance at the cost of a higher
state-value estimate for the current state. communication range. These results are intuitive. However,
5) Training: The training is carried out for 30000 episodes they provide insight into the selection of the communication
where each episode is of length T = 1000 time steps. The range for the rest of the simulations. Based on these results,
agents are initialized randomly in the environment for every we consider ρ = 20 for the rest of the analysis.
training episode but are localized during initialization.
The GALOPP architecture input at time t is the image rep-
resenting the state of the grid G, which is resized to an image C. Effect of Varying Sensing Range
of the dimension 15 × 15 using OpenCV’s INTER_AREA The size of the local map is dependent on the sensing range
interpolation method and concatenated with the local visibility ℓ, which we measure in terms of the number of cells that can

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MISHRA et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR PERSISTENT MONITORING 9

Fig. 8. Comparison of the average reward of the model on decreasing the Fig. 10. Comparison of the average reward on increasing the percentage
local sensing map range. The local map is the agents’ visibility range in the obstruction in the environment by increasing the number of obstacle blocks.
environment.

the coverage is higher and hence improvement in the average


rewards. However, as we increase the number from 4 to 5, the
improvement is marginal because four agents are sufficient to
cover the region, and hence, increasing more agents does not
increase the rewards significantly.
For a given number of agents, let us now analyze the effect
of a number of anchor agents. For 2 agents, with both being
anchors enables the agents to cover better, and since these two
agents have high accuracy, they can work independently, thus
improving the performance of one anchor. When we increase
the number of anchors for 3, 4, and 5 agent cases, we can see
that increasing the number of anchors shows only a marginal
Fig. 9. Effect of increasing the total number of agents in the environment.
For a given number of agents, we effect of increasing the number of anchor improvement. Hence, we can obtain good coverage accuracy
agents k ≤ N for N agents in the environment. with a lower number of anchor agents while ensuring there are
2 or more auxiliary agents. With a lower number of anchors,
the deployment cost can be reduced significantly.
be observed. As the sensing range ℓ increases, the number
of observed cells g × g also increases, where g = 2ℓ + 1,
resulting in a decrease in penalties. Intuitively, with an increase E. Effect of Increasing Obstruction in the Environment
in sensing range, the reward improves, which can be seen in
Figure 8. The difference in performance between ℓ = 5 and The model should have the robustness to be able to perform
ℓ = 6 is significant; however, the performance improvement well under different percentage obstructions in the environ-
is lower when we further increase the sensing range to ment. However, as the percentage of obstructions increases,
ℓ = 7. Based on these trends, if we further increase the sensing the difficulty in monitoring also increases. In order to validate
range, the improvement will be marginal. Hence, we consider this hypothesis, we perform simulations on varying obstacle
a sensing range of ℓ = 7 for the rest of the simulation. Note percentages in the environment. For each episode, the obsta-
that during this evaluation, we use a communication range of cles for a given percentage are randomly generated and placed.
ρ = 20, as fixed from the previous analysis. Figure 10 shows the performance of GALOPP for varying
percentage obstruction. From the figure, we can see that when
the obstruction is less (5-15%), the GALOPP model is able to
D. Effect of an Increasing Number of Agents and Varying learn to change the paths so that the rewards are maximized.
Anchor-Auxiliary Ratio However, with further increases in obstacle density (20-30%),
The ability to monitor adequately in the environment learning becomes difficult due to environmental constraints
depends on the number of agents present in the environment and, hence reduction in performance. When we look at the
and also the ratio of the anchor to auxiliary agents. To under- percentage of disconnections that happen due to environmental
stand this effect, we carry out simulations, varying the number changes, for 5-15% obstacle density, the disconnections are
of agents from 2 to 5. For a given agent, we vary the number less than 10%. However, with an increase in the obstacle
of anchors to understand the performance to cost benefits density, the motion constraints for the agents also increase.
associated with a higher number of anchors. Figure 9 shows Due to this, the agents are unable to explore remote regions,
the model performance for a varying number of agents in the resulting in reduced performance as shown in Figure 10.
environment. First, let’s consider the effect of an increase in Because the agents are unable to disconnect and explore,
the number of agents with a single anchor. From the figure, they remain connected, resulting in a lower percentage of
we can see that with an increase in the number of agents, disconnection time.

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

Fig. 11. Comparison of the percent time of disconnection for auxiliary agents Fig. 13. Performance comparison of GALOPP with heuristic baselines:
on increasing the percentage occlusion in the environment. Random Search (RS), Random Search with Ensured Communication (RSEC),
Greedy Search (GS), and Lawn Mower Area Sweep (LMAS).

G. Comparison Between GALOPP and Non-RL Baselines

Due to the localization constraints in the persistent monitor-


ing problem, the motion of both anchor and auxiliary agents
becomes coupled. This complexity makes it highly challenging
to devise deterministic motion strategies for these heteroge-
neous agents. Given the unique coupled objective of coverage
and localization in our approach, existing literature has lim-
itations in simultaneously addressing both aspects, which in
turn complicates the identification of suitable baselines. Since
there is limited prior research on persistent monitoring with
Fig. 12. Comparison between centralized and decentralized execution. constraints on communication, localization, and sensing range,
we compare our approach with heuristic-based algorithms
custom-designed for persistent monitoring.
As a result, we assess the performance of our model against
F. Comparison Between Centralized Maps
four custom-designed non-reinforcement learning baselines:
Vs. Decentralized Maps
random search (RS), random search with ensured communica-
In GALOPP, agents are trained using a decentralized mini- tion (RSEC), greedy search (GS), and lawn-mower area sweep
map, where each agent maintains a separate copy of the global (LMAS).
map that was updated when agents were within a communica- 1) Random Search (RS): In the RS method, agents make
ble range. We compare the performance of the decentralized decisions independently at each time step by randomly select-
global map approach to a centralized approach, where a shared ing an action (stay, up, down, right, left). This approach does
global map is maintained among all agents. To accomplish not require any prior knowledge of the problem domain or any
this comparison, agents within the communication range of model of the system dynamics. Because of random decisions,
each other compared and aggregated the global map at each communication may break, resulting in lower performance.
time step by taking the element-wise maximum for each 2) Random Search With Ensured Communication (RSEC):
grid cell in the environment, as shown in Figure 6. In order RSEC is an extension of the RS method, in which each
to know the difference in performance between centralized agent randomly selects an action while ensuring that no auxil-
map sharing and decentralized map sharing, simulations were iary agent becomes unlocalized. In other words, the RSEC
carried out, and Figure 12 shows the performance difference. approach guarantees that all agents remain localized at all
The simulations setting for the comparison are two anchor times. If an action is selected that would cause an agent
agents and two auxiliary agents with a sensing range of 7 cells or another auxiliary agent to become unlocalized, the agent
and a communication range of 20. randomly selects another action from the remaining action
From the figure, we can see that the centralized map space until a suitable action is found.
model is performing marginally better than the decentralized 3) Greedy Search (GS): In GS, agents act independently
map model, but statistically, both strategies are performing and greedily. Assume that agent i is in cell (α, β), and we
similarly. The result shows that using decentralized maps is define Ni as the set of neighboring cells that agent i can
a good alternative to centralized maps. This suggests that reach in one-time step (that is, all the cells when ℓ = 1).
the decentralized approach in GALOPP can achieve similar The agent i selects a cell that has maximum negative reward,
performance to a centralized approach while still providing without considering localization constraints. If all the grid cells
the benefits of decentralization in maintaining its local obser- in Ni have the same reward, then agent i chooses a random
vation. action.

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MISHRA et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR PERSISTENT MONITORING 11

Fig. 14. Visualization of maps: (a) Illustrates a 2-room map, and (c) Illustrates a 4-room map. The agents cannot move into black pixels, while the non-black
regions need to be persistently monitored. As the anchor agents (red stars) and auxiliary agents (dark blue triangles) monitor, their trajectory is shown as the
fading white trails for the last 30 steps. The communication range between the agents is shown in red lines. Fig (b) and Fig (d) display the trajectories of the
anchor and auxiliary agents while monitoring for the 2-room and 4-room maps, respectively.

4) Area Partition With Lawn Mower Sweep (LMAS): The 30 steps. Figure 14b shows the areas where each agent was
LMAS strategy [18], [38] begins with partitioning the area into present. From this, we can see that the anchor was in the
cells (or sections) equal to the number of agents deployed and middle region while the two auxiliary agents monitored the
placing each agent in a chosen starting position within one two rooms. The anchor agent moves around to maximize
of these sections. The agent is then programmed to follow rewards, while the auxiliary agents move in the two rooms.
a specific lawn-mower movement pattern within each section, In fact, this is the best combination for the agents, and they
typically involving a back-and-forth motion. The agent repeats learn quickly.
this pattern within each section until it has covered the entire In the four-room map, GALOPP learns a policy in which
area, ensuring complete coverage. Each agent makes use of its each of the four agents is responsible for monitoring separate
onboard local map to avoid obstacles by changing its trajectory rooms while intermittently monitoring the central corridor
in the presence of obstacles. region, as shown in Figures 14c and 14d. The anchor agents
We carried out 100 simulations for each non-RL baseline are positioned to monitor two cells and the central area, while
strategy and Figure 13 shows the performance comparison the auxiliary agents are responsible for monitoring the two
between the baseline strategies and GALOPP. From the rooms.
figure, we can see that the GALOPP consistently outper- Our results show that GALOPP is capable of adapting
forms the above-defined baseline strategies. This is attributed to complex environments and learning effective policies for
to GALOPP’s ability to explicitly account for localization multi-agent coordination. The ability of the agents to maintain
and connectivity constraints in its decision-making process. contact with each other and cover all areas of the environment
Among the baselines, Random Search (RS) exhibits the is crucial for the successful completion of tasks, and GALOPP
poorest performance, as it relies on random actions with- demonstrates its ability to achieve this.
out considering any context. Random Search with Ensured
Communication (RSEC) improves upon RS by enforcing VI. H ARDWARE I MPLEMENTATION
communication and localization, resulting in enhanced per- We implement GALOPP on a real-time hardware setup for
formance. Greedy Search (GS) leads to sub-optimal policies proof-of-concept purposes. We use multiple BitCraze Crazyflie
as each agent acts greedily at each timestep independently. 2.1 [39] nanocopters as agents. The experimental setup con-
Lawn Mower Area Sweep (LMAS) performs better compared sists of four SteamVR Base stations [40] and Lighthouse
to GS, but it does not explicitly incorporate localization Positioning System [41] to track the location of the vehicles
considerations and is influenced by the specific geometry within a 3.5m × 3.0m × 2.0m arena. The agents communicate
of the environment being monitored. GALOPP consistently with a companion computer (running on Ubuntu 20.04 with an
outperforms all the mentioned baselines. AMD Ryzen 9 5950x with a base clock speed of 3.4 GHz) via
a Crazyradio telemetry module, where the trained GALOPP
model was executed. In the experiment, we consider the
H. Evaluation in Other Environments environment as shown in Fig. 15a with 2 auxiliary agents and
In order to test the ability of GALOPP to perform in 1 anchor agent. The companion computer receives the position
other types of complex environments, we evaluate its perfor- of each CrazyFlie as input via the corresponding rostopics
mance in two-room and four-room environments, as shown in from the Crazyswarm ROS package [42], [43]. The respec-
Figure 14a and 14c, respectively. tive agents then execute the actions computed by the actor
For the two-room map, the agents learn to maintain contact networks. To avoid inaccuracies in tracking the CrazyFlies
with each other by spreading across two rooms and the caused by physical obstacles obstructing the infrared laser
corridor. In the 2-room map, we notice that our algorithm beams from the Base stations, we opt to simulate the obstacle
ends up with the agents in a formation where two of them boundaries. The model policy implemented in the simulation
position themselves in the two rooms while one monitors the ensures that the agents never collide with any obstacle.
corridor. This can be seen in Figure 14a, where the faded The video of the hardware implementation can be seen
cells show the trajectory followed by the agents for the last in [44]. Figure 15(a) shows the snapshot of the simulated

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

Fig. 15. Snapshot of the video for the hardware implementation of vehicles using one anchor and two auxiliary agents. (a) A rendered simulation snapshot of
the monitoring task. (b) Real-time decision-making being performed by the trained GALOPP network model. (c) The trajectory trails of the previous timesteps
that the agent took in the monitoring task.

environment along with the agent positions (anchor and aux- research can investigate its generalizability to other monitoring
iliary), current coverage, and the position of the obstacle. problems, such as target tracking or environmental monitoring.
We then implement the same scenario with virtual obstacles This work provides a foundation for future investigations
through the hardware, where the model sends the control sig- of GALOPP’s performance and its potential applications in
nals to the vehicles, as shown in Figure 15(b). In Figure 15(c), various monitoring scenarios.
we can see that the agent trajectories are covering all the
regions and hence achieving persistent monitoring. R EFERENCES
[1] J. Yu, S. Karaman, and D. Rus, “Persistent monitoring of events with
VII. C ONCLUSION AND F UTURE W ORK stochastic arrivals at multiple stations,” IEEE Trans. Robot., vol. 31,
no. 3, pp. 521–535, Jun. 2015.
This work developed a MARL algorithm with a graph-based [2] S. L. Smith, M. Schwager, and D. Rus, “Persistent monitoring of
connectivity approach – GALOPP for persistently monitoring changing environments using a robot with limited range sensing,” in
a bounded region, taking the communication, sensing, and Proc. IEEE Int. Conf. Robot. Autom., May 2011, pp. 5448–5455.
[3] S. K. K. Hari, S. Rathinam, S. Darbha, K. Kalyanam, S. G. Manyam,
localization constraints into account via graph connectiv- and D. Casbeer, “The generalized persistent monitoring problem,” in
ity. The experiments show that the agents using GALOPP Proc. Amer. Control Conf. (ACC), Jul. 2019, pp. 2783–2788.
can outperform four custom baseline strategies for persistent [4] X. Lin and C. G. Cassandras, “An optimal control approach to the multi-
area coverage while accounting for the connectivity bounds. agent persistent monitoring problem in two-dimensional spaces,” IEEE
Trans. Autom. Control, vol. 60, no. 6, pp. 1659–1664, Jun. 2015.
We also establish the robustness of our approach by varying [5] S. K. K. Hari, S. Rathinam, S. Darbha, K. Kalyanam, S. G. Manyam,
the sensing map, the effect of obstacle occlusion by increasing and D. Casbeer, “Optimal UAV route planning for persistent monitoring
the percent amount of obstacle, and by scaling the number missions,” IEEE Trans. Robot., vol. 37, no. 2, pp. 550–566, Apr. 2021.
[6] T. Wang, P. Huang, and G. Dong, “Cooperative persistent surveillance
of anchor agents in the system. It was seen that increasing on a road network by multi-UGVs with detection ability,” IEEE Trans.
the number of anchor agents improves the performance, but Ind. Electron., vol. 69, no. 11, pp. 11468–11478, Nov. 2022.
beyond a certain value, there are diminishing returns on the [7] V. Mersheeva and G. Friedrich, “Multi-UAV monitoring with priorities
and limited energy resources,” in Proc. Int. Conf. Automated Planning
rewards obtained. Based on power and resource constraints, Scheduling, vol. 25, 2015, pp. 347–355.
one can select a subset of anchor agents to achieve persistent [8] Y.-W. Wang, Y.-W. Wei, X.-K. Liu, N. Zhou, and C. G. Cassandras,
surveillance effectively. “Optimal persistent monitoring using second-order agents with physical
Although our experiments demonstrate that GALOPP sur- constraints,” IEEE Trans. Autom. Control, vol. 64, no. 8, pp. 3239–3252,
Aug. 2019.
passes the baseline strategies, future work could investigate [9] E. Arribas, V. Cholvi, and V. Mancuso, “Optimizing UAV resupply
the algorithm’s scalability as the number of agents signifi- scheduling for heterogeneous and persistent aerial service,” IEEE Trans.
cantly increases. Exploring optimal values for the decay rate Robot., vol. 39, no. 4, pp. 2639–2653, Aug. 2023.
[10] J. Zhu and S. S. Kia, “Cooperative localization under limited connec-
and maximum negative reward, with a focus on increasing tivity,” IEEE Trans. Robot., vol. 35, no. 6, pp. 1523–1530, Dec. 2019.
monitoring efficiency, presents a promising area for further [11] J. Liu, J. Pu, L. Sun, and Y. Zhang, “Multi-robot cooperative localization
research. Additionally, the algorithm’s suitability for diverse with range-only measurement by UWB,” in Proc. Chin. Autom. Congr.,
Nov. 2018, pp. 2809–2813.
sensor types, such as cameras or LIDAR sensors, could be
[12] R. Sharma, R. W. Beard, C. N. Taylor, and S. Quebe, “Graph-based
explored to improve agents’ situational awareness. Further observability analysis of bearing-only cooperative localization,” IEEE
research on the impact of different types of obstacles, includ- Trans. Robot., vol. 28, no. 2, pp. 522–529, Apr. 2012.
ing moving obstacles, on the algorithm’s performance would [13] F. Klaesson, P. Nilsson, T. S. Vaquero, S. Tepsuporn, A. D. Ames,
and R. M. Murray, “Planning and optimization for multi-robot planetary
also be insightful. While the proposed algorithm targets het- cave exploration under intermittent connectivity constraints,” in Proc.
erogeneous agents in the persistent monitoring problem, future ICAPS Workshop Planning Robot., 2020.

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MISHRA et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR PERSISTENT MONITORING 13

[14] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, [30] J. Chen, A. Baskaran, Z. Zhang, and P. Tokekar, “Multi-agent
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347. reinforcement learning for visibility-based persistent monitoring,” in
[15] T. N. Kipf and M. Welling, “Semi-supervised classification with graph Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2021,
convolutional networks,” in Proc. Int. Conf. Learn. Represent., 2017. pp. 2563–2570.
[16] J. O’Rourke, Art Gallery Theorems and Algorithms, vol. 57. London, [31] F. Klaesson, P. Nilsson, A. D. Ames, and R. M. Murray, “Intermittent
U.K.: Oxford Univ. Press, 1987. connectivity for exploration in communication-constrained multi-agent
[17] P. Tokekar and V. Kumar, “Visibility-based persistent monitoring with systems,” in Proc. ACM/IEEE 11th Int. Conf. Cyber-Phys. Syst. (ICCPS),
robot teams,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Apr. 2020, pp. 196–205.
Sep. 2015, pp. 3387–3394. [32] R. Khodayi-Mehr, Y. Kantaros, and M. M. Zavlanos,
[18] H. Choset, “Coverage for robotics—A survey of recent results,” Ann. “Distributed state estimation using intermittently connected robot
Math. Artif. Intell., vol. 31, pp. 113–126, Oct. 2001. networks,” IEEE Trans. Robot., vol. 35, no. 3, pp. 709–724,
[19] E. Galceran and M. Carreras, “A survey on coverage path planning for Jun. 2019.
robotics,” Robot. Auton. Syst., vol. 61, no. 12, pp. 1258–1276, 2013. [33] S. Thrun, “Probabilistic robotics,” Commun. ACM, vol. 45, no. 3,
[20] X. Tan, “Fast computation of shortest watchman routes in simple pp. 52–57, 2002.
polygons,” Inf. Process. Lett., vol. 77, no. 1, pp. 27–33, Jan. 2001. [34] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-
[21] P. H. Washington and M. Schwager, “Reduced state value iteration agent actor-critic for mixed cooperative-competitive environments,” in
for multi-drone persistent surveillance with charging constraints,” in Proc. 31st Conf. Neural Inf. Process. Syst., Long Beach, CA, USA,
Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2021, 2017, pp. 6382–6393.
pp. 6390–6397. [35] C. Yu et al., “The surprising effectiveness of ppo in cooperative multi-
[22] P. Maini, P. Tokekar, and P. B. Sujit, “Visibility-based persistent mon- agent games,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022,
itoring of piecewise linear features on a terrain using multiple aerial pp. 24611–24624.
and ground robots,” IEEE Trans. Autom. Sci. Eng., vol. 18, no. 4, [36] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,
pp. 1692–1704, Oct. 2021. and time series,” in The Handbook of Brain Theory and Neural Net-
[23] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep works, vol. 3361. Cambridge, MA, USA: MIT Press, 1995, ch. 10.
decentralized multi-task multi-agent reinforcement learning under partial [37] J. Chen, A. Baskaran, Z. Zhang, and P. Tokekar, “Multi-agent rein-
observability,” in Proc. 34th Int. Conf. Mach. Learn. (ICML), Sydney, forcement learning for visibility-based persistent monitoring,” 2020,
NSW, Australia, 2017, pp. 2681–2690. arXiv:2011.01129.
[24] D. Maravall, J. de Lope, and R. Domínguez, “Coordination of commu- [38] H. Choset, K. M. Lynch, S. Hutchinson, G. A. Kantor, and W. Burgard,
nication in robot teams by reinforcement learning,” Robot. Auto. Syst., Principles of Robot Motion: Theory, Algorithms, and Implementations.
vol. 61, no. 7, pp. 661–666, Jul. 2013. Cambridge, MA, USA: MIT Press, 2005.
[25] Q. Li, F. Gama, A. Ribeiro, and A. Prorok, “Graph neural networks for [39] Bitcraze. Crazyflie 2.1. Accessed: Apr. 2023. [Online]. Available:
decentralized multi-robot path planning,” in Proc. IEEE/RSJ Int. Conf. https://www.bitcraze.io/products/crazyflie-2-1/
Intell. Robots Syst. (IROS), Oct. 2020, pp. 11785–11792. [40] Vive. Basestation. Accessed: Apr. 2023. [Online]. Available:
[26] R. Shah, Y. Jiang, J. Hart, and P. Stone, “Deep R-learning for continual https://www.vive.com/sea/accessory/base-station2/
area sweeping,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), [41] Bitcraze. Lighthouse Positioning System. Accessed: Apr. 2023. [Online].
Oct. 2020, pp. 5542–5547. Available: https://www.bitcraze.io/documentation/tutorials/getting-
[27] Q. Li, W. Lin, Z. Liu, and A. Prorok, “Message-aware graph attention started-with-lighthouse/
networks for large-scale multi-robot path planning,” IEEE Robot. Autom. [42] J. A. Preiss, W. Hönig, G. S. Sukhatme, and N. Ayanian, “Crazyswarm:
Lett., vol. 6, no. 3, pp. 5533–5540, Jul. 2021. A large nano-quadcopter swarm,” in Proc. IEEE Int. Conf. Robot. Autom.
[28] B. Wang, Z. Liu, Q. Li, and A. Prorok, “Mobile robot path planning in (ICRA), May 2017, pp. 3299–3304.
dynamic environments through globally guided reinforcement learning,” [43] M. Quigley et al., “ROS: An open-source robot operating system,” in
IEEE Robot. Autom. Lett., vol. 5, no. 4, pp. 6932–6939, Oct. 2020. Proc. ICRA Workshop Open Source Softw., Kobe, Japan, 2009, vol. 3,
[29] J. Blumenkamp and A. Prorok, “The emergence of adversarial commu- no. 3, pp. 1–5.
nication in multi-agent reinforcement learning,” in Proc. Conf. Robot [44] Proof-of-Concept Hardware Experiment. Accessed: Apr. 2023. [Online].
Learn., Cambridge, MA, USA, 2020, pp. 1394–1414. Available: https://moonlab.iiserb.ac.in/research_page/galopp.html

Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.

You might also like