Multi-Agent Deep Reinforcement Learning For Persistent Monitoring With Sensing Communication and Localization Constraints
Multi-Agent Deep Reinforcement Learning For Persistent Monitoring With Sensing Communication and Localization Constraints
Abstract— Determining multi-robot motion policies for persis- monitoring, etc. Typically, these applications are large-scale, and
tently monitoring a region with limited sensing, communication, hence using a multi-robot system helps achieve the mission
and localization constraints in non-GPS environments is a objectives effectively. Often, the robots are subject to limited
challenging problem. To take the localization constraints into sensing range and communication range, and they may need
account, in this paper, we consider a heterogeneous robotic system to operate in GPS-denied areas. In such scenarios, developing
consisting of two types of agents: anchor agents with accurate motion planning policies for the robots is difficult. Due to the lack
localization capability and auxiliary agents with low localization of GPS, alternative localization mechanisms, like SLAM, high-
accuracy. To localize itself, the auxiliary agents must be within accurate INS, UWB radio, etc. are essential. Having SLAM or a
the communication range of an anchor, directly or indirectly. The highly accurate INS system is expensive, and hence we use agents
robotic team’s objective is to minimize environmental uncertainty having a combination of expensive, accurate localization systems
through persistent monitoring. We propose a multi-agent deep (anchor agents) and low-cost INS systems (auxiliary agents)
reinforcement learning (MARL) based architecture with graph whose localization can be made accurate using cooperative
convolution called Graph Localized Proximal Policy Optimization localization techniques. To determine efficient motion policies,
(GALOPP), which incorporates the limited sensor field-of-view, we use a multi-agent deep reinforcement learning technique
communication, and localization constraints of the agents along (GALOPP) that takes the heterogeneity in the vehicle localization
with persistent monitoring objectives to determine motion policies capability, limited sensing, and communication constraints into
for each agent. We evaluate the performance of GALOPP account. GALOPP is evaluated using simulations and compared
on open maps with obstacles having a different number of with baselines like random search, random search with ensured
anchor and auxiliary agents. We further study 1) the effect of communication, greedy search, and area partitioning. The results
communication range, obstacle density, and sensing range on the show that GALOPP outperforms the baselines. The GALOPP
performance and 2) compare the performance of GALOPP with approach offers a generic solution that be adopted with various
area partition, greedy search, random search, and random search other applications.
with communication constraint strategies. For its generalization
capability, we also evaluated GALOPP in two different environ- Index Terms— Multi-agent deep reinforcement learning
ments – 2-room and 4-room. The results show that GALOPP (MARL), persistent monitoring (PM), graph neural networks.
learns the policies and monitors the area well. As a proof-of-
concept, we perform hardware experiments to demonstrate the I. I NTRODUCTION
performance of GALOPP.
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
In Section III, we define the persistent monitoring problem above approaches cannot be applied directly, and modifying
with multiple agents. In Section IV, we describe the GALOPP them to accommodate localization constraints is difficult.
architecture and we evaluate the performance of GALOPP in The interconnection of exploration and localization objectives
Section V. In Section VI, the proof-of-concept of GALOPP increases the problem’s complexity.
performance using a team of nanocopters is described and we Another approach is to learn from the environment to
conclude in Section VII. determine agent paths while considering the sensing, commu-
nication, and localization constraints. Reinforcement learning
II. R ELATED W ORK can be one such learning-based approach that can learn to
The persistent monitoring problem can be considered as a determine paths for multiple agents while considering all
persistent area coverage or as a persistent routing problem the constraints. Multi-agent reinforcement learning (MARL)
visiting a set of targets periodically. Under persistent area based path planning literature focuses on developing effi-
coverage problem, one can consider the mobile variant of cient and effective algorithms for multi-agent systems on
the Art Gallery Problem (AGP) [16] where the objective is cooperative multi-agent tasks covering a broad spectrum of
to find the minimum number of guards to cover the area. applications [23], [24], [25], [26], [27], [28]. Blumenkamp
There are several variants on AGP for moving agents under and Prorok [29] study inter-agent communication for self-
visibility constraints [17]. An alternative way for coverage is interested agents in cooperative path planning but do not
to use cellular decomposition methods, where the area can account for localization constraints and assume complete
be decomposed into cells, and the agents can be assigned to connectivity throughout. Omidshafiei et al. [23] formalize the
these cells for coverage [18], [19]. In AGP and its variants, concept of MARL under partial observability, which applies
the visibility range is infinite but restricted by environmental to scenarios with limited sensing range. Chen et al. [30]
constraints such as obstacles or boundaries. developed a method to find trajectories for agents to cover
In addressing the persistent routing problem, one can an area continuously but with the assumption that all agents
approach it using different variants of multiple Watchman have full access to the environment due to unrestricted com-
Route Problem (n-WRP) [20]. In these approaches, the goal is munication access among agents. In the above articles, the
to find a route for each agent for monitoring while minimizing problem of determining motion policies for the agents con-
the latency in visit time. Yu et al. [1] propose a method sidering the localization, sensing, and communication range
for monitoring events with stochastic arrivals at multiple constraints jointly has not been adequately addressed. In this
stations by combining stochastic modeling and control theory work, through GALOPP, we address the problem of persistent
to optimize real-time monitoring and resource utilization. monitoring considering all three constraints using a deep
Tokekar and Kumar [17] propose a novel method for persistent reinforcement learning framework.
monitoring using robot teams that employs a coverage path
planning algorithm accounting for visibility constraints to III. P ROBLEM S TATEMENT
optimize coverage and path length trade-off. Wang et al. [8]
propose a method for cooperative persistent surveillance on A. Persistent Monitoring Problem
a road network using multiple Unmanned Ground Vehicles We consider the persistent monitoring problem in a 2D grid
(UGVs) with detection capabilities. Lin and Cassandras [4] world environment G ⊆ R2 of size A × B. Each grid cell G αβ ,
apply a decentralized control algorithm to consider agents’ 1 ≤ α ≤ A and 1 ≤ β ≤ B, has a reward Rαβ (t) associated
dynamics and monitoring constraints in real-time, enabling the with it at time t. When the cell G αβ is within the sensing range
solution of the problem by finding the optimal trajectory for of an agent, then Rαβ (t) → 0, otherwise, the reward decays
each agent. Washington and Schwager [21] propose an RSVI linearly with a decay rate 1αβ > 0. We consider negative
algorithm for real-time drone surveillance policy optimization reward as it refers to a penalty on the cell for not monitoring.
considering battery life and charging constraints, balancing At time t = 0, Rαβ (t) = 0, ∀(α, β) and
the trade-off between surveillance coverage and energy con-
sumption. Maini et al. [22] propose a coverage algorithm max{Rαβ (t) − 1αβ , −Rmax }
that considers the visibility constraints of robots to monitor Rαβ (t + 1) = if G αβ is not monitored at time t
linear features such as roads, pipelines, and power lines on
0 if G αβ is monitored at time t,
terrains with obstacles. Mersheeva and Friedrich [7] develop
a framework for multi-UAV monitoring with priorities that (1)
address the efficient allocation of UAVs while considering where Rmax refers to the maximum penalty a grid cell can
resource limitations. accumulate so that the negative reward Rαβ is bounded.
The above-cited approaches assume agents with either The objective of the persistent monitoring problem is to find
unrestricted sensing and/or communication and have full local- a policy for the agents to minimize the neglected time, which
ization. In contrast, our context involves agents with limited in turn, maximizes the total accumulated reward by G over a
access to information in these aspects. Devising a control finite time T . The optimal policy is given as
policy that allows for persistent area coverage while respecting
localization constraints is challenging. In the absence of GPS, T X
A X
B
π
X
the agent is unaware of its true position, and the estimated π ∗ = arg max Rαβ (t) , (2)
π
position covariance steadily increases over time. Therefore, the t=0 α=1 β=1
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Complete pipeline consisting of GALOPP model with environmental interaction. The observations from each agent are processed by the ConvNet,
and the generated embeddings are passed to the GraphNet following the communication graph formed among the agents. The GraphNet processes the input
embeddings and generates aggregated information vectors that are passed through the actor network. The actor network generates a probability distribution
over the possible actions for each agent, and the agents execute the actions having the highest probability. The critic provides feedback to the actor about the
actions’ expected value with respect to achieving the RL objective.
Fig. 4. (a) Schematic representation of GALOPP architecture. Each agent block of the architecture represents an actor-critic model. (b) The mini-map is the
image of the environment G, resized to g × g. The local map is a g × g slice of the environment G centered around the agent. The mini-map and local map
are concatenated together to form the input oi for agent i.
Algorithm 1 KF (µt−1 , 6t−1 , ut , z t , got Obser vation) network to determine the agents’ policy. Such an algorithm
µ̄t = At µt−1 + Bt ut is faster to train and execute but is not scalable to many
6̄t = At 6t−1 AtT + Ot agents. The decentralized approach overcomes these short-
if gotObservation → True then comings by assigning individual actor networks to each agent.
K t = 6̄t CtT (Ct 6̄t CtT + Q t )−1 However, training multiple networks can be computationally
µt = µ̄t + K t (z t − Ct µ̄t ) expensive. In this paper, we utilize the Centralized Train-
6t = (I − K t Ct )6̄t ing and Decentralized Execution (CTDE) [34] strategy. This
return µt ,6t helps in retaining the computational efficiency of centralized
actor-critic and the robustness of decentralized actors.
end
else
return µ̄t , 6̄t A. Architectural Overview
end The complete GALOPP pipeline with environmental inter-
action is shown in Figure 3, while the GALOPP architecture
details are shown in Figure 4a. The GALOPP architecture
tackled using either a centralized or a decentralized algorithm. consists of a multi-agent actor-critic model that imple-
A centralized approach will comprise a single actor-critic ments Proximal Policy Optimization (PPO) [14] to determine
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
individual agent policies. Multi-agent PPO is preferred over with h i to generate a new information vector ζi (as shown in
other policy gradient methods to avoid having large policy figure 4a).
updates and achieve better learning stability in monitoring The agents are heterogeneous agents (anchor and auxiliary)
tasks. It also has better stability, high sample efficiency, and where the localization information is a parameter aggregated
resistance to hyperparameter tuning [35]. in the graph network component of GALOPP. An agent’s
The observation of agent i is denoted as oi , comprising a aggregated information vector depends on the current position
2-channel image: the first channel, termed local map, repre- in the environment, the generated message embedding, and the
sents the locally observed visibility map, while the second localization status of each neighboring agent.
channel consists of an independently maintained global map GraphNet transfers the information vector ζi to all agents
version, compressed to match the dimensions of the local within the communication graph. The agents take in the
map (as depicted in Fig. 4b) and referred to as mini-map. weighted average of the embeddings of the neighborhood
The local map values depict a binary map indicating obstacle agents. The basic building block of a GraphNet is a graph
presence within the agent’s geometric visibility constraint. convolutional layer, which is defined as [15]:
The global map values represent the reward value heatmap
of each cell in the grid, which is subsequently compressed to H (k+1) = σ (A g H (k) W (k) ), (6)
form a mini-map. This image is passed through a Convolu- where H (k) is the feature matrix of the k-th layer, with
tional Neural Network (ConvNet) [36] to generate individual each row representing a node in the graph and each column
embeddings h i for each agent, which is then augmented with representing a feature of that node. A g is the graph’s adjacency
agent i’s positional mean µi and covariance 6i , as shown matrix, which encodes the connectivity between nodes. W (k)
in Figure 4a. This is the complete information ζi of the is the weight matrix of the k-th layer, which is used to learn a
agent’s current state. This information vector ζi forms the linear transformation of the node features and σ is a non-linear
node embedding of the graph G. It is then processed by a activation function, such as ReLU or sigmoid.
Graph Convolutional Network (GraphNet) [15] that enforces After the message passing, the aggregated information vec-
the relay of messages in the generated connectivity graph G tor ζi′ for each agent i, for a GraphNet having k hidden layers,
to ensure inter-agent communication. The decentralized actors is given as,
then use the embeddings generated by GraphNet to learn the
policy, while a centralized critic updates the overall value ζi′ = H (k) = σ (A g H (k−1) W (k−1) ). (7)
function of the environment. The model is trained end-to-end
The aggregated information vector ζ ′ is now passed on to the
for the persistent monitoring problem. The local computation
actor-critic network MLP. The actor-network makes decisions
involves - updating the local map, the mean and covariance
for the agent, and a separate critic network evaluates the actor’s
of the position, and updating each agent’s maintained global
actions to provide feedback, allowing the actor to improve its
map. The central computation is the computation of the joint
decision-making over time.
policy for the multi-agent RL problem. The components of the
GALOPP architecture are described in the below subsections.
C. Multi-Agent Actor-Critic Method Using PPO
The decentralized actors in the multi-agent PPO take in
B. Embedding Extraction and Message Passing the aggregated information vector ζi′ and generate the cor-
The GALOPP model inputs the shared global reward values responding action probability distribution π. The action space
in the 2D grid. The observation of an agent i at time t is the set consists of five discrete options: {up, down, left, right, stay},
of cells that are within the sensing range (termed as the local representing decisions to move in one of the four cardinal
map) and also a compressed image of the current grid (termed directions or to remain in the current location.
as mini-map) with the pixel values equal to the penalties The centralized critic estimates the environment’s value
accumulated by the grid cells [37]. Each agent has a separate function to influence the individual actor’s policy (Figure 3).
copy of the mini-map. Each agent updates the copy of their The shared reward for all agents is defined in Equation (2).
mini-map, and the monitoring awareness is updated through For a defined episode length T , the agent interacts with the
inter-agent connectivity. Figure 6 illustrates a representation of environment to generate and collect the trajectory values in
how the decentralized map is updated. The connected agents the form of states, actions, and rewards {si , ai , ri }. The stored
compare and aggregate the global map at each time step for a values are then sampled iteratively to update the action proba-
network graph by taking the element-wise maximum for each bilities and to fit the value function through back-propagation.
grid cell G αβ in the environment. The element-wise maximum Let θ1 be the actor trainable parameter and θ2 be the critic
value of each grid cell is shared among the connected agents. trainable parameter. Discounted return measures the long-term
The mini-map is resized to the shape of the local map of the value of a sequence
PT ofk actions. The discounted return is given
agent and then concatenated to form a 2-channel image (shown as G(t; θ1 ) = k=0 γ r (t + k + 1; θ1 ), where γ ∈ [0, 1) is the
in figure 4b). This forms the sensing observation input oi for discount factor and T is the episode time horizon. The Value
the model at time t. The ConvNet converts the observation oi function V (sti ; θ2 ) represents the expected long-term reward
into a low-dimensional feature vector h i termed the embedding that an agent i can expect to receive if it starts in that state s at
vector. The positional mean µi and covariance matrix 6i of time t. It is updated as the agent interacts with the environment
agent i is then flattened, and their elements are concatenated and learns from its experiences. The value function estimate,
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 5. (a) Outline of the open map. (b) The agents cannot move into black pixels, while the non-black regions need to be persistently monitored. As the
anchor agents (red stars) and auxiliary agents (dark blue triangles) monitor, their trajectory is shown as the fading white trails for the last 30 steps. The
communication range between the agents is shown in red lines. (c) The trajectories of the anchor and auxiliary agents while monitoring are shown by the red
and blue lines, respectively.
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
S IMULATION PARAMETERS FOR GALOPP
TABLE II
Fig. 7. Comparison of the average reward on increasing the communication
PARAMETERS FOR THE N EURAL N ETWORKS range of the agents in the open-map environment.
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. Comparison of the average reward of the model on decreasing the Fig. 10. Comparison of the average reward on increasing the percentage
local sensing map range. The local map is the agents’ visibility range in the obstruction in the environment by increasing the number of obstacle blocks.
environment.
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 11. Comparison of the percent time of disconnection for auxiliary agents Fig. 13. Performance comparison of GALOPP with heuristic baselines:
on increasing the percentage occlusion in the environment. Random Search (RS), Random Search with Ensured Communication (RSEC),
Greedy Search (GS), and Lawn Mower Area Sweep (LMAS).
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 14. Visualization of maps: (a) Illustrates a 2-room map, and (c) Illustrates a 4-room map. The agents cannot move into black pixels, while the non-black
regions need to be persistently monitored. As the anchor agents (red stars) and auxiliary agents (dark blue triangles) monitor, their trajectory is shown as the
fading white trails for the last 30 steps. The communication range between the agents is shown in red lines. Fig (b) and Fig (d) display the trajectories of the
anchor and auxiliary agents while monitoring for the 2-room and 4-room maps, respectively.
4) Area Partition With Lawn Mower Sweep (LMAS): The 30 steps. Figure 14b shows the areas where each agent was
LMAS strategy [18], [38] begins with partitioning the area into present. From this, we can see that the anchor was in the
cells (or sections) equal to the number of agents deployed and middle region while the two auxiliary agents monitored the
placing each agent in a chosen starting position within one two rooms. The anchor agent moves around to maximize
of these sections. The agent is then programmed to follow rewards, while the auxiliary agents move in the two rooms.
a specific lawn-mower movement pattern within each section, In fact, this is the best combination for the agents, and they
typically involving a back-and-forth motion. The agent repeats learn quickly.
this pattern within each section until it has covered the entire In the four-room map, GALOPP learns a policy in which
area, ensuring complete coverage. Each agent makes use of its each of the four agents is responsible for monitoring separate
onboard local map to avoid obstacles by changing its trajectory rooms while intermittently monitoring the central corridor
in the presence of obstacles. region, as shown in Figures 14c and 14d. The anchor agents
We carried out 100 simulations for each non-RL baseline are positioned to monitor two cells and the central area, while
strategy and Figure 13 shows the performance comparison the auxiliary agents are responsible for monitoring the two
between the baseline strategies and GALOPP. From the rooms.
figure, we can see that the GALOPP consistently outper- Our results show that GALOPP is capable of adapting
forms the above-defined baseline strategies. This is attributed to complex environments and learning effective policies for
to GALOPP’s ability to explicitly account for localization multi-agent coordination. The ability of the agents to maintain
and connectivity constraints in its decision-making process. contact with each other and cover all areas of the environment
Among the baselines, Random Search (RS) exhibits the is crucial for the successful completion of tasks, and GALOPP
poorest performance, as it relies on random actions with- demonstrates its ability to achieve this.
out considering any context. Random Search with Ensured
Communication (RSEC) improves upon RS by enforcing VI. H ARDWARE I MPLEMENTATION
communication and localization, resulting in enhanced per- We implement GALOPP on a real-time hardware setup for
formance. Greedy Search (GS) leads to sub-optimal policies proof-of-concept purposes. We use multiple BitCraze Crazyflie
as each agent acts greedily at each timestep independently. 2.1 [39] nanocopters as agents. The experimental setup con-
Lawn Mower Area Sweep (LMAS) performs better compared sists of four SteamVR Base stations [40] and Lighthouse
to GS, but it does not explicitly incorporate localization Positioning System [41] to track the location of the vehicles
considerations and is influenced by the specific geometry within a 3.5m × 3.0m × 2.0m arena. The agents communicate
of the environment being monitored. GALOPP consistently with a companion computer (running on Ubuntu 20.04 with an
outperforms all the mentioned baselines. AMD Ryzen 9 5950x with a base clock speed of 3.4 GHz) via
a Crazyradio telemetry module, where the trained GALOPP
model was executed. In the experiment, we consider the
H. Evaluation in Other Environments environment as shown in Fig. 15a with 2 auxiliary agents and
In order to test the ability of GALOPP to perform in 1 anchor agent. The companion computer receives the position
other types of complex environments, we evaluate its perfor- of each CrazyFlie as input via the corresponding rostopics
mance in two-room and four-room environments, as shown in from the Crazyswarm ROS package [42], [43]. The respec-
Figure 14a and 14c, respectively. tive agents then execute the actions computed by the actor
For the two-room map, the agents learn to maintain contact networks. To avoid inaccuracies in tracking the CrazyFlies
with each other by spreading across two rooms and the caused by physical obstacles obstructing the infrared laser
corridor. In the 2-room map, we notice that our algorithm beams from the Base stations, we opt to simulate the obstacle
ends up with the agents in a formation where two of them boundaries. The model policy implemented in the simulation
position themselves in the two rooms while one monitors the ensures that the agents never collide with any obstacle.
corridor. This can be seen in Figure 14a, where the faded The video of the hardware implementation can be seen
cells show the trajectory followed by the agents for the last in [44]. Figure 15(a) shows the snapshot of the simulated
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 15. Snapshot of the video for the hardware implementation of vehicles using one anchor and two auxiliary agents. (a) A rendered simulation snapshot of
the monitoring task. (b) Real-time decision-making being performed by the trained GALOPP network model. (c) The trajectory trails of the previous timesteps
that the agent took in the monitoring task.
environment along with the agent positions (anchor and aux- research can investigate its generalizability to other monitoring
iliary), current coverage, and the position of the obstacle. problems, such as target tracking or environmental monitoring.
We then implement the same scenario with virtual obstacles This work provides a foundation for future investigations
through the hardware, where the model sends the control sig- of GALOPP’s performance and its potential applications in
nals to the vehicles, as shown in Figure 15(b). In Figure 15(c), various monitoring scenarios.
we can see that the agent trajectories are covering all the
regions and hence achieving persistent monitoring. R EFERENCES
[1] J. Yu, S. Karaman, and D. Rus, “Persistent monitoring of events with
VII. C ONCLUSION AND F UTURE W ORK stochastic arrivals at multiple stations,” IEEE Trans. Robot., vol. 31,
no. 3, pp. 521–535, Jun. 2015.
This work developed a MARL algorithm with a graph-based [2] S. L. Smith, M. Schwager, and D. Rus, “Persistent monitoring of
connectivity approach – GALOPP for persistently monitoring changing environments using a robot with limited range sensing,” in
a bounded region, taking the communication, sensing, and Proc. IEEE Int. Conf. Robot. Autom., May 2011, pp. 5448–5455.
[3] S. K. K. Hari, S. Rathinam, S. Darbha, K. Kalyanam, S. G. Manyam,
localization constraints into account via graph connectiv- and D. Casbeer, “The generalized persistent monitoring problem,” in
ity. The experiments show that the agents using GALOPP Proc. Amer. Control Conf. (ACC), Jul. 2019, pp. 2783–2788.
can outperform four custom baseline strategies for persistent [4] X. Lin and C. G. Cassandras, “An optimal control approach to the multi-
area coverage while accounting for the connectivity bounds. agent persistent monitoring problem in two-dimensional spaces,” IEEE
Trans. Autom. Control, vol. 60, no. 6, pp. 1659–1664, Jun. 2015.
We also establish the robustness of our approach by varying [5] S. K. K. Hari, S. Rathinam, S. Darbha, K. Kalyanam, S. G. Manyam,
the sensing map, the effect of obstacle occlusion by increasing and D. Casbeer, “Optimal UAV route planning for persistent monitoring
the percent amount of obstacle, and by scaling the number missions,” IEEE Trans. Robot., vol. 37, no. 2, pp. 550–566, Apr. 2021.
[6] T. Wang, P. Huang, and G. Dong, “Cooperative persistent surveillance
of anchor agents in the system. It was seen that increasing on a road network by multi-UGVs with detection ability,” IEEE Trans.
the number of anchor agents improves the performance, but Ind. Electron., vol. 69, no. 11, pp. 11468–11478, Nov. 2022.
beyond a certain value, there are diminishing returns on the [7] V. Mersheeva and G. Friedrich, “Multi-UAV monitoring with priorities
and limited energy resources,” in Proc. Int. Conf. Automated Planning
rewards obtained. Based on power and resource constraints, Scheduling, vol. 25, 2015, pp. 347–355.
one can select a subset of anchor agents to achieve persistent [8] Y.-W. Wang, Y.-W. Wei, X.-K. Liu, N. Zhou, and C. G. Cassandras,
surveillance effectively. “Optimal persistent monitoring using second-order agents with physical
Although our experiments demonstrate that GALOPP sur- constraints,” IEEE Trans. Autom. Control, vol. 64, no. 8, pp. 3239–3252,
Aug. 2019.
passes the baseline strategies, future work could investigate [9] E. Arribas, V. Cholvi, and V. Mancuso, “Optimizing UAV resupply
the algorithm’s scalability as the number of agents signifi- scheduling for heterogeneous and persistent aerial service,” IEEE Trans.
cantly increases. Exploring optimal values for the decay rate Robot., vol. 39, no. 4, pp. 2639–2653, Aug. 2023.
[10] J. Zhu and S. S. Kia, “Cooperative localization under limited connec-
and maximum negative reward, with a focus on increasing tivity,” IEEE Trans. Robot., vol. 35, no. 6, pp. 1523–1530, Dec. 2019.
monitoring efficiency, presents a promising area for further [11] J. Liu, J. Pu, L. Sun, and Y. Zhang, “Multi-robot cooperative localization
research. Additionally, the algorithm’s suitability for diverse with range-only measurement by UWB,” in Proc. Chin. Autom. Congr.,
Nov. 2018, pp. 2809–2813.
sensor types, such as cameras or LIDAR sensors, could be
[12] R. Sharma, R. W. Beard, C. N. Taylor, and S. Quebe, “Graph-based
explored to improve agents’ situational awareness. Further observability analysis of bearing-only cooperative localization,” IEEE
research on the impact of different types of obstacles, includ- Trans. Robot., vol. 28, no. 2, pp. 522–529, Apr. 2012.
ing moving obstacles, on the algorithm’s performance would [13] F. Klaesson, P. Nilsson, T. S. Vaquero, S. Tepsuporn, A. D. Ames,
and R. M. Murray, “Planning and optimization for multi-robot planetary
also be insightful. While the proposed algorithm targets het- cave exploration under intermittent connectivity constraints,” in Proc.
erogeneous agents in the persistent monitoring problem, future ICAPS Workshop Planning Robot., 2020.
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[14] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, [30] J. Chen, A. Baskaran, Z. Zhang, and P. Tokekar, “Multi-agent
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347. reinforcement learning for visibility-based persistent monitoring,” in
[15] T. N. Kipf and M. Welling, “Semi-supervised classification with graph Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2021,
convolutional networks,” in Proc. Int. Conf. Learn. Represent., 2017. pp. 2563–2570.
[16] J. O’Rourke, Art Gallery Theorems and Algorithms, vol. 57. London, [31] F. Klaesson, P. Nilsson, A. D. Ames, and R. M. Murray, “Intermittent
U.K.: Oxford Univ. Press, 1987. connectivity for exploration in communication-constrained multi-agent
[17] P. Tokekar and V. Kumar, “Visibility-based persistent monitoring with systems,” in Proc. ACM/IEEE 11th Int. Conf. Cyber-Phys. Syst. (ICCPS),
robot teams,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Apr. 2020, pp. 196–205.
Sep. 2015, pp. 3387–3394. [32] R. Khodayi-Mehr, Y. Kantaros, and M. M. Zavlanos,
[18] H. Choset, “Coverage for robotics—A survey of recent results,” Ann. “Distributed state estimation using intermittently connected robot
Math. Artif. Intell., vol. 31, pp. 113–126, Oct. 2001. networks,” IEEE Trans. Robot., vol. 35, no. 3, pp. 709–724,
[19] E. Galceran and M. Carreras, “A survey on coverage path planning for Jun. 2019.
robotics,” Robot. Auton. Syst., vol. 61, no. 12, pp. 1258–1276, 2013. [33] S. Thrun, “Probabilistic robotics,” Commun. ACM, vol. 45, no. 3,
[20] X. Tan, “Fast computation of shortest watchman routes in simple pp. 52–57, 2002.
polygons,” Inf. Process. Lett., vol. 77, no. 1, pp. 27–33, Jan. 2001. [34] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-
[21] P. H. Washington and M. Schwager, “Reduced state value iteration agent actor-critic for mixed cooperative-competitive environments,” in
for multi-drone persistent surveillance with charging constraints,” in Proc. 31st Conf. Neural Inf. Process. Syst., Long Beach, CA, USA,
Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2021, 2017, pp. 6382–6393.
pp. 6390–6397. [35] C. Yu et al., “The surprising effectiveness of ppo in cooperative multi-
[22] P. Maini, P. Tokekar, and P. B. Sujit, “Visibility-based persistent mon- agent games,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022,
itoring of piecewise linear features on a terrain using multiple aerial pp. 24611–24624.
and ground robots,” IEEE Trans. Autom. Sci. Eng., vol. 18, no. 4, [36] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,
pp. 1692–1704, Oct. 2021. and time series,” in The Handbook of Brain Theory and Neural Net-
[23] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep works, vol. 3361. Cambridge, MA, USA: MIT Press, 1995, ch. 10.
decentralized multi-task multi-agent reinforcement learning under partial [37] J. Chen, A. Baskaran, Z. Zhang, and P. Tokekar, “Multi-agent rein-
observability,” in Proc. 34th Int. Conf. Mach. Learn. (ICML), Sydney, forcement learning for visibility-based persistent monitoring,” 2020,
NSW, Australia, 2017, pp. 2681–2690. arXiv:2011.01129.
[24] D. Maravall, J. de Lope, and R. Domínguez, “Coordination of commu- [38] H. Choset, K. M. Lynch, S. Hutchinson, G. A. Kantor, and W. Burgard,
nication in robot teams by reinforcement learning,” Robot. Auto. Syst., Principles of Robot Motion: Theory, Algorithms, and Implementations.
vol. 61, no. 7, pp. 661–666, Jul. 2013. Cambridge, MA, USA: MIT Press, 2005.
[25] Q. Li, F. Gama, A. Ribeiro, and A. Prorok, “Graph neural networks for [39] Bitcraze. Crazyflie 2.1. Accessed: Apr. 2023. [Online]. Available:
decentralized multi-robot path planning,” in Proc. IEEE/RSJ Int. Conf. https://www.bitcraze.io/products/crazyflie-2-1/
Intell. Robots Syst. (IROS), Oct. 2020, pp. 11785–11792. [40] Vive. Basestation. Accessed: Apr. 2023. [Online]. Available:
[26] R. Shah, Y. Jiang, J. Hart, and P. Stone, “Deep R-learning for continual https://www.vive.com/sea/accessory/base-station2/
area sweeping,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), [41] Bitcraze. Lighthouse Positioning System. Accessed: Apr. 2023. [Online].
Oct. 2020, pp. 5542–5547. Available: https://www.bitcraze.io/documentation/tutorials/getting-
[27] Q. Li, W. Lin, Z. Liu, and A. Prorok, “Message-aware graph attention started-with-lighthouse/
networks for large-scale multi-robot path planning,” IEEE Robot. Autom. [42] J. A. Preiss, W. Hönig, G. S. Sukhatme, and N. Ayanian, “Crazyswarm:
Lett., vol. 6, no. 3, pp. 5533–5540, Jul. 2021. A large nano-quadcopter swarm,” in Proc. IEEE Int. Conf. Robot. Autom.
[28] B. Wang, Z. Liu, Q. Li, and A. Prorok, “Mobile robot path planning in (ICRA), May 2017, pp. 3299–3304.
dynamic environments through globally guided reinforcement learning,” [43] M. Quigley et al., “ROS: An open-source robot operating system,” in
IEEE Robot. Autom. Lett., vol. 5, no. 4, pp. 6932–6939, Oct. 2020. Proc. ICRA Workshop Open Source Softw., Kobe, Japan, 2009, vol. 3,
[29] J. Blumenkamp and A. Prorok, “The emergence of adversarial commu- no. 3, pp. 1–5.
nication in multi-agent reinforcement learning,” in Proc. Conf. Robot [44] Proof-of-Concept Hardware Experiment. Accessed: Apr. 2023. [Online].
Learn., Cambridge, MA, USA, 2020, pp. 1394–1414. Available: https://moonlab.iiserb.ac.in/research_page/galopp.html
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on April 18,2024 at 00:25:28 UTC from IEEE Xplore. Restrictions apply.