Graph-Based Skill Acquisition For Reinforcement Learning
Graph-Based Skill Acquisition For Reinforcement Learning
In machine learning, Reinforcement Learning (RL) is an important tool for creating intelligent agents that
learn solely through experience. One particular subarea within the RL domain that has received great atten-
tion is how to define macro-actions, which are temporal abstractions composed of a sequence of primitive
actions. This subarea, loosely called skill acquisition, has been under development for several years and has
led to better results in a diversity of RL problems. Among the many skill acquisition approaches, graph-based
methods have received considerable attention. This survey presents an overview of graph-based skill acqui-
sition methods for RL. We cover a diversity of these approaches and discuss how they evolved throughout
the years. Finally, we also discuss the current challenges and open issues in the area of graph-based skill
acquisition for RL.
CCS Concepts: • Computing methodologies → Reinforcement learning; Cluster analysis; • Mathemat-
ics of computing → Spectra of graphs; Graph algorithms;
Additional Key Words and Phrases: Skill acquisition, reinforcement learning, graph analytics, centrality,
clustering
ACM Reference format:
Matheus R. F. Mendonça, Artur Ziviani, and André M. S. Barreto. 2019. Graph-Based Skill Acquisition For
Reinforcement Learning. ACM Comput. Surv. 52, 1, Article 6 (February 2019), 26 pages.
https://doi.org/10.1145/3291045
1 INTRODUCTION
Reinforcement Learning (RL) is a machine-learning approach in which an agent learns to accom-
plish a given task by simply interacting with the environment through a set of actions and receiving
reward signals. This process allows the RL agent to learn how to interact with the environment
such that it maximizes the expected discounted sum of rewards received. Different challenges arise
in this context, such as delayed rewards and problems with a very large or an infinite number of
possible states to analyze, just to name a few. The former challenge mentioned is related to scenar-
ios where the agent receives reward signals several steps after a key action is performed, making
it difficult for RL algorithms to associate such rewards with the corresponding action. The latter
challenge refers to problems with too many or even an infinite number of states (e.g., continuous 6
state spaces), making it infeasible to use exact representations.
M. R. F. Mendonça, A. Ziviani and A. M. S. Barreto thank CNPq for its support (grants 141.761/2016-4, 308.729/2015-3 and
461.739/2014-3, respectively) and also acknowledge the INCT in Data Sciences – INCT-CiD (CNPq no. 465.560/2014-8). A.
M. S. Barreto is currently with DeepMind, London, UK.
Authors’ addresses: M. R. F. Mendonça, A. Ziviani, and A. M. S. Barreto, National Laboratory for Scientific Computing
(LNCC), Petrópolis, RJ, Brazil; emails: {mrfm, ziviani, amsb}@lncc.br.
ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national
government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to
allow others to do so, for Government purposes only.
© 2019 Association for Computing Machinery.
0360-0300/2019/02-ART6 $15.00
https://doi.org/10.1145/3291045
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:2 M. R. F. Mendonça et al.
Fig. 1. Skill chaining example based on a figure from Konidaris and Barto [32]. The goal is represented by
the large circle pointed by the arrows. In the left figure, the agent hasn’t learned any skills yet. In the middle
figure, the agent learned two different skills that are able to reach the goal. In the sequence, the right figure
shows several skills chained together that allow the agent to reach the goal.
It is believed that temporal abstraction may help tackle some of these issues [59, 60, 67]. Tem-
poral abstractions encapsulate a sequence of actions into a single macro-action, allowing the RL
agent to act at different time scales. A macro-action can also be interpreted as a high-level action
that achieves a subgoal within an RL problem. In this survey, we focus on temporal abstraction
approaches combined with RL and Deep Reinforcement Learning (DRL). More specifically, we fo-
cus on the automatic definition of macro-actions, also known as automatic temporal abstractions
or automatic skill acquisition.
The particular focus on automatic skill acquisition is justified given its increasing importance
in RL, especially considering some recent advances suggesting that it can indeed be beneficial in
practice [1, 37, 41, 71]. There are different formalisms for defining macro-actions [18, 25, 57, 67] and
several different methods that automatically define skills [1, 12, 37], although how to automatically
define such skills is still considered an open problem.
For example, Konidaris and Barto [32] present a method for automatically defining skills in
continuous RL domains based on the Options framework [67] by creating a chain of skills. The first
created skill is used to reach the goal of the RL problem. It is constructed using a pseudo-reward
function that rewards for reaching the goal, and the referred skill can only be used in states inside
a state region that manage to reach the goal by triggering it. The following constructed skill uses
as its goal the state region where the previously created skill can be used. This process, illustrated
in Figure 1, is repeated until a skill is assigned to the state region where the starting state is located.
Another recent skill acquisition method, presented by Vezhnevets et al. [70], uses an action-
plan matrix Pt ∈ R |A |×T to represent macro-actions, where A is the set of possible actions and T
is the macro-action duration. The action-plan matrix Pt thus indicates in column i the probabil-
ity of taking action a ∈ A at time step t + i. The action-plan matrix is updated through Attentive
Planning [24], a technique originally created for image generation.
Some of the pioneering work related to automatic skill acquisition has adopted graph-based
metrics and clustering algorithms over the state transition graph (responsible for representing
the environment dynamics) of an RL problem in order to create high-level actions. These graph-
based approaches achieved good performance in the past [12, 13, 42] and continue to outperform
current state-of-the-art methods [37]. This survey covers graph-based skill acquisition methods for
reinforcement learning. We present how these graph-based methods evolved over time, discussing
different approaches for continuous or discrete state spaces.
Graph-based skill acquisition methods can be classified into three classes depending on the ap-
proach they adopt to identify important states or macro-actions in the transition graph (Figure 2):
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:3
(1) centrality measures, (2) clustering algorithms, and (3) spectral analysis. The first approach uses
different centrality measures over the state transition graph in order to identify important states (or
state regions) and then create macro-actions that allow the agent to move to these states. Second,
clustering approaches use several different clustering algorithms over the state transition graph
in order to identify states that share a specific similarity. Macro-actions are then created in such a
way that they allow the agent to transition between state clusters. The third approach uses spec-
tral analysis in order to identify macro-actions with their corresponding intrapolicy. There are
also some mixed approaches that use spectral clustering for finding macro-actions, showing that
these three categories can actually intersect. These mixed approaches are discussed in more detail
in Section 6.
To the best of our knowledge, there is no recent survey that thoroughly explores skill acqui-
sition methods, in particular graph-based ones. The only previous survey we know of, by Chiu
and Soo [10], is now outdated. In recent years, graph-based methods applied to different domains
are also receiving ever-increasing attention, aligned with the rise and consolidation of the in-
terdisciplinary field of network science [3, 7]. Indeed, the relation between RL and graph-based
techniques derived from network science emerges as a timely and relevant topic in the current
machine-learning landscape. In this context, this survey aims to be a starting point for the inter-
ested reader who desires to get familiarized with recent graph-based skill acquisition methods for
reinforcement learning.
This survey is organized as follows. Section 2 introduces the basic concepts of Reinforcement
Learning, followed by a description of the skill acquisition problem. This section also describes how
the state transition graph is created in such skill acquisition methods, presenting two of the most
common approaches for creating these graphs. Some basic concepts of graph analysis methods
are then introduced in Section 3, where we describe some key centrality measures and clustering
algorithms that are often used in graph-based skill acquisition methods throughout the literature.
Section 4 presents the most common benchmarks used by graph-based skill acquisition methods.
We then go to the core of this survey: (1) Section 5 describes how to create macro-actions using
centrality measures over the state transition graph, (2) Section 6 presents approaches based on
clustering algorithms, and (3) Section 7 details approaches based on spectral analysis and other
graph-based approaches that substantially differ from the classes presented in Figure 2. Section 8
presents some of the challenges and open issues in the area of graph-based skill acquisition. Finally,
in Section 9, we summarize the article along with some final considerations.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:4 M. R. F. Mendonça et al.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:5
(RNN) [29, 49, 63]. CNNs are typically used to learn image patterns and therefore are commonly
adopted for image classification or image segmentation problems. On the other hand, RNNs are
capable of identifying temporal patterns in the training data, allowing the network to predict the
next input after observing a sequence of inputs. These DL methods, such as CNNs and RNNs,
have recently replaced the simple NN models used to predict Q (s, a), thus giving rise to Deep
Reinforcement Learning methods [50, 51].
Deep Reinforcement Learning methods are capable of dealing with large state spaces as well
as learning complex behaviors. For instance, these methods have already been used for learning
how to play games from the Atari 2600 console at an expert level by only considering the current
game image (pixel vector) and the eventual reward offered by the game, a feat that could not be
accomplished by standard RL methods. This achievement opened many potential perspectives for
new applications based on machine learning. This is also the main reason for the ever-increasing
interest we observe nowadays in Deep Learning in general, and in Deep Reinforcement Learning
in particular.
• I is the initiation set, which indicates the set of states from which the agent can initiate the
execution of option o. The definition of this set depends on the skill acquisition method and
will be explored further in this survey.
• πo is the policy for option o, also called intraoption. A common approach to define πo is
to use a special reward function (usually called pseudo-reward function) when executing
an option, such as rewarding or punishing the agent when it achieves certain states. This
reward function is only valid when an option is currently in execution.
• β is the termination condition, which is a probability function β (s) that indicates the prob-
ability of option o terminating in state s. A common approach is to use a termination prob-
ability of 1 for important states and 0 for states inside the scope of the given option. The
termination condition depends on the problem being solved and the skill acquisition method
used, which will be better explained further in this survey.
The update of the action-value function of options differs from the update used for primitive
actions. For example, while the update of primitive actions is done by the standard Q-Learning
algorithm, the update for options can be performed using the Macro-Q-Learning method [47], given
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:6 M. R. F. Mendonça et al.
where Q (st , ot ) is the action-value function for option ot executed in state st at time step t, k is
the number of steps that option ot has executed so far, O is the set of all possible actions that can
be carried out (primitive actions and options alike), and r o is the cumulative discounted rewards
received while executing option ot during the past k steps, given by r o = ki=0 γ i r t +i+1 .
When a new option is created (based on whichever method used to discover such options), it is
necessary to learn its intrapolicy πo before effectively using it in the RL problem. Otherwise, the
agent will not be able to properly execute the new option. A common approach used is to learn the
intraoption through the use of the Experience Replay technique [35], where a sequence of actions
and states previously experienced by the agent that is within the given option’s initiation set is
used to learn πo . Using a sequence of states previously observed, coupled with a pseudo-reward
function, allows the policy πo to be quickly learned, allowing the insertion of the newly formed
option into the action set of the agent.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:7
We consider here that a graph G is formally represented by G = (V , E), where V is the set of
vertices (nodes) and E is the set of edges in the graph. The number of vertices (nodes) in the graph
is n = |V |. The graph’s connectivity is represented by an adjacency matrix A, where ai j ∈ A is the
weight of the edge that connects nodes i and j. For unweighted graphs, ai j = 1 if there exists an
edge between nodes i and j, and ai j = 0 otherwise.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:8 M. R. F. Mendonça et al.
where λ is an eigenvalue of the adjacency matrix A associated with the eigenvector x. Therefore,
the eigenvector centrality of each node in a graph can be calculated by solving the eigenvector
problem
Ax = λx . (8)
This process follows until there is no alteration in the configuration of clusters between two
consecutive steps. The K-Means algorithm is synthesized in Algorithm 1.
ALGORITHM 1: K-Means Clustering
Data: p1 , . . . , pn , k
Result: O 1 , . . . , O k
begin
Select k random points C 1 , . . . , Ck to represent the centroid of each cluster;
while clusters O 1 , . . . , O k change between consecutive iterations do
for i = 1, . . . , n do
Determine j such that C j is the closest centroid to pi ;
Assign point pi to cluster O j ;
end
for a = 1, . . . , k do
Ca = |O1 | pi ∈O a pi ;
a
end
end
end
3.2.2 Spectral Clustering. Spectral Clustering is a clustering method that uses as its main tool
the eigenvectors of the Laplacian matrix of a graph. Before briefly describing the Spectral Cluster-
ing method, let us define the degree matrix and the Laplacian matrix:
• Degree Matrix: the degree matrix, represented by D, is a diagonal matrix where Di j = 0
for i j and Dii = di .
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:9
• Laplacian Matrix: the Laplacian matrix is a special matrix given by L = D − A that has
n real and nonnegative eigenvalues (where n = |V |, i.e., the number of nodes in G). The
number of eigenvalues of L that equals zero indicates the number of connected components
in the corresponding graph G = (V , E) [52].
The Spectral Clustering algorithm, presented in Algorithm 2, operates as follows: given a graph
G = (V , E) and a parameter k that indicates the number of desired clusters, it computes the Lapla-
cian matrix L and then determines the k eigenvectors u 1 , . . . , uk ∈ Rn (n being the number of
nodes in the graph G) associated with the k smallest eigenvalues of L. It then constructs a matrix
U ∈ R n×k , where the ith column of U is given by the eigenvector ui . It then considers a set of points
yi ∈ R k for i = 1, . . . , N , where yi is given by the ith row of U. The algorithm then uses a clustering
method (K-Means, for example) over the set of points yi , resulting in k clusters C 1 , . . . , Ck . Finally,
a new set of clusters O 1 , . . . , O k is created such that node j of graph G belongs to O i if y j ∈ Ci , that
is, O i = {j |y j ∈ Ci }.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:10 M. R. F. Mendonça et al.
Fig. 3. Two-Room test environment. The RL agent starts at a random position inside the left room and its goal
is to reach the G position inside the right room. Many skill acquisition approaches use this test environment
to compare different approaches. Skills learned in this environment using different techniques tend to focus
on reaching the doorway, showing that this position represents a bottleneck.
The Atari Learning Environment (ALE) [51] is one of the most recent benchmarks in the skill
acquisition community, as well as in the Reinforcement Learning community in general. The ALE
provides a collection of Atari games where the agent can only access the current state of the game,
given by the current frame, and the rewards. This makes ALE an ideal benchmark since it provides
an easy way to test different methods with other approaches in the literature.
There are different ways to assess the quality of a given graph-based skill acquisition method
based on which benchmark is used. In general, there are two main desirable attributes: (1) perfor-
mance, which measures the total amount of rewards the agent can harness, and (2) training speed.
For the Two-Room environment and its variations, authors are more concerned with the training
speed and the exploration capacity, given that these environments can already be fully mastered
by simple RL methods. Therefore, the objective is to create a method capable of reaching this level
of mastery in fewer steps. The same is also valid for the taxi domain. For the ALE, however, the
performance is also very important: many Atari games are considered to be very difficult for a
simple RL method to master. Therefore, creating skill acquisition methods capable of improving
the total score on a specific game is considered a gain. The training speed is also very important
for this benchmark, given that the training time required for many of these games is high.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:11
Fig. 4. Illustration of the macro-action discovery process using centrality measures. The process starts with
(a) an initial STG of a given problem. It then goes to (b) the identification of the most important states
based on a given centrality (in this figure, we considered the betweenness centrality, with the size of a node
proportional to its betweenness centrality). Finally, (c) the states with the highest centrality (the top x nodes
with the highest centrality or all nodes with a centrality value above a predefined threshold) are identified
as subgoals, thus resulting in macro-actions capable of reaching such states.
the agent leaves the initiation set and 0 for the states in the initiation set. The newly created option
o =< I , πo , β > is then trained using experience replay [35] in order to find a good intrapolicy πo
that allows the agent to properly execute o.
The main differences between distinct graph-based skill acquisition approaches based on cen-
trality measures, as we show in the remainder of this section, are (1) which centrality measure is
used, (2) how the STG is built, and (3) how the initiation set is defined. The remaining steps are
usually very similar to each other. Algorithm 3. presents the case in which a Local STG is used,
although a Global STG might also be used, as in some approaches presented in the remainder of
this section. Another key concept in many centrality-based methods is that the important states
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:12 M. R. F. Mendonça et al.
s 1 , . . . , se are defined as being those with a centrality measure higher than a predefined thresh-
old t. Hence, the threshold t indirectly controls how many subgoals are identified within the STG.
As a consequence, choosing an appropriate value for t depends on the particular problem being
solved, as well as on which centrality measure is used. We now describe existing graph-based skill
acquisition approaches using centrality measures, divided into three categories: (1) methods that
use a Global STG, (2) methods that use a Local STG, and (3) those that do not use an STG.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:13
These states are defined as subgoals. As a result, this approach allows the detection of important
states without the need for several episodes, because if the agent crosses a bottleneck at least
once, the Q-Cut algorithm will already be able to identify it. After identifying bottleneck nodes,
the initiation set is defined as the source states of the identified cuts. Similarly to the previous
approach, the agent based on the Q-Cut method was compared with a baseline agent in the
Two-Room and Six-Room domains, where it was shown that the agent using skills managed to
reach the goal in fewer steps than the baseline agent.
In this subsection, we presented some centrality-based skill acquisition methods that rely on
a Global STG. The main difference between these methods is how the subgoal states are chosen:
betweenness centrality, visitation frequencies, and bottleneck states in a minimum cut problem.
Since these methods use a Global STG, their use is restricted to small problems, given that a Global
STG does not scale well for problems with a large state space.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:14 M. R. F. Mendonça et al.
is a parameter of the algorithm. A baseline agent was compared with an agent using skills with dif-
ferent scoring functions: betweenness, closeness, and the proposed CGS. The adopted benchmarks
were the Two-Room and taxi domains, where the agent using skills based on the CGS centrality
managed to reach the goal in fewer steps, followed by betweenness and closeness centralities. As
expected, the baseline agent required a greater amount of steps to reach the goal.
Concept Bridge Centrality [53] extends the previous concept of Connection Graph Stability cen-
trality by considering local and global information of a given node, allowing it to better filter
redundant bottleneck nodes. This also identifies bottleneck nodes. The local information is given
by the bridge coefficient that measures the degree of difference between a given node and its
neighbors, as follows:
α |du − d1u du j |
BR(u) = e u j ∈N (u )
, (12)
where α > 0 is a constant value. High values of the bridge coefficient indicate that a node has a
considerably lower degree with respect to its neighbors, suggesting that it must be a bottleneck
node. The global information is measured by any centrality measure; Moradi et al. [53] used the
Connection Graph Stability in their experiments. As a result, this approach finds a reduced num-
ber of bottleneck states that are then used to create macro-actions through the options framework.
The graph and skill construction used in that work is identical to the approach used in the author’s
previous work [61]. Both previous centrality measures proposed by the authors, namely, the Con-
nection Graph Stability and Concept Bridge centralities, were compared with similar approaches
using betweenness [12] and closeness centralities. Results show that these two new centralities
are capable of better filtering the important nodes, thus achieving a better performance.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:15
Fig. 5. Illustration of the macro-action discovery process using clustering algorithms. The process starts
with a, STG of a given problem (a). It then identifies the clusters (b) of the initial graph (a). Each clus-
ter is associated to an abstract state (c), and the connections between abstract states are associated to a
macro-action.
visitation frequency is higher than their neighbors. This helps to disregard highly visited states
that are not important. The authors used the Two-Room environment to show that their approach
is capable of identifying only doorways as subgoals, effectively filtering nonimportant states.
Another approach that uses bottleneck states is presented by Goel and Huber [22], where the
authors propose to identify bottlenecks by checking the number of predecessor and successor
states of a given state. Whenever a state has many predecessors and few successor states, the
given state is a bottleneck. The proposed method is tested in a variation of the Four-Room domain,
where it only identifies doorways as subgoals. The skilled agent also learned to reach the goal
with considerable fewer training steps when compared with a baseline agent using only primitive
actions.
Even though the methods presented in the current subsection do not use an STG, they all rely
on statistical information of each node. However, in large or continuous problems, it is difficult to
associate a set of information with each node, given that there are many or even infinite nodes.
Since these approaches were only tested in simple problems, it is difficult to know how well they
perform for more complex problems.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:16 M. R. F. Mendonça et al.
then built or updated (in case it already existed from previous episodes) and is then trained using
experience replay.
The main concept of clustering-based skill acquisition methods is to build an option capable
of transitioning the agent from one cluster Ci to another C j (these cluster transitions are repre-
sented by the edges in Figure 5(c)). Since these cluster transitions are controlled by an option, such
transitions may take several primitive actions to occur. These primitive actions are chosen based
on the intrapolicy πo , learned through experience replay using the pseudo-reward function pre-
viously described. Given that each cluster represents a collection of similar states, an option that
transitions the agent between clusters can be seen as an abstract behavior.
Differently from the centrality-based methods, no skill acquisition approach using clustering
algorithms adopts a Global STG. All methods described here use a Local STG. Hence, the main
differences between the clustering-based methods are the used clustering algorithm and how to
define the importance of a state. Several new clustering techniques were developed in this area
to ensure that the STG is well separated. The quality of the resulting clusters directly impacts the
quality of the skill acquisition step. In the remainder of this section, we will present the clustering
methods proposed for graph-based skill acquisition using Local STGs.
One of the first approaches to actually use clustering algorithms to find skills in an RL problem is
presented by Mannor et al. [42]. The adopted clustering function focuses on creating clusters with
roughly the same size and with similar reinforcement signal transitions. This approach allows the
agent to identify promising clusters by preferring options that lead the agent to clusters with high
reinforcement signals. However, a key question arises: when is the best moment to execute the
clustering algorithm? This is important since it is infeasible to perform this operation after each
episode due to execution time constraints. The author states that this issue is problem related. The
approach suggested by the author is to perform the clustering algorithm when no new states are
visited after a certain number of episodes (this number is a parameter). The authors compared
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:17
an agent using their method with a baseline agent adopting only primitive actions in a 19-Room
domain, showing that their approach manages to reach the destination in fewer steps.
Following the idea presented by Şimşek and Barto [13] of defining important states as those that
connect different regions of the state space, Şimşek et al. [15] propose a new metric to determine
the states that have such a characteristic. The authors propose the L-Cut algorithm, where the idea
is to determine a cut of the state trajectory graph that separates different regions. An edge’s weight
is associated with the number of times a given transition was experienced. The graph cut metric
used is the Normalized Cut (NCut) [64], which separates the STG into two distinct clusters and
measures the relation between the probability of transitioning between clusters and the probability
of transitioning within the same cluster. Therefore, the goal here is to find a minimum NCut, which
is a cut that effectively identifies the edges that separate the two clusters. Since finding a minimal
NCut is an NP-hard problem, the authors first use the spectral clustering method proposed in [64],
which outputs an approximation of the minimum NCut and an approximate cluster separation, as
we have already shown in Section 3.2.2. The L-Cut algorithm then proceeds to test the possible cuts
suggested by the spectral clustering method by analyzing the second eigenvector and choosing
the cut with the minimal NCut. This procedure is performed once for each newly obtained state
trajectory. Options are designed to take the agent from a given state to a state output by the L-Cut
algorithm. The initiation set is defined based on cuts that identified the given state as a subgoal and
are represented by the states from the opposite cluster. An agent using the L-Cut algorithm was
compared with an agent using the Relative Novelty (RN) method [13] and a baseline Q-Learning
agent with no skills in the Two-Room and taxi domain, where the L-Cut method outperforms both
other methods, although the gain over RN is considerably smaller.
Kheradmandian and Rahmati [31] propose an approach based on two clustering algorithms over
the STG G in order to find subgoal states. The first algorithm is a topology clustering algorithm
that iteratively removes specific edges from G in order to identify clusters, resulting in a new
graph G with nonconnected clusters. The second is a clustering algorithm based on the value
function, performed over G . It starts by defining the distance between two states as being the
difference between their value functions. It then creates clusters such that the distances between
states within the same clusters are minimized, while maximizing the distance between states of
different clusters. An option is then built in one of two ways: (1) action sequences that transition
the agent between clusters or (2) frequent action sequences. The former method creates one option
for each connection between clusters, similar to the approach presented in Algorithm 4., where the
shortest path for each transition is learned separately through Dynamic Programming [66]. The
latter method uses statistical analysis to identify frequent action sequences used to reach a given
subgoal, where each sequence identified as frequent is assigned to an option, where the initiation
set is given by the starting state of the sequence, the termination condition is set to 1 for the aimed
subgoal state of the specific sequence, and the option’s subpolicy is determined by the sequence
itself. A drawback of this approach is that the adopted statistical analysis method may not prove
useful in large or continuous state spaces, given that the same sequence of actions may hardly
occur a second time. The authors used the 19-Room and the taxi domains to test their method,
showing that it outperforms the basic Q-Learning algorithm.
Kazemitabar and Beigy [30] present a novel approach where subgoals are identified through
Strongly Connected Components (SCCs). Other approaches of skill acquisition have already used
SCC in some way [12, 25], but [30] present a skill acquisition method that relies solely on the
identification of SCCs. First, Q-Learning is used during a number of iterations in order to obtain
a viable Local STG. Second, SCCs are identified using a linear algorithm proposed by the authors.
Third, subgoal states are identified as being states that connect two SCCs, similar to previous
approaches that consider important states as being those that connect different regions of the state
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:18 M. R. F. Mendonça et al.
space [13, 46, 48]. The final step consists of creating options for reaching such subgoals, following
a similar procedure to Algorithm 4. This approach is tested in the Two-Room domain, in which
the skilled agent using the proposed approach outperforms the baseline agent.
A new clustering algorithm created to identify important nodes is presented by Entezari et al.
[21]. The authors propose a new metric to measure the quality of a cluster based on the relation
of (1) connections between border nodes and other nodes of the given cluster and (2) the number
of connections of these same border nodes with their neighbors in different clusters. The metric
is given by
i ∈X, j ∈b (X ) a i j
R(X ) = , (13)
i ∈n(X ), j ∈b (X ) a i j
where b (X ) is the set of border nodes of cluster X , n(X ) corresponds to the neighbor of border
nodes from a different cluster, and ai j indicates the element (i, j) of the adjacency matrix. The
objective is therefore to maximize R(X ). This is accomplished by adding nodes that increase the
value of R(X ); that is, node i is included in cluster X if R(X + i) > R(X ). The agent based on the
proposed skill acquisition method is tested in the Four-Room domain, once again outperforming
the baseline agent.
Davoodabadi and Beigy [16] propose a clustering algorithm that blends two different ap-
proaches: Edge Betweenness [54] and Label Propagation [62] algorithms. The former algorithm
iteratively removes the edge with the highest betweenness and analyzes the resulting clusters.
At each iteration, it calculates a measure that indicates the quality of the resulting cluster, called
the modularity measure, given by
Q= (eii − ai2 ), (14)
i
where ei j is an index of a k × k matrix (k being the number of existing clusters) and corresponds
to the number of edges connecting clusters i and j, while ai is the total number of edges between
cluster i and other clusters. The value of Q varies between 0 and 1, where values near 1 indicate a
strong community structure, while 0 indicates a random structure. The latter algorithm starts by
giving a unique label to each node, and for each iteration, each node receives the most common
label among its neighbors. This is repeated until a stopping criterion is satisfied. The algorithm
proposed in [16] performs the following steps: (1) run the label propagation algorithm and (2) for
each pair of clusters (i, j) resulting from the first step, calculate the modularity variation ΔQ (i, j)
of joining clusters i and j and join clusters i and j associated with the highest ΔQ (i, j). This is per-
formed until the modularity stops rising. Subgoals are then identified as being the border nodes
of each cluster. This approach was then compared with the Edge Betweenness and the Label Prop-
agation algorithms for skill acquisition in RL problems, showing that the proposed new algorithm
performed better, since it was capable of achieving better clustering formations. The experimental
results achieved follow the same idea as previous skill acquisition methods, where an agent based
on the proposed method outperforms the baseline agent in the taxi domain.
Following a similar approach presented in the previous work [16], a new variation of the Label
Propagation Algorithm (LPA) [62], proposed by Bacon and Precup [2], is used in order to separate
similar states into clusters and then identify subgoal states as border states of these clusters. The
novelty proposed by the authors relies on the utilization of a variation of the LPA that consid-
ers a weighted Local STG, where the weight of an edge is equal to the number of times a given
transition was observed. Therefore, the label associated with a given node is not based on the label
occurring in most neighbors of a given node, but instead is the label with the highest cumulative
weight of a node’s neighbors. After clustering the Local STG, the subgoals are defined as being the
border nodes, similar to previous approaches [12, 16]. The proposed skill acquisition method using
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:19
LPA is compared with the basic Q-Learning method and to other skill acquisition methods using
betweenness and closeness centralities in the Four-Room domain. The LPA method outperforms
the Q-Learning method but is outperformed by the betweenness and closeness methods. Despite
this, the authors show how to extend their LPA approach to continuous environments.
Taghizadeh and Beigy [69] propose two clustering approaches for skill acquisition: the first is
based on spectral clustering and a variation of the k-means algorithm, called k -means [72], while
the second approach uses Eigenvector Centrality (EVC) to separate clusters. The presented spectral-
based subgoal discovery performs a spectral clustering over the Local STG in order to obtain a set
of k clusters, with k being a parameter informed beforehand. Since it is difficult to know a priori the
number of clusters in the STG of an RL problem, the authors propose to use the k -means algorithm
in order to identify clusters in the eigenvector matrix U (Section 3.2.2). The k -means algorithm
receives as input the maximum number of clusters, given by k, and returns k clusters, where k < k
and k is an approximation to the correct number of clusters. Therefore, the number of clusters
does not need to be informed beforehand. The main drawback of this approach, as pointed out by
the authors, is the high computational cost. Hence, the authors propose a second approach, called
EVC-based subgoal discovery, where the eigenvector centrality of each node is used in order to
separate states into different clusters. To do this, the authors consider the EVC variation over the
graph and categorize states into three possible classes: (1) cluster center, composed of the states
with the highest EVC among its neighbors; (2) cluster member, which encompasses those states
with a lower EVC than at least one of its neighbors and only has neighbors of the same cluster;
and (3) cluster border, which includes states with an EVC lower than at least one of its neighbors and
has neighbors from different clusters. Subgoals are then identified as being cluster border states and
skills are then formed in a similar way as in [13, 14]. Finally, both approaches adopt a skill pruning
technique in order to avoid learning inefficient or redundant skills. Whenever an option created has
an initiation set with a value function higher than the subgoal state, this skill is removed. The EVC-
based approach is then compared with other graph-based skill acquisition approaches, namely,
node betweenness [14], strongly connected components [30], edge betweenness, and reform label
propagation [62]; it was shown that the EVC-based approach outperforms the other mentioned
approaches in the Six-Room and taxi domains.
The approach proposed by Mathew et al. [45] and later expanded by Krishnamurthy et al. [33]
uses a spectral clustering algorithm called PCCA+ [74] in order to identify abstract states that
represent a set of states with similar characteristics. The PCCA+ method receives the transition
matrix of a given graph as an input and returns a membership function X i, j that quantifies the
membership of a given state si to an abstract state S j . Each connection between two abstract states
Si and S j are used to create an option. This transition between abstract states is accomplished by
visiting neighbor states following the positive gradient of the membership function to the des-
tined abstract state. The option ends when a state sy ∈ Sj is reached, S j being the destined abstract
state. This method is then coupled with any reinforcement learning algorithm in order to learn
the optimal policy for the actions and options. Another novel idea presented in [33] is how to
extend the PCCA+ spectral clustering method to define options for Atari games using the Arcade
Learning Environment [6]. Since running the PCCA+ method for high-dimensional problems is
very expensive, the authors adapted the Deep Learning network architecture used for predicting
future frames in Atari games [55] in order to predict future frames using a high-level represen-
tation, reducing the problem dimension and thus making the execution of the PCCA+ algorithm
more feasible. This method is then coupled with the Deep Q-Network (DQN) [51] RL model to
learn when to use each macro-action identified by the proposed algorithm. PCCA+ is compared
with the L-Cut algorithm [15] in the Two-Room and taxi domains, where it manages to identify
fewer subgoals, resulting in a better overall performance. One of the main novelties of this work is
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:20 M. R. F. Mendonça et al.
that it is one of the first graph-based skill acquisition approaches to be tested in the ALE domain,
where the PCCA+ method is shown to enhance the overall performance of the learning agent in
the Seaquest Atari game when compared with the results obtained using DQN.
V π = α 1e 1 + · · · + α |S | e |S | , (15)
where α is a vector of size |S| that must be learned. This idea was then used by Osentoski
and Mahadevan [56]. Using the MAXQ framework [18] to construct macro-actions, the authors
propose a method for creating different state abstractions for each subtask found. The idea here
is that not all state variables are useful for all macro-actions. In some cases, in order to learn the
optimal policy for a macro-action, it is desirable to remove some nonimportant state variables,
although these same state variables may be useful for other macro-actions. The approach works
as follows: First, it is necessary to build an STG based on a state transition history of the agent
for the past n episodes. The authors then propose a graph reduction method capable of joining
states (represented by nodes) with similar connections and with only one different state variable
into a single abstract state that is represented without this state variable. This results in a graph
with abstract states represented by different sets of state variables. Proto-value functions are then
used over each graph G i representing a subtask i in order to obtain the basis function ϕ i (s, a). The
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:21
basis function is given by the first k eigenvectors of the Laplacian of graph G i associated with the
smallest eigenvalues, where k is the number of state variables associated with subtask i.
Value function approximation through proto-value functions was also recently used by Machado
et al. [37] in order to automatically identify options. Each proto-value function, represented by an
eigenvector ei of L, is used to create a reward function for a given option, called eigenpurpose
by the authors. After training, each eigenpurpose gives rise to a policy, called eigenbehavior, that
maps states to actions for a given policy. Instead of adding the learned options to the agent’s
action set and learning a policy over options, the authors take a different approach, where options
are executed randomly (until completion) in order to better explore the state space, allowing the
agent to reach otherwise unaccessible state regions. It is important to note that options are only
defined after an initial exploratory step where the agent builds the STG, allowing the construction
of the Laplacian matrix L. The authors show through a series of experiments that this approach
achieves good results and that the options found are not always centered on the idea of reaching
bottleneck states, as done by many previous works [22, 46, 48, 61]. It is also shown that the options
found by that method perform better than bottleneck options. The authors conclude their work
by presenting how to use their method with function approximation for large state spaces: instead
of using an adjacency matrix (from where the Laplacian matrix is derived), authors use a matrix
T ∈ Rt ×d , where t is the number of observed transitions during the sampling process and d is the
number of state features. Each row of matrix T stores a transition observed by saving the difference
between the feature vectors of the previous and current states. The eigenvectors associated with T
are then used to create a set of eigenpurpose and eigenbehavior. This approach is then coupled with a
Deep Reinforcement Learning method (Section 2.1) and tested in the Arcade Learning Environment,
showing promising results.
The previous work was later improved by using Successor Representation (SR) [17]. In brief, SR
indicates the expected future state occupancy for a given state. SR is given by a function that maps
a state pair (s, s ) ⇒ R. The value function can then be decomposed as a product between the SR
of (s, s ) and the reward received during this transition. The SR framework was then extended by
Barreto et al. [4] to problems with nonenumerated states that use function approximation, called
the Successor Features (SF). Stachenfeld et al. [65] showed that there is a strong relationship be-
tween the eigenvectors of the matrix representation of the SR and the eigenvectors of the Laplacian
associated to the STG. Using this relationship, Machado et al. [38] improved their Laplacian Frame-
work [37] by using SR (or SF when using function approximation) as a feature representation of
a state, instead of using raw pixel data. The authors augmented the SF representation by defining
SF over a lower-dimension representation of a state given by the state representation learned by
the NN responsible for predicting future images of a given Atari state [55], as previously used by
Krishnamurthy et al. [33]. Finally, each row of the transition matrix T is now given by the SF rep-
resentation of a given state. Options are then defined similarly to the Laplacian Framework [37].
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:22 M. R. F. Mendonça et al.
Initially, many graph-based skill acquisition methods worked directly with the Global STG of an
RL problem [12, 48], since these methods were only applied to relatively simple RL problems. As
more complex problems appeared, researchers started focusing on graph-based methods capable
of working with a Local STG, therefore requiring only a full transition history [13, 42, 61]. Many
other methods started to use an abstract STG, derived from the Local STG, in order to work with
smaller graphs [33, 56].
Currently, with the appearance of even more complex problems (ATARI games, general-purpose
AI [5], autonomous vehicles, etc.), RL methods became more demanding [50, 51], requiring much
more data in order to efficiently learn how to solve a given problem. This results in increasingly
large Local STG, to the point where it becomes infeasible to create a graph using the whole transi-
tion history. This corresponds to a major challenge for graph-based skill acquisition methods that
must be properly addressed; otherwise, graph-based methods will fail in properly defining skills in
current complex problems. Therefore, it is necessary to search for new approaches for creating an
abstract STG that effectively represent the environment dynamics without requiring the construc-
tion of a fully detailed Local STG. These approaches must, for example, unite a state a RL agent
is currently visiting to another previously visited state with similar characteristics (approximate
state representation, for example), resulting in the construction of an abstract STG while the agent
is transitioning between states.
Spectral analysis allows the identification of not only skills but also their respective intrapoli-
cies, achieved through the approximation of the value function for a given reinforcement learning
problem [37]. Other interesting spectral analysis methods are also available in the literature; al-
though they focus on different domains, it may be possible to adapt them to the context of skill
acquisition. For example, the approach presented in [75] to identify critical nodes for network ro-
bustness can be easily adapted to the skill acquisition problem. This method is used for finding
critical nodes in a complex network by using a local scoring function for each node that measures
how critical the corresponding node is to the network robustness. The score given to a node v is
associated with the spectral gap λ 2 [52] of the subnetwork formed by the nodes reachable within
h hops from node v. This score varies between [0, 1], where a score Sv = 1 indicates that node v is
the most locally critical node in its neighborhood. This also allows the navigation from any node to
a critical node by just going to the neighbor node with the highest score. Therefore, this approach
can then be easily adopted for skill acquisition, where critical nodes represent subgoals and the
navigation process can be associated with skills with their respective intrapolicies already defined.
Graph-based skill acquisition applied to complex RL problems results in potentially very large
Local STGs, as we previously mentioned. To deal with these potentially very large graphs, one can
adopt solutions for processing large-scale graphs that recently flourished in the literature [28, 36,
40]. This allows the graph-based skill acquisition methods to use a more complete STG, helping the
learning agent to achieve a better performance. However, this approach would only be justifiable in
cases where the agent’s learning curve is heavily impacted by skill acquisition methods. Otherwise,
the extra processing time might not be worth it. Another approach is to conceive other graph
processing methods capable of reducing the Local STG, or even new approaches to directly create
reduced STGs, as previously mentioned in the current section. The idea is to allow the agent to
use smaller STGs that efficiently approximate the Global STG. These contrasting approaches are
still underdeveloped, with plenty of room for improvement.
As a final remark, a common problem not only in the machine-learning area but also in
many research areas is how to effectively compare one proposed method with other previous
alternatives. This becomes an especially complicated challenge when there is no established frame-
work in the area capable of providing an efficient test bench for researchers. This problem is also
present in the skill acquisition area, where (1) each researcher uses different problems to test his or
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:23
her proposed methods or, in other cases, (2) uses different implementations of the same problem
to compare its method to others. As a consequence of the former problem, authors have difficul-
ties in effectively comparing their methods with other methods in the area, given that there is no
established benchmark or baseline problem. With each method running for different problems, a
direct comparison between concurrent methods becomes a challenge. The latter problem refers to
implementation differences of a same method, given that some papers in the area do not fully de-
scribe their experimental setup, making it difficult to replicate their results. However, this scenario
is slowly changing at least in the RL area with the proposal of different benchmarks that imple-
ment several complex problems, such as the Arcade Learning Environment [6], the benchmark for
continuous control [20], OpenAI Gym [8], and, more recently, the DeepMind Lab [5]. Therefore, it
is important that future work related to skill acquisition adopts such benchmarks in order to allow
a fair and easy comparison of any proposal against previous methods.
REFERENCES
[1] Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. The option-critic architecture. In Proceedings of the AAAI
Conference on Artificial Intelligence. 1726–1734.
[2] Pierre-Luc Bacon and Doina Precup. 2013. Using label propagation for learning temporally abstract actions in rein-
forcement learning. In Proceedings of the Workshop on Multiagent Interaction Networks. 1–7.
[3] Albert-László Barabási. 2016. Network Science. Cambridge University Press.
[4] Andre Barreto, Will Dabney, Remi Munos, Jonathan J. Hunt, Tom Schaul, David Silver, and Hado P. van Hasselt.
2017. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems
(NIPS’17). Curran Associates, 4058–4068.
[5] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq,
Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain,
Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. 2016. DeepMind lab.
arXiv preprint arXiv:1612.03801 (2016).
[6] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The arcade learning environment: An
evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 1 (May 2013), 253–279.
[7] Katy Börner, Soma Sanyal, and Alessandro Vespignani. 2007. Network science. Annual Review of Information Science
and Technology 41, 1 (2007), 537–607.
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:24 M. R. F. Mendonça et al.
[8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.
2016. OpenAI gym. arXiv preprint arXiv:1606.01540 (2016).
[9] Fei Chen, Yang Gao, Shifu Chen, and Zhenduo Ma. 2007. Connect-based subgoal discovery for options in hierarchical
reinforcement learning. In Proceedings of the International Conference on Natural Computation (ICNC’07), Vol. 4. 698–
702. DOI:https://doi.org/10.1109/ICNC.2007.312
[10] Cc Chiu and Von-Wun Soo. 2011. Subgoal identifications in reinforcement learning: A survey. Advances in Reinforce-
ment Learning (2011), 181–188. DOI:https://doi.org/10.5772/13214
[11] Ozgür Şimşek. 2008. Behavioral Building Blocks for Autonomous Agents: Description, Identification, and Learning. Ph.D.
Dissertation. University of Massachusetts.
[12] Ozgür Şimşek and A. Barto. 2007. Betweenness centrality as a basis for forming skills. Technical Report, University of
Massachusetts, Department of Computer Science.
[13] Özgür Şimşek and Andrew G. Barto. 2004. Using relative novelty to identify useful temporal abstractions in rein-
forcement learning. In Proceedings of the International Conference on Machine Learning (ICML’04). ACM, New York,
95–103. DOI:https://doi.org/10.1145/1015330.1015353
[14] Ozgür Şimşek and Andrew G. Barto. 2008. Skill characterization based on betweenness. In Advances in Neural Infor-
mation Processing Systems (NIPS’08), 1497–1504.
[15] Özgür Şimşek, Alicia P. Wolfe, and Andrew G. Barto. 2005. Identifying useful subgoals in reinforcement learning by
local graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML’05). ACM, New
York, 816–823. DOI:https://doi.org/10.1145/1102351.1102454
[16] Marzieh Davoodabadi and Hamid Beigy. 2011. A new method for discovering subgoals and constructing options in
reinforcement learning. In Indian International Conference on Artificial Intelligence (IICAI’11). 441–450.
[17] Peter Dayan. 1993. Improving generalization for temporal difference learning: The successor representation. Neural
Computation 5, 4 (1993), 613–624.
[18] Thomas G. Dietterich. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition.
Journal of Artificial Intelligence Research 13, 1 (Nov. 2000), 227–303.
[19] Bruce L. Digney. 1998. Learning hierarchical control structures for multiple tasks and changing environments. From
Animals to Animats 5: Proceedings of the International Conference on the Simulation of Adaptive Behavior. 321–330.
[20] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. 2016. Benchmarking deep reinforcement
learning for continuous control. In Proceedings of the International Conference on International Conference on Machine
Learning (ICML’16), Vol. 48. 1329–1338.
[21] Negin Entezari, Mohammad Ebrahim Shiri, and Parham Moradi. 2010. A local graph clustering algorithm for discov-
ering subgoals in reinforcement learning. In Communications in Computer and Information Science. Springer, Berlin,
41–50. DOI:https://doi.org/10.1007/978-3-642-17604-3_5
[22] Sandeep Goel and Manfred Huber. 2003. Subgoal discovery for hierarchical reinforcement learning using learned
policies. In Proceedings of the International Florida Artificial Intelligence Research Society Conference, Ingrid Russell
and Susan M. Haller (Eds.). AAAI Press, 346–350.
[23] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS’14).
Curran Associates, 2672–2680.
[24] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural
network for image generation. In Proceedings of the International Conference on Machine Learning (Proceedings of
Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 1462–1471.
[25] Bernhard Hengst. 2002. Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings of the Interna-
tional Conference on Machine Learning (ICML’02). Morgan Kaufmann Publishers, San Francisco, 243–250.
[26] Bernhard Hengst. 2003. Discovering Hierarchy in Reinforcement Learning. Ph.D. Dissertation. University of New South
Wales.
[27] Bernhard Hengst. 2004. Model approximation for HEXQ hierarchical reinforcement learning. In Proceedings of
the European Conference on Machine Learning (ECML’04). Springer, Berlin,144–155. DOI:https://doi.org/10.1007/
978-3-540-30115-8_16
[28] Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, Stijn Heldens, Arnau Prat-Pérez, Thomas Manhardto, Hassan
Chafio, Mihai Capotă, Narayanan Sundaram, Michael Anderson, Ilie Gabriel Tănase, Yinglong Xia, Lifeng Nai, and
Peter Boncz. 2016. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed plat-
forms. Proceedings of the VLDB Endowment 9, 13 (2016), 1317–1328. DOI:https://doi.org/10.14778/3007263.3007270
[29] Andrej Karpathy and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664–676.
[30] Seyed Jalal Kazemitabar and Hamid Beigy. 2009. Using strongly connected components as a basis for autonomous skill
acquisition in reinforcement learning. In Proceedings of the International Symposium on Neural Networks (ISNN’09).
Springer, Berlin, 794–803. DOI:https://doi.org/10.1007/978-3-642-01507-6_89
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
Graph-Based Skill Acquisition 6:25
[31] Ghorban Kheradmandian and Mohammad Rahmati. 2009. Automatic abstraction in reinforcement learning using data
mining techniques. Robotics and Autonomous Systems 57, 11 (2009), 1119–1128. DOI:https://doi.org/10.1016/j.robot.
2009.07.002
[32] George Konidaris and Andrew Barto. 2009. Skill discovery in continuous reinforcement learning domains using skill
chaining. In Advances in Neural Information Processing Systems (NIPS’09), Y. Bengio, D. Schuurmans, J. Lafferty, C.
Williams, and A. Culotta (Eds.). 1015–1023.
[33] Ramnandan Krishnamurthy, Aravind S. Lakshminarayanan, Peeyush Kumar, and Balaraman Ravindran. 2016. Hier-
archical reinforcement learning using spatio-temporal abstractions and deep neural networks. In Proceedings of the
International Conference on Machine Learning.
[34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural
networks. In Advances in Neural Information Processing Systems (NIPS’12). Curran Associates, 1097–1105.
[35] Long-Ji Lin. 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine
Learning 8, 3 (1992), 293–321. DOI:https://doi.org/10.1007/BF00992699
[36] Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. 2014. Large-scale distributed graph computing systems: An experi-
mental evaluation. Proceedings of the VLDB Endowment 8, 3, 281–292. DOI:https://doi.org/10.14778/2735508.2735517
[37] Marlos C. Machado, Marc G. Bellemare, and Michael H. Bowling. 2017. A Laplacian framework for option discovery
in reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML’17). 2295–2304.
[38] Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. 2017.
Eigenoption discovery through the deep successor representation. CoRR abs/1710.11089 (2017). Retrieved from
http://arxiv.org/abs/1710.11089.
[39] Sridhar Mahadevan. 2005. Proto-value functions: Developmental reinforcement learning. In Proceedings of the Inter-
national Conference on Machine Learning (ICML’05). ACM, New York, 553–560. DOI:https://doi.org/10.1145/1102351.
1102421
[40] Grzegorz Malewicz, Matthew H. Austern, and Aart J. C. Bik. 2010. Pregel: A system for large-scale graph processing.
In Proceedings of the ACM SIGMOD International Conference on Management of Data. 135–145.
[41] Daniel J. Mankowitz, Timothy A. Mann, and Shie Mannor. 2016. Adaptive skills adaptive partitions (ASAP). In Ad-
vances in Neural Information Processing Systems (NIPS’16), D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R.
Garnett (Eds.). Curran Associates, 1588–1596.
[42] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. 2004. Dynamic abstraction in reinforcement learning via
clustering. In Proceedings of the International Conference on Machine Learning (ICML’04). ACM, New York, 71–78.
DOI:https://doi.org/10.1145/1015330.1015355
[43] Oded Maron. 1998. Learning from Ambiguity. Ph.D. Dissertation. Massachusetts Institute of Technology.
[44] Oded Maron and Tomás Lozano-Pérez. 1998. A framework for multiple-instance learning. In Proceedings of the Con-
ference on Advances in Neural Information Processing Systems (NIPS’98). MIT Press, Cambridge, MA, 570–576.
[45] Vimal Mathew, Peeyush Kumar, and Balaraman Ravindran. 2012. Abstraction in reinforcement learning in terms of
metastability. In Proceedings of the European Workshop on Reinforcement Learning (EWRL’12).1–14.
[46] Amy McGovern and Andrew G. Barto. 2001. Automatic discovery of subgoals in reinforcement learning using diverse
density. In Proceedings of the International Conference on Machine Learning, 361–368.
[47] Amy McGovern, Richard S. Sutton, and Andrew H. Fagg. 1997. Roles of macro-actions in accelerating reinforcement
learning. In Grace Hopper Celebration of Women in Computing. 13–18.
[48] Ishai Menache, Shie Mannor, and Nahum Shimkin. 2002. Q-cut - dynamic discovery of sub-goals in reinforce-
ment learning. Proceedings of the European Conference on Machine Learning, 295–306. DOI:https://doi.org/10.1007/
3-540-36755-1_25
[49] Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. 2011. Extensions of recur-
rent neural network language model. In IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP’11). 5528–5531.
[50] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Sil-
ver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the
International Conference on Machine Learning (Proceedings of Machine Learning Research), Maria Florina Balcan and
Kilian Q. Weinberger (Eds.), Vol. 48. PMLR, New York, 1928–1937.
[51] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves,
Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis
Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level
control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529–533.
[52] Bojan Mohar. 1997. Some Applications of Laplace Eigenvalues of Graphs. Springer Netherlands, Dordrecht, 225–275.
DOI:https://doi.org/10.1007/978-94-015-8937-6_6
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.
6:26 M. R. F. Mendonça et al.
[53] Parham Moradi, Mohammad Ebrahim Shiri, and Negin Entezari. 2010. Automatic skill acquisition in reinforcement
learning agents using connection bridge centrality. In Communications in Computer and Information Science. Springer,
Berlin, 51–62. DOI:https://doi.org/10.1007/978-3-642-17604-3_6
[54] Mark E. J. Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical
Review E 69, 2 (2004), 15. DOI:https://doi.org/10.1103/PhysRevE.69.026113
[55] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder Singh. 2015. Action-conditional video pre-
diction using deep networks in atari games. In Advances in Neural Information Processing Systems (NIPS’15), C. Cortes,
N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, 2863–2871.
[56] Sarah Osentoski and Sridhar Mahadevan. 2010. Basis function construction for hierarchical reinforcement learning. In
Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS’10). International
Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 747–754.
[57] Ronald Edward Parr. 1998. Hierarchical Control and Learning for Markov Decision Processes. Ph.D. Dissertation. Uni-
versity of California, Berkeley.
[58] Duncan Potts and Bernhard Hengst. 2004. Concurrent discovery of task hierarchies. In AAAI Spring Symposium on
Knowledge Representation and Ontology for Autonomous Systems. 1–8.
[59] Doina Precup. 2000. Temporal Abstraction in Reinforcement Learning. Ph.D. Dissertation. University of Massachusetts.
[60] Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
[61] Ali Ajdari Rad, Martin Hasler, and Parham Moradi. 2010. Automatic skill acquisition in reinforcement learning using
connection graph stability centrality. In Proceedings of IEEE International Symposium on Circuits and Systems. 697–700.
DOI:https://doi.org/10.1109/ISCAS.2010.5537485
[62] Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear time algorithm to detect community
structures in large-scale networks. Physical Review E 76, 3 (2007), 11. DOI:https://doi.org/10.1103/PhysRevE.76.036106
[63] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with
region proposal networks. In Advances in Neural Information Processing Systems (NIPS’15). Curran Associates, 91–99.
[64] Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence 22, 8 (Aug. 2000), 888–905. DOI:https://doi.org/10.1109/34.868688
[65] Kimberly L. Stachenfeld, Matthew Botvinick, and Samuel J. Gershman. 2014. Design principles of the hippocampal
cognitive map. In Advances in Neural Information Processing Systems (NIPS’14). Curran Associates, 2528–2536.
[66] Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.
[67] Richard S. Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semi-MDPs: A framework for temporal
abstraction in reinforcement learning. Artificial Intelligence 112 (1999), 181–211.
[68] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vin-
cent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Computer Vision and Pattern
Recognition (CVPR’15). 1–12.
[69] Nasrin Taghizadeh and Hamid Beigy. 2013. A novel graphical approach to automatic abstraction in reinforcement
learning. Robotics and Autonomous Systems 61, 8 (2013), 821–835. DOI:https://doi.org/10.1016/j.robot.2013.04.010
[70] Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, and Koray
Kavukcuoglu. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing
Systems (NIPS’16), D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, 3486–
3494.
[71] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray
Kavukcuoglu. 2017. FeUdal networks for hierarchical reinforcement learning. In Proceedings of the International Con-
ference on Machine Learning (ICML’17). 1–12.
[72] Krista Rizman Žalik. 2008. An efficient K’-Means clustering algorithm. Pattern Recognition Letters 29, 9 (2008), 1385–
1391. DOI:https://doi.org/10.1016/j.patrec.2008.02.014
[73] Christopher J. C. H. Watkins. 1989. Learning from Delayed Rewards. Ph.D. Dissertation. King’s College, Cambridge,
UK.
[74] Marcus Weber, Wasinee Rungsarityotin, and Alexander Schliep. 2004. Perron cluster analysis and its connection to
graph partitioning for noisy data. Technical Report, Zuse Institute Berlin (ZIB).
[75] Klaus Wehmuth and Artur Ziviani. 2011. Distributed location of the critical nodes to network robustness based
on spectral analysis. In Latin American Network Operations and Management Symposium (LANOMS’11). 1–8. DOI:
https://doi.org/10.1109/LANOMS.2011.6102259
ACM Computing Surveys, Vol. 52, No. 1, Article 6. Publication date: February 2019.