0% found this document useful (0 votes)
9 views

A Survey On Temporal Graph Representation Learning and Generative Modeling

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

A Survey On Temporal Graph Representation Learning and Generative Modeling

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

A Survey on Temporal Graph Representation

Learning and Generative Modeling

Shubham Gupta Srikanta Bedathur


Department of Computer Science Department of Computer Science
IIT Delhi IIT Delhi
arXiv:2208.12126v1 [cs.LG] 25 Aug 2022

[email protected] [email protected]

Abstract

Temporal graphs represent the dynamic relationships among entities and occur
in many real life application like social networks, e-commerce, communication,
road networks, biological systems, and many more. They necessitate research
beyond the work related to static graphs in terms of their generative modeling
and representation learning. In this survey, we comprehensively review the neural
time-dependent graph representation learning and generative modeling approaches
proposed in recent times for handling temporal graphs. Finally, we identify the
weaknesses of existing approaches and discuss research proposal of our recently
published paper T IGGER [24].

1 Introduction

Traditionally static graphs have been the de facto data structures in many real-world settings like social
networks, biological networks, computer networks, routing networks, geographical weather networks,
interaction networks, co-citation networks, traffic networks, and knowledge graphs [2, 1, 12]. These
graphs are used to represent the relationships between various entities. Major tasks like community
detection, graph classification, entity classification, link prediction, and combinatorial optimization
are established research areas in this domain. These tasks have applications in recommendation
systems [37], anomaly detection[51], information retrieval using knowledge graphs [16], drug
discovery [19] , traffic prediction [88], molecule fingerprinting [14], protein interface prediction [17]
and combinatorial optimization [55]. Recently, graph neural networks [36, 26, 71, 80, 82] have been
developed to improve the state-of-the-art in these applications. Moreover, much success has been
achieved in terms of quality and scalability by the graph generative modeling methods [83, 82, 42].
However, most of these datasets often have the added dimension of time. The researchers marginalize
the temporal dimension to generate a static graph to execute the above tasks. Nonetheless, many tasks
like future link prediction[69], time of future link prediction [66] and dynamic node classification[39]
require temporal attributes as well. This has led to recent advancements in temporal graphs in terms
of defining and computing temporal properties [29]. Moreover, algorithmic problems like travelling
salesman problem [47], minimum spanning trees[30], core decomposition[76], maximum clique
[27] have been adopted to temporal graphs. Recent research on graphs has focused on dynamic
representation learning [60, 87, 67, 79] and achieved high fidelity on the downstream tasks. Research
in temporal graph generative space[89, 85] is in its early stages and requires focus, especially on
scalability.
Many surveys exist that separately study techniques on graph representation learning [9, 10, 34,
35, 77], temporal graph representation learning [63, 6, 33], and graph generative modeling [23, 15].
But this survey is a first attempt to unify these inter-related areas. It aims to be an initial point for
beginners interested in the temporal graph machine learning domain. In this report, we outline the
following-
• We initiate the discussion with definitions and preliminaries of temporal graphs.
• We then present the summary of graph representation learning methods and the static graph
generative methods that are prerequisites to explore similar approaches for temporal graphs.
• We discuss in-depth the temporal graph representation approaches proposed in recent
literature.
• Subsequently, we outline the existing temporal graph generative methods and highlight their
weaknesses.
• Finally, we propose the problem formulation of our recently published temporal graph
generative model T IGGER[24].

2 Definitions and Preliminaries


This section will formalize the definition of temporal graphs and their various representations.
Furthermore, we will explain a temporal graph’s node and edge attributes. We will then discuss the
various tasks under temporal graph setting and frequent metrics used in literature.

2.1 Temporal Graph

Temporal graphs are an effective data structure to represent the evolving topology/relationships
between various entities across time dimensions. However, temporal graphs are not only used to
describe the phenomenon which simulates the evolving links but also the underlying process which
triggers the entity addition or removal from the topology. In addition, they also represent the evolution
of entity/edge attributes over time. For example, temporal graph-based modeling can explain the
sign-up behavior of a user in a shopping network and the causes of churn apart from their shopping
behavior. In this survey, we also use temporal graphs to characterize those dynamic graphs that are
not evolving anymore.
A temporal graph is either represented in continuous space or discrete space. In the below subsections,
we describe the distinction between the two.

2.1.1 Continuous Time Temporal Graph


A continuous-time temporal graph is a stream of dyadic events happening sequentially.
G = {(e1 , t1 , x1 ), (e2 , t2 , x2 ), (e3 , t3 , x3 )....(en , tn , xn )} (1)
F
Each ei is a temporal event tuple at timestamp ti with attributes xi ∈ R . These attributes can be
continuous, categorical, both, or none at all. Each event is defined depending on the type of dataset
and tasks. For an interaction network like transaction network, shopping network, and communication
networks, each ei = (u, v) is an interaction between a node u and v at a time ti where u, v ∈ V and
V is a collection of entities/nodes in the network. In a general framework, |V | is a variable across
time. Nodes can be added and deleted, which is represented by the event e. An event ei = (u, ”add”)
is specified as node addition event where a node u with attributes xi ∈ RF at timestamp ti has been
added to the network. A repeated node addition event is interpreted as a node update event with new
attributes if the node is already existing in the network. For example, the role of an assistant professor
node can change to associate professor node in a university network. Similarly, a node deletion event
ei = (vi , ”delete”) is also possible.
In temporal graphs like knowledge graphs, co-citation networks, biological networks, transport
network, each edge will have a time-span i.e, an edge tei can be added to the network at time t1 and
removed from t2 where 0 <= t1 , t2 <= T . T is the last observed time-stamp in network. In such
cases, we will assume that each edge event is either a link addition event (ei = (u, v, ”add”)) at ti in
the network or a link deletion event (ei = (u, v, ”delete”)) at tj in the network.
Figure 1 shows an example of one of the interaction networks. Event representation is often designed
as per the requirement in the existing literature. For example, [67] add a variable k in the event tuple
(ei , ti , ki ). Here, ki is a binary variable signifying whether this event is an association event (the
permanent link between two nodes in the network) or a communication event (interaction between
two nodes). We encapsulate such intricacies in the variable ei to simplify the presentation.

2
𝑡! , 𝑡# 𝑡 $ ,𝑡 % ,𝑡 &

𝑡$
𝑢𝑠𝑒𝑟 1 𝑢𝑠𝑒𝑟 2 𝑢𝑠𝑒𝑟 3

𝑡% 𝑡&

𝑢𝑠𝑒𝑟 4

Figure 1: Temporal Interaction Network

2.1.2 Discrete-Time Temporal Graph

Generally, a temporal graph is generated by accumulating evolution across a window of consecutive


timestamps to extract the desired information or apply static graph modeling techniques. We divide
the time axis in equal lengths and perform aggregation to create a graph along with each such temporal
window. Figure 2 displays one such aggregated temporal graph. This representation is known as a
discrete-time temporal graph.

𝑡! 𝑡" 𝑡# 𝑡$

Figure 2: Evolving Temporal Network

A discrete-time temporal graph is defined as follows:

G = {(G1 , t1 , X1 v , X1 e ), (G2 , t2 , X2 v , X2 e )....(Gn , tn , Xn v , Xn e )} (2)

Here, each Gi = (Vi , Ei ) is a static network at time ti where Vi is a collection of nodes in time-
window ti and Ei is the collection of edges e = (u, v), u, v ∈ Vi . Xi v ∈ R|Vi |×Fv is the node feature
matrix, where Fv is the dimension of the feature vector for each node. Similarly, Xi e ∈ R|Ei |×Fe is
edge feature matrix and Fe is the dimension of the edge feature vector. Please note that the meaning
of the notation ti is dataset specific. Essentially, it is a custom representation of the period over which
the Gi graph has been observed. For example, ti can be a month if each snapshot is collected for
each month. In many cases, it can simply be the mean of the time-window or last timestamp in the
time-window as well.

2.2 Attributes

Attributes are a rich source of information, except for the structure of a graph. For example, a
non-attributed co-citation network will only provide information about frequent co-authors and
similarities between them, but an attributed network containing roles and titles of every node will also
offer a holistic view of the evolving interest of co-authors and help in predicting the next co-author
or next research area for co-authors. Attributes, in general, can take a categorical form like gender
and location, which are represented as 1-hot vectors. They can also make a continuous form like
age. For example, in a Wikipedia network [39], each interaction’s attributes are word vector of text
edits made in the wiki article. And the class (label) of each user explains whether the user has been
banned from editing the page. Sometimes these attributes are present as meta-data in the network.
[18] airline network dataset contains the meta-information in the form of the city’s latitude, longitude,
and population.

3
2.3 Tasks and Evaluating Metrics

The primary objective in representation learning is to project each node and edge into d-dimension
vector space. It achieves it by learning a time-dependent function f (G, v, t) : G × V × R+ → Rd
∀v ∈ V where V is the set of nodes in the temporal network G which has been observed until time T.
Generally t > T for future prediction tasks, but time-dependent requirement can be dropped to learn
only a node-specific representation [50]. If the function f allows only an existing node v ∈ V as an
argument, then this setting in literature is generally known as transductive representation learning
[26]. Often this is desirable not to have this restriction since frequent model training is not possible
in many real-life systems, and it is usually required for the trained model to generate representations
for unseen nodes v 6∈ V . This setting is recognized as inductive representation learning [26]. The
argument G in the function f encapsulates graph/node/edge attributes and is simplified according
to design choices. Suppose we assume that temporal graph in the continuous domain, then G can
be approximated as a sequence of event streams or only those events in which another argument v
and its neighbors are involved. Neighbors of a node v in a temporal graph do not have a universal
definition like static graphs. Most papers typically define the neighbourhood of a node v at given time
t as a set of nodes which are kd hops away in topology from node v and their time of interactions are
within kt of given time t. kd , and kt are application specific parameters. [67, 79, 60]. This definition
typically selects the recent interactions of node v before time t. [87, 39] use the node v’s all previous
interactions to learn its representation at time t. This is the special case of kd = 1 and kt = ∞.
Every representation learning primarily focuses on learning the node representation and then using
these representations to learn the edge embeddings. Edge e = (u, v, t) representation is learnt by
learning a function g(G, u, v, t) : G × V × V × R+ → Rd which aggregates the node representation
of its two end nodes and their attributes, including the edge in the consideration. Most often, g is
a concatenate/min/max/mean operator. It can also be a neural network-based function to learn the
aggregation. We note that t is an argument to the function g, allowing the g to learn a time-dependent
embedding function.
Similarly, discrete-time temporal graph snapshots are encoded into low-dimensional representation.
Like the edge aggregator function g, a graph-level aggregator is learned, taking all nodes in that graph
as input.
The most frequent tasks using graph/node/edge representations are graph classification, node classifi-
cation, and future link prediction. These are known as downstream tasks. These tasks cover many
applications in recommendation systems, traffic prediction, anomaly detection, and combinatorial
optimization. In recent work, we have also observed additional tasks like event time prediction [67]
and clustering [86]. We will now detail each task and the metrics used to evaluate the model efficiency
for these tasks. Please note that we overload the notation f for each task.

2.3.1 Node Classification


Given a temporal graph G and node v, a function f is trained to output its label. Formally,
f (G, v, t) : G × V × R+ → C
Where C is a set of categories possible of a node in a temporal graph G. Often a node v can be
represented as a time-dependent d dimensional vector hv (t), f can be approximated to
f (hv (t)) : Rd → C.
Node classification can be conducted both in transductive and inductive learning depending upon the
argument node to the function f . In the transductive setting, the argument node is already seen during
training. It is unseen in the case of inductive learning. Accuracy, F1, and AUC are frequent metrics
for evaluating node classification. The AUC is more favorable since it provides a reliable evaluation
even in case high-class imbalance. Anomaly detection is a typical case where class imbalance is
observed, and the positive class typically belongs to less than 1% of the population.

2.3.2 Future Link Prediction


Given a temporal graph G, two nodes u and v, a future timestamp t, a function f is learned which
predicts the probability of these two nodes linking at a time t > T where T is the latest timestamp
observed in the G.
f (G, u, v, t) : G × V × V × R+ → R

4
Similarly, we can also predict the attributes of this future link. As simplified in the node classification,
since the node representations hu (t) and hv (t) are dependent on G and time t, we can write f as
follows:
f (hu (t), hv (t)) : Rd × Rd → R
We further observe that in most future link prediction settings, node representations for t > T are
approximated as
hv (t) ≈ hv (T ) ∀v ∈ V
The argument t is often not required in future link prediction functions for such settings since f is
simply predicting the probability of link formation in the future. In most methods, f is the cosine
similarity function between node embeddings or a neural function that aggregates the information
from these embeddings. In some instances, [67], f is approximated by the temporal point processes
[58], which also allows for predicting the time of link formation as well. Like node classification, we
can categorize the future link prediction task into transductive and inductive settings, depending upon
whether the node v in consideration is already seen during training.
Future link prediction is evaluated in 2 significant settings. Researchers typically choose either of
these two. In the first set-up, a link prediction task is considered a classification task. The test
data consists of an equal number of positive and negative links. Positive links are the actual links
present in the future subgraph or test graph. Generally, a test graph is split chronologically from the
training graph to evaluate the model performance. Therefore, edges in this test graph or subgraph
are considered positive examples. From the sample test graph, an equal number of non-edge node
pairs are sampled as negative links. Accuracy, f1, or AUC are used to evaluate the performance of
f . In another setting, future link prediction is seen as a ranking problem. For every test node, its
most probable future neighbors are ranked and compared with actual future neighbors. In a slightly
different framework, the top K possible edges are ranked in a test graph and compared with the
ground truth edges. Preferred metrics in these ranking tasks are mean reciprocal rank(MRR), mean
average precision(MAP) [73], precision@K, recall@K.

2.3.3 Event Time Prediction


[67] introduces a rather novel task of prediction of the time of the link in consideration. This task has
applications in the recommendation system. Learning which items will be purchased by a user at a
particular time t will result in better recommendations, optimized product shipment routing, and a
better user experience. Mean absolute error (MAE) is the metric for evaluating this task.

3 Literature Review
We first summarize the prevalent static graph representation learning methods. We will later see that
temporal graph representation methods are direct extensions of these approaches.

3.1 Static Graph Representation Learning Methods

These methods are divided into two main categories, a) Random walks based methods and b) Graph
neural network. There are other categories as well, like factorization based approaches [3] [7] and
[52]. However, these are generally not used due to associated scalability problems and the inability
to use available attributes. Moreover, random walk and GNN-based methods are also superior in
quality. For this subsection, we assume G = (V, E) as a static graph where V is the node-set and
E = {(u, v) | u, v ∈ V } is the edge set. N is number of nodes, and M is number of edges. We
denote the 1 hop neighbourhood of node v as Nv and xv as input feature vector for node v. Also, the
bold small case variable denotes a vector, and the bold large case variable denotes a matrix.

3.1.1 Random Walk based Methods


Node representation is often the reflection of the graph structure, i.e., the more similar the repre-
sentation, the higher the chances of the corresponding nodes to co-occur in random walks. This
intuition provides the unsupervised learning objective to learn the node representation. Building on
this, D EEP WALK [56] provided the first random walk-based method. They run the random walks
RWv from each node v. Suppose one such k length random walk sequence is RWv = {v1 , v2 ...vk }.

5
Using the skip-gram objective [48], they learn the representation zv ∀v ∈ V by optimizing the
following loss objective.
X X X
LRWv = − log P (vi−w ...vi+w | vi ) = − log P (v | vi )
vi ∈RWv vi ∈RWv v∈vi−w ...vi+w ∧v6=vi
T (3)
exp(z0 v zu )
p(v | u) = P
0T
v0∈V exp(z v 0 zu )

where zv is node representation, z0v is also the node representation, but it’s not used in the downstream
tasks. This setup is similar to [48]. w is the window size of window centred at vi ∈ RWv . log p(v | u)
is often re-written using negative sampling method [49] to avoid the computationally expensive
operation in denominator of softmax as follows-
k=K
X
log p(v | u) = log σ(zTv zu ) + Evn ∼Pn (v) log σ(−zTvn zu ) (4)
k=1

Where σ is the sigmoid function and K is number of negative samples, typically 5, and Pn (v) is a
probability distribution over v ∈ V . It is often based on the degree of v and the task. D EEP WALK
compute these losses for every node in each random walk and update the zv ∀v ∈ V by using gradient
descents methods [65]. We note that the next node is selected uniformly from the current node’s
neighborhood in each random walk. D EEP WALK also shows that these learned representations can be
utilized in downstream tasks like node classification and missing link predictions. LINE is a direct
extension of D EEP WALK. They modify the D EEP WALK loss by restricting co-occurring nodes to be
directly connected. Furthermore, they add the following loss as well.
X
L=− log(f (u, v))
(u,v)∈E
(5)
1
f (u, v) =
1 + exp(−zTu .zv )
where E is the edge set. This loss forces the neighbouring nodes to have similar representation.
N ODE 2V EC [22] uses the negative sampling [48] instead of hierarchical softmax to compute the
expensive operation in the denominator of softmax. Furthermore, they introduce Breadth-First
Search(BFS) and Depth First Search(DFS) biased random walks to learn the node representations. In
BFS-based random walks, nodes near in terms of their hop distance are sampled more frequently.
This biased sampling leads to learning community structures having similar embeddings for nearby
nodes. In DFS-based random walks, nodes which are far are more likely to be sampled. This random
walk assigns similar embeddings to nodes having similar roles/structures in the network.
These methods directly work with node IDs and do not factor in the node features and associated
meta-data. Thus, these approaches are not extendable to unseen nodes in the network since new
node IDs are absent during training. Furthermore, these are unsupervised approaches, so embeddings
cannot also be learned using available supervision on the nodes/edges. These challenges limit the
practical uses of the above methods.

3.1.2 Graph Neural Network based Methods


[36] introduced Graph Convolutional Network (GCN) to learn the node representation based on
graph adjacency matrix and node features. Below equation represents the layer wise message passing
in multi-layer Graph Neural Network.
Hl+1 = σ(ÃHl Wl )
(6)
H0 = X
where X is a node feature matrix, i.e. each ith row corresponding to the feature vector xi of a node i,
Hl is a node representation matrix at lth layer. Wl is a trainable weight matrix for message passing
1 1
from layer l to l + 1. Ã = D− 2 (A + IN )D− 2 where A is the adjacency matrix corresponding to
the graph G. D is a diagonal matrix, where each diagonal element is the degree of the corresponding
node in A + IN the matrix. σ is a non-linear activation like Sigmoid, ReLU etc. Note that IN is

6
added to add the self-loops in formulation. This formulation allows the node representation at the
next layer to be an aggregation of self features and feature of neighbour nodes. L layer network
will cause node representation to be impacted by the L-hops neighbours, which is evident from the
formulae.
This proposed approach requires supervision, as the input network needs to have a few nodes labeled
to learn the Wl ∀l ∈ {0..L − 1}. Authors train the W for each layer by cross-entropy loss after
applying softmax over hK v for each labeled node v. This approach can not incorporate the unseen
nodes as training requires a full adjacency matrix. This requirement also causes scalability issues for
large graphs.
[26] identified that equation 6 essentially averages the node representation of the target node and its
1-hop neighbor nodes to learn the node representation for each node at the next layer. So, a complete
matrix formulation is not needed. Furthermore, the W matrix is also not dependent on node identity.
This relaxation motivates the inductive setting as well. [26] authors proposed a method G RAPH S AGE
with the node-level layer-wise message propagation formulation given below.
hl+1
v = σ(Wl CONCAT(hlv , AGGREGATEl ({hlu | ∀u ∈ Nv }))) (7)
where h0v = xv , xv is the feature vector for the node v. CONCAT and AGGREGATEl are function
which are defined as per the requirement. AGGREGATE simply learns the single representation
from the set of neighbour node representations. G RAPH S AGE uses MEAN, MAX and RNN based
aggregator functions. CONCAT is simply the concatenation of two embeddings. Also, note that
after each layer propagation, hlv s are normalized using l-2 norm. The above formulation is effective
since it allows the weights matrices to adjust importance on the current target node and neighbor
nodes representations to learn the subsequent layer representation for the target node. Additionally,
this formulation is inductive since node identity or adjacency matrix is not required. G RAPH S AGE
utilizes supervised loss on node labels by applying MLP followed by softmax operation to convert
the node embedding to a probability vector in node labels space. G RAPH S AGE additionally proposes
unsupervised loss similar to equation 4 where node u and v co-occurs on a short length sampled
random walks.
In the above formulation, neighbours are treated equal in aggregator functions like Max or Mean Pool.
[71] proposed an attention based aggregator, namely GAT to compute the relative importance of each
neighbour. Specifically, the authors proposed the following to compute the node representation of
node v at layer l + 1.
!
X
hl+1
v =σ αvu Whlu
u∈Nv ∪v
(8)
exp(aT LeakyRELU(Whlv kWhlu ))
αvu =P T l l
i∈Nv ∪v exp(a LeakyRELU(Whv kWhi ))

αvu indicates the importance of message from node u to node v. Here, a is a trainable weight matrix.
Additionally, GAT introduces the multi-head attention similar as [49] to utilize the self-attention
based learning process. So, the final aggregation function using K attention heads becomes as follows:
!
X
l+1 i=K i i l
hv = ki=1 σ αvu W hu (9)
u∈Nv ∪v

Finally, [81] proves that aggregator functions used in G RAPH S AGE and GCN like max-pool and
mean-pool are less powerful in graph isomorphism task than sum aggregator by showing that mean-
pool and max-pool can produce similar representation for different node multi-sets. They propose the
following simpler GNN formulation -
!
X
l+1 l+1 l+1 l
hv = MLP (1 +  )hv + hu
u∈Nv
  (10)
X
g = kl=L
l=0
 (hlv )
v∈G=(V,E)

7
where g is graph embedding and L is number of layers in GNN and  is a learnable irrational parameter.
They also show that this formulation is as powerful in graph isomorphic test as WL-test if node
features are from countable set[40].
The above formulations assign similar embeddings to nodes with similar neighborhoods with similar
attributes, even if they are distant in the network. However, embeddings need to account for the
node’s position in the network in many settings. Applications like routing, which involves the number
of hops/distance between nodes for the end objective, are examples of this requirement. Specifically,
the node’s position in the network should also be a factor in the learned embeddings. PGNN [82]
proposed a concept of anchor nodes to learn position-aware node embeddings. Assuming K anchor
nodes as {v1 , v2 ...vK } and distances of node u from these nodes respectively as {duv1 , duv2 ...duvK }
then a node u can be represented using a position encoded vector [duv1 , duv2 ...duvK ] of size K. More
anchor nodes can provide better location estimates in different network regions. PGNN generalizes
the concept of an anchor node with an anchor set, which contains a set of nodes. Node v’s distance
from an anchor set is the minimum of distances from all nodes in the anchor set. We denote ith
anchor set as Si . Each Si contains nodes sampled from G. We note that each anchor set can contain
a different no of nodes. These K anchor sets create a K size position encoded vector for all nodes.
These vectors are used along with original node features to encode each node. However, since each
dimension in the position vector is linked with an anchor set, changing the order of the anchor
set/position vector should not change the meaning. This constraint requires using a permutation
invariant function aggregator in the GNN. So, PGNN introduces the following formulation for
position-aware node representation.
hlv = σ(Mvl w)
Mvl [i] = MEAN({f (v, u, zl−1 l−1
v , zu ) | ∀u ∈ Si })
f (v, u, zl−1 l−1 l−1 l−1
v , zu ) = s(v, u)CONCAT(zv , zu )
1 (11)
s(v, u) =
dsp (v, u) + 1
zl−1
v = MEAN({Mvl−1 [i] | ∀i ∈ (0...K − 1)})
z0v = xv
where hL v is K size position aware representation for node v. w is a trainable parameter. Also,
dsp (v, u) is the shortest path between node v and u. If dsp is more than a certain threshold than it is
assumed to be infinity. This assumption speeds up the all pair shortest path computation.
The above GNNs methods work well in graphs having high homophily. [90] defines homophily as
the ratio of number of edges connecting nodes having the same label to the total number of edges.
Many networks like citation networks have high homophily, but networks like dating networks have
very low homophily or heterophily [90]. They show that state-of-the-art GNN methods work very
well on high homophily networks but perform worse than simple MLPs in heterophily networks.
Their method H2GCN proposes the following 3 simple modifications in GNNs to improve their
performance in heterophily networks.
1. Target node v’s embedding (ego embedding) should not be averaged with neighbourhood
embeddings to compute the embedding at next layer, as done in GCN. G RAPH S AGE style
concatenation is better. This is similar to the skip-connections in deep neural networks, as to
increase the depth of the network.
hlv = COMBINE(hl−1 l−1
v , AGGR({hu , ∀u ∈ Nv })) (12)
2. Instead of using 1-hop neighbour’s embedding at each layer to compute the ego-embedding,
H2GCN proposes to use higher order neighbour as well as follows:-
hlv = COMBINE(hl−1 l−1 1 l−1 2
v , AGGR({hu , ∀u ∈ Nv }), AGGR({hu , ∀u ∈ Nv })...) (13)
where Nvi denotes the nodes which are at i hops away from node v.
3. Instead of using only the last layer embedding as final embedding for each node, H2GCN
proposes combination of representation at all GNN layers.
hfinal
v = COMBINE(h0v , h1v . . . hL
v) (14)
where h0v = xv is input feature representation of node v.

8
[62] observed that the proposed GNN architectures apply only to in-homogeneous networks where
each node and edge is of a single type. For example, friendship networks. Thus, they proposed
a GCN extension, namely RGCN, applicable for heterogeneous networks like knowledge graphs
where entities and relations can be multiple types. RGCN suggested that instead of using the same
weight matrix for each neighbor during neighborhood aggregation at each layer, use a separate
weight matrix for each node type. Since number of unique relations and nodes are typically around
millions, they propose either using block-diagonal matrices or decomposing using basis matrices
and learning the coefficient of basis matrices for each entity and relations. Finally, [11] proposed a
method C LUSTER GCN to learn GNNs on large-scale networks which have millions of nodes and
edges. They proposed to first cluster nodes based on any standard clustering approach and then
randomly select multiple node-groups from the clusters and create an induced subgraph using these
chosen nodes. Now, update the weights of GNN by running gradient descent on this sub-graph.
Repeat the process of randomly sampling node groups, creating an induced sub-graph, and training
the GNN until convergence.

3.2 Deep Generative Models for Static Graphs

Till now, we have viewed the static graph representation approaches. We now shortly summarize
the deep generative models for static graphs. Given a collection of input graphs {G1 , G2 ...})
which are assumed to be sampled from an unknown underlying distribution pdata (G), the goal of
generative methods is to learn a probability distribution pθ (G) which is highly similar to pdata (G)
and produces graphs having highly similar structural properties as input graphs. Traditional graph
generative models assume some prior structural form of the graph like degree distribution, diameter,
community structure, or clustering coefficients. Examples include Erdős-Rényi [32] graphs, small-
world models [75], and scale-free graphs [4]. Prior assumptions about the graph structures can be
encoded using these approaches. Still, these approaches do not apply to practical applications like
drug discovery, molecular property prediction, and modeling a friendship network. These models
cannot automatically learn from the data. Learning a generative model on graphs is a challenging
problem since the number of nodes varies in each graph of the input data. Furthermore, search space
and running time complexity to generate edges is often quadratic in N and M . Additionally, in naive
graph representation, any observed node ordering of a graph has N1 ! probability, i.e., a single graph
can be represented using N ! possible node ordering. So, the learned generative model should be able
to navigate this large space, which is not the case with images, text, and other domains. We initiate
the discussion with N ET GAN [8] which learn a generative model from the single input network.
Then we compare the methods of M OL GAN [13], D EEP GMG [41], G RAPH RNN [84], G RAPH G EN
[19] and GRAN [43]. These methods take a collection of graphs as input to learn the generative
model. Among these, G RAPH G EN and GRAN currently, are state-of-the-art methods
N ET GAN samples a collection of random walks of max length T using the sampling approach in
N ODE 2V EC and train a WGAN [5] on this collection. A GAN architecture primarily consists of a
generator and a discriminator. The discriminator scores the probability of a given random walk as
real. A discriminator is trained by collecting sampled random walks and generated random walks by
generator, where its task is to assign high probabilities to real walks and low probabilities to synthetic
walks. The discriminator and generator are trained in tandem until the generator generates walks
indistinguishable from real walks and confuses the discriminator. The discriminator architecture
LSTM based, where each node in a sequence is encoded using a one-hot vector where the size of
the vector is |N |. Once the LSTM unit processes a sequence, it outputs a logit that provides the
sequence scores. Generator architecture is a bit tricky since it involves a stochastic operation of
sampling the next node during random walk sequence generation, i.e., (v1 , v2 . . . vT ) ∼ G. Basically,
a vector z ∼ N ormal(0, 1) is sampled from the normal distribution. z is transformed to a memory
vector m0 = fz (z) where fz is a MLP based function. This m0 is used to initialize the LSTM cell
along with 0 vector which outputs the next memory state m1 and probability distribution p1 over
the next node v1 . From this multinomial distribution p1 , the next node v1 is sampled, which along
with m1 is passed to LSTM cell to output (m2 , p2 ). This process repeats until the generator samples
the T length node sequence. To enable backpropagation using this method, the next node sampling
vi from multinomial distribution pi = (pi (1), pi (2) . . . pi (N )) is replaced with Gumbel-softmax
trick [31] which is essentially p∗i = softmax( pi (1)+g
τ
1 pi (1)+g1
, τ . . . pi (Nτ)+gN ) where gi s are sampled
from Gumbel distribution with 0 mean and 1 scale. Forward pass is computed using argmax(p∗i
and backpropagation is computed using continuous p∗i . τ is a temperature parameter. Low τ works

9
as argmax, and very high τ works as uniform distribution. Once the WGAN is trained, N ET GAN
samples a collection of random walk sequences from the generator. It creates a synthetic graph by
selecting top M edges by frequency counts. N ET GAN uses multiple graph properties like max node
degree, assortativity, triangle count, power-law exponent, clustering coefficients, and characteristic
path length to compare synthetic graph G̃ with original graph G. We note that N ET GAN’s node
space is the same as the input graph G’s node space V . Due to this limitation, its usages are limited
to applications requiring samples with similar properties as the input graph but with the same node
space. Next, we look at the M OL GAN [13] which is similar to N ET GAN in terms of using GAN.
M OL GAN takes a collection of graphs {G1 , G2 ...} as input instead of a single graph. Further,
M OL GAN is specifically designed for molecular graphs, which will be clear with the design choices.
M OL GAN utilizes the GAN architecture to learn the generator. Generator g takes a noise vector
z ∼ N ormal(0, 1) and transform it using MLPs to a probability-based adjacency matrix and node
feature matrix. Using Gumbel-softmax, adjacency and node feature matrices are sampled from
probability matrices. Finally, a GNN-based discriminator tries to classify actual graphs and generated
graphs. As we see, in M OL GAN generator produces a whole graph, but in N ET GAN, it was used
to produce a random walk. Finally, as M OL GAN noted that there are available software packages
1
to evaluate the generated molecules in terms of desired chemical properties. These scores act as
rewards in M OL GAN to provide additional supervision to the generator. This model’s parameters
are trained using a deep deterministic reinforcement-learning framework [44]. This method is not
scalable since it requires O(N 2 ) computation and memory. Furthermore, a graph can be represented
using N ! adjacency matrices, leading to significant training challenges. We now describe methods
that solve these problems.
D EEP GMG [41] observes graph generation as a sequential process. This process generates one node
at a time and decides to connect the new node to existing nodes based on the current graph state and
new node state. D EEP GMG employs GNNs to model the states. Specifically, assuming an existing
graph G = (V, E) having N nodes and M edges, newly added node v, it works as follows:
((v1 , v2 . . . vN ), vG ) = GNNL (G)
vaddnode = MLP(vG )
vaddedge = MLP((v1 , v2 . . . vN ), vG ) (15)
su = MLP(vu , vv )∀u ∈ V
vedges = sof tmax(s)
For each round of new node addition, an L-layer GNN is executed on the existing graph to compute
embeddings of nodes and graphs. Using graph embedding, first, a decision is taken to add a new node
or not. If yes, then a decision is taken whether to add edges using this node or not. If yes, then using
embedding of existing nodes and new nodes, a score is calculated, and subsequently, a probability
distribution is computed over the edges of node v with existing nodes. From this distribution, edges
are sampled. Moreover, this process repeats till a decision is taken not to add a new node. The
embedding of a new node is initialized using its features and graph state. This approach is also
computationally expensive, O(N (M + N )) since it runs a GNN and softmax operation O(N ) over
existing nodes for each new node. It has a lower memory requirement since it no longer needs to
store the whole adjacency matrix. This process is trained using Maximum likelihood over the training
graphs. M OL GAN and D EEP GMG evaluate the generative models’ performance by visual inspection
and using offline available quality scores provided by chemical software packages.
G RAPH RNN [84] follows similar graph representation approach as D EEP GMG but replaces GNN
with Recurrent neural architecture and uses Breadth-First Search(BFS) based graph representation.
Furthermore, it introduced a comprehensive evaluation pipeline based on graph structural properties.
Its contributions are summarized as follows:
• Graph Representation: Unlike previous methods, G RAPH RNN considers the BFS node
ordering of each permutation to reduce the number of possible permutation as many per-
mutations map to a single BFS ordering. Although a graph can still have multiple BFS
sequences, permutation space is drastically reduced. This approach has two-fold benefits.
1. Training needs to be performed over possible BFS sequences instead of all possible
graph permutations.
1
http://www.rdkit.org/

10
2. A major issue with D EEP GMG was that possible edges are computed for each node,
with all previous nodes causing O(N 2 ) computations. But G RAPH RNN makes the
following observation in any BFS sequence, i.e. whenever a new node i is added in
the BFS node sequence (v1 , v2 . . . vi−1 ) where it does not make an edge with a node
vj for j < i, we can safely say that any node {vk , k ≤ j} will not make an edge with
vi . This observation implies that we do not need to consider all previously generated
BFS nodes for possible edges with the new node. We can empirically calculate W
signifying latest W generated nodes are needed for a possible edge with the new node.
This reduces the computation to O(W N ).
• Hierarchical Recurrent Architecture: G RAPH RNN uses two-level RNNs for modelling
each BFS sequence. Primary RNN decides the new node and its type. Node type also in-
cludes a stop node to signal the completion of the graph generation process. The hidden state
of the primary RNN initializes the hidden state of the secondary RNN, which sequentially
processes over the latest W nodes in the BFS order sequence to create possible edges. These
2 RNNs are trained together using the maximum likelihood objective.
• Metrics: G RAPH RNN introduced Maximum Mean Discrepancy (MMD) [21] based metrics
to compute the quantitative performance of a graph generator. For I input graphs, a node
degree distribution is calculated for each graph in the input graph set and generated graph
set. MMD is used to compute the distance between these two distributions. G RAPH RNN
shows the MMD distances for degree distributions, clustering coefficient distribution, and
four-node size orbit counts.

GRAN [43] also follows a somewhat similar procedure as D EEP GMG of learning edges for each
new node with previously generated nodes during the graph generation process. GRAN observes
the graph generation process of D EEP GMG as creating a lower triangular part of adjacency matrix,
i.e., generating 1 row at a time starting from the first row. This process requires O(N ) sequential
computations. To scale this approach for large graphs ∼ 5k, instead of generating 1 row at a time,
they generate blocks of B rows, i.e., O(N/B) sequential computation. Furthermore, they drop the
recurrent architecture to model the sequential steps used in D EEP GMG and G RAPH RNN. This
step facilitates parallel training across sequential steps. At each sequential step, the main task is
to discover edges between nodes within the new block and edges between existing nodes and new
nodes. To do this, GRAN creates augmented edges between nodes of the new block. They also
create augmented edges between new nodes and all existing nodes. Finally, they use rows of lower
triangular adjacency matrix as features for existing nodes and 0 for new nodes. They transform these
features to low dimensions using a transformation matrix. Finally, GRAN runs a r round of GNN
updates on this graph to compute each node’s embedding. Using these embeddings, they learn the
Bernoulli distribution over each augmented edge. These learned distributions are utilized to sample
edges. We note that GRAN has used MLP over a difference of embeddings of nodes during message
passing and computation of attention coefficients. The overall time complexity of GRAN is the same
as D EEP GMG. Since the architecture is parallelizable, GRAN can train and generate graphs for
large-scale datasets compared to G RAPH RNN.
Finally, G RAPH G EN introduces a minimum DFS code-based graph sequence representation. Min-
imum DFS code is the canonical label of a graph capturing structure and the node/edge labels.
Canonical labels of two isomorphic graphs are the same. A DFS code sequence is of M length
where M is number of edges. Each edge (u, v) having node labels label(a), label(b) and edge label
label(uv) in the sequence is represented as (tu , tv , label(u), label(uv), label(v)) where tu , tv is
time of discovery of node u, v during DFS traversal. Essentially, DFS codes can be lexicographically
ordered. The smallest DFS code among possible DFS codes of a graph is called the minimum
DFS Code. Minimum DFS code is an interesting concept, and for thorough details, we refer to
G RAPH G EN [19]. This representation drastically reduces the possible permutations of each graph,
which speeds up the training process. Using the maximum likelihood objective, an LSTM-based
sequence generator is trained over the minimum DFS codes generated from input graphs.

3.3 Temporal Graph Representation Learning Methods

We now summarize the temporal graph representation learning based methods. Overall, we can
classify these methods into two major categories, Snapshot/discrete graph based methods [61, 20,

11
38, 64, 54] and Continuous time/event-stream based temporal graphs [46, 45, 91, 39, 87, 67, 79,
60, 74, 86].We discuss both of these categories separately.

3.3.1 Snapshot/Discrete Graph based Methods


For the following discussion in this subsection, we assume the following notation-
A dynamic graph G is represented as a collection of snapshots {G1 , G2 ...GT } where each Gt =
(Vt , Et , At , Xt ), Vt is node set at time t and similarly Et , At , Xt is edge set, adjacency matrix and
node feature matrix at time t. Nt and Mt are number of nodes and edges in graph Gt .
Naïve method involves running a static graph embedding approach over each graph snapshot and
aligning these embeddings across snapshots using certain heuristics [25]. This is a very expensive
operation. DYN G EM [20] proposes an auto-encoder based approach which initializes weights and
node embeddings of Gt using Gt−1 . Mainly, an autoencoder network (MLP based) takes two nodes
u, v of an edge in Gt represented by their neighbourhood vector su ∈ RNt , sv ∈ RNt and computes a
d dimensional vector representation hu , hv . A decoder further takes hu , hv as input and reconstructs
the original su , sv as ŝu , ŝv . Following loss is optimized at each timestamp t ∈ [1 . . . T ] to compute
the parameters.
Lt = Lglobal
t + β1 Llocal
t + β2 L1t + β3 L2t
X
Lglobal
t = kŝu − su k2 + kŝv − sv k2
u,v∈Et (16)
X
Llocal
t = khu − hv k2
u,v∈Et
where L1 , L2 are L1-norm and L2-norm computed over networks weights to reduce the over-fitting.
Lglobal is a autoencoder reconstruction loss at t snapshot and Llocal
t is a first order proximity loss
to preserve the local structure. We note that parameters of autoencoder for Gt are initialized
using parameters of autoencoder for Gt−1 inducing stability over graph/node embeddings over
consecutive snapshots and reducing the training computation assuming graph structures doesn’t
change drastically at consecutive snapshots. T N ODE E MBED [64] is a similar method but instead of
using autoencoder they directly learn node embedding matrix Wt ∈ RN ×d for each snapshot by
optimizing node classification task or edge reconstruction task. They also use the following loss to
align the consecutive Wt , Wt+1 as follows:
Rt+1 = argmin(kWt+1 R − Wt k + λkRT R − Ik)
R (17)
Wt+1 = Wt+1 Rt+1
where first term forces stable consecutive embeddings and second terms requires Rt+1 being a
rotation matrix. Further, they employ recurrent neural network over node embeddings at each
timestamp to connect time-dependent embeddings for each node and to learn its final representation.
DYNAMIC T RIAD [38] is a similar method based on regularizing embeddings of consecutive snapshots,
but it models an additional phenomenon of triadic closure process. Assuming 3 nodes (u, v, w) in a
evolving social network where (u, w), (v, w) are connected but (u, v) are not, i.e. w is a common
friend of u and v. Depending upon the w0 s social habits, it might introduce u and v or not. This
signifies that the higher the number of common nodes between u and v, the higher the chances of
them being connected at the next snapshot. DYNAMIC T RIAD models this phenomena by defining a
strength of w with u, v at snapshot t as follows:
stuvw = wuw
t
(htw − htu ) + wvw
t
(htw − htv ) (18)
where stuvw ∈ Rd and wuw t t
, wvw denotes the tie strength of w with u and v respectively at time t.
Utilizing this, DYNAMIC T RIAD defines a following probability of u, v, w becoming a close triad at
snapshot t + 1 given they are open triad at snapshot t given w is the common neighbour.
1
pt (u, v, w) = (19)
1 + exp (−θstuvw )
Since u and v can have multiple common neighbours, any of the neighbour can connect, u, v thus
closing the triads with all the neighbours. But in real world, which neighbour(s) closed the triad is(are)
unknown. To accommodate this, they introduce a vector αtuv of length B. B is number of common

12
neighbours of, u, v i.e. length of set B t (u, v) = {w, (w, u) ∈ Et ∧ (w, v) ∈ Et ∧ (u, v) ∈ / Et }.
Formally, αtuv = (αuvw )w∈B t (u,v) where αuvw = 1 denotes that u, v will connect at time t + 1
under the influence of w. Finally, they introduce probabilities that u, v will connect at, t + 1 given
that they are not connected and in open triad(s) at the time t.
X Y
pt+ (u, v) = pt (u, v, w)(αuvw ) × (1 − pt (u, v, w))(1−αuvw )
αtuv 6=0 k∈B t (u,v)
Y (20)
pt− (u, v) = (1 − pt (u, v, w))(1−αuvw )
k∈B t (u,v)

where p+ , p− denotes the u, v connecting/not connecting at next snapshot t + 1. Summation over α


denotes the iterating over all possible configuration of common neighbour(s) causing the connection
between u, v. We note that at-least 1 entry in α should non-zero to enable edge creation between u, v
at next step. Finally, their loss optimization function is:
t=T
X
L= Lttriad + β0 Ltranking + β1 Ltsmooth
t=1
X X
Lttriad =− pt+ (u, v) − pt− (u, v)
t
i,j∈S+ t
i,j∈S−
X (21)
Ltranking = t
wuv max(0, khtu − htv k22 − khtû − htv̂ k22 )
u,v∈Et ,û,v̂ ∈E
/ t
X
Ltsmooth = (khtu − ht−1 2
u k2 )
u∈Vt
t t
where S+ is set of edges which form at t + 1 and S− is set of edges not existing at t + 1. Ltranking
is a ranking loss over edges which preserves the structural information of corresponding snapshot
and Ltsmooth is embedding stability constraint over embeddings of each node. These methods can’t
directly utilize the node features and only focuses on the first order proximity of graph structures,
ignoring the higher order structures. E VOLVE GCN [54] and DY SAT [61] utilize the graph neural
networks to calculate the node embeddings over each snapshot. Thus, these methods are additionally
capable of using node features to model the higher order neighbourhood structure. We note that these
features are also dynamic, i.e. can evolve over time.
E VOLVE GCN is a natural temporal extension of GCN. Equation 6 is rewritten as follows:
Hl+1
t = σ(Ãt Hlt Wtl )
(22)
H0t = X
Where sub-script t denotes the time of the corresponding snapshot, we note that in E VOLVE GCN
formulation, Wtl ∀t ∈ [1 . . . T ], l ∈ [1..L] are not learned using GNN but are the output of recurrent
external network which incorporates the current as well the past snapshots’ information. The
parameters of a GCN at each snapshot are controlled by a recurrent model, while node embeddings
at corresponding snapshots are learned using these parameters. Specifically, E VOLVE GCN proposes
two variants of Wtl calculation.

1. This variant treats Wtl as hidden state of a RNN cell. Specifically,


Wtl = RNN(Htl , Wt−1
l
) (23)
where Htl is the input to the RNN and l
Wt−1
is the hidden state at previous timestamp. This
variant is useful if node features are strong and plays important role in end task.
2. This variant treat Wtl as output of a RNN cell. Specifically,
Wtl = RNN(Wt−1
l
) (24)
l
where Wt−1 is a input which was the output of RNN cell at previous timestamp.

The parameters are trained end-to-end using any loss associated with the node classification/link
classification/link prediction tasks.

13
Similar to E VOLVE GCN, DY SAT [61] is a gnn based model, namely GAT. DY SAT runs two
attention blocks. First, it runs a GAT style GNN over each snapshot separately. This provides
the static node embeddings at each snapshot, which captures the structural and attributes based
information. Then, DY SAT run a temporal based self-attention block i.e. for each node at time t, this
module takes it’s all previous embeddings and compute a new embedding using self-attention which
also incorporate temporal modalities. Specifically,

• Structural Attention : For each Gt ∈ G = {G1 . . . GT }, hv in equation 8 is replaced as:


!
X
l+1 l
hvt = σ αvu Whut
u∈Nv
(25)
exp(avu a LeakyRELU(Whlvt kWhlut ))
T
αvu =P T l l
i∈Nv exp(avi a LeakyRELU(Whvt kWhit ))

where avu is a graph input which denotes the weight on the edge (v, u) ∈ Et . Please note
that in GAT message is computed from self node, which is not the case here.
• Temporal Self-attention: We assume the learned node representation using structural
attention for each node v ∈ G as Hv = {hL L L d
v1 , hv2 . . . hvT } hvt ∈ R and Zv =
L L L d0 T ×d
{zv1 , zv2 . . . zvT } zvt ∈ R . Hv ∈ R is input to the temporal self-attention block and
0
Zv ∈ RT ×d is the output. Also, zvt is the final output of node v at snapshot t which will
be used for downstream tasks. Following similar design as [70], Hv is used as query, key
and value, thus the name self-attention.
0
Specifically, Wq , Wk , Wv ∈ Rd×d are matrices which are used to transform Hv to the
corresponding query, key and value space. Essentially, query and key are used to compute
an attention value for each timestamp t with previous timestamps include t. Using these
attention values, a final value is computed for the timestamp t by aggregating using attention
weights over previous timestamp values, including t. Specifically,
αij
Zv = βv (Hv Wv ), βvij = Pk=Tv
k=1 αvik
(26)
(Hv Wk )(Hv Wq )Tij
αvij = √ + Mij
d0
where Mij = 0 ∀i ≤ j and Mij = −∞ ∀i > j.

We note that the entire architecture is trained end to end by running random walks over each snapshot
and using similar unsupervised loss as in D EEP WALK for nodes co-occurring in the walks.

3.3.2 Continuous Time Graph/Event-Stream Graph-based Methods


Now, we turn our focus to the temporal embedding method for continuous graphs where each edge
between two nodes is an instantaneous event. We will first discuss methods that model evolving
network topology [45, 91, 46]. Then we discuss methods that additionally model graph attributes as
well. These include bi-partite interactions only methods [87, 39] and general interaction networks
[79, 74, 67, 86, 60]. HTNE [91] notes that snapshot-based methods model the temporal network at
pre-defined windows, thus ignoring the network/neighborhood formation process. HTNE remarks
that neighborhood formation for each node is a vital process that excites the neighborhood formation
for other nodes. For example, in the case of a co-authorship network, co-authors of a Ph.D. student
will be her advisor or colleagues in the same research lab. However, as that student becomes a
professor in the future, her students might collaborate with students of her previous colleagues, thus
exciting edges between nodes. Furthermore, a neighborhood sequence of co-authors for that Ph.D.
student might indicate the evolving research interests which will influence the future co-authors.
For the below discussion, we denote a temporal graph as G = (V, E, X) where V is the collection
of nodes in the network, E = {(u, v, t) | u, v ∈ V, t ∈ R+ } and X ∈ RN ×F is a feature matrix
for nodes where N = |V | and F is the feature dimension. HTNE visualizes the neighbourhood
formation as a sequence of events, where each event is a edge formation between source and target
node at time t. A neighbourhood formation sequence (N (v)) of a node v can be represented as

14
{(ui , ti ) | i = 1, 2....I} where I is number of interactions/edges of node v. We observe that a node
can interaction multiple times with the same node but at different timestamps. HTNE assumes
a N (v) as a event sequence for each node v and models these event sequences using marked
temporal point processes(TPP) [59]. TPP is a great tool to model the sequential event sequence,
{(e1 , t1 ), (e2 , t2 )...(eT , tT } where past events can impact the next event in the sequence. TPPs are
mostly characterized by the conditional intensity function λ(t). λ(t) defines an event arrival rate
at time t given the past event history H(t). λ(t) number of expected events in a infinitesimal time
interval [t, t + ∆t].
E [N (t + ∆t) | Ht ]
λ(t | H(t)) = lim (27)
∆t→0 ∆t
Particularly, HTNE utilizes temporal hawkes point process to model the neighbourhood formation
sequence for each node v in the network. Below formulation define the hawkes conditional intensity
for edge between node v and node u at time t.
X
λ̃u|v (t) = µu,v + αw,u κv (t − tw )
w,tw ∈{(w,tw )∈N (v)∧tw <t}

µu,v = −khu − hv k2 κv (t − tw ) = exp(−δv (t − tw )) αw,u = −γw,v khu − hw k2 (28)


2
exp(−khw − hv k )
γw,v = P 2
w0 ∈{(w,tw )∈N (v)∧tw <t} exp(−khw − hv k )
0

where µu,v is the base rate of edge formation event between node u and v. αw,u is the importance of
node v 0 s historical neighbour h in possible edge creation with node u. κ is the time decay kernel to
reduce the impact of old neighbours in current edge formation. Finally, following loss is optimized to
train the node embeddings.
X X λu|v (t)
logL = P λu|v (t) = exp(λ̃u|v (t)) (29)
v∈V u,t∈N (v) w∈V λw|v (t)

HTNE applies exponential over λ̃ to enable positive values for conditional intensity, as negative
event rates are not possible. Furthermore, exp enables softmax loss which can be optimized using
negative sampling similar to previously seen approaches to enable faster training. MDNE [46] is a
similar approach which additionally models the growth of the network i.e. number of edges e(t) at
each timestamp t apart from edge formation process. Specifically, given number of nodes n(t) by
time t, the following defines the additional new edges at time t:
1 2
P
0 γ |E| (u,v,t)∈E σ(−khu − hv k )
∆e (t) = r(t)n(t)(ζ(n(t) − 1) ) r(t) = (30)

where ζ, γ, θ are learnable parameters. r(t) encodes the linking rate of each node with every other
node in the network. Additionally, they add neighborhood factors of node u and v while calculating
the intensity function for edge formation between node v and u in eq. 28. Finally, they add a loss term
in eq. 29 on mean square error of predicted and ground truth new edges at each time t. F I TNE [45]
is a random walk-based method that defines a k length temporal walk W over temporal graph G as
{w1 , w2 . . . wk } wi ∈ E and there is no constraint over times of straight edges. Essentially, F I TNE
follows a similar approach as D EEP WALK,N ODE 2V EC. It collects a collection of random walks
from temporal graph G and learns node embeddings using a skip-gram-based learning framework
by similar unsupervised loss. When running a random walk, a transition probability is defined over
consecutive edges as follows:
w(eout | ein )
p(eout | ein ) = P (31)
e∈I(ve ) w(e | ein )
in

where I(vein ) is the set of outgoing edges from ein . Finally, they have proposed unbiased/time
difference-based formulations for w. These methods do not utilize the graph attributes and generate
static node embeddings. Specifically, learned node embeddings are not a function of time. Thus, we
now summarize methods that utilize graph attributes and learn node embeddings as a function of time
t.
JODIE [39] and T IGE CMN [87] are bipartite interactions network-based methods with specific
focus on user-item interaction paradigm. Likewise, their architecture actively involves design choices

15
for users and items. Consequently, these methods are not applicable in non-bipartite interaction
networks. A major difference between bipartite and non-bipartite networks is the no interactions
between the same type nodes(user-user, item-item). JODIE remarks that current temporal embedding
methods produce a static embedding using the temporal network, as we saw in previous approaches.
In recommendation applications, this will ensure similar recommendations even if the user revisits
the site after one hr/1 day/1 month, and so on. This problem implies node embeddings should be
a function of time, t, i.e., node embedding of a user should model the changing intent over time.
Additionally, JODIE models the stationary nature of the user’s intent. Since users interact with
a million items every day, recommendation time should be sublinear in order of number of items.
JODIE’s architecture models all of these aspects. We note that following notation for a bipartite
graph G = (U, I, E) where U is collection of user nodes, I is collection of item nodes and E
is collection of interactions where e ∈ E denotes the observed interaction and is represented as
e = (u, i, t, x), u ∈ U, i ∈ I, t ∈ R+ , x ∈ RF . Assuming hv (t) as embedding of node v at time t,
hv (t− ) as embedding of node v at time just before t i.e. updated embedding of node v after latest
interaction at time t0 < t ,hv as static embedding of node v, JODIE defines following formulation of
user and item nodes’ embeddings after observing an interaction (u, i, t, x):
hu (t) = σ(W1user hu (t− ) + W2user hi (t− ) + W3user x + W4user ∆tu )
(32)
hi (t) = σ(W1item hu (t− ) + W2item hi (t− ) + W3item x + W4item ∆ti )
where ∆tu , ∆ti are time difference of last interaction of u, i with t respectively. Wuser s are modelled
using an item RNN and similarly Witem s using a user RNN. We note that all user’s RNNs share the
same parameters, and similarly all item’s RNNs share the same parameters. Finally, JODIE proposes
the following projection operator for calculating the projected node embedding at t + ∆t where t is
the last interaction time and ∆t is time spent after last interaction.
hprojected
v (t + ∆t) = (1 + Wp ∆t)hv (t) (33)
In order to train the network, JODIE utilizes the future interactions of user nodes i.e. given a last
interaction time t with item i for user u and its future interaction time t + ∆t with item j, can we
predict the item embedding hpredicted
j (t + ∆t) just before t + ∆t, which will same as actual item
j’s embedding hj khj ((t + ∆t) ). Predicted item embedding hpredicted

j (t + ∆t) at time t + ∆t is
calculated given the last interaction of node u at time t with node i.
hpredicted
j (t + ∆t) = W1 hprojected
u (t + ∆t) + W2 hi ((t + ∆t)− ) + W3 hu + W4 hi + b (34)
The architecture is trained end-to-end using the mean square error between predicted item embedding
and actual item embedding. Further, it adds a regularization term over the change in consecutive
dynamic embeddings for users and items. JODIE notes that since the architecture directly predicts
the future item embedding, using LSH techniques [68], the nearest item can be searched in constant
time. T IGE CMN [87] utilizes memory networks to store each node’s past interactions instead of a
single latent vector in order to improve the systems’ performance. Specifically, T IGE CMN creates
a value memory matrix Mu for each user u and a value memory matrix Mi for each item i. These
memory matrices have K slots of d dimension. Furthermore, there exists a key memory matrix MU
for all users and another key memory matrix MI for all items. Assuming a interaction having feature
vector xu,i between user u and item i at time t, memory matrix M u corresponding to user u is
updated as follows:
hu,i = g(W2 (DROPOUT( W1 [xui khi k∆u] + b)))
MU (k)hTui exp(sk )
sk = wk = Pj=K k = 1, 2 . . . K
kMU (k)k2 khui k2 j=1 exp(sj ) (35)
eu,i = σ(we hu,i + b1 ) Mu (k) = Mu (k) (1 − wk eu,i ) k = 1, 2 . . . K
au,i = tanh(wa hu,i ) Mu (k) = Mu (k) + au,i wk k = 1, 2 . . . K
where eu,i erases values from Mu using interaction (u, i, t, xu,i ) embedding hu,i . au,i is a add
vector which updates the Mu . The same approach is followed to update the value memory matrix
Mi for item i. To compute the user u’s embedding at time t, its value matrix Mu ∈ RK×d is
passed through self attention layer to provide context-aware embedding for each slot k = 1, 2 . . . K
and finally mean-pooled to provide dynamic embedding h0u which is concatenated with static
node embedding (computed using transformation on one-hot vector representation) to provide

16
hu . Finally, this network is trained end to end using cosine similarity between user and item of
each interaction in conjunction with samples from negative interactions similar to G RAPH S AGE’s
unsupervised loss. We note that node embeddings learnt in T IGE CMN are not function of time, i.e.
the embedding corresponding to each node will not change after its last interaction. Another method
IGE [86] uses skip-gram [48] style loss for training embeddings. First, they introduce a induced list
Su = (u, v1 , t1 ), (u, v2 , t2 ) . . . (u, vL , tL ) containing all interactions by each node u. IGE defines
context window W (v) of size C for each node v ∈ Su using its neighbours in the induced list Su
similar to word2vec method. Finally,IGE uses similar loss as skip-gram using negative sampling with
these induced lists. Essentially, IGE observes each induced list as sentence or sequence of words to
train their embeddings.
We now summarize the temporal graph representation methods, which utilize both dynamic graph
topology and associated attributes to learn the node representation of non-bipartite temporal graphs.
Specifically, DY R EP [67] remarks that most temporal graphs exhibit two dynamic processes realizing
at different(same) timescale, mainly association process and interaction process. Association process
signifies the topological changes in the graph, and Communication process implies the interac-
tion/information exchange between connected(maybe disconnected) nodes. DY R EP observes that
these two processes are interleaving, i.e., an association event will impact the future communications
between nodes, and a communication event between nodes can excite the association event. DY R EP
notes that association events have more global impact since they change the topology. In contrast,
communications events are local, although indirectly capable of global impact by exciting topological
changes in the network. Assuming G = V, E, E = {(u, v, t, k)}, u ∈ V, v ∈ V t ∈ [0, T ], and k=0
implies association event and k= 1 implies communication event. We note that permanent topological
changes are potentially using k=0. The following formulation shows the node embedding update
corresponding to the event e = (u, v, t, k).
hv (t) = σ(W1 hv (tvp ) + W2 zu (t− ) + W3 (t − tvp )) hv (0) = xv (36)
where tvp is the last event time by node v and zu (t− ) is the aggregated embedding of node u’s
neighbourhood just before time t. We note that node embedding update formulation comprises three
main principles.
1. Self-Propagation: A node representation should evolve from its previous representation.
2. Localized embedding propagation: A event(association/communication) between nodes
must be the outcome of neighbourhood of the node through which the information propa-
gated.
3. Exogenous Drive: Finally, a global process can update the node representation between
successive events involving that node.
zu (t− ) is calculated using the neighbourhood of node u as:
zu (t− ) = max({σ(qui (W4 (W5 hi (t− ) + b))), i ∈ N (u)}) (37)

where hi (t ) is the latest node embedding of node i before time t. qui denotes the weight of each
structural neighbour of node u which reflect the tendency of node u to communicate more with
associated nodes, i.e. qui is higher for those neighbours to which node u has communicated more
frequently. Although, we note that a more weightage should have been provided for neighbours among
frequent communication neighbours with the latest exchanges. Anyway, for detailed calculation of,
qui we refer to Algorithm 1 of DY R EP. Finally, DY R EP utilizes temporal point processes to model
an event e = (u, v, t, k). Formally,
x
λu,v k − −
k (t) = fk W6 (hu (t )khv (t )) fk (x) = φk log(1 + exp( )) (38)
φk
where hv (t− ) signifies the most recent updated node representation of node v. fk is a soft-plus
function parameterized for each event type k. Finally, DY R EP utilizes following loss for training
models’ parameters.
X Z t=T X X
u,v
L= − log(λk (t)) + λu,v
k (t)dt (39)
(u,v,t,k)∈E) t=0 u∈V v∈V

We note that the second term in the loss signifies the survival probability for events that do not
happen till time T . This term is computed by sampling techniques. For more details on the sampling

17
procedure, we refer to Algorithm 2 of DY R EP. Further, this approach can handle unseen nodes,
i.e., transductive and inductive settings. We note that DY R EP’s methodology requires datasets that
contain both evolution events and communication events. We see it as a major limitation since most
of the available datasets do not have association events. We now focus on methods that require only
interaction events and do not majorly differentiate between association and communication events.
TGAT [79] is a self-attention based method similar to GAT but uses novel functional time encoding
technique to encode time in an embedding space. Although previous approaches have projected time
into embedding space using separate weight matrices, TGAT utilizes Bochner’s theorem to propose
the following time encoding:
1
hT (t) = √ (cos(ω1 t), sin(ω1 t), cos(ω2 t), sin(ω2 t) . . . cos(ωd t), sin(ωd t)) (40)
d
where d is a hyperparameter and {ω1 . . . ωd } are learnable parameters. For more details on the
derivation of this encoding using Bochner’s theorem, we refer to [78]. Finally, TGAT defines
neighbourhood of node u at time t as Nu (t) = {(v, t0 ), (u, v, t0 ) ∈ E ∧ t0 < t}. So, TGAT defines
the following temporal GAT layer l at time t-
h̃l−1 l−1
v (t) = hv (t)khT (t)

exp((Wquery h̃l−1 l−1 T


v (t))(Wkey h̃u (t)) )
αu = P l−1 l−1
u ∈ Nv (t)
T
w∈Nv (t) exp((Wquery h̃v (t))(Wkey h̃w (t)) )
X (41)
hlv (t)attention = αu Wvalue h̃l−1
u (t)
u∈Nv (t)

hlv (t) = W2l ReLU(W1l [hlv (t)attention kxv ] + bl1 ]) + bl2 h0v (t) = xv
TGAT extends this formulation to multi(k)-head attention by using separate Wquery , Wkey , Wvalue
for each attention head as follows:
hlv (t) = W2l ReLU(W1l [hlv (t)attention1 khlv (t)attention2 k . . . hlv (t)attentionk kxv ] + bl1 ]) + bl2 (42)
Finally,hL
v (t) is the final node representation of node v at time t using L layer TGAT. This model is
trained, similar to G RAPH S AGE, using link prediction or node classification loss. TGAT’s temporal
GNN layer is inductive, i.e., it can predict the embedding of unseen nodes since its parameters are
not node dependent. Furthermore, it allows incorporating edge features. Moreover, it can also use
evolving node attributes xv (t).
TGN [60] attempts to unify the ideas proposed in previous approaches and provides a general
framework for representation learning in continuous temporal graphs. This framework is inductive and
consists of independent exchangeable modules. For example, there is a module for embedding time to
a vector. This module is independent of the rest and is replaceable with a domain understanding-based
embedding function. They also define a new event type which is node addition/updation e = (v, t, x)
where a node v with attributes x is added/updated in the temporal graph G at time t. TGN defines
the following modules:

• Memory: A memory vector sv is kept for each node v. This memory stores the compressed
information about the node v and is updated only during a event involving the node v. For a
new node, its memory is initialized to 0.
• Message Function: Following messages are computed when a interaction e = (u, v, t, xe )
occurs.
mu (t) = MLP(su , sv , ∆tu , xe ) mv (t) = MLP(sv , su , ∆tv , xe ) (43)
where xe is corresponding edge attributes and ∆tu is the time different between t and last
interaction time of node u. In case, the event is node addition/updation e = (v, t, x), the
following message mv = MLP(sv , ∆tv , x) is computed.
• Message Aggregation: TGN processes interactions in a batch instead of one by one to
parallelize the process. Given B messages for a node v, its aggregated message is calculated
at time t, given that the batch contains messages from tstart to t:
magg
v (t) = AGGREGATOR(mv (t1 ), mv (t2 ) . . . mv (tB )) tstart ≤ t1 ≤ t2 . . . tB ≤ t
(44)

18
• Memory Updater: After computing messages for each node v involving events in the
current batch, memory is updates using sv = MEM(magg v (t), sv ). MEM can be any neural
based architecture. TGN notes that recurrent neural architecture is more suitable since
current memory depends upon the previous memory.
• Embedding: This is the last module whose objective is to compute the embedding of
required nodes using their neighbourhood and memory states. Specifically, TGN proposes
the following general formulation to compute the embedding of a node v at time t.
X
hv (t) = f (sv , su , xv,u (t), xv (t), xu (t)) (45)
u∈Nu (t)

where xv (t) denotes the latest attributes of node v,xv,u (t) denotes the features of the latest
interaction between node v and u. TGN remarks that TGAT layer can be utilized here
or any other simpler approximation as well basis the requirement. TGN also proposes
following L layer GNN based formulation which is simple and fast.
X
h̃l (t) = ReLU( W1 [hl−1
u (t)kxv,u kTIME-ENCODER(t − tv,u )])
u∈Nu (t) (46)
hlv (t) = W2 [h̃lv (t)khl−1 L
v (t)] hv (t) = hv t

This architecture can be trained using link-prediction or node classification tasks. TGN observes
that during training, there is an issue of target variable leak when predicting links in current batch,
since memory of nodes will be updated using events of this batch. To encounter this, TGN creates
a message storage which stores the events occurring in last batch corresponding to each node in
the graph. Now given the current batch during training, they update the memory of nodes in this
batch using the events from message storage and compute loss in this batch by using link prediction
objective. This loss is used to update the model’s parameter. Now, message storage is updated for
nodes occurring in the current batch by replacing their events with events of the current batch.
CAW [74] critiques that currently proposed methods work well in inductive setting only when there
are rich node attributes available. CAW remarks that any temporal representation learning method,
even in the absence of node features and identity, when trained on a temporal graph governed by
certain dynamic laws, should perform well on an unseen graph governed by the same dynamic laws.
For example, if a triadic closure process and feed-forward loop process 2 govern link formation in the
training graph. Then, a trained model on such data should be able to produce a similar performance on
an unseen graph governed by these two laws even if node attributes are not available. Current methods
rely heavily on attributes, and thus, an inductive setting cannot model these processes in the absence
of node identity/attributes. CAW proposes a method that anonymizes node identity by exploiting
temporal random walks. Specifically, a random walk W = ((w0 , t0 ), (w1 , t1 ) . . . (wk , tk )), t0 >
t1 > . . . > tk , (wi−1 , wi ) ∈ E ∀i backtracks each consecutive step, i.e. each consecutive edge in
the walk has a lower timestamp than the current edge. Further, Sv (t) is a collection of K random
walks of length k starting from node v at time t. g(w, Sv (t)) is a k + 1 size vector where each index
i specifies the count of node w in ith position across every W ∈ Sv (t). So, given a target link u, v at
time t, Icaw (w, (Su (t), Sv (t))) = {g(w, Su (t)), g(w, Sv (t))} denotes the relative identity of node w
wrt to Su (t) and Sv (t). We note that w must occur at-least in any random walk W ∈ Su (t) ∪ Sv (t).
CAW notes such representation traces the evolution of motifs in random walks without explicitly
using node identity and attributes. Finally, each random walk W ∈ Su (t) ∪ Sv (t) can be represented
as:
W = ((Icaw (w0 , (Su (t), Sv (t))), t0 ),(Icaw (w1 , (Su (t), Sv (t))), t1 )
(47)
. . . (Icaw (wk , (Su (t), Sv (t))), tk )))
The following formulation is used to encode the target link involving nodes u, v at time t.
hW = RNN({(f1 (Icaw (w, (Su (t), Sv (t)))), f2 (t − t0 )), ∀(w, t0 ) ∈ W })
f1 (Icaw (w, (Su (t), Sv (t)))) = MLP(g(w, Su (t))) + MLP(g(w, Sv (t)))
f2 (∆t) = [cos(ω1 ∆t), sin(ω1 ∆t) . . . cos(ωd ∆t), sin(ωd ∆t)] (48)
1 X
hSu (t),Sv (t) = hW
2∗K
W ∈Su (t)∪Sv (t)

2
if node a excites node b and node c, then node b and c will also link in the future

19
These target link embeddings are usable in future link prediction tasks. Finally, CAW observes that
the f1 /RN N layer can additionally incorporate node/edge attributes. We note that the CAW model
is not applicable in node classification.

3.4 Deep Generative Models for Temporal Graphs

The current research in deep generative models for temporal graphs is lacking and still in a nascent
stage. We briefly overview the non-neural/neural methods below.
Erdős Rényi graph model and small-world model are de-facto models for static graphs. However,
similar works are not available for temporal graphs. Few works have attempted to formulate the
generation process of a network to understand the spreading of diseases via contact using temporal
networks. [28] has proposed an approach where they start by generating a static network, and for
each link in the network, they generate an active time span and start time. Further, they sample the
sequence of contact times from the inter-event time distribution. Finally, they overlay this sequence
with the active time span of each link to generate an interaction network. [57] introduces activity-
driven temporal network generative approach. At each timestamp, they start with N disconnected
nodes. Each node i b becomes active with probability pi , also defined as an active potential, and
connects with randomly chosen m nodes. At the next timestamp, the same process is repeated. [72]
proposes a memory-based method for both node and link activation. It stores a time delta since
the last interaction τi for every node i and τij for every link between node i and j. Initially, in an
N-size network, all nodes are inactive. A node can become active with probability bfnode (τi ) and
initiate a link with inactive node j with probability gnode (τj )glink (τij ). A link can de-activate with
probability zflink (τij ). Both b and z are control parameters. f and g are memory kernels of a
power-law distribution. More recently, DYMOND [85] presented a non-neural, 3-node motif-based
approach for the same problem. They assume that each type of motif follows a time-independent
exponentially distributed arrival rate and learn the parameters to fit the observed arrival rate. TAG G EN
[89] models temporal graphs by converting them into equivalent static graphs by merging node-IDs
with their timestamps interactions (temporal edge) and connecting only those nodes in the resulting
static graph that satisfy a specified temporal neighborhood constraint. They performed random walks
on this transformed graph and modified them using heuristics-based local operations to generate many
synthetic random walks. They further learn a discriminator to differentiate between actual random
walks and synthetic random walks. Finally, the synthetic random walks classified by a discriminator
as real are collected and combined to construct a synthetic static graph. Finally, they convert the
sampled static graph to a temporal interaction graph by detaching node identity from time. These
approaches suffer from the following limitations:
• Weak Temporal Modeling: DYMOND makes two key assumptions: first, the arrival rate of
motifs is exponential; and second, the structural configuration of a motif remains the same throughout
the observed time horizon. These assumptions do not hold in practice – motifs may evolve with time
and could arrive at time-dependent rates. This type of modeling leads to poor fidelity of structural
and temporal properties of the generated graph. TAG G EN, on the other hand, does not model the
graph evolution rate explicitly. It assumes that the timestamps in the input graph are discrete random
variables, prohibiting TAG G EN from generating new(unseen in the source graph) timestamps. More
critically, the generated graph duplicates many edges from the source graph – our experiments found
up to 80% edge overlap between the generated and the source graph. While TAG G EN generates
graphs that exhibit high fidelity of graph structural and temporal interaction properties, unfortunately,
it achieves them by generating graphs indistinguishable from the source graph due to their poor
modeling of interaction times.
• Poor Scalability to Large Graphs: Both TAG G EN and DYMOND are limited to graphs where
the number of nodes is less than ≈10000, and the number of unique timestamps is below ≈200.
However, real graphs are not only of much larger size but also grow with significantly high interaction
frequency [53]. In such scenarios, the critical design choice of TAG G EN to convert the temporal
graph into a static graph fails to scale over long time horizons since the number of nodes in the
resulting static graph multiplies linearly with the number of timestamps. Further, TAG G EN also
requires the computation of the inverse of a N 0 × N 0 matrix, where N 0 is the number of nodes in
the equivalent static graph to impute node-node similarity. This complexity adds to the quadratic
increase in memory consumption and even higher cost of matrix inversion, thus making TAG G EN

20
unscalable. On the other hand, DYMOND has a O(N 3 T ) complexity, where N is the number of
nodes and T is the number of timestamps.
• Lack of Inductive Modelling: Inductivity allows the transfer of knowledge to unseen graphs [26].
In the context of generative graph modeling, inductive modeling is required to (1) upscale or
downscale the source graph to a generated graph of a different size, and (2) prevent leakage of
node-identity from the source graph. Both TAG G EN and DYMOND rely on one-to-one mapping
from source graph node IDs to the generated graph and hence are non-inductive.

4 Proposed Work: Scalable Generative Modeling for Temporal Interaction


Graphs
We now explain the problem statement, for which T IGGER[24] has proposed a solution.

4.1 Problem formulation

Problem 1 (Temporal Interaction Graph Generator)

Definition 1 (Temporal Interaction Graph) A temporal interaction graph is defined as G = (V, E)


where V is a set of N nodes and E is a set of M temporal edges {(u, v, t) | u, v ∈ V, t ∈ [0, T ]}. T
is the maximum time of interaction.

Input: A temporal interaction graph G.


Output: Let there be a hidden joint distribution of structural and temporal properties, from which
the given G has been sampled. Our goal is to learn this hidden distribution. Towards that end, we
want to learn a generative model p(G) that maximizes the likelihood of generating G. This generative
model, in turn, can be used to generate new graphs that come from the same distribution as G, but
not G itself.

The above problem formulation is motivated by the one-shot generative modelling paradigm, i.e., it
only requires one temporal graph G to learn the hidden joint distribution of structural and temporal
interaction graph properties. Defining the joint distribution of temporal and structural properties
is hard. In general, these properties are characterized by inter-interaction time distribution and
evolution of static graph properties like degree distribution, power law exponent, number of connected
components, the largest connected component, distribution of pair wise shortest distances, closeness
centrality etc. Typically, a generative model optimizes over one of these properties under the
assumption that the remaining properties are correlated and hence would be implicitly modelled.
For example, DYMOND uses small structural motifs and TAG G EN uses random walks over the
transformed static graph.

5 Conclusion
This survey presents a unified view of graph representation learning, temporal graph representation
learning, graph generative modeling, and temporal graph generative modeling. We first introduced
a taxonomy of temporal graphs and their tasks. Further, we explained static graph embedding
techniques and, building over them, explained the temporal graph embedding techniques. Finally, we
surveyed graph generative methods and highlighted the lack of similar research in temporal graph
generative models. Inspired by this, we introduced the problem statement of our recently published
paper T IGGER.

21
References
[1] Charu C. Aggarwal. Social Network Data Analytics. Springer Publishing Company, Incorpo-
rated, 1st edition, 2011.
[2] Charu C. Aggarwal and Haixun Wang. Managing and Mining Graph Data. Springer Publishing
Company, Incorporated, 2012.
[3] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J.
Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd Interna-
tional Conference on World Wide Web, WWW ’13, page 37–48, New York, NY, USA, 2013.
Association for Computing Machinery.
[4] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews
of modern physics, 74(1):47, 2002.
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial
networks. In Proceedings of the 34th International Conference on Machine Learning - Volume
70, ICML’17, page 214–223. JMLR.org, 2017.
[6] Claudio D. T. Barros, Matheus R. F. Mendonça, Alex B. Vieira, and Artur Ziviani. A survey on
embedding dynamic graphs. ACM Comput. Surv., 55(1), nov 2021.
[7] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding
and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural
Information Processing Systems, volume 14. MIT Press, 2002.
[8] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan:
Generating graphs via random walks. In Proceedings of the 35th International Conference
on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018,
pages 609–618, 2018.
[9] Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C.-C. Jay Kuo. Graph representation learning:
a survey. APSIPA Transactions on Signal and Information Processing, 9(1), 2020.
[10] Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C.-C. Jay Kuo. Graph representation learning:
a survey. APSIPA Transactions on Signal and Information Processing, 9:e15, 2020.
[11] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn.
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, Jul 2019.
[12] Diane J. Cook and Lawrence B. Holder. Mining Graph Data. John Wiley &; Sons, Inc.,
Hoboken, NJ, USA, 2006.
[13] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular
graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative
Models, 2018.
[14] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,
Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning
molecular fingerprints. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates,
Inc., 2015.
[15] Faezeh Faez, Yassaman Ommi, Mahdieh Soleymani Baghshah, and Hamid R. Rabiee. Deep
graph generators: A survey, 2020.
[16] Michael Färber, Frederic Bartscherer, Carsten Menne, and Achim Rettinger. Linked data quality
of DBpedia, Freebase, Opencyc, Wikidata, and YAGO. Semantic Web, 9(1):77–129, 2018.
[17] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. Protein interface prediction using
graph convolutional networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 30. Curran Associates, Inc., 2017.
[18] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points.
Science, 315(5814):972–976, 2007.
[19] Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu. Graphgen: A scalable approach to domain-
agnostic labeled graph generation. In Proceedings of The Web Conference 2020, WWW ’20,
page 1253–1263, New York, NY, USA, 2020. Association for Computing Machinery.

22
[20] Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. Dyngem: Deep embedding method for
dynamic graphs. ArXiv, abs/1805.11273, 2018.
[21] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773,
2012.
[22] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD ’16, page 855–864, New York, NY, USA, 2016. Association for Computing
Machinery.
[23] Xiaojie Guo and Liang Zhao. A systematic survey on deep generative models for graph
generation, 2020.
[24] Shubham Gupta, Sahil Manchanda, Srikanta Bedathur, and Sayan Ranu. T IGGER: Scalable
generative modelling for temporal interaction graphs. Proceedings of the AAAI Conference on
Artificial Intelligence, 36(6):6819–6828, Jun. 2022.
[25] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings reveal
statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 1489–1501, Berlin, Germany,
August 2016. Association for Computational Linguistics.
[26] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large
graphs. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, NIPS’17, page 1025–1035, Red Hook, NY, USA, 2017. Curran Associates Inc.
[27] Anne-Sophie Himmel, Hendrik Molter, Rolf Niedermeier, and Manuel Sorge. Enumerating
maximal cliques in temporal graphs. CoRR, abs/1605.03871, 2016.
[28] Petter Holme. Epidemiologically optimal static networks from temporal network data. PLOS
Computational Biology, 9(7):1–10, 07 2013.
[29] Petter Holme. Modern temporal network theory: a colloquium. The European Physical Journal
B, 88(9), Sep 2015.
[30] Silu Huang, Ada Wai-Chee Fu, and Ruifeng Liu. Minimum spanning trees in temporal graphs.
In Proceedings of the 2015 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’15, page 419–430, New York, NY, USA, 2015. Association for Computing
Machinery.
[31] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.
In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April
24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
[32] Micha Karoński and Andrzej Ruciński. The Origins of the Theory of Random Graphs, pages
311–336. Springer Berlin Heidelberg, Berlin, Heidelberg, 1997.
[33] Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay Sethi, Peter Forsyth,
and Pascal Poupart. Representation learning for dynamic graphs: A survey. J. Mach. Learn.
Res., 21(70):1–73, 2020.
[34] Shima Khoshraftar and Aijun An. A survey on graph representation learning methods, 2022.
[35] Shima Khoshraftar and Aijun An. A survey on graph representation learning methods, 2022.
[36] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
[37] Ioannis Konstas, Vassilios Stathopoulos, and Joemon M. Jose. On social networks and collabo-
rative recommendation. In Proceedings of the 32nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’09, page 195–202, New York, NY,
USA, 2009. Association for Computing Machinery.
[38] Le kui Zhou, Yang Yang, Xiang Ren, Fei Wu, and Yueting Zhuang. Dynamic network embedding
by modeling triadic closure process. In AAAI, 2018.
[39] Srijan Kumar, Xikun Zhang, and Jure Leskovec. Predicting dynamic embedding trajectory
in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pages 1269–1278. ACM, 2019.

23
[40] AA Leman and B Weisfeiler. A reduction of a graph to a canonical form and an algebra arising
during this reduction. Nauchno-Technicheskaya Informatsiya, 2(9):12–16, 1968.
[41] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter W. Battaglia. Learning deep
generative models of graphs. ArXiv, abs/1803.03324, 2018.
[42] Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, William L. Hamilton, David Duvenaud,
Raquel Urtasun, and Richard S. Zemel. Efficient graph generation with graph recurrent attention
networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-
Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing
Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, Vancouver, BC, Canada, pages 4257–4267, 2019.
[43] Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Charlie Nash, William L. Hamilton, David
Duvenaud, Raquel Urtasun, and Richard Zemel. Efficient graph generation with graph recurrent
attention networks. In NeurIPS, 2019.
[44] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval
Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.
In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Repre-
sentations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings,
2016.
[45] Zhining Liu, Dawei Zhou, Yada Zhu, Jinjie Gu, and Jingrui He. Towards fine-grained temporal
network representation via time-reinforced random walk. In AAAI, 2020.
[46] Yuanfu Lu, Xiao Wang, Chuan Shi, Philip S. Yu, and Yanfang Ye. Temporal network embedding
with micro- and macro-dynamics. In Proceedings of the 28th ACM International Conference
on Information and Knowledge Management, CIKM ’19, page 469–478, New York, NY, USA,
2019. Association for Computing Machinery.
[47] Othon Michail and Paul G. Spirakis. Traveling salesman problems in temporal graphs. Theoret-
ical Computer Science, 634:1–23, 2016.
[48] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. In ICLR, 2013.
[49] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed rep-
resentations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information
Processing Systems, volume 26. Curran Associates, Inc., 2013.
[50] Giang Hoang Nguyen, John Boaz Lee, Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, and
Sungchul Kim. Continuous-time dynamic network embeddings. In Companion Proceedings
of the The Web Conference 2018, WWW ’18, page 969–976, Republic and Canton of Geneva,
CHE, 2018. International World Wide Web Conferences Steering Committee.
[51] Caleb C. Noble and Diane J. Cook. Graph-based anomaly detection. In Proceedings of the
Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’03, page 631–636, New York, NY, USA, 2003. Association for Computing Machinery.
[52] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity pre-
serving graph embedding. In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD ’16, page 1105–1114, New York, NY, USA,
2016. Association for Computing Machinery.
[53] Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. Motifs in temporal networks. Pro-
ceedings of the Tenth ACM International Conference on Web Search and Data Mining, Feb
2017.
[54] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kaneza-
shi, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. EvolveGCN: Evolving graph convo-
lutional networks for dynamic graphs. In Proceedings of the Thirty-Fourth AAAI Conference on
Artificial Intelligence, 2020.
[55] Yun Peng, Byron Choi, and Jianliang Xu. Graph Learning for Combinatorial Optimization: A
Survey of State-of-the-Art. Data Sci. Eng., 6(2):119–141, Jun 2021.

24
[56] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-
sentations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’14, pages 701–710, New York, NY, USA, 2014. ACM.
[57] N. Perra, B. Gonçalves, and R. Pastor-Satorras. Activity driven modeling of time varying
networks. Sci Rep, 2:469, 2012.
[58] Jakob Gulddahl Rasmussen. Lecture notes: Temporal point processes and the conditional
intensity function. arXiv: Methodology, 2018.
[59] Marian-Andrei Rizoiu, Young Lee, Swapnil Mishra, and Lexing Xie. A tutorial on hawkes
processes for events in social media. CoRR, abs/1708.06401, 2017.
[60] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and
Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs. In ICML
2020 Workshop on Graph Representation Learning, 2020.
[61] Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. Dysat: Deep neural
representation learning on dynamic graphs via self-attention networks. In Proceedings of the
13th International Conference on Web Search and Data Mining, WSDM ’20, page 519–527,
New York, NY, USA, 2020. Association for Computing Machinery.
[62] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and
Max Welling. Modeling relational data with graph convolutional networks. In Aldo Gangemi,
Roberto Navigli, Maria-Esther Vidal, Pascal Hitzler, Raphaël Troncy, Laura Hollink, Anna
Tordai, and Mehwish Alam, editors, The Semantic Web, pages 593–607, Cham, 2018. Springer
International Publishing.
[63] Saeedreza Shehnepoor, Roberto Togneri, Wei Liu, and Mohammed Bennamoun. Spatio-
temporal graph representation learning for fraudster group detection, 2022.
[64] Uriel Singer, Ido Guy, and Kira Radinsky. Node embedding over temporal graphs. In IJCAI,
2019.
[65] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of
initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester,
editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of
Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19
Jun 2013. PMLR.
[66] Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. Know-evolve: Deep temporal
reasoning for dynamic knowledge graphs. In Proceedings of the 34th International Conference
on Machine Learning - Volume 70, ICML’17, page 3462–3471. JMLR.org, 2017.
[67] Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. Dyrep: Learning
representations over dynamic graphs. In 7th International Conference on Learning Representa-
tions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[68] Yi-Hsuan Tsai and Ming-Hsuan Yang. Locality preserving hashing. In 2014 IEEE International
Conference on Image Processing (ICIP), pages 2988–2992, Oct 2014.
[69] Tomasz Tylenda, Ralitsa Angelova, and Srikanta Bedathur. Towards time-aware link prediction
in evolving social networks. In Proceedings of the 3rd Workshop on Social Network Mining and
Analysis, SNA-KDD ’09, New York, NY, USA, 2009. Association for Computing Machinery.
[70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[71] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph Attention Networks. International Conference on Learning Representations,
2018. accepted as poster.
[72] Christian L. Vestergaard, Mathieu Génois, and Alain Barrat. How memory generates heteroge-
neous dynamics in temporal networks. Physical Review E, 90(4), Oct 2014.
[73] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’16, page 1225–1234, New York, NY, USA, 2016. Association for Computing Machinery.

25
[74] Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. Inductive representation
learning in temporal networks via causal anonymous walks. CoRR, abs/2101.05974, 2021.
[75] Strogatz SH Watts DJ. Collective dynamics of ’small-world’ networks. In Nature, 1998.
[76] Huanhuan Wu, James Cheng, Yi Lu, Yiping Ke, Yuzhen Huang, Da Yan, and Hejun Wu. Core
decomposition in large temporal graphs. In 2015 IEEE International Conference on Big Data
(Big Data), pages 649–658, 2015.
[77] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A
comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and
Learning Systems, 32(1):4–24, January 2021.
[78] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Self-attention
with functional time representation learning. In Advances in Neural Information Processing
Systems, pages 15889–15899, 2019.
[79] Da Xu, Chuanwei Ruan, Evren Körpeoglu, Sushant Kumar, and Kannan Achan. A temporal
kernel approach for deep learning with continuous-time information. In 9th International
Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
OpenReview.net, 2021.
[80] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? In 7th International Conference on Learning Representations, ICLR 2019, New
Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[81] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? In 7th International Conference on Learning Representations, ICLR 2019, New
Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[82] Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In Kamalika
Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference
on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97
of Proceedings of Machine Learning Research, pages 7134–7143. PMLR, 2019.
[83] Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, and Jure Leskovec. Graphrnn:
Generating realistic graphs with deep auto-regressive models. In Jennifer G. Dy and Andreas
Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML
2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of
Machine Learning Research, pages 5694–5703. PMLR, 2018.
[84] Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, and Jure Leskovec. Graphrnn:
Generating realistic graphs with deep auto-regressive models. In ICML, 2018.
[85] Giselle Zeno, Timothy La Fond, and Jennifer Neville. Dymond: Dynamic motif-nodes network
generative model. In Proceedings of the Web Conference 2021, WWW ’21, page 718–729, New
York, NY, USA, 2021. Association for Computing Machinery.
[86] Yao Zhang, Yun Xiong, Xiangnan Kong, and Yangyong Zhu. Learning node embeddings
in interaction graphs. In Proceedings of the 2017 ACM on Conference on Information and
Knowledge Management, CIKM ’17, page 397–406, New York, NY, USA, 2017. Association
for Computing Machinery.
[87] Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Chengwei Yao, Zhao Li, and Can
Wang. Learning temporal interaction graph embedding via coupled memory networks. In
Proceedings of The Web Conference 2020, WWW ’20, page 3049–3055, New York, NY, USA,
2020. Association for Computing Machinery.
[88] Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li.
T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Transactions on
Intelligent Transportation Systems, 21(9):3848–3858, 2020.
[89] Dawei Zhou, Lecheng Zheng, Jiawei Han, and Jingrui He. A data-driven graph generative model
for temporal interaction networks. In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery; Data Mining, KDD ’20, page 401–411, New York, NY,
USA, 2020. Association for Computing Machinery.
[90] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra.
Beyond homophily in graph neural networks: Current limitations and effective designs. arXiv:
Learning, 2020.

26
[91] Yuan Zuo, Guannan Liu, Hao Lin, Jia Guo, Xiaoqian Hu, and Junjie Wu. Embedding temporal
network via neighborhood formation. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery; Data Mining, KDD ’18, page 2857–2866, New York,
NY, USA, 2018. Association for Computing Machinery.

27

You might also like