0% found this document useful (0 votes)
17 views

Graph-based Semi-supervised Learning- A Comprehensive Review

This document provides a comprehensive review of graph-based semi-supervised learning (GSSL), focusing on its methods and applications. It introduces a new taxonomy categorizing GSSL into graph construction and label inference, detailing the advantages of GSSL over other semi-supervised methods. The paper also discusses future research directions and offers resources such as codes and datasets for practitioners and researchers in the field.

Uploaded by

geziraofficial
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Graph-based Semi-supervised Learning- A Comprehensive Review

This document provides a comprehensive review of graph-based semi-supervised learning (GSSL), focusing on its methods and applications. It introduces a new taxonomy categorizing GSSL into graph construction and label inference, detailing the advantages of GSSL over other semi-supervised methods. The paper also discusses future research directions and offers resources such as codes and datasets for practitioners and researchers in the field.

Uploaded by

geziraofficial
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1

Graph-based Semi-supervised Learning: A


Comprehensive Review
Zixing Song, Xiangli Yang, Zenglin Xu, Senior Member, IEEE Irwin King, Fellow, IEEE,

Abstract—Semi-supervised learning (SSL) has tremendous under the manifold assumption contributes to the success of
value in practice due to its ability to utilize both labeled data GSSL methods.
and unlabelled data. An important class of SSL methods is to In the graph structure commonly used for SSL, each sample
naturally represent data as graphs such that the label information
arXiv:2102.13303v1 [cs.LG] 26 Feb 2021

of unlabelled samples can be inferred from the graphs, which is represented by a node, and these nodes are connected by
corresponds to graph-based semi-supervised learning (GSSL) weighted edges that measure the similarity between them.
methods. GSSL methods have demonstrated their advantages Therefore, the main procedure of GSSL is to create a suitable
in various domains due to their uniqueness of structure, the graph along which the given labels can be easily propagated.
universality of applications, and their scalability to large scale To be precise, this goal can be achieved by the following two
data. Focusing on this class of methods, this work aims to provide
both researchers and practitioners with a solid and systematic main steps.
understanding of relevant advances as well as the underlying Step 1. Graph construction. A similarity graph is con-
connections among them. This makes our paper distinct from structed based on all the given data, including both the labeled
recent surveys that cover an overall picture of SSL methods
while neglecting fundamental understanding of GSSL methods.
and unlabeled samples. During this step, the biggest challenge
In particular, a major contribution of this paper lies in a new gen- is how to make the relationship between original samples well
eralized taxonomy for GSSL, including graph regularization and represented.
graph embedding methods, with the most up-to-date references Step 2. Label inference. The label inference is performed
and useful resources such as codes, datasets, and applications. so that the label information can be propagated from the
Furthermore, we present several potential research directions as
future work with insights into this rapidly growing field.
labeled samples to the unlabeled ones by incorporating the
structure information from the constructed graph in the previ-
Index Terms—Semi-supervised learning, graph-based semi- ous step.
supervised learning, graph embedding, graph representation
learning. Compared with other SSL methods, which are not involved
with any graph structure, GSSL has some advantages that are
worth noticing. In the following, we list several advantages of
I. I NTRODUCTION graph-based SSL methods.
• Universality. Many common data sets of current trends
EMI-SUPERVISED learning (SSL) has achieved great
S successes in various real-world applications where only
a few expensive labeled samples are available and abundant
are represented by graphs like the World Wide Web
(WWW), citation networks, and social networks.
• Convexity. Since an undirected graph is usually involved
unlabeled samples are easily obtained. Moreover, as a typical
class of SSL solutions, graph-based SSL (GSSL) is very in the graph construction step, the symmetric feature of
promising because the graph structure can be naturally used the undirected graph makes it easier to formulate the
as a reflection for the significant manifold assumption in SSL. learning problem into a convex optimization problem,
More specifically, GSSL methods start with constructing a which can be solved with various exciting techniques [1].
• Scalability. Many of the GSSL methods are meticulously
graph where the nodes represent all the samples and the
weighted edges reflect the similarity between a pair of nodes. designed so that the time complexity is linear to the total
This way of graph construction implies that nodes connected number of samples. As a result, they are often easily
by edge associated with large weights tend to have the same parallelized to handle large scale datasets with ease.
label, which corresponds to the manifold assumption. The Related work. Several SSL survey papers [2][3] often fail
manifold assumption suggests that samples locating near to to cover enough methods of GSSL, neglecting its significant
each other on a low-dimensional manifold should share similar role in SSL. Zhu et al. [2] conduct a comprehensive review of
labels. Consequently, the expressive power of graph structure classic methods involved in SSL, and GSSL is not explored
in detail. Similar earlier work like [3] by Pise et al. also tries
Z. Song and I. King are with the Department of Computer Science and En- to present a whole picture of SSL methods without covering
gineering, The Chinese University of Hong Kong, Hong Kong, China(Email: enough work in GSSL. Recent literature review work, [4] [5]
[email protected], [email protected]).
X. Yang is with the SMILE Lab, School of Computer Science and and [6] all follow the footsteps of work [2] and [3] by adding
Engineering, University of Electronic Science and Technology of China, more recent research output. However, they do not cover the
Chengdu, China(E-mail: [email protected]). recent development in GSSL methods. Instead, our work solely
Z. Xu is with the School of Computer Science and Technology, Harbin
Institute of Technology, Shenzhen, China, and also with Peng Cheng Lab, focuses on GSSL and combines both earlier studies with recent
Shenzhen, China (Email: [email protected]). advances.
2

Graph-based semi-supervised learning

Problem Setting (Sec. II) Solution (Sec. III - VII) Applications (Sec. VIII)

Datasets
Transductive Step 1 : Graph Construction (Sec. III) Step 2 : Label Inference (Sec. IV - VII)

Implementations
Inductive Unsupervised methods Graph Regularization (Sec. IV)

Domains
Supervised methods Graph Embedding (Sec. V - VII)

Graph Shallow Embedding (Sec. VI) Graph Deep Embedding (Sec. VII)

Figure 1: Taxonomy for graph-based semi-supervised learning

The most relevant work to ours is [7] by Chong et al., including the open-source codes for all the reviewed
and it is considered as the most up-to-date survey paper on methods or models, some popular benchmark data sets,
GSSL. However, there are several noticeable drawbacks of and pointers to representative practical applications in
this work that are worth mentioning here. First, [7] reviews different areas. This survey can be regarded as a hands-on
work from the perspective of transductive, inductive, and guide for researchers interested in understanding existing
scalability learning. This taxonomy fails to show the context GSSL approaches, using the codes for experiments, and
of development and thus does not reveal the relationship of even developing new ideas for GSSL.
different methods or models. As a result, we provide a novel 4) Future directions. We propose some open problems and
taxonomy from the perspective of the two main steps in GSSL: point out some directions for future research in terms
graph construction and label inference. Secondly, some of the of dynamicity, scalability, noise-resilience, and attack-
reviewed methods in [7] are not graph-based models, but rather robustness.
are some semi-supervised convolutional neural network (CNN) Organization of the paper. The rest of this survey is orga-
models as shown in Section 3.4 in the original paper[7]. Most nized as follows. In Section II, we introduce the background
importantly, [7] fails to develop a framework to generalize the knowledge related to GSSL. Then some necessary notations
methods or models reviewed. However, this paper fills all these are listed, and the relevant terms are properly defined. In Sec-
gaps with several noticeable contributions. tion III, we provide a detailed review of graph construction, the
Contributions. To sum up, this paper presents an extensive first step of GSSL. From Section IV to VII, the label inference,
and systematic review of GSSL with the following contribu- the second step of GSSL, is covered, which is the main focus
tions. of this paper. Furthermore, a new taxonomy is provided, as
1) Comprehensive review. We provide the most compre- shown in Figure 1. Graph regularization methods are reviewed
hensive and the most up-to-date overview of GSSL in Section IV while graph embedding methods are reviewed
methods. For every approach reviewed in this paper, from Section V to Section VII. Section V discusses the
we present the detailed descriptions with key equations, generalized encoder-decoder framework for graph embedding.
clarify the context of development beneath the algorithms, To provide a more detailed overview of it, we further split
make the necessary comparison, and summarise the cor- it into shallow embedding and deep embedding and review
responding strengths or limitations. them in Section VI and Section VII respectively. Moreover,
2) New taxonomy. We propose a new taxonomy for graph- in Section IX, four open problems are briefly reviewed as
based semi-supervised learning with a more generalized future research directions. Finally, applications of GSSL are
framework, as shown in Figure 1. We divide the GSSL extensively explored in the Appendix, along with a list of
process into two steps. The first one is to construct common datasets and a code base for some popular models.
a similarity graph and the second step is to do label
inference based on this graph. The latter step is much II. BACKGROUND AND D EFINITION
more challenging and is also the main focus of this paper. As is mentioned earlier, a majority of GSSL algorithms
Label inference methods are then categorized into two requires solving the following two sub-problems:
main groups: graph regularization methods and graph • Constructing a graph over the input data (if one is not
embedding methods. For the former group, a generalized already available).
framework of regularizers from the perspective of the loss • Inferring the labels on the unlabeled samples in the input
function is presented. For the latter group, we provide a or estimating the model parameters.
new unified representation for graph embedding methods GSSL methods run on a specifically designed graph in
in SSL with the help of the encoder-decoder framework. which training samples are represented as nodes, and each
3) Abundant resources. We collect abundant resources node pair is linked by weight to denote the underlying sim-
related to GSSL and build a useful, relevant code base, ilarity. Some of the nodes are labeled, while others are not.
3

Transductive M
?
?

?
?

?
M
?
M
? ?
?
?

?
?

... ...
Inductive ?
?
?
M
?
?
?

Figure 2: Comparison between transductive and inductive setting in GSSL. For transductive setting, only the labels of unlabeled
nodes in the training dataset need to be inferred while for inductive setting, the trained model M can predict the label of any
unseen node.

As a result, a graph has to be built to make these problems following Eq. (2): graph regularization and graph embedding.
amenable to the following GSSL approaches. More details will be provided in Section IV to Section VII.
However, it is worth mentioning that most of the graph-
based algorithms are designed for the label inference step. A. Related concepts
As a result, in this paper, we mainly focus on the label
inference techniques used in GSSL, and we only discuss graph 1) Supervised learning and unsupervised learning: Super-
construction in Section III. vised learning and unsupervised learning can be viewed as
Once the graph is constructed, the next step in solving an two extremes of SSL because all the training samples are
SSL problem using graph-based methods is to inject labeled well labeled in supervised learning settings, while unsuper-
data on a subset of the nodes in the graph, followed by vised learning can only have access to unlabeled data. Semi-
inferring the labels for the unlabeled nodes. While a majority supervised learning aims to introduce cheap unlabeled samples
of the graph-based inference approaches are transductive, there to enhance the model’s performance with only a few costly
are some inductive GSSL approaches as well. labeled samples. Therefore, the problem setting of SSL is a
Following the framework for SSL, the loss function of perfect match for many real-world applications.
GSSL approaches can also be generalized within that of SSL, 2) Other semi-supervised learning methods: Throughout
which contains three parts as shown in Eq. (1) the development of SSL, a great number of successful algo-
rithms or models have emerged in roughly three phases. The
L(f ) = Ls (f, Dl ) +λ Lu (f, Du ) +µ Lr (f, D) , (1) first phase is the early stage of SSL before 2000, where classic
| {z } | {z } | {z } machine learning algorithms are investigated and improved
supervised loss unsupervised loss regularization loss
with unlabeled data. Typical examples are S3VM and Co-
where Ls (f, Dl ) is the supervised loss on the labeled data and training. The second phase is the mature stage of SSL between
Lu (f, Du ) is the unsupervised loss on the unlabeled data and 2000 and 2015, in which many methods flourished, such as
Lr (f, D) is the regularization loss. Additionally, λ and µ are mixture model, pseudo label, self-training, manifold learning,
hyperparameters to balance these terms. However, for GSSL, and GSSL. The third phase is after 2015, with the advance of
unsupervised loss is often absorbed into the regularization loss deep learning and especially Graph Neural Networks (GNN).
since no label information is used in the regularization loss Since GSSL methods witness all these three stages, reviewing
term. Therefore, the loss function for GSSL can be generalized its development and recent progress is necessary.
as shown in Eq. (2) 3) Transductive and inductive settings: Like other SSL
methods, GSSL algorithms can be divided into two categories
L(f ) = Ls (f, Dl ) + µLr (f, D). (2) based on whether to predict data samples’ labels out of training
data.
In this paper, more attention will be paid to how to do
label inference when the similarity graph has already been Definition II.1 (Transductive setting). Given a train-
constructed from the given datasets under the setting of semi- ing set consisting of labeled and unlabeled data D =
nl nu
supervised learning. Two main groups of GSSL are reviewed {{xi , yi }i=1 , {xi }i=1 }, the goal of a transductive algorithm
4

Table I: Notations used in the paper xi . We will be using both i and xi to refer to the ith node in
Notations Descriptions
the graph.
G A Graph Definition II.4 (Directed and Undirected Graphs). A graph
V The set of nodes (vertices) in a graph whose edges have no starting or ending nodes is called an
E The set of edges in a graph
i,v Node i, Node v undirected graph. In the case of a directed graph, the edges
(i, j) The edge linked between node i, j have a direction associated with them.
W The weight matrix of a graph
Wij The weight associated with edge (i, j) Definition II.5 (Weighted Graph). A graph G is weighted if
A The adjacency matrix of a graph there is a number or weight associated with every edge in the
Aij ith row j th column in the adjacency matrix A
D The degree matrix of a graph graph. Given an edge (i, j), where i, j ∈ V and Wij is used
Dii The degree of node i to denote the weight associated with the edge (i, j) and thus
X The attribute matrix of a graph forms the whole weight matrix W ∈ Rn×n . In most cases,
xi The attribute vector for node i
N (i) The neighborhood of a node i we assume Wij ≥ 0 and Wij can be 0 if and only if there is
L Unnormalized graph Laplacian matrix no edge between the node pair (i, j).
L̃ Normalized graph Laplacian matrix
Pi1 First-order proximity of node i Definition II.6 (Degree ofPa Node). The degree Dii of the
2
Pi,j Second-order proximity between node i, j node i is given by Dii = j Wij . Moreover, in the case of
nl The number of labeled samples an unweighted graph, the node’s degree is equal to its number
nu The number of unlabeled samples
nl
Dl = {xi , yi }i=1 Labeled samples of neighbors.
Du = {xi }n u
i=1 Unlabeled samples
S Similarity matrix of a graph Definition II.7 (Neighborhood of a Node). The neighbor-
S[u, v] Similarity measurement between node u,v hood of a node v in a graph G is denoted as N (v) to indicate
Z Embedding matrix the subgraph of G induced by all nodes adjacent to v.
zi Embedding for node i
(k)
hv Hidden embedding for node v in kth layer Definition II.8 (Adjacency Matrix). Adjacency matrix is a
(k) Message aggregated from node v’s matrix with a 1 or 0 in each position (i, j) based on whether
mN (v)
neighborhood in kth layer
node i and node j are adjacent or not. If the given graph is
undirected, its corresponding adjacency matrix is a symmetric
matrix.
is to learn a function f : X → Y so that f is only able to
nu
predict the labels for the unlabeled data {xi }i=1 . Definition II.9 (Graph Laplacian Matrix). The unnormal-
ized graph Laplacian matrix is given by L = D −W . Here the
Definition II.2 (Inductive setting). Given a training D ∈ Rn×n is a diagonal matrix such that Di,i is the degree
set consisting of labeled and unlabeled data D = of the node i and otherwise Dij = 0 ∀i 6= j. It is easy to
nl nu
{{xi , yi }i=1 , {xi }i=1 }, the goal of an inductive algorithm is prove that L is a positive semi-definite matrix.
to learn a function f : X → Y so that f is able to predict the The normalized graph Laplacian matrix is given by L̃ =
output y of any input x ∈ X . D−1/2 LD1/2 where L is the unnormalized graph Laplacian
While most of the GSSL approaches are transductive, there matrix.
are a few inductive GSSL approaches. In most scenarios,
transductive SSL methods outperform inductive ones in terms III. G RAPH C ONSTRUCTION
of prediction accuracy while they often suffer from high
training costs compared to inductive ones, especially in the To perform any GSSL methods, a graph must be constructed
context of large scale incremental learning. Figure 2 illustrates first, where nodes represent data samples, some of which
the difference between transductive and inductive setting in are labeled while others are not, and edges are associated
GSSL. with a certain weight to reflect each node pair’s similarity.
In some domains, such as citation networks, there is already
an implicit underlying graph. Graph-based methods are thus
B. Notations and Definitions a natural fit for SSL problems in these domains. For most of
the other machine learning tasks, however, it is believed that
In this section, as a matter of convenience, we first define
the data instances are not conveniently represented as a graph
some useful and common terms used in GSSL, along with
structure, and as a result, a graph has to be built to make
relevant notations. Unless otherwise specified, the notations
these problems appropriate for GSSL approaches. The graph
used in this survey paper are illustrated as Table I. After the
construction techniques are involved in the first step mentioned
list of the notations, the minimal set of definitions required to
before.
understand this paper is defined.
The goal of graph construction is to discover a graph G =
Definition II.3 (Graph). A graph is an ordered pair G = (V, E, W ) where V is the set of nodes, E ⊆ V × V are
(V, E) where V = {1, . . . , |V |} is the set of nodes (or vertices) the edges, and W ∈ Rn×n are the associated weights on the
and E ⊆ {V × V } is the set of edges. edges. Each node in the graph represents an input sample,
GSSL algorithms start by representing the data as a graph. and thus the number of nodes in the graph is |V | = n. As
We assume that the node i ∈ V represents the input sample the nodes are fixed (assuming that D is fixed, which is often
5

the case), the task of graph construction involves estimating guarantees that every node in the resulting graph has exactly b
E and W . The following three assumptions often hold. neighbors. Using b-Matching for graph construction involves
Assumption 1. The graph is undirected, so W is symmetric. two steps: (a) graph sparsification and (b) edge re-weighting.
And all edge weights are non-negative, Wij ≥ 0, ∀i 6= j. In graph sparsification, there exists an issue where edges are
Assumption 2. Wij = 0 implies the absence of an edge removed in a way of estimating a matrix P ∈ {0, 1}n×n . For
between nodes i and j. the entry in P , Pij = 1 signifies an existing edge between a
Assumption 3. There are no self-loops, Wii = 0, ∀1 ≤ i ≤ node pair in the generated graph, while Pij = 0 suggests a lack
n. of an edge. b-Matching provides a solution by formulating an
These three assumptions simplify the problem by adding optimization problem with the objective,
these constraints and lay the foundations for the following
X
minn×n Pij ∆ij
unsupervised and supervised methods. P ∈{0,1}
i,j
X (4)
s.t. Pij = b, Pii = 0, Pij = Pji , ∀1 ≤ i, j ≤ n.
A. Unsupervised methods j
Unsupervised graph construction techniques ignore all the
Here, ∆ ∈ Rn×n+ is a symmetric distance matrix.
given label information of the training data during the con-
When a selection of edges is made in matrix P from the
struction process. Among all the unsupervised methods for
previous step, the next aim is to determine the chosen edges’
graph construction, the K-nearest neighbor (KNN) graph and
associated weights to produce the estimated weight matrix W .
b-Matching methods, along with their extensions, are the most
There are three popular ways to determine the weight matrix
popular ones.
W.
1) KNN-based approaches: For KNN-based graph con-
• Binary Kernel. The easiest way to estimate W is to set
struction approaches [8], every node is associated based on
a pre-configured distance metric with its k nearest neighbors W = P . Thus Wij = Pij and each entity in W is also
in the resulting graph. Moreover, KNN-based methods link the either 0 or 1.
• Gaussian Kernel. Here, W can be a little bit com-
k nearest neighbors greedily to generate graphs whose nodes’
degree is larger than k, which leads to irregular graphs. Note plex compared
 to the previous one. That is Wij =
d(xi ,xj )
that a graph is said to be regular if every node has the same Pij exp − 2σ2 .
degree. • Locally Linear Reconstruction (LLR). LLR is derived
KNN-based method needs a proximity function from the Locally Linear Embedding (LLE) technique by
sim (xi , xj ) or distance metric that can quantify the Roweis et al. [12]. The goal is to reconstruct Xi from its
resemblance or disparity between every node pair in the neighborhood. It can be formulated to solve the following
training data. The weight value associated with the edge is optimization problem,
given by Eq. (3), 2
 X X
sim (xi , xj ) i ∈ N (j) min xi − Pij Wij xj
Wij = . (3) W
0 otherwise i j (5)
X
In ε-neighborhood-based graph construction method [8], s.t. Wij = 1, Wij ≥ 0, i = 1, . . . , n.
if the distance between a node pair is smaller than ε, j
where ε ≥ 0 is a predefined constant, a connected edge is In summary, the b-matching method restricts the constructed
formed between them. KNN methods enjoy certain favorable similarity graph to be regular so that the given label can be
properties when compared with ε-neighborhood-based graphs. propagated in a more balanced way during the following label
Specifically, in ε-neighborhood-based graphs, a misleading inference step.
choice of the parameter ε could lead to generating discon-
nected graphs [9]. However, KNN-based graphs outperform
ε-neighborhood-based ones with better scalability. B. Supervised methods
In Oziki et al. [10], it is contended that a hub or a center The existing prevalent strategies of graph construction are
situated in the sample space can result in a corresponding hub unsupervised, i.e., they fail to use any given label information
in the classic KNN graphs. This may downgrade the prediction during the construction phase. However, labeled samples can
performance on several classification tasks. In order to handle be used as a kind of prior knowledge that can be used to
this issue, [10] proposes a new way of constructing a graph refine the generated graph for the downstream learning tasks.
by using mutual KNN in combination with a maximum span Dhillon et al. [13] study the possibility of employing labeled
tree (M-KNN). In parallel with this work, Vega et al. [11] points so as to measure the similarities between node pairs.
also introduce the sequential KNN (S-KNN) to produce graphs Rohban et al. [14] suggest another supervised method of graph
under the new relaxed condition in which the resulting graph construction, which demonstrates that the optimal solution for
contains no hubs but is not necessarily regular. a neighborhood graph can be regarded as a subgraph of a KNN
2) b-Matching: As discussed above, KNN graphs, contrary graph as long as the manifold sampling rate is large enough.
to their name, often lead to graphs where different nodes have Driven by previous studies [10], a new method, graph-based
different degrees. Jebara et al. [9] propose b-Matching, which on informativeness of labeled instances (GBILI) [15], also
6

utilizing the label information, is introduced. GBILI not only Step 2. Row-normalize Y to maintain the class probability
results in a decent accuracy on classification tasks but also interpretation.
stands out with a quadratic time complexity [16]. Moreover, Step 3. Clamp the labeled data. Repeat from step 2 until
built on GBILI [15], Lilian et al. [17] have upgraded the Y converges.
method for producing more robust graphs by solving an 1) Gaussian random fields: Gaussian Random Fields
optimization problem with the specific algorithm called the (GRF) [24] is a typical example of the early work in GSSL
Robust Graph that Considers Labeled Instances (RGCLI). by using label propagation algorithms. The strategy is to
More recently, a new SSL learning method referred to as estimate some prediction function f based on the graph G
a low-rank semi-supervised representation is proposed [18] with some constraints to ensure certain necessary properties
which incorporates labeled data into the low-rank representa- and afterward attach labels to the unlabeled nodes accord-
tion (LRR). A follow-up work is by Taherkhani et al. [19]. ing to f . In fact, the above-mentioned constraint is to take
By taking additional supervised information, the generated f (xi ) = fl (xi ) ≡ yi on all the labeled nodes. Intuitively,
similarity graph can facilitate the following label inference the clustering unlabeled points with strongly connected edges
process to a great extent. should share common labels. This is why the quadratic energy
function is designed as shown in Eq. (7),
IV. G RAPH REGULARIZATION
1X 2
All the classic GSSL methods can actually be simplified E(f ) = Lr = Wij (f (xi ) − f (xj )) . (7)
2 i,j
as searching for a function f on the graph. f has to satisfy
two criteria simultaneously: 1) it must be as close to the given It is noteworthy that the minimum value of energy function
labels as possible, and 2) it must be smooth on the entire f = arg min f |D =fl E(f ) is harmonic; namely, it satisfies the
l
constructed graph. constraint Lf = 0 on the unlabeled nodes and is equal to fl on
These two conditions can be further expressed in a general the labeled nodes Dl , where L is the graph Laplacian matrix.
regularization framework in which loss function can be de- The property of harmonic function indicates that the value
composed into two main parts. The first term is a supervised of f at every unlabeled node is the P mean value of f at its
loss constraint to the first criterion, and the second term is neighboring nodes: f (xj ) = d1j i∼j Wij f (xi ), for j =
a graph regularization loss constraint to the second criterion. l + 1, . . . , l + u. This constraint is actually compatible with
Formally, we have, the previous smoothness requirement of f with respect to the
X X graph. It can also be interpreted in an iterative manner as
L(f ) = Ls (f (xi ), yi ) +µ Lr (f (xi )) ,
| {z } | {z } shown in Eq. (8)
(xi ,yi )∈Dl xi ∈Dl +Du
supervised loss regularization loss
(6) f (t+1) ← P · f (t) , (8)
where f is the prediction function and µ is a trade-off hyper- where P = D−1 W . Furthermore, a closed form solution of
parameter. Eq. (8) can be matrix W is split into four
In the following sections, we will see that all the graph  deduced if weight

Wll Wlu
regularization methods reviewed here are similar. They only blocks W = . Then,
Wul Wuu
differ in the particular choice of the loss function with vari-
−1 −1
ous regularizers. Table II summarizes all the reviewed graph fu = (Duu − Wuu ) Wul fl = (I − Puu ) Pul fl . (9)
regularization methods in Section IV from the perspective of 2) Local and global consistency: Zhou et al. [25] extend
decomposing the regularizer. This generalized framework of the work [24] to multiclass setting and proposes Local and
graph regularization has been carefully examined by Zhou Global Consistency (LGC) to handle a more general semi-
et al. [20], and its theoretical analysis from different perspec- supervised problem. The iterative formula is shown in Eq. (10)
tives has also been provided by [21] [22] [23].
Y (t) = αSY (t−1) + (1 − α)Y (0) , (10)
A. Label propagation where S = D−1/2 AD−1/2 , and α is a hyper-parameter. We
Label Propagation (LP) [31] is the most popular method for can also easily derive the closed-form solution for Eq. (10) as
label inference on GSSL. Label Propagation can be formulated shown in Eq. (11)
as a problem, in which some of the nodes’ labels, also Ŷ = αS Ŷ + (1 − α)Y (0) . (11)
referred to as seeds, propagate to unlabeled nodes based on
the similarity of each node pair, which is represented by the From a perspective of optimization problem, LGC [25] actu-
constructed graph discussed in Section III. Meanwhile, during ally tries to minimize the following objective function Eq. (12)
the propagation process, given labels need to be fixed. In this associated with prediction function f .
way, labeled nodes serve as guides that lead label information
 !2 
flow through the edges within the graph so that unlabeled 1 X 1 1
L(f ) = Wij √ f (xi ) − p f (xj )  +
nodes can also be tagged with predicted labels. 2 i,j Dii Djj
The basic version of label propagation algorithm is as nl
X 2
follows: µ (f (xi ) − yi ) . (12)
Step 1. All nodes propagate labels for one step Y ← T Y . i=1
7

Table II: Summary on Graph Regularization Methods


Method Supervised loss fs (f, Dl ) Graph regularization loss fr (D)
Pnl 2 P 2
GRF [24] i=1 (f (xi ) − yi ) i,j Wij (f (xi ) − f (xj ))
 2 !
Pnl 2 1 1
√ f (xi ) − √
P
LRC [25] i=1 (f (xi ) − yi ) i,j Wij Dii
f (xj )
Djj
p
Pnl 2 √ 1 f (xi ) − √1
P
p-Laplacian [26] i=1 (f (xi ) − yi ) i,j Wij f (xj )
Dii Djj
 2 !
Pnl 2 √1 f (xi ) − √ 1
P
Directed regularization [27] i=1 (f (xi ) − yi ) i,j π(i)p(i, j) f (xj )
Dii Djj
Pnl 2 1
Manifold regularization [28] (f (xi ) − yi ) γA kf k2K + γI (n +n ŷ T Lŷ
i=1
P   l u )2 
Pnl 2 2 PnDii
Pn
LPDGL [29] i=1 (f (xi ) − yi ) i,j Wij (f (xi ) − f (xj )) + i=1 (1 − ) (f (xi ))2
 j=1 D jj
Pnl  2
(f (xi ) − yi )2
P P
Poisson learning [30] i=1 i,j Wij f (xi ) − j∈N (i) f (xj )

Compared to the GRF objective above, LGC has two that sums the weighted variation of each edge in the directed
important differences: (a) the inferred labels for the “labeled“ graph as shown in Eq. (15).
nodes are no longer required to be exactly equal to the seed  !2 
values, and this helps with cases where there may be noise in 1 X 1 1
the seed labels, and (b) the label for each node is penalized L(f ) = π(i)p(i, j) √ f (xi ) − p f (xj ) 
2 i,j Dii Djj
1
by the degree of that node √D , ensuring that in the case
ii nl
of irregular graphs, the influence of high degree nodes is X 2
+µ (f (i) − yi ) . (15)
regularized.
i=1
There exist quite a few variants of LGC method, a rep-
resentative one is p-Laplacian regularization [26]. The first It is also worth noting that Eq. (12) for undirected graphs
term in Eq. (12) can be substituted by a more general one as can be regarded as a specific case of Eq. (15) for directed
p graphs. The stationary distributionP of the random walk in an
√ 1 f (xi ) − √ 1 f (xj ) , where p is a positive
P
i,j Wij D ii Djj undirected graph is π(j) = Djj / i∈V Dii . By substituting
integer. Slepcev et al. [26] provide a comprehensive analysis this expression into Eq. (15), we can easily derive Eq. (12),
of its theoretical grounds. In addition, many applications based which is the exactly the regularizer of LGC by Zhou et al. [25].
on LGC are proven to be successful in various domains. For
example, Iscen et al. [32] utilize LGC method to facilitate
C. Manifold regularization
the training process of deep neural networks (DNNs) by
generating pseudo-label for the unlabeled data. The manifold regularization [28] [33] is actually a general
framework that allows for developing a great number of
algorithms ranging from supervised learning to unsupervised
learning. However, it is viewed as a natural fit for GSSL since
B. Directed regularization it combines the spectral graph theory with manifold learning to
search for a low-dimensional representation with smoothness
In the previous label propagation methods, only undirected constraint in the original commonly high-dimensional space.
graphs are applicable. A new regularization framework for The manifold regularization framework fully utilizes the
directed graphs, such as citation networks, is provided to solve geometry property of the unknown probability distribution,
this issue [27]. To fully take the directionality of the edges into which the data samples obey. Therefore, it introduces another
consideration, the idea of naive random walk is incorporated term as a regularizer to control the complexity of the prediction
into this regularization framework. π is used to denote a unique function in the intrinsic space, measured by the geometry of
probability distribution satisfying the following equations, the probability distribution.
X Formally, for a Mercer kernel K : X × X → R, we denote
π(i) = π(j)p(j, i), ∀i ∈ V, (13) the associated Reproducing Kernel Hilbert Space (RKHS)
j→i of the prediction function f . Then the loss function can be
formulated in Eq. (16) as
where nl
1 X 2
Wij Wij L(f ) = (f (xi ) − yi ) + γA kf k2K + γI kf k2I , (16)
p(i, j) = + =P . (14) nl i=1
d (i) j←i Wij
where γA balances the complexity of the prediction function
In Eq. (13) and Eq. (14), j → i denotes the set of vertices in the ambient space and γI is the weighting parameter for
adjacent to the vertex i while j ← i denotes the set of vertices the smoothness constraint term kf k2I induced by both labeled
adjacent from the vertex i. Thus, we can define a loss function and unlabeled samples.
8

It is noted that the added regularization term kf k2I usually V. G RAPH EMBEDDING
takes the following form, Generally speaking, there are two types of graph embedding
1 at two levels commonly seen in the literature. The first one
kf k2I = ŷ T Lŷ, (17) is at the entire graph level while the second one is at the
(nl + nu )2
single node level [41]. Both of them aim to represent the target
T
where ŷ = [f (x1 ), f (x2 ), . . . , f (xnl +nu )] and L is the object in a low-dimensional vector space. For GSSL, we focus
Laplacian matrix of the graph. on node embeddings since such representations can be easily
According to the Representer Theorem [34], it is well- used for SSL tasks. The main objective of node embedding is
known that Eq. (16) has a closed-form solution when kf k2I to encode the nodes as vectors with lower dimensions, which
takes the form as shown in Eq. (17). However, it suffers from in the meantime can reflect their positions and the structure
the high computational cost [35] which makes the algorithm of their local neighborhood.
unscalable when faced with large graphs. Popular solutions Formally, we have the following definition for node em-
to alleviate this problem would be to accelerate either the bedding on graphs. Given a graph G = (V, E), a node
construction of the Laplacian graph [36] [37] or the kernel embedding on it is a mapping fz : v → zv ∈ Rd , ∀v ∈ V such
matrix operation [38] [39]. that d  |V | and the function fz preserves some proximity
measure defined on graph G. The generalized form of the loss
function for graph embedding methods is shown in Eq. (20)
D. LPDGL as X
The above three methods [24] [25] [27] all prove to L(f ) = Ls (f (fz (xi )), yi )
be ineffective for handling ambiguous examples [40]. Gong (xi ,yi )∈Dl
(20)
X
et al. [29] introduce deformed graph Laplacian (DGL) and +µ Lr (f (fz (xi ))),
provides the corresponding label prediction algorithm via DGL xi ∈Dl +Du
(LPDGL) for SSL. A new smoothness term that considers local
information is added to the regularizer. The whole regularizer where fz is the embedding function. It is obvious that Eq. (20)
becomes Eq. (18) as is almost the same as Eq. (6) for graph regularization except
  that for graph embedding methods, classifiers are trained based
on the nodes’ embedding results rather than nodes’ attributes
1 X 2
L(f ) = α Wij (f (xi ) − f (xj ))  directly.
2 i,j
n
!
1 X Dii 2 (18) A. Generalization: Perspective of encoder-decoder
+ β (1 − Pn ) (f (xi ))
2 i=1 j=1 D jj Following the generalization methods on graph representa-
nl
X tion learning by Hamilton et al. [41], all the node embedding
2
+µ (f (xi ) − yi ) , methods mentioned in this section can be generalized under an
i=1 encoder-decoder framework. From this perspective, the node
where both α and β are trade-off parameters. It has been embedding problem in graphs can be viewed as involving
shown by theoretical analysis that LPDGL achieves a globally two key steps. First, an encoder model tries to map every
optimal prediction function. Additionally, the performance is node into a low-dimensional vector. Second, a decoder model
robust to the hyper-parameters setting so this model is not is constructed to take the low-dimensional node embeddings
difficult to fine-tune. as input and use them to reconstruct the information related
to each node’s neighborhood in the original graph, like an
adjacency matrix.
E. Poisson learning 1) Encoder: Formally, the encoder can be viewed as a
The most recent work under the regularization framework is function that maps nodes v ∈ V to vector embeddings
called poisson learning by [30], which is motivated by the need zv ∈ Rd . The resulting embeddings are more discriminative
to address the degeneracy of previous graph regularization in the latent space with more dimensions. Furthermore, they
methods when the label rate is meager. The new proposed can be transformed back to the original feature vector more
approach replaces the given label values with the assignment easily in the following decoder module. From a mathematical
of sources and sinks like flow in the graph. Thus, a resulting view, we have Enc : V → Rd .
Poisson equation based on the graph can be nicely solved. The 2) Decoder: The decoder module’s main goal is to re-
loss function of poisson learning is shown in Eq. (19). construct certain graph statistics from the node embeddings
 generated by the encoder in the previous step. For example,
 2  given a node embedding zu of a node u, the decoder might
1 X X
attempt to predict u’s set of neighbors N (u) or its row A[u]
L(f ) =  Wij f (xi ) − f (xj ) 

2 i,j in the graph adjacency matrix.
j∈N (i)
(19) Decoders are often defined in a pairwise form, which can
n l
X 2 be illustrated as predicting each pair of nodes’ similarity.
+µ (f (xi ) − yi ) ,
i=1
Formally, we have, Dec : Rd × Rd → R+ .
9

3) Reconstruction: The reconstruction process of a pair the constructed similarity graph, such as adjacency matrix
of node embeddings zu , zv involves applying the pairwise and normalized Laplacian matrix. Different matrix properties
decoder to them. The overall goal is to solve an optimization can lead to different ways of factorizing these matrices. For
problem that minimizes the reconstruction loss so that the instance, it is obvious that the normalized Laplacian matrix is
similarity measures produced by the decoder are as close to positive semi-definite, so eigenvalue decomposition is a natural
the ones defined in the original graph as possible. In a more fit.
formal way, we have Table III applies the encoder-decoder perspective to sum-
marize some representative factorization-based shallow em-
Dec(Enc(u), Enc(v)) = Dec (zu , zv ) ≈ S[u, v]. (21)
bedding approach on node level for graphs. The most critical
Here, we assume that S[u, v] is a certain kind of similarity benefit of the previously mentioned encoder-decoder frame-
measure between a pair of nodes. For example, the commonly work in Section V-A is that it provides a general overview
used reconstruction objective of predicting whether two nodes of their respective components so that it is much easier to
are neighbors would be minimizing the gap between S[u, v] compare different embedding methods.
and A[u, v]. 1) Locally linear embedding (LLE): The most fundamental
To achieve the reconstruction objective Eq. (21), the stan- assumption in LLE [43] is that the embedding result of
dard practice is to minimize the empirical reconstruction loss each node is just a linear combination of the nodes in its
L defined for all the training data D, including both labeled neighborhood. More specifically, each entry Wij in the weight
and unlabeled nodes, matrix W for the constructed graph can denote how much the
node j contributes to the embedding of node i, namely the
X
L= ` (Dec (Enc[u], Enc[v]) , S[u, v]) , (22)
(u,v)∈D
weight factor for the node j in the linear combination of the
node i. Formally, given the definition of Yi ,
where ` : R × R → R is a loss function for every node pair to
X
compute the inconsistency between the true similarity values Yi ≈ Wij Yj , ∀i ∈ V. (23)
and the decoded ones. j

B. Shallow embedding and deep embedding The embedding can be obtained as,
In most of the work on node embedding, the encoder can be 2
classified into a shallow embedding approach, in which this
X X
φ(Y ) = Yi − Wij Yj . (24)
encoder is simply a lookup function based on the node ID. i j
Additionally, the encoder can use both node features and the
local graph structure around each node as the input to generate Adding another two constraints N1 Y T Y = I and i Yi = 0
P
an embedding, like graph neural networks (GNNs). These into the above optimization equation, translational invariance
methods are further categorized into the deep embedding can be eliminated since the embedding is forced to around
method. the origin. It has been proven that the solution to this prob-
lem is to compute all the eigenvectors of the sparse matrix
VI. S HALLOW G RAPH E MBEDDING (I − W )T (I − W ), sort the corresponding eigenvalues in the
Some specialized optimization methods based on matrix descending order and take the first d + 1 eigenvectors as the
factorization can be employed as a deterministic way to solve final embedding result.
the optimization problem Eq. (22). Generally speaking, the 2) Laplacian eigenmaps: Laplacian Eigenmaps [44] makes
whole task can be considered as using matrix factorization strongly connected nodes close to each other in the embedding
methods to learn a low-dimensional approximation of a sim- space. Unlike the LLE [43], the objective function is designed
ilarity matrix S, where S encodes the information related to in a pairwise manner,
the original adjacency matrix or other matrix measurements.
Unlike the deterministic factorization methods, recent years 1X 2
φ(Y ) = |Yi − Yj | Wij ,
have witnessed a surge in successful methods that use stochas- 2 i,j (25)
tic measures of neighborhood overlap to generate shallow
= tr Y T LY ,

embeddings. The key innovation in these approaches is that
node embeddings are optimized under the assumption that where L is the Laplacian matrix. Similar to LLE [43], it
if two nodes in the graph co-occur on some short-length is necessary to add another constraint Y T DY = I so that
random walks with high probability, they tend to share similar some trivial solutions can be removed. The optimal solution
embeddings [42]. is achieved by choosing the eigenvectors of the normalized
Laplacian matrix whose corresponding eigenvalues are among
A. Factorization-based methods the d smallest ones.
For the category of factorization-based methods, a matrix 3) Graph factorization: Graph Factorization (GF) [45] is
that indicates the relationship between every node pair is the first algorithm to reduce the time complexity of previous
factorized to obtain the node embedding. This matrix typi- graph embedding algorithms to O(E). Instead of targeting
cally contains some underlying structural information about at factorizing Laplacian matrix like LLE [43] and Laplacian
10

Table III: Summary on Factorization-based Shallow Graph Embedding Methods


Method Decoder Similarity measure Loss function Time complexity
2 O |E|d2 
P P P 
LLE [43] zu − v Auv zv Auv u kz u− v Auv zv k
Laplacian Eigenmaps [44] kzu − zv k22 Auv Dec (zu , zv ) · S[u, v] O |E|d 2

Graph Factorization [45] z>


u zv Auv kDec (zu , zv ) − S[u, v]k22 O (|E|d)
GraRep [46] z>
u zv Auv , . . . , Akuv kDec (zu , zv ) − S[u, v]k22 O |V |3 
HOPE [47] z>
u zv General Similarity Matrix S kDec (zu , zv ) − S[u, v]k22 O |E|d2

Eigenmaps [44], GF directly employs the adjacency matrix However, DeepWalk takes another approach by maximiz-
and minimizes the objective function, ing the possibility of encountering the previous k nodes
1 X µX and the following k nodes along one specific random
2 2
φ(Y, µ) = (Wij − < Yi , Yj >) + kYi k , walk with center vi . In other words, DeepWalk max-
2 2 i
(i,j)∈E imizes the log-likelihood function which is defined as
(26) log Pr (vi−k , . . . , vi−1 , vi+1 , . . . , vi+k | Yi ), where 2k + 1 is
where µ is a hyper-parameter for the introduced regularization the length of the random walk. The decoder is a basic form
term. Because the adjacency matrix is not necessarily positive of a dot-product to reconstruct graph information from the
semidefinite, the summation over all the observed edges can encoded node embeddings.
be regarded as an approximation for the sake of scalability. 2) Planetoid: Yang et al. [52] propose another GSSL
4) GraRep: GraRep [46] utilizes the node transition prob- method based on the random walk, called Planetoid, where
ability matrix which is defined as T = D−1 W and k- the embedding of a node is jointly trained to predict the
order proximity is preserved by minimizing the loss kX k − class label and also the context in the given graph. The
Ysk YtkT k2F where X k is derived from T k , Ys and Yt are source highlight of Planetoid is that it can be employed both in
and target embedding vectors respectively. It then concatenates transductive and inductive settings. The context sampling
Ysk for all k to form Ys . The drawback of GraRep is scalability, behind Planetoid is actually built upon DeepWalk. In contrast
since Tk can have O(|V |2 ) non-zero entries. to DeepWalk, Planetoid can handle graphs with real-value
5) HOPE: Similar to GraRep [46], HOPE [47] preserves attributes by incorporating negative samples and injecting
higher-order proximity by minimizing another objective func- supervised information. The inductive variant of Planetoid
2
tion S − Ys YtT F , where S is now the proximity matrix. The views each node’s embedding as a parameterized function
similarity measurement is defined in the form of S = Mg−1 Ml , of input feature vectors, while the transductive variant only
where Mg and Ml are both sparse matrices. In this fashion, embeds graph structure information.
Singular Value Decomposition (SVD) can be applied so as to 3) node2vec: Following the same idea of DeepWalk [51],
acquire node embeddings in an efficient manner. node2vec [56] also tries to preserve higher-order proximity
6) M-NMF: While previous methods merely center around for each node pair but makes full use of biased random
the microscopic structure (i.e., the first-order and second-order walks so that it can balance between the breadth-first (BFS)
proximity), the mesoscopic community structure is incorpo- and depth-first (DFS) search on the given graph to generate
rated into the embedding process for Modularized Nonnegative more expressive node embeddings. To be more specific, many
Matrix Factorization (M-NMF) [48]. The cooperation between random walks with fixed length are sampled, and then the
the microscopic structure and the mesoscopic structure is possibility of occurrence of subsequent nodes along these
established by exploiting the consensus relationship between biased random walks is maximized.
the representations of nodes and the community structure.
4) LINE: Previously mentioned methods do not scale in
large real-world networks; Tang et al. propose LINE [53] to fix
B. Random-walk-based methods this issue by preserving both local and global graph structures
The random walk is a powerful tool to gain approximate with scalability. In particular, LINE combines first-order and
results about certain properties of the given graph, such second-order proximity, and they are optimized using the KL
as node centrality [49] and similarity [50]. Consequently, divergence metric. A decoder based on the sigmoid function
random-walk-based node embedding methods are effective is used in the first-order objective, while another decoder
under some scenarios when only part of the graph is accessible identical to the one in node2vec and DeepWalk is used in
or the graph’s scale is too large to handle efficiently. the second-order objective. Unlike node2vec and DeepWalk,
The key points of random-walk-based node embeddings LINE explicitly factorizes proximity measurement instead of
approaches are summarized in Table IV from the encoder- implicitly incorporating it with sampled random walks.
decoder perspective. The similarity function pG (v | u) corre- 5) PTE: PTE [54] is proposed as a new semi-supervised
sponds to the probability of visiting v on a fixed-length random representation learning method for text data. PTE fills the gap
walk starting from u. that many graph embedding methods are not particularly tuned
1) DeepWalk: Inspired by the skip-gram model [55], for any task. The labeled information and various levels of
DeepWalk [51] follows the main goal of HOPE [47] and information on word co-occurrence are first interpreted as a
thus preserves higher-order proximity of each node pair. text network, which is then embedded into a low-dimensional
11

Table IV: Summary on Random-Walk-based Shallow Graph Embedding Methods


Method Decoder Similarity measure Loss function Time complexity
>
ezu zv
DeepWalk [51] z>
pG (v | u) −S[u, v] log (Dec (zu , zv )) O(|V |d)
u zk
P
k∈V e
z> zv
e u
Evn ∼Pn (V ) log −σ z>
 
Planetoid [52] z>
pG (v | u) u zvn O(|V |d)
u zk
P
k∈V e
z> zv
e u
− log σ z> − γEvn ∼Pn (V ) log −σ z>
P   
node2vec [51] z>
pG (v | u) (u,v)∈D u zv u zvn O(|V |d)
u zk
P
k∈V e
1 > − γEvn ∼Pn (V ) log −σ z>
P   
LINE [53] > pG (v | u) (u,v)∈D − log σ zu zv u zvn O(|V |d)
1−e−zu zk
1
PTE [54] > pG (v | u) −S[u, v] log (pG (v | u)) O(|V |d)
1−e−zu zk

space. This stochastic embedding method preserves the se- 2) No use of node features. Another key problem with
mantic meaning of words and shows a strong representational shallow embedding approaches is that they fail to leverage
power for the particular downstream task. node features. However, rich feature information could
6) HARP: HARP [57] is a general strategy to improve potentially be informative in the encoding process. This is
the above-mentioned solutions [51] [56] [53] by avoiding especially true for SSL tasks where each node represents
local optima with the help of better weight initialization. The valuable feature information.
hierarchy of nodes is created by node aggregation in HARP 3) Failure in inductive applications. Shallow embedding
using graph coarsening technique based on the preceding methods are inherently transductive [41]. Generating
hierarchy layer. After that, the new embedding result can embeddings for new nodes that are observed after the
be generated from the coarsen graph, and the refined graph training phase is not possible. This restriction prevents
(i.e., the graph in the next level up in the hierarchy) can shallow embedding methods from being used on induc-
be initialized with the previous embedding. HARP propagates tive applications.
these node embeddings level by level so that it can be used in
combination with random-walk-based approaches in order to VII. D EEP G RAPH E MBEDDING
achieve better performance. In recent years, a great number of deep embedding ap-
proaches have been proposed to handle some of the limitations
discussed in Section VI-D. It should be emphasized that
C. Relationship between random-walk-based and these deep embedding approaches differ from the shallow
factorization-based methods embedding approaches explained in Section VI in that a much
Even though shallow embedding methods can be divided more complex encoder, which is often based on deep neural
into two groups based on whether it is deterministic or networks (DNN) [60], is constructed and employed. In this
stochastic, random-walk-based methods can actually be trans- manner, the encoder module would incorporate both the struc-
formed into the factorization-based group in general. Qiu tural and attribute information of the graph. For SSL tasks, a
et al. [58] provide a theoretical analysis of the aforementioned top-level classifier needs to be trained to predict class labels
random-walk-based methods to show that they all essentially for unlabelled nodes under the transductive setting, based on
perform implicit matrix factorization and have closed-form the node embeddings generated by these deep learning models.
solutions. Qiu et al. [58] also propose a new framework,
NetMF, to factorize these underlying matrices in random- A. AutoEncoder-based methods
walk-based methods explicitly. An impactful follow-up work, Apart from the use of deep learning models, autoencoder-
NetSMF [59] extends NetMF [58] to large-scale graphs based based methods also vary from the shallow embedding methods
on sparse matrix factorization, making it more scalable for in that a unary decoder is employed instead of a pairwise one.
large networks in the real world. Under the framework of autoencoder-based methods, every
node, i, is represented by a high-dimensional vector extracted
from a row in the similarity matrix, namely, si = ith row of S,
D. Limitations of shallow embedding
where Si,j = sG (i, j). The autoencoder-based methods aims
Although shallow embedding methods have achieved im- to first encode each node based on the corresponding vector
pressive success on many SSL related tasks, it is worth noting si and then reconstruct it again from the embedding results,
that it also has some critical drawbacks that researchers found subject to the constraint that the reconstructed one should be
it hard to overcome with ease. as close to the original one as possible (Figure 3):
1) Lack of shared parameters. In the encoder module,
Dec (Enc (si )) = Dec (zi ) ≈ si . (27)
parameters are not shared between nodes since the en-
coder directly produces a unique embedding vector for From the perspective of the loss function for autoencoder-
each node. The lack of parameter sharing means that based methods, it commonly keeps the following form:
the number of parameters necessarily grows as O(|V |), L=
X 2
kDec (zi ) − si k2 . (28)
which can be intractable in massive graphs. i∈V
12

each node that can be used to rebuild its corresponding vector


si . For the latter part, Laplacian Eigenmaps is utilized so that
penalty is imposed if connected nodes are encoded far away
in the embedding space.
2) DNGR: Deep neural networks for learning graph rep-
resentations (DNGR) [62] integrates random surfing with
autoencoders to generate node embeddings. This model has
three components: 1) random surfing, 2) estimation of positive
pointwise mutual information (PPMI) matrix and 3) stacked
denoising autoencoders. For the input graph, random surfing
is first applied to produce a co-occurrence probability matrix
similar to HOPE. This probabilistic matrix is then converted
Figure 3: For AutoEncoder-based methods, a high-dimensional into a PPMI matrix and fed into a stacked denoising autoen-
vector si is extracted and fed into the AutoEncoder for coder to generate the final embedding result. The feedback
generating a low-dimensional zi embedding of the PPMI matrix guarantees the high order proximity is
captured and maintained by the autoencoder. Moreover, the
introduction of stacked denoising autoencoders enhances the
model’s robustness when the noise is present and the model’s
capability to detect the underlying structure required for some
downstream tasks like node classification.
3) S2S-AE: Unlike previous methods whose encoders are
all based on MLP, Taheri et al. [63] extend the form of the
encoder to RNN models. S2S-AE [63] uses long short-term
memory (LSTM) [69] autoencoders to embed the graph se-
quences generated from random walks into a continuous vector
space. The final representation is computed by averaging its
graph sequence representations. The advantage of S2S-AE is
that it can support arbitrary-length sequences, unlike others,
Figure 4: Summary on the architectures of AutoEncoder-based
which often suffer from the limitation of the fixed-length
methods
inputs.
4) DRNE: Deep recursive network embedding
(DRNE) [64] holds an assumption that the embedding
From Eq. (27), it should be pointed out that the encoder
of a node needs to approximate the aggregation of the
module actually depends on the given si vector. This allows
embeddings of nodes within its neighborhood. It also
autoencoder-based deep embedding approaches to incorporate
uses LSTM [69] to aggregate a node’s neighbors, so the
local structural information into the encoder, while it is simply
reconstruction loss is different from the one in S2S-AE [63]
impossible for the shallow embedding approaches to do so.
as suggested in Table V. In this way, DRNE can solve the
The primary components of these methods are summarized as
issue that the LSTM model is not invariant when the given
Table V, and the architectures of them are compared as Figure
nodes’ sequence permutes in different ways.
4.
5) GAE & VGAE: Both MLP-based and RNN-based meth-
Despite this noticeable enhancement, the autoencoder-based
ods only consider structural information and ignore the nodes’
methods may still suffer from some problems. Particularly,
feature information. GAE [65] leverages GCN [70] to encode
the computational cost of it is still intolerable for large
both. The encoder takes the form that,
scale graphs. Moreover, the structure of the autoencoder is
predefined and unchanged during the training, so it is strictly Enc(A, X) = GraphConv (σ(GraphConv(A, X))) , (29)
transductive and thus fails to cope with evolving graphs. The
up-to-date, relevant representative works to tackle these issues where GraphConv(·) is a graph convolutional layer defined
are [67] [68]. in [70], σ(·) is the activation function, A is the adjacency
1) SDNE: Wang et al. [61] proposes Structural deep net- matrix, and X is the attribute matrix. The decoder of GAE is
work integration (SDNE) with the help of deep autoencoders defined as
to preserve the proximity for first and second orders. The first- Dec (zu , zv ) = zTu zv . (30)
order proximity describes the similarity between each node
pair, while the second-order proximity between each node pair It may have some overfitting issue when the adjacency matrix
describes the proximity of their neighborhood structure. The is reconstructed in a direct way. Variational GAE (VGAE) [65]
method takes advantage of non-linear functions to acquire the learns the distribution of data, in which the variational lower
embedding results. It actually contains two modules: 1) the bound L is optimized.
unsupervised part and 2) the supervised part. The former is
an autoencoder designed to produce an embedding result for L = Eq(Z|X,A) [log p(A | Z)] − KL[q(Z | X, A)kp(Z)], (31)
13

Table V: Summary on AutoEncoder-based Deep Graph Embedding Methods


Method Encoder Deconder Similarity measure Loss function Time complexity
kDec (zu ) − su k22
P
SDNE [61] MLP MLP su O(|V kE|)
Pu∈V
(zu ) − su k22 O |V |2 

DNGR [62] MLP MLP su kDec
Pu∈V 2
S2S-AE [63] LSTM LSTM su u∈V kDec (zu ) − su k2 O |V |2
P P 2
DRNE [64] LSTM LSTM su u∈V (zu ) − v∈N (u) LSTM (zv ) O(|V kE|)
2
2
z>
P
GAE [65] GCN u zv Auv u∈V kDec (zu ) − Au k2 O(|V kE|)
VGAE [65] GCN z>
u zv Auv Eq(Z|X,A) [log p(A | Z)] − KL[q(Z | X, A)kp(Z)] O(|V kE|)
ARGA [66] GAE z>
u zv Auv minG maxD Ez∼pz [log D(Z)] + Ex∼p(x) [log(1 − D(G(X, A)))] O(|V kE|)
ARVGA [66] VGAE z>
u zv Auv minG maxD Ez∼pz [log D(Z)] + Ex∼p(x) [log(1 − D(G(X, A)))] O(|V kE|)

where KL[q(·)kp(·)] is the Kullback-Leibler divergence be-


tween q(·) and p(·). Moreover, we have
N
Y
N zi | µi , diag σi2

q(Z | X, A) = , (32)
i=1

and
N
Y
Aij σ z> >
 
p(A | Z) = i zj + (1 − Aij ) 1 − σ zi zj .
i=1
(33)
The most recent follow-up work are RWR-GAE [71] which
adds a random walk regularizer for GAE and achieves no- Figure 5: GNN-based methods can generate node embeddings
ticeable performance improvement and DGVAE [72] which by aggregating embeddings from its neighbors
combines with graph cluster memberships as latent factors to
further improve the internal mechanism of VAEs based graph
generation. from the perspective of the specific operation changed and
6) ARGA & ARVGA: To further improve the empirical improved compared with the basic GNN. The main techniques
distribution, q(Z | X, A) in accordance with the prior distri- employed in these methods are also listed in Table VI and
bution p(A | Z) in GAE and VGAE, Pan et al. [66] propose some representative models.
ARGA and ARVGA with the help of the generative adversarial
networks (GANs) [73], in which they take GAE and VGAE C. Basic GNN
as encoder respectively.
As Gilmer et al. [90] point out, the fundamental feature
of a basic GNN is that it takes advantages of neural mes-
B. GNN-based methods sage passing in which messages are exchanged and updated
Several up-to-date deep embedding approaches are designed between each pair of the nodes by using neural networks.
to overcome the shallow embedding approaches’ main draw- More specifically, during each neural message passing iter-
backs by constructing some specific functions that depend on a (k)
ation in a basic GNN, a hidden embedding hu corresponding
node’s neighborhood (Figure 5). Graph neural network (GNN), to each node u is updated according to message or informa-
which is heavily utilized in state-of-the-art deep embedding tion aggregated from u’s neighborhood N (u). This general
approaches, is considered as a general scheme for defining message passing update rule can be expressed as follows:
deep neural networks in the graph structure data.
The main idea is that the representation vectors of nodes can
depend not only on the structure of the graph but also on any h(k+1)
u
feature information associated with the nodes. Dissimilar to
 n o
= Update(k) h(k)
u , Aggregate
(k)
h(k)
v , ∀v ∈ N (u) ,
the previously reviewed methods, graph neural networks use  
(k)
the node features, e.g., node information for a citation network = Update(k) h(k)
u , mN (u) . (34)
or even simple statistics such as node degree, one-hot vectors,
etc., to generate the desired node embeddings. It is noteworthy that in Eq. (34), both the operation Update
Like other deep node embedding methods, a classifier is and Aggregate must be differentiable functions, typically,
trained on top of the node embeddings generated by the final neural networks. Moreover, mN (u) is the exact message that
hidden state in GNN-based models explicitly or implicitly. is aggregated from node u’s neighborhood N (u) and tends
Afterward, it can be applied to the unlabeled nodes for SSL to encode useful local structure information. Combining the
tasks. message from neighbourhood with the previous hidden em-
Since GNN consists of two main operations: Aggregate op- bedding state, the new state is generated according to Eq. (34).
eration and Update operation, these methods will be reviewed After a certain preset number of iterative steps, the last hidden
14

Table VI: Summary on GNN-based Deep Graph Embedding Methods


Improvement Technique Model
Basic GNN (Baseline) Neural Message Passing Basic GNN [74]
Neighborhood Normalization GCN [70] MixHop [75] SGC [76] DGN [77]
Set Pooling [78]
Generalized Aggregate Operation Pooling
Janossy pooling [79]
Neighborhood Attention GAT [80] AGNN [81]
Column networks [82] Scattering GCN [83]
Concatenation
GraphSAGE [84] DropEdge [85]
Generalized Update Operation GGNN [86]
Gated Updates
NeuroSAT [87]
JK connections JK Networks [88] InfoGraph* [89]

embedding state converges so that this final state is regarded and VII-E, some representative improvements on the two main
as the embedding output for each node. Formally, we have, components of basic GNN, aggregation operation and update
operation, are reviewed in detail.
zu = h(K)
u , ∀u ∈ V. (35)
It should be stressed that both the basic GNN and many D. Generalized aggregation operation
of its variants strictly follow this generalized framework. The In general, the Aggregation operation in GNN models has
relationship among them is summarized as Table VI. received the most attention from the literature as a large
Before the review on some of the GNN-based methods de- number of researchers have proposed novel architectures or
signed for SSL tasks, the basic version of GNN is introduced, variations based on the original GNN model.
which is a simplification of the original GNN model proposed 1) Neighborhood normalization: As previously stated, the
by Scarselli et al. [74]. most basic neighborhood aggregation operation, shown in
The basic GNN message passing is defined as: Eq. (37), solely computes the sum of the neighborhood’s
  embedding states. The main problem with this approach is
W(k) h(k−1) (k) that it could be unstable and susceptible to the node’s degree
X
h(k)
u =σ self u + Wneigh h(k−1)
v + b(k)  ,
v∈N (u)
since nodes with a large degree tend to receive a large total
(36) value from more neighbors than those with fewer neighbors.
(k) (k) One typical and simple solution to this issue is to just
where Wself , Wneigh are trainable parameters and σ is the
activation function. The messages from the neighbors are normalize the aggregation operation based on the degree of
firstly summarized. Then, the neighborhood information is the central nodes. The simplest approach is to just take an
combined together with the node’s previous hidden embed- average rather than the sum by Eq. (40)
P
ding results by using a basic linear combination. Finally, a v∈N (u) hv
non-linearity activation function is applied on the combined mN (u) = , (40)
|N (u)|
information. From the perspective of the key components of
the GNN framework, the Aggregation operation is and the but methods with other normalization factors with similar ideas
Update operation is defined as shown in Eq. (37) and Eq. (38). were proposed and achieved remarkable performance gain,
n o X such as the following symmetric normalization employed by
Aggregate(k) h(k) v , ∀v ∈ N (u) = hv , (37) Kipf et al. [70] in the GCN model as shown in Eq. (41).
v∈N (u) X hv
mN (u) = p . (41)
  v∈N (u)
|N (u)| | N (v) |
Update hu , mN (u) = σ Wself hu + Wneigh mN (u) .
(38) Graph convolutional networks (GCNs). One of the most
Furthermore, it is not uncommon to add some self-loop popular and effective baseline GNN variants is the graph
tricks to the input graph so as to shut out the explicit update convolutional network (GCN) [70], which is inspired by [91]
step, which can be considered as a straightforward simplifica- and [92]. GCN makes full use of the neighborhood normalized
tion of the neural message passing method used in the basic aggregation techniques as well as the self-loop update opera-
GNN. To be a little more specific, the message passing process tion. Therefore, the GCN model defines the update operation
can now be simply defined as shown in Eq. (39). function as shown in Eq. (42). No aggregation operation is
n o defined since it has been implicitly defined within the update
h(k)
u = Aggregate h(k−1)
v , ∀v ∈ N (u) ∪ {u} . (39) operation function as
 
As mentioned before, GNN models have all kinds of vari- X hv
ants, which try to improve its performance and robustness to h(k)
u =σ
W(k) p . (42)
v∈N (u)∪{u}
|N (u)||N (v)|
some extent. However, regardless of the variant of GNN, they
all follow the neural message passing framework for Eq. (34) There exist a great number of GCN variants to enhance SSL
examined earlier. In the following two sections, Section VII-D performance from different aspects. Li et al. [93] are the first to
15

provide deep insights into GCN’s success and failure on SSL 2) Pooling: Aggregation operation is essentially a mapping
tasks. Later on, extensions to GCN for SSL begin to prolifer- from a set of neighborhood embedding results to a single
ate. Jiang et al. [94] explore the way to do graph construction vector with encoded information about the local structure
based on GCN. Yang et al. [95] combine the classic graph and the feature of neighbor nodes’ feature. In the previously
regularization methods with GCN. Abu et al. [96] present a reviewed settings of GNN models, the mapping function in the
novel N-GCN which marries the random walk with GCN, and aggregation operation is simply the basic summation or linear
a follow-up work GIL [97] with similar ideas is proposed as functions over neighbor embeddings. Some more sophisticated
well. Other research work on GCN extensions can be found and successful mapping functions used in the aggregation
in [98] [99] [100] [101] [102] [103] [104] [105] [106]. setting are reviewed in this section.
MixHop. GCN often fails to learn a generalized class of Set pooling. In fact, according to Wu et al. [78], one
neighborhood with various mixing relationships. In order to principal approach for designing an aggregation function is
overcome this limitation, MixHop [75] is proposed to learn focused on the theory of permutation invariant neural net-
these relationships by repeatedly mixing feature representa- works. Generally, the permutation invariance property deals
tions of neighbors at various distances. Unlike GCN whose with problems concerning a set of objects: the target value for
aggregation operator in the matrix form is defined as a given set is the same regardless of the order of the objects in
the set. A typical example of an invariant permutation model
 
H(k) = σ AH(k−1) W(k) , (43)
is a convolutional neural network which performs the pooling
where H(k−1) and H(k) are the input and output hidden operation over embedding extracted from a set’s elements.
embedding matrix for layer k. MixHop replaces the Graph Permutation invariance on graphs in general means that the
Convolution (GC) layer defined in Eq. (43) with aggregation function does not depend on the arbitrary order

(i)
 of the rows/columns in the adjacency matrix. For example,
H(k) = kj∈P σ Aj H(k−1) Wj , (44) Zaheer et al. [107] show that an aggregation function with the
where the hyper-parameter P is a set of integer adjacency following form can be considered as a universal set function
powers and k denotes column-wise concatenation. Specifically, approximator:
by setting P = {1}, it exactly recovers the original GC
 
X
layer. In fact, MixHop is interested in higher-order message mN (u) = MLPθ  MLPφ (hv ) . (47)
passing, where each node receives latent representations from v∈N (u)
their immediate (one-hop) neighbors and from further N-hot
neighbors. Janossy pooling. Another alternative method, called
Simple graph convolution networks (SGC). GCNs inherit Janossy pooling, is to enhance the aggregation operation,
unnecessary complexity and redundant computation cost in which is also possibly more efficient than simply taking a
nature as it derives inspiration from deep learning methods. sum or mean of the neighbor embeddings used in basic GNN.
Wu et al. [76] reduce this excess complexity by eliminating the Janossy pooling [79] uses a completely different approach.
nonlinearities among every GCN layer and collapsing the orig- Instead of using a permutation invariant reduction (e.g., a sum
inal nonlinear function into a simple linear mapping function or a mean), a permutation-sensitive function is applied, and
defined in Eq. (45). More importantly, these simplifications the outcome is averaged over many potential permutations.
do not harm the prediction performance in many downstream Let πi ∈ Π denotes a permutation function that
applications. maps the set {hv , ∀v  ∈ N (u)} to a specific sequence
  hv1 , hv2 , . . . , hv|N (u)| π . Namely, πi ∈ Π takes the un-
i
X h v ordered set of embedding states from the neighbors and puts
h(k)
u =σ
 p . (45)
|N (u)||N (v)| them in a sequence dependent on some random ordering
v∈N (u)∪{u}
arbitrarily. The Janossy pooling approach then performs neigh-
Differentiable group normalization (DGN). To further borhood aggregation operation by Eq. (48).
mitigate the over-smoothing issue in GCN, DGN [77] also !
applies a new operation between the successive graph convo- 1 X 
mN (u) = MLPθ ρφ hv1 , hv2 , hv|N (u)| π ,
lutional layers. Taking each embedding matrix H (k) generated |Π| i
π∈Π
from the k th graph convolutional layer as the input, DGN (48)
assigns each node into different groups and normalizes them where Π denotes a collection of permutations and ρφ is
independently to output a new embedding matrix for the next a permutation-sensitive function, e.g., a neural network that
layer. Formally, we have, operates on sequential data. Usually ρφ is represented as an
g
!
(k) LSTM in operation, since LSTMs are known to be a powerful
(k+1) (k)
X si ◦ H(k) − µi
H = H +λ γi ( ) + βi , (46) architecture for sequences in the neural network.
i=1
δi
3) Neighborhood attention: A common approach for en-
where H(k) is the k th layer of the hidden embedding matrix, hancing the aggregation layer in GNNs is to implement some
si is the similarity measure and g is the total number of groups. attention mechanisms [108], in addition to more general forms
In particular, µi and δi denote the vectors of running mean of of set aggregation. The basic principle is to assign a weight
group i, respectively, and γi and βi denote the trainable scale or value of importance to each neighbor, which is used during
and shift vectors, respectively. the aggregation phase to weigh this neighbor’s effect.
16

GAT. The first GNN model to apply this style of attention where α1 , α2 ∈ [0, 1]d are gating vectors with α1 + α2 = 1
was Cucurull et al.’s Graph Attention Network (GAT) [80], and ◦ denotes Hadamard product. In this method the final
which uses attention weights to define a weighted sum of the update is a linear interpolation between the previous output
neighbors: X and the current output and is modified depending on the
mN (u) = αu,v hv , (49) information in the neighborhood.
v∈N (u) Scattering GCN. The most recent work on tackling the
problem of over-smoothing in GNN is Scattering GCN [83]
where αu,v denotes the attention on neighbor v ∈ N (u) when
with the geometric scattering transformation that enables band-
we are aggregating information at node u. In the original GAT
pass filtering of graph signals. Geometric scattering is origi-
paper, the attention weights are defined as
nally introduced in the context of whole-graph classification
exp a> [Whu ⊕ Whv ]

and consisted of aggregating scattering features. Similar and
αu,v = P >
, (50)
v 0 ∈N (u) exp (a [Whu ⊕ Whv ])
0 concurrent work is DropEdge [85] which removes a certain
portion of edges from the given graph at each training epoch,
where a is a trainable attention vector, W is a trainable matrix, acting like a data augmenter and thus alleviate both over-
and denotes the concatenation operation. A similar parallel smoothing and over-fitting issues at the same time.
work is AGNN [81] which reduces the number of parameters These strategies are also beneficial for node classification
in GNN with the help of attention mechanisms. tasks with relatively deep GNNs, in a semi-supervised setting,
and they are excellent for these SSL tasks where the prediction
E. Generalized update operation in each node is closely correlated with the characteristics of
As already noted in Section VII-D, lots of research papers the local neighborhood.
focus on generalized aggregate operation. This was especially 2) Gated updates: Parallel to the above-mentioned work,
the case after the GraphSAGE Framework [84], which imple- the researchers have also taken inspiration from the approaches
ments the idea of generalized neighborhood aggregation. This used by recurrent neural networks (RNNs) to strengthen stabil-
section concentrates on the more diversified Update operation, ity. One way to interpret the GNN message passing algorithm
which also makes the embeddings more suitable for SSL tasks. is to collect an observation from the neighbors from the
1) Concatenation and skip-connections: Over-smoothing aggregation operation, which then updates the hidden state of
is a major issue for GNN. The over-smoothing is almost each node. From this perspective, some methods for updating
inevitable after many message iterations when the node- the hidden status of RNN architectures can be directly applied
specific information becomes “washed away”. In such cases, based on the observation.
the modified node representations are too highly dependent GatedGNN. For example, one of the earliest GNN variants
on the incoming message aggregated by the neighbors at the which put this idea into practice is proposed by Li et al. [86],
cost of previous layers’ node hidden states. One reasonable in which the update operation is defined as shown in Eq. (53)
way to mitigate this problem is to use vector concatenations as,  
(k)
or skip connections, which aim to retain information directly h(k)
u = GRU hu
(k−1)
, mN (u) , (53)
from previous rounds of the update.
These techniques can actually be used in combination with where GRU is a gating mechanism function in recurrent neural
several other update operation methods for the GNN. For networks, introduced by Kyunghyun Cho et al. [109]. Another
general purposes, Updatebase denotes the simple update rule approach called NeuroSAT [87] has employed updates based
that will be built on. For instance, the Updatebase function on the LSTM architecture as well.
can be presumed as shown in Eq. (38) in the basic GNN. 3) Jumping knowledge (JK) connections: In the previous
GraphSAGE. One of the simplest updates for skip connec- sections, it is implicitly assumed that the last layer’s output is
tion is GraphSAGE [84] which uses a concatenation vector to considered as the final embedding result. In other words, the
hold more information from node level during message passing node representations used for a downstream job, such as SSL
process: tasks, are identical to the final layer’s node embedding in the
   GNN. Formally, it is presumed that
Update (hu , mN (u)) = Updatebase hu , mN (u) ⊕ hu ,
(51) zu = h(K)
u , ∀u ∈ V. (54)
where the output from the simple update function is con- JK Net. A complementary approach to increase the effec-
catenated with the node ’s previous layer representation. The tiveness of final node representations is to use the combination
core intuition is that the model is encouraged to dissociate on each layer of the message passing, rather than merely the
information during the message passing. final layer’s output. In a more formal way,
Column Network (CLN). Besides concatenation methods,  
some other forms of skip-connections can also be applied, zu = fJK h(0) u ⊕ h (1)
u ⊕ . . . ⊕ h(K)
u , (55)
such as the linear interpolation method proposed by Pham
This technique is originally introduced and tested by Xu
et al. [82],
et al. [88], called the idea of jumping knowledge connections.
Update (hu , mN (u)) = α1 ◦ Updatebase (hu , mN (u)) The fJK function can be used as the identity function for
+ α2 ◦ hu , various applications, meaning that a simple concatenation is
(52) essentially performed among the node embeddings from each
17

layer, but Xu et al. [88] also think about other possibilities The latest interesting work based on GSSL in CV is related
such as max pooling. This method also leads to significant to domain adaptation by He et al. [116]. In this work, a novel
progress over a wide range of tasks like SSL classification idea of using graph-based manifold representation to do visual-
and is usually regarded as an effective strategy. audio transformation is proposed and examined.
InfoGraph*. Previous GNN models appear to have low 2) Natural language processing: Amarnag et al. [117] first
generalization performance because of the model’s crafted introduce GSSL into traditional natural language processing
styles. To lessen the downgrade of generalization performance (NLP) tasks and make pioneering work on part-of-speech
in the testing phase, InfoGraph* [89] maximizes the mutual (POS) tagging based on random fields [24]. The proposed
information between the embeddings learned by popular su- algorithm uses a similarity graph to encourage similar n-grams
pervised learning methods and unsupervised representations to have similar POS tags. Later, Aliannejadi et al. [118] and
learned by InfoGraph, where the given graph is encoded to Qiu et al. [59] extend this work and use some GCN-based
produce its corresponding feature map by jumping knowledge methods [70] to make the model more robust on other various
concatenation. The encoder then learns from unlabeled sam- natural language understanding (NLU) tasks.
ples while preserving the implicit semantic information that More recent works on how to combine GSSL and NLP tasks
the downstream task favors. center around graph smoothing problems. Mei et al. [119]
propose a brand-new general optimization framework for
VIII. A PPLICATIONS smoothing language models with graph structures, which can
A. Datasets further enhance the performance of information retrieval tasks.
By constructing a similarity graph of documents and words,
We summarize the commonly-used datasets in graph-based various types of GSSL methods can be performed.
semi-supervised learning according to seven different domains, Unlike the work [119] which studies the long texts, Hu
namely citation networks, co-purchase networks, web page et al. [120] focus on short texts in which the labeled data is
citation networks, and others. As shown in Table VII in sparse and limited. In particular, a flexible model based on
Appendix A, the summary results on selected benchmark GAT [80] with a dual-level attention mechanism is presented.
datasets on GSSL are listed with their respective statistical 3) Social networks: A social network is a set of people with
analysis. some pattern of interactions or “ties” between them and has
graph-structured data explicitly. As is known to all, Twitter is
B. Open-source Implementations one of the most famous and large-scale social networks, so
Here we list some open-source implementations for GSSL various meaningful and interesting tasks based on GSSL can
in Table VIII in Appendix B. be performed on the Twitter dataset. Alam et al. [121] adopt a
graph-based deep learning framework by Yang et al. [122] for
learning an inductive semi-supervised model to classify tweets
C. Domains
in a crisis situation. Later, Anand et al. [123] improve classic
GSSL has a large number of successful applications across GSSL methods to detect fake users from a large volume of
various domains. Some domains have graph-structure data in Twitter data.
nature, while others do not. The former ones would be scenar- Another popular topic in social networks is related to
ios where raw data samples have explicit relational structure POI recommendations, such as friend recommendation and
and can be easily constructed into a graph, such as traffic net- follower suggestion. Yang et al. [124] propose a general
work in the cyber-physical systems (CPS), molecular structure SSL framework to alleviate data scarcity via smoothing
in biomedical engineering, and friend recommendation in so- among users and POIs in the bipartite graph. Moreover, Chen
cial networks. From the non-graph-structured data, however, a et al. [113] employ a user profiling approach to boost the
graph cannot be extracted directly. Typical examples would be performance of POI recommendation based heterogeneous
more common scenarios, like image classification in computer graph attention networks by Wang et al. [125].
vision and text classification in NLP. 4) Biomedical science: Graphs are also ubiquitous in the
1) Computer vision: Among many computer vision tasks area of biomedical science, such as the semantic biomedical
(CV), hyperspectral image classification (HSI) is a represen- knowledge graphs, molecular graphs for drugs, and protein-
tative example for GSSL applications. For one thing, labeled drug interaction for drug proposals. Doostparast et al. [126]
data in HSI is costly and scarce. For another, among all popular use GSSL methods with genomic data integration to do phe-
SSL methods, classic GSSL methods have elegant closed-form notype classification tasks. In the meantime, Luo et al. [127]
solutions and are easy to implement. Shao et al. [110] [111] provide a new graph regularization framework in heteroge-
propose a spatial and class structure regularized sparse repre- neous networks to predict human miRNA-disease [128]. Other
sentation graph for semi-supervised HSI classification. Later, typical applications are disease diagnosis [129], medical image
Fang et al. [112] extend this work [111] by providing a more segmentation [130] and medical Image classification [131].
scalable algorithm based on anchor graph [113].
Pedronette et al. [114] also improve the KNN-based graph
IX. O PEN PROBLEM
construction methods [8] in Section III-A1 to facilitate image
retrieval. Another similar idea by Shi et al. [115] is to use a Here a chronological overview of the mentioned representa-
temporal graph to assist image analysis. tive methods in this survey is provided in Figure 6. From 2000
18

to 2012, the mainstream algorithm was centered around graph on how to generate a successful attack, [149] and [150] defend
regularization and matrix factorization in the early years. After GNN models from these adversarial attacks. Other relevant
the resurgence of deep learning in 2015, the field witnessed works via various approaches [151] [152] [153] can be found
the emergence of AutoEncoder-based methods while, in the as well.
meantime, random-walk-based methods coexisted. However,
with the introduction of GCN [70] in 2017, the GNN-based X. C ONCLUSION
method became the dominant solution and still is a heated To sum up, we conduct a comprehensive review of graph-
topic now. Based on this trend, we list four potential research based semi-supervised learning. A new taxonomy is proposed,
topics. in which all the popular label inference methods are grouped
into two main categories: graph regularization and graph
A. Dynamicity embedding. Moreover, they can be generalized within the
Conventional GSSL methods reviewed in the preceding sec- regularization framework and the encoder-decoder framework
tions all treat the graph as a fixed observation. Ma et al. [132] respectively. Then a wide range of applications of GSSL are
is the first to apply generative models on GSSL. By viewing introduced along with relevant datasets, open-source codes for
the graph as a random variable, the generated joint distribution some of the GSSL methods. A chronological overview of these
can extract more general relationships among attributes, labels, representative GSSL methods is presented in the Appendix.
and the graph structure. Moreover, this kind of model is Finally, four open problems for future research directions are
more robust to missing data. Some latest follow-up works are discussed as well.
[133] [134] and [135]. The most up-to-date work is [136],
which provides a multi-source uncertainty framework for GNN A PPENDIX A
and considers different types of uncertainties associated with DATASETS COLLECTION FOR GSSL
class probabilities.
A PPENDIX B
B. Scalability O PEN - SOURCE I MPLEMENTATIONS
Another open problem is how to make GSSL methods scal- A PPENDIX C
able when the input graph size increases rapidly. Pioneering C HRONOLOGICAL OVERVIEW OF GSSL
work has been done by Liu et al. [137] [138], in which a novel Here a chronological overview of the mentioned representa-
graph is constructed with data points and anchor, namely, tive methods in this survey is provided in Figure 6. From 2000
anchor graph regularization (AGR). Many successful follow- to 2012, the mainstream algorithm was centered around graph
up works are proposed, such as [139] [140] for graph regu- regularization and matrix factorization in the early years. After
larization methods and [141] [142] for GNN methods. These the resurgence of deep learning in 2015, the field witnessed
methods focus more on computational complexity instead of the emergence of AutoEncoder-based methods while, in the
classification accuracy. It is worth noting that the most up- meantime, random-walk-based methods coexisted. However,
to-date work is GBP [143] which invents a new localized with the introduction of GCN [70] in 2017, the GNN-based
bidirectional propagation process from both the feature vectors method became the dominant solution and still is a heated
and the nodes. topic now.

C. Noise-resilience R EFERENCES
Graphs with noise or missing attributes are also heated [1] A. Subramanya and P. P. Talukdar, Graph-Based Semi-Supervised
topics. Most existing GSSL methods end up fully trusting the Learning, ser. Synthesis Lectures on Artificial Intelligence and Machine
given few labels, but in real life, these labels are highly reliable Learning. Morgan & Claypool Publishers, 2014.
[2] X. Zhu, “Semi-supervised learning literature survey,” Computer Sci-
as they are often produced by humans prone to mistakes. ences, University of Wisconsin-Madison, Tech. Rep. 1530, 2005.
Stretcu et al. [144] propose Graph Agreement Models (GAM), [3] N. N. Pise and P. Kulkarni, “A survey of semi-supervised learning
which introduces an auxiliary model that predicts the proba- methods,” in 2008 International Conference on Computational Intelli-
gence and Security, vol. 2, 2008, pp. 30–34.
bility of two nodes sharing the same label. This co-training [4] V. J. Prakash and L. M. Nithya, “A survey on semi-supervised learning
approach makes the model more resilient to noise. Zhao techniques,” CoRR, vol. abs/1402.4645, 2014.
et al. [136] consider different types of uncertainties associated [5] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised
learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.
with class probabilities in the real case scenario. Similar [6] Y. Ouali, C. Hudelot, and M. Tami, “An overview of deep semi-
latest works with the same goal but different approaches supervised learning,” arXiv preprint arXiv:2006.05278, 2020.
are [104] [145] [146]. In addition, Dunlop et al. [147] provide [7] Y. Chong, Y. Ding, Q. Yan, and S. Pan, “Graph-based semi-supervised
learning: A review,” Neurocomputing, vol. 408, pp. 216 – 230, 2020.
theoretical insights into this topic. [8] O. Chapelle, B. Schlkopf, and A. Zien, Semi-Supervised Learning,
1st ed. The MIT Press, 2010.
[9] T. Jebara, J. Wang, and S. Chang, “Graph construction and b-matching
D. Attack-robustness for semi-supervised learning,” in ICML, ser. ACM International Con-
Robustness is always a common concern in machine learn- ference Proceeding Series, vol. 382. ACM, 2009, pp. 441–448.
[10] K. Ozaki, M. Shimbo, M. Komachi, and Y. Matsumoto, “Using the
ing systems. Liu et al. [148] first propose a general framework mutual k-nearest neighbor graphs for semi-supervised classification on
for data poisoning attacks to GSSL. While [148] focuses more natural language data,” in CoNLL. ACL, 2011, pp. 154–162.
19

Table VII: Summary of selected benchmark datasets on GSSL.

Category Dataset # Nodes # Edges # Features # Classes Source

Cora 2,708 5,429 1,433 7 [154]


Citeseer 3,327 4,732 3,703 6 [154]
Citation networks
Pubmed 19,717 44,338 500 3 [154]
DBLP 29,199 133,664 0 4 [155]

Amazon Computers 13,752 245,861 767 10 [156]


Amazon Photo 7,650 119,081 745 8 [156]
Co-purchase networks
Coauthor CS 18,333 81,894 6,805 15 [156]
Coauthor Physics 34,493 247,962 8,415 5 [156]

Cornell 195 286 1,703 5 [157]


Texas 187 298 1,703 5 [157]
Webpage citation networks
Washington 230 417 1,703 5 [157]
Wsicsonsin 265 479 1,703 5 [157]

Zachary’s karate club 34 77 0 2 [158]


Reddit 232965 11606919 602 41 [84]
Social networks BlogCatalog 10312 333983 0 39 [159]
Flickr 1,715,256 22,613,981 0 5 [160]
Youtube 1,138,499 2,990,443 0 47 [160]

Language networks Wikipedia 1,985,098 1,000,924,086 0 7 [161]

PPI 56944 818716 50 121 [162]


NCI-1 29 32 37 2 [163]
MUTAG 17 19 7 2 [164]
Bio-chemical
D&G 284 715 82 2 [165]
PROTEIN 39 72 4 2 [166]
PTC 25 19 2 [167]

Knowledge graph Nell 65755 266144 61278 210 [168]

MNIST 10,000 65,403 128 10 [169]


Image SVHN 10,000 68,844 128 10 [170]
CIFAR10 10,000 70,391 128 10 [171]

[11] D. A. Vega-Oliveros, L. Berton, A. M. Eberle, A. de Andrade Lopes, 238 – 248, 2017.


and L. Zhao, “Regular graph construction for semi-supervised learn- [18] L. Zhuang, Z. Zhou, S. Gao, J. Yin, Z. Lin, and Y. Ma, “Label
ing,” in Journal of physics: Conference series, vol. 490. IOP information guided graph construction for semi-supervised learning,”
Publishing, 2014, p. 012022. IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4182–4192,
[12] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by 2017.
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, [19] F. Taherkhani, H. Kazemi, and N. M. Nasrabadi, “Matrix completion
2000. for graph-based deep semi-supervised learning,” in Proceedings of the
[13] P. S. Dhillon, P. P. Talukdar, and K. Crammer, “Learning better data AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5058–
representation using inference-driven metric learning,” in ACL (Short 5065.
Papers). The Association for Computer Linguistics, 2010, pp. 377– [20] D. Zhou and B. Schölkopf, “A regularization framework for learning
381. from graph data,” in ICML 2004 Workshop on Statistical Relational
[14] M. H. Rohban and H. R. Rabiee, “Supervised neighborhood graph Learning and Its Connections to Other Fields (SRL 2004), 2004, pp.
construction for semi-supervised classification,” Pattern Recognition, 132–137.
vol. 45, no. 4, pp. 1363 – 1372, 2012. [21] R. K. Ando and T. Zhang, “Learning on graph with laplacian regular-
[15] L. Berton and A. d. A. Lopes, “Graph construction based on labeled ization,” in Advances in neural information processing systems, 2007,
instances for semi-supervised learning,” in 2014 22nd International pp. 25–32.
Conference on Pattern Recognition, 2014, pp. 2477–2482. [22] J. Calder and D. Slepčev, “Properly-weighted graph laplacian for semi-
[16] L. Berton and A. de Andrade Lopes, “Graph construction for semi- supervised learning,” Applied Mathematics & Optimization, pp. 1–49,
supervised learning,” in Twenty-Fourth International Joint Conference 2019.
on Artificial Intelligence, 2015. [23] F. Hoffmann, B. Hosseini, Z. Ren, and A. M. Stuart, “Consistency
[17] L. Berton, T. de Paulo Faleiros, A. Valejo, J. Valverde-Rebaza, and of semi-supervised learning algorithms on graphs: Probit and one-hot
A. de Andrade Lopes, “Rgcli: Robust graph that considers labeled methods,” Journal of Machine Learning Research, vol. 21, no. 186, pp.
instances for semi-supervised learning,” Neurocomputing, vol. 226, pp. 1–55, 2020.
20

Table VIII: A Summary of Open-source Implementations

Model Project Link

GRF [24] https://github.com/parthatalukdar/junto


Graph Regularization Label Propagation
LRC [25] https://github.com/provezano/lgc
Laplacian Eigenmaps [44] https://github.com/thunlp/OpenNE
Graph Factorization [45] https://github.com/thunlp/OpenNE
Matrix Factorization GraRep [46] https://github.com/benedekrozemberczki/GraRep
HOPE [47] https://github.com/ZW-ZHANG/HOPE
M-NMF [48] https://github.com/benedekrozemberczki/M-NMF
DeepWalk [51] https://github.com/phanein/deepwalk
Shallow Graph Embedding Planetoid [52] https://github.com/kimiyoung/planetoid
node2vec [56] https://github.com/eliorc/node2vec
Random Walk
LINE [53] https://github.com/tangjianpku/LINE
PTE [54] https://github.com/mnqu/PTE
HARP [57] https://github.com/GTmac/HARP
NetMF [58] https://github.com/xptree/NetMF
Hybid
NetSMF [59] https://github.com/xptree/NetSMF
SDNE [61] https://github.com/suanrong/SDNE
DNGR [62] https://github.com/ShelsonCao/DNGR
DRNE [64] https://github.com/tadpole/DRNE
AutoEncoder
GAE [65] https://github.com/tkipf/gae
VGAE [65] https://github.com/DaehanKim/vgae pytorch
ARGA [66] https://github.com/Ruiqi-Hu/ARGA
GCN [70] https://github.com/tkipf/gcn
MixHop [75] https://github.com/samihaija/mixhop
SGC [76] https://github.com/Tiiiger/SGC
DGN [77] https://github.com/Kaixiong-Zhou/DGN
Deep Graph Embedding Janossy pooling [79] https://github.com/PurdueMINDS/JanossyPooling
GAT [80] https://github.com/PetarV-/GAT
AGNN [81] https://github.com/dawnranger/pytorch-AGNN
GNN GraphSage [84] https://github.com/williamleif/GraphSAGE
DropEdge [85] https://github.com/DropEdge/DropEdge
Column networks [82] https://github.com/trangptm/Column networks
Scattering GCN [83] https://github.com/dms-net/scatteringGCN
GGNN [86] https://github.com/yujiali/ggnn
NeuroSAT [87] https://github.com/dselsam/neurosat
JK Networks [88] https://github.com/mori97/JKNet-dgl
InfoGraph* [89] https://github.com/fanyun-sun/InfoGraph

[24] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning [28] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization:
using gaussian fields and harmonic functions,” in Proceedings of the A geometric framework for learning from labeled and unlabeled
20th International conference on Machine learning (ICML-03), 2003, examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, 2006.
pp. 912–919. [29] C. Gong, T. Liu, D. Tao, K. Fu, E. Tu, and J. Yang, “Deformed graph
[25] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, laplacian for semisupervised learning,” IEEE Transactions on Neural
“Learning with local and global consistency,” in Advances in neural Networks and Learning Systems, vol. 26, no. 10, pp. 2261–2274, 2015.
information processing systems, 2004, pp. 321–328. [30] J. Calder, B. Cook, M. Thorpe, and D. Slepcev, “Poisson learning:
[26] D. Slepcev and M. Thorpe, “Analysis of p-laplacian regularization in Graph based semi-supervised learning at very low label rates,” in ICML,
semisupervised learning,” SIAM Journal on Mathematical Analysis, ser. Proceedings of Machine Learning Research, vol. 119. PMLR,
vol. 51, no. 3, pp. 2085–2120, 2019. 2020, pp. 1306–1316.
[27] D. Zhou, J. Huang, and B. Schölkopf, “Learning from labeled and [31] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data
unlabeled data on a directed graph,” in Proceedings of the 22nd with label propagation,” Carnegie Mellon University, Tech. Rep., 2002.
International Conference on Machine Learning, ser. ICML ’05. New [32] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for
York, NY, USA: Association for Computing Machinery, 2005, p. deep semi-supervised learning,” in Proceedings of the IEEE conference
1036–1043. on computer vision and pattern recognition, 2019, pp. 5070–5079.
21

Basic
GNN [74]

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Year
Manifold
GRF [24] regularization [28]
LLE [43] Laplacian LRC [25] Directed GF [45]
Eigenmaps [44] regularization [27]
GAE [65]
SDNE [61] GVAE [65]
DNGR [62] Gated GNN [86]

2014 2015 2016 2017 Year

GraRep [46]
PTE [54] Planetoid [52]
DeepWalk [51] LPDGL [29] LINE [53] HOPE [47] node2vec [51]
GCN [70] GraphSAGE [84] CLN [82] S2S-AE [63] DRNE [64] GAT [80]JK Networks [88]
Deep Set [107]

2017 2018 2019 Year

M-NMF [48] HARP [57]


ARGA [66]
ARVGA [66] SGC [76] NeuroSAT [87] DGN [77] DropEdge [85]
MixHop [75] Janossy pooling [79] Scattering GCN [83] InfoGraph* [89]

2019 2020 2021 Year

p-Laplacian [26] poisson learning [30]

Figure 6: Chronological overview of representative GSSL methods. Green indicates graph regularization methods. Red
indicates matrix-factorization-based methods. Orange indicates random-walk-based methods. Cyan indicates AutoEncoder-
based methods. Blue indicates GNN-based methods. Models shown above the chronological axis are involved with deep
learning techniques while those below the chronological axis are not. (Better viewed in color.)

[33] Z. Xu, I. King, M. R. Lyu, and R. Jin, “Discriminative semi-supervised locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326,
feature selection via manifold regularization,” IEEE Trans. Neural 2000.
Networks, vol. 21, no. 7, pp. 1033–1047, 2010. [44] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques
[34] A. Argyriou, C. A. Micchelli, and M. Pontil, “When is there a rep- for embedding and clustering,” in Advances in neural information
resenter theorem? vector versus matrix regularizers,” J. Mach. Learn. processing systems, 2002, pp. 585–591.
Res., vol. 10, pp. 2507–2529, 2009. [45] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and
[35] P. Niyogi, “Manifold regularization and semi-supervised learning: some A. J. Smola, “Distributed large-scale natural graph factorization,” in
theoretical analyses,” J. Mach. Learn. Res., vol. 14, no. 1, pp. 1229– Proceedings of the 22nd international conference on World Wide Web,
1250, 2013. 2013, pp. 37–48.
[36] A. Talwalkar, S. Kumar, M. Mohri, and H. A. Rowley, “Large-scale [46] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations
SVD and manifold learning,” J. Mach. Learn. Res., vol. 14, no. 1, pp. with global structural information,” in Proceedings of the 24th ACM in-
3129–3152, 2013. ternational on conference on information and knowledge management,
2015, pp. 891–900.
[37] Y. Zhang, X. Zhang, X. Yuan, and C. Liu, “Large-scale graph-based
semi-supervised learning via tree laplacian solver,” in AAAI. AAAI [47] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transi-
Press, 2016, pp. 2344–2350. tivity preserving graph embedding,” in Proceedings of the 22nd ACM
SIGKDD international conference on Knowledge discovery and data
[38] X. Chang, S. Lin, and D. Zhou, “Distributed semi-supervised learning
mining, 2016, pp. 1105–1114.
with kernel ridge regression,” J. Mach. Learn. Res., vol. 18, pp. 46:1–
[48] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community
46:22, 2017.
preserving network embedding.” in AAAI, vol. 17, 2017, pp. 203–209.
[39] J. Li, Y. Liu, R. Yin, and W. Wang, “Approximate manifold regu-
[49] M. E. Newman, “A measure of betweenness centrality based on random
larization: Scalable algorithm and generalization analysis,” in IJCAI.
walks,” Social networks, vol. 27, no. 1, pp. 39–54, 2005.
ijcai.org, 2019, pp. 2887–2893.
[50] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens, “Random-walk
[40] X. Li and Y. Guo, “Adaptive active learning for image classification,” in computation of similarities between nodes of a graph with application
Proceedings of the IEEE Conference on Computer Vision and Pattern to collaborative recommendation,” IEEE Transactions on knowledge
Recognition, 2013, pp. 859–866. and data engineering, vol. 19, no. 3, pp. 355–369, 2007.
[41] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning [51] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
on graphs: Methods and applications,” IEEE Data Eng. Bull., vol. 40, of social representations,” in Proceedings of the 20th ACM SIGKDD
no. 3, pp. 52–74, 2017. international conference on Knowledge discovery and data mining,
[42] F. Spitzer, Principles of random walk. Springer Science & Business 2014, pp. 701–710.
Media, 2013, vol. 34. [52] Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting semi-supervised
[43] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by learning with graph embeddings,” in International conference on ma-
22

chine learning. PMLR, 2016, pp. 40–48. [76] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger,
[53] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: “Simplifying graph convolutional networks,” in ICML, ser. Proceedings
Large-scale information network embedding,” in Proceedings of the of Machine Learning Research, vol. 97. PMLR, 2019, pp. 6861–6871.
24th international conference on world wide web, 2015, pp. 1067– [77] K. Zhou, X. Huang, Y. Li, D. Zha, R. Chen, and X. Hu, “Towards
1077. deeper graph neural networks with differentiable group normalization,”
[54] J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embedding through in NeurIPS, 2020.
large-scale heterogeneous text networks,” in Proceedings of the 21th [78] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A
ACM SIGKDD International Conference on Knowledge Discovery and comprehensive survey on graph neural networks,” IEEE Transactions
Data Mining, 2015, pp. 1165–1174. on Neural Networks and Learning Systems, 2020.
[55] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, [79] R. L. Murphy, B. Srinivasan, V. A. Rao, and B. Ribeiro, “Janossy
“Distributed representations of words and phrases and their composi- pooling: Learning deep permutation-invariant functions for variable-
tionality,” in Advances in neural information processing systems, 2013, size inputs,” in ICLR (Poster). OpenReview.net, 2019.
pp. 3111–3119. [80] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and
[56] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for Y. Bengio, “Graph attention networks,” in ICLR (Poster). OpenRe-
networks,” in Proceedings of the 22nd ACM SIGKDD international view.net, 2018.
conference on Knowledge discovery and data mining, 2016, pp. 855– [81] K. K. Thekumparampil, C. Wang, S. Oh, and L.-J. Li, “Attention-based
864. graph neural network for semi-supervised learning,” arXiv preprint
[57] H. Chen, B. Perozzi, Y. Hu, and S. Skiena, “HARP: hierarchical arXiv:1803.03735, 2018.
representation learning for networks,” in AAAI. AAAI Press, 2018, [82] T. Pham, T. Tran, D. Q. Phung, and S. Venkatesh, “Column networks
pp. 2127–2134. for collective classification,” in AAAI. AAAI Press, 2017, pp. 2485–
2491.
[58] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network
[83] Y. Min, F. Wenkel, and G. Wolf, “Scattering GCN: overcoming
embedding as matrix factorization: Unifying deepwalk, line, pte, and
oversmoothness in graph convolutional networks,” in NeurIPS, 2020.
node2vec,” in Proceedings of the Eleventh ACM International Confer-
[84] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
ence on Web Search and Data Mining, 2018, pp. 459–467.
learning on large graphs,” in Advances in neural information processing
[59] Z. Qiu, E. Cho, X. Ma, and W. M. Campbell, “Graph-based systems, 2017, pp. 1024–1034.
semi-supervised learning for natural language understanding,” in [85] Y. Rong, W. Huang, T. Xu, and J. Huang, “Dropedge: Towards
TextGraphs@EMNLP. Association for Computational Linguistics, deep graph convolutional networks on node classification,” in ICLR.
2019, pp. 151–158. OpenReview.net, 2020.
[60] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A [86] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph
survey of deep neural network architectures and their applications,” sequence neural networks,” in ICLR (Poster), 2016.
Neurocomputing, vol. 234, pp. 11–26, 2017. [87] D. Selsam, M. Lamm, B. Bünz, P. Liang, L. de Moura, and D. L. Dill,
[61] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” “Learning a SAT solver from single-bit supervision,” in ICLR (Poster).
in Proceedings of the 22nd ACM SIGKDD international conference on OpenReview.net, 2019.
Knowledge discovery and data mining, 2016, pp. 1225–1234. [88] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka,
[62] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning graph “Representation learning on graphs with jumping knowledge net-
representations.” in AAAI, vol. 16, 2016, pp. 1145–1152. works,” in ICML, ser. Proceedings of Machine Learning Research,
[63] A. Taheri, K. Gimpel, and T. Berger-Wolf, “Learning graph repre- vol. 80. PMLR, 2018, pp. 5449–5458.
sentations with recurrent neural network autoencoders,” KDD Deep [89] F. Sun, J. Hoffmann, V. Verma, and J. Tang, “Infograph: Unsupervised
Learning Day, 2018. and semi-supervised graph-level representation learning via mutual
[64] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu, “Deep recursive network information maximization,” in ICLR. OpenReview.net, 2020.
embedding with regular equivalence,” in Proceedings of the 24th ACM [90] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.
SIGKDD International Conference on Knowledge Discovery & Data Dahl, “Neural message passing for quantum chemistry,” in ICML, ser.
Mining, 2018, pp. 2357–2366. Proceedings of Machine Learning Research, vol. 70. PMLR, 2017,
[65] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv pp. 1263–1272.
preprint arXiv:1611.07308, 2016. [91] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks
[66] S. Pan, R. Hu, S.-f. Fung, G. Long, J. Jiang, and C. Zhang, “Learning and locally connected networks on graphs,” in ICLR, 2014.
graph embedding with adversarial training methods,” IEEE Transac- [92] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
tions on Cybernetics, vol. 50, no. 6, pp. 2475–2487, 2019. neural networks on graphs with fast localized spectral filtering,” in
[67] C. Wang, S. Pan, G. Long, X. Zhu, and J. Jiang, “Mgae: Marginalized Advances in neural information processing systems, 2016, pp. 3844–
graph autoencoder for graph clustering,” in Proceedings of the 2017 3852.
ACM on Conference on Information and Knowledge Management, [93] Q. Li, Z. Han, and X. Wu, “Deeper insights into graph convolutional
2017, pp. 889–898. networks for semi-supervised learning,” in AAAI. AAAI Press, 2018,
[68] T. Ma, J. Chen, and C. Xiao, “Constrained generation of semantically pp. 3538–3545.
valid graphs via regularizing variational autoencoders,” in Advances in [94] B. Jiang, Z. Zhang, D. Lin, J. Tang, and B. Luo, “Semi-supervised
Neural Information Processing Systems, 2018, pp. 7113–7124. learning with graph learning-convolutional networks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
[69] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
2019, pp. 11 313–11 320.
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[95] H. Yang, K. Ma, and J. Cheng, “Rethinking graph regularization for
[70] T. N. Kipf and M. Welling, “Semi-supervised classification with graph graph neural networks,” arXiv preprint arXiv:2009.02027, 2020.
convolutional networks,” in ICLR (Poster). OpenReview.net, 2017. [96] S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee, “N-gcn: Multi-
[71] P.-Y. Huang, R. Frederking et al., “Rwr-gae: Random walk regulariza- scale graph convolution for semi-supervised node classification,” in
tion for graph auto encoders,” arXiv preprint arXiv:1908.04003, 2019. Uncertainty in Artificial Intelligence. PMLR, 2020, pp. 841–851.
[72] J. Li, J. Yu, J. Li, H. Zhang, K. Zhao, Y. Rong, H. Cheng, and J. Huang, [97] C. Xu, Z. Cui, X. Hong, T. Zhang, J. Yang, and W. Liu, “Graph
“Dirichlet graph variational autoencoder,” in NeurIPS, 2020. inference learning for semi-supervised classification,” in ICLR. Open-
[73] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, Review.net, 2020.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in [98] R. Liao, M. Brockschmidt, D. Tarlow, A. L. Gaunt, R. Urtasun, and
Advances in neural information processing systems, 2014, pp. 2672– R. S. Zemel, “Graph partition neural networks for semi-supervised
2680. classification,” in ICLR (Workshop). OpenReview.net, 2018.
[74] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, [99] Y. Zhang, S. Pal, M. Coates, and D. Ustebay, “Bayesian graph
“The graph neural network model,” IEEE Transactions on Neural convolutional neural networks for semi-supervised classification,” in
Networks, vol. 20, no. 1, pp. 61–80, 2008. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
[75] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, 2019, pp. 5829–5836.
H. Harutyunyan, G. V. Steeg, and A. Galstyan, “Mixhop: Higher-order [100] S. Vashishth, P. Yadav, M. Bhandari, and P. P. Talukdar, “Confidence-
graph convolutional architectures via sparsified neighborhood mixing,” based graph convolutional networks for semi-supervised learning,” in
in ICML, ser. Proceedings of Machine Learning Research, vol. 97. AISTATS, ser. Proceedings of Machine Learning Research, vol. 89.
PMLR, 2019, pp. 21–29. PMLR, 2019, pp. 1792–1801.
23

[101] C. Zhuang and Q. Ma, “Dual graph convolutional networks for graph- [125] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu,
based semi-supervised classification,” in WWW. ACM, 2018, pp. 499– “Heterogeneous graph attention network,” in WWW. ACM, 2019,
508. pp. 2022–2032.
[102] S. Wan, S. Pan, J. Yang, and C. Gong, “Contrastive and generative [126] A. D. Torshizi and L. R. Petzold, “Graph-based semi-supervised learn-
graph convolutional networks for graph-based semi-supervised learn- ing with genomic data integration using condition-responsive genes
ing,” arXiv preprint arXiv:2009.07111, 2020. applied to phenotype classification,” J. Am. Medical Informatics Assoc.,
[103] M. T. Kejani, F. Dornaika, and H. Talebi, “Graph convolution networks vol. 25, no. 1, pp. 99–108, 2018.
with manifold regularization for semi-supervised learning,” Neural [127] J. Luo, P. Ding, C. Liang, and X. Chen, “Semi-supervised prediction
Networks, 2020. of human miRNA-disease association based on graph regularization
[104] B. Xu, H. Shen, Q. Cao, K. Cen, and X. Cheng, “Graph convolu- framework in heterogeneous networks,” Neurocomputing, vol. 294, pp.
tional networks using heat kernel for semi-supervised learning,” arXiv 29–38, Jun. 2018.
preprint arXiv:2007.16002, 2020. [128] X. Zeng, W. Wang, G. Deng, J. Bing, and Q. Zou, “Prediction
[105] F. Hu, Y. Zhu, S. Wu, L. Wang, and T. Tan, “Hierarchical graph of potential disease-associated micrornas by using neural networks,”
convolutional networks for semi-supervised node classification,” in Molecular Therapy-Nucleic Acids, vol. 16, pp. 566–575, 2019.
IJCAI. ijcai.org, 2019, pp. 4532–4539. [129] R. Lang, R. Lu, C. Zhao, H. Qin, and G. Liu, “Graph-based semi-
[106] K. Sun, Z. Lin, and Z. Zhu, “Adagcn: Adaboosting graph convolutional supervised one class support vector machine for detecting abnormal
networks into deep models,” CoRR, vol. abs/1908.05081, 2019. lung sounds,” Applied Mathematics and Computation, vol. 364, p.
[107] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, 124487, Jan. 2020.
and A. J. Smola, “Deep sets,” in Advances in neural information
[130] D. Mahapatra, “Semi-supervised learning and graph cuts for consensus
processing systems, 2017, pp. 3391–3401.
based medical image segmentation,” Pattern Recognition, vol. 63, pp.
[108] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
700 – 709, 2017.
jointly learning to align and translate,” in ICLR, 2015.
[131] Q. Liu, L. Yu, L. Luo, Q. Dou, P. A. Heng, and P. A. Heng,
[109] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
“Semi-supervised medical image classification with relation-driven
gated recurrent neural networks on sequence modeling,” arXiv preprint
self-ensembling model,” IEEE Transactions on Medical Imaging, pp.
arXiv:1412.3555, 2014.
1–1, 2020.
[110] Y. Shao, N. Sang, C. Gao, and L. Ma, “Probabilistic class structure reg-
ularized sparse representation graph for semi-supervised hyperspectral [132] J. Ma, W. Tang, J. Zhu, and Q. Mei, “A flexible generative framework
image classification,” Pattern Recognit., vol. 63, pp. 102–114, 2017. for graph-based semi-supervised learning,” in Advances in Neural
[111] ——, “Spatial and class structure regularized sparse representation Information Processing Systems, 2019, pp. 3281–3290.
graph for semi-supervised hyperspectral image classification,” Pattern [133] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang, Y. Xu, and M. Coates,
Recognit., vol. 81, pp. 81–94, 2018. “Non parametric graph learning for bayesian graph neural networks,”
[112] F. He, R. Wang, and W. Jia, “Fast semi-supervised learning with anchor in Conference on Uncertainty in Artificial Intelligence. PMLR, 2020,
graph for large hyperspectral images,” Pattern Recognition Letters, vol. pp. 1318–1327.
130, pp. 319–326, Feb. 2020. [134] M. Esmaeili and A. Nosratinia, “New gcnn-based architecture for
[113] Y. Chen, Z. Lai, Y. Ding, K. Lin, and W. K. Wong, “Deep supervised semi-supervised node classification,” arXiv preprint arXiv:2009.13734,
hashing with anchor graph,” in Proceedings of the IEEE International 2020.
Conference on Computer Vision, 2019, pp. 9796–9804. [135] W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang,
[114] D. C. G. Pedronette, Y. Weng, A. Baldassin, and C. Hou, “Semi- E. Kharlamov, and J. Tang, “Graph random neural networks for semi-
supervised and active learning through Manifold Reciprocal kNN supervised learning on graphs,” in NeurIPS, 2020.
Graph for image retrieval,” Neurocomputing, vol. 340, pp. 19–31, May [136] X. Zhao, F. Chen, S. Hu, and J. Cho, “Uncertainty aware semi-
2019. supervised learning on graph data,” in NeurIPS, 2020.
[115] X. Shi, H. Su, F. Xing, Y. Liang, G. Qu, and L. Yang, “Graph tem- [137] W. Liu, J. He, and S.-F. Chang, “Large graph construction for scalable
poral ensembling based semi-supervised convolutional neural network semi-supervised learning,” in ICML, 2010.
with noisy labels for histopathology image analysis,” Medical Image [138] W. Liu, J. Wang, and S.-F. Chang, “Robust and scalable graph-based
Analysis, vol. 60, p. 101624, Feb. 2020. semisupervised learning,” Proceedings of the IEEE, vol. 100, no. 9, pp.
[116] G. He, X. Liu, F. Fan, and J. You, “Classification-aware semi- 2624–2638, 2012.
supervised domain adaptation,” in CVPR Workshops. IEEE, 2020, [139] M. Wang, W. Fu, S. Hao, D. Tao, and X. Wu, “Scalable semi-supervised
pp. 4147–4156. learning by efficient anchor graph regularization,” IEEE Transactions
[117] A. Subramanya, S. Petrov, and F. C. N. Pereira, “Efficient graph-based on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1864–1877,
semi-supervised learning of structured tagging models,” in EMNLP. 2016.
ACL, 2010, pp. 167–176. [140] F. He, F. Nie, R. Wang, H. Hu, W. Jia, and X. Li, “Fast semi-
[118] M. Aliannejadi, M. Kiaeeha, S. Khadivi, and S. S. Ghidary, “Graph- supervised learning with optimal bipartite graph,” IEEE Transactions
based semi-supervised conditional random fields for spoken language on Knowledge and Data Engineering, 2020.
understanding using unaligned data,” in Proceedings of the Aus-
[141] S. Verma and Z.-L. Zhang, “Stability and generalization of graph
tralasian Language Technology Association Workshop 2014, Mel-
convolutional neural networks,” in Proceedings of the 25th ACM
bourne, Australia, Nov. 2014, pp. 98–103.
SIGKDD International Conference on Knowledge Discovery & Data
[119] Q. Mei, D. Zhang, and C. Zhai, “A general optimization framework
Mining, 2019, pp. 1539–1548.
for smoothing language models on graph structures,” in SIGIR. ACM,
[142] Z. Li, C. Li, L. Yang, S. Y. Philip, and Z. Li, “Mixture distribution mod-
2008, pp. 611–618.
eling for scalable graph-based semi-supervised learning,” Knowledge-
[120] L. Hu, T. Yang, C. Shi, H. Ji, and X. Li, “Heterogeneous graph
Based Systems, p. 105974, 2020.
attention networks for semi-supervised short text classification,” in
EMNLP/IJCNLP (1). Association for Computational Linguistics, [143] M. Chen, Z. Wei, B. Ding, Y. Li, Y. Yuan, X. Du, and J. Wen, “Scalable
2019, pp. 4820–4829. graph neural networks via bidirectional propagation,” in NeurIPS, 2020.
[121] F. Alam, S. R. Joty, and M. Imran, “Graph based semi-supervised [144] O. Stretcu, K. Viswanathan, D. Movshovitz-Attias, E. Platanios,
learning with convolution neural networks to classify crisis related S. Ravi, and A. Tomkins, “Graph agreement models for semi-
tweets,” in ICWSM. AAAI Press, 2018, pp. 556–559. supervised learning,” in Advances in Neural Information Processing
[122] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi- Systems, 2019, pp. 8713–8723.
supervised learning with graph embeddings,” in ICML, ser. JMLR [145] F. Zhou, T. Li, H. Zhou, H. Zhu, and Y. Jieping, “Graph-based semi-
Workshop and Conference Proceedings, vol. 48. JMLR.org, 2016, supervised learning with non-ignorable non-response,” in Advances in
pp. 40–48. Neural Information Processing Systems, 2019, pp. 7015–7025.
[123] M. Balaanand, N. Karthikeyan, S. Karthik, R. Varatharajan, G. Manog- [146] B. K. de Aquino Afonso and L. Berton, “Analysis of label noise
aran, and C. B. Sivaparthipan, “An enhanced graph-based semi- in graph-based semi-supervised learning,” in Proceedings of the 35th
supervised learning algorithm to detect fake users on twitter,” J. Annual ACM Symposium on Applied Computing, 2020, pp. 1127–1134.
Supercomput., vol. 75, no. 9, pp. 6085–6105, 2019. [147] M. M. Dunlop, D. Slepčev, A. M. Stuart, and M. Thorpe, “Large
[124] C. Yang, L. Bai, C. Zhang, Q. Yuan, and J. Han, “Bridging collaborative data and zero noise limits of graph-based semi-supervised learning
filtering and semi-supervised learning: A neural approach for POI algorithms,” Applied and Computational Harmonic Analysis, vol. 49,
recommendation,” in KDD. ACM, 2017, pp. 1245–1254. no. 2, pp. 655–697, 2020.
24

[148] X. Liu, S. Si, J. Zhu, Y. Li, and C. Hsieh, “A unified framework for data
poisoning attack to graph-based semi-supervised learning,” in NeurIPS,
2019, pp. 9777–9787.
[149] P. Liao, H. Zhao, K. Xu, T. Jaakkola, G. Gordon, S. Jegelka, and
R. Salakhutdinov, “Graph adversarial networks: Protecting information
against adversarial attacks,” arXiv preprint arXiv:2009.13504, 2020.
[150] X. Zhang and M. Zitnik, “Gnnguard: Defending graph neural networks
against adversarial attacks,” in NeurIPS, 2020.
[151] H. Gan, Z. Li, W. Wu, Z. Luo, and R. Huang, “Safety-aware graph-
based semi-supervised learning,” Expert Systems with Applications, vol.
107, pp. 243–254, 2018.
[152] X. Gao, W. Hu, and Z. Guo, “Exploring structure-adaptive graph
learning for robust semi-supervised classification,” in ICME. IEEE,
2020, pp. 1–6.
[153] P. Elinas, E. V. Bonilla, and L. C. Tiao, “Variational inference for graph
convolutional networks in the absence of graph data and adversarial
settings,” in NeurIPS, 2020.
[154] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-
Rad, “Collective classification in network data,” AI magazine, vol. 29,
no. 3, pp. 93–93, 2008.
[155] B. Perozzi, V. Kulkarni, H. Chen, and S. Skiena, “Don’t walk, skip!
online learning of multi-scale network embeddings,” in Proceedings of
the 2017 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining 2017, 2017, pp. 258–265.
[156] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls
of graph neural network evaluation,” arXiv preprint arXiv:1811.05868,
2018.
[157] W. Wang, X. Liu, P. Jiao, X. Chen, and D. Jin, “A unified weakly super-
vised framework for community detection and semantic matching,” in
Advances in Knowledge Discovery and Data Mining, D. Phung, V. S.
Tseng, G. I. Webb, B. Ho, M. Ganji, and L. Rashidi, Eds. Cham:
Springer International Publishing, 2018, pp. 218–230.
[158] M. Girvan and M. E. Newman, “Community structure in social and
biological networks,” Proceedings of the national academy of sciences,
vol. 99, no. 12, pp. 7821–7826, 2002.
[159] L. Tang and H. Liu, “Relational learning via latent social dimensions,”
in Proceedings of the 15th ACM SIGKDD international conference on
Knowledge discovery and data mining, 2009, pp. 817–826.
[160] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network
dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
[161] M. Mahoney, “Large text compression benchmark,” 2011.
[162] M. Zitnik and J. Leskovec, “Predicting multicellular function through
multi-layer tissue networks,” Bioinformatics, vol. 33, no. 14, pp. i190–
i198, 2017.
[163] N. Wale, I. A. Watson, and G. Karypis, “Comparison of descriptor
spaces for chemical compound retrieval and classification,” Knowl. Inf.
Syst., vol. 14, no. 3, p. 347–375, Mar. 2008.
[164] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shuster-
man, and C. Hansch, “Structure-activity relationship of mutagenic aro-
matic and heteroaromatic nitro compounds. correlation with molecular
orbital energies and hydrophobicity,” Journal of medicinal chemistry,
vol. 34, no. 2, pp. 786–797, 1991.
[165] P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures from
non-enzymes without alignments,” Journal of molecular biology, vol.
330, no. 4, pp. 771–783, 2003.
[166] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J.
Smola, and H.-P. Kriegel, “Protein function prediction via graph
kernels,” Bioinformatics, vol. 21, no. suppl 1, pp. i47–i56, 2005.
[167] H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma,
“Statistical evaluation of the predictive toxicology challenge 2000–
2001,” Bioinformatics, vol. 19, no. 10, pp. 1183–1193, 2003.
[168] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M.
Mitchell, “Toward an architecture for never-ending language learning,”
in AAAI. AAAI Press, 2010.
[169] L. Deng, “The mnist database of handwritten digit images for machine
learning research [best of the web],” IEEE Signal Processing Magazine,
vol. 29, no. 6, pp. 141–142, 2012.
[170] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
“Multimodal deep learning,” in ICML. Omnipress, 2011, pp. 689–
696.
[171] A. Krizhevsky and G. Hinton, “Convolutional deep belief networks on
cifar-10,” Unpublished manuscript, vol. 40, no. 7, pp. 1–9, 2010.

You might also like