GMPT Cikm2021 Final
GMPT Cikm2021 Final
Anonymous Author(s)
ABSTRACT edge features, GNNs can effectively learn low-dimensional represen-
Recently, graph neural networks (GNNs) have been shown powerful tation vectors for each node or the entire graph. However, to apply
at modeling structural data. However, when adapted to downstream GNNs to downstream applications, it usually requires abundant
tasks, it usually requires abundant task-specific labeled data, which task-specific labeled data, which can be extremely scarce in practice.
can be extremely scarce in practice. A promising solution to data To alleviate the data scarcity issue [30], pre-training GNNs [18, 19]
scarcity is to pre-train a transferable and expressive GNN model on has been proposed as a promising solution. It first learns a transfer-
large amount of unlabeled graphs or coarse-grained labeled graphs. able and expressive GNN on a large amount of unlabeled graphs
Then the pre-trained GNNs are fine-tuned on downstream datasets or coarse-grained labeled graphs. Then, the pre-trained GNNs are
with task-specific fine-grained labels. fine-tuned on downstream datasets with task-specific labels.
In this paper, we present a novel Graph Matching based GNN Pre- For GNN pre-training, existing studies mostly focus on the design
Training framework, called GMPT. Focusing on a pair of graphs, of suitable self-supervised tasks, such as graph structure reconstruc-
we propose to learn structural correspondences between them via tion [15, 18, 19], mutual information maximization [32, 41, 48] and
neural graph matching, consisting of both intra-graph message properties prediction [18]. Generally, these tasks can be classified
passing and inter-graph message passing. In this way, we can learn into two main categories: (1) node-level tasks utilize node represen-
adaptive representations for a given graph when paired with dif- tations to predict the localized properties in the graph (e.g., link
ferent graphs, and both node- and graph-level characteristics are prediction); (2) graph-level tasks focus on the entire graph and learn
naturally considered in a single pre-training task. The proposed graph representations when designing the optimization goal (e.g.,
method can be applied to fully self-supervised pre-training and graph’s property prediction).
coarse-grained supervised pre-training. We further propose an Given the two kinds of GNN pre-training tasks, it is essential
approximate contrastive training strategy to significantly reduce to combine node- and graph-level optimization goals [18], since
the time/memory consumption. Extensive experiments on multi- they capture the graph characteristics in different views. Existing
domain, out-of-distribution benchmarks have demonstrated the approaches either adopt a two-stage approach arranging multi-level
effectiveness of our approach. pre-training tasks sequentially [18], or frame them in a multi-task
learning manner [29]. In this way, each individual pre-training task
CCS CONCEPTS is not aware of all the optimization goals at different levels, which
might result in locally optimal representations w.r.t. some specific
• Computing methodologies → Unsupervised learning; Neu-
level (e.g., node- or graph-level). Ideally, a good pre-training task
ral networks.
can capture node- and graph-level characteristics simultaneously
in order to derive more comprehensive node representations.
KEYWORDS
graph representation learning; graph neural networks; graph pre-
Shared Substructure Static Pooled
training; graph matching; self-supervised learning; contrastive blue Graph Representations
Positive
learning
ACM Reference Format:
is pooled by
Anonymous Author(s). 2021. Neural Graph Matching for Pre-training Graph red and blue
Neural Networks. In CIKM ’21: 30th ACM International Conference on Infor-
mation and Knowledge Management, November 01–05, 2021, Online. ACM, Negative
New York, NY, USA, 10 pages. https://doi.org/10.1145/1122445.1122456
Graph Matching Adaptive Contextual
1 INTRODUCTION Positive Graph Representations
In the past few years, graph neural networks (GNNs) have emerged
√
as a powerful technical approach for graph representation learn-
ing [15, 23, 47]. By leveraging graph structure as well as node and √
Permission to make digital or hard copies of all or part of this work for personal or Negative
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, Figure 1: An example of neural graph matching and compar-
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
ison with existing studies of static graph representations.
CIKM ’21, November 01–05, 2021, Online
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 For this purpose, we attempt to design new GNN pre-training
https://doi.org/10.1145/1122445.1122456 tasks that are able to learn node- and graph-level graph semantics
CIKM ’21, November 01–05, 2021, Online Anon.
in one single pre-pretraining task. Our solution is based on neural graph data, a surge of GNNs have been proposed, which roughly
graph matching [27, 42, 46], a neural approach to learning structural fall into two lines. The first line focuses on learning node rep-
correspondence among graphs. We present an illustrative example resentation in sepctral domain. As a pioneer and representative
of our idea in Figure 1. At each time, a pair of two associated graphs work, GCN [23] simplifies the Chebyshev polynomial in graph
(e.g., with the same labels or augmented graphs) are given, and we filtering [2, 8] based on first-order approximation and naturally
evaluate whether the two graphs have similar structural properties. captures local strcture by the graph Laplacian. Correspondingly,
As a major advantage, neural graph matching naturally combines the other line aims at aggregating and updating local structural
node-level correspondence (e.g., 𝑣 1 to 𝑣 2 ) and graph-level properties information in the spatial domain. Hence, a series of stragies have
(e.g., whether containing shared substructure) when establishing been proposed to guide the message passing in GNNs, including
their correspondence. That is the major reason why we adopt it as mean/max/LSTM aggregation [15], attention mechanism [40] and
the GNN pre-training task. Another merit of this approach is that graph struture pooling [11]. Moreover, several recent efforts have
a graph will correspond to different representations, when paired been made for improving the training effciency through layer sam-
with different graphs. As shown in this example, we will derive pling [4, 20] and graph partition [5, 50]. For a thorough review,
different representations for graph 𝐴 when paired with graph 𝐴1 please refer to the elaborate surveys [44, 51]. However, how to
or 𝐴2 , since neural graph matching will enforce one graph to refer train GNNs in an effective and effcient way still remains greatly
to another graph’s information when learning graph representa- challenging, since domain-specific labels are always scarce during
tions. Therefore, we call the learned representations adaptive graph training.
representations. As a comparison, existing graph-level pre-training
Pre-training graph neural networks. To alleviate above issues,
tasks usually adopt static graph representations.
pre-training for graph neural networks has drawn much atten-
To this end, in this paper, we propose a novel Graph Matching
tion recently, which empowers GNNs to capture the structural
based GNN Pre-Training method, named as GMPT. The key con-
and semantic information of the input unlabel graphs (or with few
tribution lies in a neural graph matching module, where we pair
coarse-grained label), followed by several fine-tuning steps on the
two associated graphs as input at each time. To learn structural
downstream tasks of interest. Obviously, developing effective su-
correspondences, we perform intra-graph message passing as well
pervised (self-supervised) signals to guide GNNs expoit structural
as inter-graph message passing. In this way, the representations
and semantic properties on original graphs is at the heart. Gener-
of a given graph are learned by referring to another paired graph,
ally, existing designed supervised signals can be classified into two
which derive adaptive graph representations. Such a method can
main categories. The first is called node-level tasks, which aim at
capture both node- and graph-level characteristics when learning
predicting localized properties utilizing node representations, such
the graph representations. The proposed method can be applied
as graph structure reconstruction [15, 18, 19, 26, 29], localized at-
to both fully self-supervised and coarse-grained supervised pre-
tribute prediction [18, 19, 34] and node representation recovery [16].
training settings. In self-supervised setting, GNNs are optimized by
Another is called graph-level tasks, which defines globalized opti-
a graph matching based contrastive loss. To accelerate the learn-
mization goal for the entire graph, such as graph property predic-
ing of graph pairs during pre-training, we further proposed an
tion [18, 34] and mutual information maximization [29, 32, 41, 48].
approximate contrastive training strategy to significantly reduce
Our proposed framework differs from the above approaches in the
the time/memory consumption, without loss of accuracy. While in
following two aspect: hybrid-level pre-training and adaptive graph
supervised setting, we design different supervised tasks according
representation, which is detailedly presented in Section 1.
to different coarse-grained labels.
To summarize, our main contributions can be presented as: Graph matching. Graph matching refers to establishing node cor-
(1) We design a new GNN pre-training task based on neural respondences between two (or among multiple) graphs [3], such
graph matching, where both node- and graph-level charac- that the similarity between the matched graphs is maximized. Some
teristics are considered in a single pre-training task. research focus on the accuracy of the node correspondence, and re-
(2) We apply the proposed pre-training method in both self- gard graph matching as a quadratic assignment programming (QAP)
supervised and coarse-grained supervised settings, where we problem [28], which is NP-complete [12]. Thus, researchers mainly
learn adaptive graph representations and propose an approxi- employ approximate techniques to seek inexact solutions, such
mate contrastive training strategy to reduce the time/memory as spectral approximation [7, 24], double-stochastic approxima-
consumption. tion [6, 14, 25], and learning based approaches [3, 10, 42, 49]. While
(3) Extensive experiments on public out-of-distribution bench- others care about the calculated similarity between graphs. Early ef-
marks from multiple domains on various GNN architectures forts are mainly based on heuristic rules, such as minimal graph edit
have demonstrated the effectiveness of our approach. distance [33, 43] and graph kernel methods [13, 21, 35, 36]. With
the development of GNNs, recent works leverage message passing
2 RELATED WORK technique to explore neural-based graph matching [27, 37, 46]. In
this work, we apply neural graph matching for GNN pre-training,
In this section, we review the most related work in graph neural
to learn adaptive graph representations and encourage GNNs inte-
networks, pre-training and graph matching.
grating localized and globalized domain-specific features.
Graph neural networks. Recent years have witnessed a remark-
able progress of graph neural networks (GNNs) in characterizing
graph-structured data [44]. To learn powerful representation of
Neural Graph Matching for Pre-training Graph Neural Networks CIKM ’21, November 01–05, 2021, Online
Contrastive
Loss
Regression
Loss
GNN
Classification
Linear
Loss
Supervised
Pre-training
Intra-graph Message
Fine-tuned on downstream tasks Localized Globalized
Inter-graph Message (fine-grained properties) Hybrid-task
Figure 2: Overall framework of our proposed graph matching based GNN pre-training methods.
𝒛𝐺˜ = READOUT {𝒛 𝑣′ |𝑣 ′ ∈ V2 } .
(2𝑛 views in total). We consider various augmentation techniques, (9)
2
including node/edge perturbation [48], subgraph sampling [48],
diffusion [17], and adaptive methods [52]. The selection of graph Note that when involved in different pairs, a given graph will corre-
augmentation techniques depends on the actual data domain [48]. spond to different representations in our approach. It is a key merit
To pre-train an expressive GNN encoder, we consider whether a for subsequent pre-training task, since it can adaptively capture
pair of graph views denoted by 𝐺˜ 1 and 𝐺˜ 2 are matched or not based structural correspondences instead of using static graph represen-
on their graph representations. Specially, we first apply the GNN tations as in previous studies [18, 48].
encoder described in Section 3 to obtain node representations in 4.1.2 Contrastive Learning with Adaptive Graph Represntations.
(1) (2)
the two graph views. Let 𝒉𝑠 and 𝒉𝑡 denote the representations Contrastive learning is a commonly used technique to learn with
of node 𝑠 from 𝐺˜ 1 and node 𝑡 from 𝐺˜ 2 , respectively. augmented graph views in pairs [48]. It aims to increase similarity
scores for positive pairs and decrease similarity scores for negative
Neural Graph Matching. Following recent progress in neural
pairs. However, existing graph contrastive learning methods mainly
graph matching [27, 42], we incorporate message passings within
adopt static graph representations [17, 48, 52], where node-level
a graph (called intra-graph message) and between a pair of graphs
interaction across graphs is not explicitly modeled in this process.
(called inter-graph messages). Given a intra-graph node pair ⟨𝑠, 𝑡⟩
As a comparison, given a pair of graph views, we first apply the
and a node inter-graph pair ⟨𝑠 ′, 𝑡 ′ ⟩, we define the two kinds of
neural graph matching technique (Section 4.1.1) to characterize
message passings formally as:
inter-graph interaction, and then construct the constative loss based
(1) (1)
on the adaptive graph representations.
𝒎𝑠→𝑡 = MSGintra 𝒉𝑠 , 𝒉𝑡 , 𝒆𝑠𝑡 , (4) Formally, given a positive pair (𝐺˜𝑖 , 𝐺˜ 𝑗 ), we firstly adaptively
encode them into contextual graph representations 𝒛𝐺˜ and 𝒛𝐺˜
(1) (2)
𝝁𝑠 ′ →𝑡 ′ = MSGinter 𝒉𝑠 ′ , 𝒉𝑡 ′ , (5) 𝑖 𝑗
(Eq. (8) and Eq. (9)), and then formalize the contrastive loss below:
where 𝒎 𝑗→𝑖 and 𝝁 𝑗 ′ →𝑖 ′ are intra-graph and inter-graph messages, exp 𝑠𝑖,𝑗 /𝜏
respectively. Intra-graph message passing can be defined in a similar ℓ𝑖,𝑗 = − log Í , (10)
𝑘≠𝑖 exp 𝑠𝑖,𝑘 /𝜏
way following standard GNN architectures, like GIN [45]. While,
for MSGinter , we adopt a cross-graph attention mechanism as: where 𝑠𝑖,𝑗 = sim(𝒛𝐺˜ , 𝒛𝐺˜ ) and 𝜏 is a temperature parameter. In
𝑖 𝑗
practice, we usually have a batch of graph views, and we enumerate
(1) all the possible pairs for this loss in the denominator of Eq. (10).
𝝁𝑠 ′ →𝑡 ′ = 𝑎𝑠 ′ →𝑡 ′ · 𝒉𝑠 ′ , (6)
Although the above contrastive loss is also defined at the graph
(1) (2)
exp(sim(𝒉𝑠 ′ ,𝒉𝑡 ′ )) level (whether two views are augmented from the same graph),
where 𝑎𝑠 ′ →𝑡 ′ = Í (1) (2) and sim(·) is a similarity the derived graph representations 𝒛𝐺˜ and 𝒛𝐺˜ are enhanced with
𝑘 ∈𝐺˜ 2 exp(sim(𝒉𝑠 ′ ,𝒉𝑘 )) 𝑖 𝑗
function, such as dot product and cosine similarity. The above at- inter-graph node interaction via neural graph matching. As such,
tention mechanism allows an adaptive message exchange between optimizing ℓ𝑖,𝑗 will encourage GNNs to capture both node- and
the paired graphs. Intuitively, messages passed between similar graph-level characteristics in graph representations.
substructures of graphs will have higher attention weights. Here, 4.1.3 Approximate Contrastive Training. A major issue with graph
we normalize the attention weights of messages from the same matching is it incurs a quadratic time and space cost in terms of
Í
source node, which means 𝑡 ′ 𝑎𝑠→𝑡 ′ = 1. Besides, it is also optional the number of nodes. Here, we propose an approximate contrastive
to normalize the attention weights of messages to the same target training strategy to improve the algorithm efficiency. We consider
Í
nodes ( 𝑡 ′ 𝑎𝑡 ′ →𝑠 = 1). the setting with a mini-batch of 𝑛 graphs. As mentioned before, we
Match Enhanced Graph Representations. After passing intra- would generate 2𝑛 augmented graph views for graph matching and
graph messages from nodes’ neighbors (denoted by Nintra ) and constative learning. Typically, for a mini-batch of 𝑛 graphs, GMPT-
inter-graph messages from all the nodes of another graph (denoted CL considers 2𝑛 × 2𝑛 times of graph comparisons (two augmented
by Ninter ), we aggregate the messages together and update to obtain views each graph) in total. Each comparison performs a node-to-
the contextual node features 𝒁 . For a node 𝑡, we update its original node similarity calculation (refer to Eq. (4)), taking an additional
Í2𝑛
representation 𝒉𝑡 as: cost of 𝑂 (𝑚 2 · 𝑑) time and space, where 𝑚 = 𝑖=1 |V𝑖 | denotes the
total number of nodes in 2𝑛 graph views and 𝑑 is the dimensionality
∑︁ ∑︁ of representation vectors.
𝒛𝑡 = Update 𝒉𝑡 ,
©
𝒎𝑠→𝑡 , 𝝁𝑠 ′ →𝑡 ® ,
ª
(7) In order to reduce the time and memory consumption, the key
« 𝑡 ∈Nintra ′
𝑠 ∈Ninter ¬ idea is to perform an approximate calculation of the proposed con-
trastive loss (Eq. (10)): we sample 𝑞 out of 2𝑛 graph views to contrast
where we use the sum operation for aggregation. Finally, we ob- with all the other views (𝑞 × 2𝑛 times comparisons totally). In this
tain the entire graph’s adaptive representation 𝒛𝐺 by employing way, the additionally expected time complexity can be reduced to
𝑂 ( 2𝑛 · 𝑚 2 · 𝑑). For a further reduction of space complexity, we
𝑞
a permutation-invariant function READOUT to pool contextual
Neural Graph Matching for Pre-training Graph Neural Networks CIKM ’21, November 01–05, 2021, Online
adopt the gradient accumulation technique. For each time of sam- graph matching based score function. Thus, Eq. (13) fits into the
pling, we perform 1 × 2𝑛 times of comparison. After calculating formulation of the InfoNCE loss [38, 39]. Minimizing Eq. (10) results
the contrastive loss, the model backpropagates prediction error in a tighter lower-bound of the mutual information.
and calculates the gradients, but doesn’t update model parameters Furthermore, we prove that approximate contrastive training has
immediately. Instead, the gradients are accumulated until all the the same optimization lower-bound as originally, in expectation.
𝑞 samples are calculated. In this way, the sampled
𝑞 graphs
only We sample 𝑞 views to contrast with all the 𝑛 views. The loss can be
require an additional space complexity of 𝑂 2𝑛 1 · 𝑚2 · 𝑑 . rewritten in an expectation form,
𝑞
We will given a brief theoretical proof in Section 4.1.4 that, the 1 ∑︁ exp 𝑓𝑖,𝑗
proposed approximate contrastive training fits the formulation of ℓ′ = − log Í , (14)
𝑞
𝑖∼𝑈 (1,𝑛) 𝑘≠𝑖 exp 𝑓𝑖,𝑘
the InfoNCE loss [38, 39] in expectation. Empirically, experiments Í
𝑘≠𝑖 exp 𝑓𝑖,𝑘
in Section 5.3 will show that, performances of the fine-tuned GNNs
=E E𝑖∼𝑈 (1,𝑛) log , (15)
on downstream datasets are not affected (even improved) with exp 𝑓𝑖,𝑗
the proposed approximate contrastive training. The overall pre- 𝑛 𝑛
training algorithm of GMPT-CL in one mini-batch is presented in © 1 ∑︁ ∑︁
=E log exp (𝑓𝑖,𝑘 − 𝑓𝑖,𝑗 ) ® + log (𝑛 − 2), (16)
ª
Algorithm 1. Line 1 ∼ 4 show the graph augmentation progress 𝑛 𝑖
𝑘≠𝑖,𝑗
(Section 4.1.1), line 5 shows the sampling of approximate contrastive
« ¬
training. In line 7 ∼ 9, we build the positive pair of views in Eq. (10), where we can find that Eq. (16) is the same as Eq. (13). The key point
and in line 11 ∼ 12, we show the gradient accumulation and param- is that, by uniform sampling, the expectation of the contrastive loss
eter optimization. of one single sample is equal to the averaged contrastive loss of the
whole batch as Eq. (15).
Table 1: Statistics of the datasets. PT denotes Pre-Training • Infomax [41] maximizes mutual information between patch
and FT denotes Fine-Tuning. * denotes the summarization representations and corresponding high-level summaries of
over 8 downstream datasets. graphs.
• EdgePred [15] directly predicts the connectivity of node
Dataset Bio Chem pairs, a.k.a., link prediction task.
• ContextPred [18] samples each node’s neighborhood sub-
#(sub)graphs for self-supervised PT 307K 2, 000K graph and the surrounding context subgraphs. The task is to
#(sub)graphs for supervised PT/FT 88K 456K make prediction about whether a given pair of neighborhood
#Coarse-grained labels for PT 5, 000 1, 310 graph and context graph are generated by the same center
#Downstream FT tasks 40 678* node.
• AttrMasking [18] predicts nodes’ or edges’ attributes, which
are randomly masked.
𝒛𝐺 1 and 𝒛𝐺 2 ), we jointly predict the corresponding coarse-grained • GraphCL [48] contrast the static representations of aug-
labels 𝑦1 and 𝑦2 , The loss for discrete labels ℓ𝑑 can be calculated as, mented views and judge whether they are generated from
the same graph.
∑︁
ℓ𝑑 = BCE(𝒚𝑖 , 𝑾𝑖 · 𝒛𝐺𝑖 + 𝒃𝑖 ), (18) • L2P-GNN [29] selects link prediction as node-level task and
𝑖=1,2 Infomax as graph-level task, then utilizes meta-learning to
alleviate the divergence between multi-task pre-training and
where BCE(·, ·) denotes the popular binary cross-entropy loss, and fine-tuning objectives.
𝒚1 and 𝒚2 are vectorized representations of labels 𝑦1 and 𝑦2 , respec- • PropPred [18] predicts the coarse-grained labels of graphs
tively. Although the loss of the two graphs in a pair is calculated in the pre-training datasets.
separately, their representations are obtained from the graph match-
To our knowledge, our selected baselines have a comprehen-
ing module (e.g.,, cross-graph message passing in Eq. (5)).
sive coverage of recenyly proposed pre-training methods for GNN.
Among them, PrepPred is designed for supervised pre-training
5 EXPERIMENTS setting, while others are targeted at self-supervised pre-training
In this section, we conduct extensive experiments to verify effec- setting. Moreover, results of non-pre-trained model are also re-
tiveness of our proposed methods at both self-supervised and super- ported.
vised settings. Moreover, we also give an in-depth analysis about
training staregy, key paratemeters and transferability. 5.1.3 Parameter Settings. To enhance the reproducibility, we elab-
orately present the implementation details as follows.
5.1 Experimental Setup GNN architecture. We mainly experiment on Graph Isomorphism
5.1.1 Datasets. We conduct experiments on two public (sub)graph Networks (GINs) [45], the most expressive GNN architecture for
classification benchmarks, from different domains. Dataset statis- graph-level prediction tasks. We also experiment with other popular
tics are summarized in Table 1. Below are the descriptions of the architectures: GCN [23], GraphSAGE [15] and GAT [40]. Following
datasets: the previous work [18, 29], we select the same hyper-parameters
as follows: 300 dimensional hidden units, 5 GNN layers (𝐾 = 5). As
• Bio [18] is a biology dataset for protein function predic-
for the READOUT function, following the guidance of the previous
tion, where nodes are proteins, and different edges denote
works [18, 48], we select the average pooling for Bio and Chem
types of relationship exists between a pair of proteins. To
datasets (except L2P-GNN). Note that a learnable graph pooling
evaluate pre-training methods’transferability and out-of-
layer is adpoted for baseline L2P-GNN according to the restriction
distribution generalization, we split down stream dataset
for pre-training in the original paper.
by species, which means the pretrained model will be tested
For our proposed graph matching methods, we adopt dot pro-
on new sepcies.
duction for the sim(·) function. In GMPT-CL, we select 𝜏 = 0.07 in
• Chem [18] is a chemistry dataset for molecular property pre-
Eq. (10).
diction, where nodes and edges represent atoms and chem-
ical bonds, respectively. As above, downstream dataset is Pre-training and fine-tuning settings. Results of baselines on
split by scaffold (molecular substructure). different datasets are directly taken if they have been reported
the pre-training dataset and downstream dataset are from literaturely, while for the others, the optimization of pre-training
different research fields. and fine-tuning are presented as follows. We implement all models
Note that in Bio, graph instances are subgraphs sampled from using PyTorch [31] and PyTorch Geometric [9]. Models are trained
the original giant network via random walk [18], while in Chem, with Adam optimizer [22]. Models are trained with Adam opti-
graphs are individual moleculars. We strictly adopt the same way of mizer [22]. We use PyTorch [31] and PyTorch Geometric [9] for all
splitting and pre-processing of these benchmarks as previous work. of our implementation. We pre-train the models with a learning
Plearse refer to Hu et al. [18] for more details about the benchmarks. rate of 0.001, and fine-tune the GNNs with a learning rate tuned
in {0.01, 0.001, 0.0001} for all the methods. For self-supervised pre-
5.1.2 Baselines. We compare our pre-training methods with the training, we use a batch size of 16 for 5 epochs on Bio datasets,
following representative GNN pre-training methods: while a batch size of 32 for 10 epochs on Chem datasets, which
Neural Graph Matching for Pre-training Graph Neural Networks CIKM ’21, November 01–05, 2021, Online
Table 2: Evaluation in self-supervised setting on Bio. We test Bio, with the most expressive GNN architecture GIN. And applying
ROC-AUC (%) performance using different pre-training meth- GMPT-CL on currently popular GNN architectures, as shown in
ods with different GNN architecture. The macro-average re- Table 2, GMPT-CL achieves the best macro-average result (71.13%)
sults over all the popular GNN architectures are also reported over all the compared baseline methods. In particular, we can see
in the rightmost column. In L2P-GNN, GNN is fine-tuned that GMPT-CL is powerful even with less expressive GNN architec-
with a parameterized global pooling layer, while others us- tures like GCN or GAT, which brings 3.60% and 1.92% absolutely
ing average pooling. The improvement of our method over gains compared to the best baseline, respectively.
the best baseline is significant at the level of 0.01 with paired Table 3 presents the performance comparison with GIN in self-
𝑡-test (except GraphSAGE). supervised pre-training setting over 8 downstream datasets of
Chem. Generally, we can see that non of the compared pre-training
Pre-training methods
GNN architecture
Average
methods can achieve the best performance over all the downstream
GCN GraphSAGE GAT GIN tasks of Chem. We notice that GMPT-CL gains the best macro-
63.20 65.70 68.20 64.80 average results (71.5%) compared to all the baselines, although it
– 65.48
(± 1.00) (± 1.20) (± 1.10) (± 1.00) cannot achieve the best performance over all the 8 downstream
Infomax
62.83 67.21 66.94 64.10
65.27 datasets.
(± 1.22) (± 1.84) (± 2.61) (± 1.5) In sum, we make the following observations.
63.18 66.05 65.72 65.7
EdgePred 65.16 (1) On average, the proposed GMPT-CL yields the best perfor-
(± 1.12) (± 0.78) (± 1.17) (± 1.3)
mance on benchmarks of different domains (72.53% on Bio
62.81 66.47 67.86 65.20
ContextPred
(± 1.87) (± 1.27) (± 1.19) (± 1.60)
65.59 and 71.50% on Chem, all with GIN). As GMPT-CL is a hybrid-
level pre-training task, it encourages GNNs capturing both
62.40 63.32 61.72 64.40
AttrMasking 62.96 localized and globalized domain-specific semantics. Graph
(± 1.35) (± 1.01) (± 2.70) (± 1.30)
67.05 71.53 65.68 67.88 matching module of GMPT-CL can generate adaptive graph
GraphCL 68.04 representations, in which shared substructures are enhanced.
(± 1.16) (± 0.46) (± 3.98) (± 0.85)
66.48 69.89 69.15 70.13 (2) Pre-training GNNs with large amount of unlabeled data is
L2P-GNN 68.91
(± 1.59) (± 1.63) (± 1.86) (± 0.95) clearly helpful to downstream tasks, as GMPT-CL brings
70.65 70.29 71.07 72.53 6.69% and 4.5% absolutely gains compared to non-pretrained
GMPT-CL 71.13
(± 0.53) (± 0.21) (± 0.14) (± 0.42) models on the macro-average results over datasets of two
domains, respectively.
Table 3: Evaluation in self-supervised setting on Chem. Here we test ROC-AUC (%) performance on molecular prediction
benchmarks using different pre-training methods with GIN. The rightmost column averages the mean of test performance
across the 8 datasets. Bold numbers indicate the best performance and underline numbers indicate the second best performance.
Dataset
Pre-training methods Average
BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE
65.80 74.00 63.40 57.30 58.00 71.80 75.30 70.10
– 67.00
(± 4.50) (± 0.80) (± 0.60) (± 1.60) (± 4.40) (± 2.50) (± 1.90) (± 5.40)
68.80 75.30 62.70 58.40 69.90 75.30 76.00 75.90
Infomax 70.30
(± 0.80) (± 0.50) (± 0.40) (± 0.80) (± 3.00) (± 2.50) (± 0.70) (± 1.60)
67.30 76.00 64.10 60.40 64.10 74.10 76.30 79.90
EdgePred 70.30
(± 2.40) (± 0.60) (± 0.60) (± 0.70) (± 3.70) (± 2.10) (± 1.00) (± 0.90)
64.30 76.70 64.20 61.00 71.80 74.70 77.20 79.30
AttrMasking 71.10
(± 2.80) (± 0.40) (± 0.50) (± 0.70) (± 4.10) (± 1.40) (± 1.10) (± 1.60)
68.00 75.70 63.90 60.90 65.90 75.80 77.30 79.60
ContextPred 70.90
(± 2.00) (± 0.70) (± 0.60) (± 0.60) (± 3.80) (± 1.70) (± 1.00) (± 1.20)
69.68 73.87 62.40 60.53 75.99 69.80 78.47 75.38
GraphCL 70.80
(± 0.67) (± 0.66) (± 0.57) (± 0.88) (± 2.65) (± 2.66) (± 1.22) (± 1.44)
66.42 75.51 63.13 58.86 66.15 74.71 77.31 81.05
L2P-GNN 70.40
(± 1.10) (± 0.37) (± 0.50) (± 0.79) (± 5.00) (± 1.84) (± 1.22) (± 0.75)
66.68 74.63 63.13 60.13 76.00 76.57 77.54 77.04
GMPT-CL 71.50
(± 1.96) (± 0.65) (± 0.37) (± 0.67) (± 1.32) (± 2.08) (± 1.18) (± 1.38)
Memory (G)
4
AUC (%)
70
Pre-training methods Bio Chem 3
5.5
5
69 batch size 4.15 4 3.8 3.8
4 2
– 64.8 ± 1.0 67.0 68
8 1
PropPred 69.0 ± 2.4 70.0 67 16
0
GMPT-Sup 70.84 ± 0.59 – 1 2 4 8 16 32 1 2 4 8 16 32
AUC (%)
AUC (%)
0.6 0.6 69 69
68 68
0.4 0.4
67 67
0.2 0.2 66 66
GMPT-CL GMPT-CL
65 L2P-GNN 65 L2P-GNN
0 0 64 64
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 10 50 100 300 500 2 3 4 5
No pre-training No pre-training Dimension #Layers of GNN
Figure 4: Analysis of transferability for L2P-GNN (a) and Figure 5: Tuning of different hyper-parameters of the GNN
GMPT-CL (b) over 40 out-of-distribution sub-tasks of Bio. pre-trained with GMPT-CL on Bio dataset.
REFERENCES [31] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
[1] Andrew P. Bradley. 1997. The use of the area under the ROC curve in the Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-
evaluation of machine learning algorithms. Pattern Recognit. 30, 7 (1997), 1145– maison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
1159. Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
[2] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral net- Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
works and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 Library. In NIPS. 8024–8035.
(2013). [32] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding,
[3] Tibério S Caetano, Julian J McAuley, Li Cheng, Quoc V Le, and Alex J Smola. 2009. Kuansan Wang, and Jie Tang. 2020. GCC: Graph Contrastive Coding for Graph
Learning graph matching. IEEE transactions on pattern analysis and machine Neural Network Pre-Training. In SIGKDD. 1150–1160.
intelligence 31, 6 (2009), 1048–1058. [33] John W. Raymond, Eleanor J. Gardiner, and Peter Willett. 2002. RASCAL: Calcu-
[4] Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph lation of Graph Similarity using Maximum Common Edge Subgraphs. Comput. J.
convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 45, 6 (2002), 631–644.
(2018). [34] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang,
[5] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. and Junzhou Huang. 2020. Self-Supervised Graph Transformer on Large-Scale
2019. Cluster-gcn: An efficient algorithm for training deep and large graph Molecular Data. In NIPS.
convolutional networks. In SIGKDD. 257–266. [35] Nino Shervashidze and Karsten M. Borgwardt. 2009. Fast subtree kernels on
[6] Minsu Cho, Jungmin Lee, and Kyoung Mu Lee. 2010. Reweighted Random Walks graphs. In NIPS. 1660–1668.
for Graph Matching. In ECCV. 492–505. [36] Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and
[7] Timothée Cour, Praveen Srinivasan, and Jianbo Shi. 2006. Balanced Graph Karsten M. Borgwardt. 2009. Efficient graphlet kernels for large graph comparison.
Matching. In NIPS. 313–320. In AISTATS. 488–495.
[8] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convo- [37] Yixin Su, Rui Zhang, Sarah Erfani, and Junhao Gan. 2021. Neural Graph Matching
lutional neural networks on graphs with fast localized spectral filtering. arXiv based Collaborative Filtering. In SIGIR.
preprint arXiv:1606.09375 (2016). [38] Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, and Mario
[9] Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning Lucic. 2020. On Mutual Information Maximization for Representation Learning.
with PyTorch Geometric. CoRR abs/1903.02428 (2019). In ICLR.
[10] Matthias Fey, Jan Eric Lenssen, Christopher Morris, Jonathan Masci, and Nils M. [39] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning
Kriege. 2020. Deep Graph Matching Consensus. In ICLR. with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018).
[11] Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In ICML. 2083–2092. [40] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[12] M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
the Theory of NP-Completeness. W. H. Freeman. [41] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio,
[13] Thomas Gärtner, Tamás Horváth, and Stefan Wrobel. 2010. Graph Kernels. In and R. Devon Hjelm. 2019. Deep Graph Infomax. In ICLR.
Encyclopedia of Machine Learning, Claude Sammut and Geoffrey I. Webb (Eds.). [42] Runzhong Wang, Junchi Yan, and Xiaokang Yang. 2019. Learning Combinatorial
467–469. Embedding Networks for Deep Graph Matching. In ICCV. 3056–3065.
[14] Steven Gold and Anand Rangarajan. 1996. A Graduated Assignment Algorithm [43] Peter Willett, John M. Barnard, and Geoffrey M. Downs. 1998. Chemical Similarity
for Graph Matching. IEEE transactions on pattern analysis and machine intelligence Searching. J. Chem. Inf. Comput. Sci. 38, 6 (1998), 983–996.
18, 4 (1996), 377–388. [44] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
[15] William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Represen- S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE
tation Learning on Large Graphs. In NIPS. 1024–1034. transactions on neural networks and learning systems (2020).
[16] Bowen Hao, Jing Zhang, Hongzhi Yin, Cuiping Li, and Hong Chen. 2021. Pre- [45] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful
Training Graph Neural Networks for Cold-Start Users and Items Representation. are Graph Neural Networks?. In ICLR.
In WSDM. ACM, 265–273. [46] Kun Xu, Liwei Wang, Mo Yu, Yansong Feng, Yan Song, Zhiguo Wang, and Dong
[17] Kaveh Hassani and Amir Hosein Khas Ahmadi. 2020. Contrastive Multi-View Yu. 2019. Cross-lingual Knowledge Graph Alignment via Graph Matching Neural
Representation Learning on Graphs. In ICML. 4116–4126. Network. In ACL. 3156–3161.
[18] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay S. [47] Jiaxuan You, Zhitao Ying, and Jure Leskovec. 2020. Design Space for Graph
Pande, and Jure Leskovec. 2020. Strategies for Pre-training Graph Neural Net- Neural Networks. In NIPS.
works. In ICLR. [48] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and
[19] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In NIPS.
GPT-GNN: Generative Pre-Training of Graph Neural Networks. In SIGKDD. 1857– [49] Andrei Zanfir and Cristian Sminchisescu. 2018. Deep Learning of Graph Matching.
1867. In CVPR. 2684–2693.
[20] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. 2018. Adaptive sam- [50] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor
pling towards fast graph representation learning. arXiv preprint arXiv:1809.05343 Prasanna. 2019. Graphsaint: Graph sampling based inductive learning method.
(2018). arXiv preprint arXiv:1907.04931 (2019).
[21] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. 2003. Marginalized Kernels [51] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A
Between Labeled Graphs. In ICML, Tom Fawcett and Nina Mishra (Eds.). 321–328. survey. IEEE Transactions on Knowledge and Data Engineering (2020).
[22] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- [52] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021.
mization. In ICLR. Graph Contrastive Learning with Adaptive Augmentation. (2021).
[23] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with
Graph Convolutional Networks. In ICLR.
[24] Marius Leordeanu and Martial Hebert. 2005. A Spectral Technique for Corre-
spondence Problems Using Pairwise Constraints. In ICCV. 1482–1489.
[25] Marius Leordeanu, Martial Hebert, and Rahul Sukthankar. 2009. An Integer
Projected Fixed Point Method for Graph Matching and MAP Inference. In NIPS.
1114–1122.
[26] Feng Li, Bencheng Yan, Qingqing Long, Pengjie Wang, Wei Lin, Jian Xu, and Bo
Zheng. 2021. Explicit Semantic Cross Feature Learning via Pre-trained Graph
Neural Networks for CTR Prediction. In SIGIR.
[27] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019.
Graph Matching Networks for Learning the Similarity of Graph Structured Ob-
jects. In ICML. 3835–3845.
[28] Eliane Maria Loiola, Nair Maria Maia de Abreu, Paulo Oswaldo Boaventura Netto,
Peter Hahn, and Tania Maia Querido. 2007. A survey for the quadratic assignment
problem. Eur. J. Oper. Res. 176, 2 (2007), 657–690.
[29] Yuanfu Lu, Xunqiang Jiang, Yuan Fang, and Chuan Shi. 2021. Learning to Pre-train
Graph Neural Networks. In AAAI.
[30] Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE
Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.