0% found this document useful (0 votes)
5 views10 pages

GMPT Cikm2021 Final

This paper introduces a novel Graph Matching based GNN Pre-Training framework (GMPT) that addresses data scarcity in graph neural networks (GNNs) by learning structural correspondences between graph pairs. The proposed method combines node- and graph-level characteristics in a single pre-training task, utilizing both self-supervised and coarse-grained supervised settings, while also implementing an approximate contrastive training strategy to enhance efficiency. Extensive experiments demonstrate the effectiveness of GMPT across various GNN architectures and benchmarks.

Uploaded by

librahu123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

GMPT Cikm2021 Final

This paper introduces a novel Graph Matching based GNN Pre-Training framework (GMPT) that addresses data scarcity in graph neural networks (GNNs) by learning structural correspondences between graph pairs. The proposed method combines node- and graph-level characteristics in a single pre-training task, utilizing both self-supervised and coarse-grained supervised settings, while also implementing an approximate contrastive training strategy to enhance efficiency. Extensive experiments demonstrate the effectiveness of GMPT across various GNN architectures and benchmarks.

Uploaded by

librahu123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Neural Graph Matching for Pre-training Graph Neural Networks

Anonymous Author(s)
ABSTRACT edge features, GNNs can effectively learn low-dimensional represen-
Recently, graph neural networks (GNNs) have been shown powerful tation vectors for each node or the entire graph. However, to apply
at modeling structural data. However, when adapted to downstream GNNs to downstream applications, it usually requires abundant
tasks, it usually requires abundant task-specific labeled data, which task-specific labeled data, which can be extremely scarce in practice.
can be extremely scarce in practice. A promising solution to data To alleviate the data scarcity issue [30], pre-training GNNs [18, 19]
scarcity is to pre-train a transferable and expressive GNN model on has been proposed as a promising solution. It first learns a transfer-
large amount of unlabeled graphs or coarse-grained labeled graphs. able and expressive GNN on a large amount of unlabeled graphs
Then the pre-trained GNNs are fine-tuned on downstream datasets or coarse-grained labeled graphs. Then, the pre-trained GNNs are
with task-specific fine-grained labels. fine-tuned on downstream datasets with task-specific labels.
In this paper, we present a novel Graph Matching based GNN Pre- For GNN pre-training, existing studies mostly focus on the design
Training framework, called GMPT. Focusing on a pair of graphs, of suitable self-supervised tasks, such as graph structure reconstruc-
we propose to learn structural correspondences between them via tion [15, 18, 19], mutual information maximization [32, 41, 48] and
neural graph matching, consisting of both intra-graph message properties prediction [18]. Generally, these tasks can be classified
passing and inter-graph message passing. In this way, we can learn into two main categories: (1) node-level tasks utilize node represen-
adaptive representations for a given graph when paired with dif- tations to predict the localized properties in the graph (e.g., link
ferent graphs, and both node- and graph-level characteristics are prediction); (2) graph-level tasks focus on the entire graph and learn
naturally considered in a single pre-training task. The proposed graph representations when designing the optimization goal (e.g.,
method can be applied to fully self-supervised pre-training and graph’s property prediction).
coarse-grained supervised pre-training. We further propose an Given the two kinds of GNN pre-training tasks, it is essential
approximate contrastive training strategy to significantly reduce to combine node- and graph-level optimization goals [18], since
the time/memory consumption. Extensive experiments on multi- they capture the graph characteristics in different views. Existing
domain, out-of-distribution benchmarks have demonstrated the approaches either adopt a two-stage approach arranging multi-level
effectiveness of our approach. pre-training tasks sequentially [18], or frame them in a multi-task
learning manner [29]. In this way, each individual pre-training task
CCS CONCEPTS is not aware of all the optimization goals at different levels, which
might result in locally optimal representations w.r.t. some specific
• Computing methodologies → Unsupervised learning; Neu-
level (e.g., node- or graph-level). Ideally, a good pre-training task
ral networks.
can capture node- and graph-level characteristics simultaneously
in order to derive more comprehensive node representations.
KEYWORDS
graph representation learning; graph neural networks; graph pre-
Shared Substructure Static Pooled
training; graph matching; self-supervised learning; contrastive blue Graph Representations
Positive
learning
ACM Reference Format:
is pooled by
Anonymous Author(s). 2021. Neural Graph Matching for Pre-training Graph red and blue
Neural Networks. In CIKM ’21: 30th ACM International Conference on Infor-
mation and Knowledge Management, November 01–05, 2021, Online. ACM, Negative
New York, NY, USA, 10 pages. https://doi.org/10.1145/1122445.1122456
Graph Matching Adaptive Contextual
1 INTRODUCTION Positive Graph Representations

In the past few years, graph neural networks (GNNs) have emerged

as a powerful technical approach for graph representation learn-
ing [15, 23, 47]. By leveraging graph structure as well as node and √

Permission to make digital or hard copies of all or part of this work for personal or Negative
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, Figure 1: An example of neural graph matching and compar-
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
ison with existing studies of static graph representations.
CIKM ’21, November 01–05, 2021, Online
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 For this purpose, we attempt to design new GNN pre-training
https://doi.org/10.1145/1122445.1122456 tasks that are able to learn node- and graph-level graph semantics
CIKM ’21, November 01–05, 2021, Online Anon.

in one single pre-pretraining task. Our solution is based on neural graph data, a surge of GNNs have been proposed, which roughly
graph matching [27, 42, 46], a neural approach to learning structural fall into two lines. The first line focuses on learning node rep-
correspondence among graphs. We present an illustrative example resentation in sepctral domain. As a pioneer and representative
of our idea in Figure 1. At each time, a pair of two associated graphs work, GCN [23] simplifies the Chebyshev polynomial in graph
(e.g., with the same labels or augmented graphs) are given, and we filtering [2, 8] based on first-order approximation and naturally
evaluate whether the two graphs have similar structural properties. captures local strcture by the graph Laplacian. Correspondingly,
As a major advantage, neural graph matching naturally combines the other line aims at aggregating and updating local structural
node-level correspondence (e.g., 𝑣 1 to 𝑣 2 ) and graph-level properties information in the spatial domain. Hence, a series of stragies have
(e.g., whether containing shared substructure) when establishing been proposed to guide the message passing in GNNs, including
their correspondence. That is the major reason why we adopt it as mean/max/LSTM aggregation [15], attention mechanism [40] and
the GNN pre-training task. Another merit of this approach is that graph struture pooling [11]. Moreover, several recent efforts have
a graph will correspond to different representations, when paired been made for improving the training effciency through layer sam-
with different graphs. As shown in this example, we will derive pling [4, 20] and graph partition [5, 50]. For a thorough review,
different representations for graph 𝐴 when paired with graph 𝐴1 please refer to the elaborate surveys [44, 51]. However, how to
or 𝐴2 , since neural graph matching will enforce one graph to refer train GNNs in an effective and effcient way still remains greatly
to another graph’s information when learning graph representa- challenging, since domain-specific labels are always scarce during
tions. Therefore, we call the learned representations adaptive graph training.
representations. As a comparison, existing graph-level pre-training
Pre-training graph neural networks. To alleviate above issues,
tasks usually adopt static graph representations.
pre-training for graph neural networks has drawn much atten-
To this end, in this paper, we propose a novel Graph Matching
tion recently, which empowers GNNs to capture the structural
based GNN Pre-Training method, named as GMPT. The key con-
and semantic information of the input unlabel graphs (or with few
tribution lies in a neural graph matching module, where we pair
coarse-grained label), followed by several fine-tuning steps on the
two associated graphs as input at each time. To learn structural
downstream tasks of interest. Obviously, developing effective su-
correspondences, we perform intra-graph message passing as well
pervised (self-supervised) signals to guide GNNs expoit structural
as inter-graph message passing. In this way, the representations
and semantic properties on original graphs is at the heart. Gener-
of a given graph are learned by referring to another paired graph,
ally, existing designed supervised signals can be classified into two
which derive adaptive graph representations. Such a method can
main categories. The first is called node-level tasks, which aim at
capture both node- and graph-level characteristics when learning
predicting localized properties utilizing node representations, such
the graph representations. The proposed method can be applied
as graph structure reconstruction [15, 18, 19, 26, 29], localized at-
to both fully self-supervised and coarse-grained supervised pre-
tribute prediction [18, 19, 34] and node representation recovery [16].
training settings. In self-supervised setting, GNNs are optimized by
Another is called graph-level tasks, which defines globalized opti-
a graph matching based contrastive loss. To accelerate the learn-
mization goal for the entire graph, such as graph property predic-
ing of graph pairs during pre-training, we further proposed an
tion [18, 34] and mutual information maximization [29, 32, 41, 48].
approximate contrastive training strategy to significantly reduce
Our proposed framework differs from the above approaches in the
the time/memory consumption, without loss of accuracy. While in
following two aspect: hybrid-level pre-training and adaptive graph
supervised setting, we design different supervised tasks according
representation, which is detailedly presented in Section 1.
to different coarse-grained labels.
To summarize, our main contributions can be presented as: Graph matching. Graph matching refers to establishing node cor-
(1) We design a new GNN pre-training task based on neural respondences between two (or among multiple) graphs [3], such
graph matching, where both node- and graph-level charac- that the similarity between the matched graphs is maximized. Some
teristics are considered in a single pre-training task. research focus on the accuracy of the node correspondence, and re-
(2) We apply the proposed pre-training method in both self- gard graph matching as a quadratic assignment programming (QAP)
supervised and coarse-grained supervised settings, where we problem [28], which is NP-complete [12]. Thus, researchers mainly
learn adaptive graph representations and propose an approxi- employ approximate techniques to seek inexact solutions, such
mate contrastive training strategy to reduce the time/memory as spectral approximation [7, 24], double-stochastic approxima-
consumption. tion [6, 14, 25], and learning based approaches [3, 10, 42, 49]. While
(3) Extensive experiments on public out-of-distribution bench- others care about the calculated similarity between graphs. Early ef-
marks from multiple domains on various GNN architectures forts are mainly based on heuristic rules, such as minimal graph edit
have demonstrated the effectiveness of our approach. distance [33, 43] and graph kernel methods [13, 21, 35, 36]. With
the development of GNNs, recent works leverage message passing
2 RELATED WORK technique to explore neural-based graph matching [27, 37, 46]. In
this work, we apply neural graph matching for GNN pre-training,
In this section, we review the most related work in graph neural
to learn adaptive graph representations and encourage GNNs inte-
networks, pre-training and graph matching.
grating localized and globalized domain-specific features.
Graph neural networks. Recent years have witnessed a remark-
able progress of graph neural networks (GNNs) in characterizing
graph-structured data [44]. To learn powerful representation of
Neural Graph Matching for Pre-training Graph Neural Networks CIKM ’21, November 01–05, 2021, Online

Neural Graph Matching Adaptive Graph


Representations Self-supervised
GNN Pre-training

Contrastive
Loss

Regression
Loss
GNN

Classification
Linear
Loss

Supervised
Pre-training

Intra-graph Message
Fine-tuned on downstream tasks Localized Globalized
Inter-graph Message (fine-grained properties) Hybrid-task

Figure 2: Overall framework of our proposed graph matching based GNN pre-training methods.

3 PRELIMINARIES node attributes 𝒙 𝑣 . To obtain the representation of the entire graph,


In this section, we first introduce the task of graph supervised a permutation-invariant function READOUT pools node features
learning and present an overview of general GNN architecture [15]. after the last iteration 𝐾,

Graph supervised learning. A graph can be represented as 𝐺 = 


(𝐾)

𝒉𝐺 = READOUT {𝒉 𝑣 |𝑣 ∈ 𝐺 } . (3)
(V, E, 𝑿, 𝑬), where V = {𝑣 1, 𝑣 2, . . . , 𝑣 | V | } denotes the node set,
E ⊆ V × V denotes the edge set, 𝑿 ∈ R | V |×𝑑 𝑣 and 𝑬 ∈ R | E |×𝑑𝑒
represent the 𝑑 𝑣 - and 𝑑𝑒 -dimensional attribute matrix for nodes In this work, we focus on pre-training GNNs: GNNs that are
and edges, respectively. Furthermore, each graph is possibly asso- initialized with pre-trained parameters are fine-tuned according
ciated with some label 𝑦 from a label set Y. Given a set of graphs to various downstream tasks. Given the defined graph learning
with labels {(𝐺 1, 𝑦1 ), (𝐺 2, 𝑦2 ), . . . , (𝐺 𝑁 , 𝑦𝑁 )}, the purpose of graph task, we consider two kinds of GNN pre-training paradigms based
supervised learning is to learn graph representation embedding 𝒉𝐺 on whether graph labels are used or not during pre-training: self-
for the entire graph 𝐺 and further utilize 𝒉𝐺 to predict the corre- supervised setting (without graph labels) and supervised setting (with
sponding label 𝑦. By sampling a node (called central node) and its graph labels).
neighborhood as a subgraph, tasks defined on the central node can
be viewed as graph classification tasks, e.g., domain classification 4 THE PROPOSED METHOD
on citation networks [23, 29, 40]. In this case, labels of the sampled In this section, we present the proposed graph matching based
subgraph are actually central node’s labels. GNN pre-training method GMPT, for both self-supervised setting
Graph Neural Networks (GNNs). GNNs leverage graph structure and coarse-grained supervised setting. Our approach takes pairs
as well as node and edge features to learn a representation vector 𝒉 𝑣 of graphs as input. With a carefully designed cross-graph mes-
for each node 𝑣 ∈ 𝐺, or the entire graph 𝒉𝐺 . Modern GNNs adopt a sage passing mechanism, one graph can be adaptively encoded
neighborhood aggregation strategy. Each node’s representation is into different graph representations when paired with different
iteratively updated by aggregating representations of its neighbor- graphs. Figure 2 presents the overall architecture of our proposed
ing nodes and edges [15]. After 𝑘 iterations of aggregation, a node’s framework. Next, we describe each part in detail.
representation captures its 𝑘-hop network neighborhood’s struc-
tural features. Formally, the 𝑘-th layer of a GNN can be summarized 4.1 Self-supervised Pre-training
by the following expression: In self-supervised setting, we have no available labeled data for
 
(𝑘) (𝑘−1) (𝑘−1) (𝑘−1) pre-training. The pre-training task is to evaluate whether a pair of
𝒈𝑣 = Aggregate (𝑘) {𝒉 𝑣 , 𝒉𝑢 , 𝒆𝑢𝑣 |𝑢 ∈ N𝑣 } , (1)
augmented graph views are generated based on the same graph. We
(𝑘) (𝑘−1) (𝑘)
𝒉𝑣 = Update (𝑘) (𝒉 𝑣 , 𝒈𝑣 ), (2) adopt the constative learning to construct the learning objective,
(𝑘) and name our approach in this setting as GMPT-CL.
where 𝒉 𝑣 is the representation of node 𝑣 at the 𝑘-th layer/iteration,
𝒆𝑢𝑣 is the feature vector of edge between node 𝑢 and 𝑣, N𝑣 is the 4.1.1 Graph Representation Learning via Graph Matching. We first
(0)
set of nodes adjacent to node 𝑣, and 𝒉 𝑣 is initialized as node 𝑣’s present how to learn graph representations via graph matching.
CIKM ’21, November 01–05, 2021, Online Anon.

Graph Augmentation and Encoding. Given a list of 𝑛 graph node features:


examples, we first apply stochastic data augmentation to trans-
form any given graph example into two correlated views randomly 𝒛𝐺˜ = READOUT ({𝒛 𝑣 |𝑣 ∈ V1 }) , (8)
1

𝒛𝐺˜ = READOUT {𝒛 𝑣′ |𝑣 ′ ∈ V2 } .

(2𝑛 views in total). We consider various augmentation techniques, (9)
2
including node/edge perturbation [48], subgraph sampling [48],
diffusion [17], and adaptive methods [52]. The selection of graph Note that when involved in different pairs, a given graph will corre-
augmentation techniques depends on the actual data domain [48]. spond to different representations in our approach. It is a key merit
To pre-train an expressive GNN encoder, we consider whether a for subsequent pre-training task, since it can adaptively capture
pair of graph views denoted by 𝐺˜ 1 and 𝐺˜ 2 are matched or not based structural correspondences instead of using static graph represen-
on their graph representations. Specially, we first apply the GNN tations as in previous studies [18, 48].
encoder described in Section 3 to obtain node representations in 4.1.2 Contrastive Learning with Adaptive Graph Represntations.
(1) (2)
the two graph views. Let 𝒉𝑠 and 𝒉𝑡 denote the representations Contrastive learning is a commonly used technique to learn with
of node 𝑠 from 𝐺˜ 1 and node 𝑡 from 𝐺˜ 2 , respectively. augmented graph views in pairs [48]. It aims to increase similarity
scores for positive pairs and decrease similarity scores for negative
Neural Graph Matching. Following recent progress in neural
pairs. However, existing graph contrastive learning methods mainly
graph matching [27, 42], we incorporate message passings within
adopt static graph representations [17, 48, 52], where node-level
a graph (called intra-graph message) and between a pair of graphs
interaction across graphs is not explicitly modeled in this process.
(called inter-graph messages). Given a intra-graph node pair ⟨𝑠, 𝑡⟩
As a comparison, given a pair of graph views, we first apply the
and a node inter-graph pair ⟨𝑠 ′, 𝑡 ′ ⟩, we define the two kinds of
neural graph matching technique (Section 4.1.1) to characterize
message passings formally as:
inter-graph interaction, and then construct the constative loss based

(1) (1)
 on the adaptive graph representations.
𝒎𝑠→𝑡 = MSGintra 𝒉𝑠 , 𝒉𝑡 , 𝒆𝑠𝑡 , (4) Formally, given a positive pair (𝐺˜𝑖 , 𝐺˜ 𝑗 ), we firstly adaptively
encode them into contextual graph representations 𝒛𝐺˜ and 𝒛𝐺˜
 
(1) (2)
𝝁𝑠 ′ →𝑡 ′ = MSGinter 𝒉𝑠 ′ , 𝒉𝑡 ′ , (5) 𝑖 𝑗
(Eq. (8) and Eq. (9)), and then formalize the contrastive loss below:

where 𝒎 𝑗→𝑖 and 𝝁 𝑗 ′ →𝑖 ′ are intra-graph and inter-graph messages, exp 𝑠𝑖,𝑗 /𝜏
respectively. Intra-graph message passing can be defined in a similar ℓ𝑖,𝑗 = − log Í , (10)
𝑘≠𝑖 exp 𝑠𝑖,𝑘 /𝜏
way following standard GNN architectures, like GIN [45]. While,
for MSGinter , we adopt a cross-graph attention mechanism as: where 𝑠𝑖,𝑗 = sim(𝒛𝐺˜ , 𝒛𝐺˜ ) and 𝜏 is a temperature parameter. In
𝑖 𝑗
practice, we usually have a batch of graph views, and we enumerate
(1) all the possible pairs for this loss in the denominator of Eq. (10).
𝝁𝑠 ′ →𝑡 ′ = 𝑎𝑠 ′ →𝑡 ′ · 𝒉𝑠 ′ , (6)
Although the above contrastive loss is also defined at the graph
(1) (2)
exp(sim(𝒉𝑠 ′ ,𝒉𝑡 ′ )) level (whether two views are augmented from the same graph),
where 𝑎𝑠 ′ →𝑡 ′ = Í (1) (2) and sim(·) is a similarity the derived graph representations 𝒛𝐺˜ and 𝒛𝐺˜ are enhanced with
𝑘 ∈𝐺˜ 2 exp(sim(𝒉𝑠 ′ ,𝒉𝑘 )) 𝑖 𝑗

function, such as dot product and cosine similarity. The above at- inter-graph node interaction via neural graph matching. As such,
tention mechanism allows an adaptive message exchange between optimizing ℓ𝑖,𝑗 will encourage GNNs to capture both node- and
the paired graphs. Intuitively, messages passed between similar graph-level characteristics in graph representations.
substructures of graphs will have higher attention weights. Here, 4.1.3 Approximate Contrastive Training. A major issue with graph
we normalize the attention weights of messages from the same matching is it incurs a quadratic time and space cost in terms of
Í
source node, which means 𝑡 ′ 𝑎𝑠→𝑡 ′ = 1. Besides, it is also optional the number of nodes. Here, we propose an approximate contrastive
to normalize the attention weights of messages to the same target training strategy to improve the algorithm efficiency. We consider
Í
nodes ( 𝑡 ′ 𝑎𝑡 ′ →𝑠 = 1). the setting with a mini-batch of 𝑛 graphs. As mentioned before, we
Match Enhanced Graph Representations. After passing intra- would generate 2𝑛 augmented graph views for graph matching and
graph messages from nodes’ neighbors (denoted by Nintra ) and constative learning. Typically, for a mini-batch of 𝑛 graphs, GMPT-
inter-graph messages from all the nodes of another graph (denoted CL considers 2𝑛 × 2𝑛 times of graph comparisons (two augmented
by Ninter ), we aggregate the messages together and update to obtain views each graph) in total. Each comparison performs a node-to-
the contextual node features 𝒁 . For a node 𝑡, we update its original node similarity calculation (refer to Eq. (4)), taking an additional
Í2𝑛
representation 𝒉𝑡 as: cost of 𝑂 (𝑚 2 · 𝑑) time and space, where 𝑚 = 𝑖=1 |V𝑖 | denotes the
total number of nodes in 2𝑛 graph views and 𝑑 is the dimensionality
∑︁ ∑︁ of representation vectors.
𝒛𝑡 = Update ­𝒉𝑡 ,
©
𝒎𝑠→𝑡 , 𝝁𝑠 ′ →𝑡 ® ,
ª
(7) In order to reduce the time and memory consumption, the key
« 𝑡 ∈Nintra ′
𝑠 ∈Ninter ¬ idea is to perform an approximate calculation of the proposed con-
trastive loss (Eq. (10)): we sample 𝑞 out of 2𝑛 graph views to contrast
where we use the sum operation for aggregation. Finally, we ob- with all the other views (𝑞 × 2𝑛 times comparisons totally). In this
tain the entire graph’s adaptive representation 𝒛𝐺 by employing way, the additionally expected time complexity can be reduced to
𝑂 ( 2𝑛 · 𝑚 2 · 𝑑). For a further reduction of space complexity, we
𝑞
a permutation-invariant function READOUT to pool contextual
Neural Graph Matching for Pre-training Graph Neural Networks CIKM ’21, November 01–05, 2021, Online

adopt the gradient accumulation technique. For each time of sam- graph matching based score function. Thus, Eq. (13) fits into the
pling, we perform 1 × 2𝑛 times of comparison. After calculating formulation of the InfoNCE loss [38, 39]. Minimizing Eq. (10) results
the contrastive loss, the model backpropagates prediction error in a tighter lower-bound of the mutual information.
and calculates the gradients, but doesn’t update model parameters Furthermore, we prove that approximate contrastive training has
immediately. Instead, the gradients are accumulated until all the the same optimization lower-bound as originally, in expectation.
𝑞 samples are calculated. In this way, the sampled
 𝑞 graphs
 only We sample 𝑞 views to contrast with all the 𝑛 views. The loss can be
require an additional space complexity of 𝑂 2𝑛 1 · 𝑚2 · 𝑑 . rewritten in an expectation form,
𝑞
We will given a brief theoretical proof in Section 4.1.4 that, the 1 ∑︁ exp 𝑓𝑖,𝑗
proposed approximate contrastive training fits the formulation of ℓ′ = − log Í , (14)
𝑞
𝑖∼𝑈 (1,𝑛) 𝑘≠𝑖 exp 𝑓𝑖,𝑘
the InfoNCE loss [38, 39] in expectation. Empirically, experiments Í 
𝑘≠𝑖 exp 𝑓𝑖,𝑘

in Section 5.3 will show that, performances of the fine-tuned GNNs
=E E𝑖∼𝑈 (1,𝑛) log , (15)
on downstream datasets are not affected (even improved) with exp 𝑓𝑖,𝑗
the proposed approximate contrastive training. The overall pre- 𝑛 𝑛
training algorithm of GMPT-CL in one mini-batch is presented in © 1 ∑︁ ∑︁
=E ­ log exp (𝑓𝑖,𝑘 − 𝑓𝑖,𝑗 ) ® + log (𝑛 − 2), (16)
ª
Algorithm 1. Line 1 ∼ 4 show the graph augmentation progress 𝑛 𝑖
𝑘≠𝑖,𝑗
(Section 4.1.1), line 5 shows the sampling of approximate contrastive
« ¬
training. In line 7 ∼ 9, we build the positive pair of views in Eq. (10), where we can find that Eq. (16) is the same as Eq. (13). The key point
and in line 11 ∼ 12, we show the gradient accumulation and param- is that, by uniform sampling, the expectation of the contrastive loss
eter optimization. of one single sample is equal to the averaged contrastive loss of the
whole batch as Eq. (15).

Algorithm 1: The pre-training algorithm of GMPT-CL 4.2 Supervised Pre-training


for one mini-batch of 𝑛 graphs.
Besides a large amount of unlabeled graphs for self-supervised
input : a mini-batch of 𝑛 graphs {𝐺 1 , 𝐺 2 , . . . , 𝐺𝑛 }
pre-trianing, sometimes we can obtain graphs with coarse-grained
1 S𝑣𝑖𝑒𝑤 ←− ∅ labels. Different from elaborately created fine-grained labels, coarse-
2 foreach 𝐺 ∈ {𝐺 1 , 𝐺 2 , . . . , 𝐺𝑛 } do grained labels have a relatively weak correlation with downstream
3 Generate two views 𝐺˜1 and 𝐺˜2 task goals, but can be obtained in an easier way. For example, in
4 S𝑣𝑖𝑒𝑤 ← S𝑣𝑖𝑒𝑤 ∪ {𝐺˜1 , 𝐺˜2 } molecular property prediction, we can easily collect the properties
5

S𝑣𝑖𝑒𝑤 ← 𝑞 views randomly sampled from S𝑣𝑖𝑒𝑤 of molecules that have been experimentally measured so far [18].
6 foreach 𝐺˜1 ∈ S𝑣𝑖𝑒𝑤
′ do In supervised pre-training setting, we pre-train GNNs on graphs
7 ˜
foreach 𝐺 ∈ S𝑣𝑖𝑒𝑤 do with coarse-grained labels, and then the pre-trained GNNs are
8 if 𝐺˜ is generated from the same graph as 𝐺˜1 then fine-tuned according to fine-grained labels in downstream tasks.
9 𝐺˜2 ← 𝐺˜ Note that coarse-grained labels used for supervised pre-training
10 Calculate ℓ1,2 with Eq. (10)
are not the real labels of downstream tasks. Based on whether
11 Backpropagate and accumulate the gradients
coarse-grained labels are continuous or discrete, we propose two
variants of GMPT in supervised setting, named as GMPT-Sup and
12 Update model parameters according to the accumulated gradients
GMPT-Sup++, respectively.
Continuous Labels. Labels with continuous properties can be
4.1.4 Theoretical Analysis. In this section, we provide a theoretical regarded as real-value vectors. We assume that similar pairs of
analysis to reveal the connection between GMPT-CL with approxi- graphs also correspond to similar labels. Based on this consideration,
mate contrastive training and mutual information maximization. we propose GMPT-Sup to learn the similarities between graphs via
Firstly, we prove that minimizing Eq. (10) is equivalent to maxi- graph matching module, and then minimize the difference between
mizing a lower bound of the mutual information between the latent the learned similarities and the actual label similarity. Given a pair
representations of two views of graphs. The contrastive loss ℓ𝑖,𝑗 in of graphs 𝐺 1 and 𝐺 2 and their coarse-grained labels 𝑦1 and 𝑦2 , we
Eq. (10) can be rewritten for a batch of graphs (e.g., 𝑛 views), firstly obtain their adaptive graph representations 𝒛𝐺 1 and 𝒛𝐺 2 as
𝑛 Eq. (8) and Eq. (9). Then we define the loss function as following,
1 ∑︁ exp (𝑠𝑖,𝑗 /𝜏)
ℓ =− log Í , (11) ℓ𝑐 = MSE(𝑠𝑝 , 𝑠𝑔 ), (17)
𝑛 𝑖 𝑘≠𝑖 exp 𝑠𝑖,𝑘 /𝜏
𝑛 Í ! where 𝑠𝑝 = sim(𝒚1, 𝒚2 ), 𝑠𝑔 = sim(𝒛𝐺 1 , 𝒛𝐺 2 ), and MSE(·, ·) denotes
1 ∑︁ 𝑘≠𝑖 exp 𝑓𝑖,𝑘 the standard mean squared error loss. This loss drives the represen-
=E log , (12)
𝑛 𝑖 exp (𝑓𝑖,𝑗 ) tation similarity of two graphs to be close to their label similarity.
𝑛
© 1 ∑︁
𝑛
∑︁ Discrete Labels. For discrete labels, we do not enforce a direct
=E ­ log exp (𝑓𝑖,𝑘 − 𝑓𝑖,𝑗 ) ® + log (𝑛 − 2), (13)
ª
𝑛 𝑖 comparison of the two graphs in a pair. Instead, we construct a
« 𝑘≠𝑖,𝑗 ¬ classification based approach GMPT-Sup++ to associated graph
where the expectation E is taken over 𝑁 independent views {(𝐺𝑖 , 𝐺 𝑗 )} representations with suitable coarse-grained labels. For a given
from the joint distribution 𝑝 (𝐺𝑖 , 𝐺 𝑗 ), and 𝑓𝑖,𝑗 = 𝑠𝑖,𝑗 /𝜏 is a learnable pair of graphs 𝐺 1 and 𝐺 2 (encoded by graph matching module into
CIKM ’21, November 01–05, 2021, Online Anon.

Table 1: Statistics of the datasets. PT denotes Pre-Training • Infomax [41] maximizes mutual information between patch
and FT denotes Fine-Tuning. * denotes the summarization representations and corresponding high-level summaries of
over 8 downstream datasets. graphs.
• EdgePred [15] directly predicts the connectivity of node
Dataset Bio Chem pairs, a.k.a., link prediction task.
• ContextPred [18] samples each node’s neighborhood sub-
#(sub)graphs for self-supervised PT 307K 2, 000K graph and the surrounding context subgraphs. The task is to
#(sub)graphs for supervised PT/FT 88K 456K make prediction about whether a given pair of neighborhood
#Coarse-grained labels for PT 5, 000 1, 310 graph and context graph are generated by the same center
#Downstream FT tasks 40 678* node.
• AttrMasking [18] predicts nodes’ or edges’ attributes, which
are randomly masked.
𝒛𝐺 1 and 𝒛𝐺 2 ), we jointly predict the corresponding coarse-grained • GraphCL [48] contrast the static representations of aug-
labels 𝑦1 and 𝑦2 , The loss for discrete labels ℓ𝑑 can be calculated as, mented views and judge whether they are generated from
the same graph.
∑︁
ℓ𝑑 = BCE(𝒚𝑖 , 𝑾𝑖 · 𝒛𝐺𝑖 + 𝒃𝑖 ), (18) • L2P-GNN [29] selects link prediction as node-level task and
𝑖=1,2 Infomax as graph-level task, then utilizes meta-learning to
alleviate the divergence between multi-task pre-training and
where BCE(·, ·) denotes the popular binary cross-entropy loss, and fine-tuning objectives.
𝒚1 and 𝒚2 are vectorized representations of labels 𝑦1 and 𝑦2 , respec- • PropPred [18] predicts the coarse-grained labels of graphs
tively. Although the loss of the two graphs in a pair is calculated in the pre-training datasets.
separately, their representations are obtained from the graph match-
To our knowledge, our selected baselines have a comprehen-
ing module (e.g.,, cross-graph message passing in Eq. (5)).
sive coverage of recenyly proposed pre-training methods for GNN.
Among them, PrepPred is designed for supervised pre-training
5 EXPERIMENTS setting, while others are targeted at self-supervised pre-training
In this section, we conduct extensive experiments to verify effec- setting. Moreover, results of non-pre-trained model are also re-
tiveness of our proposed methods at both self-supervised and super- ported.
vised settings. Moreover, we also give an in-depth analysis about
training staregy, key paratemeters and transferability. 5.1.3 Parameter Settings. To enhance the reproducibility, we elab-
orately present the implementation details as follows.
5.1 Experimental Setup GNN architecture. We mainly experiment on Graph Isomorphism
5.1.1 Datasets. We conduct experiments on two public (sub)graph Networks (GINs) [45], the most expressive GNN architecture for
classification benchmarks, from different domains. Dataset statis- graph-level prediction tasks. We also experiment with other popular
tics are summarized in Table 1. Below are the descriptions of the architectures: GCN [23], GraphSAGE [15] and GAT [40]. Following
datasets: the previous work [18, 29], we select the same hyper-parameters
as follows: 300 dimensional hidden units, 5 GNN layers (𝐾 = 5). As
• Bio [18] is a biology dataset for protein function predic-
for the READOUT function, following the guidance of the previous
tion, where nodes are proteins, and different edges denote
works [18, 48], we select the average pooling for Bio and Chem
types of relationship exists between a pair of proteins. To
datasets (except L2P-GNN). Note that a learnable graph pooling
evaluate pre-training methods’transferability and out-of-
layer is adpoted for baseline L2P-GNN according to the restriction
distribution generalization, we split down stream dataset
for pre-training in the original paper.
by species, which means the pretrained model will be tested
For our proposed graph matching methods, we adopt dot pro-
on new sepcies.
duction for the sim(·) function. In GMPT-CL, we select 𝜏 = 0.07 in
• Chem [18] is a chemistry dataset for molecular property pre-
Eq. (10).
diction, where nodes and edges represent atoms and chem-
ical bonds, respectively. As above, downstream dataset is Pre-training and fine-tuning settings. Results of baselines on
split by scaffold (molecular substructure). different datasets are directly taken if they have been reported
the pre-training dataset and downstream dataset are from literaturely, while for the others, the optimization of pre-training
different research fields. and fine-tuning are presented as follows. We implement all models
Note that in Bio, graph instances are subgraphs sampled from using PyTorch [31] and PyTorch Geometric [9]. Models are trained
the original giant network via random walk [18], while in Chem, with Adam optimizer [22]. Models are trained with Adam opti-
graphs are individual moleculars. We strictly adopt the same way of mizer [22]. We use PyTorch [31] and PyTorch Geometric [9] for all
splitting and pre-processing of these benchmarks as previous work. of our implementation. We pre-train the models with a learning
Plearse refer to Hu et al. [18] for more details about the benchmarks. rate of 0.001, and fine-tune the GNNs with a learning rate tuned
in {0.01, 0.001, 0.0001} for all the methods. For self-supervised pre-
5.1.2 Baselines. We compare our pre-training methods with the training, we use a batch size of 16 for 5 epochs on Bio datasets,
following representative GNN pre-training methods: while a batch size of 32 for 10 epochs on Chem datasets, which
Neural Graph Matching for Pre-training Graph Neural Networks CIKM ’21, November 01–05, 2021, Online

Table 2: Evaluation in self-supervised setting on Bio. We test Bio, with the most expressive GNN architecture GIN. And applying
ROC-AUC (%) performance using different pre-training meth- GMPT-CL on currently popular GNN architectures, as shown in
ods with different GNN architecture. The macro-average re- Table 2, GMPT-CL achieves the best macro-average result (71.13%)
sults over all the popular GNN architectures are also reported over all the compared baseline methods. In particular, we can see
in the rightmost column. In L2P-GNN, GNN is fine-tuned that GMPT-CL is powerful even with less expressive GNN architec-
with a parameterized global pooling layer, while others us- tures like GCN or GAT, which brings 3.60% and 1.92% absolutely
ing average pooling. The improvement of our method over gains compared to the best baseline, respectively.
the best baseline is significant at the level of 0.01 with paired Table 3 presents the performance comparison with GIN in self-
𝑡-test (except GraphSAGE). supervised pre-training setting over 8 downstream datasets of
Chem. Generally, we can see that non of the compared pre-training
Pre-training methods
GNN architecture
Average
methods can achieve the best performance over all the downstream
GCN GraphSAGE GAT GIN tasks of Chem. We notice that GMPT-CL gains the best macro-
63.20 65.70 68.20 64.80 average results (71.5%) compared to all the baselines, although it
– 65.48
(± 1.00) (± 1.20) (± 1.10) (± 1.00) cannot achieve the best performance over all the 8 downstream
Infomax
62.83 67.21 66.94 64.10
65.27 datasets.
(± 1.22) (± 1.84) (± 2.61) (± 1.5) In sum, we make the following observations.
63.18 66.05 65.72 65.7
EdgePred 65.16 (1) On average, the proposed GMPT-CL yields the best perfor-
(± 1.12) (± 0.78) (± 1.17) (± 1.3)
mance on benchmarks of different domains (72.53% on Bio
62.81 66.47 67.86 65.20
ContextPred
(± 1.87) (± 1.27) (± 1.19) (± 1.60)
65.59 and 71.50% on Chem, all with GIN). As GMPT-CL is a hybrid-
level pre-training task, it encourages GNNs capturing both
62.40 63.32 61.72 64.40
AttrMasking 62.96 localized and globalized domain-specific semantics. Graph
(± 1.35) (± 1.01) (± 2.70) (± 1.30)
67.05 71.53 65.68 67.88 matching module of GMPT-CL can generate adaptive graph
GraphCL 68.04 representations, in which shared substructures are enhanced.
(± 1.16) (± 0.46) (± 3.98) (± 0.85)
66.48 69.89 69.15 70.13 (2) Pre-training GNNs with large amount of unlabeled data is
L2P-GNN 68.91
(± 1.59) (± 1.63) (± 1.86) (± 0.95) clearly helpful to downstream tasks, as GMPT-CL brings
70.65 70.29 71.07 72.53 6.69% and 4.5% absolutely gains compared to non-pretrained
GMPT-CL 71.13
(± 0.53) (± 0.21) (± 0.14) (± 0.42) models on the macro-average results over datasets of two
domains, respectively.

5.2.2 Supervised Setting. Table 4 presents the performance com-


have less nodes per graph. We sample 𝑞 = 4 views per batch in parison between GMPT-Sup, GMPT-Sup++ and the baselines (i.e.,
approximate contrastive training (Section 4.1.3). As You et al. [48] non-pre-trained GIN and PropPred) in supervised pre-training set-
suggests, we use edge perturbation for graph augmentation on Bio, ting. Though coarse-grained labels in Bio datasets are multi-hot
as well as node drop and subgraph graph augmentation on Chem. vectors, we still view them as continuous properties in GMPT-Sup.
For supervised pre-training, we use the same hyper-parameters as While, coarse-grained labels in Chem datasets contain plenty of
PropPred [18]. we report the ROC-AUC [1] for Bio and Chem, The missing values. As similarity over missing values is hard to define,
downstream experiments are run with 10 random seeds, and we GMPT-Sup is not a suitable method for labels with missing values,
report the mean and standard deviation of the report metrics. Our and we only report the result of GMPT-Sup++ on Chem.
code and pre-trained models will be released after the review period. The compared baseline PropPred always outperforms non-pre-
trained method, reflecting that GNNs pre-trained on coarse-grained
5.2 Performance Comparison labels can characterize domain-specific semantics. Compared with
5.2.1 Self-supervised Setting. Table 2 presents the performance PropPred, which encodes graphs into static representations with a
comparison in self-supervised pre-training setting between GMPT- graph-level pre-training task, the proposed hybrid-level methods
CL and the baselines on Bio. GMPT-Sup and GMPT-Sup++ generate adaptive graph representa-
We observe that node-level methods (i.e., EdgePred, ContextPred tions (Section 4.1.1). Consequently, we can see that the proposed
and AttrMasking) and graph-level methods (i.e, Infomax and GraphCL) methods achieve the best performances on Bio and Chem, respec-
have both similar performance with non-pre-trained GNN, since tively, demonstrating the effectiveness of our pre-training meth-
these methods only cature on side features (either localized or glob- ods. Besides, we notice that GMPT-Sup outperforms GMPT-Sup++
alized) for pre-training. Among these baselines, GraphCL achieve slightly on Bio dataset. A possible reason is that directly predicting
better performance, showing the strong transferability of graph each single property of coarse-grained labels (as GMPT-Sup++) may
contrastive learning, which is empirically showed in You et al. [48] cause overfitting, and limits the transferability of the pre-trained
L2P-GNN frames node-level task and graph-level task in a multi- model on account of lacking precious domain-specific semantics
task manner, and as shown, achives the best performance on av- derived from coarse-grained labels
erage over the baselines. However, as discussed in Section 1, the
multi-task manner might result in locally optimal representations. 5.3 Training Strategy Analysis
The proposed pre-training method GMPT-CL achieves the new As mentioned above, we propose a approximate contrastive train-
best performance 72.53% in self-supervised pre-training setting of ing strategy to reduce the time and space consumption of GMPT-CL
CIKM ’21, November 01–05, 2021, Online Anon.

Table 3: Evaluation in self-supervised setting on Chem. Here we test ROC-AUC (%) performance on molecular prediction
benchmarks using different pre-training methods with GIN. The rightmost column averages the mean of test performance
across the 8 datasets. Bold numbers indicate the best performance and underline numbers indicate the second best performance.

Dataset
Pre-training methods Average
BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE
65.80 74.00 63.40 57.30 58.00 71.80 75.30 70.10
– 67.00
(± 4.50) (± 0.80) (± 0.60) (± 1.60) (± 4.40) (± 2.50) (± 1.90) (± 5.40)
68.80 75.30 62.70 58.40 69.90 75.30 76.00 75.90
Infomax 70.30
(± 0.80) (± 0.50) (± 0.40) (± 0.80) (± 3.00) (± 2.50) (± 0.70) (± 1.60)
67.30 76.00 64.10 60.40 64.10 74.10 76.30 79.90
EdgePred 70.30
(± 2.40) (± 0.60) (± 0.60) (± 0.70) (± 3.70) (± 2.10) (± 1.00) (± 0.90)
64.30 76.70 64.20 61.00 71.80 74.70 77.20 79.30
AttrMasking 71.10
(± 2.80) (± 0.40) (± 0.50) (± 0.70) (± 4.10) (± 1.40) (± 1.10) (± 1.60)
68.00 75.70 63.90 60.90 65.90 75.80 77.30 79.60
ContextPred 70.90
(± 2.00) (± 0.70) (± 0.60) (± 0.60) (± 3.80) (± 1.70) (± 1.00) (± 1.20)
69.68 73.87 62.40 60.53 75.99 69.80 78.47 75.38
GraphCL 70.80
(± 0.67) (± 0.66) (± 0.57) (± 0.88) (± 2.65) (± 2.66) (± 1.22) (± 1.44)
66.42 75.51 63.13 58.86 66.15 74.71 77.31 81.05
L2P-GNN 70.40
(± 1.10) (± 0.37) (± 0.50) (± 0.79) (± 5.00) (± 1.84) (± 1.22) (± 0.75)
66.68 74.63 63.13 60.13 76.00 76.57 77.54 77.04
GMPT-CL 71.50
(± 1.96) (± 0.65) (± 0.37) (± 0.67) (± 1.32) (± 2.08) (± 1.18) (± 1.38)

Table 4: Evaluation in supervised setting. We report ROC-


AUC (%) performance on Bio and Chem using different su- 73 6
pervised pre-training methods with GIN. 72 5
Memory
Time
Time / Epoch (h)
71

Memory (G)
4
AUC (%)

70
Pre-training methods Bio Chem 3
5.5
5
69 batch size 4.15 4 3.8 3.8

4 2
– 64.8 ± 1.0 67.0 68
8 1
PropPred 69.0 ± 2.4 70.0 67 16
0
GMPT-Sup 70.84 ± 0.59 – 1 2 4 8 16 32 1 2 4 8 16 32

#Sample Views #Sample Views


GMPT-Sup++ 70.73 ± 0.42 70.4
(a) (b)

Figure 3: Tuning of approximate contrastive training with


and the corresponding theoretical analysis about time and mem-
different numbers of sampled views on Bio dataset. (a) Per-
ory complexity can be found in In Section 4.1.3. Here we conduct
formance with different batch sizes. (b) Time/Memory Con-
detailed experiments to show how the fine-tuned GNNs’ perfor-
sumption with batch size 32. The labels on the histogram
mance affected by the batch size 𝑛 and number of sampled views 𝑞,
indicates the memory consumption.
and what’s the actual running time/memory when the proposed
approximate training strategy is applied.
Performance Comparison w.r.t. Batch Size and the Number of As Figure 3(a) shows, with different batch sizes and number of
Sampled Views. We pre-train GIN models in different batch sizes sampled views, our method can reach no less than 69% ROCAUC
𝑛 ∈ {4, 8, 16} and numbers of sampled views 𝑞 ∈ {1, 2, 4, 8, 16, 32}, performance. Generally, our GMPT-CL method reaches the best
and compare the fine-tuned model’s performance on downstream performance when the number of sampled views 𝑞 = 4. And we
dataset of Bio. Note that for a certain batch size 𝑛, the maximum notice that applying approximate contrastive training usually im-
number of sampled views is 𝑞 = 2𝑛 (totally 2𝑛 views). In other proves the performance of GMPT-CL. Besides, we find that even
words, the proposed approximate training strategy is not actually with a small sampling number 𝑞 = 1, our method can still have
applied when 𝑞 = 2𝑛. competitive performances.
Neural Graph Matching for Pre-training Graph Neural Networks
Pre-training with L2P-GNN CIKM ’21, November 01–05, 2021, Online

Pre-training with GMPT-CL


1 1 73 73
72 72
0.8 0.8 71 71
70 70

AUC (%)

AUC (%)
0.6 0.6 69 69
68 68
0.4 0.4
67 67
0.2 0.2 66 66
GMPT-CL GMPT-CL
65 L2P-GNN 65 L2P-GNN
0 0 64 64
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 10 50 100 300 500 2 3 4 5
No pre-training No pre-training Dimension #Layers of GNN

(a) (b) (a) (b)

Figure 4: Analysis of transferability for L2P-GNN (a) and Figure 5: Tuning of different hyper-parameters of the GNN
GMPT-CL (b) over 40 out-of-distribution sub-tasks of Bio. pre-trained with GMPT-CL on Bio dataset.

5.5 Parameter Tuning


Time and Memory Consumption w.r.t. the Number of Sam- In this part, we examine the robustness of our method and perform
pled Views. Here, we take an intuitive look at the actual time and detailed parameter analysis. For simplicity, we only report the per-
space complexity of the proposed approximate contrastive training formance of self-supervised pre-training setting on Bio dataset, and
strategy. As Figure 3(b) shows, we find that as 𝑞 decreases, the incorporate the best baseline L2P-GNN in Table 2 as a comparison.
training time of GMPT-CL per epoch also decreases, which verifies
that choosing a relative small 𝑞 can dramatically reduce the training Varying the Dimension 𝑑. Here we let graph matching module
time of GMPT-CL. Besides, we find that the memory consumption of the proposed method GMPT have the same dimension 𝑑 as the
doesn’t change a lot with different choices of 𝑞, verifying that the pre-trained GNN, and we vary 𝑑 in {10, 50, 100, 300, 500} to analyze
space complexity doesn’t affected much by 𝑞. the sensitivity of GMPT to the variation of dimension. As shown in
In summary, we suggest to adopt the proposed approximate Figure 5(a), our method constantly outperforms the best baseline,
contrastive training strategy when pre-train GNNs with GMPT-CL, and achieves the best performance when 𝑑 = 300. Overall, the per-
which has been shown to reduce the time/space consumption and formance is stable around 𝑑 ≥ 300, and the performance decreases
gain a performance increasing. For the choice of the number of sharply when 𝑑 < 100.
sampled views 𝑞 with the batch size 𝑛, we suggest to select a relative Varing the number of GNN layers 𝐿. The number of GNN layers
small 𝑞 (i.e., 4 ≤ 𝑞 ≤ 𝑛). 𝐿 is also an essential hyper-parameter for the design of GNNs [47].
Here we vary 𝐿 in the range 2 and 5. As shown in Figure 5(b), when
5.4 Transferability Analysis 𝐿 = 5, our method reaches the best performance. When 𝐿 ≤ 4, our
Out-of-distribution problem widely exists in real world applica- method has a relatively stable result and marginly outperforms the
tions, meaning that graphs in the training set are structurally very best baseline.
different from graphs in the test set [18]. Existing study shows that
improperly designed GNN pre-training tasks may cause serious 6 CONCLUSION
negative transfer, which means the pre-trained GNN obatins worse In this work, we propose GMPT, a general graph matching based
results than non-pretrained GNN on several certain tasks. Thus, it’s GNN pre-training framework. By structuralized neural graph match-
crucial to analysis the transfer status over the individual sub-tasks ing module, as well as message passing both within and cross
of out-of-distribution datasets. graphs, we generate adaptive representations for the matched graphs,
Here we analysis the proposed GMPT-CL and the best baseline encouraging GNNs to learn both globalized and localized domain-
L2P-GNN, which are both self-supervised pre-training methods. specific semantics in a single pre-training task. We apply GMPT for
The two methods are pre-trained and fine-tuned with GIN on Bio both self-supervised pre-training and coarse-grained supervised
dataset, and status of transfer is shown in Figure 4. The left and pre-training. We also propose approximate contrastive training
up area indicates positive transfer while the right and bottom area strategy, which significant reduces the time/memory consumption
indicates negative transfer. The middle line indicates the border- brought by graph matching module. Extensive experiments on
line. Another line shows the worst negative transfer across the 40 multi-domain out-of-distribution benchmarks show the effective-
sub-tasks. As shown in Figure 4, we can see that compared to the ness and transferability of our method.
best baseline L2P-GNN, the proposed GMPT-CL has less negative Besides the proposed neural graph matching approach, more
transfer cases (12 v.s. 17), as well as a slighter negative transfer hybrid-level GNN pre-training tasks can be explored in future.
extent (−0.07 v.s. −0.13). Besides we can see that GNN pre-trained In addition, we will also consider generalizing our framework to
by GMPT-CL gets AUC result > 0.5 on all downstream sub-tasks more complicated graph strutures (e.g. dynamic graphs, knowledge
of Bio. All the observations above show the good transferability of graphs and heterogeneous graphs).
the proposed GMPT-CL method.
CIKM ’21, November 01–05, 2021, Online Anon.

REFERENCES [31] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
[1] Andrew P. Bradley. 1997. The use of the area under the ROC curve in the Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-
evaluation of machine learning algorithms. Pattern Recognit. 30, 7 (1997), 1145– maison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
1159. Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
[2] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral net- Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
works and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 Library. In NIPS. 8024–8035.
(2013). [32] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding,
[3] Tibério S Caetano, Julian J McAuley, Li Cheng, Quoc V Le, and Alex J Smola. 2009. Kuansan Wang, and Jie Tang. 2020. GCC: Graph Contrastive Coding for Graph
Learning graph matching. IEEE transactions on pattern analysis and machine Neural Network Pre-Training. In SIGKDD. 1150–1160.
intelligence 31, 6 (2009), 1048–1058. [33] John W. Raymond, Eleanor J. Gardiner, and Peter Willett. 2002. RASCAL: Calcu-
[4] Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph lation of Graph Similarity using Maximum Common Edge Subgraphs. Comput. J.
convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 45, 6 (2002), 631–644.
(2018). [34] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang,
[5] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. and Junzhou Huang. 2020. Self-Supervised Graph Transformer on Large-Scale
2019. Cluster-gcn: An efficient algorithm for training deep and large graph Molecular Data. In NIPS.
convolutional networks. In SIGKDD. 257–266. [35] Nino Shervashidze and Karsten M. Borgwardt. 2009. Fast subtree kernels on
[6] Minsu Cho, Jungmin Lee, and Kyoung Mu Lee. 2010. Reweighted Random Walks graphs. In NIPS. 1660–1668.
for Graph Matching. In ECCV. 492–505. [36] Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and
[7] Timothée Cour, Praveen Srinivasan, and Jianbo Shi. 2006. Balanced Graph Karsten M. Borgwardt. 2009. Efficient graphlet kernels for large graph comparison.
Matching. In NIPS. 313–320. In AISTATS. 488–495.
[8] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convo- [37] Yixin Su, Rui Zhang, Sarah Erfani, and Junhao Gan. 2021. Neural Graph Matching
lutional neural networks on graphs with fast localized spectral filtering. arXiv based Collaborative Filtering. In SIGIR.
preprint arXiv:1606.09375 (2016). [38] Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, and Mario
[9] Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning Lucic. 2020. On Mutual Information Maximization for Representation Learning.
with PyTorch Geometric. CoRR abs/1903.02428 (2019). In ICLR.
[10] Matthias Fey, Jan Eric Lenssen, Christopher Morris, Jonathan Masci, and Nils M. [39] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning
Kriege. 2020. Deep Graph Matching Consensus. In ICLR. with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018).
[11] Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In ICML. 2083–2092. [40] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[12] M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
the Theory of NP-Completeness. W. H. Freeman. [41] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio,
[13] Thomas Gärtner, Tamás Horváth, and Stefan Wrobel. 2010. Graph Kernels. In and R. Devon Hjelm. 2019. Deep Graph Infomax. In ICLR.
Encyclopedia of Machine Learning, Claude Sammut and Geoffrey I. Webb (Eds.). [42] Runzhong Wang, Junchi Yan, and Xiaokang Yang. 2019. Learning Combinatorial
467–469. Embedding Networks for Deep Graph Matching. In ICCV. 3056–3065.
[14] Steven Gold and Anand Rangarajan. 1996. A Graduated Assignment Algorithm [43] Peter Willett, John M. Barnard, and Geoffrey M. Downs. 1998. Chemical Similarity
for Graph Matching. IEEE transactions on pattern analysis and machine intelligence Searching. J. Chem. Inf. Comput. Sci. 38, 6 (1998), 983–996.
18, 4 (1996), 377–388. [44] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
[15] William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Represen- S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE
tation Learning on Large Graphs. In NIPS. 1024–1034. transactions on neural networks and learning systems (2020).
[16] Bowen Hao, Jing Zhang, Hongzhi Yin, Cuiping Li, and Hong Chen. 2021. Pre- [45] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful
Training Graph Neural Networks for Cold-Start Users and Items Representation. are Graph Neural Networks?. In ICLR.
In WSDM. ACM, 265–273. [46] Kun Xu, Liwei Wang, Mo Yu, Yansong Feng, Yan Song, Zhiguo Wang, and Dong
[17] Kaveh Hassani and Amir Hosein Khas Ahmadi. 2020. Contrastive Multi-View Yu. 2019. Cross-lingual Knowledge Graph Alignment via Graph Matching Neural
Representation Learning on Graphs. In ICML. 4116–4126. Network. In ACL. 3156–3161.
[18] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay S. [47] Jiaxuan You, Zhitao Ying, and Jure Leskovec. 2020. Design Space for Graph
Pande, and Jure Leskovec. 2020. Strategies for Pre-training Graph Neural Net- Neural Networks. In NIPS.
works. In ICLR. [48] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and
[19] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In NIPS.
GPT-GNN: Generative Pre-Training of Graph Neural Networks. In SIGKDD. 1857– [49] Andrei Zanfir and Cristian Sminchisescu. 2018. Deep Learning of Graph Matching.
1867. In CVPR. 2684–2693.
[20] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. 2018. Adaptive sam- [50] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor
pling towards fast graph representation learning. arXiv preprint arXiv:1809.05343 Prasanna. 2019. Graphsaint: Graph sampling based inductive learning method.
(2018). arXiv preprint arXiv:1907.04931 (2019).
[21] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. 2003. Marginalized Kernels [51] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A
Between Labeled Graphs. In ICML, Tom Fawcett and Nina Mishra (Eds.). 321–328. survey. IEEE Transactions on Knowledge and Data Engineering (2020).
[22] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- [52] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021.
mization. In ICLR. Graph Contrastive Learning with Adaptive Augmentation. (2021).
[23] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with
Graph Convolutional Networks. In ICLR.
[24] Marius Leordeanu and Martial Hebert. 2005. A Spectral Technique for Corre-
spondence Problems Using Pairwise Constraints. In ICCV. 1482–1489.
[25] Marius Leordeanu, Martial Hebert, and Rahul Sukthankar. 2009. An Integer
Projected Fixed Point Method for Graph Matching and MAP Inference. In NIPS.
1114–1122.
[26] Feng Li, Bencheng Yan, Qingqing Long, Pengjie Wang, Wei Lin, Jian Xu, and Bo
Zheng. 2021. Explicit Semantic Cross Feature Learning via Pre-trained Graph
Neural Networks for CTR Prediction. In SIGIR.
[27] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019.
Graph Matching Networks for Learning the Similarity of Graph Structured Ob-
jects. In ICML. 3835–3845.
[28] Eliane Maria Loiola, Nair Maria Maia de Abreu, Paulo Oswaldo Boaventura Netto,
Peter Hahn, and Tania Maia Querido. 2007. A survey for the quadratic assignment
problem. Eur. J. Oper. Res. 176, 2 (2007), 657–690.
[29] Yuanfu Lu, Xunqiang Jiang, Yuan Fang, and Chuan Shi. 2021. Learning to Pre-train
Graph Neural Networks. In AAAI.
[30] Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE
Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.

You might also like