0% found this document useful (0 votes)
87 views13 pages

XSim GCL

The paper introduces XSimGCL, a novel method for contrastive learning in recommendation systems that simplifies the process by eliminating ineffective graph augmentations and employing a noise-based embedding approach. It reveals that contrastive learning enhances recommendation performance by creating more evenly distributed user/item representations, which helps mitigate popularity bias and promote long-tail items. Experimental results demonstrate that XSimGCL outperforms traditional graph augmentation-based methods in both accuracy and training efficiency across multiple datasets.

Uploaded by

alexhychen1992
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views13 pages

XSim GCL

The paper introduces XSimGCL, a novel method for contrastive learning in recommendation systems that simplifies the process by eliminating ineffective graph augmentations and employing a noise-based embedding approach. It reveals that contrastive learning enhances recommendation performance by creating more evenly distributed user/item representations, which helps mitigate popularity bias and promote long-tail items. Experimental results demonstrate that XSimGCL outperforms traditional graph augmentation-based methods in both accuracy and training efficiency across multiple datasets.

Uploaded by

alexhychen1992
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

XSimGCL: Towards Extremely Simple Graph


Contrastive Learning for Recommendation
Junliang Yu, Xin Xia, Tong Chen, Lizhen Cui, Nguyen Quoc Viet Hung, Hongzhi Yin∗

Abstract—Contrastive learning (CL) has recently been demonstrated critical in improving recommendation performance. The
fundamental idea of CL-based recommendation models is to maximize the consistency between representations learned from different
graph augmentations of the user-item bipartite graph. In such a self-supervised manner, CL-based recommendation models are
expected to extract general features from the raw data to tackle the data sparsity issue. Despite the effectiveness of this paradigm, we
still have no clue what underlies the performance gains. In this paper, we first reveal that CL enhances recommendation through
endowing the model with the ability to learn more evenly distributed user/item representations, which can implicitly alleviate the
pervasive popularity bias and promote long-tail items. Meanwhile, we find that the graph augmentations, which were considered a
arXiv:2209.02544v2 [[Link]] 8 Sep 2022

necessity in prior studies, are relatively unreliable and less significant in CL-based recommendation. On top of these findings, we put
forward an eXtremely Simple Graph Contrastive Learning method (XSimGCL) for recommendation, which discards the ineffective
graph augmentations and instead employs a simple yet effective noise-based embedding augmentation to create views for CL. A
comprehensive experimental study on three large and highly sparse benchmark datasets demonstrates that, though the proposed
method is extremely simple, it can smoothly adjust the uniformity of learned representations and outperforms its graph
augmentation-based counterparts by a large margin in both recommendation accuracy and training efficiency. The code is released at
[Link]

Index Terms—Recommendation, Self-Supervised Learning, Contrastive Learning, Data Augmentation.

F
1 I NTRODUCTION

R E cently, a revival of contrastive learning (CL) [1], [2], [3]


has swept across many fields of deep learning, leading
to a series of major advances [4], [5], [6], [7], [8]. Since
Original
Graph
Encoder
Recommendation

the ability of CL to learn general features from unlabeled Representation

raw data is a sliver bullet for addressing the data sparsity x Joint
x Optimization
Perturbed
issue [9], [10], [11], it also pushes forward the frontier x Graph
Encoder
of recommendation. A flurry of enthusiasm on CL-based x
recommendation [12], [13], [14], [15], [16], [17], [18] has
recently been witnessed, followed by a string of promising x Perturbed
Graph Contrastive
results. Based on these studies a paradigm of contrastive rec- x Encoder Learning
ommendation can be clearly profiled. It mainly includes two x
steps: first augmenting the original user-item bipartite graph
with structure perturbations (e.g., edge/node dropout at Fig. 1: Graph contrastive learning with edge dropout for
a certain ratio), and then maximizing the consistency of recommendation.
representations learned from different graph augmentations
[3] under a joint learning framework (shown in Fig. 1). surprising that several recent studies [17], [20], [21] have
Despite the effectiveness of this paradigm, we still have reported that the performances of their CL-based recom-
no clue what underlies the performance gains. Intuitively, mendation models are not sensitive to graph augmentations
encouraging the agreement between related graph augmen- with different edge dropout rates, and even a large drop rate
tations can learn representations invariant to slight struc- (e.g., 0.9) can somehow benefit the model. Such a finding is
ture perturbations and capturing the essential information elusive, because a large dropout rate will result in a huge
of the original user-item bipartite [1], [19]. However, it is loss of the raw structural information and should have had
a negative effect only. This naturally raises an intriguing
and fundamental question: Are graph augmentations really a
• J. Yu, X. Xia, T. Chen, and H. Yin are with the School of Information necessity for CL-based recommendation models?
Technology and Electrical Engineering, The University of Queensland,
Brisbane, Queensland, Australia.
To answer this question, we first conduct experiments
E-mail: {[Link], [Link], [Link], h.yin1}@[Link] with and without graph augmentations and compare the
• H. Nguyen is with the Institute for Integrated and Intelligent Systems, performances. The results show that when graph augmen-
Griffith University, Gold Coast, Australia. tations are detached, there is only a slight performance drop.
E-mail: quocviethung1@[Link]
• Lizhen Cui is with the School of Software, Shandong University, Jinan, We then investigate the representations learned by non-
China. CL and CL-based recommendation methods. By visualizing
E-mail: clz@[Link] the representations, we observe that the jointly optimized
∗ Corresponding author. contrastive loss InfoNCE [22] is what really matters, rather
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2

than graph augmentations. Optimizing this contrastive loss SimGCL


always leads to more evenly distributed user/item repre- Recommendation Contrastive Task
sentations regardless of graph augmentations, which im-
plicitly alleviates the pervasive popularity bias [23] and
promotes long-tail items. On the other hand, though not as Avg. Avg. Avg.
Final Embeddings Final Embeddings Final Embeddings
effective as expected, some types of graph augmentations
indeed improve the recommendation performance, which Graph Conv Graph Conv + Noise Graph Conv + Noise
are analogous to the cherry on the cake. However, to pick
Graph Conv Graph Conv + Noise Graph Conv + Noise
out the useful ones, a long process of trial-and-error is
needed. Otherwise a random selection may degrade the rec-
Node Embeddings
ommendation performance. Besides, it should be aware that
repeatedly creating graph augmentations and constructing
adjacency matrices bring extra expense to model training.
XSimGCL
In view of these weakness, it is sensible to substitute
graph augmentations with better alternatives. A follow- Recommendation Contrastive Task
up question then arises: Are there more effective and efficient
augmentation approaches? Avg.
Final Embeddings
In our preliminary study [24], we had given an affirma-
Cross-Layer
tive response to this question. On top of our finding that Graph Conv + Noise Contrast
learning more evenly distributed representations is critical
for boosting recommendation performance, we proposed Graph Conv + Noise

a graph-augmentation-free CL method which makes the


uniformity more controllable, and named it SimGCL (short Node Embeddings

for Simple Graph Contrastive Learning). SimGCL conforms


to the paradigm presented in Fig. 1, but it discards the Fig. 2: The architectures of SimGCL and XSimGCL.
ineffective graph augmentations and instead adds uniform
noises to the learned representations for a far more effi-
cient representation-level data augmentation. We empiri- with different noises and relies on the ordinary representa-
cally demonstrated that this noise-based augmentation can tions for recommendation, whereas XSimGCL uses the same
directly regularize the embedding space towards a more perturbed representations for both tasks, and replaces the
even representation distribution. Meanwhile, by modulat- final-layer contrast in SimGCL with the cross-layer contrast.
ing the magnitude of noises, SimGCL can smoothly adjust With this new design, XSimGCL is nearly as lightweight
the uniformity of representations. Benefitting from these as the conventional recommendation model LightGCN [28].
characteristics, SimGCL shows superiorities over its graph And to top it all, XSimGCL even outperforms SimGCL with
augmentation-based counterparts in both recommendation its simpler architecture.
accuracy and training efficiency. Overall, as an extension to our conference paper [24], this
However, in spite of these advantages, the cumbersome work makes the following contributions:
architecture of SimGCL makes it less than perfect. In • We reveal that CL enhances graph recommendation
addition to the forward/backward pass for the recommen- models by learning more evenly distributed represen-
dation task, it requires two extra forward/backward passes tations where the InfoNCE loss is far more important
for the contrastive task in a mini-batch (shown in Fig. 2). than graph augmentations.
Actually, this is a universal problem for all the CL-based • We propose a simple yet effective noise-based aug-
recommendation models [12], [25], [17], [26], [27] under mentation approach, which can smoothly adjust the
the paradigm in Fig. 1. What makes it worse is that these uniformity of the representation distribution through
methods require that all nodes in the user-item bipartite contrastive learning.
graph to be present during training, which means their com- • We put forward a novel CL-based recommendation
putational complexity is almost triple that of conventional model XSimGCL which is even more effective and
recommendation models. This flaw greatly limits the use of efficient than its predecessor SimGCL.
CL-based models at scale. • We conduct a comprehensive experimental study on
In order to address this issue, in this work we put three large and highly sparse benchmark datasets (two
forward an eXtremely Simple Graph Contrastive Learning of which were not used in our preliminary study) to
method (XSimGCL) for recommendation. XSimGCL not demonstrate that XSimGCL is an ideal alternative of its
only inherits SimGCL’s noise-based augmentation but also graph augmentation-based counterparts.
drastically reduces the computational complexity by stream- The rest of this paper is organized as follows. Section 2
lining the propagation process. As shown in Fig. 2, the investigates the necessity of graph augmentations in the
recommendation task and the contrastive task of XSimGCL contrastive recommendation and explores how CL enhances
share the forward/backward propagation in a mini-batch recommendation. Section 3 proposes the noise-based aug-
instead of owning separate pipelines. To be specific, both mentation approach and the CL-based recommendation
SimGCL and XSimGCL are fed with the same input: the model XSimGCL. The experimental study is presented in
initial embeddings and the adjacency matrix. The difference Section 4. Section 5 provides a brief review of the related
is that SimGCL contrasts two final representations learned literature. Finally, we conclude this work in Section 6.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3

2 R EVISITING G RAPH C ONTRASTIVE L EARNING TABLE 1: Performance comparison of different SGL vari-
ants.
FOR R ECOMMENDATION
2.1 Contrastive Recommendation with Graph Augmen-
Yelp2018 Kindle iFashion
tations Method
Recall NDCG Recall NDCG Recall NDCG
Generally, data augmentations are the prerequisite for CL- LightGCN 0.0639 0.0525 0.2053 0.1315 0.0955 0.0461
based recommendation models [15], [12], [13], [25]. In this SGL-ND 0.0644 0.0528 0.2069 0.1328 0.1032 0.0498
SGL-ED 0.0675 0.0555 0.2090 0.1352 0.1093 0.0531
section, we investigate the widely used dropout-based aug- SGL-RW 0.0667 0.0547 0.2105 0.1351 0.1095 0.0531
mentations on graphs [12], [4]. They assume that the learned SGL-WA 0.0671 0.0550 0.2084 0.1347 0.1065 0.0519
representations which are invariant to partial structure per-
turbations are high-quality. We target a representative state-
of-the-art CL-based recommendation model SGL [12], which 2.2 Necessity of Graph Augmentations
performs the node/edge dropout to augment the user-item
graph. The joint learning scheme in SGL is formulated as: The phenomenon reported in [17], [20], [21] that even very
sparse graph augmentations can somehow benefit the rec-
L = Lrec + λLcl , (1) ommendation model suggest that CL-based recommenda-
tion may work in a way different from what we reckon.
which consists of the recommendation loss Lrec and the
To demystify how CL enhances recommendation, we first
contrastive loss Lcl . Since the goal of SGL is to recommend
investigate the necessity of the graph augmentation in SGL.
items, the CL task plays an auxiliary role and its effect is
In the original paper of SGL [12], three variants are proposed
modulated by a hyperparameter λ. As for the concrete forms
including SGL-ND (-ND denotes node dropout), SGL-ED
of these two losses, SGL adopts the standard BPR loss [29]
(-ED denotes edge dropout), and SGL-RW (-RW denotes
for recommendation and the InfoNCE loss [22] for CL. The
random walk, i.e., multi-layer edge dropout). For a control
standard BPR loss is defined as:
X   group, we construct a new variant of SGL, termed SGL-WA
Lrec = − log σ(e>u e i − e>
u ej ) , (2) (-WA stands for w/o augmentation) in which the CL loss is
(u,i)∈B defined as:
where σ is the sigmoid function, eu is the user repre-
X exp(1/τ )
Lcl = − log P >
. (5)
sentation, ei is the representation of an item that user u i∈B j∈B exp(zi zj /τ )
has interacted with, ej is the representation of a randomly
sampled item, and B is a mini-batch. The InfoNCE loss [22] Because no augmentations are used in SGL-WA, we have
is formulated as: z0i = z00i = zi . The performance comparison is conducted
on three benchmark datasets: Yelp2018 [28], Amazon-Kindle
X exp(z0> 00
i zi /τ ) [30] and Alibaba-iFashion [12]. A 3-layer setting is adopted
Lcl = − log P 0> 00
, (3)
i∈B j∈B exp(zi zj /τ ) and the hyperparameters are tuned according to the original
paper of SGL (more experimental settings can be found in
where i and j are users/items in B , z0i and z00i are L2 nor- Section 4.1). The results are presented in Table 1 where the
malized representations learned from two different dropout- highest values are marked in bold type.
e0
based graph augmentations (namely z0i = ke0ik2 ), and τ > 0 As can be observed, all the graph augmentation-based
i
(e.g., 0.2) is the temperature which controls the strength variants of SGL outperform LightGCN, which demonstrates
of penalties on hard negative samples. The InfoNCE loss the effectiveness of CL. However, to our surprise SGL-WA
encourages the consistency between z0i and z00i which are is also competitive. Its performance is on par with that of
the positive sample of each other, whilst minimizing the SGL-ED and SGL-RW and is even better than that of SGL-
agreement between z0i and z00j , which are the negative sam- ND on all the datasets. Given these results, we can draw
ples of each other. Optimizing the InfoNCE loss is actually two conclusions: (1) graph augmentations indeed work but
maximizing a tight lower bound of mutual information. they are not as effective as expected; the large proportion
To learn representations from the user-item graph, SGL of performance gains derive from the contrastive loss In-
employs LightGCN [28] as its encoder, whose message foNCE and this can explain why even very sparse graph
passing process is defined as: augmentations seem to be informative in [17], [20], [21]; (2)
1 not all graph augmentations have a positive impact; a long
E= (E(0) + AE(0) + ... + AL E(0) ), (4) process of trial-and-error is required to pick out the useful
1+L
ones. As for (2), the possible reason could be that some
where E(0) ∈ R|N |×d is the node embeddings to be learned, graph augmentations highly distort the original graph. For
E is the initial final representations for prediction, |N | example, the node dropout is very likely to drop the key
is the number of nodes, L is the number of layers, and nodes (e.g., hub) and their associated edges and hence
A ∈ R|N |×|N | is the normalized undirected adjacency breaks the correlated subgraphs into disconnected pieces.
matrix without self-connection. By replacing A with the Such graph augmentations share little learnable invariance
adjacency matrix of the corrupted graph augmentations Ã, with the original graph and therefore it is unreasonable to
z0 and z00 can be learned via Eq. (4). It should be noted that in encourage the consistency between them. By contrast, the
every epoch, Ã is reconstructed. For the sake of brevity, here edge dropout is at a lower risk to largely perturb the orig-
we just present the core ingredients of SGL and LightGCN. inal graph so that SGL-ED/RW can hold a slim advantage
More details can be found in the original papers [12], [28]. over SGL-WA. However, in view of the expense of regular
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4

Hot Users Cold Users Hot Items Cold Items

LightGCN SGL-WA SGL-ED CL Only SimGCL XSimGCL


60 60 10
40 40
40 40
40
5
20 20
20 20 20
Features

0 0 0 0
0 0

−20 −20 −20 −20


−20
−5
−40 −40
−40 −40 −40
−60 −10
−50 −25 0 25 −50 −25 0 25 50 −50 −25 0 25 50 −10 −5 0 5 10 −50 −25 0 25 50 −50 −25 0 25 50
Features Features Features Features Features Features
1.0 1.0 1.0 1.0 1.0 1.0
Density

0.5 0.5 0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0 0.0 0.0


−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
Angles Angles Angles Angles Angles Angles

(a) Distribution learned from Yelp2018

Hot Users Cold Users Hot Items Cold Items

LightGCN SGL-WA SGL-ED CL Only SimGCL XSimGCL


40 40 40 40 40
4
20 20 20 20 20
2
Features

0 0 0 0 0 0

−2
−20 −20 −20 −20 −20
−4
−40 −40 −40
−40 −40
−6
−50 0 50 −40 −20 0 20 40 −40 −20 0 20 40 −5.0 −2.5 0.0 2.5 5.0 −40 −20 0 20 40 −40 −20 0 20 40
Features Features Features Features Features Features
1.0 1.0 1.0 1.0 1.0 1.0
Density

0.5 0.5 0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0 0.0 0.0


−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
Angles Angles Angles Angles Angles Angles

(b) Distribution learned from Amazon-Kindle

Hot Users Cold Users Hot Items Cold Items

LightGCN SGL-WA SGL-ED CL Only SimGCL XSimGCL


40 40 40
30
30 6
20
20 4
20 20 20
10
10 2
Features

0 0 0 0 0 0
−10 −10 −2
−20 −20 −20 −20 −20
−4
−30 −30
−6
−40 −40
−40 −20 0 20 40 −40 −20 0 20 40 −20 0 20 40 −5 0 5 −20 0 20 40 −40 −20 0 20 40
Features Features Features Features Features Features
1.0 1.0 1.0 1.0 1.0
Density

1
0.5 0.5 0.5 0.5 0.5

0 0.0 0.0 0.0 0.0 0.0


−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
Angles Angles Angles Angles Angles Angles

(c) Distribution learned from Alibaba-iFashion

Fig. 3: The distribution of representations learned from three datasets. The top of each figure plots the learned 2D features
and the bottom of each figure plots the Gaussian kernel density estimation of atan2(y, x) for each point (x,y) ∈ S 1

reconstruction of the adjacency matrices during training, it In our preliminary study [24], we have displayed the
is sensible to search for better alternatives. distribution of 2,000 randomly sampled users after opti-
mizing the InfoNCE loss. For a thorough understanding,
2.3 Uniformity Is What Really Matters in this version we sample both users and items. Specifically,
we first rank users and items according to their popularity.
The last section has revealed that the contrastive loss In-
Then 500 hot items are randomly sampled from the item
foNCE is the key. However, we still have no idea how it
group with the top 5% interactions; the other 500 cold items
operates. The previous research [31] on the visual represen-
are randomly sampled from the group with the bottom 80%
tation learning has identified that pre-training with the In-
interactions, and so are the users. Afterwards, we map the
foNCE loss intensifies two properties: alignment of features
learned representations to 2-dimensional space with t-SNE
from positive pairs, and uniformity of the normalized feature
[32]. All the representations are collected when the models
distribution on the unit hypersphere. It is unclear if the
reach their best. Then we plot the 2D feature distributions
CL-based recommendation methods exhibit similar patterns
in Fig. 3. For a clearer presentation, the Gaussian kernel
under a joint learning setting. Since in recommendation the
density estimation [33] of arctan(feature y/feature x) on
goal of Lrec is to align the interacted user-item pair, here we
the unit hypersphere S 1 is also visualized.
focus on investigating the uniformity.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5

From Fig. 3 an obvious contrast between the fea- Embedding Space 𝒆′′
𝑖
tures/density estimations learned by LightGCN and CL- 𝒆𝑖
based recommendation models can be observed. In the
𝒆′𝑖
leftmost column, LightGCN learns highly clustered features 𝜃2
and the density curves are with steep rises and falls. Besides, 𝜃1
∆′′
it is easy to notice that the hot users and hot items have
similar distributions and the cold users also cling to the hot ∆′
items; only a small number of users are scattered among the 𝒆j r=𝜖
cold items. Technically, this is a biased pattern that will lead
the model to continually expose hot items to most users and
generate run-of-the-mill recommendations. We think that Fig. 4: An illustration of the proposed random noise-based
two issues may cause this biased distribution. One is that in data augmentation.
recommender systems a fraction of items often account for
most interactions [34], and the other is the notorious over- 3.1 Noise-Based Augmentation
smoothing problem [35] which makes embeddings become Based on the findings above, we speculate that by adjusting
locally similar and hence aggravates the Matthew Effect. By the uniformity of the learned representation in a certain
contrast, in the second and the third columns, the features scope, contrastive recommendation models can reach a bet-
learned by SGL variants are more evenly distributed, and ter performance. Since manipulating the graph structure for
the density estimation curves are less sharp, regardless of if controllable uniformity is intractable and time-consuming,
the graph augmentations are applied. For reference, in the we shift attention to the embedding space. Inspired by the
forth column we plot the features learned only by optimiz- adversarial examples [36] which are constructed through
ing the InfoNCE loss in SGL-ED. Without the effect of Lrec , adding imperceptible perturbation to the images, we pro-
the features are almost subject to uniform distributions. The pose to directly add random noises to the representation for
following inference provides a theoretical justification for an efficient augmentation.
this pattern. By rewriting Eq. (3) we can derive, Formally, given a node i and its representation ei in
X X  the d-dimensional embedding space, we can implement the
Lcl = − z0> 00
i zi /τ + log exp(z0> 00
i zj /τ ) . (6) following representation-level augmentation:
i∈B j∈B

When the representations of different augmentations of the


e0i = ei + ∆0i , e00i = ei + ∆00i , (8)
same node are perfectly aligned (SGL-WA is analogous to where the added noise vectors ∆0i and ∆00i are subject
this case), we have to k∆k2 =  and  is a small constant. Technically, this
X  X 
! constraint of magnitude makes ∆ numerically equivalent
0> 00 to points on a hypersphere with the radius . Besides, it is
Lcl = −1/τ +log exp(zi zj /τ )+exp(1/τ ) .
i∈B j∈B/{i} required that:
(7)
Since 1/τ is a constant, optimizing the CL loss is actually
∆=X sign(ei ), X ∈ Rd ∼ U (0, 1), (9)
towards minimizing the cosine similarity between different 0 00
which forces ei , ∆ and ∆ to be in the same hyperoctant,
node representations, which will push different nodes away so that adding the noises to ei will not result in a large
from each other. deviation and construct less informative augmentations of
By aligning the results in Table 1 with the distributions in ei . Geometrically, by adding the scaled noise vectors to ei ,
Fig. 3, it is natural to speculate that the increased uniformity we can rotate it by two small angles (θ1 and θ2 shown in
of the learned distribution is what really matters to the Fig. 4). Each rotation corresponds to a deviation of ei , and
performance gains. It implicitly alleviates the popularity generates an augmented representation (e0i and e00i ). Since
bias and promotes the long-tail items (discussed in section the rotation is small enough, the augmented representation
4.2) because more evenly distributed representations can retains most information of the original representation and
preserve the intrinsic characteristics of nodes and improve meanwhile also brings some difference. Particularly, we also
the generalization ability. This can also justifies the unex- hope the learned representations can spread out in the
pected remarkable performance of SGL-WA. Finally, it also entire embedding space so as to fully utilize the expression
should be noted that, a positive correlation between the power of the space. Zhang et al. [37] proved that uniform
uniformity and the performance only holds in a limited distribution has such a property. We then choose to generate
scope. The excessive pursuit to the uniformity will weaken the noises from uniform distribution. Though it is techni-
the effect of the recommendation loss to align interacted cally difficult to make the learned distribution approximate
pairs and similar users/items, and hence degrades the rec- uniform distribution in this way, it can statistically bring a
ommendation performance. hint of uniformity to the augmentation.

3 TOWARDS E XTREMELY S IMPLE G RAPH C ON - 3.2 Proposed Contrastive Recommendation Model


TRASTIVE L EARNING FOR R ECOMMENDATION 3.2.1 A Review of SimGCL
In this section we propose a substitute of graph augmenta- Before presenting XSimGCL, we first briefly review SimGCL
tions and develop a lightweight architecture for CL-based proposed in our conference paper [24], which will help
recommendation. understand the new contributions.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6

As shown in Fig. 2, SimGCL follows the paradigm of


−3.550 ϵ = 0.01
graph CL-based recommendation portrayed in Fig. 1. It ϵ = 0.05
ϵ = 0.1 0.0708
consists of three encoders: one is for the recommendation −3.575
ϵ = 0.2 0.0715
task and the other two are for the contrastive task. SimGCL 0.0718
−3.600
likewise adopts LightGCN as the backbone to learn graph 0.0723

representations. Since LightGCN is network-parameter-free,

Uniformity
−3.625
the input user/item embeddings are the only parameters
−3.650
to be learned. The ordinary graph encoding is used for
recommendation which follows Eq. (4) to propagate node −3.675
information. Meanwhile, in the other two encoders SimGCL
−3.700
employs the proposed noise-based augmentation approach,
and adds different uniform random noises to the aggregated −3.725
embeddings at each layer to obtain perturbed representa-
tions. This noise-involved representation learning can be 0 10 20 30 40
Epoch
formulated as:
1 Fig. 5: Trends of uniformity with different . Lower values
E0 = (AE(0) + ∆(1) ) + (A(AE(0) + ∆(1) ) + ∆(2) ))+
L  on the y-axis are better. We present the Recall@20 values of
... + (AL E(0) + AL−1 ∆(1) + ... + A∆(L−1) + ∆(L) ) XSimGCL with different  when it reaches convergence.
(10)
Note that we skip the input embedding E(0) in all the three encoding processes. As a result, the new architecture has
encoders when calculating the final representations, because only one-time forward/backward pass in a mini-batch com-
we experimentally find that CL cannot consistently improve putation. We name this new method XSimGCL, which is
non-aggregation-based models and skipping it will lead to a short for eXtremely Simple Graph Contrastive Learning. We
slight performance improvement. Finally, we substitute the illustrate it in Fig. 2. The perturbed representation learning
learned representations into the joint loss presented in Eq. of XSimGCL is as the same as that of SimGCL. The joint loss
(1) then use Adam to optimize it. of XSimGCL is formulated as:
X  
3.2.2 XSimGCL - Simpler Than Simple L=− log σ(e0> 0 0> 0
u ei − eu ej ) +
Compared with SGL, SimGCL is much simpler because (u,i)∈B
∗ (11)
the constant graph augmentation is no longer required. X exp(z0> l
i zi /τ )
λ − log P 0> l∗
,
However, the cumbersome architecture of SimGCL makes j∈B exp(zi zj /τ )
i∈B
it less than perfect. For each computation, it requires three
forward/backward passes to obtain the loss and then back- where l∗ denotes the layer to be contrasted with the final
propagate the error to update the input node embeddings. layer. Contrasting two intermediate layers is optional, but
Though it seems a convention to separate the pipelines the experiments in section 4.3 show that involving the final
of the recommendation task and the contrastive task in layer leads to the optimal performance.
CL-based recommender systems [12], [24], [27], [25], we
question the necessity of this architecture. 3.2.3 Adjusting Uniformity Through Changing 
As suggested by [38], there is a sweet pot when utilizing In XSimGCL, we can explicitly control how far the aug-
CL where the mutual information between correlated views mented representations deviate from the original through
is neither too high nor too low. However, in the architecture changing the value of . Intuitively, a larger  will result in
of SimGCL, given a pair of views of the same node, the a more uniform representation distribution. This is because
mutual information between two them could always be very when optimizing the contrastive loss, the added noises are
high since the two corresponding embeddings both contain also propagated as part of the gradients. As the noises are
information from L hops of neighbors. Due to the minor sampled from uniform distribution, the original representa-
difference between them, contrasting them with each other tion is roughly regularized towards higher uniformity. We
may be less effective in learning general features. In fact, this conduct the following experiment to demonstrate it.
is also a universal problem for many CL-based recommen- According to [31], the logarithm of the average pairwise
dation models under the paradigm in Fig. 1. It is natural Gaussian potential (a.k.a. the Radial Basis Function (RBF)
to think what if we contrast different layer embeddings? kernel) can well measure the uniformity of representations,
They share some common information but differ in aggre- which is defined as:
gated neighbors and added noises, which conform to the 2

sweet pot theory. Besides, considering that the magnitude


Luniform (f ) = log E e−2kf (u)−f (v)k2 . (12)
i.i.d
u,v ∼ pnode
of added noises is small enough, we can directly use the
perturbed representations for the recommendation task. The where f (u) outputs the L2 normalized embedding of u.
noises are analogous to the widely used dropout trick and We choose the popular items (with more than 200 inter-
are only applied in training. In the test phase, the model actions) and randomly sample 5,000 users in the dataset
switches to the ordinary mode without noises. of Yelp2018 to form the user-item pairs, and then measure
Benefitting from this design of cross-layer contrast, we the uniformity of representations learned by XSimGCL with
can streamline the architecture of SimGCL by merging its Eq. (12). We fix λ = 0.2 and use a 3-layer setting, and
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7

then tune  to observe how the uniformity changes. The constant graph augmentations. Since this part is usually
uniformity is checked after every epoch until XSimGCL performed on CPUs, which brings SGL-ED more expense
reaches convergence. of time in practice. By comparison, XSimGCL needs neither
As clearly shown in Fig. 5, similar trends are observed graph augmentations nor extra encoders. Without consider-
on all the curves. At the beginning, all curves have highly ing the computation for the contrastive task, XSimGCL is
evenly-distributed representations because we use Xavier theoretically as lightweight as LightGCN and only spends
initialization, which is a special uniform distribution, to one-third of SimGCL’s training expense in graph encoding.
initialize the input embeddings. With the training pro- When the actual number of epochs for training is consid-
ceeding, the uniformity declines due to the effect of Lrec . ered, XSimGCL will show more efficiency beyond what we
After reaching the peak the uniformity increases again till can observe from this theoretical analysis.
convergence. It is also obvious that with the increase of
, XSimGCL tends to learn a more even representation TABLE 3: Dataset Statistics
distribution. Meanwhile, the better performance is achieved,
Dataset #User #Item #Feedback Density
which evidences our claim that more uniformity brings a
Yelp2018 31,668 38,048 1,561,406 0.13%
better performance in scope. Besides, we notice that there Amazon-Kindle 138,333 98,572 1,909,965 0.014%
is a correlation between the convergence speed and the Alibaba-iFashion 300,000 81,614 1,607,813 0.007%
magnitude of noises, which will be discussed in section 4.3.

3.3 Complexity 4 E XPERIMENTS


In this section, we analyze the theoretical time complexity 4.1 Experimental Settings
of XSimGCL and compare it with LightGCN, SGL-ED and
its predecessor SimGCL. The discussion is within the scope Datasets. For reliable and convincing results, we conduct
of a single batch since the in-batch negative sampling is a experiments on three public large-scale datasets: Yelp2018
widely used trick in CL [5]. Here we let |A| be the edge [28], Amazon-kindle [12] and Alibaba-iFashion [12] to eval-
number in the user-item bipartite graph, d be the embedding uate XSimGCL/SimGCL. The statistics of these datasets are
dimension, B denote the batch size, M represent the node presented in Table 3. We split the datasets into three parts
number in a batch, L be the layer number, and ρ denote the (training set, validation set, and test set) with a [Link] ratio.
edge keep rate in SGL-ED. We can derive: Following [12], [28], we first search the best hyperparame-
ters on the validation set, and then we merge the training
TABLE 2: The comparison of time complexity set and the validation set to train the model and evaluate it
on the test set where the relevancy-based metric Recall@20
LightGCN SGL-ED SimGCL XSimGCL and the ranking-aware metric NDCG@20 are used. For a
Adjacency
Matrix
O(2|A|) O(2|A|+4ρ|A|) O(2|A|) O(2|A|) rigorous and unbiased evaluation, the reported result are
Graph
O(2|A|Ld) O((2+4ρ)|A|Ld) O(6|A|Ld) O(2|A|Ld)
the average values of 5 runs, with all the items being ranked.
Encoding
Prediction O(2Bd) O(2Bd) O(2Bd) O(2Bd)
Baselines. Besides LightGCN and the SGL variants, the
Contrast - O(BM d) O(BM d) O(BM d) following recent data augmentation-based/CL-based rec-
ommendation models are compared.
• Since LightGCN, SimGCL and XSimGCL do not need • DNN+SSL [27] is a recent DNN-based recommendation
graph augmentations, they only construct the normalized method which adopts the similar architecture in Fig. 1,
adjacency matrix which has 2|A| non-zero elements. For and conducts feature masking for CL.
SGL-ED, two graph augmentations are used and each has • BUIR [21] has a two-branch architecture which consists
2ρ|A| non-zero elements in their adjacency matrices. of a target network and an online network, and only uses
• In the phase of graph encoding, a three-encoder architec- positive examples for self-supervised recommendation.
ture is adopted in both SGL-ED and SimGCL to learn two • MixGCL [39] designs the hop mixing technique to syn-
different augmentations, so the encoding expense of SGL- thesize hard negatives for graph collaborative filtering by
ED and SimGCL are almost three times that of LightGCN. embedding interpolation.
In contrast, the encoding expense of XSimGCL is as the • NCL [18] is a very recent contrastive model which designs
same as that of LightGCN. a prototypical contrastive objective to capture the correla-
• As for the prediction, all methods are trained with the tions between a user/item and its context.
BPR loss and each batch contains B interactions, so they Hyperparameters. For a fair comparison, we referred to the
have exactly the same time cost in this regard. best hyperparameter settings reported in the original papers
• The computational cost of CL comes from the contrast of the baselines and then fine-tuned them with the grid
between the positive/negative samples, which are O(Bd) search. As for the general settings, we create the user and
and O(BM d), respectively, because each node regards the item embeddings with the Xavier initialization of dimension
views of itself as the positives and views of other nodes 64; we use Adam to optimize all the models with the
as the negatives. For the sake of brevity, we mark it as learning rate 0.001; the L2 regularization coefficient 10−4
O(BM d) since M  1. and the batch size 2048 are used, which are common in
In these four models, SGL-ED and SimGCL are obvi- many papers [28], [12], [40]. In SimGCL, XSimGCL and SGL,
ously the two with the highest computation costs. SimGCL we empirically let the temperature τ = 0.2 because this
needs more time in the graph encoding but SGL-ED requires value is often reported a great choice in papers on CL [12],
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8

TABLE 4: Performance Comparison for different CL methods on three benchmarks.

Yelp2018 Amazon-Kindle Alibaba-iFashion


Method
Recall@20 NDCG@20 Recall@20 NDCG@20 Recall@20 NDCG@20
LightGCN 0.0631 0.0515 0.1871 0.1186 0.0845 0.0390
SGL-ND 0.0643 (+1.9%) 0.0529 (+2.7%) 0.1880 (+0.5%) 0.1192 (+0.5%) 0.0896 (+6.0%) 0.0432(+10.8%)
SGL-ED 0.0637 (+1.0%) 0.0526 (+2.1%) 0.1936 (+3.5%) 0.1231 (+3.8%) 0.0932 (+10.3%) 0.0447 (+14.6%)
1-Layer
SGL-RW 0.0637 (+1.0%) 0.0526 (+2.1%) 0.1936 (+3.5%) 0.1231 (+3.8%) 0.0932 (+10.3%) 0.0447 (+14.6%)
SGL-WA 0.0628 (-0.4%) 0.0525 (+1.9%) 0.1918 (+2.5%) 0.1221 (+2.9%) 0.0913 (+8.0%) 0.0440 (12.8%)
SimGCL 0.0689 (+9.2%) 0.0572 (+11.1%) 0.2087 (+11.5%) 0.1361 (+14.8%) 0.1036 (+22.6%) 0.0505 (+29.5%)
XSimGCL 0.0692 (+9.7%) 0.0582 (+13.0%) 0.2071 (+10.7%) 0.1339 (+12.9%) 0.1069 (+26.5%) 0.0527 (+35.1%)
LightGCN 0.0622 0.0504 0.2033 0.1284 0.1053 0.0505
SGL-ND 0.0658 (+5.8%) 0.0538 (+6.7%) 0.2020 (-0.6%) 0.1307 (+1.8%) 0.0993 (-5.7%) 0.0484 (-4.2%)
SGL-ED 0.0668 (+7.4%) 0.0549 (+8.9%) 0.2084 (+2.5%) 0.1341 (+4.4%) 0.1062 (+0.8%) 0.0514 (+1.8%)
2-Layer SGL-RW 0.0644 (+3.5%) 0.0530 (+5.2%) 0.2088 (+2.7%) 0.1345 (+4.8%) 0.1053 (+0.0%) 0.0512 (+1.4%)
SGL-WA 0.0653 (+5.0%) 0.0544 (+7.9%) 0.2068 (+1.7%) 0.1330 (+3.6%) 0.1028 (-2.4%) 0.0501 (-0.8%)
SimGCL 0.0719 (+15.6%) 0.0601 (+19.2%) 0.2071 (+1.9%) 0.1341 (+4.4%) 0.1119 (+6.3%) 0.0548 (+8.5%)
XSimGCL 0.0722 (+16.1%) 0.0604 (+19.8%) 0.2114 (+4.0%) 0.1382 (+7.6%) 0.1143 (+8.5%) 0.0559 (+10.7%)
LightGCN 0.0639 0.0525 0.2057 0.1315 0.0955 0.0461
SGL-ND 0.0644 (+0.8%) 0.0528 (+0.6%) 0.2069 (+0.6%) 0.1328 (+1.0%) 0.1032 (+8.1%) 0.0498 (+8.0%)
SGL-ED 0.0675 (+5.6%) 0.0555 (+5.7%) 0.2090 (+1.6%) 0.1352 (+2.8%) 0.1093 (+14.5%) 0.0531 (+15.2%)
3-Layer SGL-RW 0.0667 (+4.4%) 0.0547 (+4.5%) 0.2105 (+2.3%) 0.1351 (+2.7%) 0.1095 (+14.7%) 0.0531 (+15.2%)
SGL-WA 0.0671 (+5.0%) 0.0550 (+4.8%) 0.2084 (+1.3%) 0.1347 (+2.4%) 0.1065 (+11.5%) 0.0519 (+12.6%)
SimGCL 0.0721 (+12.8%) 0.0601 (+14.5%) 0.2104 (+2.3%) 0.1374 (+4.5%) 0.1151 +(20.5%) 0.0567 (+23.0%)
XSimGCL 0.0723 (+13.1%) 0.0604 (+15.0%) 0.2147 (+4.4%) 0.1415 (+7.6%) 0.1196 (+25.2%) 0.0586 (27.1%)

[31]. An exception is that we let τ = 0.15 for XSimGCL largest improvements are observed on Alibaba-iFashion
on Yelp2018, which brings a slightly better performance. where the performance of XSimGCL surpasses that of
Note that although the paper of SGL [12] uses Yelp2018 LightGCN by more than 25% under the 1-layer and 3-
and Alibaba-iFashion as well, we cannot reproduce their re- layer settings.
sults on Alibaba-iFashion with their given hyperparameters • SGL-ED and SGL-RW have very close performances and
under the same experimental setting. So we re-search the outperform SGL-ND by large margins. In many cases,
hyperparameters of SGL and choose to present our results SGL-WA show advantages over SGL-WA but still falls
on this dataset in Table 4. behind SGL-ED and SGL-RW. These results further cor-
roborate that the InfoNCE loss is the primary factor
TABLE 5: The best hyperparameters of compared methods.
which accounts for the performance gains, and meanwhile
Dataset Yelp2018 Amazon-Kindle Alibaba-iFashion heuristic graph augmentations are not as effective as ex-
SGL λ=0.1, ρ=0.1 λ=0.05, ρ=0.1 λ=0.05, ρ=0.2 pected and some of them even degrade the performance.
SimGCL λ=0.5, =0.1 λ=0.1, =0.1 λ=0.05, =0.1
XSimGCL λ=0.2, =0.2, l∗ =1 λ=0.2, =0.1, l∗ =1 λ=0.05, =0.05, l∗ =3
• XSimGCL/SimGCL show the best/second best perfor-
mance in almost all the cases, which demonstrates the
effectiveness of the random noised-based data augmen-
tation. Particularly, on the largest and sparsest dataset -
4.2 SGL vs. XSimGCL: From A Comprehensive Per-
Alibaba-iFashion, they significantly outperforms the SGL
spective
variants. In addition, it is obvious that the evolution
In this part, we compare XSimGCL with SGL in a com- from SimGCL to XSimGCL is successful, bringing non-
prehensive way. The experiments focus on three important negligible performance gains.
aspects: recommendation performance, training time, and
the ability to promote long-tail items. To further demonstrate XSimGCL’s outstanding perfor-
mance, we also compare it with a few recent augmentation-
4.2.1 Performance Comparison based or CL-based recommendation models. The imple-
We first show the performances of SGL and XSimGCL/ mentations of these methods are available in our GitHub
SimGCL with different layers. The best hyperparameters of repository SELFRec as well. According to Table 6, XSimGCL
them are provided in Table 5 for an easy reproduction of our and SimGCL still outperform with a great lead, achieving
results. The figures of the best performance are presented in the best and the second best performance, respectively. NCL
bold and the runner-ups are presented with underline. The and MixGCF, which employ LightGCN as their backbones,
improvements are calculated based on the performance of also show their competitiveness. By contrast, DNN+SSL
LightGCN. Note that we contrast the final layer with itself and BUIR are not as powerful as expected and even not
in the 1-layer XSimGCL. Based on Table 4, we have the comparable to LightGCN. We attribute their failure to: (1).
following observations: DNNs are proved effective when abundant user/item fea-
• In the vast majority of cases, the SGL variants, SimGCL tures are provided. In our datasets, features are unavailable
and XSimGCL can largely outperform LightGCN. The and the self-supervision signals are created by masking
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9

TABLE 6: Performance comparison with other models. LightGCN SGL-WA SGL-ED SimGCL XSimGCL

Batch Time 3
Epochs Needed Total Time
10
Yelp2018 Kindle iFashion 100 4
Method 10
Recall NDCG Recall NDCG Recall NDCG
LightGCN 0.0639 0.0525 0.2057 0.1315 0.1053 0.0505 80
NCL 0.0670 0.0562 0.2090 0.1348 0.1088 0.0528

Time (seconds)

Time (seconds)
Epoch Number
BUIR 0.0487 0.0404 0.0922 0.0528 0.0830 0.0384 60 10
2

DNN+SSL 0.0483 0.0382 0.1520 0.0989 0.0818 0.0375 3


10
MixGCF 0.0713 0.0589 0.2098 0.1355 0.1085 0.0520
SimGCL 0.0721 0.0601 0.2104 0.1374 0.1151 0.0567 40

XSimGCL 0.0723 0.0604 0.2147 0.1415 0.1196 0.0586


20
1
10
2
0 10
Yelp2018 Kindle iFashion Yelp2018 Kindle iFashion Yelp2018 Kindle iFashion
item embeddings, so it cannot fulfill itself in this situation.
(2). In the paper of BUIR, the authors removed long-tail Fig. 6: The training speed of compared methods.
users and items to guarantee a good result, but we use
all the data. We also notice that BUIR performs very well
Yelp2018 Alibaba-iFashion
on suggesting popular items but poorly on long-tail items.
LightGCN LightGCN
This may explain why the original paper uses a biased 2.5% SGL-WA
4.0% SGL-WA
SGL-ED 3.5% SGL-ED
experimental setting. 2.0%
SimGCL SimGCL
XSimGCL 3.0% XSimGCL

Recall@20
2.5%
1.5%
4.2.2 Comparison of Training Efficiency 2.0%
1.0% 1.5%
As has been claimed, XSimGCL is almost as lightweight
1.0%
as LightGCN in theory. In this part, we report the actual 0.5%
0.5%
training time, which is more informative than the theoretical 0.0% 0.0%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
analysis. The reported figures are collected on a workstation Group Group

with an Intel(R) Xeon(R) Gold 5122 CPU and a GeForce


RTX 2080Ti GPU. These methods are implemented with Fig. 7: The ability to promote long-tail items.
Tensorflow 1.14, and a 2-layer setting is applied to all.
According to Fig. 6, we have the following observations:
• SGL-ED takes the longest time to finish the computation 4.2.3 Comparison of Ability to Promote Long-tail Items
in a single batch, which is almost four times that of Light-
GCN on all the datasets. SimGCL ranks second due to its Optimizing the InfoNCE loss is found to learn more evenly
three-encoder architecture, which is almost two times that distributed representations, which is supposed to alleviate
of LightGCN. Since SGL-WA, XSimGCL and LightGCN the popularity bias. To verify that XSimGCL upgrades this
have the same architecture, there training costs for a batch ability with the noise-based augmentation, we follow [12]
are very close. The former two need a bit of extra time for to divide the test set into ten groups, labeled with IDs
the contrastive task. from 1 to 10. Each group includes the same number of
• LightGCN is trained with hundreds of epochs, which is at interactions. The larger ID the group has, the more popular
least an order of magnitude more than the epochs that items it contains. We then conduct experiments with a 2-
other methods need. By contrast, XSimGCL needs the layer setting to check the Recall@20 value that each group
fewest epochs to reach convergence and its predecessor achieves. The results are illustrated in Fig. 7.
SimGCL falls behind by several epochs. SGL-WA and According to Fig. 7, LightGCN is inclined to recommend
SGL-ED require the same number of epochs to get con- popular items and achieves the highest recall value on the
verged and are slower than SimGCL. When it comes to the last group. By contrast, XSimGCL and SimGCL do not
total training time, LightGCN is still the method trained show outstanding performance on group 10, but they have
with the longest time, followed by SGL-ED and SimGCL. distinct advantages over LightGCN on other groups. Partic-
Due to the simple architecture, SGL-WA and XSimGCL ularly, SimGCL is the standout on Yelp2018 and XSimGCL
are the last two but XSimGCL only needs about half of keeps strong on iFashion. Their extraordinary performance
the cost SGL-WA spends in total. in recommending long-tail items largely compensates for
With these observations, we can easily draw some con- their loss on the popular item group. As for the SGL
clusions. First, CL can tremendously accelerate the train- variants, they fall between LightGCN and SimGCL on ex-
ing. Second, graph augmentations cannot contribute to the ploring long-tail items and exhibit similar recommendation
training efficiency. Third, the cross-layer contrasts not only performance on Yelp2018. SGL-ED shows a slight advantage
brings performance improvement but also leads to faster over SGL-WA on iFashion. Combining Fig. 3 with Fig. 7,
convergence. By analyzing the gradients from the CL loss, we can easily find that the ability to promote long-tail
we find that the noises in XSimGCL and SimGCL will add items seems to positively correlate with the uniformity of
an small increment to the gradients, which works like a representations. Since a good recommender system should
momentum and can explain the speedup. Compared with suggest items that are most pertinent to a particular user in-
the final-layer contrast, the cross-layer has shorter route stead of recommending popular items that might have been
for gradient propagation. This can explain why XSimGCL known, SimGCL and XSimGCL significantly outperforms
needs fewer epochs compared with SimGCL. other methods in this regard.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10

4.3 Hyperparameter Investigation Yelp2018 Amazon-Kindle Alibaba-iFashion


21.0%
7.0%
XSimGCL has three important hyperparameters: λ - the 20.0% 10.0%

coefficient of the contrastive task,  - the magnitude of added

Recall@20
6.0%
19.0%
ϵ: 0 ϵ: 0 ϵ: 0
noises, and l∗ - the layer to be contrasted. In this part, we in- ϵ: 0.01
18.0%
ϵ: 0.01
8.0%
ϵ: 0.01
5.0% ϵ: 0.05 ϵ: 0.05 ϵ: 0.05
vestigate the model’s sensitivity to these hyperparameters. ϵ: 0.1
17.0%
ϵ: 0.1
6.0%
ϵ: 0.1
ϵ: 0.2 ϵ: 0.2 ϵ: 0.2
ϵ: 0.5 ϵ: 0.5 ϵ: 0.5
4.0% 16.0%
λ: 0.0 0.05 0.1 0.2 0.5 1 λ: 0.0 0.05 0.1 0.2 0.5 1 λ: 0.0 0.05 0.1 0.2 0.5 1
1 1 1
4.3.1 Influence of λ and 
6.0% 5.5%
13.5%
We try different combinations of λ and  with the set [0.01, 5.5%
5.0%
13.0%
0.05, 0.1, 0.2, 0.5, 1] for λ and [0, 0.01, 0.05, 0.1, 0.2, 0.5] 4.5%

NDCG@20
5.0%
for . We fix l∗ =1 and conduct experiments with a 2-layer ϵ: 0
ϵ: 0.01
12.5%
ϵ: 0
ϵ: 0.01
4.0% ϵ: 0
ϵ: 0.01
4.5% 12.0%
setting, but we found that the best values of these two ϵ: 0.05
ϵ: 0.1
ϵ: 0.05
ϵ: 0.1
3.5% ϵ: 0.05
ϵ: 0.1
4.0% 11.5%
hyperparameters are also applicable to other settings. As ϵ: 0.2 ϵ: 0.2 3.0% ϵ: 0.2
ϵ: 0.5 11.0% ϵ: 0.5 ϵ: 0.5
3.5%
shown in Fig. 8, on all the datasets XSimGCL reaches its λ: 0.0 0.05 0.1 0.2 0.5 1 λ: 0.0 0.05 0.1 0.2 0.5 1
2.5%
λ: 0.0 0.05 0.1 0.2 0.5 1
1 1 1
best performance when  is in [0.05, 0.1, 0.2]. Without the
added noises (=0), we can see an obvious performance Fig. 8: The influence of λ and .
drop. When  is too large (=0.05) or too small (=0.01),
the performance declines as well. The similar trend is also
observed by changing the value of λ. The performance is Yelp2018 7.2 Amazon-Kindle Alibaba-iFashion 12.0

at the peak when λ=0.2 on Yelp2018, λ = 0.2 on Amazon- 6.64 7.01 20.71 21.19 11.42 11.74

L1

L1

L1
7.0 21.0
11.8

Recall@20
Random Random Random
Kindle, and λ = 0.05 on Alibaba-iFashion. According to 7.06 7.01 21.33 20.27 11.50 11.58

L2

L2

L2
6.8 20.5
11.6
our experience, XSimGCL (SimGCL) is more sensitive to 6.34 6.53 6.91 19.65 20.55 20.69 11.90 11.82 11.85

L3

L3

L3
6.6
20.0
λ, and  = 0.1 is usually a good and safe choice on most 7.23 7.21 6.83 7.08 21.47 21.09 20.42 19.89 11.56 11.50 11.96 11.39

Final

Final

Final
6.4 11.4
19.5
datasets. Besides, we also find that a larger  leads to faster L1 L2 L3 Final L1 L2 L3 Final L1 L2 L3 Final

convergence. But when it is overlarge (e.g., greater than 1), 6.0


14.0 5.8
it will act like a large learning rate, causing the progressive 5.57 5.84 13.39 14.02 5.59 5.74
L1

L1

L1
5.8
NDCG@20

Random Random Random


zigzag optimization which will overshoot the minimum. 5.90 5.86 13.86 13.34 5.64 5.67 5.7
L2

L2

L2
5.6 13.5

5.27 5.47 5.79 12.64 13.34 13.46 5.82 5.78 5.81


L3

L3

L3
5.6
5.4
13.0
6.04 6.01 5.70 5.90 14.15 13.76 13.25 12.89 5.66 5.64 5.86 5.59
Final

Final

Final
5.2 5.5
4.3.2 Layer Selection for Contrast L1 L2 L3 Final L1 L2 L3 Final L1 L2 L3 Final

In XSimGCL, two layers are chosen to be contrasted. We


Fig. 9: The influence of the layer selection for contrast.
report the results of different choices in Fig. 9 where a
3-layer setting is used. Since these matrix-like heat maps
are symmetric, we only display the lower triangular parts.
The figures in the diagonal cells represent the results of 4.4.1 Noised-Based CL on Other Structures
contrasting the same layer. As can be seen, although the best
We choose MF and the vanilla GCN as the backbones to
choice varies from dataset to dataset, it always appears as
be tested because these two simple backbones are widely
the contrast between the final layer and one of the previous
used in practice. For MF that cannot adopt the cross-layer
layers. We analyzed the similarities between representations
contrast, we add different uniform noises to the input
of different layers and tried to find if l∗ is related to the sim-
embeddings for different augmentations. We tried many
ilarity but no evidence was found. Fortunately, XSimGCL
combinations of λ and  on these two structures and report
usually achieves the best performance with a 3-layer setting,
the best results in Table 7 where NBC is short for noise-
which means three attempts are enough. The amount of
based CL. As can be been, NBC can likewise improve GCN.
manual work for tuning l∗ is therefore greatly reduced. A
We guess this is because GCN also has an aggregation
compromised way without tuning l∗ is to randomly choose
mechanism. However, NBC cannot consistently improve
a layer in every mini-batch and contrast its embeddings
MF. On the dataset of Amazon-Kindle it works, whereas
with the final embeddings. We report the results of this
on the other two datasets, it lowers the performance. This
random selection in the upper right of the heatmap. They
inconsistent effect cannot be easily concluded, and we leave
are acceptable but much lower than the best performance.
it to our future work.

TABLE 7: Performance comparison of different backbones.


4.4 Applicability Investigation

The noise-based CL has been proved effective when com- Method


Yelp2018 Kindle iFashion
bining with LightGCN. We further wonder whether this Recall NDCG Recall NDCG Recall NDCG
MF 0.0543 0.0445 0.1751 0.1068 0.996 0.468
method is applicable to other common backbones such as MF + NBC 0.0517 0.0433 0.1878 0.1175 0.975 0.0453
MF and GCN. Besides, whether uniform noises are the best GCN 0.0556 0.0452 0.1833 0.1137 0.952 0.0458
choice remains unknown. In this part, we investigate the GCN + NBC 0.0632 0.0530 0.1989 0.1290 0.1017 0.0486
applicability of the noise-based augmentation and different
types of noises.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11

4.4.2 XSimGCL with Other Types of Noises Inspired by the success of CL in other fields, the community
We test three other types of noises in this experiment also has started to integrate CL into recommendation [15],
including adversarial perturbation obtained by following [12], [14], [17], [13], [51], [52], [16]. The fundamental idea
FGSM [36] (denoted by XSimGCLa ), positive uniform behind existing contrastive recommendation methods is to
noises without the sign of learned embeddings (denoted by regard every instance (e.g., user or item) as a class, and
XSimGCLp ), and Gaussian noises (denoted by XSimGCLg ). then pull views of the same instance closer and push views
We tried many combinations of λ and  for different types of of different instances apart when learning representations,
noises and presented the results in Table 8. As observed, the where the views are augmented by applying different trans-
vanilla XSimGCL with signed uniform noises outperforms formations to the original data. In brief, the goal is to
other variants. Although positive uniform noises and Gaus- maximize the mutual information between views from the
sian noises also bring hefty performance gains compared same instance. Through this process, the recommendation
with LightGCN, adding adversarial noises unexpectedly model is expected to learn the essential information of the
leads to a large drop of performance. This indicates that only user-item interactions.
a few particular distributions can generate helpful noises. To the best of our knowledge, S3 -Rec [15] is the first
Besides, the result that XSimGCL outperforms XSimGCLp work that combines CL with sequential recommendation.
demonstrates the necessity of the sign constraint. It first randomly masks part of attributes and items to
create sequence augmentations, and then pre-trains the
TABLE 8: Performance comparison of different XSimGCL Transformer [53] by encouraging the consistency between
variants. different augmentations. The similar idea is also found in a
concurrent work CL4SRec [25], where more augmentation
Yelp2018 Kindle iFashion approaches including item reordering and cropping are
Method
Recall NDCG Recall NDCG Recall NDCG
used. Besides, S2 -DHCN [14] and ICL [18] adopt advanced
LightGCN 0.0639 0.0525 0.2057 0.1315 0.1053 0.0505 augmentation strategies by re-organizing/clustering the se-
XSimGCLa 0.0558 0.0464 0.1267 0.0833 0.0158 0.0065 quential data for more effective self-supervised signals. Qiu
XSimGCLp 0.0714 0.0596 0.2121 0.1398 0.1183 0.0577
XSimGCLg 0.0722 0.0602 0.2140 0.1410 0.1190 0.0583 et al. proposed DuoRec [54] which adopts a model-level
XSimGCL 0.0723 0.0604 0.2147 0.1415 0.1196 0.0586 augmentation by conducting dropout on the encoder. Xia
et al. [55] integrated CL into a self-supervised knowledge
distillation framework to transfer more knowledge from
the server-side large recommendation model to resource-
5 R ELATED W ORK constrained on-device models to enhance next-item recom-
5.1 GNNs-Based Recommendation Models mendation. In the same period, CL was also introduced
to different graph-based recommendation scenarios. S2 -
In recent years, graph neural networks (GNNs) [41], [42]
MHCN [27] and SMIN [56] integrate CL into social recom-
have brought the regime of DNNs to an end, and become a
mendation. HHGR [26] proposes a double-scale augmen-
routine in recommender systems for its extraordinary ability
tation approach for group recommendation and develops
to model the user behavior data. A great number of recom-
a finer-grained contrastive objective for users and groups.
mendation models developed from GNNs have achieved
CCDR [57] and CrossCBR [58] have explored the use of
greater than ever performances in different recommenda-
CL in cross-domain and bundle recommendation. Yao et
tion scenarios [40], [13], [28], [43], [44]. Among numerous
al. [27] proposed a feature dropout-based two-tower ar-
variants of GNNs, GCN [45] is the most prevalent one and
chitecture for large-scale item recommendation. NCL [18]
drives many state-of-the-art graph neural recommendation
designs a prototypical contrastive objective to capture the
models such as NGCF [46], LightGCN [28], LR-GCCF [47]
correlations between a user/item and its context. SEPT [17]
and LCF [48]. Despite varying implementation details, all
and COTREC [59] further propose to mine multiple positive
these GCN-based models share a common scheme which
samples with semi-supervised learning on the perturbed
is to aggregate information from the neighborhood in the
graph for social/session-based recommendation. The most
user-item graph layer by layer [42]. Benefitting from its sim-
widely used model is SGL [12] which replies edge/node
ple structure, LightGCN becomes one of the most popular
dropout to augment the graph data. Although these meth-
GCN-based recommendation models. It follows SGC [49]
ods have demonstrated their effectiveness, they pay little
to remove the redundant operations in the vanilla GCN
attention to why CL can enhance recommendation.
including transformation matrices and nonlinear activation
functions. This design is proved efficient and effective for
recommendation where only the user-item interactions are 6 C ONCLUSION
provided. It also inspires a lot of CL-based recommendation
In this paper, we revisit the graph CL in recommendation
models such as SGL [12], NCL [18] and SimGCL [24].
and investigate how it enhances graph recommendation
models. The findings are surprising that the InfoNCE loss
5.2 Contrastive Learning for Recommendation is the decisive factor which accounts for most of the per-
Contrastive learning [1], [2] recently has drawn considerable formance gains, whilst the elaborate graph augmentations
attention in many fields due to its ability to deal with only play a secondary role. Optimizing the InfoNCE loss
massive unlabeled data [6], [5], [4]. As CL usually works in leads to a more even representation distribution, which
a self-supervised manner [3], it is inherently a sliver bullet helps to promote the long-tail items in the scenario of
to the data sparsity issue [50] in recommender systems. recommendation. In light of this, we propose a simple yet
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12

effective noise-based augmentation approach, which can [21] D. Lee, S. Kang, H. Ju, C. Park, and H. Yu, “Bootstrapping user
smoothly adjust the uniformity of the representation distri- and item representations for one-class collaborative filtering,” in
SIGIR, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai,
bution through CL. An extremely simple model XSimGCL Eds., 2021, pp. 1513–1522.
is also put forward, which brings an ultralight architecture [22] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with
for CL-based recommendation. The extensive experiments contrastive predictive coding,” arXiv preprint arXiv:1807.03748,
on three large and highly sparse datasets demonstrate that 2018.
[23] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, and X. He, “Bias and
XSimGCL is an ideal alternative of its graph augmentation- debias in recommender system: A survey and future directions,”
based counterparts. arXiv preprint arXiv:2010.03240, 2020.
[24] J. Yu, H. Yin, X. Xia, T. Chen, L. Cui, and Q. V. H. Nguyen,
“Are graph augmentations necessary? simple graph contrastive
learning for recommendation,” in SIGIR, 2022, pp. 1294–1303.
R EFERENCES [25] X. Xie, F. Sun, Z. Liu, S. Wu, J. Gao, J. Zhang, B. Ding, and B. Cui,
“Contrastive learning for sequential recommendation,” in ICDE.
[1] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, IEEE, 2022, pp. 1259–1273.
“A survey on contrastive self-supervised learning,” Technologies, [26] J. Zhang, M. Gao, J. Yu, L. Guo, J. Li, and H. Yin, “Double-scale
vol. 9, no. 1, p. 2, 2021. self-supervised hypergraph learning for group recommendation,”
[2] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and in CIKM, 2021, pp. 2557–2567.
J. Tang, “Self-supervised learning: Generative or contrastive,” [27] T. Yao, X. Yi, D. Z. Cheng, F. Yu, T. Chen, A. Menon, L. Hong, E. H.
arXiv preprint arXiv:2006.08218, vol. 1, no. 2, 2020. Chi, S. Tjoa, J. Kang et al., “Self-supervised learning for large-scale
[3] J. Yu, H. Yin, X. Xia, T. Chen, J. Li, and Z. Huang, “Self-supervised item recommendations,” in CIKM, 2021, pp. 4321–4330.
learning for recommender systems: A survey,” arXiv preprint [28] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, “Lightgcn:
arXiv:2006.07733, 2022. Simplifying and powering graph convolution network for recom-
[4] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph mendation,” in SIGIR. ACM, 2020, pp. 639–648.
contrastive learning with augmentations,” NeurIPS, vol. 33, 2020. [29] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme,
[5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple “Bpr: Bayesian personalized ranking from implicit feedback,” in
framework for contrastive learning of visual representations,” in UAI. AUAI Press, 2009, pp. 452–461.
ICML, 2020, pp. 1597–1607. [30] R. He and J. McAuley, “Ups and downs: Modeling the visual
[6] T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning evolution of fashion trends with one-class collaborative filtering,”
of sentence embeddings,” in EMNLP, 2021, pp. 6894–6910. in WWW, 2016, pp. 507–517.
[7] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast [31] T. Wang and P. Isola, “Understanding contrastive representation
for unsupervised visual representation learning,” in CVPR, 2020, learning through alignment and uniformity on the hypersphere,”
pp. 9729–9738. in ICML, 2020, pp. 9929–9939.
[8] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, [32] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”
E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar Journal of machine learning research, vol. 9, no. 11, 2008.
et al., “Bootstrap your own latent: A new approach to self- [33] Z. I. Botev, J. F. Grotowski, and D. P. Kroese, “Kernel density
supervised learning,” NeurIPS, 2020. estimation via diffusion,” The annals of Statistics, vol. 38, no. 5,
[9] M. Singh, “Scalability and sparsity issues in recommender pp. 2916–2957, 2010.
datasets: a survey,” Knowledge and Information Systems, vol. 62, [34] H. Yin, B. Cui, J. Li, J. Yao, and C. Chen, “Challenging the long tail
no. 1, pp. 1–43, 2020. recommendation,” Proc. VLDB Endow., vol. 5, no. 9, pp. 896–907,
[10] T. T. Nguyen, M. Weidlich, D. C. Thang, H. Yin, and N. Q. V. Hung, 2012.
“Retaining data from streams of social platforms with minimal [35] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun, “Measuring and
regret,” in IJCAI, 2017, pp. 2850–2856. relieving the over-smoothing problem for graph neural networks
[11] T. Chen, H. Yin, Q. V. H. Nguyen, W.-C. Peng, X. Li, and X. Zhou, from the topological view,” in AAAI, vol. 34, no. 04, 2020, pp.
“Sequence-aware factorization machines for temporal predictive 3438–3445.
analytics,” in 2020 IEEE 36th International Conference on Data Engi- [36] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-
neering (ICDE). IEEE, 2020, pp. 1405–1416. nessing adversarial examples,” in ICLR, Y. Bengio and Y. LeCun,
[12] J. Wu, X. Wang, F. Feng, X. He, L. Chen, J. Lian, and X. Xie, “Self- Eds., 2015.
supervised graph learning for recommendation,” in SIGIR, 2021, [37] X. Zhang, F. X. Yu, S. Kumar, and S.-F. Chang, “Learning spread-
pp. 726–735. out local feature descriptors,” in CVPR, 2017, pp. 4595–4603.
[13] J. Yu, H. Yin, J. Li, Q. Wang, N. Q. V. Hung, and X. Zhang, “Self- [38] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola,
supervised multi-channel hypergraph convolutional network for “What makes for good views for contrastive learning?” NeurIPS,
social recommendation,” in WWW, 2021, pp. 413–424. vol. 33, pp. 6827–6839, 2020.
[14] X. Xia, H. Yin, J. Yu, Q. Wang, L. Cui, and X. Zhang, “Self- [39] T. Huang, Y. Dong, M. Ding, Z. Yang, W. Feng, X. Wang, and
supervised hypergraph convolutional networks for session-based J. Tang, “Mixgcf: An improved training method for graph neural
recommendation,” in AAAI, 2021, pp. 4503–4511. network-based recommender systems,” pp. 665–674, 2021.
[15] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, F. Zhang, Z. Wang, [40] X. Wang, H. Jin, A. Zhang, X. He, T. Xu, and T.-S. Chua, “Disentan-
and J.-R. Wen, “S3-rec: Self-supervised learning for sequential rec- gled graph collaborative filtering,” in SIGIR, 2020, pp. 1001–1010.
ommendation with mutual information maximization,” in CIKM, [41] C. Gao, X. Wang, X. He, and Y. Li, “Graph neural networks for
2020, pp. 1893–1902. recommender system,” in WSDM, 2022, pp. 1623–1625.
[16] C. Zhou, J. Ma, J. Zhang, J. Zhou, and H. Yang, “Contrastive learn- [42] S. Wu, F. Sun, W. Zhang, X. Xie, and B. Cui, “Graph neural
ing for debiased candidate generation in large-scale recommender networks in recommender systems: a survey,” CSUR, 2020.
systems,” in KDD, 2021, pp. 3985–3995. [43] J. Yu, H. Yin, J. Li, M. Gao, Z. Huang, and L. Cui, “Enhance social
[17] J. Yu, H. Yin, M. Gao, X. Xia, X. Zhang, and N. Q. V. Hung, recommendation with adversarial graph convolutional networks,”
“Socially-aware self-supervised tri-training for recommendation,” IEEE Transactions on Knowledge and Data Engineering, 2020.
in KDD, F. Zhu, B. C. Ooi, and C. Miao, Eds. ACM, 2021, pp. [44] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan, “Session-based
2084–2092. recommendation with graph neural networks,” in AAAI, vol. 33,
[18] Z. Lin, C. Tian, Y. Hou, and W. X. Zhao, “Improving graph collabo- no. 01, 2019, pp. 346–353.
rative filtering with neighborhood-enriched contrastive learning,” [45] T. N. Kipf and M. Welling, “Semi-supervised classification with
in WWW, 2022, pp. 2320–2329. graph convolutional networks,” in ICLR, 2017.
[19] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning rep- [46] X. Wang, X. He, M. Wang, F. Feng, and T.-S. Chua, “Neural graph
resentations by maximizing mutual information across views,” collaborative filtering,” in SIGIR, 2019, pp. 165–174.
NeurIPS, pp. 15 509–15 519, 2019. [47] L. Chen, L. Wu, R. Hong, K. Zhang, and M. Wang, “Revisiting
[20] X. Zhou, A. Sun, Y. Liu, J. Zhang, and C. Miao, “Selfcf: A sim- graph based collaborative filtering: A linear residual graph con-
ple framework for self-supervised collaborative filtering,” arXiv volutional network approach,” in AAAI, vol. 34, no. 01, 2020, pp.
preprint arXiv:2107.03019, 2021. 27–34.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13

[48] W. Yu and Z. Qin, “Graph convolutional network for recommen- Tong Chen received his PhD degree in com-
dation with low-pass collaborative filters,” in ICML, 2020, pp. puter science from The University of Queens-
10 936–10 945. land in 2020. He is currently a Lecturer with the
[49] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, Data Science research group, School of Infor-
“Simplifying graph convolutional networks,” in ICML, 2019, pp. mation Technology and Electrical Engineering,
6861–6871. The University of Queensland. His research in-
[50] J. Yu, M. Gao, J. Li, H. Yin, and H. Liu, “Adaptive implicit friends terests include data mining, recommender sys-
identification over heterogeneous network for social recommen- tems, user behavior modelling and predictive an-
dation,” in CIKM. ACM, 2018, pp. 357–366. alytics.
[51] J. Ma, C. Zhou, H. Yang, P. Cui, X. Wang, and W. Zhu, “Disentan-
gled self-supervision in sequential recommenders,” in KDD, 2020,
pp. 483–491.
[52] Y. Wei, X. Wang, Q. Li, L. Nie, Y. Li, X. Li, and T.-S. Chua,
“Contrastive learning for cold-start recommendation,” in ACM
Multimedia, 2021, pp. 5382–5390.
[53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
NeurIPS, vol. 30, 2017.
[54] R. Qiu, Z. Huang, H. Yin, and Z. Wang, “Contrastive learning for
representation degeneration problem in sequential recommenda-
tion,” in WSDM, 2022, pp. 813–823. Lizhen Cui is a full professor with Shandong
[55] X. Xia, H. Yin, J. Yu, Q. Wang, G. Xu, and Q. V. H. Nguyen, “On- University. He is appointed dean and deputy
device next-item recommendation with self-supervised knowl- party secretary for School of Software, co-
edge distillation,” in SIGIR, 2022, pp. 546–555. director of Joint SDU-NTU Centre for Artificial
[56] X. Long, C. Huang, Y. Xu, H. Xu, P. Dai, L. Xia, and L. Bo, “Social Intelligence Research(C-FAIR), director of the
recommendation with self-supervised metagraph informax net- Research Center of Software and Data Engi-
work,” in CIKM, 2021, pp. 1160–1169. neering, Shandong University. His main interests
[57] R. Xie, Q. Liu, L. Wang, S. Liu, B. Zhang, and L. Lin, “Contrastive include big data intelligence theory, data mining,
cross-domain recommendation in matching,” in KDD, 2022, pp. wisdom science, and medical health big data AI
4226–4236. applications.
[58] Y. Ma, Y. He, A. Zhang, X. Wang, and T.-S. Chua, “Crosscbr: Cross-
view contrastive learning for bundle recommendation,” KDD, pp.
1233–1241, 2022.
[59] X. Xia, H. Yin, J. Yu, Y. Shao, and L. Cui, “Self-supervised graph
co-training for session-based recommendation,” in CIKM, 2021,
pp. 2180–2190.

Nguyen Quoc Viet Hung is a senior lecturer


and an ARC DECRA Fellow in Griffith University.
He earned his Master and PhD degrees from
EPFL in 2010 and 2014 respectively. His re-
Junliang Yu received his B.S. and M.S degrees search focuses on Data Integration, Data Qual-
in Software Engineering from Chongqing Uni- ity, Information Retrieval, Trust Management,
versity, China. Currently, he is a final-year Ph.D. Recommender Systems, Machine Learning and
candidate at the School of Information Technol- Big Data Visualization, with special emphasis on
ogy and Electrical Engineering, the University web data, social data and sensor data.
of Queensland. His research interests include
recommender systems, social media analytics,
and self-supervised learning.

Hongzhi Yin is an associate professor and ARC


Future Fellow at the University of Queensland.
Xin Xia received her B.S. degree in Software He received his Ph.D. degree from Peking Uni-
Engineering from Jilin University, China. Cur- versity, in 2014. His research interests include
rently, she is a third-year Ph.D. candidate at the recommendation system, deep learning, social
School of Information Technology and Electrical media mining, and federated learning. He is
Engineering, the University of Queensland. Her currently serving as Associate Editor/Guest Ed-
research interests include knowledge distillation, itor/Editorial Board for ACM Transactions on In-
sequence modeling, and self-supervised learn- formation Systems (TOIS), ACM Transactions on
ing. Intelligent Systems and Technology (TIST), etc.

You might also like