XSim GCL
XSim GCL
Abstract—Contrastive learning (CL) has recently been demonstrated critical in improving recommendation performance. The
fundamental idea of CL-based recommendation models is to maximize the consistency between representations learned from different
graph augmentations of the user-item bipartite graph. In such a self-supervised manner, CL-based recommendation models are
expected to extract general features from the raw data to tackle the data sparsity issue. Despite the effectiveness of this paradigm, we
still have no clue what underlies the performance gains. In this paper, we first reveal that CL enhances recommendation through
endowing the model with the ability to learn more evenly distributed user/item representations, which can implicitly alleviate the
pervasive popularity bias and promote long-tail items. Meanwhile, we find that the graph augmentations, which were considered a
arXiv:2209.02544v2 [[Link]] 8 Sep 2022
necessity in prior studies, are relatively unreliable and less significant in CL-based recommendation. On top of these findings, we put
forward an eXtremely Simple Graph Contrastive Learning method (XSimGCL) for recommendation, which discards the ineffective
graph augmentations and instead employs a simple yet effective noise-based embedding augmentation to create views for CL. A
comprehensive experimental study on three large and highly sparse benchmark datasets demonstrates that, though the proposed
method is extremely simple, it can smoothly adjust the uniformity of learned representations and outperforms its graph
augmentation-based counterparts by a large margin in both recommendation accuracy and training efficiency. The code is released at
[Link]
F
1 I NTRODUCTION
raw data is a sliver bullet for addressing the data sparsity x Joint
x Optimization
Perturbed
issue [9], [10], [11], it also pushes forward the frontier x Graph
Encoder
of recommendation. A flurry of enthusiasm on CL-based x
recommendation [12], [13], [14], [15], [16], [17], [18] has
recently been witnessed, followed by a string of promising x Perturbed
Graph Contrastive
results. Based on these studies a paradigm of contrastive rec- x Encoder Learning
ommendation can be clearly profiled. It mainly includes two x
steps: first augmenting the original user-item bipartite graph
with structure perturbations (e.g., edge/node dropout at Fig. 1: Graph contrastive learning with edge dropout for
a certain ratio), and then maximizing the consistency of recommendation.
representations learned from different graph augmentations
[3] under a joint learning framework (shown in Fig. 1). surprising that several recent studies [17], [20], [21] have
Despite the effectiveness of this paradigm, we still have reported that the performances of their CL-based recom-
no clue what underlies the performance gains. Intuitively, mendation models are not sensitive to graph augmentations
encouraging the agreement between related graph augmen- with different edge dropout rates, and even a large drop rate
tations can learn representations invariant to slight struc- (e.g., 0.9) can somehow benefit the model. Such a finding is
ture perturbations and capturing the essential information elusive, because a large dropout rate will result in a huge
of the original user-item bipartite [1], [19]. However, it is loss of the raw structural information and should have had
a negative effect only. This naturally raises an intriguing
and fundamental question: Are graph augmentations really a
• J. Yu, X. Xia, T. Chen, and H. Yin are with the School of Information necessity for CL-based recommendation models?
Technology and Electrical Engineering, The University of Queensland,
Brisbane, Queensland, Australia.
To answer this question, we first conduct experiments
E-mail: {[Link], [Link], [Link], h.yin1}@[Link] with and without graph augmentations and compare the
• H. Nguyen is with the Institute for Integrated and Intelligent Systems, performances. The results show that when graph augmen-
Griffith University, Gold Coast, Australia. tations are detached, there is only a slight performance drop.
E-mail: quocviethung1@[Link]
• Lizhen Cui is with the School of Software, Shandong University, Jinan, We then investigate the representations learned by non-
China. CL and CL-based recommendation methods. By visualizing
E-mail: clz@[Link] the representations, we observe that the jointly optimized
∗ Corresponding author. contrastive loss InfoNCE [22] is what really matters, rather
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2
2 R EVISITING G RAPH C ONTRASTIVE L EARNING TABLE 1: Performance comparison of different SGL vari-
ants.
FOR R ECOMMENDATION
2.1 Contrastive Recommendation with Graph Augmen-
Yelp2018 Kindle iFashion
tations Method
Recall NDCG Recall NDCG Recall NDCG
Generally, data augmentations are the prerequisite for CL- LightGCN 0.0639 0.0525 0.2053 0.1315 0.0955 0.0461
based recommendation models [15], [12], [13], [25]. In this SGL-ND 0.0644 0.0528 0.2069 0.1328 0.1032 0.0498
SGL-ED 0.0675 0.0555 0.2090 0.1352 0.1093 0.0531
section, we investigate the widely used dropout-based aug- SGL-RW 0.0667 0.0547 0.2105 0.1351 0.1095 0.0531
mentations on graphs [12], [4]. They assume that the learned SGL-WA 0.0671 0.0550 0.2084 0.1347 0.1065 0.0519
representations which are invariant to partial structure per-
turbations are high-quality. We target a representative state-
of-the-art CL-based recommendation model SGL [12], which 2.2 Necessity of Graph Augmentations
performs the node/edge dropout to augment the user-item
graph. The joint learning scheme in SGL is formulated as: The phenomenon reported in [17], [20], [21] that even very
sparse graph augmentations can somehow benefit the rec-
L = Lrec + λLcl , (1) ommendation model suggest that CL-based recommenda-
tion may work in a way different from what we reckon.
which consists of the recommendation loss Lrec and the
To demystify how CL enhances recommendation, we first
contrastive loss Lcl . Since the goal of SGL is to recommend
investigate the necessity of the graph augmentation in SGL.
items, the CL task plays an auxiliary role and its effect is
In the original paper of SGL [12], three variants are proposed
modulated by a hyperparameter λ. As for the concrete forms
including SGL-ND (-ND denotes node dropout), SGL-ED
of these two losses, SGL adopts the standard BPR loss [29]
(-ED denotes edge dropout), and SGL-RW (-RW denotes
for recommendation and the InfoNCE loss [22] for CL. The
random walk, i.e., multi-layer edge dropout). For a control
standard BPR loss is defined as:
X group, we construct a new variant of SGL, termed SGL-WA
Lrec = − log σ(e>u e i − e>
u ej ) , (2) (-WA stands for w/o augmentation) in which the CL loss is
(u,i)∈B defined as:
where σ is the sigmoid function, eu is the user repre-
X exp(1/τ )
Lcl = − log P >
. (5)
sentation, ei is the representation of an item that user u i∈B j∈B exp(zi zj /τ )
has interacted with, ej is the representation of a randomly
sampled item, and B is a mini-batch. The InfoNCE loss [22] Because no augmentations are used in SGL-WA, we have
is formulated as: z0i = z00i = zi . The performance comparison is conducted
on three benchmark datasets: Yelp2018 [28], Amazon-Kindle
X exp(z0> 00
i zi /τ ) [30] and Alibaba-iFashion [12]. A 3-layer setting is adopted
Lcl = − log P 0> 00
, (3)
i∈B j∈B exp(zi zj /τ ) and the hyperparameters are tuned according to the original
paper of SGL (more experimental settings can be found in
where i and j are users/items in B , z0i and z00i are L2 nor- Section 4.1). The results are presented in Table 1 where the
malized representations learned from two different dropout- highest values are marked in bold type.
e0
based graph augmentations (namely z0i = ke0ik2 ), and τ > 0 As can be observed, all the graph augmentation-based
i
(e.g., 0.2) is the temperature which controls the strength variants of SGL outperform LightGCN, which demonstrates
of penalties on hard negative samples. The InfoNCE loss the effectiveness of CL. However, to our surprise SGL-WA
encourages the consistency between z0i and z00i which are is also competitive. Its performance is on par with that of
the positive sample of each other, whilst minimizing the SGL-ED and SGL-RW and is even better than that of SGL-
agreement between z0i and z00j , which are the negative sam- ND on all the datasets. Given these results, we can draw
ples of each other. Optimizing the InfoNCE loss is actually two conclusions: (1) graph augmentations indeed work but
maximizing a tight lower bound of mutual information. they are not as effective as expected; the large proportion
To learn representations from the user-item graph, SGL of performance gains derive from the contrastive loss In-
employs LightGCN [28] as its encoder, whose message foNCE and this can explain why even very sparse graph
passing process is defined as: augmentations seem to be informative in [17], [20], [21]; (2)
1 not all graph augmentations have a positive impact; a long
E= (E(0) + AE(0) + ... + AL E(0) ), (4) process of trial-and-error is required to pick out the useful
1+L
ones. As for (2), the possible reason could be that some
where E(0) ∈ R|N |×d is the node embeddings to be learned, graph augmentations highly distort the original graph. For
E is the initial final representations for prediction, |N | example, the node dropout is very likely to drop the key
is the number of nodes, L is the number of layers, and nodes (e.g., hub) and their associated edges and hence
A ∈ R|N |×|N | is the normalized undirected adjacency breaks the correlated subgraphs into disconnected pieces.
matrix without self-connection. By replacing A with the Such graph augmentations share little learnable invariance
adjacency matrix of the corrupted graph augmentations Ã, with the original graph and therefore it is unreasonable to
z0 and z00 can be learned via Eq. (4). It should be noted that in encourage the consistency between them. By contrast, the
every epoch, Ã is reconstructed. For the sake of brevity, here edge dropout is at a lower risk to largely perturb the orig-
we just present the core ingredients of SGL and LightGCN. inal graph so that SGL-ED/RW can hold a slim advantage
More details can be found in the original papers [12], [28]. over SGL-WA. However, in view of the expense of regular
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4
0 0 0 0
0 0
0 0 0 0 0 0
−2
−20 −20 −20 −20 −20
−4
−40 −40 −40
−40 −40
−6
−50 0 50 −40 −20 0 20 40 −40 −20 0 20 40 −5.0 −2.5 0.0 2.5 5.0 −40 −20 0 20 40 −40 −20 0 20 40
Features Features Features Features Features Features
1.0 1.0 1.0 1.0 1.0 1.0
Density
0 0 0 0 0 0
−10 −10 −2
−20 −20 −20 −20 −20
−4
−30 −30
−6
−40 −40
−40 −20 0 20 40 −40 −20 0 20 40 −20 0 20 40 −5 0 5 −20 0 20 40 −40 −20 0 20 40
Features Features Features Features Features Features
1.0 1.0 1.0 1.0 1.0
Density
1
0.5 0.5 0.5 0.5 0.5
Fig. 3: The distribution of representations learned from three datasets. The top of each figure plots the learned 2D features
and the bottom of each figure plots the Gaussian kernel density estimation of atan2(y, x) for each point (x,y) ∈ S 1
reconstruction of the adjacency matrices during training, it In our preliminary study [24], we have displayed the
is sensible to search for better alternatives. distribution of 2,000 randomly sampled users after opti-
mizing the InfoNCE loss. For a thorough understanding,
2.3 Uniformity Is What Really Matters in this version we sample both users and items. Specifically,
we first rank users and items according to their popularity.
The last section has revealed that the contrastive loss In-
Then 500 hot items are randomly sampled from the item
foNCE is the key. However, we still have no idea how it
group with the top 5% interactions; the other 500 cold items
operates. The previous research [31] on the visual represen-
are randomly sampled from the group with the bottom 80%
tation learning has identified that pre-training with the In-
interactions, and so are the users. Afterwards, we map the
foNCE loss intensifies two properties: alignment of features
learned representations to 2-dimensional space with t-SNE
from positive pairs, and uniformity of the normalized feature
[32]. All the representations are collected when the models
distribution on the unit hypersphere. It is unclear if the
reach their best. Then we plot the 2D feature distributions
CL-based recommendation methods exhibit similar patterns
in Fig. 3. For a clearer presentation, the Gaussian kernel
under a joint learning setting. Since in recommendation the
density estimation [33] of arctan(feature y/feature x) on
goal of Lrec is to align the interacted user-item pair, here we
the unit hypersphere S 1 is also visualized.
focus on investigating the uniformity.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5
From Fig. 3 an obvious contrast between the fea- Embedding Space 𝒆′′
𝑖
tures/density estimations learned by LightGCN and CL- 𝒆𝑖
based recommendation models can be observed. In the
𝒆′𝑖
leftmost column, LightGCN learns highly clustered features 𝜃2
and the density curves are with steep rises and falls. Besides, 𝜃1
∆′′
it is easy to notice that the hot users and hot items have
similar distributions and the cold users also cling to the hot ∆′
items; only a small number of users are scattered among the 𝒆j r=𝜖
cold items. Technically, this is a biased pattern that will lead
the model to continually expose hot items to most users and
generate run-of-the-mill recommendations. We think that Fig. 4: An illustration of the proposed random noise-based
two issues may cause this biased distribution. One is that in data augmentation.
recommender systems a fraction of items often account for
most interactions [34], and the other is the notorious over- 3.1 Noise-Based Augmentation
smoothing problem [35] which makes embeddings become Based on the findings above, we speculate that by adjusting
locally similar and hence aggravates the Matthew Effect. By the uniformity of the learned representation in a certain
contrast, in the second and the third columns, the features scope, contrastive recommendation models can reach a bet-
learned by SGL variants are more evenly distributed, and ter performance. Since manipulating the graph structure for
the density estimation curves are less sharp, regardless of if controllable uniformity is intractable and time-consuming,
the graph augmentations are applied. For reference, in the we shift attention to the embedding space. Inspired by the
forth column we plot the features learned only by optimiz- adversarial examples [36] which are constructed through
ing the InfoNCE loss in SGL-ED. Without the effect of Lrec , adding imperceptible perturbation to the images, we pro-
the features are almost subject to uniform distributions. The pose to directly add random noises to the representation for
following inference provides a theoretical justification for an efficient augmentation.
this pattern. By rewriting Eq. (3) we can derive, Formally, given a node i and its representation ei in
X X the d-dimensional embedding space, we can implement the
Lcl = − z0> 00
i zi /τ + log exp(z0> 00
i zj /τ ) . (6) following representation-level augmentation:
i∈B j∈B
Uniformity
−3.625
the input user/item embeddings are the only parameters
−3.650
to be learned. The ordinary graph encoding is used for
recommendation which follows Eq. (4) to propagate node −3.675
information. Meanwhile, in the other two encoders SimGCL
−3.700
employs the proposed noise-based augmentation approach,
and adds different uniform random noises to the aggregated −3.725
embeddings at each layer to obtain perturbed representa-
tions. This noise-involved representation learning can be 0 10 20 30 40
Epoch
formulated as:
1 Fig. 5: Trends of uniformity with different . Lower values
E0 = (AE(0) + ∆(1) ) + (A(AE(0) + ∆(1) ) + ∆(2) ))+
L on the y-axis are better. We present the Recall@20 values of
... + (AL E(0) + AL−1 ∆(1) + ... + A∆(L−1) + ∆(L) ) XSimGCL with different when it reaches convergence.
(10)
Note that we skip the input embedding E(0) in all the three encoding processes. As a result, the new architecture has
encoders when calculating the final representations, because only one-time forward/backward pass in a mini-batch com-
we experimentally find that CL cannot consistently improve putation. We name this new method XSimGCL, which is
non-aggregation-based models and skipping it will lead to a short for eXtremely Simple Graph Contrastive Learning. We
slight performance improvement. Finally, we substitute the illustrate it in Fig. 2. The perturbed representation learning
learned representations into the joint loss presented in Eq. of XSimGCL is as the same as that of SimGCL. The joint loss
(1) then use Adam to optimize it. of XSimGCL is formulated as:
X
3.2.2 XSimGCL - Simpler Than Simple L=− log σ(e0> 0 0> 0
u ei − eu ej ) +
Compared with SGL, SimGCL is much simpler because (u,i)∈B
∗ (11)
the constant graph augmentation is no longer required. X exp(z0> l
i zi /τ )
λ − log P 0> l∗
,
However, the cumbersome architecture of SimGCL makes j∈B exp(zi zj /τ )
i∈B
it less than perfect. For each computation, it requires three
forward/backward passes to obtain the loss and then back- where l∗ denotes the layer to be contrasted with the final
propagate the error to update the input node embeddings. layer. Contrasting two intermediate layers is optional, but
Though it seems a convention to separate the pipelines the experiments in section 4.3 show that involving the final
of the recommendation task and the contrastive task in layer leads to the optimal performance.
CL-based recommender systems [12], [24], [27], [25], we
question the necessity of this architecture. 3.2.3 Adjusting Uniformity Through Changing
As suggested by [38], there is a sweet pot when utilizing In XSimGCL, we can explicitly control how far the aug-
CL where the mutual information between correlated views mented representations deviate from the original through
is neither too high nor too low. However, in the architecture changing the value of . Intuitively, a larger will result in
of SimGCL, given a pair of views of the same node, the a more uniform representation distribution. This is because
mutual information between two them could always be very when optimizing the contrastive loss, the added noises are
high since the two corresponding embeddings both contain also propagated as part of the gradients. As the noises are
information from L hops of neighbors. Due to the minor sampled from uniform distribution, the original representa-
difference between them, contrasting them with each other tion is roughly regularized towards higher uniformity. We
may be less effective in learning general features. In fact, this conduct the following experiment to demonstrate it.
is also a universal problem for many CL-based recommen- According to [31], the logarithm of the average pairwise
dation models under the paradigm in Fig. 1. It is natural Gaussian potential (a.k.a. the Radial Basis Function (RBF)
to think what if we contrast different layer embeddings? kernel) can well measure the uniformity of representations,
They share some common information but differ in aggre- which is defined as:
gated neighbors and added noises, which conform to the 2
then tune to observe how the uniformity changes. The constant graph augmentations. Since this part is usually
uniformity is checked after every epoch until XSimGCL performed on CPUs, which brings SGL-ED more expense
reaches convergence. of time in practice. By comparison, XSimGCL needs neither
As clearly shown in Fig. 5, similar trends are observed graph augmentations nor extra encoders. Without consider-
on all the curves. At the beginning, all curves have highly ing the computation for the contrastive task, XSimGCL is
evenly-distributed representations because we use Xavier theoretically as lightweight as LightGCN and only spends
initialization, which is a special uniform distribution, to one-third of SimGCL’s training expense in graph encoding.
initialize the input embeddings. With the training pro- When the actual number of epochs for training is consid-
ceeding, the uniformity declines due to the effect of Lrec . ered, XSimGCL will show more efficiency beyond what we
After reaching the peak the uniformity increases again till can observe from this theoretical analysis.
convergence. It is also obvious that with the increase of
, XSimGCL tends to learn a more even representation TABLE 3: Dataset Statistics
distribution. Meanwhile, the better performance is achieved,
Dataset #User #Item #Feedback Density
which evidences our claim that more uniformity brings a
Yelp2018 31,668 38,048 1,561,406 0.13%
better performance in scope. Besides, we notice that there Amazon-Kindle 138,333 98,572 1,909,965 0.014%
is a correlation between the convergence speed and the Alibaba-iFashion 300,000 81,614 1,607,813 0.007%
magnitude of noises, which will be discussed in section 4.3.
[31]. An exception is that we let τ = 0.15 for XSimGCL largest improvements are observed on Alibaba-iFashion
on Yelp2018, which brings a slightly better performance. where the performance of XSimGCL surpasses that of
Note that although the paper of SGL [12] uses Yelp2018 LightGCN by more than 25% under the 1-layer and 3-
and Alibaba-iFashion as well, we cannot reproduce their re- layer settings.
sults on Alibaba-iFashion with their given hyperparameters • SGL-ED and SGL-RW have very close performances and
under the same experimental setting. So we re-search the outperform SGL-ND by large margins. In many cases,
hyperparameters of SGL and choose to present our results SGL-WA show advantages over SGL-WA but still falls
on this dataset in Table 4. behind SGL-ED and SGL-RW. These results further cor-
roborate that the InfoNCE loss is the primary factor
TABLE 5: The best hyperparameters of compared methods.
which accounts for the performance gains, and meanwhile
Dataset Yelp2018 Amazon-Kindle Alibaba-iFashion heuristic graph augmentations are not as effective as ex-
SGL λ=0.1, ρ=0.1 λ=0.05, ρ=0.1 λ=0.05, ρ=0.2 pected and some of them even degrade the performance.
SimGCL λ=0.5, =0.1 λ=0.1, =0.1 λ=0.05, =0.1
XSimGCL λ=0.2, =0.2, l∗ =1 λ=0.2, =0.1, l∗ =1 λ=0.05, =0.05, l∗ =3
• XSimGCL/SimGCL show the best/second best perfor-
mance in almost all the cases, which demonstrates the
effectiveness of the random noised-based data augmen-
tation. Particularly, on the largest and sparsest dataset -
4.2 SGL vs. XSimGCL: From A Comprehensive Per-
Alibaba-iFashion, they significantly outperforms the SGL
spective
variants. In addition, it is obvious that the evolution
In this part, we compare XSimGCL with SGL in a com- from SimGCL to XSimGCL is successful, bringing non-
prehensive way. The experiments focus on three important negligible performance gains.
aspects: recommendation performance, training time, and
the ability to promote long-tail items. To further demonstrate XSimGCL’s outstanding perfor-
mance, we also compare it with a few recent augmentation-
4.2.1 Performance Comparison based or CL-based recommendation models. The imple-
We first show the performances of SGL and XSimGCL/ mentations of these methods are available in our GitHub
SimGCL with different layers. The best hyperparameters of repository SELFRec as well. According to Table 6, XSimGCL
them are provided in Table 5 for an easy reproduction of our and SimGCL still outperform with a great lead, achieving
results. The figures of the best performance are presented in the best and the second best performance, respectively. NCL
bold and the runner-ups are presented with underline. The and MixGCF, which employ LightGCN as their backbones,
improvements are calculated based on the performance of also show their competitiveness. By contrast, DNN+SSL
LightGCN. Note that we contrast the final layer with itself and BUIR are not as powerful as expected and even not
in the 1-layer XSimGCL. Based on Table 4, we have the comparable to LightGCN. We attribute their failure to: (1).
following observations: DNNs are proved effective when abundant user/item fea-
• In the vast majority of cases, the SGL variants, SimGCL tures are provided. In our datasets, features are unavailable
and XSimGCL can largely outperform LightGCN. The and the self-supervision signals are created by masking
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9
TABLE 6: Performance comparison with other models. LightGCN SGL-WA SGL-ED SimGCL XSimGCL
Batch Time 3
Epochs Needed Total Time
10
Yelp2018 Kindle iFashion 100 4
Method 10
Recall NDCG Recall NDCG Recall NDCG
LightGCN 0.0639 0.0525 0.2057 0.1315 0.1053 0.0505 80
NCL 0.0670 0.0562 0.2090 0.1348 0.1088 0.0528
Time (seconds)
Time (seconds)
Epoch Number
BUIR 0.0487 0.0404 0.0922 0.0528 0.0830 0.0384 60 10
2
Recall@20
2.5%
1.5%
4.2.2 Comparison of Training Efficiency 2.0%
1.0% 1.5%
As has been claimed, XSimGCL is almost as lightweight
1.0%
as LightGCN in theory. In this part, we report the actual 0.5%
0.5%
training time, which is more informative than the theoretical 0.0% 0.0%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
analysis. The reported figures are collected on a workstation Group Group
Recall@20
6.0%
19.0%
ϵ: 0 ϵ: 0 ϵ: 0
noises, and l∗ - the layer to be contrasted. In this part, we in- ϵ: 0.01
18.0%
ϵ: 0.01
8.0%
ϵ: 0.01
5.0% ϵ: 0.05 ϵ: 0.05 ϵ: 0.05
vestigate the model’s sensitivity to these hyperparameters. ϵ: 0.1
17.0%
ϵ: 0.1
6.0%
ϵ: 0.1
ϵ: 0.2 ϵ: 0.2 ϵ: 0.2
ϵ: 0.5 ϵ: 0.5 ϵ: 0.5
4.0% 16.0%
λ: 0.0 0.05 0.1 0.2 0.5 1 λ: 0.0 0.05 0.1 0.2 0.5 1 λ: 0.0 0.05 0.1 0.2 0.5 1
1 1 1
4.3.1 Influence of λ and
6.0% 5.5%
13.5%
We try different combinations of λ and with the set [0.01, 5.5%
5.0%
13.0%
0.05, 0.1, 0.2, 0.5, 1] for λ and [0, 0.01, 0.05, 0.1, 0.2, 0.5] 4.5%
NDCG@20
5.0%
for . We fix l∗ =1 and conduct experiments with a 2-layer ϵ: 0
ϵ: 0.01
12.5%
ϵ: 0
ϵ: 0.01
4.0% ϵ: 0
ϵ: 0.01
4.5% 12.0%
setting, but we found that the best values of these two ϵ: 0.05
ϵ: 0.1
ϵ: 0.05
ϵ: 0.1
3.5% ϵ: 0.05
ϵ: 0.1
4.0% 11.5%
hyperparameters are also applicable to other settings. As ϵ: 0.2 ϵ: 0.2 3.0% ϵ: 0.2
ϵ: 0.5 11.0% ϵ: 0.5 ϵ: 0.5
3.5%
shown in Fig. 8, on all the datasets XSimGCL reaches its λ: 0.0 0.05 0.1 0.2 0.5 1 λ: 0.0 0.05 0.1 0.2 0.5 1
2.5%
λ: 0.0 0.05 0.1 0.2 0.5 1
1 1 1
best performance when is in [0.05, 0.1, 0.2]. Without the
added noises (=0), we can see an obvious performance Fig. 8: The influence of λ and .
drop. When is too large (=0.05) or too small (=0.01),
the performance declines as well. The similar trend is also
observed by changing the value of λ. The performance is Yelp2018 7.2 Amazon-Kindle Alibaba-iFashion 12.0
at the peak when λ=0.2 on Yelp2018, λ = 0.2 on Amazon- 6.64 7.01 20.71 21.19 11.42 11.74
L1
L1
L1
7.0 21.0
11.8
Recall@20
Random Random Random
Kindle, and λ = 0.05 on Alibaba-iFashion. According to 7.06 7.01 21.33 20.27 11.50 11.58
L2
L2
L2
6.8 20.5
11.6
our experience, XSimGCL (SimGCL) is more sensitive to 6.34 6.53 6.91 19.65 20.55 20.69 11.90 11.82 11.85
L3
L3
L3
6.6
20.0
λ, and = 0.1 is usually a good and safe choice on most 7.23 7.21 6.83 7.08 21.47 21.09 20.42 19.89 11.56 11.50 11.96 11.39
Final
Final
Final
6.4 11.4
19.5
datasets. Besides, we also find that a larger leads to faster L1 L2 L3 Final L1 L2 L3 Final L1 L2 L3 Final
L1
L1
5.8
NDCG@20
L2
L2
5.6 13.5
L3
L3
5.6
5.4
13.0
6.04 6.01 5.70 5.90 14.15 13.76 13.25 12.89 5.66 5.64 5.86 5.59
Final
Final
Final
5.2 5.5
4.3.2 Layer Selection for Contrast L1 L2 L3 Final L1 L2 L3 Final L1 L2 L3 Final
4.4.2 XSimGCL with Other Types of Noises Inspired by the success of CL in other fields, the community
We test three other types of noises in this experiment also has started to integrate CL into recommendation [15],
including adversarial perturbation obtained by following [12], [14], [17], [13], [51], [52], [16]. The fundamental idea
FGSM [36] (denoted by XSimGCLa ), positive uniform behind existing contrastive recommendation methods is to
noises without the sign of learned embeddings (denoted by regard every instance (e.g., user or item) as a class, and
XSimGCLp ), and Gaussian noises (denoted by XSimGCLg ). then pull views of the same instance closer and push views
We tried many combinations of λ and for different types of of different instances apart when learning representations,
noises and presented the results in Table 8. As observed, the where the views are augmented by applying different trans-
vanilla XSimGCL with signed uniform noises outperforms formations to the original data. In brief, the goal is to
other variants. Although positive uniform noises and Gaus- maximize the mutual information between views from the
sian noises also bring hefty performance gains compared same instance. Through this process, the recommendation
with LightGCN, adding adversarial noises unexpectedly model is expected to learn the essential information of the
leads to a large drop of performance. This indicates that only user-item interactions.
a few particular distributions can generate helpful noises. To the best of our knowledge, S3 -Rec [15] is the first
Besides, the result that XSimGCL outperforms XSimGCLp work that combines CL with sequential recommendation.
demonstrates the necessity of the sign constraint. It first randomly masks part of attributes and items to
create sequence augmentations, and then pre-trains the
TABLE 8: Performance comparison of different XSimGCL Transformer [53] by encouraging the consistency between
variants. different augmentations. The similar idea is also found in a
concurrent work CL4SRec [25], where more augmentation
Yelp2018 Kindle iFashion approaches including item reordering and cropping are
Method
Recall NDCG Recall NDCG Recall NDCG
used. Besides, S2 -DHCN [14] and ICL [18] adopt advanced
LightGCN 0.0639 0.0525 0.2057 0.1315 0.1053 0.0505 augmentation strategies by re-organizing/clustering the se-
XSimGCLa 0.0558 0.0464 0.1267 0.0833 0.0158 0.0065 quential data for more effective self-supervised signals. Qiu
XSimGCLp 0.0714 0.0596 0.2121 0.1398 0.1183 0.0577
XSimGCLg 0.0722 0.0602 0.2140 0.1410 0.1190 0.0583 et al. proposed DuoRec [54] which adopts a model-level
XSimGCL 0.0723 0.0604 0.2147 0.1415 0.1196 0.0586 augmentation by conducting dropout on the encoder. Xia
et al. [55] integrated CL into a self-supervised knowledge
distillation framework to transfer more knowledge from
the server-side large recommendation model to resource-
5 R ELATED W ORK constrained on-device models to enhance next-item recom-
5.1 GNNs-Based Recommendation Models mendation. In the same period, CL was also introduced
to different graph-based recommendation scenarios. S2 -
In recent years, graph neural networks (GNNs) [41], [42]
MHCN [27] and SMIN [56] integrate CL into social recom-
have brought the regime of DNNs to an end, and become a
mendation. HHGR [26] proposes a double-scale augmen-
routine in recommender systems for its extraordinary ability
tation approach for group recommendation and develops
to model the user behavior data. A great number of recom-
a finer-grained contrastive objective for users and groups.
mendation models developed from GNNs have achieved
CCDR [57] and CrossCBR [58] have explored the use of
greater than ever performances in different recommenda-
CL in cross-domain and bundle recommendation. Yao et
tion scenarios [40], [13], [28], [43], [44]. Among numerous
al. [27] proposed a feature dropout-based two-tower ar-
variants of GNNs, GCN [45] is the most prevalent one and
chitecture for large-scale item recommendation. NCL [18]
drives many state-of-the-art graph neural recommendation
designs a prototypical contrastive objective to capture the
models such as NGCF [46], LightGCN [28], LR-GCCF [47]
correlations between a user/item and its context. SEPT [17]
and LCF [48]. Despite varying implementation details, all
and COTREC [59] further propose to mine multiple positive
these GCN-based models share a common scheme which
samples with semi-supervised learning on the perturbed
is to aggregate information from the neighborhood in the
graph for social/session-based recommendation. The most
user-item graph layer by layer [42]. Benefitting from its sim-
widely used model is SGL [12] which replies edge/node
ple structure, LightGCN becomes one of the most popular
dropout to augment the graph data. Although these meth-
GCN-based recommendation models. It follows SGC [49]
ods have demonstrated their effectiveness, they pay little
to remove the redundant operations in the vanilla GCN
attention to why CL can enhance recommendation.
including transformation matrices and nonlinear activation
functions. This design is proved efficient and effective for
recommendation where only the user-item interactions are 6 C ONCLUSION
provided. It also inspires a lot of CL-based recommendation
In this paper, we revisit the graph CL in recommendation
models such as SGL [12], NCL [18] and SimGCL [24].
and investigate how it enhances graph recommendation
models. The findings are surprising that the InfoNCE loss
5.2 Contrastive Learning for Recommendation is the decisive factor which accounts for most of the per-
Contrastive learning [1], [2] recently has drawn considerable formance gains, whilst the elaborate graph augmentations
attention in many fields due to its ability to deal with only play a secondary role. Optimizing the InfoNCE loss
massive unlabeled data [6], [5], [4]. As CL usually works in leads to a more even representation distribution, which
a self-supervised manner [3], it is inherently a sliver bullet helps to promote the long-tail items in the scenario of
to the data sparsity issue [50] in recommender systems. recommendation. In light of this, we propose a simple yet
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12
effective noise-based augmentation approach, which can [21] D. Lee, S. Kang, H. Ju, C. Park, and H. Yu, “Bootstrapping user
smoothly adjust the uniformity of the representation distri- and item representations for one-class collaborative filtering,” in
SIGIR, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai,
bution through CL. An extremely simple model XSimGCL Eds., 2021, pp. 1513–1522.
is also put forward, which brings an ultralight architecture [22] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with
for CL-based recommendation. The extensive experiments contrastive predictive coding,” arXiv preprint arXiv:1807.03748,
on three large and highly sparse datasets demonstrate that 2018.
[23] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, and X. He, “Bias and
XSimGCL is an ideal alternative of its graph augmentation- debias in recommender system: A survey and future directions,”
based counterparts. arXiv preprint arXiv:2010.03240, 2020.
[24] J. Yu, H. Yin, X. Xia, T. Chen, L. Cui, and Q. V. H. Nguyen,
“Are graph augmentations necessary? simple graph contrastive
learning for recommendation,” in SIGIR, 2022, pp. 1294–1303.
R EFERENCES [25] X. Xie, F. Sun, Z. Liu, S. Wu, J. Gao, J. Zhang, B. Ding, and B. Cui,
“Contrastive learning for sequential recommendation,” in ICDE.
[1] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, IEEE, 2022, pp. 1259–1273.
“A survey on contrastive self-supervised learning,” Technologies, [26] J. Zhang, M. Gao, J. Yu, L. Guo, J. Li, and H. Yin, “Double-scale
vol. 9, no. 1, p. 2, 2021. self-supervised hypergraph learning for group recommendation,”
[2] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and in CIKM, 2021, pp. 2557–2567.
J. Tang, “Self-supervised learning: Generative or contrastive,” [27] T. Yao, X. Yi, D. Z. Cheng, F. Yu, T. Chen, A. Menon, L. Hong, E. H.
arXiv preprint arXiv:2006.08218, vol. 1, no. 2, 2020. Chi, S. Tjoa, J. Kang et al., “Self-supervised learning for large-scale
[3] J. Yu, H. Yin, X. Xia, T. Chen, J. Li, and Z. Huang, “Self-supervised item recommendations,” in CIKM, 2021, pp. 4321–4330.
learning for recommender systems: A survey,” arXiv preprint [28] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, “Lightgcn:
arXiv:2006.07733, 2022. Simplifying and powering graph convolution network for recom-
[4] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph mendation,” in SIGIR. ACM, 2020, pp. 639–648.
contrastive learning with augmentations,” NeurIPS, vol. 33, 2020. [29] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme,
[5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple “Bpr: Bayesian personalized ranking from implicit feedback,” in
framework for contrastive learning of visual representations,” in UAI. AUAI Press, 2009, pp. 452–461.
ICML, 2020, pp. 1597–1607. [30] R. He and J. McAuley, “Ups and downs: Modeling the visual
[6] T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning evolution of fashion trends with one-class collaborative filtering,”
of sentence embeddings,” in EMNLP, 2021, pp. 6894–6910. in WWW, 2016, pp. 507–517.
[7] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast [31] T. Wang and P. Isola, “Understanding contrastive representation
for unsupervised visual representation learning,” in CVPR, 2020, learning through alignment and uniformity on the hypersphere,”
pp. 9729–9738. in ICML, 2020, pp. 9929–9939.
[8] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, [32] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”
E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar Journal of machine learning research, vol. 9, no. 11, 2008.
et al., “Bootstrap your own latent: A new approach to self- [33] Z. I. Botev, J. F. Grotowski, and D. P. Kroese, “Kernel density
supervised learning,” NeurIPS, 2020. estimation via diffusion,” The annals of Statistics, vol. 38, no. 5,
[9] M. Singh, “Scalability and sparsity issues in recommender pp. 2916–2957, 2010.
datasets: a survey,” Knowledge and Information Systems, vol. 62, [34] H. Yin, B. Cui, J. Li, J. Yao, and C. Chen, “Challenging the long tail
no. 1, pp. 1–43, 2020. recommendation,” Proc. VLDB Endow., vol. 5, no. 9, pp. 896–907,
[10] T. T. Nguyen, M. Weidlich, D. C. Thang, H. Yin, and N. Q. V. Hung, 2012.
“Retaining data from streams of social platforms with minimal [35] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun, “Measuring and
regret,” in IJCAI, 2017, pp. 2850–2856. relieving the over-smoothing problem for graph neural networks
[11] T. Chen, H. Yin, Q. V. H. Nguyen, W.-C. Peng, X. Li, and X. Zhou, from the topological view,” in AAAI, vol. 34, no. 04, 2020, pp.
“Sequence-aware factorization machines for temporal predictive 3438–3445.
analytics,” in 2020 IEEE 36th International Conference on Data Engi- [36] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-
neering (ICDE). IEEE, 2020, pp. 1405–1416. nessing adversarial examples,” in ICLR, Y. Bengio and Y. LeCun,
[12] J. Wu, X. Wang, F. Feng, X. He, L. Chen, J. Lian, and X. Xie, “Self- Eds., 2015.
supervised graph learning for recommendation,” in SIGIR, 2021, [37] X. Zhang, F. X. Yu, S. Kumar, and S.-F. Chang, “Learning spread-
pp. 726–735. out local feature descriptors,” in CVPR, 2017, pp. 4595–4603.
[13] J. Yu, H. Yin, J. Li, Q. Wang, N. Q. V. Hung, and X. Zhang, “Self- [38] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola,
supervised multi-channel hypergraph convolutional network for “What makes for good views for contrastive learning?” NeurIPS,
social recommendation,” in WWW, 2021, pp. 413–424. vol. 33, pp. 6827–6839, 2020.
[14] X. Xia, H. Yin, J. Yu, Q. Wang, L. Cui, and X. Zhang, “Self- [39] T. Huang, Y. Dong, M. Ding, Z. Yang, W. Feng, X. Wang, and
supervised hypergraph convolutional networks for session-based J. Tang, “Mixgcf: An improved training method for graph neural
recommendation,” in AAAI, 2021, pp. 4503–4511. network-based recommender systems,” pp. 665–674, 2021.
[15] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, F. Zhang, Z. Wang, [40] X. Wang, H. Jin, A. Zhang, X. He, T. Xu, and T.-S. Chua, “Disentan-
and J.-R. Wen, “S3-rec: Self-supervised learning for sequential rec- gled graph collaborative filtering,” in SIGIR, 2020, pp. 1001–1010.
ommendation with mutual information maximization,” in CIKM, [41] C. Gao, X. Wang, X. He, and Y. Li, “Graph neural networks for
2020, pp. 1893–1902. recommender system,” in WSDM, 2022, pp. 1623–1625.
[16] C. Zhou, J. Ma, J. Zhang, J. Zhou, and H. Yang, “Contrastive learn- [42] S. Wu, F. Sun, W. Zhang, X. Xie, and B. Cui, “Graph neural
ing for debiased candidate generation in large-scale recommender networks in recommender systems: a survey,” CSUR, 2020.
systems,” in KDD, 2021, pp. 3985–3995. [43] J. Yu, H. Yin, J. Li, M. Gao, Z. Huang, and L. Cui, “Enhance social
[17] J. Yu, H. Yin, M. Gao, X. Xia, X. Zhang, and N. Q. V. Hung, recommendation with adversarial graph convolutional networks,”
“Socially-aware self-supervised tri-training for recommendation,” IEEE Transactions on Knowledge and Data Engineering, 2020.
in KDD, F. Zhu, B. C. Ooi, and C. Miao, Eds. ACM, 2021, pp. [44] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan, “Session-based
2084–2092. recommendation with graph neural networks,” in AAAI, vol. 33,
[18] Z. Lin, C. Tian, Y. Hou, and W. X. Zhao, “Improving graph collabo- no. 01, 2019, pp. 346–353.
rative filtering with neighborhood-enriched contrastive learning,” [45] T. N. Kipf and M. Welling, “Semi-supervised classification with
in WWW, 2022, pp. 2320–2329. graph convolutional networks,” in ICLR, 2017.
[19] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning rep- [46] X. Wang, X. He, M. Wang, F. Feng, and T.-S. Chua, “Neural graph
resentations by maximizing mutual information across views,” collaborative filtering,” in SIGIR, 2019, pp. 165–174.
NeurIPS, pp. 15 509–15 519, 2019. [47] L. Chen, L. Wu, R. Hong, K. Zhang, and M. Wang, “Revisiting
[20] X. Zhou, A. Sun, Y. Liu, J. Zhang, and C. Miao, “Selfcf: A sim- graph based collaborative filtering: A linear residual graph con-
ple framework for self-supervised collaborative filtering,” arXiv volutional network approach,” in AAAI, vol. 34, no. 01, 2020, pp.
preprint arXiv:2107.03019, 2021. 27–34.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13
[48] W. Yu and Z. Qin, “Graph convolutional network for recommen- Tong Chen received his PhD degree in com-
dation with low-pass collaborative filters,” in ICML, 2020, pp. puter science from The University of Queens-
10 936–10 945. land in 2020. He is currently a Lecturer with the
[49] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, Data Science research group, School of Infor-
“Simplifying graph convolutional networks,” in ICML, 2019, pp. mation Technology and Electrical Engineering,
6861–6871. The University of Queensland. His research in-
[50] J. Yu, M. Gao, J. Li, H. Yin, and H. Liu, “Adaptive implicit friends terests include data mining, recommender sys-
identification over heterogeneous network for social recommen- tems, user behavior modelling and predictive an-
dation,” in CIKM. ACM, 2018, pp. 357–366. alytics.
[51] J. Ma, C. Zhou, H. Yang, P. Cui, X. Wang, and W. Zhu, “Disentan-
gled self-supervision in sequential recommenders,” in KDD, 2020,
pp. 483–491.
[52] Y. Wei, X. Wang, Q. Li, L. Nie, Y. Li, X. Li, and T.-S. Chua,
“Contrastive learning for cold-start recommendation,” in ACM
Multimedia, 2021, pp. 5382–5390.
[53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
NeurIPS, vol. 30, 2017.
[54] R. Qiu, Z. Huang, H. Yin, and Z. Wang, “Contrastive learning for
representation degeneration problem in sequential recommenda-
tion,” in WSDM, 2022, pp. 813–823. Lizhen Cui is a full professor with Shandong
[55] X. Xia, H. Yin, J. Yu, Q. Wang, G. Xu, and Q. V. H. Nguyen, “On- University. He is appointed dean and deputy
device next-item recommendation with self-supervised knowl- party secretary for School of Software, co-
edge distillation,” in SIGIR, 2022, pp. 546–555. director of Joint SDU-NTU Centre for Artificial
[56] X. Long, C. Huang, Y. Xu, H. Xu, P. Dai, L. Xia, and L. Bo, “Social Intelligence Research(C-FAIR), director of the
recommendation with self-supervised metagraph informax net- Research Center of Software and Data Engi-
work,” in CIKM, 2021, pp. 1160–1169. neering, Shandong University. His main interests
[57] R. Xie, Q. Liu, L. Wang, S. Liu, B. Zhang, and L. Lin, “Contrastive include big data intelligence theory, data mining,
cross-domain recommendation in matching,” in KDD, 2022, pp. wisdom science, and medical health big data AI
4226–4236. applications.
[58] Y. Ma, Y. He, A. Zhang, X. Wang, and T.-S. Chua, “Crosscbr: Cross-
view contrastive learning for bundle recommendation,” KDD, pp.
1233–1241, 2022.
[59] X. Xia, H. Yin, J. Yu, Y. Shao, and L. Cui, “Self-supervised graph
co-training for session-based recommendation,” in CIKM, 2021,
pp. 2180–2190.