Disentangled Representation With Cross Experts Covariance Loss For Multi-Domain Recommendation
Disentangled Representation With Cross Experts Covariance Loss For Multi-Domain Recommendation
[email protected],[email protected]
ABSTRACT 1 INTRODUCTION
Multi-domain learning (MDL) has emerged as a prominent research
arXiv:2405.12706v1 [cs.IR] 21 May 2024
gAUC in Kuairand
HiNet 0.6600
Information Abundance
commonalities across domains while preserving the distinct char- 6 PLE Crocodile 0.6575
dilemma. On one hand, a model needs to leverage domain-specific 4 MMoE ME-MMoE 0.592
STAR
gAUC in AliCCP
modules, such as experts or embeddings, to preserve the uniqueness 0.591
0.590
of each domain. On the other hand, due to the long-tailed distri- 2 STEM
ME-PLE 0.589
butions observed in real-world domains, some tail domains may 0.588
lack sufficient samples to fully learn their corresponding modules. 0.00 0.25 0.50 0.75 1.00
oE
ME E
ME oE
ST E
EM
et
Cro R
e
dil
PL
-PL
A
HiN
MM
ST
Diversity Index
co
-M
Unfortunately, existing approaches have not adequately addressed
this dilemma. To address this issue, we propose a novel model called (a) Dillema of Interest Diversity v.s. Col- (b) gAUC in Kuairand1k and AliCCP
lapse Prevention
Crocodile, which stands for Cross-experts Covariance Loss for Dis-
entangled Learning. Crocodile adopts a multi-embedding paradigm
to facilitate model learning and employs a Covariance Loss on these Figure 1: The proposed Crocodile successfully resolves the dilemma
embeddings to disentangle them. This disentanglement enables the of interest diversity v.s. collapse prevention and achieves the best
model to capture diverse user interests across domains effectively. performance regarding gAUC.
Additionally, we introduce a novel gating mechanism to further
enhance the capabilities of Crocodile. Through empirical analysis,
Online recommendation systems [2] are now essential tools
we demonstrate that our proposed method successfully resolves
for businesses to effectively reach their target audiences, servic-
these two challenges and outperforms all state-of-the-art methods
ing billions of internet users daily. Empowered by advanced algo-
on publicly available datasets. We firmly believe that the analytical
rithms and vast traffic, these systems play a crucial role in modeling
perspectives and design concept of disentanglement presented in
user experiences and generating revenue for advertisers. How-
our work can pave the way for future research in the field of MDL.
ever, traditional methods still face substantial challenges due to
the dynamic nature of online platforms, which are marked by di-
KEYWORDS verse user preferences across domains. To address these challenges,
Multi-domain Learning; Disentangled Representation Learning; researchers have increasingly turned to multi-domain learning
Dimensional collapse (MDL) [5, 18, 31, 39, 43] as a promising approach to enhance the
performance by involving information from multiple related do-
ACM Reference Format:
mains.
Zhutian Lin1 , Junwei Pan2 , Haibin Yu2 , Xi Xiao1 , Ximei Wang2 , Zhixiang
Unfortunately, MDL for recommendation often confronts the
Feng2 , Shifeng Wen2 , Shudong Huang2 , Lei Xiao2 , Jie Jiang2 . 2018. Disentan-
gled Representation with Cross Experts Covariance Loss for Multi-Domain challenge of preference conflict, where a user might exhibit divergent
Recommendation. In Proceedings of Crocodile: Cross Experts Covariance for behaviors across domains. To remedy this, the MDL community
Disentangled Learning in Multi-Domain Recommendation (CIKM’ 24). ACM, aims to develop domain-specific components, including STAR [31],
New York, NY, USA, 9 pages. https://doi.org/XXXXXXX.XXXXXXX HiNet [43], and PEPNet [5] have progressively integrated domain-
specific components.
Interestingly, we observe a trend in MDL methods towards the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed development of bottom-level domain-specific modules. In this work,
for profit or commercial advantage and that copies bear this notice and the full citation we uncover the mechanisms behind this trend. To the best of our
on the first page. Copyrights for components of this work owned by others than the knowledge, this is the first study to reveal the mechanism of increas-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission ing interest diversity to resolve preference conflict. Furthermore,
and/or a fee. Request permissions from [email protected]. we propose a novel metric, the Diversity Index (DI), to quantify
CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA interest diversity, as shown in Fig 1(a).
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/18/06 Based on our discovery, we notice that domain-specific embedding
https://doi.org/XXXXXXX.XXXXXXX [12, 33] incurs the highest diversity and hence should be able to
CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA Lin Z. and Pan J., et al.
resolve preference conflict. However, naively deploying domain- Recently, STEM [33] proposed a multi-embedding (ME) struc-
specific embeddings does not turn out to be fruitful, we observe ture in multi-task learning (MTL) and step forward in allocat-
negative transfer in domains with fewer data. Delving further, we ing more bottom-level domain-personalized structure, i.e., embed-
find out the core issue is sub-optimal embeddings for the domain ding. It achieved better results compared to both single and multi-
with less data, i.e., they are dimensionally collapsed.1 embedding versions of MMoE [22] and PLE [34] in MTL experi-
As shown in Fig. 1(a), our analysis demonstrates that existing ments. However, STEM fails in the multi-domain recommendation.
works struggle to maintain these two properties: maintaining di- We will discuss the reasons in detail in the subsequent sections.
versity and avoiding dimensional collapse, thus resulting in the Besides, some other paradigms are applied for multi-domain
sub-optimal performance shown in Fig. 1(b). As a result, it remains recommendation. AdaSparse [39] and ADL [18] applied adaptive
an open question whether we can incorporate both properties. learning mechanism. EDDA [26], COAST [41] and UniCDR[3] pro-
To this end, we propose a novel a novel Cross-experts Covariance vided graph learning-based solutions.
Loss for Disentangled Learning model (Crocodile) framework,
which we build this Covariance Loss (CovLoss) to alleviate em- 2.2 Disentangled Representation Learning
bedding collapse while in the meantime enabling learning diverse
In the realm of disentangled representation learning (DRL), methods
user interests across various domains. To further unleash the power
based on spatial or statistical relationships have been proposed.
of Crocodile, we develop a novel Prior Informed element-wise
On the one hand, methods based on spatial relationships achieve
Gating (PeG) mechanism to work with CovLoss.
disentangling by manipulating the distances or angles between
Empirical analysis on two public datasets confirms that Crocodile
representations. OLE [17], dot product[38] and cosine similarity[20]
achieves state-of-the-art results. Further analysis also demonstrates
push representations towards orthogonal. GAN-based models [6, 15,
that Crocodile can simultaneously mitigate embedding collapse
40] promote the discrimination among representations. However,
and learn diverse representations. Ablation experiments reveal that
they are computationally expensive and not conducive to online
each proposed component contributes positively to the overall per-
recommendation.
formance. Our contributions can be briefly summarized as follows:
On the other hand, statistical methods view representations as
• We discover the mechanism of previous works in maintaining instances drawn from high-dimensional distributions and design
diversity, especially the trend of focusing more on bottom-level loss based on statistical measures. Distance correlation (dCorr)[19]
modules. We propose a novel metric, Diversity Index (DI), to encourages lower dCorr between different factors. ETL [7] models
quantify the diversity and conclude that the trend of previous the joint distribution of user behaviors across domains with equiv-
works is optimizing to increase DI. alent transformation. VAE-based methods [21, 23, 37] encourage
• We reveal that existing works fall into the dilemma of maintain- higher KL divergence between dimensions. However, they often
ing diversity and avoiding collapse, hence resulting in inferior assume a normal prior distribution, which is challenging to satisfy
performance. in the context of sparse interactions [29], especially under sample
• We propose the Crocodile approach, which successfully bal- volume imbalance of multi-domain recommendation.
ances both properties. Empirical evidence on two public datasets Besides, some methods integrate both aspects mentioned above.
demonstrates that Crocodile achieves state-of-the-art perfor- VICReg [1, 32] employs covariance and variance as loss components
mance and resolves the aforementioned dilemma through com- within representations to promote independence across dimensions
prehensive analysis. and prevent collapse while utilizing Euclidean distance between rep-
resentations to ensure semantic similarity among representations
2 RELATED WORK of different patches of an image.
2.1 Multi-domain Recommendation
Multi-domain learning (MDL) is a promising topic that has garnered 2.3 Dimensional Collapse
attention from industrial recommendations. Many MDL methods Dimensional collapse[8] is defined as representations that end up
have made great efforts to model complex multi-domain interests. spanning a lower-dimensional space, which is evident through the
Among them, MoE-based methods are widely applied in industrial pronounced imbalance in the distribution of singular values. Chi
recommendation systems, such as Taobao [31] , Meituan [5] and et al. [8] found that SMoE structure [10, 16] may lead to dimensional
Kuaishou [43]. collapse due to the influence of gating. The work of [13] claimed
STAR [31] utilizes the dot product of domain-specific and domain- that this problem is also observed in popular models (e.g., DCN V2
shared weight matrices to capture domain-specific interests. HiNet [43] [36] and Final MLP [25]). To measure the collapse, they proposed
proposes two expert-like components to capture cross-domain cor- the Information Abundance (IA) [13] metric, defined as the ratio
relations and within-domain information explicitly. PEPNet [5] in- between the sum and the maximum l1-norm value of the target
troduces PPNet and EPNet, which incorporate domain information singular values. A small IA indicates severe insignificance of tail
through gating. These methods often innovate from the perspec- dimensions and potential dimensional collapse.
tive of making more bottom-level modules be domain-personalized Beyond a single perspective, some methods [14] built the bridge
within the model. between representation disentanglement and dimensional collapse
prevention. It also claimed feature decorrelation helps alleviate
1We adopt the note of dimensional collapse in the work of Chi et al. [8] and utilize dimensional collapse by reducing strong correlations between axes
information abundance (IA) [13] to quantify dimensional collapse. and promoting more diverse and informative representations.
Disentangled Representation with Cross Experts Covariance Loss for Multi-Domain Recommendation CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA
3 ON THE CHALLENGES OF MULTI-DOMAIN frequency groups. We discovered that domain S6 exhibits a lower
RECOMMENDATION IA than the overall S6, indicating a more severe collapse, regard-
less of the group. Within each domain, the low-exposure group
As we have discussed in Sec. 1, existing works fall into the dilemma
consistently shows a lower IA than the high-exposure group. This
of avoiding dimensional collapse and maintaining interest diversity.
further corroborates the notion that a lower exposure count (i.e.,
In this section, we will discuss these two properties thoroughly.
data volume) is more prone to dimensional collapse.
3.1 Preventing Dimensional Collapse Both analytical and empirical evidence suggest that inadequate
data of domain can lead to dimensional collapse of the domain-
According to Fig 1(a), domain-specific embeddings demonstrate its specific embedding, hence leading to inferior performance.
effectiveness in capturing diverse user interests. However, the naive
adoption of domain-specific embeddings does not yield the expected 3.2 Interest Diversity
outcomes. In fact, the methods with domain-specific embeddings,
Intuitively, the ability to capture various user interests across do-
i.e., ME-PLE [33, 34] and STEM [33], both performance deterio-
mains can be directly characterized by the diversity of represen-
rate in the domain with less data. For example, in domain S6 of
tations. When facing conflict preference across domains, a set of
Kuairand1k [11] dataset, the ME-PLE and STEM show -0.4% and
diverse representations is enabled to encode both prefer/non-prefer
-0.1% lower AUC than the single domain method. This phenomenon
semantics and thus provide distinct vectors to towers of those con-
has sparked our deep investigation into its underlying mechanism.
flict domains.
Intuitively, the domain-specific embeddings are trained solely on
To this end, we propose a novel metric, the Diversity Index (DI),
data from the corresponding domain, which results in insufficient
to quantify the representations’ ability to capture conflict preference
samples for domains with less data. Insufficient training data fails
across domains. Specifically, DI depicts the ratio of samples that
to provide adequate variance for high-dimensional representations,
contain both large and small norm of representations within the
resulting in high-dimensional representations residing within a
whole conflict samples, which is formulated as2 :
lower-dimensional subspace, called dimensional collapse [8, 13, 14].
(𝑝 ) (𝑞)
To measure the collapse, we adopt the IA metric [13], which
Í
𝑘 ∈D𝑐𝑜𝑛 (𝑖,𝑗 ) I[ ∃𝑝,𝑞 ∈𝑀,( | |O𝑘 | | 2 ≥𝜏𝑡 )∩( | |O𝑘 | | 2 ≤𝜏𝑏 ) ]
is defined as 𝐼𝐴(𝐸) = ||𝜎 || 1 /||𝜎 || ∞ , where 𝐸 is the embedding of 𝐷𝐼 𝐴𝐹 (𝑖, 𝑗) = | D𝑐𝑜𝑛 (𝑖,𝑗 ) | ,
feature (e.g., item id) and || · || 1, || · || ∞ represent the 𝑙1-norm and (1)
∞-norm, respectively. Í (𝑖 ) (𝑗)
𝑘 ∈D𝑐𝑜𝑛 (𝑖,𝑗 ) I[ ( | |O𝑘 | | 2 ≥𝜏𝑡 )∩( | |O𝑘 | | 2 ≤𝜏𝑏 ) ]
Remark 1. Intuitively, singular values 𝜎𝑖 represent the significance 𝐷𝐼 𝑆𝐹 (𝑖, 𝑗) = | D𝑐𝑜𝑛 (𝑖,𝑗 ) | ,
of 𝑖-th singular value, and IA captures the balance of the singular (2)
value distribution. Thus, a small IA indicates severe insignificance D𝑐𝑜𝑛 (𝑖,𝑗 ) contains preference conflict samples between domain 𝑖
of tail dimensions and potential dimension collapse. and 𝑗. I[condition] is the indication function, which is assigned to
(𝑝 )
1 if the condition is true; otherwise, 0. 𝑀 represents experts, O𝑘
means the output of 𝑝-th expert and 𝑘-th sample, 𝜏𝑡 and 𝜏𝑏 are the
10.0 Domain S1 (High) thresholds of big and small l2-norm.
4
Domain S1 7.5 Domain S1 (Low)
Remark 2. As shown in Fig. 1(a), ME-MMoE [12] exhibits signifi-
log(IA)
log(IA)
2 5.0 cantly lower diversity compared to STEM [33], and its improvement
Domain S6 Domain S6 (High)
2.5 in diversity is relatively limited when compared to MMoE [22]. This
Domain S6 (Low)
0 0.0 suggests that the representations learned by ME-MMoE [12] lack
0 1000 2000 0 1000 2000 sufficient ability to solve preference conflict challenges.
Batches Batches
(a) 𝑙𝑜𝑔 (𝐼𝐴) of Experts (b) 𝑙𝑜𝑔 (𝐼𝐴) of Item Embeddings
4 METHODOLOGY
We propose a multi-embedding architecture with Cross-experts Co-
Figure 2: Information Abundance (𝑙𝑜𝑔 (𝐼𝐴)) dynamics of STEM’s bot-
tom layer of experts and item embeddings on Kuairand1k. High
variance Loss for Disentangled Learning (Crocodile) to resolve the
and Low denote the items with the highest and lowest frequencies, dilemma, containing Covariance Loss (CovLoss) and Prior informed
respectively. Element-wise Gating (PEG), as shown in Fig. 3.
Taking STEM as the example of methods applying domain- 4.1 Multi-Embedding Layer
specific embedding, we analyzed its training IA dynamics of experts The raw input features 𝑥 = {𝑥 1, 𝑥 2, · · · , 𝑥 𝐹 } contains 𝐹 fields. We
and embeddings in Fig. 2. In Kuairand1k, S1 and S6 are the domains employ a multi-embedding (ME) architecture [33], which utilizes
with the most and least data. In Fig. 2(a), the S6-specific expert 𝑀 sets of embedding lookup tables. Each embedding looked up
exhibits a lower IA compared to S1, indicating that domains with a from each respective table is fed into the corresponding expert.
smaller data volume are more susceptible to dimensional collapse Thus, in ME architecture, there are 𝑀 sets of experts. For a given
at the expert level. In Fig. 2(b), we further categorized items into 𝑝-th embedding table corresponding to 𝑘-th sample and 𝑓 -th field,
five groups based on their exposure frequency and observed the 2 Here 𝐴𝐹 and 𝑆𝐹 are abbreviations for all forward and specific forward respectively,
collapse of item ID embeddings for both the highest and lowest aligned with the concepts in [33].
CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA Lin Z. and Pan J., et al.
EG
𝑦" ("! ) 𝑦" ("" )
Input to tower
Towers
MLP & ReLU MLP & ReLU
PEG EG EG
Prior features Expert outputs
CovLoss CovLoss i
CovLoss j
Experts with
CovLoss CovLoss
MLP & ReLU MLP & ReLU MLP & ReLU
𝑝-th expert output 𝑞-th expert output
𝑂($) 𝑗 𝑂($) 𝑗
Multi-embedding
Layer
De-correlation
Figure 3: Crocodile Architecture, which consists of a Multi-Embedding (ME) layer, a Cross-expert Covariance Loss (CovLoss), and a Prior
informed Element-wise Gating (PEG) mechanism.
(𝑝 )
the looked-up embedding is represented as 𝑒𝑘,𝑓 . Thus, each 𝑝-th
embeddings corresponded to 𝑘-th sample is: 1 ∑︁ (𝑝 ) (𝑞)
L𝐶𝑜𝑣 = 2 ||[O (𝑝 ) − O ]𝑇 [O (𝑞) − O ] || 1, (6)
𝑑 𝑝,𝑞 ∈𝑀 ×𝑀,𝑝>𝑞
(𝑝 ) (𝑝 ) (𝑝 ) (𝑝 )
𝑒𝑘 = [𝑒𝑘,0 , 𝑒𝑘,1 , · · · , 𝑒𝑘,𝐹 ]. (3)
where O (𝑝 ) , O (𝑞) ∈ R 𝑁 ×𝑑 are the outputs of 𝑝-th and 𝑞-th experts,
(𝑝 )
4.2 Experts and Towers O ∈ R 1×𝑑 is the average value of each dimension across all
samples, and || · || 1 is the 𝑙1-norm.
Methods based on the Mixture of Experts (MoE) have been widely
Finally, the overall loss function is defined as :
applied for multi-task [22, 33, 34] or multi-domain [5, 43] recom-
mendation. Thus, we followed the same architecture and employed 𝑆
∑︁ 1 ∑︁
𝑁𝑠
(𝑠 ) (𝑠 )
several sets of Multi-Layer Perceptrons (MLPs) as the experts and L= [ L𝐵𝐶𝐸 (𝑦ˆ𝑘 , 𝑦𝑘 )] + 𝛼 L𝐶𝑜𝑣 (7)
𝑁 𝑠
towers. Besides, we utilized a gating mechanism to compute the 𝑠=1 𝑘=1
weights for the representation output from each expert. The output where 𝑁𝑠 is the number of 𝑠-th domain’s samples, 𝑦ˆ𝑘 is the label
(𝑠 )
of the 𝑝-th expert can be represented as : of 𝑘-th sample in domain 𝑠, and 𝛼 is the weight of CovLoss.
(𝑝 ) (𝑝 )
O𝑘 = 𝑀𝐿𝑃 (𝑝 ) (𝑒𝑘 ), (4) 4.4 Prior Informed Element-wise Gating
Mechanism
where O (𝑝 ) ∈ R 1×𝑑 and ReLU serves as the inter-layer activation
𝑘 CovLoss aims to disentangle different representations by minimiz-
function. Through weighted by gating, the outputs O (·) of each ing the covariance between dimensions among different experts.
𝑘
expert are aggregated into a vector, denoted as 𝑡 (𝑠 ) ∈ R 1×𝑑 , the However, the existing gating structure hinders the effectiveness of
𝑘
input of 𝑠-th domain. The detail of gating will be introduced in CovLoss. MMoE [22] and PLE [34]’s gating mechanism assigns an
Section 4.4. identical weight to all dimensions within each expert; the dimen-
Finally, each domain-specific tower outputs the final predicted sions within each expert are mutually constrained. PEPNet [5]’s
value for the corresponding domain as : proposed Gate Neaural Unit operates at the dimension level as well.
However, it is designed to scale each dimension separately within
𝑦ˆ𝑘
(𝑠 ) (𝑠 )
= 𝜎 (𝑀𝐿𝑃 (𝑡𝑘 )), (5) the representation but not to disentangle these representations.
Consequently, we propose an Element-wise Gating mechanism
(𝑠 ) (EG), which implements weight control over corresponding dimen-
where 𝑦ˆ𝑘 is the predicted value of 𝑘-th sample in the 𝑠-th domain,
sions between experts, as shown in Fig. 3:
𝜎 (·) is the sigmoid function.
𝑡𝑘𝑠 = 𝑔𝑘𝑠 ⊙ O𝑘 , (8)
4.3 Cross-expert Covariance Loss where O𝑘 ∈ R 𝐾 ×𝑑 is the concatnation across all experts respected
Many pioneering work [1, 9, 32] have highlighted the efficiency of to 𝑘-th sample, and ⊙ is the element-wise multiplication.
tailoring the covariance into objective functions. However, they Moreover, we are also concerned about the parameter burden
aimed to decorrelate the dimensions within each represen- of ME structure. In the original ME-MMoE structure, the number
tation rather than disentangle among representations of all of embeddings is tied to the number of domains. This is because
experts. To learn diverse user interests at the expert level, we the gating is performed using 𝑔𝑘𝑠 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑒𝑘𝑠 𝑊𝑔𝑠 ), which means
propose the cross-expert covariance loss (CovLoss) to explicitly the embedding 𝑒𝑘𝑠 is still related to a 𝑠-th gating. Thus, when it is
disentangle representations among experts, defined as: deployed in situations with a large amount of domains, the scale
Disentangled Representation with Cross Experts Covariance Loss for Multi-Domain Recommendation CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA
𝑔𝑘𝑠 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑟𝑘𝑊𝑔𝑠 ), (10) We employed two metrics to measure the model performance: AUC
and gAUC. AUC measures the overall ranking performance, and
where 𝑟𝑘 ∈ R 1×𝑙 is the prior embedding of 𝑘-th sample, 𝑊𝑔𝑠 ∈ gAUC evaluates the quality of intra-user item ranking.
R𝑙 ×𝐾 ×𝑑 and 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (·) is operated across the second dimension.
Furthermore, PG can be understood as an attention [35]-like 5.1 Datasets
mechanism, comprising query, key, and value components corre- In this study, to examine the performance of the model in realistic
sponding to Prior embedding 𝑟𝑘 , gating weight matrix 𝑊𝑔𝑠 of 𝑠-th recommendation, we carefully selected the Kuairand1k [11] and
domain, and expert outputs O𝑘 , respectively. Its significance lies in AliCCP [24] datasets for our experiments. For Kuairand1k, we
weighting the learnt user interests through the interaction of user, chose five domains with top-5 data volumes. For AliCCP, we se-
item, and domain information provided by query and key. lected all three domains. Tab. 1 is the statistics for the two datasets
after low-frequency filtering. We removed features with fewer than
4.5 Understanding the CovLoss 10 exposures and replaced them with default values.
CovLoss decorates the representations of different experts in an
element-wise fashion. Beyond statistic understanding, CovLoss 5.2 Baselines
also directly addresses the core objective of cross-domain modeling: To establish a performance benchmark for comparison, we will
enhance the diversity of representations. select comparative methods from both MTL and MDL for multi-
For simplification, we consider the model involving only two domain recommendation. We apply the SharedBottom [4] method
experts, outputting x and y drawn from two high-dimensional dis- in single/multi-domain contexts and subsequently introduce pop-
tributions X and Y, respectively, and the 𝑖-th element of x and y are ular MTL methods such as MMoE [22], PLE [34] (both Single-
x𝑖 and y𝑖 , respectively. Following [32], we assume the randomness Embedding and Multi-Embedding [33] versions), and STEM [33].
comes from the input of the network. Besides, we choose representative MDL methods, including STAR [31],
To maximize the diversity of the outputs from both experts, PEPNet [5], HiNet [43], and AdaSparse [39].
the objective for each pair of samples can be expressed as the
expectation of l2-norm difference: 5.3 Performance Evaluation (RQ1 and RQ2)
E[(||x|| 2 − ||y|| 2 ) ] = E[||x|| 22 + ||y|| 22 − 2||x|| 2 ||y|| 2 ]
2
(11) We compared Crocodile with state-of-the-art recommenders in the
𝑑 𝑑
Kuairand1k [11] and AliCCP [24] datasets. All experiments are
repeated 6 times and averaged results are reported, as shown in
√︄
2 2
[𝜇 𝑦2 𝑗 + 𝜎 𝑦2 𝑗 ] − 2E[
∑︁ ∑︁ ∑︁
= [𝜇𝑥𝑖 + 𝜎𝑥𝑖 ]+ (𝑥𝑖 𝑦 𝑗 ) 2 ]
𝑖=1 𝑗=1 𝑗 ≥𝑖
Tab. 2 and Tab. 3, respectively.
5.3.1 Overall Performance. Our proposed Crocodile abrains the
v
u
u 𝑑 ∑︁
t∑︁ 𝑑
≥ −2E[ (𝑥𝑖 𝑦 𝑗 )2] best overall AUC and gAUC in two datasets. In the Kuairand1k
𝑖=1 𝑗=1 dataset, Crocodile outperforms the second-best method signifi-
cantly by 0.09% and 0.19% in terms of overall AUC and gAUC,
𝑑 ∑︁
𝑑 √︃
∑︁ ∑︁ respectively. In the AliCCP dataset, Crocodile surpasses the second-
≥ −2E[ (𝑥𝑖 𝑦 𝑗 ) 2 ] = −2 E[𝑥𝑖 𝑦 𝑗 ]
best method significantly by AUC of 0.04% and gAUC of 0.07% than
𝑖=1 𝑗=1 𝑗 ≥𝑖
the second-best method.
𝑑 ∑︁
𝑑
∑︁ Crocodile achieves larger AUC and gAUC almost for almost all
= −2 [𝜇𝑥𝑖 𝜇 𝑦 𝑗 + 𝑐𝑜𝑣 (𝑥𝑖 , 𝑦 𝑗 )],
domains in two datasets significantly compared to single domain
𝑖=1 𝑗=1
method, with p-value<0.05. Such positive transfer can be attributed
where 𝑥𝑖 , 𝑦 𝑗 ≥ 0 because the outputs from experts have been ac- to Crocodile’s better usage of multi-domain information and multi-
tivated by ReLU before. According to the above Eq., minimizing embedding to enhance performance in each domain, without raising
𝑐𝑜𝑣 (𝑥𝑖 , 𝑦 𝑗 ) can maximize the lower bound of diversity objective3 dimensional collapse.
3 Although 𝜇
𝑥𝑖 𝜇 𝑦 𝑗 +𝑐𝑜𝑣 (𝑥𝑖 , 𝑦 𝑗 ) simultaneously incorporates both the first and second 5.3.2 Comparison with MTL for MDR methods. Multi-task learning
moments, they are nearly independent of each other in optimization. (MTL) methods for Multi-Domain Recommendation (MDR) can
CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA Lin Z. and Pan J., et al.
Table 2: Performance on Kuairand1k dataset. Bold and underline highlight the best and second-best results, respectively. * indicates that the
performance difference against the second-best result is statistically significant at p-value<0.05.
AUC gAUC
Methods
S0 S1 S2 S4 S6 Overall S0 S1 S2 S4 S6 Overall
Single Domain 0.73272 0.74202 0.80179 0.72505 0.75559 - 0.60336 0.59610 0.54581 0.55085 0.60562 -
Shared Bottom 0.73535 0.74271 0.82033 0.72552 0.75945 0.78574 0.60335 0.59694 0.56337 0.56299 0.62500 0.66087
MTL (SE) for MDR MMoE 0.73639 0.74279 0.82090 0.72793 0.76005 0.78594 0.60188 0.59658 0.56278 0.56325 0.62721 0.66057
PLE 0.73716 0.74271 0.81759 0.72737 0.75807 0.78565 0.60585 0.59710 0.56234 0.56276 0.62908 0.66093
ME-MMoE 0.73842 0.74281 0.81897 0.72637 0.75985 0.78586 0.60649 0.59874 0.56620 0.56080 0.63001 0.66188
MTL (ME) for MDR ME-PLE 0.73645 0.74240 0.80785 0.72727 0.75138 0.78499 0.61084 0.59717 0.55982 0.56170 0.61744 0.66075
STEM 0.73741 0.74267 0.80526 0.72776 0.75449 0.78527 0.60820 0.59701 0.56250 0.56365 0.62595 0.66096
PEPNET 0.73732 0.74275 0.81396 0.72767 0.76239 0.78574 0.60812 0.59589 0.55435 0.55931 0.62733 0.66018
HiNet 0.73285 0.74264 0.81955 0.72698 0.76451 0.78563 0.60303 0.59830 0.56416 0.56267 0.62790 0.66131
MDL
STAR 0.71890 0.74071 0.81586 0.72203 0.75954 0.78336 0.58028 0.59396 0.55229 0.55896 0.62419 0.65652
AdaSparse 0.73445 0.74235 0.81575 0.71845 0.76151 0.78481 0.60653 0.59729 0.56316 0.56128 0.63150 0.66031
Crocodile 0.73874 0.74430* 0.81646 0.72838 0.75817 0.78683* 0.61075 0.60068* 0.56964 0.56366 0.63579 0.66373*
Table 3: Performance on AliCCP dataset. bold and underline highlight the best and second-best results, respectively. * indicates that the
performance difference against the second-best result is statistically significant at p-value<0.05.
AUC gAUC
Methods
S1 S2 S3 Overall S1 S2 S3 Overall
Single Domain 0.61299 0.56544 0.61735 - 0.58077 0.53023 0.58739 -
Shared Bottom 0.61925 0.59355 0.61952 0.61885 0.58908 0.57252 0.59076 0.58982
MTL (SE) for MDR MMoE 0.61951 0.59324 0.61969 0.61935 0.58907 0.57269 0.59065 0.58992
PLE 0.62152 0.59476 0.61949 0.61931 0.58911 0.57256 0.59070 0.58947
ME-MMoE 0.62271 0.59741 0.62179 0.62131 0.59002 0.57258 0.59148 0.59048
MTL (ME) for MDR ME-PLE 0.62352 0.58727 0.62063 0.62132 0.58974 0.55777 0.59130 0.59049
STEM 0.61978 0.58187 0.61978 0.61907 0.58941 0.55101 0.59132 0.59036
PEPNET 0.62273 0.59239 0.61988 0.62034 0.58851 0.56080 0.59085 0.58975
HiNet 0.62042 0.59515 0.62003 0.62022 0.58881 0.57191 0.59015 0.58970
MDL
STAR 0.62039 0.59204 0.61886 0.61949 0.58705 0.56527 0.58932 0.58855
AdaSparse 0.62192 0.59656 0.62173 0.62189 0.58819 0.57170 0.58972 0.58924
Crocodile 0.62448 0.59951 0.62192 0.62233* 0.59039 0.57085 0.59204* 0.59123*
be primarily categorized into single-embedding (SE) and multi- negative transfer was also observed in STEM. We analyzed their em-
embedding (ME) paradigms [33]. We observed that some methods beddings via singular spectrum analysis [13] and show the results
allocating domain-specific embedding underperform compared to in Fig. 4. We discovered that ME-PLE and STEM, in comparison
SE methods, despite the fact they can capture diverse user interests to other methods, exhibited a clear dominance of top factors with
in Fig. 5(a). For instance, in the comparison between PLE and ME- much higher singular value over tail factors in the S6 embeddings,
PLE on the Kuairand1k dataset in, Tab. 2, PLE achieved an overall indicating the dimensional collapse issue.
AUC of 0.78565 and gAUC of 0.66093. In contrast, ME-PLE’s AUC In contrast, Crocodile’s embeddings exhibit a more balanced
and gAUC were 0.78499 and 0.66075, respectively, both lower than importance of singular values, thereby better supporting the high-
those of PLE. dimensional representational space. Consequently, within domain
To investigate the underlying reason, we conducted a domain-by- S6, Crocodile outperforms ME-PLE by 0.68% in AUC and 1.8% in
domain performance comparison and found that ME-PLE exhibited gAUC, and surpasses STEM by 0.37% in AUC and 0.98% in gAUC.
notably poorer results in the smallest domain, S6, with a negative Even though STEM learns the most diverse user interest, it still
transfer in AUC of -0.4% compared to the single domain. A similar fails to surpass Crocodile due to dimensional collapse.
Disentangled Representation with Cross Experts Covariance Loss for Multi-Domain Recommendation CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA
log( i)
value<0.05.
0 2
(a) Crocodile vs MDL and MTL (b) Crocodile vs other Losses 5.3.4 Comparison with Other Forms of Losses. Through compar-
ison with MTL and MDL approaches, we found that Crocodile
Figure 5: Diversity Index (DI) of representations in the Kuairand1k
captures the diverse interests of users (Fig. 5(b)) without introduc-
dataset. There are five domains in Kuairand1k, resulting in 19 unique ing the issue of dimensional collapse (Fig. 4), thereby achieving the
combinations. We plot the DIs on all combinations in descending best overall AUC and gAUC scores in Kuairand1k and AliCCP.
order. We compared Crocodile with MDL and MTL methods in (a) Thus, can other disentangled losses also achieve favorable di-
and other disentanglement losses in (b). versity and corresponding positive outcomes compared to those
without them? We compare CovLoss with several other disentan-
gling losses, including Dot Product (Dot) [38], Cosine Similarity
5.3.3 Comparison with MDL methods. The Multi-domain Learning (Cos) [20], Distance Correlation (dCorr) [19] and OLE [17] (F-
(MDL) approaches typically evolve on a trend of progressively norm form).
allocating domain-personalized components from top to bottom to We first analyzed the ability to capture user diverse interests in
capture more diverse user interest. Fig. 5(b). We observed that all disentangled losses could achieve
To empirically support this, we analyze the Diversity Index (DI) higher DI than ME-MMoE-PeG, where ME-MMoE-PEG is the basic
of both MDL and MTL methods in Fig. 5(a). Firstly, along with the structure without any disentangled loss compared to others.
trend from top to bottom, the DI of STAR [31], HiNet [43], and Among them, we found that dCorr [19] loss achieves the small-
STEM [33] reveals an increased trend. Secondly, methods with an est improvement of only 0.01% and 0.01% for AUC and gAUC lift,
ME structure typically exhibit higher DI than their SE versions, e.g., respectively in Tab. 4. This is coherent with its related poor diver-
ME-PLE has higher DI than PLE. sity compared to other disentangled losses in Fig. 5(b). In contrast,
However, as discussed in Section 5.3.2, the allocation of domain- ultilizing CovLoss achieves the best AUC and gAUC with the
specific embeddings can lead to the side effect of dimensional col- lift of 0.07% and 0.17% compared to the Base model. This evi-
lapse, which in turn may result in a negative transfer. Consequently, dence supports that the ability to learn diverse interests is important
we are curious as to whether methods that do not employ domain- to improve performance.
specific embeddings can achieve favorable outcomes. Furthermore, we are interested in exploring whether the transfer
For example, HiNet [43] only allocates domain-specific compo- of other types of loss functions, e.g., load balance [30] and transfer
nents at the expert level and displayed the best ability to prevent learning [7], could yield performance enhancements. As shown in
dimensional collapse in Fig. 4. It thus got the best AUC of 0.76451 in Tab. 4, the Importance loss [30] resulted in a limited improvement,
domain S6, in Kuairand1k. However, HiNet still falls in suboptimal whereas Trans5, one of the effective versions of equivalent transfor-
performance in overall AUC and gAUC in both datasets compared mation losses in [7], had a detrimental effect. This can be attributed
to Crocodile. This is because of its low ability to capture diverse to that these loss functions do not explicitly promote diversity in
user interests in 5(a). learning. Instead, they focus on balancing the contributions of vari-
Conversely, Crocodile achieves the second-best DI, only slightly ous experts or enhancing the robustness of representations, which
trailing behind STEM, which employs domain-specific embeddings does not meet the expected outcomes.
CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA Lin Z. and Pan J., et al.
Table 5: Ablation study in Kuairand1k and AliCCP. ME and SE are shorts for multi- and single-embedding, respectively. B and B+C are shorts
for BCE Loss and BCE Loss with CovLoss, respectively. * indicates that the performance difference against Base is significant at p-value<0.05.
5.4 Ablation Study (RQ3) dimension for i-th layer in all 𝑙 layers. This is much higher than
In this section, we empirically test the effectiveness of our proposed CovLoss in practice.
components with an ablation study. We eliminated CovLoss, PeG, Moreover, we can reduce the complexity of calculating covari-
and ME one by one and compared their performance on Kuairand1k ance by sampling. We discovered that selecting as low as 32 samples
and AliCCP, as shown in Tab 5. yields no significant difference in computed covariance norms (p-
All three components are critical since removing either of them value> 0.1) in the testing dataset, contributing as much as 99.2%
leads to a significant performance drop on both datasets. In par- complexity reducing, as we originally selected 𝑁 = 1024.
ticular, CovLoss is more critical in the Kuairand1k dataset, while
CovLoss and EG are equally important in AliCCP. 6 CONCLUSION
Multi-domain learning is vital in enhancing personalized recom-
5.5 Hyper-Parameters and Embedding Sets mendations. Existing MTL methods fail to learn both dimensional-
(RQ4) robust representations and diverse user interests. We propose Crocodile,
which employs a Covariance Loss (CovLoss) and Prior informed
In this section, we investigate the sensitivity of the model’s per-
Element-wise Gating (PEG) based on the multi-embedding (ME)
formance to hyperparameters and the number of embeddings. For
paradigm. Comprehensive experimental evaluations validate the
hyperparameters, the primary parameter is the weight of CovLoss
effectiveness of Crocodile.
in Eq. 7, i.e., 𝛼. Taking Kuairand1k as an example, we find that as
long as 𝛼 is not too small (below 2 × 10 −5 ), Crocodile’s AUC and
gAUC remain >0.78590 and >0.66174, thus maintaining its status as
REFERENCES
[1] Adrien Bardes, Jean Ponce, and Yann LeCun. 2021. Vicreg: Variance-
the second-best method. invariance-covariance regularization for self-supervised learning. arXiv preprint
Additionally, we analyzed the impact of model capacity on perfor- arXiv:2105.04906 (2021).
mance by adjusting the number of embedding sets. At the minimum [2] Andrei Z Broder. 2008. Computational advertising and recommender systems. In
Proceedings of the 2008 ACM conference on Recommender systems. 1–2.
number of MEs (i.e., 2-sets), Crocodile achieves a gAUC of 0.66198 [3] Jiangxia Cao, Shaoshuai Li, Bowen Yu, Xiaobo Guo, Tingwen Liu, and Bin Wang.
and an AUC of 0.78595, which still surpasses other MTL and MDL 2023. Towards universal cross-domain recommendation. In Proceedings of the
Sixteenth ACM International Conference on Web Search and Data Mining. 78–86.
methods, even when the MTL(ME) methods apply for 5-sets of [4] Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41–75.
embeddings. Moreover, as the number of embeddings increases, [5] Jianxin Chang, Chenbin Zhang, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song,
the performance gains of Crocodile over ME-MMoE-PG are robust and Kun Gai. 2023. Pepnet: Parameter and embedding personalized network for
infusing with personalized prior information. In Proceedings of the 29th ACM
(+0.06% ∼ +0.20%), as illustrated in Tab. 6. SIGKDD Conference on Knowledge Discovery and Data Mining. 3795–3804.
[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter
Abbeel. 2016. Infogan: Interpretable representation learning by information
5.6 Computational Complexity maximizing generative adversarial nets. Advances in neural information processing
The Multi-Embedding (ME) paradigm has already been widely systems 29 (2016).
[7] Xu Chen, Ya Zhang, Ivor W Tsang, Yuangang Pan, and Jingchao Su. 2023. Toward
adopted in industry [13, 27, 28, 33, 42]. Compared to these ME equivalent transformation of user preferences in cross domain recommendation.
architectures, our proposed Crocodile introduces the CovLoss upon ACM Transactions on Information Systems 41, 1 (2023), 1–31.
it, which only influences the training process while not the infer- [8] Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra,
Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. 2022. On the repre-
ence at all. Its complexity is not only small and negligible sentation collapse of sparse mixture of experts. Advances in Neural Information
when compared with the ME-based backbone, but it can be Processing Systems 35 (2022), 34600–34613.
[9] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. 2021.
further reduced by sampling. Whitening for self-supervised representation learning. In International Conference
× (𝑑 2 𝑁 ), shown lower
𝑀 (𝑀 −1)
The complexity of CovLoss is 2 on Machine Learning. PMLR, 3015–3024.
[10] William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers:
× (𝑑 3 𝑁 +𝑑 2 𝑁 ), where
𝑀 (𝑀 −1)
complexity than dCorr [19], which is 2 Scaling to trillion parameter models with simple and efficient sparsity. Journal
𝑀 and 𝑁 are the number of experts and samples, and 𝑑 is the of Machine Learning Research 23, 120 (2022), 1–39.
[11] Chongming Gao, Shijun Li, Yuan Zhang, Jiawei Chen, Biao Li, Wenqiang Lei, Peng
dimension of expert output. Besides, the complexity of experts in Jiang, and Xiangnan He. 2022. KuaiRand: An Unbiased Sequential Recommen-
Í −1
MMoE structure is 𝑀 × 𝑙𝑖=1 𝑁𝑑𝑖 𝑑𝑖 −1 , where 𝑑𝑖 is the hidden layer dation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM
Disentangled Representation with Cross Experts Covariance Loss for Multi-Domain Recommendation CIKM’ 24, October 21-25, 2024, Boise, Idaho, USA
Table 6: gAUC metric along with different numbers of embeddings (#Emb) on ME-MMoE-PG and Crocodile. * indicates that the performance
difference against ME-MMoE-PG is significant at p-value<0.05.
#Emb 2 3 4 5 6 7 8 9 10
ME-MMoE-PG 0.66138 0.66157 0.66206 0.66232 0.66203 0.66196 0.66286 0.66279 0.66285
Crocodile 0.66198* 0.66284* 0.66324* 0.66373* 0.6637* 0.66394* 0.66395* 0.66432* 0.66397*
Lift (%) 0.06% 0.13% 0.12% 0.14% 0.17% 0.20% 0.11% 0.15% 0.11%
International Conference on Information and Knowledge Management (Atlanta, [30] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le,
GA, USA) (CIKM ’22). 3953–3957. https://doi.org/10.1145/3511808.3557624 Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The
[12] Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
Long. 2023. On the Embedding Collapse when Scaling up Recommendation [31] Xiang-Rong Sheng, Liqin Zhao, Guorui Zhou, Xinyao Ding, Binding Dai, Qiang
Models. arXiv preprint arXiv:2310.04400 (2023). Luo, Siran Yang, Jingshan Lv, Chi Zhang, Hongbo Deng, et al. 2021. One model to
[13] Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng serve all: Star topology adaptive recommender for multi-domain ctr prediction. In
Long. 2023. On the Embedding Collapse when Scaling up Recommendation Proceedings of the 30th ACM International Conference on Information & Knowledge
Models. arXiv preprint arXiv:2310.04400 (2023). Management. 4104–4113.
[14] Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang [32] Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, Tim GJ Rudner, and
Zhao. 2021. On feature decorrelation in self-supervised learning. In Proceedings Yann LeCun. 2024. An Information Theory Perspective on Variance-Invariance-
of the IEEE/CVF International Conference on Computer Vision. 9598–9608. Covariance Regularization. Advances in Neural Information Processing Systems
[15] Valentin Khrulkov, Leyla Mirvakhabova, Ivan Oseledets, and Artem Babenko. 36 (2024).
2021. Disentangled representations from non-disentangled models. arXiv preprint [33] Liangcai Su, Junwei Pan, Ximei Wang, Xi Xiao, Shijie Quan, Xihua Chen, and
arXiv:2102.06204 (2021). Jie Jiang. 2023. STEM: Unleashing the Power of Embeddings for Multi-task
[16] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Recommendation. arXiv preprint arXiv:2308.13537 (2023).
Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: [34] Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progres-
Scaling giant models with conditional computation and automatic sharding. arXiv sive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for
preprint arXiv:2006.16668 (2020). Personalized Recommendations. In RecSys. ACM, 269–278.
[17] José Lezama, Qiang Qiu, Pablo Musé, and Guillermo Sapiro. 2018. Ole: Orthog- [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
onal low-rank embedding-a plug and play geometric loss for deep learning. In Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. you need. Advances in neural information processing systems 30 (2017).
8109–8118. [36] Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong,
[18] Jinyun Li, Huiwen Zheng, Yuanlin Liu, Minfang Lu, Lixia Wu, and Haoyuan and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons
Hu. 2023. ADL: Adaptive Distribution Learning Framework for Multi-Scenario for web-scale learning to rank systems. In Proceedings of the web conference 2021.
CTR Prediction. In Proceedings of the 46th International ACM SIGIR Conference on 1785–1797.
Research and Development in Information Retrieval. 1786–1790. [37] Xin Wang, Hong Chen, Si’ao Tang, Zihao Wu, and Wenwu Zhu. 2022. Disentan-
[19] Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankan- gled representation learning. arXiv preprint arXiv:2211.11695 (2022).
halli. 2022. Disentangled multimodal representation learning for recommenda- [38] Aming Wu, Rui Liu, Yahong Han, Linchao Zhu, and Yi Yang. 2021. Vector-
tion. IEEE Transactions on Multimedia (2022). decomposed disentanglement for domain-invariant object detection. In Proceed-
[20] Liu Liu, Jiangtong Li, Li Niu, Ruicong Xu, and Liqing Zhang. 2021. Activity ings of the IEEE/CVF International Conference on Computer Vision. 9342–9351.
image-to-video retrieval by disentangling appearance and motion. In Proceedings [39] Xuanhua Yang, Xiaoyu Peng, Penghui Wei, Shaoguo Liu, Liang Wang, and Bo
of the AAAI Conference on Artificial Intelligence, Vol. 35. 2145–2153. Zheng. 2022. AdaSparse: Learning Adaptively Sparse Structures for Multi-Domain
[21] Weiming Liu, Xiaolin Zheng, Jiajie Su, Mengling Hu, Yanchao Tan, and Chaochao Click-Through Rate Prediction. In Proceedings of the 31st ACM International
Chen. 2022. Exploiting variational domain-invariant user embedding for partially Conference on Information & Knowledge Management. 4635–4639.
overlapped cross domain recommendation. In Proceedings of the 45th International [40] Hengshi Yu and Joshua D Welch. 2021. MichiGAN: sampling from disentangled
ACM SIGIR Conference on Research and Development in Information Retrieval. 312– representations of single-cell data using generative adversarial networks. Genome
321. biology 22, 1 (2021), 1–26.
[22] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. Chi. 2018. [41] Chuang Zhao, Hongke Zhao, Ming He, Jian Zhang, and Jianping Fan. 2023. Cross-
Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of- domain recommendation via user interest alignment. In Proceedings of the ACM
Experts. In KDD. ACM, 1930–1939. Web Conference 2023. 887–896.
[23] Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. 2019. Learn- [42] Xiangyu Zhaok, Haochen Liu, Wenqi Fan, Hui Liu, Jiliang Tang, Chong Wang,
ing disentangled representations for recommendation. Advances in neural infor- Ming Chen, Xudong Zheng, Xiaobing Liu, and Xiwang Yang. 2021. Autoemb:
mation processing systems 32 (2019). Automated embedding dimensionality search in streaming recommendations. In
[24] Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 896–905.
Gai. 2018. Entire space multi-task model: An effective approach for estimating [43] Jie Zhou, Xianshuai Cao, Wenhao Li, Kun Zhang, Chuan Luo, and Qian Yu.
post-click conversion rate. In The 41st International ACM SIGIR Conference on 2023. HiNet: A Novel Multi-Scenario & Multi-Task Learning Approach with
Research & Development in Information Retrieval. 1137–1140. Hierarchical Information Extraction. arXiv preprint arXiv:2303.06095 (2023).
[25] Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, and Zhenhua Dong.
2023. FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction. arXiv
preprint arXiv:2304.00902 (2023).
[26] Wentao Ning, Xiao Yan, Weiwen Liu, Reynold Cheng, Rui Zhang, and Bo Tang.
2023. Multi-domain Recommendation with Embedding Disentangling and Do-
main Alignment. In Proceedings of the 32nd ACM International Conference on
Information and Knowledge Management. 1917–1927.
[27] Wentao Ning, Xiao Yan, Weiwen Liu, Reynold Cheng, Rui Zhang, and Bo Tang.
2023. Multi-domain Recommendation with Embedding Disentangling and Do-
main Alignment. In Proceedings of the 32nd ACM International Conference on
Information and Knowledge Management. 1917–1927.
[28] Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming
Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang. 2024. Ad Recommendation in a Collapsed
and Entangled World. arXiv preprint arXiv:2403.00793 (2024).
[29] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Confer-
ence on Data Mining (ICDM). IEEE, 995–1000.