0% found this document useful (0 votes)
23 views11 pages

Representation Learning For Attributed Multiplex Heterogeneous Network.

This paper presents a unified framework for representation learning in Attributed Multiplex Heterogeneous Networks (AMHEN), addressing the limitations of existing methods that focus on single-typed nodes and edges. The proposed model, GATNE, supports both transductive and inductive learning, demonstrating significant improvements in link prediction tasks across various datasets, including Amazon and Alibaba. The framework has been successfully implemented in a recommendation system, showcasing its effectiveness and scalability for real-world applications.

Uploaded by

bingzexu2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Representation Learning For Attributed Multiplex Heterogeneous Network.

This paper presents a unified framework for representation learning in Attributed Multiplex Heterogeneous Networks (AMHEN), addressing the limitations of existing methods that focus on single-typed nodes and edges. The proposed model, GATNE, supports both transductive and inductive learning, demonstrating significant improvements in link prediction tasks across various datasets, including Amazon and Alibaba. The framework has been successfully implemented in a recommendation system, showcasing its effectiveness and scalability for real-world applications.

Uploaded by

bingzexu2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Representation Learning for Attributed Multiplex

Heterogeneous Network

Yukuo Cen† , Xu Zou† , Jianwei Zhang‡ , Hongxia Yang‡∗ , Jingren Zhou‡ , Jie Tang†∗
† Department of Computer Science and Technology, Tsinghua University
‡ DAMO Academy, Alibaba Group

{cyk18,zoux18}@mails.tsinghua.edu.cn
{zhangjianwei.zjw,yang.yhx,jingren.zhou}@alibaba-inc.com
arXiv:1905.01669v2 [cs.SI] 20 May 2019

[email protected]

ABSTRACT ACM Reference Format:


Network embedding (or graph embedding) has been widely used in Yukuo Cen, Xu Zou, Jianwei Zhang, Hongxia Yang, Jingren Zhou and Jie
Tang. 2019. Representation Learning for Attributed Multiplex Heteroge-
many real-world applications. However, existing methods mainly
neous Network. In The 25th ACM SIGKDD Conference on Knowledge Discov-
focus on networks with single-typed nodes/edges and cannot scale ery and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USA. ACM,
well to handle large networks. Many real-world networks consist New York, NY, USA, 11 pages. https://doi.org/10.1145/3292500.3330964
of billions of nodes and edges of multiple types, and each node
is associated with different attributes. In this paper, we formalize
the problem of embedding learning for the Attributed Multiplex 1 INTRODUCTION
Heterogeneous Network and propose a unified framework to ad- Network embedding [4], or network representation learning, is
dress this problem. The framework supports both transductive and a promising method to project nodes in a network to a low-
inductive learning. We also give the theoretical analysis of the pro- dimensional continuous space while preserving network structure
posed framework, showing its connection with previous works and and inherent properties. It has attracted tremendous attention re-
proving its better expressiveness. We conduct systematical eval- cently due to significant progress in downstream network learning
uations for the proposed framework on four different genres of tasks such as node classification [1], link prediction [39], and com-
challenging datasets: Amazon, YouTube, Twitter, and Alibaba1 . Ex- munity detection [8]. DeepWalk [27], LINE [35], and node2vec [10]
perimental results demonstrate that with the learned embeddings are pioneering works that introduce deep learning techniques into
from the proposed framework, we can achieve statistically signif- network analysis to learn node embeddings. NetMF [29] gives a
icant improvements (e.g., 5.99-28.23% lift by F1 scores; p ≪ 0.01, theoretical analysis of equivalence for the different network embed-
t−test) over previous state-of-the-art methods for link prediction. ding algorithms, and later NetSMF [28] gives a scalable solution via
The framework has also been successfully deployed on the recom- sparsification. Nevertheless, they were designed to handle only the
mendation system of a worldwide leading e-commerce company, homogeneous network with single-typed nodes and edges. More re-
Alibaba Group. Results of the offline A/B tests on product recom- cently, PTE [34], metapath2vec [7], and HERec [31] are proposed for
mendation further confirm the effectiveness and efficiency of the heterogeneous networks. However, real-world network-structured
framework in practice. applications, such as e-commerce, are much more complicated, com-
prising not only multi-typed nodes and/or edges but also a rich set
CCS CONCEPTS of attributes. Due to its significant importance and challenging re-
• Mathematics of computing → Graph algorithms; • Com- quirements, there have been tremendous attempts in the literature
puting methodologies → Learning latent representations. to investigate embedding learning for complex networks. Depend-
ing on the network topology (homogeneous or heterogeneous)
KEYWORDS and attributed property (with or without attributes), we catego-
rize six different types of networks and summarize their relative
Network embedding; Multiplex network; Heterogeneous network
comprehensive developments, respectively, in Table 1. These six
∗ Hongxia Yang and Jie Tang are the corresponding authors. categories include HOmogeneous N etwork (or HON), Attributed
1 Code is available at https://github.com/cenyk1230/GATNE. HOmogeneous N etwork (or AHON), HEterogeneous N etwork (or
HEN), Attributed HEterogeneous N etwork (or AHEN), Multiplex
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed HEterogeneous N etwork (or MHEN), and Attributed Multiplex
for profit or commercial advantage and that copies bear this notice and the full citation HEterogeneous N etwork (or AMHEN). As can be seen, among
on the first page. Copyrights for components of this work owned by others than ACM them the AMHEN has been least studied.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a In this paper, we focus on embedding learning for AMHENs,
fee. Request permissions from [email protected]. where different types of nodes might be linked with multiple dif-
KDD ’19, August 4–8, 2019, Anchorage, AK, USA ferent types of edges, and each node is associated with a set of
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6201-6/19/08. different attributes. This is common in many online applications.
https://doi.org/10.1145/3292500.3330964 For example, in the four datasets that we are working with, there
… …

Consider all information


… …

… …

click AMHEN
add-to-preference
add-to-cart 89.94
Gender: male 90
Age: 23 conversion Price: $1000 86.65
Location: Beijing Brand: Lenovo
… … 85

Gender: female Ignore node

F1-score
Age: 26 Price: $800 atrributes 80
Location: Bangalore Brand:Apple
… …
75
Gender: male
Age: 35 Price: $50 70.14
Location: Boston Brand: Nike MHEN 70
… …

users items 65
DeepWalk GATNE-T GATNE-I

Results on Alibaba dataset

Ignore edge multiplicity and node atrributes

HON

Figure 1: The left illustrates an example of an attributed multiplex heterogeneous network. Users in the left part of the fig-
ure are associated with attributes, including gender, age, and location. Similarly, items in the left part of the figure include
attributes such as price and brand. The edge types between users and items are from four interactions, including click, add-
to-preference, add-to-cart and conversion. The three subfigures in the middle represent three different ways of setting up the
graphs, including HON, MHEN, and AMHEN from the bottom to the top. The right part shows the performance improvement
of the proposed models over DeepWalk on the Alibaba dataset. As can be seen, GATNE-I achieves a +28.23% performance lift
compared to DeepWalk.

are 20.3% (Twitter), 21.6% (YouTube), 15.1% (Amazon) and 16.3% Table 1: The network types handled by different methods.
(Alibaba) of the linked node pairs having more than one type of
edges respectively. As an instance, in an e-commerce system, users Network Type Method
Heterogeneity
Attribute
Node Type Edge Type
may have several types of interactions with items, such as click,
DeepWalk [27]
conversion, add-to-cart, add-to-preference. Figure 1 illustrates such LINE [35]
Homogeneous
an example. Obviously, “users” and “items” have intrinsically differ- node2vec [10] Single Single /
Network (HON)
NetMF [29]
ent properties and shall not be treated equally. Moreover, different NetSMF [28]
user-item interactions imply different levels of interests and should TADW [41]
be treated differently. Otherwise, the system cannot precisely cap- LANE [16]
Attributed
AANE [15]
ture the user’s behavioral patterns and preferences and would be Homogeneous
SNE [20]
Single Single Attributed
Network (AHON)
insufficient for practical use. DANE [9]
Not merely because of the heterogeneity and multiplicity, in ANRL [44]
PTE [34]
practice, dealing with AMHEN poses several unique challenges: Heterogeneous
metapath2vec [7] Multi Single /
Network (HEN)
HERec [31]
Attributed
HNE [3] Multi Single Attributed
• Multiplex Edges. Each node pair may have multiple different types HEN (AHEN)
of relationships. It is important to be able to borrow strengths PMNE [22]
Multiplex MVE [30]
from different relationships and learn unified embeddings. Single Multi
Heterogeneous MNE [43] /
• Partial Observations. The real networked data are usually par- Network (MHEN) mvn2vec [32]
GATNE-T Multi Multi
tially observed. For example, a long-tailed customer may only Attributed
present few interactions with some products. Most existing net- GATNE-I Multi Multi Attributed
MHEN (AMHEN)
work embedding methods focus on the transductive settings, and
cannot handle the long-tailed or cold-start problems.
• Scalability. Real networks usually have billions of nodes and tens To address the above challenges, we propose a novel approach
or hundreds of billions of edges [40]. It is important to develop to capture both rich attributed information and to utilize multiplex
learning algorithms that can scale well to large networks. topological structures from different node types, namely General
Attributed Multiplex HeTerogeneous Network Embedding, or ab- different objects in heterogeneous networks to unified vector rep-
breviated as GATNE. The key features of GATNE are the following: resentations. PTE [34] constructs large-scale heterogeneous text
network from labeled information and different levels of word
• We formally define the problem of attributed multiplex heteroge- co-occurrence information, which is then embedded into a low-
neous network embedding, which is a more general representa- dimensional space. metapath2vec [7] formalizes meta-path based
tion for real-world networks. random walk to construct the heterogeneous neighborhood of a
• GATNE supports both transductive and inductive embeddings node and then leverages a heterogeneous skip-gram model to per-
learning for attributed multiplex heterogeneous networks. We form node embeddings. HERec [31] uses a meta-path based ran-
also give the theoretical analysis to prove that our transduc- dom walk strategy to generate meaningful node sequences to learn
tive model is a more general form than existing models (e.g., network embeddings that are first transformed by a set of fusion
MNE [43]). functions and subsequently integrated into an extended matrix
• Efficient and scalable learning algorithms for GATNE have been factorization (MF) model.
developed. Our learning algorithms are able to handle hundreds
of million nodes and billions of edges efficiently. Multiplex Heterogeneous Network Embedding. Existing ap-
proaches usually study networks with a single type of proximity
We conduct extensive experiments to evaluate the proposed mod- between nodes, which only captures a single view of a network.
els on four different genres of datasets: Amazon, YouTube, Twitter, However, in reality, there usually exist multiple types of proxim-
and Alibaba. Experimental results show that the proposed frame- ities between nodes, yielding networks with multiple views, or
work can achieve statistically significant improvements (∼5.99- multiplex network embedding. PMNE [22] proposes three meth-
28.23% lift by F1 scores on Alibaba dataset; p ≪0.01, t−test) over ods to project a multiplex network into a continuous vector space.
state-of-the-art methods. We have deployed the proposed model MVE [30] embeds networks with multiple views in a single collab-
on Alibaba’s distributed system and apply the method to Alibaba’s orated embedding using attention mechanism. MNE [43] uses one
recommendation engine. Offline A/B tests further confirm the ef- common embedding and several additional embeddings of each
fectiveness and efficiency of our proposed models. edge type for each node, which are jointly learned by a unified
network embedding model. Mvn2vec [32] explores the feasibility
2 RELATED WORK to achieve better embedding results by simultaneously modeling
preservation and collaboration to represent semantic meanings of
In this section, we review related state-of-the-arts for network edges in different views respectively.
embedding, heterogeneous network embedding, multiplex hetero-
geneous network embedding, and attributed network embedding. Attributed Network Embedding. Attributed network embed-
ding aims to seek for low-dimensional vector representations for
Network Embedding. Works in network embedding mainly con- nodes in a network, such that original network topological struc-
sist of two categories, graph embedding (GE) and graph neural net- ture and node attribute proximity can be preserved in such repre-
work (GNN). Representative works for GE include DeepWalk [27] sentations. TADW [41] incorporates text features of vertices into
which generates a corpus on graphs by random walk and then network representation learning under the framework of matrix
trains a skip-gram model on the corpus. LINE [35] learns node factorization. LANE [16] smoothly incorporates label information
presentations on large-scale networks while preserving both first- into the attributed network embedding while preserving their cor-
order and second-order proximities. node2vec [10] designs a biased relations. AANE [15] enables a joint learning process to be done
random walk procedure to efficiently explore diverse neighbor- in a distributed manner for accelerated attributed network embed-
hoods. NetMF [29] is a unified matrix factorization framework for ding. SNE [20] proposes a generic framework for embedding social
theoretically understanding and improving DeepWalk and LINE. networks by capturing both the structural proximity and attribute
For popular works in GNN, GCN [19] incorporates neighbors’ fea- proximity. DANE [9] can capture the high nonlinearity and pre-
ture representations into the node feature representation using serve various proximities in both topological structure and node
convolutional operations. GraphSAGE [11] provides an inductive attributes. ANRL [44] uses a neighbor enhancement autoencoder
approach to combine structural information with node features. It to model the node attribute information and an attribute-aware
learns functional representations instead of direct embeddings for skip-gram model based on the attribute encoder to capture the
each node, which helps it work inductively on unobserved nodes network structure.
during training.
Heterogeneous Network Embedding. Heterogeneous networks
examine scenarios with nodes and/or edges of various types. Such
networks are notoriously difficult to mine because of the bewil- 3 PROBLEM DEFINITION
dering combination of heterogeneous contents and structures. The Denote a network G = (V, E), where V is a set of n nodes and E is
creation of a multidimensional embedding of such data opens the a set of edges between nodes. Each edge ei j = (vi , v j ) is associated
door to the use of a wide variety of off-the-shelf mining techniques with a weight w i j ≥ 0, indicating the strength of the relationship
for multidimensional data. Despite the importance of this problem, between vi and v j . In practice, the network could be either directed
limited efforts have been made on embedding a scalable network or undirected. If G is directed, we have ei j . e ji and w i j . w ji ;
of dynamic and heterogeneous data. HNE [3] jointly considers the if G is undirected, we have ei j ≡ e ji and w i j ≡ w ji . Notations are
contents and topological structures in networks and represents summarized in Table 2.
Table 2: Notations. Base Embedding

Notation Description Output Layer


GATNE-T
G the input network GATNE-I Input Layer
V, E the node/edge set of G 0 Hidden

0 Layer
O, R the node/edge type set of G 0
Node Attributes |V1|×K1
A the attribute set of G 0
0
n the number of nodes 10010
0

m 01001 1
the number of edge types 0

r an edge type 01010 0


|V2|×K2
0
d the dimension of base/overall embeddings 0

s the dimension of edge embeddings 0


0
Edge Embedding
v a node in the graph |V|-dim

N the neighborhood set of a node on an edge type


b, u, c, v the base/edge/context/overall embedding of a node |V3|×K3

h, g transformation functions in our inductive approach


x the attribute of a node Heterogeneous Skip-Gram
Generating Overall Node Embedding

Figure 2: Illustration of GATNE-T and GATNE-I models.


Definition 1 (Heterogeneous Network). A heterogeneous
GATNE-T only uses network structure information while
network [3, 33] is a network G = (V, E) associated with a node type
GATNE-I considers both structure information and node at-
mapping function ϕ : V → O and an edge type mapping function
tributes. The output layer of heterogeneous skip-gram speci-
ψ : E → R, where O and R represent the set of all node types and
fies one set of multinomial distributions for each node type
the set of all edge types, respectively. Each node v ∈ V belongs to
in the neighborhood of the input node v. In this example,
a particular node type. Similarly, each edge e ∈ E is categorized
V = V1 ∪ V2 ∪ V3 and K1, K2, K3 specify the size of v’s
into a specific edge type. If |O| + |R| > 2, the network is called
neighborhood on each node type respectively.
heterogeneous; otherwise homogeneous.
Notice that in a heterogeneous network, an edge e can no longer
be denoted as ei j since there may be multiple types of edges between GATNE-T. We also give a theoretical discussion about the connec-
(r ) tion of GATNE-T with the newly trending models, e.g., MNE. To
node vi and v j . Under such situations, an edge is denoted as ei j ,
deal with the partial observation problem, we further extend the
where r corresponds to a certain edge type. model to the inductive context [42] and present a new inductive
Definition 2 (Attributed Network). An attributed network model named as GATNE-I. For both models, we present efficient
[3, 16] is a network G endowed with an attribute representation A, optimization algorithms.
i.e., G = (V, E, A). Each node vi ∈ V is associated with some types
of feature vectors. A = {xi |vi ∈ V} is the set of node features for all 4.1 Transductive Model: GATNE-T
nodes, where xi is the associated node feature of node vi . We begin with embedding learning for attributed multiplex het-
erogeneous networks in the transductive context, and present our
Definition 3 (Attributed Multiplex Heterogeneous Net-
model GATNE-T. More specifically, in GATNE-T, we split the overall
work). An attributed multiplex heterogeneous network is a net-
embedding of a certain node vi on each edge type r into two parts:
work G = (V, E, A), E = r ∈R Er , where Er consists of all edges
Ð
base embedding, edge embedding as shown in Figure 2. The base
with edge type r ∈ R, and |R| > 1. We separate the network for every
embedding of node vi is shared between different edge types. The
edge type or view r ∈ R as G r = (V, Er , A).
k-th level edge embedding ui,r ∈ Rs , (1 ≤ k ≤ K) of node vi on
(k)
An example of AMHEN is illustrated in Figure 1, which con- edge type r is aggregated from neighbors’ edge embeddings as:
tains 2 node types and 4 edge types. The two node types are user (k ) (k −1)
and item with different attributes. Given the above definitions, we ui,r = aддreдator ({uj,r , ∀v j ∈ Ni,r }), (1)
can formally define our problem for representation learning on where Ni,r is the neighbors of node vi on edge type r . The ini-
(0)
networks. tial edge embedding ui,r for each node vi and each edge type r is
Problem 1 (AMHEN Embedding). Given an AMHEN G = randomly initialized in our transductive model. Following Graph-
(V, E, A), the problem of AMHEN embedding is to give a uni- SAGE [11], the aддreдator function can be a mean aggregator as:
fied low-dimensional space representation of each node v on every (k ) (k−1)
ui,r = σ (Ŵ(k) · mean({uj,r , ∀v j ∈ Ni,r })), (2)
edge type r . The goal is to find a function fr : V → Rd for every
edge type r , where d ≪ |V |. or other pooling aggregators such as max-pooling aggregator:
(k ) (k ) (k −1) (k )
ui,r = max({σ (Ŵpool uj,r + b̂pool ), ∀v j ∈ Ni,r }), (3)
4 METHODOLOGY
In this section, we first explain the proposed GATNE framework where σ is an activation function. We denote the K-th level edge
(K )
in the transductive context [19]. The resultant model is named as embedding ui,r as edge embedding ui,r , and concatenate all the
edge embeddings for node vi as Ui with size s-by-m, where s is the Thus the parameter space of MNE is almost included by our
dimension of edge-embeddings: model’s parameter space and we can conclude that GATNE-T is a
generalization of MNE, if edge embeddings can be trained directly.
Ui = (ui,1 , ui,2 , . . . , ui,m ). (4)
However, in our model, the edge embedding u is generated from
We use self-attention mechanism [21] to compute the coefficients single or multiple layers of aggregation. We give more discussions
ai,r ∈ Rm of linear combination of vectors in Ui on edge type r as: about the aggregation case.
ai,r = softmax(wTr tanh (Wr Ui ))T , (5) Effects of Aggregation. In the GATNE-T model, the edge embed-
ding u(k ) is computed by aggregating the edge embedding u(k −1)
where wr and Wr are trainable parameters for edge type r with size
of its neighbors as:
da and da × s respectively and the superscript T denotes the trans-
(k ) (k −1)
position of the vector or the matrix. Thus, the overall embedding ui,r = σ (Ŵ(k ) · mean({uj,r , v j ∈ Ni,r })). (11)
of node vi for edge type r is:
The mean aggregator is basically a matrix multiplication,
vi,r = bi + α r MTr Ui ai,r , (6)
(k −1) (k−1)
meank (vi ) = mean({uj,r , v j ∈ Ni,r }) = (Ur Nr )i , (12)
where bi is the base embedding for node vi , α r is a hyper-parameter
denoting the importance of edge embeddings towards the overall (k )
where Nr is the neighborhood matrix on edge type r , Ur =
embedding and Mr ∈ Rs×d is a trainable transformation matrix. (k ) (k)
(u1,r , ..., un,r ) is the k-th level edge embedding of all nodes in
Connection with Previous Work. We choose MNE [43], a recent
Nr )i denotes the i th column
(k −1)
representative work for MHEN, as the base model for multiplex the graph on edge type r , and (Ur
(k −1)
heterogeneous networks to discuss the connection between our of Ur Nr . Nr can be a normalized adjacency matrix. The mean
proposed model and previous work. In GATNE-T, we use the atten- operator of Equation (11) can be weighted and the neighborhood
tion mechanism to capture the influential factors between different matrix Ni,r can be sampled. Take Ŵ(k ) = I, where I is an identity
(k )
edge types. We theoretically prove that our transductive model is a matrix, then ui,r = σ (meank (vi )). If Nr is of full rank, then for any
more general form of MNE and improves the expressiveness. For (k−1) (k−1)
Or = (o1,r , . . . , on,r ), there exists Ur such that Ur Nr = Or .
MNE, the overall embedding ṽi,r of node vi on edge type r ∈ R is
If the activation function σ is an automorphism, i.e., σ : R → R
ṽi,r = bi + α r XTr oi,r , (7) and Nr is of full rank, we can use the construction method described
in Theorem 4.1 to construct ui,r and the above method to construct
where Xr is a edge-specific transformation matrix. And for GATNE- (K −1) (0)
T, the overall embedding for node vi on edge type r is: each level of edge embeddings ui,r , ..., ui,r subsequently. There-
fore, our model is still a more general form that can generalize
m
the MNE model, when all the neighborhood matrices Nr and the
vi,r = bi + α r MTr Ui ai,r = bi + α r MTr
Õ
λp ui,p , (8) activation function σ are invertible in all levels of aggregation.
p=1

where λp denotes the p-th element of ai,r and is computed as: 4.2 Inductive Model: GATNE-I
exp(wTr tanh (Wr ui,p )) The limitation of GATNE-T is that it cannot handle unobserved data.
λp = Í T
. (9) However, in many real-world applications, the networked data is
t exp(wr tanh (Wr ui,t ))
often partially observed [42]. We then extend our model to the
Theorem 4.1. For any r ∈ R, there exist parameters wr and inductive context and present a new model named GATNE-I. The
Wr , such that for any oi,1 , . . . , oi,m , and corresponding matrix Xr , model is also illustrated in Figure 2. Specifically, we define the base
with dimension s for each oi, j and size s × d for Xr , there exist embedding bi as a parameterized function of vi ’s attribute xi as
ui,1 , . . . , ui,m , and corresponding matrix Mr , with dimension s + m bi = hz (xi ), where hz is a transformation function and z = ϕ(i) is
for each ui, j and size (s + m) × d for Mr , that satisfy vi,r ≈ ṽi,r . node vi ’s corresponding node type. Notice that nodes with different
types may have different dimensions of their attributes xi . The
Proof. For any t, let ui,t = (oTi,t , ui,t
′T )T concatenated by two
transformation function hz can have different forms such as a
vectors, where the first s dimension is oi,t , and ui,t ′ is an m-
multi-layer perceptron [26]. Similarly, the initial edge embeddings
dimensional vector. Let ui,t,k denote the k-th dimension of ui,t
′ ′ , (0)
ui,r for node i on edge type r should be also parameterized as
then take ui,t,k = δ t k , where δ is the Kronecker delta function as

(0)
the function of attributes xi as ui,r = gz,r (xi ), where gz,r is also
δi j = 1, i = j; δi j = 0, i , j. (10) a transformation function that transforms the feature to an edge
embedding of node vi on the edge type r and z is vi ’s corresponding
Let Wr be all zero, except for the element on row 1 and column s +r node type. To be more specific, for the inductive model, we also
with a large enough number M; therefore tanh(Wr ui,p ) becomes a add an extra attribute term to the overall embedding of node vi on
vector with its 1st dimension approximately being δr p , and other type r :
dimensions being 0. Take wr as a vector with its 1st dimension
being M, then exp(wTr tanh (Wr ui,p )) ≈ exp(Mδr p ), so λp ≈ δr p . vi,r = hz (xi ) + α r MTr Ui ai,r + βr DTz xi , (13)
Finally we take Mr being Xr at its first s × d dimension, and 0 on where βr is a coefficient and Dz is a feature transformation matrix
the following m × d dimension, and we can get vi,r ≈ ṽi,r . □ on vi ’s corresponding node type z. The difference between our
Algorithm 1: GATNE Table 3: Statistics of Datasets.
Input: network G = (V, E, A), embedding dimension d, edge
embedding dimension s, learning rate η, negative Dataset # nodes # edges # n-types # e-types
samples L, coefficient α, β. Amazon 10,166 148,865 1 2
Output: overall embeddings vi,r for all nodes on every edge YouTube 2,000 1,310,617 1 5
type r Twitter 10,000 331,899 1 4
1 Initialize all the model parameters θ . Alibaba-S 6,163 17,865 2 4
2 Generate random walks on each edge type r as Pr . Alibaba 41,991,048 571,892,183 2 4
3 Generate training samples {(v i , v j , r )} from random walks Pr
on each edge type r .
4 while not converged do
Thus, given a node vi with its context C of a path, our objective
5 foreach (vi , v j , r ) ∈ training samples do is to minimize the following negative log-likelihood:
Õ
6 Calculate vi,r using Equation (6) or (13) − log Pθ {v j |v j ∈ C}|vi =

− log Pθ (v j |vi ), (15)
7 Sample L negative samples and calculate objective v j ∈C
function E using Equation (17)
where θ denotes all the parameters. Following metapath2vec [7] we
8 Update model parameters θ by ∂E ∂θ . use the heterogeneous softmax function which is normalized with
respect to the node type of node v j . Specifically, the probability of
v j given vi is defined as:
exp(cTj · vi,r )
transductive and inductive model mainly lies on how the base em- Pθ (v j |vi ) = Í , (16)
(0) k ∈Vt exp(cTk · vi,r )
bedding bi and initial edge embeddings ui,r are generated. In our
transductive model, the base embedding bi and initial edge embed- where v j ∈ Vt , ck is the context embedding of node vk and vi is
(0)
ding ui,r are directly trained for each node based on the network the overall embedding of node vi for edge type r .
structure, and the transductive model cannot handle nodes that are Finally, we use heterogeneous negative sampling to approximate
not seen during training. As for our inductive model, instead of the objective function − log Pθ (v j |vi ) for each node pair (vi , v j ) as:
(0)
training bi and ui,r directly for each node, we train transformation L
E = − log σ (cTj · vi,r ) − Evk ∼Pt (v) [log σ (−cTk · vi,r )],
Õ
functions hz and gz,r that transforms the raw feature xi to bi and (17)
(0) l =1
ui,r , which works for nodes that did not appear during training as
long as they have corresponding raw features. where σ (x) = 1/(1 + exp(−x)) is the sigmoid function, L is the num-
ber of negative samples correspond to a positive training sample,
and vk is randomly drawn from a noise distribution Pt (v) defined
4.3 Model Optimization on node v j ’s corresponding node set Vt .
We discuss how to learn the proposed transductive and inductive We summarize our algorithm in Algorithm 1. The time complex-
models. Following [10, 27, 35], we use random walk to generate ity of our random walk based algorithm is O(nmdL) where n is the
node sequences and then perform skip-gram [24, 25] over the node number of nodes, m is the number of edge types, d is overall embed-
sequences to learn embeddings. Since each view of the input net- ding size, L is the number of negative samples per training sample
work is heterogeneous, we use meta-path-based random walks [7]. (L ≥ 1). The memory complexity of our algorithm is O(n(d +m ×s))
To be specific, given a view r of the network, i.e., G r = (V, Er , A) with s being the size of edge embedding.
and a meta-path scheme T : V1 → V2 → · · · Vt · · · → Vl , where
l is the length of the meta-path scheme, the transition probability 5 EXPERIMENTS
at step t is defined as follows: In this section, we first introduce the details of four evaluation
datasets and the competitor algorithms. We focus on the link pre-
 1
 |Ni, r ∩Vt +1 | (vi , v j ) ∈ Er , v j ∈ Vt +1 ,
 diction task to evaluate the performances of our proposed methods
p(v j |vi , T ) =

0 (vi , v j ) ∈ Er , v j < Vt +1 , (14) compared to other state-of-the-art methods. Parameter sensitiv-
(vi , v j ) < Er ,


 0 ity, convergence, and scalability are then discussed. Finally, we
report the results of offline A/B tests of our method on Alibaba’s
where vi ∈ Vt and Ni,r denotes the neighborhood of node vi recommendation system.
on edge type r . The flow of the walker is conditioned on the pre-
defined meta path T . The meta-path-based random walk strategy 5.1 Datasets
ensures that the semantic relationships between different types
We work on three public datasets and the Alibaba dataset for the link
of nodes can be properly incorporated into skip-gram model [7].
prediction task. Amazon Product Dataset2 [13, 23] includes product
Supposing the random walk with length l on edge type r follows
metadata and links between products; YouTube dataset3 [36, 38]
a path P = (vp1 , . . . , vpl ) such that (vpt −1 , vpt ) ∈ Er (t = 2 . . . l),
denote vpt ’s context as C = {vpk |vpk ∈ P, |k −t | ≤ c, t , k }, where 2 http://jmcauley.ucsd.edu/data/amazon/

c is the radius of the window size. 3 http://socialcomputing.asu.edu/datasets/YouTube


consists of various types of interactions; Twitter dataset4 [6] also MNE uses one common embedding and several additional embed-
contains various types of links. Alibaba dataset has two node types, dings for each edge type, which are jointly learned by a unified
user and item (or product), and includes four types of interactions network embedding model.
between users and items. Since some of the baselines cannot scale
Attributed Network Embedding Methods. The compared
to the whole graph, we evaluate performances on sampled datasets.
method is ANRL [44]. ANRL uses a neighbor enhancement auto-
The statistics of these four sampled datasets are summarized in
encoder to model the node attribute information and an attribute-
Table 3. Notice that n-types and e-types in the table denote node
aware skip-gram model based on the attribute encoder to capture
types and edge types, respectively.
the network structure.
Amazon. In our experiments, we only use the product metadata
Our Methods. Our proposed methods include GATNE-T and
of Electronics category, including the product attributes and co-
GATNE-I. GATNE-T considers the network structure and uses base
viewing, co-purchasing links between products. The product at-
embeddings and edge embeddings to capture the influential factors
tributes include the price, sales-rank, brand, and category.
between different edge types. GATNE-I considers both the network
YouTube. YouTube dataset is a multiplex bidirectional network structure and the node attributes, and learns an inductive trans-
dataset that consists of five types of interactions between 15,088 formation function instead of learning base embeddings and meta
YouTube users. The types of edges include contact, shared friends, embeddings for each node directly. For Alibaba dataset, we use the
shared subscription, shared subscriber, and shared favorite videos same meta-path schemes as metapath2vec. For some datasets with-
between users. out node attributes, we also generate node features for them. Due
to the size of the Alibaba dataset with more than 40 million nodes
Twitter. Twitter dataset is about tweets related to the discovery
and 500 million edges and the scalabilities of the other competitors,
of the Higgs boson between 1st and 7th, July 2012. It is made up of
we only compare our GATNE model with DeepWalk, MVE, and
four directional relationships between more than 450,000 Twitter
MNE. Specific implementations can be found in the Appendix.
users. The relationships are re-tweet, reply, mention, and friend-
ship/follower relationships between Twitter users.
5.3 Performance Analysis
Alibaba. Alibaba dataset consists of four types of interactions
including click, add-to-preference, add-to-cart, and conversion be- Link prediction is a common task in both academia and industry.
tween two types of nodes, user and item. The sampled Alibaba For academia, it is widely used to evaluate the quality of network
dataset is denoted as Alibaba-S. We also provide the evaluation of embeddings obtained by different methods. In the industry, link
the whole dataset on Alibaba’s distributed cloud platform; the full prediction is a very demanding task since in real-world scenarios
dataset is denoted as Alibaba. we are usually facing graphs with partial links, especially for e-
commerce companies that rely on the links between their users
and items for recommendations. We hide a set of edges/non-edges
5.2 Competitors from the original graph and train on the remaining graph. Follow-
We categorize our competitors into the following four groups. The ing [2, 18], we create a validation/test set that contains 5%/10%
overall embedding size is set to 200 for all methods. The specific randomly selected positive edges respectively with the equivalent
hyper-parameter settings for different methods are listed in the number of randomly selected negative edges for each edge type.
Appendix. The validation set is used for hyper-parameter tuning and early
Network Embedding Methods. The compared methods include stopping. The test set is used to evaluate the performance and is
DeepWalk [27], LINE [35], and node2vec [10]. As these methods only run once under the tuned hyper-parameter. We use some com-
can only deal with HON, we feed separate graphs with different monly used evaluation criteria, i.e., the area under the ROC curve
edge types to them and obtain different node embeddings for each (ROC-AUC) [12] and the PR curve (PR-AUC) [5] in our experiments.
separate graph. We also use F1 score as the other metric for evaluation. To avoid
the thresholding effect [37], we assume that the number of hidden
Heterogeneous Network Embedding Methods. We focus on edges in the test set is given [27, 29, 37]. All of these metrics are
the representative work metapath2vec [7], which is designed to uniformly averaged among the selected edge types.
deal with the node heterogeneity. When there is only one node type
in the network, metapath2vec degrades to DeepWalk. For Alibaba Quantitative Results. The experimental results of three public
dataset, the meta-path schemes are set to U − I − U and I − U − I , datasets and Alibaba-S are shown in Table 4. GATNE outperforms
where U and I denote User and Item respectively. all sorts of baselines in the various datasets. GATNE-T obtains
better performance than GATNE-I on Amazon dataset as the node
Multiplex Heterogeneous Network Embedding Methods. attributes are limited. The node attributes of Alibaba dataset are
The compared methods include PMNE [22], MVE [30], MNE [43]. abundant so that GATNE-I obtains the best performance. ANRL is
We denote the three methods of PMNE as PMNE(n), PMNE(r) and very sensitive to the weak node attributes and obtains the worst
PMNE(c) respectively. MVE uses collaborated context embeddings result on Amazon dataset. The different node attributes of users
and applies an attention mechanism to view-specific embedding. and items also limit the performance of ANRL on Alibaba-S dataset.
On YouTube and Twitter datasets, GATNE-I performs similarly to
GATNE-T as the node attributes of these two datasets are the node
4 https://snap.stanford.edu/data/higgs-twitter.html embeddings of DeepWalk, which are generated by the network
Table 4: Performance comparison of different methods on four datasets.

Amazon YouTube Twitter Alibaba-S


ROC-AUC PR-AUC F1 ROC-AUC PR-AUC F1 ROC-AUC PR-AUC F1 ROC-AUC PR-AUC F1
DeepWalk 94.20 94.03 87.38 71.11 70.04 65.52 69.42 72.58 62.68 59.39 60.62 56.10
node2vec 94.47 94.30 87.88 71.21 70.32 65.36 69.90 73.04 63.12 62.26 63.40 58.49
LINE 81.45 74.97 76.35 64.24 63.25 62.35 62.29 60.88 58.18 53.97 54.65 52.85
metapath2vec 94.15 94.01 87.48 70.98 70.02 65.34 69.35 72.61 62.70 60.94 61.40 58.25
ANRL 71.68 70.30 67.72 75.93 73.21 70.65 70.04 67.16 64.69 58.17 55.94 56.22
PMNE(n) 95.59 95.48 89.37 65.06 63.59 60.85 69.48 72.66 62.88 62.23 63.35 58.74
PMNE(r) 88.38 88.56 79.67 70.61 69.82 65.39 62.91 67.85 56.13 55.29 57.49 53.65
PMNE(c) 93.55 93.46 86.42 68.63 68.22 63.54 67.04 70.23 60.84 51.57 51.78 51.44
MVE 92.98 93.05 87.80 70.39 70.10 65.10 72.62 73.47 67.04 60.24 60.51 57.08
MNE 90.28 91.74 83.25 82.30 82.18 75.03 91.37 91.65 84.32 62.79 63.82 58.74
GATNE-T 97.44 97.05 92.87 84.61 81.93 76.83 92.30 91.77 84.96 66.71 67.55 62.48
GATNE-I 96.25 94.77 91.36 84.47 82.32 76.83 92.04 91.95 84.38 70.87 71.65 65.54

Table 5: The experimental results on Alibaba dataset.


72 72

70 70
ROC-AUC PR-AUC F1

ROC-AUC

ROC-AUC
DeepWalk 65.58 78.13 70.14 68 68

MVE 66.32 80.12 72.14 66 66

MNE 79.60 93.01 84.86 GATNE-T GATNE-T


64 GATNE-I 64 GATNE-I
GATNE-T 81.02 93.39 86.65
20 50 100 200 500 2 5 10 20 50
GATNE-I 84.20 95.04 89.94 Base Embedding Dimension Edge Embedding Dimension

(a) Base embedding dimension (b) Edge embedding dimension

90
7
GATNE-T
85
Figure 4: The performance of GATNE-T and GATNE-I on
Training Time (Hours)

80
6 GATNE-I
5 Alibaba-S when changing base/edge embedding dimensions
ROC-AUC

75

70
4 exponentially.
65 3

60 2
GATNE-I
55 GATNE-T 1
achieves better performance than GATNE-T on extremely large-
50 0
0 5 10 15 20 25 40 60 80 100 120 140 160 scale real-world datasets.
Millions of Training Batches Number of Workers

(a) Convergence (b) Scalability Scalability Analysis. We investigate the scalability of GATNE
that has been deployed on multiple workers for optimization. Fig-
ure 3(b) shows the speedup w.r.t. the number of workers on the
Figure 3: (a) The convergence curve for GATNE-T and Alibaba dataset. The figure shows that GATNE is quite scalable on
GATNE-I models on Alibaba dataset. The inductive model the distributed platform, as the training time decreases significantly
converges faster and achieves better performance than the when we add up the number of workers, and finally, the inductive
transductive model. (b) The training time decreases as the model takes less than 2 hours to converge with 150 distributed
number of workers increases. GATNE-I takes less training workers. We also find that GATNE-I’s training speed increases al-
time to converge compared with GATNE-T. most linearly as the number of workers increases but less than 150.
While GATNE-T converges slower and its training speed is about
structure. Table 5 lists the experimental results of Alibaba dataset. to reach a limit when the number of workers being larger than 100.
GATNE scales very well and achieves state-of-the-art performance Besides the state-of-the-art performance, GATNE is also scalable
on Alibaba dataset, with 2.18% performance lift in PR-AUC, 5.78% enough to be adopted in practice.
in ROC-AUC, and 5.99% in F1-score, compared with best results
Parameter Sensitivity. We investigate the sensitivity of different
from previous state-of-the-art algorithms. The GATNE-I performs
hyper-parameters in GATNE, including overall embedding dimen-
better than GATNE-T model in the large-scale dataset, suggesting
sion d and edge embedding dimension s. Figure 4 illustrates the
that the inductive approach works better on large-scale attributed
performance of GATNE when altering the base embedding dimen-
multiplex heterogeneous networks, which is usually the case in
sion d or edge embedding dimension s from the default setting
real-world situations.
(d = 200, s = 10). We can conclude that the performance of GATNE
Convergence Analysis. We analyze the convergence properties is relatively stable within a large range of base/edge embedding
of our proposed models on Alibaba dataset. The results, as shown dimensions, and the performance drops when the base/edge em-
in Figure 3(a), demonstrate that GATNE-I converges faster and bedding dimension is either too small or too large.
5.4 Offline A/B Tests [13] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual
evolution of fashion trends with one-class collaborative filtering. In WWW’16.
We deploy our inductive model GATNE-I on Alibaba’s distributed 507–517.
cloud platform for its recommendation system. The training dataset [14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.
has about 100 million users and 10 million items, with 10 billion [15] Xiao Huang, Jundong Li, and Xia Hu. 2017. Accelerated attributed network
interactions between them per day. We use the model to generate embedding. In SDM’17. SIAM, 633–641.
embedding vectors for users and items. For every user, we use K [16] Xiao Huang, Jundong Li, and Xia Hu. 2017. Label informed attributed network
embedding. In WSDM’17. ACM, 731–739.
nearest neighbor (KNN) with Euclidean distance to calculate the [17] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
top-N items that the user is most likely to click. The experimental mization. arXiv preprint arXiv:1412.6980 (2014).
goal is to maximize top-N hit-rate. Under the framework of A/B [18] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv
preprint arXiv:1611.07308 (2016).
tests, we conduct an offline test on GATNE-I, MNE, and DeepWalk. [19] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with
The results demonstrate that GATNE-I improves hit-rate by 3.26% Graph Convolutional Networks. In ICLR’17.
[20] Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. Attributed
and 24.26% compared to MNE and DeepWalk, respectively. social network embedding. TKDE 30, 12 (2018), 2257–2270.
[21] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang,
Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence
6 CONCLUSION embedding. ICLR’17.
[22] Weiyi Liu, Pin-Yu Chen, Sailung Yeung, Toyotaro Suzumura, and Lingli Chen.
In our paper, we formalized the attributed multiplex heterogeneous 2017. Principled multilayer network embedding. In ICDMW’17. IEEE, 134–141.
network embedding problem and proposed GATNE to solve it with [23] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
both transductive and inductive settings. We split the overall node 2015. Image-based recommendations on styles and substitutes. In SIGIR’15. ACM,
43–52.
embedding of GATNE-I into three parts: base embedding, edge [24] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
embedding, and attribute embedding. The base embedding and estimation of word representations in vector space. ICLR’13.
attribute embedding are shared among edges of different types, [25] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
while the edge embedding is computed by aggregation of neighbor- NIPS’13. 3111–3119.
hood information with the self-attention mechanism. Our proposed [26] Sankar K Pal and Sushmita Mitra. 1992. Multilayer Perceptron, Fuzzy Sets,
Classifiaction. (1992).
methods achieve significantly better performances compared to [27] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning
previous state-of-the-art methods on link prediction tasks across of social representations. In KDD’14. ACM, 701–710.
multiple challenging datasets. The approach has been successfully [28] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and
Jie Tang. 2019. NetSMF: Large-Scale Network Embedding as Sparse Matrix
deployed and evaluated on Alibaba’s recommendation system with Factorization. In WWW’19.
excellent scalability and effectiveness. [29] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018.
Network embedding as matrix factorization: Unifying deepwalk, line, pte, and
node2vec. In WSDM’18. ACM, 459–467.
ACKNOWLEDGMENTS [30] Meng Qu, Jian Tang, Jingbo Shang, Xiang Ren, Ming Zhang, and Jiawei Han.
2017. An Attention-based Collaboration Framework for Multi-View Network
We thank Qibin Chen, Ming Ding, Chang Zhou, and Xiaonan Fang Representation Learning. In CIKM’17. ACM, 1767–1776.
[31] Chuan Shi, Binbin Hu, Xin Zhao, and Philip Yu. 2018. Heterogeneous Information
for their comments. The work is supported by the NSFC for Distin- Network Embedding for Recommendation. TKDE (2018).
guished Young Scholar (61825602), NSFC (61836013), and a research [32] Yu Shi, Fangqiu Han, Xinran He, Carl Yang, Jie Luo, and Jiawei Han. 2018.
fund supported by Alibaba Group. mvn2vec: Preservation and Collaboration in Multi-View Network Embedding.
arXiv preprint arXiv:1801.06597 (2018).
[33] Yizhou Sun, Brandon Norick, Jiawei Han, Xifeng Yan, Philip S Yu, and Xiao
Yu. 2013. Pathselclus: Integrating meta-path selection with user-guided object
REFERENCES clustering in heterogeneous information networks. TKDD 7, 3 (2013), 11.
[1] Smriti Bhagat, Graham Cormode, and S Muthukrishnan. 2011. Node classification [34] Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Predictive text embedding
in social networks. In Social network data analytics. Springer, 115–148. through large-scale heterogeneous text networks. In KDD’15. ACM, 1165–1174.
[2] Aleksandar Bojchevski and Stephan GÃijnnemann. 2018. Deep Gaussian Embed- [35] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015.
ding of Graphs: Unsupervised Inductive Learning via Ranking. In ICLR’18. Line: Large-scale information network embedding. In WWW’15. 1067–1077.
[3] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and [36] Lei Tang and Huan Liu. 2009. Uncovering cross-dimension group structures in
Thomas S Huang. 2015. Heterogeneous network embedding via deep archi- multi-dimensional networks. In SDM workshop on Analysis of Dynamic Networks.
tectures. In KDD’15. ACM, 119–128. ACM, 568–575.
[4] Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2018. A survey on network [37] Lei Tang, Suju Rajan, and Vijay K Narayanan. 2009. Large scale multi-label
embedding. TKDE (2018). classification via metalabeler. In WWW’09. ACM, 211–220.
[5] Jesse Davis and Mark Goadrich. 2006. The relationship between Precision-Recall [38] Lei Tang, Xufei Wang, and Huan Liu. 2009. Uncoverning groups via heteroge-
and ROC curves. In ICML’06. ACM, 233–240. neous interaction analysis. In ICDM’09. IEEE, 503–512.
[6] Manlio De Domenico, Antonio Lima, Paul Mougel, and Mirco Musolesi. 2013. [39] Ben Taskar, Ming-Fai Wong, Pieter Abbeel, and Daphne Koller. 2004. Link
The anatomy of a scientific rumor. Scientific reports 3 (2013), 2980. prediction in relational data. In NIPS’04. 659–666.
[7] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: [40] Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun
Scalable representation learning for heterogeneous networks. In KDD’17. ACM, Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation
135–144. in Alibaba. KDD’18, 839–848.
[8] Santo Fortunato. 2010. Community detection in graphs. Physics reports 486, 3-5 [41] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. 2015.
(2010), 75–174. Network representation learning with rich text information.. In IJCAI’15. 2111–
[9] Hongchang Gao and Heng Huang. 2018. Deep Attributed Network Embedding.. 2117.
In IJCAI’18. 3364–3370. [42] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Revisiting
[10] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for semi-supervised learning with graph embeddings. In ICML’16. 40–48.
networks. In KDD’16. ACM, 855–864. [43] Hongming Zhang, Liwei Qiu, Lingling Yi, and Yangqiu Song. 2018. Scalable
[11] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation Multiplex Network Embedding. In IJCAI’18. 3082–3088.
learning on large graphs. In NIPS’17. 1024–1034. [44] Zhen Zhang, Hongxia Yang, Jiajun Bu, Sheng Zhou, Pinggang Yu, Jianwei Zhang,
[12] James A Hanley and Barbara J McNeil. 1982. The meaning and use of the area Martin Ester, and Can Wang. 2018. ANRL: Attributed Network Representation
under a receiver operating characteristic (ROC) curve. Radiology 143, 1 (1982), Learning via Deep Neural Networks.. In IJCAI’18. 3155–3161.
29–36.
A APPENDIX We use the default setting of Adam optimizer in TensorFlow; the
In the appendix, we first give the implementation notes of our learning rate is set to lr = 0.001. For offline A/B test in section 5.4,
proposed models. The detailed descriptions of datasets and the we use N = 50.
parameter configurations of all methods are then given. Finally, we Code and Dataset Releasing Details. The codes of our pro-
discuss the questions about fair comparison and our future work. posed models on the single Linux server (based on Tensorflow
1.12), together with our partition of the three public datasets and
A.1 Implementation Notes the Alibaba-S dataset are available.
Running Environment. The experiments in this paper can be
divided into two parts. One is conducted on four datasets using a A.2 Compared Methods
single Linux server with 4 Intel(R) Xeon(R) Platinum 8163 CPU @ We give the detailed running configuration about all compared
2.50GHz, 512G RAM and 8 NVIDIA Tesla V100-SXM2-16GB. The methods as follows. The embedding size is set to 200 for all methods.
codes of our proposed models in this part are implemented with For random-walk based methods, the number of walks for each
TensorFlow5 1.12 in Python 3.6. The other part is conducted on node is set to 20 and the length of walks is set to 10. The window
the full Alibaba dataset using Alibaba’s distributed cloud platform size is set to 5 for generating node contexts. The number of negative
which contains thousands of workers. Every two workers share an samples for each training pairs is set to 5. The number of iterations
NVIDIA Tesla P100 GPU with 16GB memory. Our proposed models for training the skip-gram model is set to 100. The code sources
are implemented with TensorFlow 1.4 in Python 2.7 in this part. and other specific hyper-parameter settings of compared methods
are explained below.
Implementation Details. Our codes used by single Linux server
can be split into three parts: random walk, model training and A.2.1 Network Embedding Methods.
evaluation. The random walk part is implemented referring to the
• DeepWalk [27]. For public and Alibaba-S datasets, we use
corresponding part of DeepWalk6 and metapath2vec7 . The training
the codes from the corresponding author’s GitHub6 . For
part of the model is implemented referring to the word2vec part
Alibaba dataset, we re-implemented DeepWalk on Alibaba
of TensorFlow tutorials8 . The evaluation part uses some metric
distributed cloud platform.
functions from scikit-learn9 including roc_auc_score, f1_score, preci-
• LINE [35]. The codes of LINE are from the corresponding
sion_recall_curve, auc. Our model parameters are updated and opti-
author’s GitHub10 . We use the LINE(1st+2nd) as the overall
mized by stochastic gradient descent with Adam updating rule [17].
embeddings. The embedding size is set to 100 both for first-
The distributed version of our proposed models is implemented
order and second-order embeddings. The number of samples
based on the coding rules of Alibaba’s distributed cloud platform
is set to 1000 million.
in order to maximize the distribution efficiency. High-level APIs,
• node2vec [10]. The codes of node2vec are from the corre-
such as tf.estimator and tf.data, are used for the higher coefficient
sponding author’s GitHub11 . Node2vec adds two parameters
of utilization of computation resources in the Alibaba’s distributed
to control the random walk process. The parameter p is set
cloud platform.
to 2 and the parameter q is set to 0.5 in our experiments.
Function Selection. Many different aggregator functions in Equa-
tion (1), such as the mean aggregator (Cf. Equation (2)) or pooling A.2.2 Heterogeneous Network Embedding Methods.
aggregator (Cf. Equation (3)), achieve similar performance in our • metapath2vec [7]. The codes provided by the correspond-
experiments. Mean aggregator is finally used to be reported in the ing author are only for specific datasets and could not directly
quantitative experiments in our model. We use the linear transfor- generalize to other datasets. We re-implement metapath2vec
mation function as the parameterized function of attributes hz and for networks with arbitrary node types in Python based
gz,r in Equation (13) of our inductive model GATNE-I. on the original C++ codes12 . As the number of node types
Parameter Configuration. Our base/overall embedding dimen- of three public datasets is one, metapath2vec degrades to
sion d is set to 200 and the dimension of edge embedding s is set to DeepWalk in these three datasets. For Alibaba dataset, the
10. The number of walks for each node is set to 20 and the length node types include U and I representing User and Item re-
of walks is set to 10. The window size is set to 5 for generating spectively. The meta-path-schemes are set to U − I − U and
node contexts. The number of negative samples L for each positive I − U − I.
training sample is set to 5. The number of maximum epochs is set A.2.3 Multiplex Heterogeneous Network Embedding Methods.
to 50 and our models will early stop if ROC-AUC on the validation
set does not improve in 1 training epoch. The coefficient α r and βr • PMNE [22]. PMNE proposes three different methods to ap-
are all set to 1 for every edge type r . For Alibaba dataset, the node ply node2vec on multiplex networks. We denote their net-
types include U and I representing User and Item respectively. The work aggregation algorithm, result aggregation algorithm,
meta-path-schemes of our methods are set to U − I −U and I −U − I . and layer co-analysis algorithm as PMNE(n), PMNE(r), and
PMNE(c) respectively in accord with the denotations of
5 https://www.tensorflow.org/
6 https://github.com/phanein/deepwalk
7 https://ericdongyx.github.io/metapath2vec/m2v.html 10 https://github.com/tangjianpku/LINE
8 https://www.tensorflow.org/tutorials/representation/word2vec 11 https://github.com/aditya-grover/node2vec
9 https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics 12 https://ericdongyx.github.io/metapath2vec/m2v.html
Table 6: Statistics of Original Datasets. • YouTube is a multi-dimensional bidirectional network
dataset consists of 5 types of interactions between 15,088
Dataset # nodes # edges # n-types # e-types YouTube users. The types of edges include contact, shared
Amazon 312,320 7,500,100 1 4 friends, shared subscription, shared subscriber, and shared
YouTube 15,088 13,628,895 1 5 favorite videos between users. It is a multiplex network with
Twitter 456,626 15,367,315 1 4 |O| = 1 and |R| = 5.
• Twitter is a dataset about tweets posted on Twitter about
the discovery of the Higgs boson between 1st and 7th, July
2012. It is made up of 4 directional relationships between
MNE[43]. We use the codes from MNE’s GitHub13 . The prob-
more than 450,000 Twitter users. The relationships are re-
ability of traversing layers of PMNE(c) is set to 0.5.
tweet, reply, mention, and friendship/follower relationship
• MVE [30]. MVE uses collaborated context embeddings and
between Twitter users. It is a multiplex network with |O| = 1
applies an attention mechanism to view-specific embeddings.
and |R| = 4.
The code of MVE was received from the corresponding au-
• Alibaba consists of 4 types of interactions which includ-
thor by email. The embedding dimensions for each view is
ing click, add-to-preference, add-to-cart, and conversion be-
set to 200. The number of training samples for each epoch is
tween two types of nodes, user and item. The node type set
set to 100 million and the number of epochs is set to 10. As
of Alibaba is O = {user, item} and the size of the edge type
for Alibaba dataset, we re-implemented this method on the
set is |R| = 4. The whole graph of Alibaba is so large that we
Alibaba distributed cloud platform.
cannot evaluate the performances of different methods on it
• MNE [43]. MNE uses one common embedding and several
by a single machine. We extract a subgraph from the whole
additional embeddings of each edge type for each node,
graph for comparison with different methods, denoted as
which are jointly learned by a unified network embedding
Alibaba-S. By the way, we also provide the evaluation of
model. The additional embedding size for MNE is set to 10.
the whole graph on the Alibaba’s distributed cloud platform,
We use the codes released by the corresponding author in
the full graph is denoted as Alibaba.
the GitHub13 . As for Alibaba dataset, we re-implemented it
on the Alibaba distributed cloud platform.
A.4 Discussion
A.2.4 Attributed Network Embedding Methods. As for research on network embedding, many people use link predic-
• ANRL [44]. We use the codes from Alibaba’s GitHub14 . As tion or node classification tasks for evaluating the representation of
YouTube and Twitter datasets do not have node attributes, network embeddings. However, although there are many commonly
we generate node attributes for them. To be specific, we use used public datasets, like Twitter or YouTube dataset, none of them
the node embeddings (200 dimensions) of DeepWalk as the provide a "standard" separation for train, validation, and test for
input node features on these datasets for ANRL. For Alibaba- different tasks. This causes different results on the same dataset for
S and Amazon dataset, we use raw features as attributes. different evaluation separations so the results from previous papers
cannot be directly used, and researchers have to re-implement and
run all baselines themselves, reducing their attention on improving
A.3 Datasets their model.
Our experiment evaluates on five datasets, including four datasets Here we appeal on researchers to provide the standardized
and Alibaba dataset. Due to the limitation of memory and com- dataset, which contains a standard separation of train, validation
putation resources on a single Linux server, the four datasets are and test sets as well as the full dataset. Therefore researchers can
subgraphs sampled from the original datasets for training and eval- evaluate their method based on a standard environment, and results
uation. Table 6 shows the statistics of the original public datasets. across papers can be compared directly. This also helps to increase
• Amazon is a dataset of product reviews and metadata from the reproducibility of research.
Amazon. In our experiments, we only use the product meta-
Future Work. Apart from the heterogeneity of networks, the
data, including the product attributes and co-viewing, co-
dynamics of networks are also crucial to network representation
purchasing links between products. The node type set of
learning. There are three ways to capture the dynamic information
Amazon is O = {product } and the edge type set of Ama-
of networks. Firstly, we can add dynamic information into node
zon is R = {also_bouдht, also_viewed}, which denotes two
attributes. For example, we can use methods like LSTM [14] to
products are co-bought or co-viewed by the same user re-
capture the dynamic activities of users. Secondly, the dynamic
spectively. The products of Amazon are split into many cat-
information, such as the timestamp of each interaction, can be
egories. The number of products in all the categories is so
considered as the attributes of edges. Thirdly, we may consider the
large that we use the Electronics category of products for ex-
several snapshots of networks representing the dynamic evolution
perimentation. The number of products in Electronics is still
of networks. We leave representation learning for the dynamic
large for many algorithms; therefore, we extract a connected
attributed multiplex heterogeneous network as our future work.
subgraph from the whole graph.

13 https://github.com/HKUST-KnowComp/MNE
14 https://github.com/cszhangzhen/ANRL

You might also like