19-Gemsec Graph embedding with self clustering
19-Gemsec Graph embedding with self clustering
Abstract—Modern graph embedding procedures can efficiently proceed to use this random-walk-proximity information as a
process graphs with millions of nodes. In this paper, we propose basis to embed nodes such that socially close nodes are placed
GEMSEC – a graph embedding algorithm which learns a nearby. In this category, Deepwalk [8] and Node2Vec [9] are
clustering of the nodes simultaneously with computing their
embedding. GEMSEC is a general extension of earlier work two popular methods.
arXiv:1802.03997v3 [cs.SI] 25 Jul 2019
in the domain of sequence-based graph embedding. GEMSEC While these methods preserve the proximity of nodes in
places nodes in an abstract feature space where the vertex the graph sense, they do not have an explicit preference for
features minimize the negative log-likelihood of preserving sam- preserving social communities. Thus, in this paper, we develop
pled vertex neighborhoods, and it incorporates known social a machine learning approach that considers clustering when
network properties through a machine learning regularization.
We present two new social network datasets and show that embedding the network and includes a parameter to control the
by simultaneously considering the embedding and clustering closeness of nodes in the same community. Figure 1(a) shows
problems with respect to social properties, GEMSEC extracts the embedding obtained by the standard Deepwalk method,
high-quality clusters competitive with or superior to other where communities are coherent, but not clearly separated in
community detection algorithms. In experiments, the method is the embedding. The method described in this paper, called
found to be computationally efficient and robust to the choice of
hyperparameters. GEMSEC, is able to produce clusters that are tightly embedded
Index Terms—community detection, clustering, node embed- and separated from each other (Fig. 1(b)).
ding, network embedding, feature extraction.
I. I NTRODUCTION
Community detection is one of the most important problems
in network analysis due to its wide applications ranging from
the analysis of collaboration networks to image segmentation,
the study of protein-protein interaction networks in biology,
and many others [1], [2], [3]. Communities are usually defined
as groups of nodes that are connected to each other more
densely than to the rest of the network. Classical approaches
to community detection depend on properties such as graph (a) DeepWalk (b) GEMSEC
metrics, spectral properties and density of shortest paths [4]. Fig. 1. Zachary’s Karate club graph [10]. White nodes: instructor’s group;
Random walks and randomized label propagation [5], [6] have blue nodes: president’s group. GEMSEC produces embedding with more
also been investigated. tightly clustered communities.
Embedding the nodes in a low dimensional Euclidean space
enables us to apply standard machine learning techniques. This
space is sometimes called the feature space – implying that it A. Our Contributions
represents abstract structural features of the network. Embed- GEMSEC is an algorithm that considers the two problems
dings have been used for machine learning tasks such as label- of embedding and community detection simultaneously, and
ing nodes, regression, link prediction, and graph visualization, as a result, the two solutions of embedding and clustering
see [7] for a survey. Graph embedding processes usually aim to can inform and improve each other. Through iterations, the
preserve certain predefined differences between nodes encoded embedding converges toward one where nodes are placed close
in their embedding distances. For social network embedding, to their neighbors in the network, while at the same time
a natural priority is to preserve community membership and clusters in the embedding space are well separated.
enable community detection. The algorithm is based on the paradigm of sequence-
Recently, sequence-based methods have been developed as based node embedding procedures that create d dimensional
a way to convert complex, non-linear network structures into feature representations of nodes in an abstract feature space.
formats more compatible with vector spaces. These methods Sequence-based node embeddings embed pairs of nodes close
sample sequences of nodes from the graph using a randomized to each other if they occur frequently within a small window
mechanism (e.g. random walks), with the idea that nodes of each other in a random walk. This problem can be formu-
that are “close” in the graph connectivity will also frequently lated as minimizing the negative log-likelihood of observed
appear close in a sampling of random walks. The methods then neighborhood samples (Sec. III) and is called the skip-gram
optimization [11]. We extend this objective function to include into vector spaces [19]. Optimization-based representation of
a clustering cost. The formal description is presented in networks has been used for routing and navigation in domains
Subsection III-A. The resulting optimization problem is solved such as sensor networks and robotics [20], [21]. Represen-
with a variant of mini-batch gradient descent [12]. tations in hyperbolic spaces have emerged as a technique to
The detailed algorithm is presented in Subsection III-B. preserve richer network structures [22], [23], [24].
By enforcing clustering on the embedding, GEMSEC reveals Recent advances in node embedding procedures have made
the natural community structure (e.g. Figure 1).Our approach it possible to learn vector features for large real-world
improves over existing methods of simultaneous embedding graphs [8], [16], [9]. Features extracted with these sequence-
and clustering [13], [14], [15] and shows that community based node embedding procedures can be used for predicting
sensitivity can be directly incorporated into the skip-gram style social network users’ missing age [7], the category of scientific
optimization to obtain greater accuracy and efficiency. papers in citation networks [17] and the function of proteins
In social networks, nodes in the same community tend to in protein-protein interaction networks [9]. Besides supervised
have similar groups of friends, which is expressed as high learning tasks on nodes the extracted features can be used
neighborhood overlap. This fact can be leveraged to produce for graph visualization [7], link prediction [9] and community
clusters that are better aligned with the underlying communi- detection [13].
ties. We achieve this effect using a regularization procedure Sequence based embedding commonly considers variations
– a smoothness regularization added to the basic optimization in the sampling strategy that is used to obtain vertex sequences
achieves more coherent community detection. The effect can – truncated random walks being the simplest strategy [8]. More
be seen in Figure 3, where a somewhat uncertain community involved methods include second-order random walks [9],
affiliation suggested by the randomized sampling is sharpened skips in random walks [17] and diffusion graphs [25]. It is
by the smoothness regularization. This technique is described worth noting that these models implicitly approximate matrix
in Subsection III-C. factorizations for different matrices that are expensive to
In experimental evaluation we demonstrate that GEMSEC factorize explicitly [26].
outperforms – in clustering quality – the state of the art neigh- Our work extends the literature of node embedding algo-
borhood based [8], [9], multi-scale [16], [17] and community rithms which are community aware. Earlier works in this
aware embedding methods [13], [14], [15]. We present new category did not directly extend the skip-gram embedding
social datasets from the streaming service Deezer and show framework. M-NMF [14] applies computationally expensive
that the clustering can improve music recommendations. The non-negative matrix factorization with a modularity constraint
clustering performance of GEMSEC is found to be robust to term. The procedure DANMF [15] uses hierarchical non-
hyperparameter changes, and the runtime complexity of our negative matrix factorization to create community-aware node
method is linear in the size of the graphs. embeddings. ComE [13] is a more scalable approach, but it
To summarize, the main contributions of our work are: assumes that in the embedding space the communities fit a
1) GEMSEC: a sequence sampling-based learning model gaussian structure, and aims to model them by a mixture
which learns an embedding of the nodes at the same of Gaussians. In comparison to these methods, GEMSEC
time as it learns a clustering of the nodes. provides greater control over community sensitivity of the em-
2) Clustering in GEMSEC can be aligned to network neigh- bedding process, it is independent of the specific neighborhood
borhoods by a smoothness regularization added to the sampling methods and is computationally efficient.
optimization. This enhances the algorithm’s sensitivity
to natural communities.
3) Two new large social network datasets are introduced – III. G RAPH E MBEDDING WITH S ELF C LUSTERING
from Facebook and Deezer data.
4) Experimental results show that the embedding process For a graph G = (V, E), a node embedding is a mapping
runs linearly in the input size. It generally performs well f : V → Rd where d is the dimensionality of the embedding
in quality of embedding and in particular outperforms space. For each node v ∈ V we create a d dimensional
existing methods on cluster quality measured by modu- representation. Alternatively, the embedding f is a |V | × d
larity and subsequent recommendation tasks. real-valued matrix. In sequence-based embedding, sequences
of neighboring nodes are sampled from the graph. Within a
We start with reviewing related work in the area and relation
sequence, a node v occurs in the context of a window ω within
to our approach in the next section. A high-performance
the sequence. Given a sample S of sequences, we refer to the
Tensorflow reference implementation of GEMSEC and the
collection of windows containing v as NS (v). Earlier works
datasets that we collected can be accessed online1 .
have proposed random walks, second-order random walks or
II. R ELATED W ORK branching processes to obtain NS (v). In our experiments, we
There is a long line of research in metric embedding – used unweighted first and second-order random walks for node
for example, embedding discrete metrics into trees [18] and sampling [8], [9].
Our goal is to minimize the negative log-likelihood of ob-
1 https://github.com/benedekrozemberczki/GEMSEC serving neighborhoods of source nodes conditional on feature
vectors that describe the position of nodes in the embedding O(|V |2 ) runtime complexity. Because of this, we approximate
space. Formally, the optimization objective is: the partition function term with negative sampling which is a
X form of noise contrastive estimation [11], [27].
min − log P (NS (v)|f (v)) (1)
f
v∈V
TABLE II
M EAN MODULARITY OF CLUSTERINGS ON THE FACEBOOK DATASETS . E ACH EMBEDDING EXPERIMENT WAS REPEATED TEN TIMES . E RRORS IN THE
PARENTHESES CORRESPOND TO TWO STANDARD DEVIATIONS . I N TERMS OF MODULARITY Smooth GEMSEC2 OUTPERFORMS THE BASELINES .
are concatenated to form a multi-scale representation of positive effect on the clustering performance of Deepwalk,
nodes. GEMSEC and GEMSEC2 .
6) ComE [13]: Uses a Gaussian mixture model to learn
an embedding and clustering jointly using random walk D. Sensitivity Analysis for hyperparameters
features.
We tested the effect of hyperparameter changes to clustering
7) M-NMF [14]: Factorizes a matrix which is a weighted
performance. The Politicians Facebook graph is embedded
sum of the first two proximity matrices with a modular-
with the standard parameter settings while the initial and final
ity based regularization constraint.
learning rates are set to be 10−2 and 5 · 10−3 respectively, the
8) DANMF [15]: Decomposes a weighted sum of the first
clustering cost coefficient is 0.1 and we perturb certain hyper-
two proximity matrices hierarchically to obtain cluster
parameters. The second-order random walks used in-out and
memberships with an autoencoder-like non-negative ma-
return parameters of 4. In Figure 4 each data point represents
trix factorization model.
the mean modularity calculated from 10 experiments. Based
Smooth GEMSEC, GEMSEC2 and Smooth GEMSEC2 consis- on the experimental results we make two observations. First,
tently outperform the neighborhood conserving node embed- GEMSEC model variants give high-quality clusters for a wide
ding methods and the competing community aware methods. range of parameter settings. Second, introducing smoothness
The relative advantage of Smooth GEMSEC2 over the bench- regularization makes GEMSEC models more robust to hyper-
marks is highest on the Athletes dataset as the clustering’s parameter changes. This is particularly apparent across varying
modularity is 3.44% higher than the best performing baseline. the number of clusters. The length of truncated random walks
It is the worst on the Media dataset with a disadvantage of and the number of random walks per source node above a
0.35% compared to the strongest baseline. Use of smoothness certain threshold has only a marginal effect on the community
regularization has sometimes non-significant, but definitely detection performance.
0.86
0.8 13
Modularity 0.85
Modularity
11
0.84 Modularity
0.852 7 9 11 13 15
log2 Vertex number
0.82
DeepWalk Smooth DeepWalk GEMSEC
0.848 Smooth GEMSEC M-NMF ComE
0.8
0.844
Fig. 5. Sensitivity of optimization runtime to graph size measured by seconds.
16 32 48 64 80 96 112128 40 80 120 160 The dashed lines are linear references.
Dimensions Walk length
0.86
0.855
V. C ONCLUSIONS
0.85
Modularity
Modularity
0.845
We described GEMSEC – a novel algorithm that learns a
0.84 0.835
node embedding and a clustering of nodes jointly. It extends
0.83
0.825 existing embedding modes. We showed that smoothness regu-
larization is used to incorporate social network properties and
0.2 0.4 0.6 0.8 1 2 4 6 8 10
Cluster cost coefficient Number of radom walks produce natural embedding and clustering. We presented new
GEMSEC Smooth GEMSEC GEMSEC2 Smooth GEMSEC2 social datasets, and experimentally, our methods outperform
a number of strong community aware node embedding base-
Fig. 4. Sensitivity of cluster quality to parameter changes measured by lines.
modularity.
VI. ACKNOWLEDGEMENTS
E. Music Genre Recommendation Benedek Rozemberczki and Ryan Davies were supported
Node embeddings are often used for extracting features of by the Centre for Doctoral Training in Data Science, funded
nodes for downstream predictive tasks. In order to investigate by EPSRC (grant EP/L016427/1).
this, we use social networks of Deezer users collected from
R EFERENCES
European countries. We predict the genres (out of 84) of
music liked by people. Following the embedding, we used [1] T. Van Laarhoven and E. Marchiori, “Robust community detection
logistic regression with ℓ2 regularization to predict each of methods with resolution parameter for complex detection in protein
protein interaction networks,” Pattern Recognition in Bioinformatics, pp.
the labels and 90% of the nodes were randomly selected for 1–13, 2012.
training. We evaluated the performance of the remaining users. [2] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group
Numbers reported in Table III are F1 scores calculated from 10 formation in large social networks: Membership, growth, and evolution,”
in Proceedings of the 12th ACM SIGKDD international conference on
experimental repetitions. GEMSEC2 significantly outperforms Knowledge discovery and data mining. ACM, 2006, pp. 44–54.
the other methods on all three countries’ datasets. The perfor- [3] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos, “Com-
mance advantage varies between 3.03% and 4.95%. We also munity detection in social media,” Data Mining and Knowledge Discov-
ery, vol. 24, no. 3, pp. 515–554, 2012.
see that Smooth GEMSEC2 has lower accuracy, but it is able [4] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive
to outperform DeepWalk, LINE, Node2Vec, Walklets, ComE, Datasets. Cambridge University Press, 2014.
M-NMF and DANMF on all datasets. [5] P. Pascal and M. Latapy, In International Symposium on Computer and
Information Sciences. Springer Berlin Heidelberg, 2005, ch. Computing
Communities in Large Networks Using Random Walks., pp. 284–293.
F. Scalability and computational efficiency [6] S. Gregory, “Finding overlapping communities in networks by label
propagation,” New Journal of Physics, vol. 12, no. 10, p. 103018, 2010.
To create graphs of various sizes, we used the Erdos-Renyi [7] P. Goyal and E. Ferrara, “Graph embedding techniques, applications,
model and with an average degree of 20. Figure 5 shows the and performance: A survey,” arXiv preprint arXiv:1705.02801, 2017.
log of mean runtime against the log of the number of nodes. [8] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
of social representations.” in Proceedings of the 20th ACM SIGKDD
Most importantly, we can conclude that doubling the size of international conference on Knowledge discovery and data mining.,
the graph doubles the time needed for optimizing GEMSEC, 2014.
thus the growth is linear. We also observe that embedding [9] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning for
networks,” in Proceedings of the 22nd ACM SIGKDD International
algorithms that incorporate clustering have a higher cost, and Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–
regularization also produces a higher cost, but similar growth. 864.
TABLE III
M ULTI - LABEL NODE CLASSIFICATION PERFORMANCE OF THE EMBEDDING EXTRACTED FEATURES ON THE D EEZER GENRE LIKES DATASETS .
P ERFORMANCE IS MEASURED BY AVERAGE F1 SCORE VALUES . M ODELS WERE TRAINED ON 90% OF THE DATA AND EVALUATED ON THE REMAINING
10%. E RRORS IN THE PARENTHESES CORRESPOND TO TWO STANDARD DEVIATIONS . GEMSEC MODELS CONSISTENTLY HAVE GOOD PERFORMANCE .
[10] W. W. Zachary, “An information flow model for conflict and fission in [23] C. De Sa, A. Gu, C. Ré, and F. Sala, “Representation tradeoffs for
small groups,” Journal of anthropological research, vol. 33, no. 4, pp. hyperbolic embeddings,” Proceedings of machine learning research,
452–473, 1977. vol. 80, p. 4460, 2018.
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of [24] W. Zeng, R. Sarkar, F. Luo, X. Gu, and J. Gao, “Resilient routing
word representations in vector space,” 2013. for sensor networks using hyperbolic embedding of universal covering
[12] J. B. Diederik P. Kingma, “Adam: A method for stochastic optimiza- space,” in 2010 Proceedings IEEE INFOCOM. IEEE, 2010, pp. 1–9.
tion,” in Proceedings of the 3rd International Conference on Learning [25] B. Rozemberczki and R. Sarkar, “Fast sequence-based embedding with
Representations (ICLR), 2015. diffusion graphs,” in International Workshop on Complex Networks.
[13] S. Cavallari, V. W. Zheng, H. Cai, K. C.-C. Chang, and E. Cambria, Springer, 2018, pp. 99–107.
“Learning community embedding with community detection and node [26] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network
embedding on graphs,” in Proceedings of the 2017 ACM on Conference embedding as matrix factorization: Unifying deepwalk, line, pte, and
on Information and Knowledge Management, 2017, pp. 377–386. node2vec,” in Proceedings of the Eleventh ACM International Confer-
[14] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community ence on Web Search and Data Mining. ACM, 2018, pp. 459–467.
preserving network embedding.” in AAAI, 2017, pp. 203–209. [27] M. Gutmann and A. Hyvarinen, “Noise-contrastive estimation: A new
[15] F. Ye, C. Chen, and Z. Zheng, “Deep autoencoder-like nonnegative estimation principle for unnormalized statistical models,” in Proceedings
matrix factorization for community detection,” in Proceedings of the of the Thirteenth International Conference on Artificial Intelligence and
27th ACM International Conference on Information and Knowledge Statistics, 2010, pp. 297–304.
Management. ACM, 2018, pp. 1393–1402. [28] J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski,
[16] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: J. Kertész, and A.-L. Barabási, “Structure and tie strengths in mobile
Large-scale information network embedding,” in Proceedings of the 24th communication networks,” Proceedings of the national academy of
International Conference on World Wide Web, 2015, pp. 1067–1077. sciences, vol. 104, no. 18, pp. 7332–7336, 2007.
[17] B. Perozzi, V. Kulkarni, H. Chen, and S. Skiena, “Don’t walk, skip!: [29] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and
Online learning of multi-scale network embeddings,” in Proceedings of A. J. Smola, “Distributed large-scale natural graph factorization,” in
the 2017 IEEE/ACM International Conference on Advances in Social Proceedings of the 22nd international conference on World Wide Web,
Networks Analysis and Mining 2017, 2017, pp. 258–265. 2013, pp. 37–48.
[18] J. Fakcharoenphol, S. Rao, and K. Talwar, “A tight bound on approx-
imating arbitrary metrics by tree metrics,” Journal of Computer and
System Sciences, vol. 69, no. 3, pp. 485–497, 2004.
[19] J. Matoušek, Lectures on discrete geometry. Springer, 2002, vol. 108.
[20] X. Yu, X. Ban, W. Zeng, R. Sarkar, X. Gu, and J. Gao, “Spherical
representation and polyhedron routing for load balancing in wireless
sensor networks,” in 2011 Proceedings IEEE INFOCOM. IEEE, 2011,
pp. 621–625.
[21] K. Huang, C.-C. Ni, R. Sarkar, J. Gao, and J. S. Mitchell, “Bounded
stretch geographic homotopic routing in sensor networks,” in IEEE IN-
FOCOM 2014-IEEE Conference on Computer Communications. IEEE,
2014, pp. 979–987.
[22] R. Sarkar, “Low distortion delaunay embedding of trees in hyperbolic
plane,” in International Symposium on Graph Drawing. Springer, 2011,
pp. 355–366.