0% found this document useful (0 votes)
6 views13 pages

LP Survey 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

LP Survey 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Appl Intell (0000) 42:738–750

DOI 10.1007/s10489-014-0631-0

Link prediction in dynamic social networks by integrating


different types of information
Nahla Mohamed Ahmed Ibrahim · Ling Chen

Published online: 23 December 2014


© Springer Science+Business Media New York 2014

Abstract Link prediction in social networks has attracted based on the integrated time model can predict future links
increasing attention in various fields such as sociology, efficiently in temporal social networks, and achieves higher
anthropology, information science, and computer science. quality results than traditional methods.
Most existing methods adopt a static graph representa-
tion to predict new links. However, these methods lose Keywords Temporal networks · Community clustering ·
some important topological information of dynamic net- Eigenvector centrality · Link prediction
works. In this work, we present a method for link prediction
in dynamic networks by integrating temporal information,
community structure, and node centrality in the network. 1 Introduction
Information on all of these aspects is highly beneficial
in predicting potential links in social networks. Temporal Many social, biological, and information systems can be
information offers link occurrence behavior in the dynamic naturally described as networks, where vertices repre-
network, while community clustering shows how strong sent the entities and links denote relations or interactions
the connection between two individual nodes is, based on between the vertices. A social network consists of indi-
whether they share the same community. The centrality viduals and their relations such as friendship or partner-
of a node, which measures its relative importance within ship. In the field of sociology, there is a long history of
a network, is highly related with future links in social social network analysis that investigates the relations among
networks. We predict a node’s future importance by eigen- social entities. In recent years, social network analysis has
vector centrality, and use this for link prediction. Merging attracted considerable attention in various business ven-
the typological information, including community structure tures, such as marketing and business process modeling.
and centrality, with temporal information generates a more In social networks, links among entities can vary dynami-
realistic model for link prediction in dynamic networks. cally; for example, email communications and cooperative
Experimental results on real datasets show that our method interactions change over time.
Link prediction is an important task in social network
analysis. It detects hidden links from the observed part
N. M. A. Ibrahim · L. Chen ()
of the network, and predicts future links given the cur-
College of Information Engineering, Yangzhou University,
Yangzhou 225009, China rent structure of the network. Link prediction has several
e-mail: [email protected] applications including predicting both relations among indi-
viduals such as friendship or partnership, and their future
N. M. A. Ibrahim
behavior such as communications and collaborations. In a
College of Mathematical Sciences, Khartoum University,
Khartoum 11115, Sudan social security network, link prediction is used to identify
e-mail: [email protected] hidden groups of terrorists or criminals [1], while in net-
works of human behavior, link prediction is used to detect
L. Chen
and classify the behavior and motion of people [2]. Link
State Key Lab of Novel Software Tech, Nanjing University,
Nanjing 210093, China prediction also has many applications in domains outside
Link prediction in dynamic social networks by integrating different types of information 739

social networks. In sensor networks, link prediction is used trends emerging in dynamic networks. Hu et al. [17] pre-
to explore dynamic temporal properties [3], to ensure infor- sented a probabilistic model to detect human motion in a
mation transfer secrecy [4], and to realize optimal routing social network, and advanced a method for labeling human
[5]. motion using a constraint-based genetic algorithm to opti-
Since relations among social members change contin- mize the model. Zhu et al. [18] proposed a method based
uously over time, links in real world social networks are on the Markov model to predict links in the World Wide
constantly varying and evolving. New links may appear Web. In this method, an algorithm for transition proba-
and existing links may vanish from the network. Therefore, bility matrix compression is used to cluster Web pages
link prediction algorithms should be capable of detect- with similar transition behaviors and compress the tran-
ing dynamic relationships between members in a temporal sition matrix to an optimal size for efficient probability
social network. Recently, various approaches have been pro- calculation in link prediction. However, such a probabilistic
posed to detect potential or future links in temporal social model requires a predefined distribution of link occurrences,
networks. which is difficult to know in advance for a given temporal
Some of these methods are based on analysis of the network.
topological features of the network. Purnamrita et al. [6] Various other methods for link prediction in temporal
proposed a nonparametric link prediction algorithm for networks are based on matrix or tensor analysis [19–21].
a sequence of graph snapshots over time. The algorithm Hayashi et al. [19] proposed an approach to model the
predicts links based on the features of its endpoints, as link structure of a temporal network based on an exten-
well as those of the local neighbor nodes around the end- sion of matrix factorization. To estimate the true links
points. Murata et al. [7] explored the connection between behind a sequence of noise-corrupted relational matrices
link prediction and graph topology. They proposed an effectively, dynamic evolutions are modeled in a space of
improved method for predicting links based on weighted low-rank matrices. Dunlavy et al. [20] presented a matrix
proximity measures of social networks. Kim et al. [8] pre- and tensor-based method for predicting future links in a
sented a method to predict future network topology using temporal bipartite network. They extended the Katz method
node centrality, which can identify important nodes in the to bipartite graphs, and approximated the results in a scal-
future. able way using truncated singular value decomposition.
Machine learning strategies have also been exploited in Acar et al. [21] used a matrix and tensor-based method
temporal network link prediction methods. Pujari et al. [9] for predicting links in temporal bipartite graphs by tracing
applied a supervised rank aggregation method for link pre- changes in the network over time. Since matrix and tensor
diction in temporal complex networks. Vu et al. [10] intro- computations require vast amounts of time, such methods
duced a continuous-time regression model for temporal net- are not suitable for real-time link prediction in large-scale
work link prediction. This model can be incorporated with networks.
both time-dependent network statistics and time-varying In this work, we present efficient approaches for pre-
regression coefficients. Bringmann et al. [11] proposed an dicting potential links in temporal social networks. One
approach to link prediction in temporal networks based on approach is based on a reduced static graph using a mod-
techniques for association-rule mining and frequent-pattern ified reduced adjacency matrix to reflect the frequency of
detection. O’Madadhain et al. [12] presented a link predic- each link, while the other approach integrates similarity
tion method for event-based temporal networks. Using tech- indices of the nodes to exploit both temporal and topologi-
niques for data mining and machine learning, the method cal information such as community structure and centrality
can predict future co-participation of individuals in social of the nodes. To emphasize the importance of more recent
events. topological information, the approach uses a damping fac-
Some methods for link prediction in temporal networks tor to integrate the similarity indices in the time steps. To
are based on a probabilistic model. Hanneke et al. [13] consider existing links, we make use of an augmented adja-
proposed a family of statistical models for temporal social cency matrix in calculating the similarity indices at each
network link prediction by extending the exponential ran- time step. Experimental results show that our methods yield
dom graph model, while Liu et al. [14] presented a method higher quality results for link prediction in temporal social
for link prediction in a user-object network, which con- networks.
siders both time attenuation and diversion delay. Gao et The rest of this paper is organized as follows: Section 2
al. [15] proposed a model that exploits multiple infor- discusses the static graph model, and introduces the method
mation sources in the network to obtain link occurrence based on a modified reduced adjacency matrix. Section 3
probabilities as a function of time. To increase the accu- presents a time series model integrating community infor-
racy of link prediction, Potgieter et al. [16] presented a mation, while Section 4 discusses a time series model
method based on Bayesian networks to model temporal integrating node centrality for link prediction. Section 5
740 N. M. Ahmed, L. Chen

presents our link prediction algorithm based on a time series existence. The node similarity index is defined by con-
model that exploits both temporal and topological informa- sidering the fact that two nodes are more similar if they
tion. In Section 6, we present and analyze the experimental have more common neighbors or other topological features.
results of our link prediction methods on real datasets. To examine the performance of our proposed model, we
Finally, Section 7 gives our conclusions. employed five commonly used similarity indices in link
prediction experiments in Section 6. Of these, four indices
use local topological structures, namely, Common Neigh-
2 Reduced static graph approaches bor,Jaccard,Adamic/Adar andPreferential Attachment [1].
The remaining index, Katz, uses a global topological struc-
In this section, we first give a brief review of traditional ture [1].
reduced static graph approaches for link prediction in tem-
poral networks, and then present a new approach based on a 2.2 Improved reduced static graph approach
modified reduced adjacency matrix.
Traditional reduced static graph methods for temporal
2.1 Original reduced static graph method network link prediction have the drawback of missing
some important information. First, since the reduced adja-
We use an undirected binary graph to represent the net- cency matrix At0 ,T defined in (1) is a binary one, it
work. Let V = {v1 , v2 , ..., vn } be a set of vertices, and does not consider the frequency of the links in graphs
Gt0 , Gt0 +1 , ..., Gt0 +T −1 be a sequence of graphs on V at Gt0 , Gt0 +1 , ..., Gt0 +T −1 . It is obvious that if a link occurs
time steps t = t0 , t0 + 1, ..., t0 + T − 1 Define a list of sym- more frequently in the temporal network, it has a higher
metric matrices At0 , At0 +1 , ..., At0 +T −1 to be the adjacency probability of occurring in the future. Moreover, the
matrices of graphs , Gt0 , Gt0 +1 , ..., Gt0 +T −1 respectively. reduced adjacency matrix At0 ,T ignores time information,
The binary value of At (i, j ) indicates the existence of an which is important in link prediction for temporal net-
edge between nodes vi and vj , i, j = 1, 2, ..., n, during works. Since recent links in the network should have
time period t, t = t0 , t0 + 1, ..., t0 + T − 1. Given such more influence in predicting future links, they should have
a graph series Gt0 , Gt0 +1 , ..., Gt0 +T −1 , the time series link greater weights in calculating the similarity indexes. Since
prediction problem is to predict the occurrence probabili- static graph model predictors such as Common Neigh-
ties of edges at time t0 + T , where T is the window size. In bor, Jaccard, and Adamic/Adar treat all historical links
this work, we specify the output of the link prediction prob- with equal importance, they have relatively low accu-
lem as an n × n matrix St0 ,T with each element St0 ,T (i, j ) racy in predicting future links for temporal social net-
being a link occurrence score proportional to the predicted works. Here, we present a new model to represent the
occurrence probability of edge (vi , vj ) at time t0 + T . structure of temporal networks. Using this model, we
In the commonly used algorithms for link prediction in can achieve marked improvement in the accuracy of the
temporal networks, networks in the time series are first results.
reduced to a static graph representation, and then a static In the proposed model, we reduce the sequence of adja-
graph link prediction algorithm is applied to detect poten- cency matrices At0 , At0 +1 , ..., At0 +T −1 to a weighted matrix
tial links in the reduced static graph. In other words, graph Mt0 ,T , where a damping factor α is used to assign greater
series Gt0 , Gt0 +1 , ..., Gt0 +T −1 is reduced to a single binary importance to more recent topological information. In this
graph Gt0 ,T with a corresponding adjacency matrix At0 ,T , model, we also add a self-loop (an edge from node i to i) to
with the entries in At0 ,T given by each node in the reduced static graph. Thus, if link (i, j ) has
appeared previously, the link occurrence probability of (i, j )

1 if ∃k ∈ [t0 , t0 + T − 1] : Ak (i, j ) = 1 can exploit the occurrences of two link pairs ((i, i), (i, j )),
At0 ,T (i, j ) = .
0 otherwise and ((i, j ), (j, j )). Therefore, the reduced adjacency matrix
(1) Mt0 ,T is defined as:

⎨1 if i = j
Then, static network link prediction methods are applied
Mt0 ,T = T −1 (2)
to the reduced static graph Gt0 ,T , with the results on Gt0 ,T ⎩ α T −t At0 +T (i, j ) otherwise
output as the solution of the temporal link prediction on t=0
Gt0 +T . where 0 < α < 1 is the damping factor. The base
In the similarity-based method for link prediction, each fact considered in this model is that recent links have
pair of nodes is assigned an index that can be used as more accurate information than older ones. In this model,
the similarity measure. Links connecting a greater num- the weight of each edge is reduced by the damping fac-
ber of similar nodes should have a higher probability of tor at each time step. From (2), we can see that the
Link prediction in dynamic social networks by integrating different types of information 741

value of Mt0 ,T (i, j ) reflects the frequency of the appear- community-based similarity matrix is defined as:
ance of link (vi , vj ). Since more frequently occurring ⎧
⎪ 
⎨ ρ if vi , vj ∈ Ct0 ,T , k = 1, 2, ..., t0
k
links have a higher probability of appearing in the future,

and existing links have a higher probability than missing Ncom (i, j ) = 1 if vi ∈ Ctk ,T , vj ∈ Ctk ,T , k = k  (3)

⎩ 0 otherwise0 0
links of appearing in the future, the reduced static graph
with adjacency matrix Mt0 ,T is more reliable for tempo-
ral link prediction. In our modified method, we consider In (3), a parameter ρ > 1 is used to assign a greater
both the frequency and the time of the link occurrence; weight to a pair of nodes belonging to the same community.
the more frequently and more recently link (vi , vj ) occurs, Since the improved reduced graph representation Mt0 ,T is
the greater is the weight assigned to it according used for community detection, recent links have a greater
to (2). influence on the link prediction results than older ones.
As seen from the experimental results, our approach Although node pairs in the same current community are
achieves higher performance and yields more accurate assigned a higher probability of connection in the future,
results for future link prediction. the interactions between nodes in different communities
are also considered. These inter-community node pairs are
assigned a smaller probability of connection in the future.
3 Exploiting community structure for link prediction
3.1 Method for detecting communities
Social networks have interesting and attractive properties,
such as the small world property and local structures, also Many approaches for community detection focusing on var-
known as communities. A community in a network is a clus- ious topological criteria, including betweenness, modular-
ter of nodes such that nodes within the same community ity, normalized cut, structural density, and partition density,
are densely connected while links between communities are have been studied. In this work, we use the modularity
relatively sparse. The attributes of individuals in the same optimization method to detect communities.
community are more similar than those between individu- In our method, we first build the modified static graph
als from different communities, which means that they may representation for the given temporal graphs from time t0
share more personal information, interests, experiences, and to t0 + T − 1. Then, we detect communities using this rep-
other useful resources. Individuals in the same commu- resentation. The detection process is performed using some
nity have a higher probability of communicating in the existing modularity-based algorithms [22–26] such as the
future than individuals from different communities. There- algorithm for the Louvain method [25, 26]. These algo-
fore, we exploit the community structure of a network to rithms attempt to partition the network into communities
predict links. To use community information in the predic- by optimizing modularity. These algorithms for optimizing
tion of future links, communities are first detected based modularity are performed in two steps. First, they look for
on the improved reduced graph representation introduced small communities by optimizing modularity locally. Then,
in Section 2, which is more reliable than the traditional they aggregate nodes belonging to the same community
reduced representation. Let G∗t0 ,T be a weighted graph into a super-node and build a new network to represent the
with a corresponding adjacency matrix Mt0 ,T defined in communities. The super-nodes are then merged into larger
(2). Graph G∗t0 ,T is partitioned into a set of communi- communities to obtain greater modularity. In such iterative
 t
 algorithms, if an initial local optimized state can be pro-
ties, Ct0 ,T = Ct10 ,T , Ct20 ,T , ..., Ct00,T , maximizing com- vided for the algorithm, the number of iterations is reduced,
munication between nodes from the same community and and hence, the global optimized state will be reached faster.
minimizing communications between nodes from differ- For temporal networks, we detect the dynamic community
ent communities. In our model, we integrate temporal link structure in an incremental way. Since the topological struc-
information with community information. In addition, the tures in networks Gt and Gt+1 have only slight differences,
communications between individuals from different com- the community structure does not change rapidly. There-
munities are also considered. It does not seem reasonable fore, community detection can be enhanced by providing
to ignore these communications completely, because nodes the community structure of Gt at the current time step as the
in different current communities will probably have future initial state for the network at the next time step Gt+1 .
interactions. We consider two facts: first, recently active
nodes are more likely to communicate in the future; and 3.2 Algorithm for generating the community matrix
second, there may be links in the future between nodes
in different current communities. These links provide the Pseudo-code for our algorithm that generates the commu-
possibility of their connection in the future. Therefore, a nity matrix is given below.
742 N. M. Ahmed, L. Chen

centrality is used to measure a node’s importance within a


network.

4.1 Eigenvector centrality

Eigenvector centrality is a measure of the influence of a


node in a network. It assigns relative scores to all nodes in
the network based on the principle that connections to high-
scoring nodes contribute more to the score of the node in
question than similar connections to low-scoring nodes [1].
The eigenvector centrality score of node i is defined as:

1 1
xi = xj = A(i, j )xj (4)
λ j ∈N(i) λ j ∈V

Here, N(i) is the set of neighbors of node i, and λ is a con-


stant. Equation (4) can be rewritten in matrix and vector
notation as the eigenvector equation:

Ax = λx

In general, for an n × n matrix A, there are n different


eigenvalues λi and n eigenvectors xi for i = 1, 2, ..., n.
In Algorithm 1 for generating the community matrix, However, by the Perron–Frobenius theorem we know that
most of the computational time is exploited by steps 2 and the additional requirement that all entries in the eigenvector
3. The model estimation in step 3 requires O(n2 ) time, are positive, implies that only the greatest eigenvalue results
while the time complexity of step 2 depends on the com- in the desired centrality measure [2]. Then, the i-th com-
munity detection algorithm used. There are several commu- ponent of the related eigenvector of the largest eigenvalue
nity detection algorithms requiring less time than O(n2 ). gives the centrality score of vertex i in the network.
For instance, the algorithm by Clauset et al. [27] runs in
O(n. log2 n) time, while the Louvain algorithm [25] runs 4.2 Centrality matrix derivation
even faster. The generalized Louvain algorithm by De Meo
et al. [26] runs in no more than O(γ .|E|) time, where |E|is For a given graph seriesGt0 , Gt0 +1 , ..., Gt0 +T −1 , let the
the number of edges and γ is a small positive constant. series of n × n matrixes qt0 , qt0 +1 , ..., qt0 +T −1 be the list
Therefore, the time complexity of Algorithm 1 isO(n2 ). of its corresponding centrality score matrixes. Centrality
matrix Ncen is an n × n matrix, where n is the number of
nodes in V . Ncen is defined as a linear combination of cen-
4 Exploiting node centrality for link prediction trality scores; moreover, the time factor is considered to
provide a more accurate model. Ncen is defined as:
In predicting the future topology of a dynamic social net-
work, a key problem is identifying nodes that will be in t0 +T −1
more important positions in the future. In different appli- Ncen (i, j ) = γ t0 +T −1−t [qt (i) + qt (j )] , 0<γ <1
cations, ad hoc prediction functions are used to evaluate t=t0
node importance. Since more important nodes have a higher (5)
probability of being connected, we present an approach for
predicting potential links using the node’s future importance In (5), γ is a damping factor, used to assign greater weights
under a centrality matrix. The centrality matrix integrates to more recent centrality values. The centrality matrix
the topological information in the history, thereby present- includes topological information of both the time and a
ing a combination of the node’s relative importance over node’s importance.
time. Since more recent important nodes have a higher prob- As confirmed by the experimental results (see Section 6),
ability than older ones of participating in the future, the our approach using centrality information achieves higher
proposed combination allocates greater weights for their performance and obtains more accurate results for future
corresponding centrality scores. In this model, eigenvector link prediction.
Link prediction in dynamic social networks by integrating different types of information 743

4.3 Algorithm for generating the centrality matrix In our ITM algorithm, we first generate a modified
static graph representation for the given temporal graphs
Pseudo-code for the algorithm that generates the time series G1 , G2 , ..., GT with window size T . Then, we detect
centrality matrix is given below. communities using this representation and various exist-
ing modularity optimization algorithms. In the next win-
dow consisting of G2 , G3 , ..., GT +1 , detecting communi-
ties is enhanced by iteratively providing the community
structure of the first window as the initial state for the next
one. In addition, we compute the importance of all nodes in
the temporal graphs using eigenvector centrality to estimate
such importance.
Let Mt0 ,T be the adjacency matrix of the modified static
graph for temporal graphs Gt0 , Gt0 +1 , ..., Gt0 +T −1 , and
 t

Ct0 ,T = Ct10 ,T , Ct20 ,T , ..., Ct00,T be the set of communi-
ties of the static graph. In partitioning the communities,
isolated communities of size less than two are ignored.
Let Ncom be the symmetric matrix defined in (3), where
element Ncom = ρ > 1, if its corresponding nodes vi ,
vj belong to the same community at the same time step,
Ncom (i, j ) = 1, if vi , vj belong to the same commu-
nity but at different time steps, and Ncom (i, j ) =
0 otherwise. Let qt0 , qt0 +1 , ..., qt0 +T −1 be the node central-
ity lists for graphs Gt0 , Gt0 +1 , ..., Gt0 +T −1 , respectively,
where qt (i) is the centrality value of node vi at time step t.
We define a symmetric matrixQt , such that for every pair
For generating the time series centrality matrix, the
of nodesvi and vj , Qt (i, j ) = qt (i) + qt (j ) for each time
main computation time for Algorithm 2 is exploited
step t.
by steps 2 and 3. The time complexity of step 2
Matrices Ncom and Qt , t = t0 , t0 + 1, ...., t0 + T − 1
depends on the computation of node centrality. Comput-
include inner-link and global topological information and
ing the eigenvector centrality of all the nodes requires
can strongly reflect the occurrence probabilities of edges in
less than O(n2 ) computation time using the Lanczos
graphsGt0 , Gt0 +1 , ..., Gt0 +T −1 .
method. Since the model estimation in step 3 also
requires O(n2 )time, the time complexity of Algorithm 2 is Next, we integrate these matrices to construct matrix
O(n2 ). St0 ,T , which consists of occurrence probabilities for future
links.

5 Integrated time series model for temporal link


prediction t0 +T −1
St0 ,T = β ∗ Ncom + γ t0 +T −1−t Qt ,
5.1 Overview of the algorithm t=t0
= β ∗ Ncom + Ncen 0 < β, γ < 1 (6)
Many researchers present the problem of link prediction
as one of identifying important nodes or important edges
that will appear in the future. In this work, we use the In (6), 0 < γ < 1 is the damping factor, which is used
importance of both nodes and edges to predict future links to assign greater importance to more recent inner-links, and
in dynamic social networks. In addition, the time factor β is a community matrix integration parameter. The proba-
is also considered. In our proposed algorithm, ITM (Inte- bility matrix St0 ,T carries time, inner-link, and community
grated Time Series Model), analysis of the social network information.
from different aspects provides more realistic information,
and eliminates the influence of biased data. Our ITM algo- 5.2 Algorithm for integrated time series information
rithm combines three types of information, community
information, node centrality information, and time series Pseudo-code for the algorithm for temporal link prediction
information. by integrating time series information is given algorithm 3.
744 N. M. Ahmed, L. Chen

5.3 Computational complexity analysis


Table 1 Main features of the datasets

In the time series link prediction algorithm ITM, most of Datasets #Nodes #Unique #Unique Period length TN
the computational time is consumed by steps 2 and 3. If the Links Links-T (H)
number of nodes in the network is n, the model estimation
in step 2 requires O(n2 ) time, while the time complexity of Irvine Msg 896 2201 3486 55.556 25
step 3 depends on the computational time required to gener- Enron 704 925 1634 240 33
ate the community matrix using Algorithm 1 and centrality Inf. Pattern 200 714 1733 0.0833 93
matrix using Algorithm 2. News Words 503 15638 29898 24 60
Since Algorithms 1 and 2 each require O(n2 ) com- Nodobo 395 453 2016 24 61
putation time, the overall computational time required to Man. Email 167 3250 29847 55.556 117
implement the proposed ITM is O(n2 ). Since temporal
social networks exhibit the sparseness property, using other
data structures, such as lists or trees, can provide even better experiments to several temporal social networks. All the
performance. experiments were performed on an Intel Core i3 computer
running Windows 7, with 4 GB memory, and using Python
programming language. First, we introduce the six datasets
used in the experiments, and explain the experimental setup.

6.1 Test datasets

I. Irvine Message dataset


This network contains messages between users of
an online community of students at the University of
California, Irvine. Each node represents a user, and
each edge represents a message between two users.
II. Enron Email dataset
(http://konect.uni-koblenz.de/networks/enron) The
Enron email network consists of emails between
employees of Enron between 1999 and 2003. Nodes
in the network denote individual employees and edges
represent their email communications. We performed
link prediction analysis on the monthly email graphs.
III. Infectious SocioPatterns (Inf. Pattern) dataset
(http://www.sociopatterns.org/datasets/infectious-
socio patterns-dynamic-contact-networks/)
The Infectious SocioPatterns dataset represents a
network reflecting human mobility. It contains the
daily cumulated networks represented in the Infec-
tious SocioPatterns visualization system. The nodes
denote visitors to the Science Gallery, while the edges

Table 2 Time series link prediction methods

Temporal information Abbreviation

Link occurrence I T M1 =L
Link and node community I T M2 = L + Ncom
6 Experimental results
Node centrality I T M3 = L + Ncen
Link, node centrality, and I T M4 = L + Ncom + Ncen
To evaluate our proposed methods for link prediction in a
node community
temporal social network, we applied them in a series of
Link prediction in dynamic social networks by integrating different types of information 745

Fig. 1 Performance of the four ITM methods on the six datasets

represent close-range face-to-face proximity during minutes, we applied the link prediction analysis to
20-second intervals between pairs of visitors. 5-minute periodical graphs.
After pre-processing, we chose the dataset on 1 IV. News Words dataset
May, which includes eight active hours partitioned into Nodes in this dataset represent words in a news-
93 periods. Since the duration of each period was 5 paper. A link between two nodes implies that the

Fig. 2 Average AUC values of


four ITMs on the six datasets
746 N. M. Ahmed, L. Chen

corresponding words appear in the same text unit Table 1 shows the main features of the datasets,
(sentence). including the number of nodes (#Nodes), total num-
V. Nodobo dataset (http://nodobo.com/release.html) ber of unique links (#Unique Links), where repeated
The Nodobo dataset represents a mobile phone links between the same node pair are counted only
call network of high-school students, from September once, length of the time series sequence (TN ), total
2010 to February 2011. The nodes represent students, number of unique links at each time step (#Unique
and each link represents a call between a pair of Links-T), and the length of the time period in hours
students. (Period length (H)).
After pre-processing, we chose 61 days from 1
October to 30 November 2010. The link prediction 6.2 Experiment setup
algorithm was applied to daily call graphs.
VI. Manufacturing Emails (ME) dataset For every dataset, we took TN snapshot graphs
(http://konect.uni-koblenz.de/networks/radoslaw G1 , G2 , ..., GTN . At each time step t0 , t0 = 1, 2, ..., TN −T ,
email) we used the next T graphs, Gt0 , Gt0 +1 , ..., Gt0 +T −1 , to
The Manufacturing Emails dataset is the internal test the static and time series link prediction methods
email communication network between employees of for predicting Gt0 +T . Since the topological structure
a mid-sized manufacturing company. The nodes repre- of Gt0 +T was already known, our prediction result
sent employees, and an edge between two nodes repre- could be evaluated by comparing it with the actual
sents an email communication between the respective presence of links in Gt0 +T . In the reduced static
employees. graph representation tests, we merged these T graphs

Fig. 3 Changes in the average AUC values for ITM with varying values of β
Link prediction in dynamic social networks by integrating different types of information 747

Fig. 4 Changes in the average AUC values for ITM with varying values of γ

to build a reduced static graph Gt0 ,T with a cor- methods with their abbreviations. We also tested our model
responding adjacency matrix At0 ,T as defined in by varying the parameter values.
Section 2. We used area-under-curve (AUC) scores to evaluate the
In our experiments, we tested the proposed algorithm quality of the results for the algorithms tested. After each
ITM based on an integrated time series model, and com- algorithm calculates and ranks the similarities of all node
pared the quality of the results with the traditional methods pairs representing all existing and nonexistent links, the
based on a reduced static graph. For the traditional methods, AUC value can be interpreted as the probability that a ran-
we used the similarity measurements of Common Neighbor, domly chosen existing link is given a higher score than a
Jaccard, Adamic/Adar, Katz, and Preferential Attachment, randomly chosen nonexistent link. At each time, we ran-
denoted as CN, JC, AA, Katz, and PA, respectively. We also domly pick an existing link and a nonexistent link in Gt0 +T
tested our method with different combinations of informa- to compare their scores. If among n independent compar-
tion. The first combination, called ITM 1 , contained only link isons, existing links have a higher score n times and the
occurrence information provided by the modified reduced same score n times, the AUC value is given by
graph. The second combination, ITM 2 , used link occurrence
AUC = (n + 0.5n )/n (7)
information and community information but ignored infor-
mation on node centrality. The third combination, ITM 3 , In general, a larger AUC value indicates better perfor-
included only the time series combination of node centrality. mance, and hence, the AUC value of the perfect result is 1.0,
The final combination, ITM 4 , represented the complete while the AUC of the result by a random predictor is 0.5.
model, integrating the three different types of information. For detecting communities, we use the Louvain method
Table 2 shows the four tested time series link prediction [25], which is a greedy optimization method that attempts to
748 N. M. Ahmed, L. Chen

Fig. 5 Performance of the traditional methods and ITM

optimize the modularity of a partition of the network. The To find the node centrality scores, we used ARPACK
optimization is performed in two steps. In the first step, each software, which is a collection of Fortran77 subroutines
node is assigned to a community by maximizing network designed to solve large-scale eigenvalue problems. The soft-
modularity. The second step simply creates a new network ware was designed to compute a few, say k, eigenvalues with
consisting of nodes that are in those communities previ- user specified features such as those of the largest real part
ously found. Then, the process iterates until the network or largest magnitude using n · O(k) + O(k 2 ) storage. Since
modularity cannot be improved further. we only needed the largest eigenvalue, only O(n) storage

Fig. 6 Average AUC values of


traditional methods and ITM
Link prediction in dynamic social networks by integrating different types of information 749

was required. This software is based on an algorithmic vari- and other integrated information in general method ITM 4 .
ant of the Arnoldi process, called the implicitly restarted The average optimal value of γ for all datasets is around
Arnoldi method (IRAM). When matrix A is symmetric, as 0.2. Based on the experimental results, we chose the optimal
in our case, it reduces to a variant of the Lanczos pro- value γ = 0.2.
cess called the implicitly restarted Lanczos method (IRLM), We also tested the traditional link prediction methods
which is suitable for large-scale problems. CN, JC, PA, and Katz, and compared their performance with
that of general method ITM 4 . Figure 5 compares the aver-
6.3 Experimental results and analysis age AUC values by the general method ITM 4 with those by
CN, JC, PA, and Katz. From the figure, we can see that on
In our experiments, results were obtained by fixing the Irvine Msg, Enron Email, Inf. Pattern and Nodobo datasets,
parameter values for all datasets as follows: α = 0.8β = method ITM 4 achieves better performance than the tradi-
0.05T = 5. tional methods including Katz. Unlike methods CN, JC,
First, we compared the performance of the four pro- and PA, which use only local topological information on the
posed time series methods shown in Table 2 on the six nodes, method Katz uses global topological information for
datasets. Figure 1 shows the AUC values of the results link prediction and yields higher accuracy results at the cost
by the general method ITM 4 (dashed line), and the AUC of much greater computation time. Despite method Katz’s
values of the results by other partial ITMs (solid lines). deep consideration of global information of common paths,
The figure shows good improvement of AUCs by general method ITM 4 yields even better results on four datasets.
method ITM 4 . The reason that the general method ITM 4 However, on the News Words and Man. Email datasets, the
achieves higher quality results is that it integrates differ- results of method ITM 4 are not significantly different from
ent time series, inner-link, and topological information. It those of Katz.
precisely reflects the occurrence probabilities for the given Figure 6 illustrates the average AUC values for methods
temporal graphs, and eliminates the influence of biased data. ITM 4 , CN, JC, AA, PA, and Katz on the six datasets. It can
Figure 2 illustrates the average AUC scores of the results be seen from this figure that ITM 4 achieves the best per-
by the four ITM methods. It can be seen from Fig. 2 that the formance on datasets Irvine Msg, Enron Email, Inf. Pattern,
general ITM 4 achieves the best performance on all datasets. and Nodobo datasets, while Katz obtains the best AUC val-
We can also see that algorithm ITM 2 , which merges time ues on datasets News Words and Man. Email. It is clear that
series link occurrences with node community information, ITM 4 yields higher AUC values than the other traditional
always achieves higher performance than ITM 1 , which only methods on most of the datasets. This shows that our pro-
uses link occurrences. posed algorithm ITM 4 based on the integrated time series
In our experiments, we also varied the value of β in the model improves the quality of link prediction in temporal
range [0, 1] to test the performance of the community based social networks.
method ITM 2 and general method ITM 4 on the six datasets.
Figure 3 shows that the value of β has little impact on the
prediction performance of general method ITM 4 , but it has 7 Conclusion
a stronger influence on the prediction performance of ITM 2 .
The performance of ITM 2 improves markedly as the value We investigated the problem of link prediction in tem-
of β increases. The optimal value of β for all datasets is poral social networks. In this work, we achieved higher
around 0.05. Based on our empirical results, we chose the quality link prediction results by providing greater weights
optimal value β =0.05. for frequently occurring links. We also presented a time
We also varied the value of parameter γ in the range series model that exploits temporal information on evolving
[0, 1] to examine the impact of node centrality informa- social networks for link prediction. We proposed a method
tion in general method ITM 4 on the six datasets. The results based on node community information that utilizes both
are shown in Fig. 4. With γ = 0, the method achieved temporal and topological information of the temporal net-
its lowest performance on most of the datasets. If γ = works. In addition, integration of node importance was also
0, general method ITM 4 degenerates to ITM 2 ; we can exploited by the proposed integrated time series model.
see that the performance of ITM 2 is much lower than We conducted extensive experimental evaluation of our
that of ITM 4 . It can also be seen that the centrality of nodes methods on different datasets. The experimental results
is indeed an important additional information source, since show that our methods obtain higher quality results for link
the performance improves sharply when this information is prediction in temporal social networks.
introduced. After the performance has reached a maximum,
it deteriorates slightly when varying the value of γ . This can Acknowledgments This research was supported in part by the Chi-
be explained by the need for a tradeoff between centrality nese National Natural Science Foundation under grant Nos. 61379066,
750 N. M. Ahmed, L. Chen

61070047, 61379064, and 61472344, Natural Science Foundation 16. Potgieter A, April K, Cooke R, Osunmakinde IO (2006) Tempo-
of Jiangsu Province under contracts BK20130452, BK2012672, rality in link prediction: understanding social complexity. J Trans
and BK2012128, and the Natural Science Foundation of Educa- Eng Manag 11(1):83–96
tion Department of Jiangsu Province under contracts 12KJB520019, 17. Hu FY, Wong HS (2013) Labelling of Human Motion Based
13KJB520026, and 09KJB20013. on CBGA and Probabilistic Model. Int J Smart Sens Intell Syst
6(2):583–609
18. Zhu J, Hong J, Hughes JG (2002) Using Markov Chains for Link
prediction in Adaptive Web Sites. In: Bustard D, Liu W, Sterritt
References R (eds) Soft-Ware 2002, vol 2311. LNCS, pp 60–73
19. Hayashi K, Hirayama J, Ishii S (2009) Dynamic exponential
1. Lü LY, Zhou T (2011) Link prediction in complex networks: A family matrix factorization. In: Proceedings of PAKDD’09, pp
survey. Phys A 390:1150–1170 452-462
2. Ahmed HS, Faouzi BM, Caelen J (2013) Detection and classifica- 20. Dunlavy DM, Kolda TG, Acar E (2011) Temporal link prediction
tion of the behavior of people in an intelligent building by camera. using matrix and tensor factorizations. Proceedings of the ACM
Int J Smart Sens Intell Syst 6(4):1317–1342 TKDD’11 15(12):10:1–10:27
3. Yang LT, Wang S, Jiang H (2011) Cyclic temporal network den- 21. Acar E, Dunlavy M (2009) Link Prediction on Evolving Data
sity and its impact on information diffusion for delay tolerant using Matrix and Tensor Factorizations. In: Proceedings of 2009
networks. Int J Smart Sens Intell Syst 4(1):35–52 IEEE international conference on data mining workshops. pp
4. Liu ZH, Ma JF, Zeng Y (2013) Secrecy transfer for sensor net- 262–269
works: from random graphs to secure random geometric graphs. 22. Song HH, Cho TW, Dave V, Zhang Y, Qiu L (2009) Scalable
Int J Smart Sens Intell Syst 6(1):77–94 Proximity Estimation And Link Prediction in Online Social Net-
5. Jia X, Xin F, Chuan WR (2013) Adaptive spray routing for works. In: Proceedings of the 9th ACM SIGCOMM conference on
opportunistic networks. Int J Smart Sens Intell Syst 6(1):95–119 Internet measurement conference, ACM. pp 322–335
6. Purnamrita S, Chakrabarti D, Jordan M (2012) Nonparametric link 23. Girvan M, Newman MEJ (2002) Community Structure in Social
prediction in dynamic networks. arXiv:12066394 and Biological Networks. P Natl Acad Sci USA 99(12):7821–
7. Murata T, Moriyasu S (2008) Link prediction based on struc- 7826
tural properties of online social networks. New Gener Comput 24. Newman MEJ (2006) Modularity and Community Structure
26(3):245–257 in Networks. Proc Natl Acad Sci USA 103(23):8577–
8. Kim H, Tang J, Anderson R, Mascolo C (2012) Centrality predic- 8582
tion in dynamic human contact networks. Comput Netw 56:983– 25. Blondel V., Guillaume J., Lambiotte R., Lefebvre E. (2008) Fast
996 unfolding of communities in large networks. J Stat Mech: Theory
9. Pujari M, Kanawati R (2012) Supervised Rank Aggregation Exp P10008
Approach for Link Prediction in Complex Networks, WWW 2012 26. De Meo P, Ferrarax E, Fiumara G, Provett A Generalized
Companion, Lyon, France, pp 1189–1196, April 16–20 Louvain method for community detection in large networks.
10. Vu DQ, Asuncion AU, Hunter DR, Smyth P (2011) Continuous- arXiv:1108.1502v2 [cs.SI] 10 Feb,2012
time regression models for longitudinal networks. In: Proceedings 27. Clauset A, Newman MEJ, Moore C (2004) Finding community
of the 25th annual conference on neural information processing structure in very large networks. J Phys Rev E Stat Nonlin Soft
systems, advances in Neural information processing systems 24, Matter Phys 70(6 Pt 2):066111
pp 1–9
11. Bringmann B, Berlingerio M, Bonchi F, Gionis A (2010) Learn-
ing and predicting the evolution of social networks. IEEE Intell Nahla Mohamed Ahmed Ibrahim is currently a PhD student in
Syst 26–34 the Computer Science Department, Yangzhou University, China. Her
12. O’Madadhain J, Hutchins J, Smyth P (2005) Prediction and rank- research interests are in complex network analysis, mathematical and
ing algorithms for event-based network data. In: Proceeding of the computer modeling. She got her master degree in mathematics in Insti-
ACM SIGKDD international conference on knowledge discovery tute of Mathematical Sciences, Cape Town, South Africa. She is a
and data mining. pp 23–30 lecturer in College of Mathematical Sciences, Khartoum University,
13. Hanneke S, Fu WJ, Xing EP (2010) Discrete temporal models of Sudan.
social networks. Electron J Stat 4:585–605
14. Liu Ji, Denga G (2009) Link prediction in a user object net-
work based on time-weighted resource allocation. Phys A 388: Ling Chen is currently a professor in the Computer Science Depart-
3643–3650 ment, Yangzhou University, China. His research interests include
15. Gao S, Denoyer L, Gallinari P (2011) Temporal link prediction by parallel and distributed computing, data mining and computational
integrating content and structure information. CIKM’11, Glasgow, intelligence. He has co-edited 6 books/proceedings, and published
Scotland, pp 1169-1174 more than 200 research papers including over 80 journal papers.

You might also like