04 Communities
04 Communities
What is community?
Finding communities
CONCOR
Edge removal (Girvan-Newman)
Outline
• What does “network community” mean?
• Community detection versus graph partitioning
versus hierarchical clustering
• Graph partitioning algorithms
– Spectral partitioning (Fiedler’s method based on
graph Laplacian)
• Modularity metric for community detection
– Spectral-based modularity optimization
– Other methods for modularity optimization
• Community detection methods that do not rely on
modularity metric
– Betweenness-Centrality method
– Radicchi et al. method
• Hierarchical agglomerative clustering
Cliques Review
• Dfn: A clique is a maximal,
completely connected subgraph of a
given graph.
B B B
C A C A C A
D D D
Clique Review
http://sebastian.doc.gold.ac.uk/
• Note: the notion of clique here
dose not necessary refers to a
complete subgraph,
http://sebastian.doc.gold.ac.uk/
Clique Percolation Method (CPM)
• Clique is a very strict definition, unstable
• Normally use cliques as a core to find larger communities
• CPM is such a method to find overlapping communities
– Input
• A parameter k, and a network
– Procedure
1. Find out all cliques of size k in a given network
2. Construct a clique graph. Two cliques are adjacent if
they share k-1 nodes
3. Each connected components in the clique graph form
a community 6
Example: Clique Percolation Method
Step 1: Find all Cliques of size 3
{1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8},
{6, 7, 8}
7
Step 2: Construct Clique Graph
8
Step 3: Finding Communities
Two cliques are adjacent if
they share k-1 nodes (i.e. k-1=2)
Communities:
{1, 2, 3, 4}
{4, 5, 6, 7, 8} 9
Communities cont..
• In the study of complex networks, a network is said to
have community structure if the nodes of the network can
be easily grouped into (potentially overlapping) sets of
nodes such that each set of nodes is densely connected
internally.
• In the particular case of non- overlapping community
finding, this implies that the network divides naturally into
groups of nodes with dense connections internally and
sparser connections between groups.
• But overlapping communities are also allowed. The more
general definition is based on the principle that pairs of
nodes are more likely to be connected if they are both
members of the same community(ties), and less likely to be
connected if they do not share communities.
Network Communities cont..
• One of the most relevant feature of graphs
representing real systems is community structure or
Clustering.
• The organization of vertices in clusters , with many
edges joining vertices of same cluster and
comparatively few edges joining vertices of different
clusters.
• Such clusters or communities can be considered as
fairly independent compartments of a graph, plying
similar role like “tissues or the organs in the human
body”.
Communities cont..
• Detecting communities is a great importance
in sociology, biology and computer science.
• Detection of communities is a task of defining
and identifying communities in social and
information networks.
• In graphs in which the node represents
underlying social entities and the edges
represent interactions between pairs of nodes.
GROUP DISCUSSION:
family,
friends,
colleagues
,
Existing Methods
• Node-based: A node overlaps if more than one belonging
coefficient values are larger than some threshold
– Label Propagation (COPRA) [Gregory 2010, Subelj and Bajec 2011]
• Structure-based: A node overlaps if it participates in
multiple base structures with different memberships
– Clique Percolation (CPM) [Palla et al. 2005, Derenyi et al. 2005]
– Link Partition [Evans and Lambiotte 2009 , Ahn et al. 2010]
f(i,c1)=0.35,
f(i,c2)=0.05, Base structure:
f(i,c3)=0.4, … links
i i i
f(i,c)=mean(f(j,c))
Limitations of Existing Methods
f(i,c1)=0.2, f(i,c2)=0.15,
f(i,c3)=0.25, f(i,c4)=0.2,
…
Weak-tie
c1 c2
i c3 i i
c4
i: overlapping i: non-overlapping i: non-overlapping
COPRA fails CPM fails Link partition fails
Convergences of iterated Correlation
CONCOR Intuition
• In a partition, nodes are similar (w.r.t. edges)
• Find similar nodes via correlation
– Pattern of edges: adjacency matrix
Review: Adjacency Matrix
• N(v) pre-calc A
A
0
B
1
C
0
D
…
E F G H I J K L M
B 1 0 0 …
C 0 0 ... 1
D … … 1
E 1 1
F 1 1
G 1 1 1
H 1 1 1
I 1 1
J 1 1 1
K 1 1 1
L 1 1
M 1 1
Background: Pearson Correlation
• How related are variables X and Y?
+1: positively related
0: not related
-1: inversely related
i 1 i 1
A B C D E F G H I J K L M A B C D E F G H I J K L M
C
(t) A
B 1
1 1
1
1
1
1
1
1
1
-1
-1
1
1
1
1
-1
-1
-1
-1
1
1
-1
-1
-1
-1
C
(1) A
B
1.00 -0.08 -0.08 -0.08 -0.12 -0.12 -0.16 -0.16 -0.12 -0.16 -0.16 -0.12 -0.12
-0.08 1.00 -0.08 -0.08 -0.12 -0.12 -0.16 -0.16 -0.12 -0.16 -0.16 -0.12 -0.12
C 1 1 1 1 1 -1 1 1 -1 -1 1 -1 -1 C -0.08 -0.08 1.00 1.00 -0.12 -0.12 -0.16 -0.16 -0.12 -0.16 -0.16 -0.12 -0.12
D 1 1 1 1 1 -1 1 1 -1 -1 1 -1 -1 D -0.08 -0.08 1.00 1.00 -0.12 -0.12 -0.16 -0.16 -0.12 -0.16 -0.16 -0.12 -0.12
E 1 1 1 1 1 -1 1 1 -1 -1 1 -1 -1 E -0.12 -0.12 -0.12 -0.12 1.00 -0.18 -0.23 -0.23 -0.18 -0.23 -0.23 -0.18 -0.18
F -1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 1 F -0.12 -0.12 -0.12 -0.12 -0.18 1.00 -0.23 -0.23 0.41 0.78 -0.23 -0.18 0.41
G 1 1 1 1 1 -1 1 1 -1 -1 1 -1 -1 G -0.16 -0.16 -0.16 -0.16 -0.23 -0.23 1.00 0.57 -0.23 -0.30 0.57 -0.23 -0.23
H 1 1 1 1 1 -1 1 1 -1 -1 1 -1 -1 H -0.16 -0.16 -0.16 -0.16 -0.23 -0.23 0.57 1.00 -0.23 -0.30 0.13 0.27 -0.23
communities!
K 1 1 1 1 1 -1 1 1 -1 -1 1 -1 -1 K -0.16 -0.16 -0.16 -0.16 -0.23 -0.23 0.57 0.13 -0.23 -0.30 1.00 -0.23 0.27
L -1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 1 L -0.12 -0.12 -0.12 -0.12 -0.18 -0.18 -0.23 0.27 0.41 0.27 -0.23 1.00 -0.18
M -1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 1 M -0.12 -0.12 -0.12 -0.12 -0.18 0.41 -0.23 -0.23 -0.18 0.27 0.27 -0.18 1.00
CONCOR Summary
• Node similarity by correlation
• Iterate to stability, bisect graph
• Repeat!
• Issues:
– Meaningful bisection?
– How many bisections?
Edge Removal: Girvan-Newman Method
1. BFS from v
Computing Betweenness E&K 3.6.B
1. BFS from v
2. # shortest paths from
v to each node 1 1 1 1
2 1 2
3 3
6
Computing Betweenness E&K 3.6.B
1 1 1 1
Flow +1 Flow +1 Flow +1 Flow +1
3. Propagate flow 1
2
1
1
2 1
2
1
3 3
Flow +1 Flow +1
1/2 1/2
6
Flow +1
A. Count flow from below + 1 at each node
B. Split flow up among parent nodes
according to # of shortest paths
Girvan-Newman Method
1. Calculate betweenness of all edges
2. Cut (remove max betweenness)
3. Repeat!
When do
we stop???
Modularity (Stopping Criterion)
• Intuition: More edges inside a
community than random chance
the community v
1v,u = 1 if there is and u are assigned
an edge v, u to; 1 if equal
1 k v ku
Q 1v ,u cv , cu
2 m v ,u 2m
divide over all Expected edges
edges m in a random
version (preview)
Edge removal summary
• Remove edges
– E.g., pick by betweenness
– This bisects the graph
• Modularity as stopping criterion
• In practice:
– Ok for several thousand nodes
– Bigger – need to approximate
betweenness
• Hierarchical clustering
• Stochastic Block Models
– Brief review of probability theory
Intuition: Hierarchical Clustering
• Node similarity (distance) metric
• Find communities on blank nodes
– Apply threshold t0 and add edges
– Draw graph of communities G(t )
0
G(1)
(0)
Π(1) A Distance = 1
INDIVIDUAL EXERCISE:
What is Π(2)?
1 2 3 4 5 6 7 8 9 10
7 1 1 1 1 1 0 0 0 0 0 0
2 1 1 1 0 0 0 0 0 0 0
8 6 5 3 1 1 1 0 0 0 0 0 0 0
4 1 0 0 1 1 0 0 0 0 0
9 5 0 0 0 1 1 1 0 0 0 0
10
6 0 0 0 0 1 1 1 0 1 0
7 0 0 0 0 0 1 1 1 1 0
3 1 4 8 0 0 0 0 0 0 1 1 1 0
9 0 0 0 0 0 1 1 1 1 0
10 0 0 0 0 0 0 0 0 0 1
2
Hier. Clustering Example (3)
exclude
2 nodes
• Node distance: Manhattan distance
1 2 3 4 5 6 7 8 9 10
7 1 1 1 1 1 0 0 0 0 0 0
2 1 1 1 0 0 0 0 0 0 0
8 6 5 3 1 1 1 0 0 0 0 0 0 0
4 1 0 0 1 1 0 0 0 0 0
9 5 0 0 0 1 1 1 0 0 0 0
10
6 0 0 0 0 1 1 1 0 1 0
7 0 0 0 0 0 1 1 1 1 0
3 1 4 8 0 0 0 0 0 0 1 1 1 0
9 0 0 0 0 0 1 1 1 1 0
10 0 0 0 0 0 0 0 0 0 1
2
G(1)
(2)
Π(2) A Distance = 2
Cluster Dendrogram
• Clusters merge over time
• Issues:
– When do you stop?
(Just report the tree?
Stopping criteria?)
– Ignoring structure of G(t)
Clusters
Probabilities!
Generative vs. Discriminative
• Given network G, find best partition Π*
Π* = max
Π
P(Π | G)
Generative Models Discriminative Models
• Use Bayes’ Rule • Directly calculate
P(Π | G) = P(G | Π) ·P(Π) / P(G) P(Π | G)
• Model the generation of G • Discriminate between
• Estimate parameters for possible Π values
the model from data • Define features, find
patterns implying Π
General formulation
prob. of edges
possible observed in G
partition
likelihood
L(G | , ) i j 1 i j
function; ij G ij G
i.e., prob. parameters:
assuming prob. of not producing
probabilities edges (that weren’t
params η of edges observed in G)