02-nodeemb
02-nodeemb
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3
¡ Colab 0 will be released today by 9PM on our
course website
¡ Colab 0:
§ Overview of NetworkX and PyTorch Geometric
§ Does not need to be handed in
§ TAs will hold a recitation session to walk you
through Colab 0:
§ Time: Friday (09/29), 3-4pm PT
§ Location: Zoom, link will be posted on Ed
§ Session will be recorded
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4
¡ Colab 1 will be released today by 9PM on our
course website
¡ Colab 1:
§ Will cover material from Lectures 1-2,
so you can get started right away!
§ Due on Thursday 10/12 (2 weeks from today)
§ Submit written answers and code on Gradescope
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Given an input graph, extract node, link
and graph-level features, then learn a
model (SVM, neural network, etc.) that
maps features to labels.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
Graph Representation Learning alleviates
the need to do feature engineering every
single time.
Input Structured Learning
Prediction
Graph Features Algorithm
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8
Goal: Efficient task-independent feature
learning for machine learning with graphs!
node vector
𝑢
𝑓: 𝑢 → ℝ!
ℝ!
Feature representation,
embedding
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9
¡ Task: Map nodes into an embedding space
§ Similarity of embeddings between nodes indicates
their similarity in the network. For example:
§ Both nodes are close to each other (connected by an edge)
§ Encode network information
§ Potentially used for many downstream predictions
Vec Tasks
• Node classification
• Link prediction
• Graph classification
• Anomalous node detection
embeddings ℝ! • Clustering
• ….
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10
Example
¡ 2D embedding of nodes of the Zachary’s
Karate Club network:
Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Assume we have an (undirected) graph G:
§ V is the vertex set.
§ A is the adjacency matrix (assume binary).
§ For simplicity: No node features or extra
information is used
4
3
2 æ0 1 0 1ö
1
ç ÷
ç1 0 0 1÷
A=ç
V: {1, 2, 3, 4} 0 0 0 1÷
ç ÷
ç1 1 1 0 ÷ø
è
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13
¡ Goal is to encode nodes so that similarity in
the embedding space (e.g., dot product)
approximates similarity in the graph
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14
Goal: similarity 𝑢, 𝑣 ≈ 𝐳"# 𝐳$
in the original network Similarity of the embedding
Need to define!
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15
1. Encoder maps from nodes to embeddings
2. Define a node similarity function (i.e., a
measure of similarity in the original network)
3. Decoder 𝐃𝐄𝐂 maps from embeddings to the
similarity score
4. Optimize the parameters of the encoder so
that: 𝐃𝐄𝐂(𝐳 "𝐳 ) ! #
similarity 𝑢, 𝑣 ≈ 𝐳"# 𝐳$
in the original network Similarity of the embedding
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16
¡ Encoder: maps each node to a low-dimensional
vector d-dimensional
ENC 𝑣 = 𝐳! embedding
node in the input graph
¡ Similarity function: specifies how the
relationships in vector space map to the
relationships in the original network
similarity 𝑢, 𝑣 ≈ 𝐳"# 𝐳$ Decoder
Similarity of 𝑢 and 𝑣 in dot product between node
the original network embeddings
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17
Simplest encoding approach: Encoder is just an
embedding-lookup
ENC 𝑣 = 𝐳𝒗 = 𝐙 ⋅ 𝑣
Dimension/size
𝐙= of embeddings
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19
Simplest encoding approach: Encoder is just an
embedding-lookup
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21
¡ Key choice of methods is how they define node
similarity.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22
¡ This is unsupervised/self-supervised way of
learning node embeddings.
§ We are not utilizing node labels
§ We are not utilizing node features
§ The goal is to directly estimate a set of coordinates
(i.e., the embedding) of a node so that some aspect
of the network structure (captured by DEC) is
preserved.
¡ These embeddings are task independent:
§ They are not trained for a specific task but can be
used for any task.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Vector 𝐳! :
§ The embedding of node 𝑢 (what we aim to find).
¡ Probability 𝑃 𝑣 𝐳! ) : Our model prediction based on 𝐳#
§ The (predicted) probability of visiting node 𝑣 on
random walks starting from node 𝑢.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25
10
9
12
2 Step 3 Step 4
8 Step 5
1
11
3
Given a graph and a starting
4 Step 1 Step 2 point, we select a neighbor of
it at random, and move to this
neighbor; then we select a
6
5 neighbor of this point at
random, and move to it, etc.
The (random) sequence of
7 points visited this way is a
random walk on the graph.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26
probability that u
"
𝐳! 𝐳# ≈ and v co-occur on
a random walk over
the graph
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27
1. Estimate probability of visiting node 𝒗 on a
random walk starting from node 𝒖 using
some random walk strategy 𝑹
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33
Putting it all together:
exp(𝐳!) 𝐳% )
ℒ=# # − log( ) )
∑*∈# exp(𝐳! 𝐳* )
!∈# %∈&4 (!)
exp(𝐳!) 𝐳% )
ℒ=# # −log( ) )
∑*∈# exp(𝐳! 𝐳* )
!∈# %∈&4 (!)
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35
But doing this naively is too expensive!
exp(𝐳!) 𝐳% )
ℒ=# # −log( ) )
∑*∈# exp(𝐳! 𝐳* )
!∈# %∈&4 (!)
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36
Why is the approximation valid?
Technically, this is a different objective. But
$ℒ
§ For all 𝑢, make a step in reverse direction of derivative: 𝑧# ← 𝑧# − 𝜂 .
$&&
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39
§ Stochastic Gradient Descent: Instead of evaluating
gradients over all examples, evaluate it for each
individual training example.
§ Initialize 𝑧! at some randomized value for all nodes 𝑢.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40
1. Run short fixed-length random walks starting
from each node on the graph
2. For each node 𝑢 collect 𝑁& (𝑢), the multiset of
nodes visited on random walks starting from 𝑢.
3. Optimize embeddings using Stochastic
Gradient Descent:
ℒ=1 1 −log(𝑃(𝑣|𝐳% ))
%∈5 6∈7& (%)
We can efficiently approximate this using
negative sampling!
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41
¡ So far we have described how to optimize
embeddings given a random walk strategy R
¡ What strategies should we use to run these
random walks?
§ Simplest idea: Just run fixed-length, unbiased
random walks starting from each node (i.e.,
DeepWalk from Perozzi et al., 2013)
§ The issue is that such notion of similarity is too constrained
¡ How can we generalize this?
Reference: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42
¡ Goal: Embed nodes with similar network
neighborhoods close in the feature space.
¡ We frame this goal as a maximum likelihood
optimization problem, independent to the
downstream prediction task.
¡ Key observation: Flexible notion of network
neighborhood 𝑁& (𝑢) of node 𝑢 leads to rich node
embeddings
¡ Develop biased 2nd order random walk 𝑅 to
generate network neighborhood 𝑁& (𝑢) of node 𝑢
Reference: Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43
Feature Learning for Networks
s1 s2 s8
careful
s7
cent re-
BFS
d to sig- u s6
features DFS
sitive to s4 s9
s3 s5
r learn- Figure 1: BFS and DFS search strategies from node u (k = 3).
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44
Jure Leskovec
Stanford University
[email protected]
BFS: DFS:
Micro-view of Macro-view of
neighbourhood neighbourhood
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
Biased fixed-length random walk 𝑹 that given a
node 𝒖 generates neighborhood 𝑵𝑹 𝒖
¡ Two parameters:
§ Return parameter 𝒑:
§ Return back to the previous node
§ In-out parameter 𝒒:
§ Moving outwards (DFS) vs. inwards (BFS) from the
previous node
§ Intuitively, 𝑞 is the “ratio” of BFS vs. DFS
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47
Biased 2nd-order random walks explore network
neighborhoods:
§ Rnd. walk just traversed edge (𝑠8 , 𝑤) and is now at 𝑤
§ Insight: Neighbors of 𝑤 can only be:
Same distance to 𝒔𝟏
s2 s3
w Farther from 𝒔𝟏
u s1
Back to 𝒔𝟏
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 49
¡ Walker came over edge (𝐬𝟏 , 𝐰) and is at 𝐰.
Where to go next?
Target 𝒕 Prob. Dist. (𝒔𝟏 , 𝒕)
s2 1
1/𝑞 s3 s1 1/𝑝 0
w w → s2 1 1
1/𝑞
u s1 1/𝑝 s4
s3 1/𝑞 2
s4 1/𝑞 2
Unnormalized
§ BFS-like walk: Low value of 𝑝 transition prob.
segmented based
¡ Linear-time complexity
¡ All 3 steps are individually parallelizable
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51
¡ Different kinds of biased random walks:
§ Based on node attributes (Dong et al., 2017).
§ Based on learned weights (Abu-El-Haija et al., 2017)
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52
¡ Core idea: Embed nodes so that distances in
embedding space reflect node similarities in
the original network.
¡ Different notions of node similarity:
§ Naïve: Similar if two nodes are connected
§ Random walk approaches (covered today)
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53
¡ So what method should I use..?
¡ No one method wins in all cases….
§ E.g., node2vec performs better on node classification
while alternative methods perform better on link
prediction (Goyal and Ferrara, 2017 survey).
¡ Random walk approaches are generally more
efficient.
¡ In general: Must choose definition of node
similarity that matches your application.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Goal: Want to embed a subgraph or an entire
graph 𝐺. Graph embedding: 𝐳𝑮 .
𝒛(
¡ Tasks:
§ Classifying toxic vs. non-toxic molecules
§ Identifying anomalous graphs
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56
Simple (but effective) approach 1:
¡ Run a standard graph embedding
technique on the (sub)graph 𝐺.
¡ Then just sum (or average) the node
embeddings in the (sub)graph 𝐺.
𝒛𝑮 = # 𝒛#
#∈%
¡ Used by Duvenaud et al., 2016 to classify
9/27/23
molecules based on their graph structure
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57
¡ Approach 2: Introduce a “virtual node” to
represent the (sub)graph and run a standard
graph embedding technique
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59
¡ DiffPool: We can also hierarchically cluster
nodes in graphs, and sum/avg the node
embeddings according to these clusters.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Recall: encoder as an embedding lookup
embedding vector for a
embedding specific node
matrix
Dimension/size
𝐙= of embeddings
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62
¡ Simplest node similarity: Nodes 𝑢, 𝑣 are
similar if they are connected by an edge
¡ This means: 𝐳"# 𝐳$ = 𝐴$,"
which is the (𝑢, 𝑣) entry of the graph
adjacency matrix 𝐴
¡ Therefore, 𝒁> 𝒁 = 𝐴
𝒁* 𝒁
4 æ0 1 0 1ö
ç ÷
3 ç1 0 0 1÷ ×
A=ç
2 0 0 0 1÷
ç ÷
1 ç1 1 1 0 ÷ø
è
𝐳$ 𝐳!
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63
¡ The embedding dimension 𝑑 (number of rows in 𝒁)
is much smaller than number of nodes 𝑛.
¡ Exact factorization 𝐴 = 𝒁𝑻 𝒁 is generally not possible
¡ However, we can learn 𝒁 approximately
¡ Objective:min ∥ A − 𝒁: 𝒁 ∥;
𝐙
§ We optimize 𝒁 such that it minimizes the L2 norm
(Frobenius norm) of A − 𝒁) 𝒁
§ Note today we used softmax instead of L2. But the goal to
approximate A with 𝒁) 𝒁 is the same.
¡ Conclusion: Inner product decoder with node
similarity defined by edge connectivity is
equivalent to matrix factorization of 𝐴.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64
¡ DeepWalk and node2vec have a more
complex node similarity definition based on
random walks
¡ DeepWalk is equivalent to matrix
factorization of the following complex matrix
expression:
1 *
𝑙𝑜𝑔 𝑣𝑜𝑙(𝐺) 7 (𝐷 .-𝐴)+ 𝐷 .- − log 𝑏
𝑇 +,-
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65
Volume of graph
𝑣𝑜𝑙 𝐺 = > > 𝐴(,) Diagonal matrix 𝐷
( ) 𝐷#,# = deg(𝑢)
1 .
log 𝑣𝑜𝑙(𝐺) B (𝐷 /- 𝐴)+ 𝐷 /- − log 𝑏
𝑇 +,-
4
3
2 A newly added node 5 at test time
1 5 (e.g., new user in a social network)
Training set Cannot compute its embedding
with DeepWalk / node2vec. Need to
recompute all node embeddings.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69
¡ Cannot capture structural similarity:
3 13
4 5 10
2 12
1 11
4
Feature vector
3 (e.g. protein properties in a
2 protein-protein interaction graph)
1 5
DeepWalk / node2vec
embeddings do not incorporate
such node features