0% found this document useful (0 votes)
84 views

02-nodeemb

CS224W is a course on Machine Learning with Graphs taught by Jure Leskovec at Stanford University, focusing on graph representation learning and feature extraction for machine learning tasks. The course includes assignments using tools like NetworkX and PyTorch Geometric, with specific lectures on advanced topics in graph neural networks and link prediction. Key concepts include encoding nodes into embeddings to capture network similarities and utilizing random walks for estimating node relationships.

Uploaded by

Sher1ock
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

02-nodeemb

CS224W is a course on Machine Learning with Graphs taught by Jure Leskovec at Stanford University, focusing on graph representation learning and feature extraction for machine learning tasks. The course includes assignments using tools like NetworkX and PyTorch Geometric, with specific lectures on advanced topics in graph neural networks and link prediction. Key concepts include encoding nodes into embeddings to capture network similarities and utilizing random walks for estimating node relationships.

Uploaded by

Sher1ock
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

CS224W: Machine Learning with Graphs

Jure Leskovec, Stanford University


http://cs224w.stanford.edu
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ No class on November 7th (Election Day)
§ Lectures 13 (Advanced Topics in GNNs) to 17 (Link
Prediction and Causality) will be pushed back by one
class
§ Lecture 18 (Frontiers of GNN Research) will be
skipped
¡ First assignments released on course website:
Colab 0 and Colab 1
§ Links can be found under the Schedule section of
the website

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3
¡ Colab 0 will be released today by 9PM on our
course website
¡ Colab 0:
§ Overview of NetworkX and PyTorch Geometric
§ Does not need to be handed in
§ TAs will hold a recitation session to walk you
through Colab 0:
§ Time: Friday (09/29), 3-4pm PT
§ Location: Zoom, link will be posted on Ed
§ Session will be recorded

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4
¡ Colab 1 will be released today by 9PM on our
course website
¡ Colab 1:
§ Will cover material from Lectures 1-2,
so you can get started right away!
§ Due on Thursday 10/12 (2 weeks from today)
§ Submit written answers and code on Gradescope

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Given an input graph, extract node, link
and graph-level features, then learn a
model (SVM, neural network, etc.) that
maps features to labels.

Input Structured Learning


Prediction
Graph Features Algorithm

Feature engineering Downstream


(node-level, edge-level, graph- prediction task
level features)

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
Graph Representation Learning alleviates
the need to do feature engineering every
single time.
Input Structured Learning
Prediction
Graph Features Algorithm

Feature Representation Learning -- Downstream


Engineering Automatically prediction task
learn the features

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8
Goal: Efficient task-independent feature
learning for machine learning with graphs!

node vector
𝑢
𝑓: 𝑢 → ℝ!
ℝ!
Feature representation,
embedding

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9
¡ Task: Map nodes into an embedding space
§ Similarity of embeddings between nodes indicates
their similarity in the network. For example:
§ Both nodes are close to each other (connected by an edge)
§ Encode network information
§ Potentially used for many downstream predictions

Vec Tasks
• Node classification
• Link prediction
• Graph classification
• Anomalous node detection
embeddings ℝ! • Clustering
• ….
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10
Example
¡ 2D embedding of nodes of the Zachary’s
Karate Club network:

Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Assume we have an (undirected) graph G:
§ V is the vertex set.
§ A is the adjacency matrix (assume binary).
§ For simplicity: No node features or extra
information is used
4
3
2 æ0 1 0 1ö
1
ç ÷
ç1 0 0 1÷
A=ç
V: {1, 2, 3, 4} 0 0 0 1÷
ç ÷
ç1 1 1 0 ÷ø
è

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13
¡ Goal is to encode nodes so that similarity in
the embedding space (e.g., dot product)
approximates similarity in the graph

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14
Goal: similarity 𝑢, 𝑣 ≈ 𝐳"# 𝐳$
in the original network Similarity of the embedding

Need to define!

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15
1. Encoder maps from nodes to embeddings
2. Define a node similarity function (i.e., a
measure of similarity in the original network)
3. Decoder 𝐃𝐄𝐂 maps from embeddings to the
similarity score
4. Optimize the parameters of the encoder so
that: 𝐃𝐄𝐂(𝐳 "𝐳 ) ! #

similarity 𝑢, 𝑣 ≈ 𝐳"# 𝐳$
in the original network Similarity of the embedding

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16
¡ Encoder: maps each node to a low-dimensional
vector d-dimensional
ENC 𝑣 = 𝐳! embedding
node in the input graph
¡ Similarity function: specifies how the
relationships in vector space map to the
relationships in the original network
similarity 𝑢, 𝑣 ≈ 𝐳"# 𝐳$ Decoder
Similarity of 𝑢 and 𝑣 in dot product between node
the original network embeddings
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17
Simplest encoding approach: Encoder is just an
embedding-lookup
ENC 𝑣 = 𝐳𝒗 = 𝐙 ⋅ 𝑣

!× 𝒱 matrix, each column is a node


𝚭∈ℝ embedding [what we learn /
optimize]
indicator vector, all zeroes
𝒱
𝑣∈𝕀 except a one in column
indicating node v
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18
Simplest encoding approach: encoder is just an
embedding-lookup
embedding vector for a
embedding specific node
matrix

Dimension/size
𝐙= of embeddings

one column per node

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19
Simplest encoding approach: Encoder is just an
embedding-lookup

Each node is assigned a unique


embedding vector
(i.e., we directly optimize
the embedding of each node)

Many methods: DeepWalk, node2vec


9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20
¡ Encoder + Decoder Framework
§ Shallow encoder: Embedding lookup
§ Parameters to optimize: 𝐙 which contains node
embeddings 𝐳' for all nodes 𝑢 ∈ 𝑉
§ We will cover deep encoders in the GNNs

§ Decoder: based on node similarity.


§ Objective: maximize 𝐳() 𝐳' for node pairs (𝑢, 𝑣)
that are similar

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21
¡ Key choice of methods is how they define node
similarity.

¡ Should two nodes have a similar embedding if


they…
§ are linked?
§ share neighbors?
§ have similar “structural roles”?
¡ We will now learn node similarity definition that uses
random walks, and how to optimize embeddings for
such a similarity measure.

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22
¡ This is unsupervised/self-supervised way of
learning node embeddings.
§ We are not utilizing node labels
§ We are not utilizing node features
§ The goal is to directly estimate a set of coordinates
(i.e., the embedding) of a node so that some aspect
of the network structure (captured by DEC) is
preserved.
¡ These embeddings are task independent:
§ They are not trained for a specific task but can be
used for any task.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Vector 𝐳! :
§ The embedding of node 𝑢 (what we aim to find).
¡ Probability 𝑃 𝑣 𝐳! ) : Our model prediction based on 𝐳#
§ The (predicted) probability of visiting node 𝑣 on
random walks starting from node 𝑢.

Non-linear functions used to produce predicted probabilities


¡ Softmax function:
§ Turns vector of 𝐾 real values (model predictions)
𝒛[#]
into
"
𝐾 probabilities that sum to 1: 𝜎(𝒛)[𝑖] = ∑( 𝒛[%]
%&' "
¡ Sigmoid function:
§ S-shaped function that turns real values into the range of (0, 1).
)
Written as 𝑆 𝑥 = !" . )*+

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25
10
9

12
2 Step 3 Step 4

8 Step 5
1
11
3
Given a graph and a starting
4 Step 1 Step 2 point, we select a neighbor of
it at random, and move to this
neighbor; then we select a
6
5 neighbor of this point at
random, and move to it, etc.
The (random) sequence of
7 points visited this way is a
random walk on the graph.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26
probability that u
"
𝐳! 𝐳# ≈ and v co-occur on
a random walk over
the graph

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27
1. Estimate probability of visiting node 𝒗 on a
random walk starting from node 𝒖 using
some random walk strategy 𝑹

2. Optimize embeddings to encode these


random walk statistics:
Similarity in embedding space (Here:
dot product=cos(𝜃)) encodes random walk “similarity”
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28
1. Expressivity: Flexible stochastic definition of
node similarity that incorporates both local
and higher-order neighborhood information
Idea: if random walk starting from node 𝑢
visits 𝑣 with high probability, 𝑢 and 𝑣 are
similar (high-order multi-hop information)

2. Efficiency: Do not need to consider all node


pairs when training; only need to consider
pairs that co-occur on random walks
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 29
¡ Intuition: Find embedding of nodes in
𝑑-dimensional space that preserves similarity

¡ Idea: Learn node embedding such that nearby


nodes are close together in the network

¡ Given a node 𝑢, how do we define nearby


nodes?
§ 𝑁. 𝑢 … neighbourhood of 𝑢 obtained by some
random walk strategy 𝑅
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30
¡ Given 𝐺 = (𝑉, 𝐸),
¡ Our goal is to learn a mapping 𝑓: 𝑢 → ℝ$ :
𝑓 𝑢 = 𝐳%
¡ Log-likelihood objective:
max 1 log P(𝑁3 (𝑢)| 𝐳' )
/
' ∈2
§ 𝑁$ (𝑢) is the neighborhood of node 𝑢 by strategy 𝑅

¡ Given node 𝑢, we want to learn feature


representations that are predictive of the nodes
9/27/23
in its random walk neighborhood 𝑁& (𝑢).
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31
1. Run short fixed-length random walks
starting from each node 𝑢 in the graph using
some random walk strategy R.
2. For each node 𝑢 collect 𝑁& (𝑢), the multiset*
of nodes visited on random walks starting
from 𝑢.
3. Optimize embeddings according to: Given
node 𝑢, predict its neighbors 𝑁' (𝑢).
max H log P(𝑁' (𝑢)| 𝐳$ ) Maximum likelihood objective
(
$ ∈+
*𝑁! (𝑢) can have repeat elements since nodes can be visited multiple times on random walks
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32
Equivalently,
ℒ=( ( −log(𝑃(𝑣|𝐳5 ))
5∈7 8∈9' (5)

• Intuition: Optimize embeddings 𝒛$ to maximize


the negative log-likelihood of random walk co-
occurrences.
• Parameterize 𝑃(𝑣|𝐳𝑢 ) using softmax: Why softmax?
We want node 𝑣 to be
,
exp(𝐳$ 𝐳" ) most similar to node 𝑢
𝑃 𝑣 𝐳$ = (out of all nodes 𝑛).

∑-∈+ exp(𝐳$, 𝐳- ) Intuition: ∑ exp 𝑥 ≈


max exp(𝑥 )
"
"
" "

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33
Putting it all together:
exp(𝐳!) 𝐳% )
ℒ=# # − log( ) )
∑*∈# exp(𝐳! 𝐳* )
!∈# %∈&4 (!)

sum over all sum over nodes 𝑣 predicted probability of 𝑢


nodes 𝑢 seen on random and 𝑣 co-occuring on
walks starting from 𝑢 random walk

Optimizing random walk embeddings =


Finding embeddings 𝐳𝒖 that minimize L
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34
But doing this naively is too expensive!

exp(𝐳!) 𝐳% )
ℒ=# # −log( ) )
∑*∈# exp(𝐳! 𝐳* )
!∈# %∈&4 (!)

Nested sum over nodes gives


O(|V|2) complexity!

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35
But doing this naively is too expensive!

exp(𝐳!) 𝐳% )
ℒ=# # −log( ) )
∑*∈# exp(𝐳! 𝐳* )
!∈# %∈&4 (!)

The normalization term from the softmax is


the culprit… can we approximate it?

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36
Why is the approximation valid?
Technically, this is a different objective. But

¡ Solution: Negative sampling Negative Sampling is a form of Noise


Contrastive Estimation (NCE) which approx.
maximizes the log probability of softmax.
exp 𝐳'5 𝐳( New formulation corresponds to using a
−log( ) logistic regression (sigmoid func.) to

∑:∈2 exp 𝐳'5 𝐳: distinguish the target node 𝑣 from nodes 𝑛!


sampled from background distribution 𝑃" .
More at https://arxiv.org/pdf/1402.3722.pdf

≈ log 𝜎 𝐳'5 𝐳( + ∑9678 log 𝜎 −𝐳'5 𝐳:! , 𝑛6 ~𝑃2

sigmoid function random distribution


(makes each term a “probability”
between 0 and 1)
over nodes
Instead of normalizing w.r.t. all nodes, just
normalize against 𝑘 random “negative samples” 𝑛/
¡ Negative sampling allows for quick likelihood calculation.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37
exp 𝐳'5 𝐳( random distribution
log( ) over nodes
∑:∈2 exp 𝐳'5 𝐳:
9
≈ log 𝜎 𝐳'5 𝐳( +1 log 𝜎 −𝐳'5 𝐳:! , 𝑛6 ~𝑃2
678
§ Sample 𝑘 negative nodes 𝑛/ each with prob.
proportional to its degree.
§ Two considerations for 𝑘 (# negative samples):
1. Higher 𝑘 gives more robust estimates
2. Higher 𝑘 corresponds to higher bias on negative events
In practice 𝑘 =5-20. Can negative sample be any node or only the nodes not on the
walk? People often use any nodes (for efficiency). However, the
most “correct” way is to use nodes not on the walk.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38
§ After we obtained the objective function, how do
we optimize (minimize) it?
ℒ=) ) −log(𝑃(𝑣|𝐳: ))
:∈< =∈>$ (:)

§ Gradient Descent: a simple way to minimize ℒ :


§ Initialize 𝑧$ at some randomized value for all nodes 𝑢.

§ Iterate until convergence:


$ℒ 𝜂: learning rate
§ For all 𝑢, compute the derivative .
$&&

$ℒ
§ For all 𝑢, make a step in reverse direction of derivative: 𝑧# ← 𝑧# − 𝜂 .
$&&
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39
§ Stochastic Gradient Descent: Instead of evaluating
gradients over all examples, evaluate it for each
individual training example.
§ Initialize 𝑧! at some randomized value for all nodes 𝑢.

§ Iterate until convergence: ℒ (#) = B −log(𝑃(𝑣|𝐳# ))


!∈*' (#)
%ℒ (?)
§ Sample a node 𝑢, for all 𝑣 calculate the derivative .
%'@
%ℒ (?)
§ For all 𝑣, update:𝑧( ← 𝑧( − 𝜂 .
%'@

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40
1. Run short fixed-length random walks starting
from each node on the graph
2. For each node 𝑢 collect 𝑁& (𝑢), the multiset of
nodes visited on random walks starting from 𝑢.
3. Optimize embeddings using Stochastic
Gradient Descent:
ℒ=1 1 −log(𝑃(𝑣|𝐳% ))
%∈5 6∈7& (%)
We can efficiently approximate this using
negative sampling!
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41
¡ So far we have described how to optimize
embeddings given a random walk strategy R
¡ What strategies should we use to run these
random walks?
§ Simplest idea: Just run fixed-length, unbiased
random walks starting from each node (i.e.,
DeepWalk from Perozzi et al., 2013)
§ The issue is that such notion of similarity is too constrained
¡ How can we generalize this?
Reference: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42
¡ Goal: Embed nodes with similar network
neighborhoods close in the feature space.
¡ We frame this goal as a maximum likelihood
optimization problem, independent to the
downstream prediction task.
¡ Key observation: Flexible notion of network
neighborhood 𝑁& (𝑢) of node 𝑢 leads to rich node
embeddings
¡ Develop biased 2nd order random walk 𝑅 to
generate network neighborhood 𝑁& (𝑢) of node 𝑢
Reference: Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43
Feature Learning for Networks

Idea: use flexible, biased


Jure random walks that can
Leskovec
trade off between localUniversity
Stanford and global views of the
[email protected]
network (Grover and Leskovec, 2016).

s1 s2 s8
careful
s7
cent re-
BFS
d to sig- u s6
features DFS
sitive to s4 s9
s3 s5
r learn- Figure 1: BFS and DFS search strategies from node u (k = 3).
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44
Jure Leskovec
Stanford University
[email protected]

Two classic strategies to define a neighborhood


𝑵𝑹 𝒖 of a given node 𝒖:
s1 s2 s8
careful
s7
cent re-
BFS
d to sig- u s6
eatures DFS
sitive to s4 s9
s3 s5
r learn- Figure 1: BFS and DFS search strategies from node u (k = 3).
vec, we Walk of length 3 (𝑁 𝑢 of size 3): &
eatures
𝑁
een net- 123 𝑢 = { 𝑠 , 𝑠 , 𝑠 } Local microscopic view
and edges. A typical solution involves hand-engineering domain-
specific features based4 5on expert 6 knowledge. Even if one discounts
node’s
the tedious work of feature engineering, such features are usually
proce-
leads to
𝑁 723 𝑢 = { 𝑠 , 𝑠 , 𝑠 } Global macroscopic view
designed for specific 8 tasks 9 and : do not generalize across different
9/27/23 prediction tasks.
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45
u

BFS: DFS:
Micro-view of Macro-view of
neighbourhood neighbourhood

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
Biased fixed-length random walk 𝑹 that given a
node 𝒖 generates neighborhood 𝑵𝑹 𝒖
¡ Two parameters:
§ Return parameter 𝒑:
§ Return back to the previous node
§ In-out parameter 𝒒:
§ Moving outwards (DFS) vs. inwards (BFS) from the
previous node
§ Intuitively, 𝑞 is the “ratio” of BFS vs. DFS

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47
Biased 2nd-order random walks explore network
neighborhoods:
§ Rnd. walk just traversed edge (𝑠8 , 𝑤) and is now at 𝑤
§ Insight: Neighbors of 𝑤 can only be:
Same distance to 𝒔𝟏
s2 s3
w Farther from 𝒔𝟏

u s1
Back to 𝒔𝟏

Idea: Remember where the walk came from


9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48
¡ Walker came over edge (𝐬𝟏 , 𝐰) and is at 𝐰.
Where to go next?
s2 1
1/𝑞 s3 1/𝑝, 1/𝑞, 1 are
w unnormalized
1/𝑞
s1 probabilities
u 1/𝑝 s4

¡ 𝑝, 𝑞 model transition probabilities


§ 𝑝 … return parameter
§ 𝑞 … ”walk away” parameter

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 49
¡ Walker came over edge (𝐬𝟏 , 𝐰) and is at 𝐰.
Where to go next?
Target 𝒕 Prob. Dist. (𝒔𝟏 , 𝒕)
s2 1
1/𝑞 s3 s1 1/𝑝 0
w w → s2 1 1
1/𝑞
u s1 1/𝑝 s4
s3 1/𝑞 2
s4 1/𝑞 2
Unnormalized
§ BFS-like walk: Low value of 𝑝 transition prob.
segmented based

§ DFS-like walk: Low value of 𝑞 on distance from 𝑠!

𝑁& (𝑢) are the nodes visited by the biased walk


9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50
¡ 1) Compute random walk probabilities
¡ 2) Simulate 𝑟 random walks of length 𝑙 starting
from each node 𝑢
¡ 3) Optimize the node2vec objective using
Stochastic Gradient Descent

¡ Linear-time complexity
¡ All 3 steps are individually parallelizable

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51
¡ Different kinds of biased random walks:
§ Based on node attributes (Dong et al., 2017).
§ Based on learned weights (Abu-El-Haija et al., 2017)

¡ Alternative optimization schemes:


§ Directly optimize based on 1-hop and 2-hop random walk
probabilities (as in LINE from Tang et al. 2015).

¡ Network preprocessing techniques:


§ Run random walks on modified versions of the original
network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al.
2016’s HARP).

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52
¡ Core idea: Embed nodes so that distances in
embedding space reflect node similarities in
the original network.
¡ Different notions of node similarity:
§ Naïve: Similar if two nodes are connected
§ Random walk approaches (covered today)

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53
¡ So what method should I use..?
¡ No one method wins in all cases….
§ E.g., node2vec performs better on node classification
while alternative methods perform better on link
prediction (Goyal and Ferrara, 2017 survey).
¡ Random walk approaches are generally more
efficient.
¡ In general: Must choose definition of node
similarity that matches your application.

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Goal: Want to embed a subgraph or an entire
graph 𝐺. Graph embedding: 𝐳𝑮 .

𝒛(

¡ Tasks:
§ Classifying toxic vs. non-toxic molecules
§ Identifying anomalous graphs
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56
Simple (but effective) approach 1:
¡ Run a standard graph embedding
technique on the (sub)graph 𝐺.
¡ Then just sum (or average) the node
embeddings in the (sub)graph 𝐺.

𝒛𝑮 = # 𝒛#
#∈%
¡ Used by Duvenaud et al., 2016 to classify
9/27/23
molecules based on their graph structure
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57
¡ Approach 2: Introduce a “virtual node” to
represent the (sub)graph and run a standard
graph embedding technique

¡ Proposed by Li et al., 2016 as a general


technique for subgraph embedding
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58
We discussed 3 ideas to graph embeddings:

¡ Approach 1: Embed nodes and sum/avg them

¡ Approach 2: Create super-node that spans the


(sub) graph and then embed that node.

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59
¡ DiffPool: We can also hierarchically cluster
nodes in graphs, and sum/avg the node
embeddings according to these clusters.

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60
CS224W: Machine Learning with Graphs
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
¡ Recall: encoder as an embedding lookup
embedding vector for a
embedding specific node
matrix

Dimension/size
𝐙= of embeddings

one column per node


Objective: maximize 𝐳!)𝐳$ for node pairs (𝑢, 𝑣) that are similar

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62
¡ Simplest node similarity: Nodes 𝑢, 𝑣 are
similar if they are connected by an edge
¡ This means: 𝐳"# 𝐳$ = 𝐴$,"
which is the (𝑢, 𝑣) entry of the graph
adjacency matrix 𝐴
¡ Therefore, 𝒁> 𝒁 = 𝐴
𝒁* 𝒁
4 æ0 1 0 1ö
ç ÷
3 ç1 0 0 1÷ ×
A=ç
2 0 0 0 1÷
ç ÷
1 ç1 1 1 0 ÷ø
è
𝐳$ 𝐳!
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63
¡ The embedding dimension 𝑑 (number of rows in 𝒁)
is much smaller than number of nodes 𝑛.
¡ Exact factorization 𝐴 = 𝒁𝑻 𝒁 is generally not possible
¡ However, we can learn 𝒁 approximately
¡ Objective:min ∥ A − 𝒁: 𝒁 ∥;
𝐙
§ We optimize 𝒁 such that it minimizes the L2 norm
(Frobenius norm) of A − 𝒁) 𝒁
§ Note today we used softmax instead of L2. But the goal to
approximate A with 𝒁) 𝒁 is the same.
¡ Conclusion: Inner product decoder with node
similarity defined by edge connectivity is
equivalent to matrix factorization of 𝐴.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64
¡ DeepWalk and node2vec have a more
complex node similarity definition based on
random walks
¡ DeepWalk is equivalent to matrix
factorization of the following complex matrix
expression:
1 *
𝑙𝑜𝑔 𝑣𝑜𝑙(𝐺) 7 (𝐷 .-𝐴)+ 𝐷 .- − log 𝑏
𝑇 +,-

§ Explanation of this equation is on the next slide.


Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec, WSDM 18

9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65
Volume of graph
𝑣𝑜𝑙 𝐺 = > > 𝐴(,) Diagonal matrix 𝐷
( ) 𝐷#,# = deg(𝑢)

1 .
log 𝑣𝑜𝑙(𝐺) B (𝐷 /- 𝐴)+ 𝐷 /- − log 𝑏
𝑇 +,-

context window size Number of


See Lec 3 slide 30: Power of normalized negative samples
𝑇 = |𝑁& 𝑢 | adjacency matrix

¡ Node2vec can also be formulated as a matrix


factorization (albeit a more complex matrix)
¡ Refer to the paper for more details:
Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec, WSDM 18
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 66
¡ How to use embeddings 𝒛𝒊 of nodes:
§ Clustering/community detection: Cluster points 𝒛𝒊
§ Node classification: Predict label of node 𝑖 based on 𝒛𝒊
§ Link prediction: Predict edge (𝑖, 𝑗) based on (𝒛𝒊 , 𝒛𝒋 )
§ Where we can: concatenate, avg, product, or take a difference
between the embeddings:
§ Concatenate: 𝑓(𝒛( , 𝒛) )= 𝑔([𝒛( , 𝒛) ])
§ Hadamard: 𝑓(𝒛( , 𝒛) )= 𝑔(𝒛( ∗ 𝒛) ) (per coordinate product)
§ Sum/Avg: 𝑓(𝒛( , 𝒛) )= 𝑔(𝒛( + 𝒛) )
§ Distance: 𝑓(𝒛( , 𝒛) )= 𝑔(||𝒛( − 𝒛𝒋 ||+ )
§ Graph classification: Graph embedding 𝒛𝑮 via aggregating
node embeddings or virtual-node.
Predict label based on graph embedding 𝒛9 .
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 67
We discussed graph representation learning, a way to
learn node and graph embeddings for downstream
tasks, without feature engineering.
¡ Encoder-decoder framework:
§ Encoder: embedding lookup
§ Decoder: predict score based on embedding to match
node similarity
¡ Node similarity measure: (biased) random walk
§ Examples: DeepWalk, Node2Vec

¡ Extension to Graph embedding: Node embedding


aggregation
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 68
Limitations of node embeddings via matrix
factorization and random walks
§ Transductive (not inductive) method: Cannot
obtain embeddings for nodes not in the training
set. Cannot apply to new graphs, evolving graphs.

4
3
2 A newly added node 5 at test time
1 5 (e.g., new user in a social network)
Training set Cannot compute its embedding
with DeepWalk / node2vec. Need to
recompute all node embeddings.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69
¡ Cannot capture structural similarity:
3 13
4 5 10
2 12
1 11

§ Node 1 and 11 are structurally similar – part of one


triangle, degree 2, …
§ However, they have very different embeddings.
§ It’s unlikely that a random walk will reach node 11 from node 1.

¡ DeepWalk and node2vec do not capture


structural similarity.
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 70
¡ Cannot utilize node, edge and graph features

4
Feature vector
3 (e.g. protein properties in a
2 protein-protein interaction graph)
1 5
DeepWalk / node2vec
embeddings do not incorporate
such node features

Solution to these limitations: Deep Representation


Learning and Graph Neural Networks
(To be covered in depth next)
9/27/23 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 71

You might also like