03 Matrix
03 Matrix
node/link
• Level 1: diameter, connectivity, graph-level classification, graph-level embedding, graph kernel, graph structure learning, graph generator,…
• Level 2: frequent subgraphs, clustering, community detection, motif, teams, dense subgraphs, subgraph matching, NetFair, …
• Level 3: node proximity, node classification, link prediction, anomaly detection, node embedding, network alignment, NetFair,
• Beyond:, network of X, ….
2
Matrix & Tensor Tools
• Matrix Tools
– Proximity (covered in Lecture 2)
– Low-rank approximation
– Co-clustering
• Tensor Tools
3
Motivation
• Q: How to find patterns?
– e.g., communities, anomalies, etc.
• A (Common Approach): Low-Rank
Approximation (LRA) for Adjacency Matrix.
X M X R
A ~ L
4
Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos: Colibri: fast
mining of large static and dynamic graphs. KDD 2008: 686-694
LRA for Graph Mining
Conference
John
ICDM
1 1 0 0
Tom 1 1 0 0
KDD
Author
Bob
1 1 0 0
Carl
ISMB 0 1 1 1
Van
RECOMB
0 0 1 1
Roy
0 0 1 1
Author Conference Adjacency matrix: A
5
LRA for Graph Mining: Communities
R: Conf. Group Matrix
John Adj. matrix: A
ICDM
Tom X X
KDD
Bob
Carl
ISMB
~ M: Group-Group
Van
Interaction Matrix
RECOMB
Roy
6
LRA for Graph Mining: Anomalies
John Adj. matrix: A L M R
ICDM
Tom X X
KDD
Bob
Carl
ISMB
~
Van
RECOMB
Roy
Author Conf.
Recon. error is high
→ ‘Carl’ is abnormal
7
Challenges – Problem 1
• Prob.1: Given a static graph A,
+ (C1) How to get (L, M, R) efficiently?
- Both time and space
+ (C2) What is the interpretation of
(L, M, R)?
8
Challenges – Problem 2
• Prob. 2: Given a dynamic graph
At(t=1,2,…),
+ (C3) How to get (Lt, Mt, Rt) incrementally?
- Track patterns over time
9
Roadmap - LRA
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
10
Overview
X M X R
A L L
Find L Project A
Projection of A
Same for
different methods
11
Matrix & Vector
3 1 Phlip Yu Philip 1
ICML
• Matrix B= 1 1 William Cohen William
1
3
1 SIGMOD
0 0 John Smith John
SIGMOD ICML
John Smith
William Cohen
12
Column Space
3 1 Phlip Yu Philip 1
ICML
• Matrix B= 1 1 William Cohen William
1
3
1 SIGMOD
0 0 John Smith John
SIGMOD ICML
ICML SIGMOD
13
Projection & Projection Matrix
KDD
v
ICML
v~ KDD ~
SIGMOD
+
X BTB X BT X
v~ = B v
Core Matrix
L M R
+
X BTB X BT X
~ = B A
A
Core Matrix
15
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
16
Singular-Value-Decomposition (SVD)
1 … v1
x x
…
… … …. ….
… k vk
V:
a1 a2 ….
a3 …a m
~ u1 …
…. uk
right singular vectors
… …
18
SVD: advantages
• Optimal Low-Rank Approximation
–In both L2 and LF
19
SVD: drawbacks
• (C1) Efficiency A U V
2 2
– Time O (min( n m, nm ))
[footnote: or O( E • Iter ) ] =
– Space (U, V) are dense
• (C2) Interpretation
20
SVD: drawbacks
• (C3) Dynamic: not easy
At Ut t Vt At+1 Ut+1 t+1 Vt+1
21
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
22
CUR (CX) decomposition
[Drineas+ 2005]
+
… … … x (C T….
C) x C AT
….
U R
… ……. … ~ ….
•Sample Columns from A
•Project A
… … … Left matrix: C
Middle matrix: (C T C ) +
Right matrix : C T A
A: n x m C 23
CUR (CX): advantages
• (C0) Quality: Near-Optimal
• (C1) Efficiency (better than SVD)
– Time O ( c 2
n ) or O ( c 3
+ cm)
• (c is # of sampled col.s)
– Space (C, R) are sparse
• (C2) Interpretation
24
CUR (CX): drawbacks
• (C1) Redundancy in C
• 3 copies of green,
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red…
25
Redundant Col.
KDD
Does Not Help
ICML ~
KDD
SIGMOD
Observations: VLDB
~
#1: Does not help KDD
KDD
#2: Wastes time & space
ICML ~
KDD
SIGMOD
VLDB
26
CUR (CX): drawbacks
• (C3) Dynamic: not easy
~~
~
?
t t+1
C ~
C
27
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
28
CMD [Sun+ 2007]
CUR (CX) CMD
Original Matrix
~~ ~
Left matrix: C
Middle matrix: (C T C ) +
• 3 copies of green, Right matrix : C T A
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red C
30
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
– Colibri-S for static graphs (Problem 1)
– Colibri-D for dynamic graphs (Problem 2)
• Experimental Results
• Conclusion
31
Colibri-S: Basic Idea
CUR (CX) Colibri-S
Original Matrix
…
x. x ….
M R
…
. Left matrix: L
Middle matrix: ( LT L) −1
33
A: Find L & M incrementally!
Initially Sampled
….
Matrix C
No
Expand L & M
34
Step 1: How to test if KDD is redundant ?
KDD
SIGMOD
~
_ X
ICML
KDD
KDD
=
ICML
~ SIGMOD
KDD
KDD = Mold X ICML X
SIGMOD
35
Step 2: How to update core matrix ?
-1
SIGMOD
SIGMOD
ICML
KDD Mold = ICML
X
ICML
~
KDD
?
SIGMOD
-1
SIGMOD
SIGMOD
ICML
X
KDD
Mnew = ICML
KDD
36
Q: How to update core matrix?
A: Incrementally.
Theorem 1
1 ~
KDDX −1
[Tong et al KDD 2008]
+ X
KDD
Mold
KDD
~
~
2
Mnew = −1
~
KDD
1
~
We only need to know KDD and !
37
Colibri-S vs. CUR(CMD)
Example:
• (C0) Quality: -If c = 200, c = 1000
• Colibri-S = CUR(CMD) - Colibri-S: 125x faster !
X: Philip Yu
X: Philip Yu
Y: William Cohen
2x
3x 2x
4x
2x
1x X: Philip Yu
Y: William Cohen
X: Philip Yu
Y: William Cohen
X: Philip Yu
• Conclusion
44
Problem Definition
• Given (e.g., Author-Conference Graphs)
A1 A2 A3 …
• Find Incrementally
M1 R1 M2 R2 M3 R3
L1 L2 L3 …
45
Colibri-D for dynamic graphs
Mt Rt
t Lt
Mt+1 Rt+1
?
t+1 Lt+1
Mt Rt
t Lt
47
Changed from t
Colibri-D: How-To Mt
Selected Redundant Lt
t
~
M
Unchanged Cols!
~ Subspace by
L
Initially sampled matrix blue cols
Selected Redundant at t+1
t+1
Mt+1
Lt+1
48
How to Get Core Matrix
for Un-changed Col.s ?
Lt
-1
X
M t
= [(Lt )’ x Lt ]-1 =
t
?
-1
t+1 ~ ~t ~ t -1
Mt = [(L )’ xL ] = X
~
Lt v
49
How to Get Core Matrix
for Un-changed Col.s ?
Let
s: # of changed columns in Lt
s: # of changed columns in Lt
51
Comparison SVD, CUR/CMD vs. Colibri
s
52
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
53
Experimental Setup
• Data set
• Network traffic
• 21,837 sources/destinations
• 1,222 consecutive hours (~ 2 months)
• 22,800 edges per hour
• Accuracy:
Accuracy =
• Space Cost:
54
Performance
SVD SVD
of Colibri-S
• Accuracy CUR CUR
• Same 91%+
• Time
• 12x of CMD
• 28x of CUR
• Space
• ~1/3 of CMD CMD
• ~10% of CUR CMD Ours
Ours
Time Space 55
Performance
Time
CMD of Colibri-D
(Prior Best Method)
Network traffic
- 21,837 nodes
Colibri-S - 1,220 hours
- 22,800 edge/hr
Colibri-D Accuracy
- Same 93%+
# of changed cols
Colibri-D achieves up to 112x speedups 56
Conclusion: Colibri
• Colibri-S (for static graphs)
– Idea: remove redundancy
– Up to 52x speedup; 2/3 space saving
– No quality loss (w.r.t., CUR/CMD)
• Colibri-D (for dynamic graphs)
– Idea: leverage “smoothness”
– Up to 112x than CMD
57
optional
58
Graph Mining by Low-Rank Approximation
More on LRA
• Q0: SVD + example-based LRA
• Q1: Nonnegative Matrix Factorization
• Q2: Non-negative Residual Matrix Factorization
• Q3: Nuclear norm related technologies
60
Low Rank Approximation
• Nonnegative Matrix Factorization (NMF)
DanielMay
D. Lee 1st-4th,
and H. Sebastian
2013Seung. Learning
SDM the 2013,
parts of objects
Austin, by Texas
non-negative matrix factorization. Nature 401,61788-
791 (21 October 1999)
Nonnegative Matrix Factorization (NMF)
(1 row in G)
r r
62
NMF Solutions: Multiplicative Updates
• Multiplicative update method
Daniel D. Lee and H. Sebastian Seung (2001). Algorithms for Non-negative Matrix Factorization. NIPS 2001.
H Zhou, K Lange,
May and M Suchard.
1st-4th, 2013 (2010) Graphical
SDM processing units andTexas
2013, Austin, high-dimensional optimization, Statistical Science,
63
25:311-324
NMF Solutions: Alternating Nonnegative
Least Squares
• Initialize F and G with nonnegative values
• Iterate the following procedure:
– Fixing , Solve
– Fixing , Solve
P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error
estimates of data values. Environmetrics, 5(1):111–126, 1994
C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation,19(2007), 2756-2779.
D. Kim, S. Sra, I. S. Dhillon, Fast Newton-type Methods for the Least Squares Nonnegative Matrix Approximation Problem.
SDM 2007.
May 1st-4th, 2013 SDM 2013, Austin, Texas 64
J. Kim and H. Park. Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons. ICDM 2008.
Application of NMF: Privacy-Aware On-line User
Role Tracking [AAAI11]
• Problem Definitions
– Given: the user-activity log that changes over time
– Monitor: (1) the user role/cluster; and (2) the role/cluster description.
• Design Objective
– (1) Privacy-aware; and (2) Efficiency (in both time and space).
65
Key Ideas
• Minimize the upper bound of the original/exact objective function
min ||X + ΔX – F G - ΔF GT – F ΔGT - ΔF ΔGT ||F
Time
Time Stamp Data Sets
H. Tong, C. Lin: Non-Negative Residual Matrix Factorization with Application to Graph Anomaly
May 1st-4th,
Detection. SDM 20112013 SDM 2013, Austin, Texas 70
Visual Comparisons
Original NrMF SVD Original NrMF SVD
71
Low Rank Approximation
• Nonnegative Matrix Factorization
• Non-negative Residual Matrix Factorization
• Nuclear norm related technologies
• Convex relaxation
M. Fazel, H. Hindi, S. Boyd. A Rank Minimization Heuristic with Application to Minimum Order
May
System 1st-4th, 2013Proceedings
Approximation. SDM 2013, Control
American Austin, Conference,
Texas 6:4734-4739, June 2001.
73
Nuclear Norm Minimization
• Singular Value Thresholding
– http://svt.stanford.edu/
• Accelerated gradient
– http://www.public.asu.edu/~jye02/Software/SLEP
/index.htm
• Interior point methods
– http://abel.ee.ucla.edu/cvxopt/applications/nucnr
m/
J-F. Cai, E.J. Candès and Z. Shen. A Singular Value Thresholding Algorithm for Matrix Completion. SIAM Journal on
Optimization. Volume 20 Issue 4, January 2010 Pages 1956-1982.
Shuiwang Ji and Jieping Ye. An Accelerated Gradient Method for Trace Norm Minimization. The Twenty-Sixth International
Conference on Machine Learning (ICML 2009)
Z. Liu, May
Lieven 1st-4th,
Vandenberghe.
2013 Interior-point method for nuclear
SDM 2013, norm approximation
Austin, Texas with application to system 74
identification. SIAM Journal on Matrix Analysis and Applications (2009)
◼ From LRA to Co-clustering
Co-clustering
• Let X and Y be discrete random variables
– X and Y take values in {1, 2, …, m} and {1, 2, …, n}
– p(X, Y) denotes the joint probability distribution—if
not known, it is often estimated based on co-occurrence
data
– Application areas: text mining, market-basket analysis,
analysis of browsing behavior, etc.
• Key Obstacles in Clustering Contingency Tables
– High Dimensionality, Sparsity, Noise
– Need for robust and scalable algorithms
Reference:
1. Dhillon et al. Information-Theoretic Co-clustering, KDD’03
76
n
𝑃(𝑋, 𝑌) .05 .05 .05 0 0 0 eg, terms x documents
.05 .05 .05 0 0 0
m 0 0 0 .05 .05 .05
.04
0 0 0 .05 .05 .05
.04 0 .04 .04 .04
.04 .04 .04 0 .04 .04
k
=
l n
.5 0 0 .3 0 l .36 .36 .28 0 0 0 .054 .054 .042 0 0 0
.5 0 0 k 0 .3
.2 .2 0 0 0 .28 .36 .36
.054 .054 .042 0 0 0
m 0 00
0 .5 0 0 0 .042 .054 .054
0 .5 0
𝑃(𝑌|𝑌)
.036 0 0 .042 .054 .054
0 𝑃(𝑋, 𝑌) .036
0 .5 .036 028 .028 .036 .036
0 .5 .036 .028 .028 .036 .036
𝑃(𝑋, 𝑌)
𝑃(𝑋|𝑋)
77
med. doc
cs doc
.5
.5
0
0
0
0
.03 .03
.2 .2
.36 .36 .28
0 0 0
0 0
.28 .36 .36
0
= .054
.054
.054
.054
.042
.042
0
0
0
0
0
0
0 .5 0
00 0 0 .042 .054 .054
00 .5 0
doc x .036 0 0 .042 .054 .054
0 .036
0 .5 .036 028 .028 .036 .036
0 .5 doc group .036 .028 .028 .036 .036
term x
term-group
78
Co-clustering
Observations
• uses KL divergence, instead of L2 or LF
• the middle matrix is not diagonal
– we’ll see that again in the Tucker tensor
decomposition
79
Matrix & Tensor Tools
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC
80
Tensor Basics
Reminder: SVD
n n
VT
m A m
U
– Best rank-k approximation in L2 or LF
82
Reminder: SVD
n
1u1v1 2u2v2
m A +
IxJxK
IxR JxR
¼
A
B = +…+
RxRxR
84
Main points:
• 2 major types of tensor decompositions:
PARAFAC and Tucker
• both can be solved with ``alternating least
squares’’ (ALS)
• Details follow – we start with terminology:
85
[T. Kolda,’07]
A tensor is a multidimensional array
An I x J x K tensor Column (Mode-1) Row (Mode-2) Tube (Mode-3)
Fibers Fibers Fibers
X1,1,1
xijk
I
J
Horizontal Slices Lateral Slices Frontal Slices
3rd
order tensor
mode 1 has dimension I
mode 2 has dimension J
mode 3 has dimension K
Note: focus is on 3rd
order, but everything
can be extended to
higher orders.
86
details [T. Kolda,’07]
Matricization: Converting a Tensor to
a Matrix
X(n): The mode-n fibers are
Matricize
(i′,j′) rearranged to be the columns
(unfolding) (i,j,k)
of a matrix
Reverse
(i′,j′)
Matricize (i,j,k)
5 7
1 3
6 8
2 4
87
details
[T. Kolda,’07]
88
details
Time
Location
Clusters
Location
Time
Clusters
Time
[T. Kolda,’07]
89
details
Location
Time
Location
Time
Time
[T. Kolda,’07]
90
details
Outer, Kronecker, &
Khatri-Rao Products
3-Way Outer Product Review: Matrix Kronecker Product
MxN PxQ
MP x NQ
=
Matrix Khatri-Rao Product
Rank-1 Tensor
MxR NxR MN x R
91 [T. Kolda,’07]
Specially Structured Tensors
Specially Structured Tensors
• Tucker Tensor • Kruskal Tensor
Our
Notation
Our
Notation
“core”
[T. Kolda,’07]
93
details
[T. Kolda,’07]
94
Outline: Part 2
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC
95
Tensor Decompositions
Tucker Decomposition - intuition
.5
.5
0
0
0
0
.03 .03
.2 .2
.36 .36 .28
0 0 0
0 0
.28 .36 .36
0
= .054
.054
.054
.054
.042
.042
0
0
0
0
0
0
0 .5 0
00 0 0 .042 .054 .054
00 .5 0
doc x .036 0 0 .042 .054 .054
0 .036
0 .5 .036 028 .028 .036 .036
0 .5 doc group .036 .028 .028 .036 .036
term x
term-group
98
Tucker Decomposition
99
details
Tucker Variations
See Kroonenberg & De Leeuw, Psychometrika,1980 for discussion.
• Tucker2 Identity Matrix
IxJxK IxR JxS
~ B
A
RxSxK
• Tucker1
IxJxK IxR
~
A Finding principal components in only mode 1
RxJxK
can be solved via rank-R matrix SVD
100
details
Solving for Tucker
Given A, B, C orthonormal, the optimal core is: IxJxK IxR JxS
~~ B
Tensor norm is the square A
root of the sum of all the RxSxT
elements squared Eliminate the core to get:
Minimize
s.t. A,B,C orthonormal fixed maximize this
If B & C are fixed, then we can solve for A as follows:
• Initialize
– Choose R, S, T
IxJxK – Calculate A, B, C via HO-SVD
IxR JxS
• Until converged do…
= B
– A = R leading left singular
A vectors of X(1)(CB)
RxSxT
– B = S leading left singular
vectors of X(2)(CA)
– C = T leading left singular
vectors of X(3)(BA)
• Solve for core:
[T. Kolda,’07]
104
Outline: Part 2
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC
105
CANDECOMP/PARAFAC
Decomposition
IxJxK
IxR JxR
~ B = +…+
A
RxRxR
= +…+
IxJxK
Khatri-Rao Product
(column-wise Kronecker product) Find all the vectors in
one mode at a time
Hadamard Product
~¼
B ¼ +…+
~ b1 bR
A
RxSxT a1 aR
109
Tensor tools - summary
• Two main tools
– PARAFAC
– Tucker
• Both find row-, column-, tube-groups
– but in PARAFAC the three groups are identical
• To solve: Alternating Least Squares
110
Tensor tools - resources
• Toolbox: from Tamara Kolda:
csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
• T. G. Kolda and B. W. Bader. Tensor
Decompositions and Applications. SIAM
Review 2008
• csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfil
es/TensorReview-preprint.pdf
111
Key Papers
Core Papers
• Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos: Colibri: fast mining
of large static and dynamic graphs. KDD 2008: 686-694
• Dhillon et al. Information-Theoretic Co-clustering, KDD’03
• T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review 2008
Further Reading
• Chih-Jen Lin: Projected Gradient Methods for Non-negative Matrix Factorization.
https://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf
• Candès, Emmanuel J., and Benjamin Recht. "Exact matrix completion via convex optimization."
Foundations of Computational mathematics 9, no. 6 (2009): 717.
• Rendle, S. (2010, December). Factorization machines. In 2010 IEEE International Conference on Data
Mining (pp. 995-1000). IEEE.
• Tamara G. Kolda, Brett W. Bader, Joseph P. Kenny: Higher-Order Web Link Analysis Using Multilinear
Algebra. ICDM 2005: 242-249
• U Kang, Evangelos E. Papalexakis, Abhay Harpale, Christos Faloutsos: GigaTensor: scaling tensor analysis
up by 100 times - algorithms and discoveries. KDD 2012: 316-324
• Deepayan Chakrabarti, Spiros Papadimitriou, Dharmendra S. Modha, Christos Faloutsos: Fully automatic
cross-associations. KDD 2004: 79-88
• Trigeorgis, G., Bousmalis, K., Zafeiriou, S., & Schuller, B. (2014, January). A deep semi-nmf model for
learning hidden representations. In International Conference on Machine Learning (pp. 1692-1700).
• Risi Kondor, Nedelina Teneva, and Vikas Garg. 2014. Multiresolution matrix factorization. In International
Conference on Machine Learning. 1620–1628
112