0% found this document useful (0 votes)
14 views

03 Matrix

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

03 Matrix

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

CS 514 Advanced Topics in Network Science

Lecture 3. Matrix and Tensor


Hanghang Tong, Computer Science, Univ. Illinois at Urbana -Champaign, 2024
Network Science: An Overview
We are here
network
(e.g., Patterns, laws, connectivity, etc.)

(e.g., clusters, communities,


dense subgraphs, etc.)
subgraph

(e.g., ranking, link prediction, embedding, etc.)

node/link

• Level 1: diameter, connectivity, graph-level classification, graph-level embedding, graph kernel, graph structure learning, graph generator,…
• Level 2: frequent subgraphs, clustering, community detection, motif, teams, dense subgraphs, subgraph matching, NetFair, …
• Level 3: node proximity, node classification, link prediction, anomaly detection, node embedding, network alignment, NetFair,
• Beyond:, network of X, ….

2
Matrix & Tensor Tools
• Matrix Tools
– Proximity (covered in Lecture 2)
– Low-rank approximation
– Co-clustering
• Tensor Tools

3
Motivation
• Q: How to find patterns?
– e.g., communities, anomalies, etc.
• A (Common Approach): Low-Rank
Approximation (LRA) for Adjacency Matrix.
X M X R

A ~ L

4
Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos: Colibri: fast
mining of large static and dynamic graphs. KDD 2008: 686-694
LRA for Graph Mining
Conference

John
ICDM
1 1 0 0
Tom 1 1 0 0
KDD

Author
Bob
1 1 0 0
Carl
ISMB 0 1 1 1
Van
RECOMB
0 0 1 1
Roy
0 0 1 1
Author Conference Adjacency matrix: A
5
LRA for Graph Mining: Communities
R: Conf. Group Matrix
John Adj. matrix: A
ICDM
Tom X X
KDD
Bob

Carl
ISMB
~ M: Group-Group
Van
Interaction Matrix
RECOMB
Roy

L: author group matrix


Author Conf.

6
LRA for Graph Mining: Anomalies
John Adj. matrix: A L M R
ICDM
Tom X X
KDD
Bob

Carl
ISMB
~
Van
RECOMB
Roy

Author Conf.
Recon. error is high
→ ‘Carl’ is abnormal
7
Challenges – Problem 1
• Prob.1: Given a static graph A,
+ (C1) How to get (L, M, R) efficiently?
- Both time and space
+ (C2) What is the interpretation of
(L, M, R)?

8
Challenges – Problem 2
• Prob. 2: Given a dynamic graph
At(t=1,2,…),
+ (C3) How to get (Lt, Mt, Rt) incrementally?
- Track patterns over time

9
Roadmap - LRA
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
10
Overview

X M X R

A L L
Find L Project A

Projection of A

Same for
different methods
11
Matrix & Vector
3 1 Phlip Yu Philip 1
ICML
• Matrix B= 1 1 William Cohen William
1
3
1 SIGMOD
0 0 John Smith John

SIGMOD ICML

John Smith
William Cohen

ICML = [1, 1, 0]’


SIGMOD = [3, 1, 0]’
Philip Yu

12
Column Space
3 1 Phlip Yu Philip 1
ICML
• Matrix B= 1 1 William Cohen William
1
3
1 SIGMOD
0 0 John Smith John

SIGMOD ICML

• Column Space of a Matrix

ICML SIGMOD

VLDB = SIGMOD – ICML = [2 0 0]’

13
Projection & Projection Matrix
KDD
v

ICML
v~ KDD ~
SIGMOD

+
X BTB X BT X

v~ = B v
Core Matrix

Projection of v Projection matrix of B An arbitrary vector 14


Projection of a Matrix

L M R

+
X BTB X BT X

~ = B A
A
Core Matrix

Projection of A Projection matrix of B

15
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
16
Singular-Value-Decomposition (SVD)
1 … v1

x x


… … …. ….
… k vk

 V:
a1 a2 ….
a3 …a m
~ u1 …
…. uk
right singular vectors

… …

A: n x m U: left singular vectors 17


SVD: definitions
• #1: Find the left matrix U,
– where A  viT a1  vi ,1 + a2  vi ,2 + ... + am  vi ,m
ui = =
i i
• #2: Project A into the column space of U
+
A = U (U U ) U A = ... = U V
T T

18
SVD: advantages
• Optimal Low-Rank Approximation
–In both L2 and LF

–For any rank-k matrix Ak


|| A – ||2, F <= || A – Ak ||2,F

19
SVD: drawbacks
• (C1) Efficiency A U  V
2 2
– Time O (min( n m, nm ))
[footnote: or O( E • Iter ) ] =
– Space (U, V) are dense

• (C2) Interpretation

20
SVD: drawbacks
• (C3) Dynamic: not easy
At Ut t Vt At+1 Ut+1 t+1 Vt+1

21
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
22
CUR (CX) decomposition
[Drineas+ 2005]

+
… … … x (C T….
C) x C AT
….

U R
… ……. … ~ ….
•Sample Columns from A
•Project A
… … … Left matrix: C
Middle matrix: (C T C ) +
Right matrix : C T A

A: n x m C 23
CUR (CX): advantages
• (C0) Quality: Near-Optimal
• (C1) Efficiency (better than SVD)
– Time O ( c 2
n ) or O ( c 3
+ cm)
• (c is # of sampled col.s)
– Space (C, R) are sparse

• (C2) Interpretation

24
CUR (CX): drawbacks
• (C1) Redundancy in C

• 3 copies of green,
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red…

25
Redundant Col.
KDD
Does Not Help
ICML ~
KDD
SIGMOD

Observations: VLDB
~
#1: Does not help KDD
KDD
#2: Wastes time & space
ICML ~
KDD
SIGMOD
VLDB

26
CUR (CX): drawbacks
• (C3) Dynamic: not easy

~~
~

?
t t+1

C ~
C

27
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
28
CMD [Sun+ 2007]
CUR (CX) CMD
Original Matrix

~~ ~

Left matrix: C
Middle matrix: (C T C ) +
• 3 copies of green, Right matrix : C T A
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red C

Duplicate: deleted in CMD!


29
Challenges
• Can we do even better than CMD
• by removing the other types of redundancy?
• Can we efficiently track LRA
• for time-evolving graphs?

30
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
– Colibri-S for static graphs (Problem 1)
– Colibri-D for dynamic graphs (Problem 2)
• Experimental Results
• Conclusion

31
Colibri-S: Basic Idea
CUR (CX) Colibri-S
Original Matrix

x. x ….

M R


. Left matrix: L
Middle matrix: ( LT L) −1

• 3 copies of green, Right matrix : LT A


• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red L
We want the Col.s in L to be linearly independent!
32
Q: How to find L & M from C efficiently?

33
A: Find L & M incrementally!
Initially Sampled
….
Matrix C

Current For each col. v in C


Redundant
discard v
L&M Project it on L ? Yes

No

Expand L & M
34
Step 1: How to test if KDD is redundant ?

KDD
SIGMOD

~
_ X

ICML
KDD
KDD
=

ICML
~ SIGMOD

KDD
KDD = Mold X ICML X
SIGMOD

35
Step 2: How to update core matrix ?

-1

SIGMOD
SIGMOD

ICML
KDD Mold = ICML
X

ICML
~
KDD
?
SIGMOD
-1
SIGMOD

SIGMOD
ICML
X

KDD
Mnew = ICML
KDD

36
Q: How to update core matrix?
A: Incrementally.
Theorem 1
1 ~
KDDX −1
[Tong et al KDD 2008]
+ X 

KDD
Mold 

KDD
~

~
 2

Mnew = −1
 ~
KDD
1
 

~
We only need to know KDD and !
37
Colibri-S vs. CUR(CMD)
Example:
• (C0) Quality: -If c = 200, c = 1000
• Colibri-S = CUR(CMD) - Colibri-S: 125x faster !

• (C1) Time: O(c 3 + cm) vs. O(c3 + cm ), where c  c, m  m


• Colibri-S better or equal CUR(CMD)
• (C1) Space
• Colibri-S better or equal CUR(CMD)
• (C2) Interpretations
• Colibri-S = CUR(CMD)
38
A Pictorial Comparison
1 ICML Y: William Cohen
Philip
1 3
William 1 SIGMOD
……

X: Philip Yu

Each dot is a conference


39
A Pictorial Comparison: SVD
Y: William Cohen

2nd singular vector

1st singular vector

X: Philip Yu

Each dot is a conference


40
A Pictorial Comparison: CUR
[Drineas+ 2005]

Y: William Cohen

2x

3x 2x

4x
2x
1x X: Philip Yu

Each dot is a conference


41
A Pictorial Comparison: CMD
[Sun+ 2007]

Y: William Cohen

X: Philip Yu

Each dot is a conference


42
A Pictorial Comparison: Colibri-S
[Tong+ 2008]

Y: William Cohen

X: Philip Yu

Each dot is a conference


43
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
– Colibri-S for static graphs (Problem 1)
– Colibri-D for dynamic graphs (Problem 2)
• Experimental Results optional

• Conclusion

44
Problem Definition
• Given (e.g., Author-Conference Graphs)

A1 A2 A3 …

• Find Incrementally

M1 R1 M2 R2 M3 R3
L1 L2 L3 …
45
Colibri-D for dynamic graphs

Mt Rt

t Lt

Initially sampled matrix

Mt+1 Rt+1
?
t+1 Lt+1

Q: How to update L and M efficiently? 46


Colibri-D: How-To
Selected Redundant

Mt Rt

t Lt

Initially sampled matrix


Selected Redundant
t+1 t+1
M R
?
t+1 Lt+1

47
Changed from t
Colibri-D: How-To Mt

Selected Redundant Lt

t
~
M
Unchanged Cols!

~ Subspace by
L
Initially sampled matrix blue cols
Selected Redundant at t+1

t+1
Mt+1

Lt+1
48
How to Get Core Matrix
for Un-changed Col.s ?
Lt
-1

X
M t
= [(Lt )’ x Lt ]-1 =
t

?
-1
t+1 ~ ~t ~ t -1
Mt = [(L )’ xL ] = X

~
Lt v
49
How to Get Core Matrix
for Un-changed Col.s ?
Let
s: # of changed columns in Lt

Theorem 2 [Tong et al KDD 2008] -1


X X
~t _ t
M 2,2
t
M 2,1
M = t
M 1,2

We only need an s x s matrix inverse !


50
How to Get Core Matrix
for Un-changed Col.s
Let t: # of un-changed columns in Lt

s: # of changed columns in Lt

We only need a matrix inverse of size


- s x s, instead of t x t
- if s<< t (a.k.a, “smooth”), we are faster
- example:
+ if s=10 and t=100, we are 1000x faster!

51
Comparison SVD, CUR/CMD vs. Colibri
s

Wish List SVD CUR/CMD Colibri


[Golub+ 1989] [Drineas+ 2005, [Tong+ 2008]
Sun+ 2007]
(C0)
Quality
(C1)
Efficiency
(C2)
Interpretation
(C3)
Dynamics
(?)

52
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
• Experimental Results
• Conclusion

53
Experimental Setup
• Data set
• Network traffic
• 21,837 sources/destinations
• 1,222 consecutive hours (~ 2 months)
• 22,800 edges per hour
• Accuracy:
Accuracy =
• Space Cost:

54
Performance
SVD SVD
of Colibri-S
• Accuracy CUR CUR
• Same 91%+
• Time
• 12x of CMD
• 28x of CUR
• Space
• ~1/3 of CMD CMD
• ~10% of CUR CMD Ours
Ours
Time Space 55
Performance
Time
CMD of Colibri-D
(Prior Best Method)

Network traffic
- 21,837 nodes
Colibri-S - 1,220 hours
- 22,800 edge/hr
Colibri-D Accuracy
- Same 93%+
# of changed cols
Colibri-D achieves up to 112x speedups 56
Conclusion: Colibri
• Colibri-S (for static graphs)
– Idea: remove redundancy
– Up to 52x speedup; 2/3 space saving
– No quality loss (w.r.t., CUR/CMD)
• Colibri-D (for dynamic graphs)
– Idea: leverage “smoothness”
– Up to 112x than CMD

57
optional

• More on Matrix Low Rank Approximations

58
Graph Mining by Low-Rank Approximation

Q: How to get the low-rank matrix approximations?


59
optional

More on LRA
• Q0: SVD + example-based LRA
• Q1: Nonnegative Matrix Factorization
• Q2: Non-negative Residual Matrix Factorization
• Q3: Nuclear norm related technologies

60
Low Rank Approximation
• Nonnegative Matrix Factorization (NMF)

DanielMay
D. Lee 1st-4th,
and H. Sebastian
2013Seung. Learning
SDM the 2013,
parts of objects
Austin, by Texas
non-negative matrix factorization. Nature 401,61788-
791 (21 October 1999)
Nonnegative Matrix Factorization (NMF)

• Factorizing a nonnegative matrix to the


product of two low-rank matrices
(entire F matrix)

(1 row in G)

r r
62
NMF Solutions: Multiplicative Updates
• Multiplicative update method

Daniel D. Lee and H. Sebastian Seung (2001). Algorithms for Non-negative Matrix Factorization. NIPS 2001.
H Zhou, K Lange,
May and M Suchard.
1st-4th, 2013 (2010) Graphical
SDM processing units andTexas
2013, Austin, high-dimensional optimization, Statistical Science,
63
25:311-324
NMF Solutions: Alternating Nonnegative
Least Squares
• Initialize F and G with nonnegative values
• Iterate the following procedure:
– Fixing , Solve
– Fixing , Solve

(1) Projected Gradient: http://www.csie.ntu.edu.tw/~cjlin/nmf/


(2) Newtown Type of Method:
http://www.cs.utexas.edu/users/dmkim/Source/software/nnma/index.html
(3) Block Principal Pivoting: https://sites.google.com/site/jingukim/nmf_bpas.zip?attredirects=0

P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error
estimates of data values. Environmetrics, 5(1):111–126, 1994
C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation,19(2007), 2756-2779.
D. Kim, S. Sra, I. S. Dhillon, Fast Newton-type Methods for the Least Squares Nonnegative Matrix Approximation Problem.
SDM 2007.
May 1st-4th, 2013 SDM 2013, Austin, Texas 64
J. Kim and H. Park. Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons. ICDM 2008.
Application of NMF: Privacy-Aware On-line User
Role Tracking [AAAI11]
• Problem Definitions
– Given: the user-activity log that changes over time
– Monitor: (1) the user role/cluster; and (2) the role/cluster description.
• Design Objective
– (1) Privacy-aware; and (2) Efficiency (in both time and space).

65
Key Ideas
• Minimize the upper bound of the original/exact objective function
min ||X + ΔX – F G - ΔF GT – F ΔGT - ΔF ΔGT ||F

||X – F GT ||F Dependent on X,


but fixed
≤ +
||ΔX – ΔF GT – F ΔGT - ΔF ΔGT ||F Independent on X
subject to: ΔF+F ≥ 0, ΔG+G ≥ 0

min ||ΔX – ΔF GT – F ΔGT - ΔF ΔGT ||F Independent on X


subject to: ΔF+F ≥ 0, ΔG+G ≥ 0

Can be solved by the projected gradient descent method


Fei Wang, Hanghang Tong, Ching-Yung Lin: Towards Evolutionary Nonnegative Matrix
66
Factorization. AAAI 2011
Experimental Results

Time
Time Stamp Data Sets

Red: Our method; Blue: Off-line method


Fei Wang, Hanghang Tong, Ching-Yung Lin: Towards Evolutionary Nonnegative Matrix
67
Factorization. AAAI 2011
NMF: Extensions
• General loss
– Bregman Divergence
• Different constraints
– Semi-NMF, Convex NMF, Symmetric NMF
• Incorporating supervisions
– Pairwise constraints, label
• Multiple factorized matrices
– Tri-factorization
I. S. Dhillon and S. Sra. Generalized Nonnegative Matrix Approximations with Bregman Divergences. NIPS 2005.
Chris H. Q. Ding, Tao Li, Michael I. Jordan: Convex and Semi-Nonnegative Matrix Factorizations. IEEE Trans. Pattern Anal.
Mach. Intell. 32(1): 45-55 (2010)
Chris H. Q. Ding, Tao Li, Wei Peng, Haesun Park: Orthogonal nonnegative matrix t-factorizations for clustering. KDD 2006.
Fei Wang, Tao Li, Changshui Zhang: Semi-Supervised Clustering via Matrix Factorization. SDM 2008: 1-12
Yuheng Hu, Fei Wang, Subbarao Kambhampati. Listen to the Crowd: Automated Analysis of Live Events via Aggregated
TwitterMay 1st-4th, 2013 SDM 2013, Austin, Texas 68
Sentiment. IJCAI 2013.
Graph Mining by Low-Rank Approximation

Q: How to get the low-rank matrix approximations?


69
A2: Non-negative Residual MF
• Observations: anomalies → actual activities
• Examples: popularity contest, port scanner, etc
• NrMF formulation

Weighted Frobenius Form

Common in Any MF Weight

Unique in NrMF Non-negative residual

H. Tong, C. Lin: Non-Negative Residual Matrix Factorization with Application to Graph Anomaly
May 1st-4th,
Detection. SDM 20112013 SDM 2013, Austin, Texas 70
Visual Comparisons
Original NrMF SVD Original NrMF SVD

71
Low Rank Approximation
• Nonnegative Matrix Factorization
• Non-negative Residual Matrix Factorization
• Nuclear norm related technologies

SDM 2013, Austin, Texas 72


Rank Minimization and Nuclear Norm
• Matrix completion with rank minimization
NP hard

• Convex relaxation

M. Fazel, H. Hindi, S. Boyd. A Rank Minimization Heuristic with Application to Minimum Order
May
System 1st-4th, 2013Proceedings
Approximation. SDM 2013, Control
American Austin, Conference,
Texas 6:4734-4739, June 2001.
73
Nuclear Norm Minimization
• Singular Value Thresholding
– http://svt.stanford.edu/
• Accelerated gradient
– http://www.public.asu.edu/~jye02/Software/SLEP
/index.htm
• Interior point methods
– http://abel.ee.ucla.edu/cvxopt/applications/nucnr
m/
J-F. Cai, E.J. Candès and Z. Shen. A Singular Value Thresholding Algorithm for Matrix Completion. SIAM Journal on
Optimization. Volume 20 Issue 4, January 2010 Pages 1956-1982.
Shuiwang Ji and Jieping Ye. An Accelerated Gradient Method for Trace Norm Minimization. The Twenty-Sixth International
Conference on Machine Learning (ICML 2009)
Z. Liu, May
Lieven 1st-4th,
Vandenberghe.
2013 Interior-point method for nuclear
SDM 2013, norm approximation
Austin, Texas with application to system 74
identification. SIAM Journal on Matrix Analysis and Applications (2009)
◼ From LRA to Co-clustering
Co-clustering
• Let X and Y be discrete random variables
– X and Y take values in {1, 2, …, m} and {1, 2, …, n}
– p(X, Y) denotes the joint probability distribution—if
not known, it is often estimated based on co-occurrence
data
– Application areas: text mining, market-basket analysis,
analysis of browsing behavior, etc.
• Key Obstacles in Clustering Contingency Tables
– High Dimensionality, Sparsity, Noise
– Need for robust and scalable algorithms

Reference:
1. Dhillon et al. Information-Theoretic Co-clustering, KDD’03
76
n
𝑃(𝑋, 𝑌) .05 .05 .05 0 0 0  eg, terms x documents
.05 .05 .05 0 0 0

m 0 0 0 .05 .05 .05

.04 
0 0 0 .05 .05 .05
.04 0 .04 .04 .04
.04 .04 .04 0 .04 .04 
k
 =
l n
.5 0 0  .3 0  l .36 .36 .28 0 0 0 .054 .054 .042 0 0 0 
.5 0 0  k 0 .3
.2 .2 0 0 0 .28 .36 .36
.054 .054 .042 0 0 0

m 0   00 
0 .5 0 0 0 .042 .054 .054

0 .5 0
 ෠
𝑃(𝑌|𝑌)
.036 0 0 .042 .054 .054

0  𝑃(𝑋,෠ 𝑌) .036 
0 .5 .036 028 .028 .036 .036
 0 .5  ෠ .036 .028 .028 .036 .036 


𝑃(𝑋, 𝑌)

𝑃(𝑋|𝑋)

77
med. doc
cs doc

.05 .05 .05 0 0 0  med. terms


.05 .05 .05 0 0 0

 00 0 0 .05 .05 .05
 cs terms
.04 
0 0 .05 .05 .05
term group x
.04 0 .04 .04 .04
doc. group .04 .04 .04 0 .04 .04 
common terms

.5
.5
0
0
0
0
 .03 .03
.2 .2

.36 .36 .28
0 0 0
0 0
.28 .36 .36
0
= .054
.054
.054
.054
.042
.042
0
0
0
0
0
0


0 .5 0
  00 0 0 .042 .054 .054

 00 .5 0
 doc x .036 0 0 .042 .054 .054

0  .036 
0 .5 .036 028 .028 .036 .036
 0 .5  doc group .036 .028 .028 .036 .036 

term x
term-group
78
Co-clustering
Observations
• uses KL divergence, instead of L2 or LF
• the middle matrix is not diagonal
– we’ll see that again in the Tucker tensor
decomposition

79
Matrix & Tensor Tools
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC

80
Tensor Basics
Reminder: SVD

n n



VT
m A m

U
– Best rank-k approximation in L2 or LF

82
Reminder: SVD

n
1u1v1 2u2v2

m A  +

– Best rank-k approximation in L2

83 See also PARAFAC


Goal: extension to >=3 modes

IxJxK
IxR JxR


¼
A
B = +…+

RxRxR

84
Main points:
• 2 major types of tensor decompositions:
PARAFAC and Tucker
• both can be solved with ``alternating least
squares’’ (ALS)
• Details follow – we start with terminology:

85
[T. Kolda,’07]
A tensor is a multidimensional array
An I x J x K tensor Column (Mode-1) Row (Mode-2) Tube (Mode-3)
Fibers Fibers Fibers
X1,1,1

xijk
I

J
Horizontal Slices Lateral Slices Frontal Slices
3rd
order tensor
mode 1 has dimension I
mode 2 has dimension J
mode 3 has dimension K
Note: focus is on 3rd
order, but everything
can be extended to
higher orders.

86
details [T. Kolda,’07]
Matricization: Converting a Tensor to
a Matrix
X(n): The mode-n fibers are
Matricize
(i′,j′) rearranged to be the columns
(unfolding) (i,j,k)
of a matrix

Reverse
(i′,j′)
Matricize (i,j,k)

5 7
1 3
6 8
2 4

87
details

Tensor Mode-n Multiplication

• Tensor Times Matrix • Tensor Times Vector

Compute the dot


Multiply each
product of a and
row (mode-2)
fiber by B each column
(mode-1) fiber

[T. Kolda,’07]
88
details

Mode-n product Example


• Tensor times a matrix

Time

Location

Clusters
Location

Time

Clusters
Time

[T. Kolda,’07]
89
details

Mode-n product Example


• Tensor times a vector

Location

Time
Location

Time

Time

[T. Kolda,’07]
90
details
Outer, Kronecker, &
Khatri-Rao Products
3-Way Outer Product Review: Matrix Kronecker Product

MxN PxQ

MP x NQ

=
Matrix Khatri-Rao Product
Rank-1 Tensor

MxR NxR MN x R

91 [T. Kolda,’07]
Specially Structured Tensors
Specially Structured Tensors
• Tucker Tensor • Kruskal Tensor

Our
Notation
Our
Notation

“core”

IxJxK IxR JxS IxJxK wI1 x R wR


JxR
= V = = +…+ V
v1 vR
U U
RxSxT u1
RxRxR uR

[T. Kolda,’07]
93
details

Specially Structured Tensors


• Tucker Tensor • Kruskal Tensor

In matrix form: In matrix form:

[T. Kolda,’07]
94
Outline: Part 2
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC

95
Tensor Decompositions
Tucker Decomposition - intuition

IxJxK IxR JxS


~ B
A
RxSxT

• author x keyword x conference


• A: author x author-group
• B: keyword x keyword-group
• C: conf. x conf-group
• G: how groups relate to each other
97
Reminder
.05 .05 .05 0 0 0  med. terms
.05 .05 .05 0 0 0

 00 0 0 .05 .05 .05
 cs terms
.04 
0 0 .05 .05 .05
term group x
.04 0 .04 .04 .04
doc. group .04 .04 .04 0 .04 .04 
common terms

.5
.5
0
0
0
0
 .03 .03
.2 .2

.36 .36 .28
0 0 0
0 0
.28 .36 .36
0
= .054
.054
.054
.054
.042
.042
0
0
0
0
0
0


0 .5 0
  00 0 0 .042 .054 .054

 00 .5 0
 doc x .036 0 0 .042 .054 .054

0  .036 
0 .5 .036 028 .028 .036 .036
 0 .5  doc group .036 .028 .028 .036 .036 

term x
term-group

98
Tucker Decomposition

IxJxK IxR JxS


~ B
Given A, B, C, the optimal core is:
A
RxSxT

• Proposed by Tucker (1966) Recall the equations for


• AKA: Three-mode factor analysis, three-mode converting a tensor to a matrix
PCA, orthogonal array decomposition
• A, B, and C generally assumed to be
orthonormal (generally assume they have full
column rank)
• is not diagonal
• Not unique

99
details

Tucker Variations
See Kroonenberg & De Leeuw, Psychometrika,1980 for discussion.
• Tucker2 Identity Matrix
IxJxK IxR JxS
~ B
A
RxSxK

• Tucker1
IxJxK IxR
~
A Finding principal components in only mode 1
RxJxK
can be solved via rank-R matrix SVD

100
details
Solving for Tucker
Given A, B, C orthonormal, the optimal core is: IxJxK IxR JxS
~~ B
Tensor norm is the square A
root of the sum of all the RxSxT
elements squared Eliminate the core to get:

Minimize
s.t. A,B,C orthonormal fixed maximize this
If B & C are fixed, then we can solve for A as follows:

Optimal A is R left leading singular vectors for


101
details

Higher Order SVD (HO-SVD)


Not optimal, but
IxJxK often used to
IxR JxS initialize Tucker-
~ B ALS algorithm.
A
RxSxT

(Observe connection to Tucker1)

De Lathauwer, De Moor, & Vandewalle, SIMAX, 1980


102
Tucker-Alternating Least Squares (ALS)
Successively solve for each component (A,B,C).

• Initialize
– Choose R, S, T
IxJxK – Calculate A, B, C via HO-SVD
IxR JxS
• Until converged do…
= B
– A = R leading left singular
A vectors of X(1)(CB)
RxSxT
– B = S leading left singular
vectors of X(2)(CA)
– C = T leading left singular
vectors of X(3)(BA)
• Solve for core:

Kroonenberg & De Leeuw, Psychometrika, 1980


103
details
Tucker in Not Unique

IxJxK IxR JxS


~ B
A
RxSxT

Tucker decomposition is not unique. Let Y be


an RxR orthonormal matrix. Then…

[T. Kolda,’07]
104
Outline: Part 2
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC

105
CANDECOMP/PARAFAC
Decomposition

IxJxK
IxR JxR
~ B = +…+
A
RxRxR

• CANDECOMP = Canonical Decomposition (Carroll & Chang, 1970)


• PARAFAC = Parallel Factors (Harshman, 1970)
• Core is diagonal (specified by the vector )
• Columns of A, B, and C are not orthonormal
• If R is minimal, then R is called the rank of the tensor (Kruskal 1977)
• Can have rank ( ) > min{I,J,K}
106
details

PARAFAC-Alternating Least Squares (ALS)


Successively solve for each component (A,B,C).

= +…+

IxJxK
Khatri-Rao Product
(column-wise Kronecker product) Find all the vectors in
one mode at a time

Hadamard Product

If C, B, and  are fixed, the optimal A is given by:

Repeat for B,C, etc.


[T. Kolda,’07]
107
details

PARAFAC is often unique


IxJxK c1 cR
Assume
+…+ PARAFAC
= b1 bR decomposition
a1 aR is exact.

Sufficient condition for uniqueness (Kruskal, 1977):

kA = k-rank of A = max number k such that every set


of k columns of A is linearly independent
108
Tucker vs. PARAFAC Decompositions
• Tucker • PARAFAC
– Variable transformation in – Sum of rank-1 components
each mode – No core, i.e., superdiagonal
– Core G may be dense core
– A, B, C generally – A, B, C may have linearly
orthonormal dependent columns
– Not unique – Generally unique

IxJxK IxR JxS IxJxK c1 cR


B ¼ +…+
~ b1 bR
A
RxSxT a1 aR

109
Tensor tools - summary
• Two main tools
– PARAFAC
– Tucker
• Both find row-, column-, tube-groups
– but in PARAFAC the three groups are identical
• To solve: Alternating Least Squares

110
Tensor tools - resources
• Toolbox: from Tamara Kolda:
csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
• T. G. Kolda and B. W. Bader. Tensor
Decompositions and Applications. SIAM
Review 2008
• csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfil
es/TensorReview-preprint.pdf

111
Key Papers
Core Papers
• Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos: Colibri: fast mining
of large static and dynamic graphs. KDD 2008: 686-694
• Dhillon et al. Information-Theoretic Co-clustering, KDD’03
• T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review 2008

Further Reading
• Chih-Jen Lin: Projected Gradient Methods for Non-negative Matrix Factorization.
https://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf
• Candès, Emmanuel J., and Benjamin Recht. "Exact matrix completion via convex optimization."
Foundations of Computational mathematics 9, no. 6 (2009): 717.
• Rendle, S. (2010, December). Factorization machines. In 2010 IEEE International Conference on Data
Mining (pp. 995-1000). IEEE.
• Tamara G. Kolda, Brett W. Bader, Joseph P. Kenny: Higher-Order Web Link Analysis Using Multilinear
Algebra. ICDM 2005: 242-249
• U Kang, Evangelos E. Papalexakis, Abhay Harpale, Christos Faloutsos: GigaTensor: scaling tensor analysis
up by 100 times - algorithms and discoveries. KDD 2012: 316-324
• Deepayan Chakrabarti, Spiros Papadimitriou, Dharmendra S. Modha, Christos Faloutsos: Fully automatic
cross-associations. KDD 2004: 79-88
• Trigeorgis, G., Bousmalis, K., Zafeiriou, S., & Schuller, B. (2014, January). A deep semi-nmf model for
learning hidden representations. In International Conference on Machine Learning (pp. 1692-1700).
• Risi Kondor, Nedelina Teneva, and Vikas Garg. 2014. Multiresolution matrix factorization. In International
Conference on Machine Learning. 1620–1628

112

You might also like