Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
Emmanuel Abbe
UC Berkeley
Martin Wainwright
Princeton University
June 2015
1 / 46
Introduction
Era of massive data sets
Fascinating problems at the interfaces between information theory and
statistical machine learning.
Fundamental issues
Claude Shannon
Andrey Kolmogorov
X1n
X1n
June 2015
6 / 46
June 2015
7 / 46
5
Y
exp(s xs )
s=1
5
Y
exp(s xs )
s=1
exp(st xs xt )
(s,t)C
5
Y
s=1
exp(s xs )
s6=t
exp(st xs xt )
Underlying graphs
X
sV
s xs +
st xs xt
(s,t)E
b n1 ) 6= G
Qn G(X
{z
}
|
Prob. that estimated graph differs from truth
Channel decoding:
Think of graphs as codewords, and the graph family as a codebook.
Gaussian case
Binary case
information-theoretic bounds
(Santhanam & W., 2012)
computational lower bounds
(Dagum & Luby, 1993; Bresler et al., 2014)
phase transitions and performance of neighborhood regression
(Bento &
Montanari, 2009)
p = 18
d=3
d=6
Prob. success
0.8
0.6
0.4
0.2
0
0
p = 64
p = 100
p = 225
100
200
300
400
Number of samples
500
600
Prob. success
0.8
0.6
0.4
0.2
0
0
p = 64
p = 100
p = 225
0.5
1
1.5
Control parameter
2
n
d2 log p
Q(X | G)
X1 , X2 , . . . , Xn
I(Xn1 ; G)
o(1)
log |G|
Base graph G
1
2
3
Graph Guv
Graph Gst
p
Divide the vertex set V into d+1
groups of size d + 1, and form the base
graph G by making a (d + 1)-clique C within each group.
Form graph Guv by deleting edge (u, v) from G.
( )T
| {z }
|{z}
+Ip
b :=
1X
Xi XiT
n i=1
|
{z
}
14
12
Diagonal value
10
8
6
4
2
0
0
10
20
30
40
50
Index
Given n i.i.d. samples Xi with zero mean, and with spiked covariance
= ( )T + I:
b jj = 1 Pn X 2 .
1 Compute diagonal entries of sample covariance:
i=1 ij
n
b jj , j = 1, . . . , p}
2 Apply threshold to vector {
Prob. success
0.8
0.6
0.4
p = 100
p = 200
p = 300
p = 600
p = 1200
0.2
0
0
200
400
600
Number of observations
800
Prob. success
0.8
0.6
0.4
p = 100
p = 200
p = 300
p = 600
p = 1200
0.2
0
0
5
10
Control parameter
15
n
k2 log p
|{z}
Signal-to-noise
( )T
+Ipp
| {z }
rank one spike
where
B0 (k) B2 (1)
|
{z
}
k-sparse and unit norm
n
uDT
k 2 log p
{z
}
|
DT succeeds w.h.p.
n
> ES .
k log p
{z
}
|
Succeed w.h.p.
Z0
max
pp
ZR
Z=Z T ,
Z0
trace(Z T )
s.t. trace(Z) = 1.
In practice:
(dAspremont et al., 2008)
b
apply this relaxation using the sample covariance matrix
Pp
2
add the 1 -constraint i,j=1 |Zij | k .
Prob. success
0.8
0.6
0.4
p = 100
p = 200
p = 300
0.2
0
0
5
10
Control parameter
15
n
k log p
A natural question
Questions:
Can logarithmic sparsity or rank one condition be removed?
|{z}
signal-to-noise
( )( )T +Ipp
| {z }
rank one spike
H0
H1
where
B0 (k) B2 (1)
|
{z
}
k-sparse and unit norm
(no signal):
(spiked signal):
Xi D(0, Ipp )
Xi D(0, ).
k 2 log p
n
Erdos-Renyi
Planted k-clique
Random entries
Planted k k sub-matrix
0
Function value
Function value
0.5
1.5
0.5
0.5
0.25
0
Design value x
0.25
0.5
1.5
0.5
0.25
0
Design value x
0.25
0.5
p
X
j xj +w
j=1
| {z }
h , xi
p/n
|{z}
linear in p
ep
2
with sparsity k p: error
k log
/n
{zk
}
|
logarithmic in p
Error
(1/n) 2+p
| {z }
Exponential slow-down
Minimax risk
For a given sample size n, the minimax risk
n
inf Rworst
(fb; F) = inf sup En 2 (fb, f )
fb
fb f F
June 2015
37 / 46
f3
f4
2
fM
L
L
f1
f3
f2
f4
2
f
kfb f k2n :=
2
1X b
e f (X))
e 2 .
f (xi ) f (xi ) , or kfb f k22 = E (fb(X)
n i=1
June 2015
42 / 46
and
o
kk2 1 .
ep
)
k
elements
k log
n
ep
k
Polynomial-time achievability:
by 1 -relaxations under restricted eigenvalue (RE) conditions
(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)
log M (2; F) 1/
1/
2
2 1/n 2+1 .
June 2015
44 / 46
Additively decomposable:
p
X
gj (xj )
j=1
Sparse:
At most k of
Smooth:
gj
Each
(g1 , . . . , g p)
are non-zero
k 1/n)
|
{z
2
2+1
k-component estimation
k log ep
k
n }
| {z
search complexity
Summary
To be provided during tutorial....