Computational Information Geometry For Binary Classification of High-Dimensional Random Tensors
Computational Information Geometry For Binary Classification of High-Dimensional Random Tensors
1. Introduction
the Bayesian detector chooses the alternative hypothesis H1 if Pr(H1 |y) > Pr(H0 |y) for a given
N-dimensional measurement vector y or the null hypothesis H0 , otherwise. Consequently, the optimal
decision rule can often only be derived at the price of a costly numerical computation of the log
(N)
posterior-odds ratio [3] since an exact calculation of the minimal Bayes’ error probability Pe is often
(N)
intractable [3,8]. To circumvent this problem, it is standard to exploit well-known bounds on Pe
based on information theory [9–13]. In particular, the Chernoff information [14,15] is asymptotically
(N)
(in N) relied on the exponential rate of Pe . It turns out that the Chernoff information is very useful in
many practically important problems as for instance, distributed sparse detection [16], sparse support
recovery [17], energy detection [18], multi-input and multi-output (MIMO) radar processing [19,20],
network secrecy [21], angular resolution limit in array processing [22], detection performance for
informed communication systems [23], just to name a few. In addition, the Chernoff information
bound can be tight for a minimal s-divergence over parameter s ∈ (0, 1). Generally, this step requires
solving numerically an optimization problem [24] and often leads to a complicated and uninformative
expression of the optimal value of s. To circumvent this difficulty, a simplified case of s = 1/2 is often
used corresponding to the well-known Bhattacharyya divergence [13] at the price of a less accurate
(N)
prediction of Pe . In information geometry, parameter s is often called α, and the s-divergence is the
so-called Chernoff α-divergence [24].
The tensor decomposition theory is a timely and prominent research topic [25,26]. Confronting the
problem of extracting useful information from a massive and multidimentional volume of measurements,
it is shown that tensors are extremely relevant. In the standard literature, two main families of tensor
decomposition are prominent, namely the Canonical Polyadic Decomposition (CPD) [26] and the
Tucker decomposition (TKD)/HOSVD (High-Order SVD) [27,28]. These approaches are two possible
multilinear generalization of the Singular Value Decomposition (SVD). A natural generalization to
tensors of the usual concept of rank for matrices is called the CPD. The tensorial/canonical rank of a
P-order tensor is equal to the minimal positive integer, say R, of unit rank tensors that must be summed
up for perfect recovery. A unit rank tensor is the outer product of P vectors. In addition, the CPD has
remarkable uniqueness properties [26] and involves only a reduced number of free parameters due
to the constraint of minimality on R. Unfortunately, unlike the matrix case, the set of tensors with
fixed (tensorial) rank is not close [29,30]. This singularity implies that the problem of the computation
of the CPD is mathematically ill-posed. The consequence is that its numerical computation remains
non trivial and is usually done using suboptimal iterative algorithms [31]. Note that this problem
can sometimes be avoided by exploiting some natural hidden structures in the physical model [32].
The TKD [28] and the HOSVD [27] are two popular decompositions being an alternative to the CPD.
Under this circumstance, alternative definition of rank is required, since the tensorial rank based on
CPD scenario is no longer appropriate. In particular, stardard definition of multilinear rank defined
as the set of positive integers { R1 , . . . , R P } where each integer, R p , is the usual rank of the p-th mode.
Following the Eckart-Young theorem at each mode level [33], this construction is non-iterative, optimal
and practical. In real-time computation [34] or adaptively computation [35], it is shown that this
approach is suitable. However, in general, the low (multilinear) rank tensor based on this procedure is
suboptimal [27]. More precisely, for tensors of order strictly greater than two, a generalization of the
Eckart-Young theorem does not exist.
The classification performance of a multilinear tensor following the CPD and TKD can be derived
and studied. It is interesting to note that the classification theory for tensors is very under studied.
Based on our knowledge on the topic, only the publication [36] tackles this problem in the context of
radar multidimensional data detection. A major difference with this publication is that their analysis is
based on the performance of a low rank detection after matched filtering.
More precisely, we consider two cases where the observations are either (1) a noisy rank-R tensor
admitting a Q-order CPD with large factors of size Nq × R, i.e., for 1 ≤ q ≤ Q, R, Nq → ∞ with
R1/q /Nq converging towards a finite constant, or (2) a noisy tensor admitting a TKD of multilinear
( M1 , . . . , MQ )-rank with large factors of size Nq × Mq , i.e., for 1 ≤ q ≤ Q, where Nq , Mq → ∞
Entropy 2018, xx, 203 3 of 22
with Mq /Nq converging towards a finite constant. A standard approach for zero-mean independent
Gaussian core and noise tensors, is to define the Signal to Noise Ratio by SNR = σs2 /σ2 where σs2 and
σ2 are the variances of the vectorized core and noise tensors, respectively. So, the binary classification
can be described in the following way:
Under the null hypothesis H0 , SNR = 0, meaning that the observed tensor contains only
noise. Conversely, the alternative hypothesis H1 is based on SNR 6= 0, meaning that there exists
a multilinear signal of interest. First note that there exists a lack of contribution dealing with
classification performance for tensors. Since the exact derivation of the error probability is intractable,
the performance of the classification of the core tensor random entries is hard to evaluate. To circumvent
this audible difficulty, based on computational information geometry theory, we consider the Chernoff
Upper Bound (CUB), and the Fisher information in the context of massive measurement vectors.
The error exponent can be minimized at s? , which corresponds to the reachable tightest CUB.
In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB)—Chernoff
Information calculated at s? = 1/2—cannot solve this problem effectively. As a consequence, we rely
on a costly numerical optimization strategy to find s? . However, with respect to different Signal to Noise
Ratios (SNR), we provide simple analytical expressions of s? , thanks to the so-called Random Matrix
Theory (RMT). For low SNR, analytical expressions of the Fisher information are given. Note that
the analysis of the Fisher information in the context of the RMT has been only studied in recent
contributions [37–39] for parameter estimation. For larger SNR, analytic and simple expression of the
CUB for the CPD and the TKD are provided.
We note that Random Matrix Theory (RMT) has attracted both mathematicians and physicists
since they were first introduced in mathematical statistics by Wishart in 1928 [40]. When Wigner [41]
introduced the concept of statistical distribution of nuclear energy levels, the subject has started to earn
prominence. However, it took until 1955 before Wigner [42] introduced ensembles of random matrices.
Since then, many important results in RMT were developed and analyzed, see for instance [43–46] and
the references therein. In the last two decades, research on RMT has been constantly published.
Finally, let us underline that many arguments of this paper differ from the works presented
in [47,48]. In [47], we tackled the problem of detection using Chernoff Upper Bound in data of type
matrix in the double asymptotic regime. In [48], we established the detection problem in tensor data
by analyzing the Chernoff Upper Bound. In [48], we assumed that the tensor follows the Canonical
Polyadic Decomposition (CPD), we gave some analysis of Chernoff Upper Bound when the rank of
the tensor is much smaller than the dimensions of the tensor. Since [47,48] are conference papers,
some proofs have been omitted due to limited space. Therefore, this full paper may share the ideas
in [47,48] on Information Geometry (s-divergence, Chernoff Upper Bound, Fisher Information, etc.),
but completes [48] in a more general asymptotic regime. Moreover, in this work, we give new analysis
in both scenarios (SNR small and large) whereas [48] did not, and the important and difficult new
tensor scenario of the Tucker decomposition is considered. This is in our view the main difference
because the CPD is a particular case of the more general decomposition of TucKer. Indeed, in the CPD,
the core tensor is assumed to be diagonal.
xh = [X ]m1 ,...,mQ
Mq
[X ×q U]m1 ,...,mq−1 ,k,mq+1 ,...,mQ = ∑ [X ]m1 ,...,mQ [U]k,mq
m q =1
where 1 ≤ k ≤ K.
Definition 4. The q-mode unfolding matrix of size Mq × ∏kQ=1,k6=q Mk denoted by X(q) = unfoldq (X ) of
a tensor X ∈ R M1 ×...× Mq is defined according to
R
∑ sr
(1) ( Q)
X = φr ◦ . . . ◦ φr with rank{X r } = 1
r =1 | {z }
Xr
(q)
where ◦ is the outer product [25], φr ∈ R Nq ×1 and sr is a real scalar.
An equivalent formulation using the q-mode product defined in Definition 3 is
X = S × 1 Φ (1) × 2 . . . × Q Φ ( Q )
(q) (q)
where S is the R × · · · × R diagonal core tensor with [S]r,...,r = sr and Φ(q) = [φ1 ...φ R ] is the q-th
factor matrix of size Nq × R.
Entropy 2018, xx, 203 5 of 22
where S = diag(s) with s = [s1 , ..., s R ] T and stands for the Khatri-Rao product [25].
M1 M2 MQ
∑ ∑ ∑
(1) (2) ( Q)
X = ... sm1 m2 ...mQ φm1 ◦ φm2 ◦ · · · ◦ φmQ
m1 =1 m2 =1 m Q =1
(q)
where φmq ∈ R Nq ×1 , q = 1, ..., Q and sm1 m2 ...mQ is a real scalar.
The q-mode product of X is similar to CPD case, however the q-mode unfolding matrix for tensor
X is slightly different
T
X ( q ) = Φ ( q ) S ( q ) Φ ( Q ) ⊗ . . . ⊗ Φ ( q +1) ⊗ Φ ( q −1) . . . ⊗ Φ (1)
(q) (q)
where S(q) ∈ R Nq × N1 N2 ...Nq−1 Nq+1 ...NQ the q-mode unfolding matrix of tensor S , Φ(q) = [φ1 ...φ Mq ] ∈
R Nq × Mq and ⊗ stands for Kronecker product. See Figure 1.
Following the definitions, we note that the CPD and TKD scenarios imply that vector x in
Equation (11) is related either to the structured linear system Φ = Φ(Q) ... Φ(q+1) Φ(q−1)
... Φ(1) or Φ⊗ = Φ(Q) ⊗ . . . ⊗ Φ(q+1) ⊗ Φ(q−1) . . . ⊗ Φ(1) .
N
1
N ∑ vn vnT
n =1
converges towards σ2 I M in the spectral norm sense. In the high dimensional asymptotic regime
defined by
M
M → +∞, N → +∞, c N = →c>0
N
it is well understood that
W N W TN − σ2 I M
does not converge towards 0. In particular, the empirical
1 M T
distribution ν̂N = M ∑m =1 δλ̂m,N of the eigenvalues λ̂1,N ≥ ... ≥ λ̂ M,N of W N W N does not converge
towards the Dirac measure at point λ = σ2 . More precisely, we denote by νc,σ2 the Marchenko-Pastur
distribution of parameters (c, σ2 ) defined as the probability measure
p
1 (λ − λ− )(λ+ − λ)
νc,σ2 (dλ) = δ0 [1 − ]+ + 1[λ− ,λ+ ](λ) dλ (1)
c 2σ2 cπλ
√ √
with λ− = σ2 (1 − c)2 and λ+ = σ2 (1 + c)2 . Then, the following result holds.
Theorem 1 ([45]). The empirical eigenvalue value distribution ν̂N converges weakly almost surely towards νc,σ2
when both M and N converge towards +∞ in such a way that c N = M N converges towards c > 0. Moreover,
it holds that
√
λ̂1,N → σ2 (1 + c)2 a.s. (2)
√
λ̂min( M,N ) → σ2 (1 − c)2 a.s. (3)
70
60
50
40
30
20
10
0
−0.5 0 0.5 1 1.5 2 2.5 3
W N W TN M 1
Figure 2. Histogram of the eigenvalues of N (with M = 256, c N = N = 256 , σ2 = 1).
12
10
0
−0.5 0 0.5 1 1.5 2 2.5 3
W N W TN M
Figure 3. Histogram of the eigenvalues of N (with M = 256, c N = N = 14 , σ2 = 1).
We also observe that Theorem 1 remains valid if W N is not necessarily a Gaussian matrix whose
i.i.d. elements have a finite fourth order moment (see e.g., [43]). Theorem 1 means that when ratio
Entropy 2018, xx, 203 7 of 22
M
N is not small enough, the eigenvalues of the empirical spatial covariance matrix of a temporally
and spatially white noise tend to spread out around the variance of the noise, and that almost
surely, for N large enough, all the eigenvalues are located in a neighbourhood of interval [λ− , λ+ ].
See Figures 2 and 3.
is the alternative hypothesis (H1 ) data-space. Following the above expression, the log-likelihood ratio
test Λ(y N ) and the binary classification threshold τ 0 are given by
−1
y TN Φ Φ T Φ + SNR × I ΦT y N
Λ(y N ) = ,
σ2
τ 0 = − log det SNR × ΦΦ T + I N
where det(·) and log(·) are respectively the determinant and the natural logarithm.
p (y )
Z
E Λ(y N ) = p(y N Ĥ) log 1 N dy N
y N Ĥ X p0 ( y N )
= KL(Ĥ||H0 ) − KL(Ĥ||H1 )
−1
1
= 2 Tr Φ T Φ + SNR × I Φ T ΣΦ
σ
where
Z p(y N Ĥ)
KL(Ĥ||Hi ) = p(y N Ĥ) log
dy N
X pi ( y N )
is the Kullback-Leibler Divergence (KLD) [10]. The expected log-likelihood ratio test admits to a simple
geometric characterization based on the difference of two KLDs [8]. However, it is often difficult to
(N)
evaluate the performance of the test via the minimal Bayes’ error probability Pe , since its expression
cannot be determined analytically in closed-form [3,8].
Entropy 2018, xx, 203 8 of 22
1
Pr(Error|y N ) = min{ P1,0 , P0,1 }
2
3.3. CUB
According to [24], the relation between the Chernoff Upper Bound and the (average) minimal
(N)
Bayes’ error probability Pe = EPr(Error|y N ) is given by
(N) 1
Pe ≤ × exp[−µ̃ N (s)] (5)
2
in which MX (t) = E exp[t × X ] is the moment generating function (mgf) of variable X. The error
exponent, denoted by µ̃(s), is given by the Chernoff information which is an asymptotic
characterization on the exponentially decay of the minimal Bayes’ error probability. The error exponent
is derived thanks to the Stein’s lemma according to [13]
(N)
log Pe µ̃ N (s) def.
− lim = lim = µ̃(s).
N →∞ N N →∞ N
As parameter s ∈ (0, 1) is free, the CUB can be tightened by minimizing this parameter:
Finally, using Equations (5) and (7), the Chernoff Upper Bound (CUB) is obtained. Instead of
solving Equation (7), the Bhattacharyya Upper Bound (BUB) is calculated by Equation (5) and by fixing
s = 1/2 . Therefore we have the following relation of order:
(N) 1 1
Pe ≤ × exp[−µ̃ N (s? )] ≤ × exp[−µ̃ N (1/2)].
2 2
Lemma 1. The log-moment generating function given by Equation (6) for test of Equation (4) is given by
1−s
µ̃ N (s) = − log det SNR × ΦΦ T + I (8)
2
1
+ log det SNR × (1 − s)ΦΦ T + I .
2
From now on, to simplify the presentation and the numerical results later on, we denote by
for all s ∈ [0, 1], the opposites of the log-moment generating function and its limit.
Remark 1. The functions µ N (s), µ(s) are negative, since the s-divergence µ̃ N (s) is positive for all s ∈ [0, 1].
Entropy 2018, xx, 203 9 of 22
1−s 1
µ N (s) = log det [Σ(δSNR)] − log det [Σ(δSNR × (1 − s))]
2 2
Lemma 2. The s-divergence in the small deviation regime can be approximated according to
1
JF ( x) = Tr((I + x × ΦΦ T )−1 ΦΦ T (I + x × ΦΦ T )−1 ΦΦ T ).
2
SNR1 1
s? ≈ 1− . (9)
log SNR + 1
K ∑nK=1 log λn
Y = X +N (10)
where N is the noise tensor whose entries are assumed to be centered i.i.d. Gaussian, i.e.,
[N ]n1 ,...,nQ ∼ N (0, σ2 ) and the core tensor X follows either CPD or TKD given by Section 2.1.2
and Section 2.1.3, respectively. The vectorization of Equation (10) is given by
y N = vec(Y(1) ) = x + n (11)
where n = vec(N(1) ) and x = vec(X(1) ). Note that Y(1) , N(1) and X(1) are respectively the first
unfolding matrices given by Definition 4 of tensors Y , N and X ,
Entropy 2018, xx, 203 10 of 22
h iT
where Φ = Φ(Q) . . . Φ(1) is a N × R structured matrix and s = s1 ... sR where
sr ∼ N (0, σs2 ), i.i.d. and N = N1 · · · NQ .
2. When tensor X follows a Q-order TKD of multilinear rank of { M1 , . . . , MQ }, we have
T
x = vec Φ(1) S(1) Φ(Q) ⊗ . . . ⊗ Φ(2) = Φ⊗ vec(S)
Result 1. In the asymptotic regime where N1 , . . . , NQ converge towards +∞ at the same rate and where
R → +∞ in such a way that c R = NR
converges towards a finite constant c > 0, it holds that
Remark 2. In [49], the Central Limit Theorem (CLT) for the linear eigenvalue statistics of the tensor version of
the sample covariance matrix of type Φ (Φ ) T is established, for Φ = Φ(2) Φ(1) , i.e., the tensor order is
Q = 2.
Result 2. In the small SNR scenario, the Fisher information for CPD is given as
1 SNR1 (SNR)2
µ ≈ − × c (1 + c ).
2 16
Entropy 2018, xx, 203 11 of 22
J F (0) 1 R 1 h T 2i
= Tr (Φ (Φ ) )
N 2NR
and that
1 h T 2i
Tr (Φ (Φ ) )
R
converges a.s towards the second moment of the Marchenko-Pastur distribution which is 1 + c (see for
instance [43]).
Note that µ 12 is the error exponent related to the Bhattacharyya divergence.
SNR1 1
s? ≈ 1− . (14)
log SNR − 1 − 1−c c log(1 − c)
K Z +∞
1 1−c
K ∑ log(λn ) −→ 0
log(λ)dνc (λ) = −1 −
c
log(1 − c).
n =1
The last equality can be obtained as in [50]. Using Lemma 3, we get immediately Equation (14).
Remark 3. It is interesting to note that for c → 0 or 1, the optimal s-value follows the same approximated
relation given by
SNR1 1
s? ≈ 1−
log SNR
1−c c →1 1−c c →0
log(1 − c) −→ 0, and log(1 − c) −→ −1.
c c
Using Equation (14) and condition SNR exp[1], the desired result is proved.
Result 4. Under this regime, the error exponent can be approximated as follows:
c 1 c
µ(s) ≈ (1 − s) log(1 + SNR) − log(1 + (1 − s)SNR) .
2
It is easy to notice that the second-order derivative of µ(s) is strictly positive. Therefore, µ(s) is
a strictly convex function over interval (0, 1). As a consequence, µ(s) admits at most one global
Entropy 2018, xx, 203 12 of 22
minimum. We denote by s? , the global minimizer and obtained by zeroing the first-order derivative of
the error exponent. This optimal value is expressed as
c 1 1 1
s? ≈ 1 + − . (15)
SNR log(1 + SNR)
• At low SNR, we denote by µ(s? ), the error exponent associated with the tightest CUB,
coincides with the error exponent associated with the BUB. To see this, when c 1, we derive
the second-order approximation of the optimal value s? in Equation (15)
2 1 SNR 1
s? ≈ 1 + 1− 1+ = .
SNR 2 2
Result 1 and the above approximation allow us to get the best error exponent at low SNR
and c 1,
1 SNR1 1 1 SNR
µ ≈ Ψ (SNR) − Ψc1
2 4 c 1 2 2
√
c 1 + SNR
= log .
2 1 + SNR
2
• Contrarily, when SNR → ∞, s? → 1. As a consequence, the optimal error exponent in this regime
log SNR
is not the BUB anymore. Assuming that SNR → 0, Equation (15) in Result 4 provides the
following approximation of the optimal error exponent for large SNR
SNR1 c
µ (s? ) ≈ (1 − log SNR + log log(1 + SNR)) .
2
Result 5. In the asymptotic regime where Mq < Nq , 1 ≤ q ≤ Q and Mq , Nq converge towards +∞ at the
Mq
same rate such that Nq → cq , where 0 < cq < 1, it holds
1 − s +∞
Z +∞
µ N (s) a.s
Z
−→ µ(s) = c1 · · · cQ ··· log(1 + SNR × λ1 · · · λQ )dνc1 (λ1 ) · · · dνcQ (λQ )
N 2 0 0
Z +∞ Z +∞ (16)
1
− ··· log(1 + (1 − s)SNR × λ1 · · · λQ )dνc1 (λ1 ) · · · dνcQ (λQ )
2 0 0
where νcq are Marchenko-Pastur distributions of parameters (cq , 1) defined as in Equation (1).
Remark 4. We can notice that for Q = 1, the result 5 is similar to result 1. However, when Q ≥ 2, the integrals
in Equation (16) are not tractable in a closed-form expression. For instance, let Q = 2, we consider the integral
Z +∞ Z +∞
log(1 + SNR × λ1 λ2 )νc1 (dλ1 )νc2 (dλ2 )
−∞ −∞
q q
λ1 − λ − λ2 − λ −
+ +
Z λ+
c1
Z λ+
c2 c1 λ c1 − λ 1 c2 λ c2 − λ 2
= log(1 + SNR × λ1 λ2 ) dλ1 dλ2
λ−
c1 λ−
c2 2πc1 λ1 2πc2 λ2
√ 2
where λ± ci = (1 ± ci ) , i = 1, 2. We can notice that this integral is characterized by elliptic integral
(see e.g., [51]). As a consequence, it cannot be expressed in closed-form. However, numerical computations can
be exploited to solve efficiently the minimization problem of Equation (7).
SNR1 1
s? ≈ 1− 1− c i
. (17)
log SNR − Q − ∑iQ=1 ci log(1 − ci )
Result 7. For small SNR deviation, the Chernoff information for the TKD is given by
Q
1 δSNR1 (δSNR)2
µ
2
≈ −
16 ∏ c q × (1 + c q ).
q =1
T
h i
J F (0) 1M 1 h 1M i Q Tr (Φ(q) Φ(q) )2
N
=
2N M
Tr (Φ⊗ (Φ⊗ ) T )2 =
2N ∏ Mq
.
q =1
Each term in the product converges a.s towards the second moment of Marchenko-Pastur
Q
distributions νcq which are 1 + cq and M
N converges to ∏q=1 cq . This proves the desired result.
Entropy 2018, xx, 203 14 of 22
Remark 5. Contrary to the Remark 3, it is interesting to note that for c1 = c2 = ... = cQ = c and c → 0 or 1,
the optimal s-value follows different approximated relation given by
SNR1 1
s? ≈ 1−
c →0 log SNR
SNR1 1
s? ≈ 1−
c →1 log SNR − Q
which depends on Q.
In practice, when c is close to 1, we have to carefully check if Q is in the neighbourhood of log(SNR).
As we can see that, when log SNR − Q < 0 or 0 < log SNR − Q < 1, following the above approximation,
s? 6∈ [0, 1].
5. Numerical Illustrations
In this section, we consider cubic tensors of order Q = 3 with N1 = 10, N2 = 20, N3 = 30, R = 3000
following a CPD and M1 = 100, M2 = 120, M3 = 140, N1 = N2 = N3 = 200 for the TKD, respectively.
0.95
0.9
0.85
0.8
0.75
s⋆
0.7
0.65
0.6
SNR [dB]
Figure 4. Canonical Polyadic Decomposition (CPD) scenario: Optimal s-parameter versus Signal to
Noise Ratio (SNR) in dB.
Firstly, for the CPD model, in Figure 4, parameter s? is drawn with respect to the SNR in
dB. The parameter s? is obtained thanks to three different methods. The first one is based on the
brute force/exhaustive computation of the CUB by minimizing the expression in Equation (8) with
Φ = Φ . This approach has a very high computational cost especially in our asymptotic regime
(for a standard computer with Intel Xeon E5-2630 2.3 GHz and 32 GB RAM, it requires 183 h to establish
10,000 simulations). The second approach is based on the numerical optimization of the closed-form
expression of µ(s) given in Result 4. In this scenario, the drawback in terms of the computational
cost is largely mitigated since it consists of a minimization of a univariate regular function. Finally,
under the hypothesis that SNR is large, typically >30 dB, the optimal s-value, s? , is derived by an
analytic expression given by Equation (15). We can check that the proposed semi-analytic and analytic
expressions are in good agreement with the brute-force method for a lowest computational cost.
s? −s?
Moreover, we compute the mean square relative error L1 ∑lL=1 ( l s? )2 where L = 10,000 the number
b
It turns out that the mean square relative errors are in mean of order −40 dB. We can conclude that the
estimator bs? is a consistent estimator of s? .
In Figure 5, we draw various s-divergences: µ 12 , µ(s? ), N1 µ N 12 , N1 µ N (ŝ). We can observe
the good agreement with the proposed theoretical results. The s-divergence obtained by fixing s = 12 is
accurate only at small SNR but degrades when SNR grows large.
In Figure 6, we fix SNR = 45 dB and draw s? obtained by Equation (14) versus values
of c ∈ {10−6 , 10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 0.25, 0.5, 0.75, 0.9, 0.99} and the expression obtained by
Equation (15). The two curves approach each other as c goes to zero as predicted by our
theoretical analysis.
For the TKD scenario, we follow the same methodology as above for CPD, Figures 7 and 8 all
agree with the analysis provided in Section 4.3.
-0.2
-0.4
s-divergence : µ(s)
-0.6
-0.8
-1
-1.2
-1.4
SNR [dB]
0.904
0.902
0.9
s⋆
0.898
0.896
0.894
1
s⋆ = 1 − log(SNR)−1− 1−c
c log(1−c)
1 1
s⋆ = 1 + SNR − log(1+SNR)
0.892
10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6
c
Figure 6. CPD scenario: s? vs. c , SNR = 45 dB.
Entropy 2018, xx, 203 16 of 22
0.95
0.9
0.85
0.8
0.75
s⋆
0.7
0.65
0.6
SNR [dB]
Figure 7. TucKer Decomposition (TKD) scenario: Optimal s-parameter vs. SNR in dB.
-0.5
s-divergence : µ(s)
-1
-1.5
-2
-2.5
SNR [dB]
For TKD scenario, the mean square relative error is in mean of order −40 dB. So, we check
numerically the consistency of the estimator of the optimal s-value.
µ (s)
We can also notice that the convergence of NN towards its deterministic equivalent µ(s) in the case
TKD is faster than in the case CPD, since the dimension of matrix Φ⊗ is 200, 200, 200 × 100, 120, 140
(N = 2003 ) which is much larger than the dimension 6000 × 3000 of Φ (N = 6000).
6. Conclusions
In this work, we derived and studied the limit performance in terms of minimal Bayes’ error
probability for the binary classification of high-dimensional random tensors using both the tools
of Information Geometry (IG) and of Random Matrix Theory (RMT). The main results on Chernoff
Bounds and Fisher Information are illustrated by Monte–Carlo simulations that corroborated our
theoretical analysis.
For future work, we would like to study the rate of convergence and the fluctuation of the statistics
µ N (s)
N and ŝ.
Entropy 2018, xx, 203 17 of 22
Acknowledgments: The authors would like to thank Philippe Loubaton (UPEM, France) for the fruitful discussions.
This research was partially supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by
ANR (The French National Research Agency) as part of the program “Investissement d’Avenir” Idex Paris-Saclay
(ANR-11-IDEX-0003-02).
Author Contributions: Gia-Thuy Pham, Rémy Boyer and Frank Nielsen contributed to the research results
presented in this paper. Gia-Thuy Pham and Rémy Boyer performed the numerical experiments. All authors have
read and approved the final manuscript.
Conflicts of Interest: The authors declare no conflict of interest.
is given by [15]:
1 det(sΣ0 + (1 − s)Σ1 )
µ̃ N (s) = log . (A1)
2 [detΣ0 ]s [detΣ1 ]1−s
Using the expressions of the covariance matrices Σ0 and Σ1 , the numerator in Equation (A1) is
given by
N log σ2 + log det SNR × (1 − s)ΦΦ T + I
and the two terms at its numerator are log[det Σ0 ]s = sN log σ2 and
log[det Σ1 ]1−s = (1 − s) N log σ2 + log det SNR × ΦΦ T + I .
1−s h i 1 h i
µ N (s) = log det I + (δSNR) × ΦΦ T − log det I + δSNR × (1 − s) × ΦΦ T
2 2
Now, using Equation (8), and the following approximation:
1 1 1 x2 1
log det(I + xA) = Tr log(I + xA) ≈ x × TrA − × TrA2
N N N 2 N
we obtain
where the Fisher information for y|δSNR ∼ N (0, Σ(δSNR)) is given by [3]:
∂2 log p(y|δSNR)
JF (δSNR) = −E
∂(δSNR)2
1
= Tr{Σ(δSNR)−1 dΣ(δSNR)Σ(δSNR)−1 dΣ(δSNR)}
2
1
= Tr((I + (δSNR) × ΦΦ T )−1 ΦΦ T (I + δSNR) × ΦΦ T )−1 ΦΦ T ).
2
The second step is to derive a closed-form expression in the high SNR regime using the following
−1 x 1
the approximation (see [52] for instance): x × ΦΦ T + I ≈ Π⊥ ⊥
Φ = I N − ΦΦ where ΠΦ is an
†
given by
h i −1 SNR1
−1
(1 − s)Σ0−1 + sΣ1−1 ≈ σ2 I N − sI N + sΠ⊥
Φ
−1
= σ2 I N − sΦΦ† .
As sΦΦ† is a rank-K projector matrix scaled by factor s > 0, its eigen-spectrum is given by
s, . . . , s, 0, . . . , 0 . In addition, as the rank-N identity matrix and the scaled projector sΦΦ† can
| {z } | {z }
K N −K
be diagonalized in the same orthonormal basis matrix, the n-th eigenvalue of the inverse of matrix
I N − sΦΦ† is given by
−1
† 1
λn I N − sΦΦ = n o
λn {I N } − sλn ΦΦ†
(
1
= 1−s , 1 ≤ n ≤ K,
1, K + 1 ≤ n ≤ N
N −1
log det [I N − sΦΦ† ]−1 = log ∏ λn I N − sΦΦ†
n =1
= −K log(1 − s).
In addition, we have
SNR1 K
log det SNR × ΦΦ T + I ≈ Tr log SNR × Φ T Φ = K × log SNR + ∑ log λn
n =1
Entropy 2018, xx, 203 19 of 22
∂µs (SNR)
Finally, to obtain s? in Equation (9), we solve ∂s = 0.
−1
c
tc (z) = −z + .
1 + tc (z)
When z ∈ R−∗ , i.e., z = −ρ, with ρ > 0, it is well known that tc (ρ) is given by
2
tc (−ρ) = q (A3)
ρ − (1 − c ) + (ρ + λ− +
c )( ρ + λc )
It was established for the first time in [45] that if X represents a K × P random matrix with zero
mean and K1 variance i.i.d. entries, and if (λk )k=1,...,K represent the eigenvalues of XX T arranged
in decreasing order, then K1 ∑kK=1 δ(λ − λk ), the empirical eigenvalue distribution of XX T converges
weakly almost surely towards νc , under the regime K → +∞, P → +∞, KP → c. In addition, we have
the following property, for each continuous function f (λ)
K
1
Z
a.s
K ∑ f (λk ) −→
R+
f (λ) νc (dλ). (A4)
k =1
Practically, when K and P are large enough, the histogram of the eigenvalues of each realization
of XX T accumulates around the graph of the probability density of νc .
( Q) (1)
The columns (φr )r=1,...,R of Φ are vectors (φr ⊗ . . . ⊗ φr )r=1,...,R , which are mutually
independent, identically distributed, and satisfy E(φr φrT ) = INN . However, since the components
of each column φr are not independent, it results in that the entries of Φ are not mutually
independent. Applying the results of [53] (see also [54]), we can establish that the empirical eigenvalue
distribution of Φ (Φ ) T still converges almost surely towards νc , under the asymptotic regime NR
→ c.
R
For continuous function f (λ) = log(1 + λ/ρ), we apply Equation (A4), R+ log(1 + λ/ρ) νc (dλ) can
be expressed in terms of tc (−ρ) given by Equation (A3) (see e.g., [50]), we finish the proof.
Entropy 2018, xx, 203 20 of 22
1 x x
Ψ c 1 ( x ) ≈ c × + c log(1 + x ) − c = c log(1 + x ).
1+x 1+x
N1 NQ
1 1
∑ ∑
(1) ( Q)
log det SNR × Φ⊗ (Φ⊗ ) T + I = ... log SNR × λn1 · · · λnQ + 1
N N n1 =1 n q =1
M1 MQ
M 1
∑ ∑
(1) ( Q)
= ... log SNR × λn1 · · · λnQ + 1
N M n1 =1 n q =1
and that
MQ a.s R +∞ R +∞
1 M (1) ( Q)
M ∑n11=1 ... ∑nq =1 log SNR × λn1 · · · λnQ + 1 −→ 0 ... 0 log(1 + SNR × λ1 ...λQ )dνc1 (λ1 )...dνcQ (λQ )
References
1. Besson, O.; Scharf, L.L. CFAR matched direction detector. IEEE Trans. Signal Process. 2006, 54, 2840–2844.
2. Bianchi, P.; Debbah, M.; Maida, M.; Najim, J. Performance of Statistical Tests for Source Detection using
Random Matrix Theory. IEEE Trans. Inf. Theory 2011, 57, 2400–2419.
3. Kay, S.M. Fundamentals of Statistical Signal Processing, Volume II: Detection Theory; PTR Prentice-Hall:
Englewood Cliffs, NJ, USA, 1993.
4. Loubaton, P.; Vallet, P. Almost Sure Localization of the Eigenvalues in a Gaussian Information Plus Noise
Model. Application to the Spiked Models. Electron. J. Probab. 2011, 16, 1934–1959.
5. Mestre, X. Improved Estimation of Eigenvalues and Eigenvectors of Covariance Matrices Using Their Sample
Estimates. IEEE Trans. Inf. Theory 2008, 54, 5113–5129.
6. Baik, J.; Silverstein, J. Eigenvalues of large sample covariance matrices of spiked population models.
J. Multivar. Anal. 2006, 97, 1382–1408.
7. Silverstein, J.W.; Combettes, P.L. Signal detection via spectral theory of large dimensional random matrices.
IEEE Trans. Signal Process. 1992, 40, 2100–2105.
8. Cheng, Y.; Hua, X.; Wang, H.; Qin, Y.; Li, X. The Geometry of Signal Detection with Applications to Radar
Signal Processing. Entropy 2016, 18, 381.
Entropy 2018, xx, 203 21 of 22
9. Ali, S.M.; Silvey, S.D. A General Class of Coefficients of Divergence of One Distribution from Another.
J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142.
10. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012.
11. Kailath, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Commun. Technol.
1967, 15, 52–60.
12. Nielsen, F. Hypothesis Testing, Information Divergence and Computational Geometry; Geometric Science of
Information; Springer: Berlin, Germany, 2013; pp. 241–248.
13. Sinanovic, S.; Johnson, D.H. Toward a theory of information processing. Signal Process. 2007, 87, 1326–1344.
14. Chernoff, H. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations.
Ann. Math. Stat. 1952, 23, 493–507.
15. Nielsen, F. Chernoff information of exponential families. arXiv 2011, arXiv:1102.2684.
16. Chepuri, S.P.; Leus, G. Sparse sensing for distributed Gaussian detection. In Proceedings of the 2015
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia,
19–24 April 2015.
17. Tang, G.; Nehorai, A. Performance Analysis for Sparse Support Recovery. IEEE Trans. Inf. Theory 2010,
56, 1383–1399.
18. Lee, Y.; Sung, Y. Generalized Chernoff Information for Mismatched Bayesian Detection and Its Application
to Energy Detection. IEEE Signal Process. Lett. 2012, 19, 753–756.
19. Grossi, E.; Lops, M. Space-time code design for MIMO detection based on Kullback-Leibler divergence.
IEEE Trans. Inf. Theory 2012, 58, 3989–4004.
20. Sen, S.; Nehorai, A. Sparsity-Based Multi-Target Tracking Using OFDM Radar. IEEE Trans. Signal Process.
2011, 59, 1902–1906.
21. Boyer, R.; Delpha, C. Relative-entropy based beamforming for secret key transmission. In Proceedings of
the 2012 IEEE 7th Sensor Array and Multichannel Signal Processing Workshop (SAM), Hoboken, NJ, USA,
17–20 June 2012.
22. Tran, N.D.; Boyer, R.; Marcos, S.; Larzabal, P. Angular resolution limit for array processing: Estimation
and information theory approaches. In Proceedings of the 20th European Signal Processing Conference
(EUSIPCO), Bucharest, Romania, 27–31 August 2012.
23. Katz, G.; Piantanida, P.; Couillet, R.; Debbah, M. Joint estimation and detection against independence.
In Proceedings of the Annual Conference on Communication Control and Computing (Allerton), Monticello,
IL, USA, 30 September–3 October 2014; pp. 1220–1227.
24. Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett.
2013, 20, 269–272.
25. Cichocki, A.; Mandic, D.; De Lathauwer, L.; Zhou, G.; Zhao, Q.; Caiafa, C.; Phan, H.A. Tensor decompositions
for signal processing applications: From two-way to multiway component analysis. IEEE Signal Process. Mag.
2015, 32, 145–163.
26. Comon, P. Tensors: A brief introduction. IEEE Signal Process. Mag. 2014, 31, 44–53.
27. De Lathauwer, L.; Moor, B.D.; Vandewalle, J. A Multilinear Singular Value Decomposition. SIAM J. Matrix
Anal. Appl. 2000, 21, 1253–1278.
28. Tucker, L.R. Some mathematical notes on three-mode factor analysis. Psychometrika 1966, 31, 279–311.
29. Comon, P.; Berge, J.T.; De Lathauwer, L.; Castaing, J. Generic and Typical Ranks of Multi-Way Arrays.
Linear Algebra Appl. 2009, 430, 2997–3007.
30. De Lathauwer, L. A survey of tensor methods. In Proceedings of the IEEE International Symposium on
Circuits and Systems, ISCAS 2009, Taipei, Taiwan, 24–27 May 2009.
31. Comon, P.; Luciani, X.; De Almeida, A.L.F. Tensor decompositions, alternating least squares and other tales.
J. Chemom. 2009, 23, 393–405.
32. Goulart, J.H.D.M.; Boizard, M.; Boyer, R.; Favier, G.; Comon, P. Tensor CP Decomposition with Structured
Factor Matrices: Algorithms and Performance. IEEE J. Sel. Top. Signal Process. 2016, 10, 757–769.
33. Eckart, C.; Young, G. The approximation of one matrix by another of lower rank. Psychometrika 1936, 1, 211–218.
34. Badeau, R.; Richard, G.; David, B. Fast and stable YAST algorithm for principal and minor subspace tracking.
IEEE Trans. Signal Process. 2008, 56, 3437–3446.
Entropy 2018, xx, 203 22 of 22
35. Boyer, R.; Badeau, R. Adaptive multilinear SVD for structured tensors . In Proceedings of the 2006 IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP’06), Toulouse, France,
14–19 May 2006.
36. Boizard, M.; Ginolhac, G.; Pascal, F.; Forster, P. Low-rank filter and detector for multidimensional data based
on an alternative unfolding HOSVD: Application to polarimetric STAP. EURASIP J. Adv. Signal Process. 2014,
2014, 119.
37. Bouleux, G.; Boyer, R. Sparse-Based Estimation Performance for Partially Known Overcomplete Large-Systems.
Signal Process. 2017, 139, 70–74.
38. Boyer, R.; Couillet, R.; Fleury, B.-H.; Larzabal, P. Large-System Estimation Performance in Noisy Compressed
Sensing with Random Support—A Bayesian Analysis. IEEE Trans. Signal Process. 2016, 64, 5525–5535.
39. Ollier, V.; Boyer, R.; El Korso, M.N.; Larzabal, P. Bayesian Lower Bounds for Dense or Sparse (Outlier) Noise
in the RMT Framework. In Proceedings of the 2016 IEEE Sensor Array and Multichannel Signal Processing
Workshop (SAM 16), Rio de Janerio, Brazil, 10–13 July 2016.
40. Wishart, J. The generalized product moment distribution in samples. Biometrika 1928, 20A, 32–52.
41. Wigner, E.P. On the statistical distribution of the widths and spacings of nuclear resonance levels. Proc. Camb.
Philos. Soc. 1951, 47, 790–798.
42. Wigner, E.P. Characteristic vectors of bordered matrices with infinite dimensions. Ann. Math. 1955, 62, 548–564.
43. Bai, Z.D.; Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices, 2nd ed.; Springer Series in
Statistics; Springer: Berlin, Germany, 2010.
44. Girko, V.L. Theory of Random Determinants; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1990.
45. Marchenko, V.A.; Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Math. Sb. (N.S.)
1967, 72, 507–536.
46. Voiculescu, D. Limit laws for random matrices and free products. Invent. Math. 1991, 104, 201–220.
47. Boyer, R.; Nielsen, F. Information Geometry Metric for Random Signal Detection in Large Random Sensing
Systems. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017.
48. Boyer, R.; Loubaton, P. Large deviation analysis of the CPD detection problem based on random tensor
theory. In Proceedings of the 2017 25th European Association for Signal Processing (EUSIPCO), Kos, Greece,
28 August–2 September 2017.
49. Lytova, A. Central Limit Theorem for Linear Eigenvalue Statistics for a Tensor Product Version of Sample
Covariance Matrices. J. Theor. Prob. 2017, 1–34.
50. Tulino, A.M.; Verdu, S. Random Matrix Theory and Wireless Communications; Now Publishers Inc.: Hanover,
MA, USA, 2004; Volume 1.
51. Milne-Thomson, L.M. “Elliptic Integrals” (Chapter 17). In Handbook of Mathematical Functions with Formulas,
Graphs, and Mathematical Tables, 9th printing; Abramowitz, M., Stegun, I.A., Eds.; Dover Publications:
New York, NY, USA, 1972; pp. 587–607.
52. Behrens, R.T.; Scharf, L.L. Signal processing applications of oblique projection operators. IEEE Trans.
Signal Process. 1994, 42, 1413–1424.
53. Pajor, A.; Pastur, L.A. On the Limiting Empirical Measure of the sum of rank one matrices with log-concave
distribution. Stud. Math. 2009, 195, 11–29.
54. Ambainis, A.; Harrow, A.W.; Hastings, M.B. Random matrix theory: Extending random matrix theory to
mixtures of random product states. Commun. Math. Phys. 2012, 310, 25–74.
c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).