0% found this document useful (0 votes)
37 views

Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima

neural network, PCA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima

neural network, PCA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

ItiYi-hllXO’XY $3 00 + 00

CopbrIght ( IYXY Pqdmon Preah plc

ORIGINAL CONTRIBUTION

Neural Networks and Principal Component Analysis:


Learning from Examples Without Local Minima

PIERREBALDIAND
Umverslty of Cahforma, San Diego

(Recerved 18 May 1988, revved and accepted 16 August 1988)

Abstract-We consider the problem of learnmg from examples tn layered hnear feed-forward neurul networks
usmg opttmtzatton methods. such as back propagatton, wtth respect to the usual quadrattc error functton E of
the connectton weights Our main result IS a complete descrtptton of the landscape attached to E m terms of
prmctpal component analysts. We show that E has a umque muumum correspondtng to the proJectton onto the
subspace generated by the first principal vectors of a covariance matrtx assoctated with the trainmg patterns. All
the additional crtttcal potnts of E are saddle pomts (correspondmg to prolecttons onto subspaces generated by
higher order vectors) The auto-assoctattve case IS exammed tn detail E.xtenstons and tmpltcattons for the learning
algortthnzs are dtscussed

Keywords-Neural networks. Prmclpal component analysir, Learmng, Back propagation

1. INTRODUCTION to be: E = C,lly, - F(,# where F 1s the current


function implemented by the network. During the
Neural networks can be viewed as ctrcutts of highly
training phase, the weights (and hence F) are suc-
interconnected umts w:th modifiable mterconnectton
cessively modified, according to one of several pos-
weights. They can be classified, for Instance, ac-
sible algorithms, in order to reduce E. Back prop-
cordmg to their architecture, algortthm for adjusting
agation, the best known of such algortthms, is just a
the weights, and the type of units used m the circuit.
way of implementing a gradient descent method for
We shall assume that the reader is familiar with the
E. The main thrust of this paper IS not the study of
basic concepts of the field; general reviews, comple-
a specific algorithm but rather a precise description
ments. and references can be found in Rumelhart,
of the salient features of the surface attached to E
McClelland, and the PDP Research Group (1986a),
when the units are linear.
Ltppman (1987), and Grossberg (1988).
Linear units are the simplest one can use m these
The network architecture considered here 1sof the
circuits. They are often considered as uninteresting
type often described in Rumelhart Hinton, and Wil-
for: (a) only linear functions can be computed m
liams (1986b), namely layered feed-forward net-
linear networks (and most “interesting” functions
works with one layer of input units, one layer of
are nonlinear); and (b) a network with several layers
output units, and one or several layers of hidden
of linear units can always be collapsed into a linear
units. We assume that there are T input patterns x,
network without any hidden layer by multiplying the
(1 i t 5 T) and T corresponding target output pat-
weights in the proper fashion.
terns y, which are used to train the network. For thts
As a result, nonlinear units are most commonly
purpose. a quadratic error function is defined as usual
used: linear threshold gates or, when continuity or
differentiability is required, units wtth a srgmoid m-
put-output function. In this setting, the results of
numerous simulations have led several people to be-
*Permanent address Instltut fur Statlstlk and Wahrschemhch- lieve that descent methods, such as back propaga-
keltstheorle, Techmsche Umversltat Wlen. Wledner Haupstr 8- tion, applied to the error function E are not seriously
101107, A-1040 Wlen, Austria
plagued by the problem of local minima (either be-
The final stages of this work were supported by NSF grant
DMS-8800323 to P B cause global minima are found, etther because the
Requests for reprints should be sent to Pierre Bald], JPL 198- local minima encountered are “good enough” for
330. Cahforma Institute of Technology, Pasadena. CA 91109 practical purposes) and that, for Instance, the solu-
53
54

tions obtained have remarkable generalization prop- netting the inputs to the hldderr I;ib~~ ,s~c &.~~Lrtt~t,
erties. The complete absence, to this date, of any by a p X n real matrix B and thtr,c- from the hitidc,I;
analytical result supporting these claims would alone layer to the output by an II J IJ {&:a1matrix .-I WA!
by itself Justify a careful investigation of the simpler these assumptions, the error funt t1$)1!::nn hl- Mrltrr-::
linear case.
In addition, recent work ot Lmsker (1986a. 19Xhb.
1986~) and Cottrell, Munro, and Zipser (m press)
seems to indicate that, for some tasks, linear units We define the usual sample <r)\arlance matrll*ex
can still be of interest, not as much for the global c, = c, x,x:, c,y = z‘, .u,.v; ?,* ) -: 1, )‘ \ ;i!ld
map they Implement but for the internal represen- ‘c,,y = z, ,;s: We consider thr. problem of tindmg
tation of the input data and the transformations that the matrices A and B so as to mlnlmlzc E. In Se&on
occur m the different layers during the learning pe- 7 we use spectral analysis to deicrlbe the proper&
-3
riod. of the landscape attached to E in the general situa-
Linsker, for instance, has shown that m a layered tion The auto-associative case ,tnd Its relatlorrs t%j
feed-forward network of linear units with random principal component analysis fc4Iou Immediate!: <I\
inputs and a Hebb type of algorithm for adjusting a special case. In Section 3, we bnefly examine some
the synaptic weights, spatial opponent and onenta- consequences for the optlmlzauon algorithms All
tion selective units spontaneously emerge in succes- mathematical proofs are deferred to the Appendlr
sive hidden layers, in a way which does not contradict It is important to notice from the onset that II C
what is observed in the early visual system of higher is any p x p invertible matrix. then A B z -1C’c
animals. Cottrell et al. (in press) have used linear B = (AC)( C ‘B). Therefore the matrIces ,-I and N
units together with the technique of auto-association are never unique since they can niways be multlphed
to realize image compression. Auto-association. by appropriate invertible m,rtrices Whenevcl
which is also called auto-encoding or identity map- uniqueness occurs It IS m term\ ot the global map
ping (see Ackley, Hinton, & Sejnowskl; 1985; Ell- W = AB (equivalently, one .~)uld partltlon the
man & Zipser, 1988) is a simple trick intended to matrices into equivalence classcb) Notice also rhat
avoid the need for having a teacher, that is, for know- W has rank at most p and recall that of Z, , 1s mvcr.
ing the target values yl, by setting X, = y,. In this tible the solution to the problem oi minimizing p( f<) -~~
mode, the network will tend to learn the IdentIty 1, 11.~~ - Ls,ii’. where f, IS an II - tl matrix without
map which in itself is not too exciting. However, it any rank restnctlons. 1s umquth :md giver] b\ i. ;
this is done using one narrow layer of hidden umts. XYXC,&which IS the usual slope matrix for the or-
one expects the network to find efficient ways of dinary least squares regresaon t)i Y on X. Finally . 11
compressing the information contained in the input A4 is an t1 x p(p 5 n) matrix we shall denote by P,+,
patterns. An analysis of linear auto-assoaatlon has the matrix of the orthogonal projectIon onto the sub-
been provided by Bourlard and Kamp (1988) based space spanned by the columns of .\I. It is well known
on singular value decomposition of matrices. How- that Pi, = P,$, and Ph = P,, It III addition M is ot
ever, their results for the linear case. which are com- full rank p. then P,, = M(M’M‘ ‘.\I
prised by ours, do not give a description of the land-
scape of E. 2. MAIN RESULTS: THE LANDSCAPE
Our notation will be as follows. All vectors are
OF E
column vectors and prime superscripts denote trans-
position. To begin with, we shall assume that both Our main result is that:
X, and yt are n-dimensional vectors and that the net- E has, up to equivalence, a unique local and global
work consists of one input layer with n inputs, one minimum corresponding to an orthogonal projection
hidden layer with p(p 5 n) units, and one output onto the subspace spanned by the first principal ei-
layer with it units (see Figure 1). The weights con- genvectors of a covariance matrix associated with the
training patterns. All other crItIca points of E are
n output units saddle points.
More precisely, one has the f-ollowmg tour facts.
Fact I : For any fixed n x p matrix A the function
E( A, B) is convex in the coefficients of B and attains
its minimum for any B satisfying the equation
p hidden untts
A’ABZ,, = A’:,, (1)

If XXX1s invertible and A is full rank p, then E IS


strictly convex and has a unique minimum reached
n input units when
FIGURE 1. The network.
cc

NN and Prmclpal Component Anal.vsls

In the auto-associative case, (3) becomes Therefore a critical W of rankp is always the product
of the ordinary least squares regression matrix fol-
B = b(A) = (A’A)-‘A’. (3’) lowed by an orthogonal projection onto the subspace
Fact 2: For any fixed p x n matrix B the function spanned by p eigenvectors of 2. The critical map W
E(A , B) 1sconvex m the coefficients of A and attains associated with the index set (1. 2, . . . . p} is the
its minimum for any A satisfying the equation unique local and global minimum of E. The remain-
mg (of) - 1 p-index sets correspond to saddle points.
ABZ,,B’ = S),B’. (4) All additional crtttcal points defined by matrices A
If 2, is invertible and B is of full rank p, then E is and B which are not of full rank are also saddle points
strictly convex and has a unique minimum reached and can be characterized in terms of orthogonal pro-
when jections onto subspaces spanned by q eigenvectors,
with q < p (see Figure 2). In the auto-associative
A = ii(B) = X,I\B’(BC,B’)-’ (5) case, (8) (9) and (10) become
In the auto-associative case, (5) becomes
A = U.C (8’)
A = /i(B) = ZlkB’(BZ,,B’)-’ (5’)
B = C’iJ’, (9’)
Fact3: Assume that Z:,, is invertible. If two matri-
w = P,, (10’)
ces A and B define a critical point of E (i.e., a point
where dEl&z,, = dE/ab,, = 0) then the global map and therefore the unique locally and globally optimal
W= ABtsoftheform map W is the orthogonal projection onto the space
w = P,Xy,X.,{ (6) spanned by the first p eigenvectors of C,.
Remark: At the global mmimum, if C is the iden-
with A satisfying tity f, then the activities of the units in the hidden
P,Z = PAZP, = CP, (7)
layer are given by u;jl, . . . , u;j, the so-calledprrn-
czpal components of the jl’s (see for instance Kshir-
where C = C,,E;iZ,,. In the auto-associative case, sagar, 1972) In the auto-associative case, these ac-
c= C, and (6) and (7) become tivities are given by u;x,. . . , u;,.Y,. They are the
W = AB = P, (6’) coordmates of the vector x, along the first p eigen-
vector5 of C,,
P4C y‘y = P,X,,P, = C,P,. (7’) The assumptions on the rank or eigenvalues of the
If A is of full rank p, then A and B define a critical matrices appearing m the statements of the facts are
point of E if and only if A satisfies (7) and B = by no means restrictive. They are satisfied in most
d(A), or equtvalently tf and only if A and W satisfy practical situations and also in the case of random
(6) and (7) matrices with probability one. For instance a non-
Notice that m (4), the matrix SrxS& is the slope invertible C,, corresponds to a poor choice of the
matrix for the ordinary least squares regression of Y training patterns with linear dependencies and a rank
on X. It is easily seen that I; is the sample covariance deficient matrix A (or B) to a very poor utilization
matrix of the best unconstrained linear approxtma- of the units in the network. For back propagation,
tion jjc = ZyxC&:x, of Y based on X. the initial weights are usually set at random which
Fact 4: Assume that 2 is full rank with n distinct yields, with probability one, matrices A and B of full
eigenvalues A, > . . . > A,. If ?= {ir, . . . , zp} (1 5 rank. )3 is a covariance matrix and therefore its ei-
1, < .., < i, 5 n) is any ordered p-index set, let genvalues are always non-negative To assume that
U , = [ULl, . . . , u, ] denote the matrix formed by they are all strictly positive is equivalent to assuming
the orthonormal eigenvectors of E associated with
the eigenvalues A,,, . . . , Alp. Then two full rank
matrices A and B define a critical point of E if and
only if there exist an ordered p-index set .vand an
invertible p x p C matrix such that

A = U/C (8)

Saddle
For such a crttical point we have pomts

E(A, B) = trZ, - 2 A, (11)


IEI FIGURE 2. The landscape of E.
that both XXXand C, are of full rank Full rank 3. CONCLUDING REMARKS ON
matrices are dense and in a realistic environment THE ALGORITHMS
with noise and finite precision, we can always slightly
perturb the conditions so as to make C invertible and One of the mce features ot tilt ~andscapt: ot fi‘ ,.,
with distinct eigenvalues. Furthermore, m the proof\ the existence. up to equlvalencc :,:’ ,I unique iocai
in Appendix B, we describe the structure of the crit- and global minimum which, m ad&ion, can be de.
ical points with deficient rank and what happens in scribed in terms of principal component analysis and
the case where some of the eigenvalues of 2 are least squares regression. Con\elllleiltly. thic , ,;lti
equal. mum could also be obtained from *,cL-L‘ral well-hnowr!
We have also restricted our analysis to the cabe ot algorithms for computmg the elgcn \ nlues md elgc_‘n-
linear units without bias and to networks containing vectors of symmetric positive dehmtc matncc\ (VX
a single hidden layer. The generalization of our result for instance Atkinson. 197X). B> rtumerical rmalyslb
standards. these algorithms arc xup;r~t>r tc, gradic*nt
to the affine case IS straightforward either by prc-
methods for the class of problrmz considered here
subtracting the mean from the mput and target data.
However, though efficiency consldcrallon~ are ot lm-
or by adding a unit which is kept at a fixed value
portance, one should not dlsregar c! nach propagdtlon
A rigorous extension to the nonlinear slgmoid cake
on this sole basis. for Its mtroductton In the de51gn
or the case involving linear threshold units seem:,
of neural networks was guided hi tcccral other con-
more difficult. However, our results, and m particular
siderations. In particular, m addition to Its srmphcit~ .
the main features of the landscape of E, hold true m
error back-propagation can bc applted to nonlinedr
the case of linear networks with several hidden la>-
networks and to a variety of problem wtthout hd\ mg
ers.
any detailed ;I prior1 knowledge I IL ‘ht.~~ \tructurs ,*I
One of the central issues m learning from exam-
of the mathematical prclpertle> tI’ III< ~~pttm,il ~,lu-
ples 1s the problem of generalization. that is. how
tion\.
does the network perform when exposed to a pattern
A second nice feature of the landscape clt E I\ that
never seen previously? In our setting, a precise quan-
if we fix A (resp. B) with full ranh. :hen E is a strtctl)
titative answer can be given to this question. For
convex quadratic form and there exists a umque mm-
instance, in the auto-associative case, the distortion
lmum reached for B = b(A) (rchlt .! = .-i (El)\ in
on a new pattern is exactly given by Its distance to
this case, gradient descent with appropriate step width
the subspace generated by the first p eigenvectors of
-i (or “learning rate”) leads to d i.gmvergence wrth 3
-xx. residual error decaying exponentially fast Ot course.
It is reasonable to think that for most solutions
b(A) (resp. A(B)) can also be obtained directly b>
found by running a gradient descent algorithm on
solvmg the linear system in (1) ‘f hi; also suggeus
the function E, the final matrix C will not be the
another optlmlzation strategy whlc h consists of sue-
identity Z,. In fact, we even expect C to be rather
cessively computing, startmg for ln4tance from a ran-
“random” looking. This 1s the main reason why the
dom A, B(A), a(fi(A)), ,~d ~3 forth, which
relation of auto-association to principal component
m fact is a Newton’s type of method. In any cabe.
analysis was not apparent in earlier simulations de-
from a theoretical standpomt. one should notice that,
scribed in the literature and why, in the solutions
although E has no local mmima. both gradient de-
found by back propagation, the work load seems to
scent and Newton’s type of methA could get stuck
be evenly distributed among the units of the hidden
in a saddle pomt. However. as cvtlmpllfled by em-
layer. If in (9’) we take C = I,, then B = U’
ulations (Cottrell et al., in press). this seems unlikely
Therefore the synaptic vector correspondmg to the
to happen, especially with the W,I> error back-prc>-
“first” hidden unit is exactly equal to the dommant
pagaton is usually implemented. with a descent dl-
eigenvector of the input correlation matrix. This IS
rection computed by differentiarmg E after prr\en-
m fact exactly the same result as the one obtamed
tatlon of one or just a few trammg patterns Such 3
by Oja (1982) in a different setting, usmg differential directlon is clearly distinct from ci rrur gradient
equations to approximate a constramed form of Heb-
bian learning on a single linear umt with n stochastic
Inputs. In other words, up to equivalence, the so- REFERENCES
lution sought by a back propagation type of algo- Ackley, D H . Hmton. G L . S: \ejiirlu\hl I’ 1 (icJ851 4
rithm in the auto-associative case and by Hebblan learning dlgorlthm for Boltzmann mdchtncs (‘ognmw .Scwflcr3
learning are identical on one single linear “neuron ‘* 9, 147-169
It remains to be checked whether simultaneous Heb- Atkmson. K E. (197X) An wrociucw~~~ lo r~urnerrc~l umiyw
New York John Wdey & Son\
bian learning onp units, probably with some appropri-
Bourlard, H . 6i Kamp, Y (1988) Autw,iihocuitlon b} multll+cr
ate form of Iaterial inhibition, leads to the same re- perceptrons and singular value decomposmon B&w-al Cv-
sults as those encountered here for the auto-associ- hernerrcs, 59, 291-293
atlon. Cottrell. G W . Munro. P W , & Ltptr I) (m preq\) 1ma.w
57
NN and Prlnclpal Component Analysu

compre55lon by back propagation A demonstrdtlon of exten- Finally, let us introduce the mput data matrix X = [x,. .
slonal programmmg In N E Sharkey (Ed ), Advancey m .rT] and the output data matnx Y = [.v,, , yT] It is easily
togmt~ve ~cle,lce (Vol 2) Norwood. NJ Abbex seen that XX’ = I;, , XY’ = Zxr, YY’ = Z,, , YX’ = 8, y and
E(A, B ) = Ilvec( Y - ABX)ll* In the proofs of facts 1 and 2. we
Ellman. J L , & Zipser. D (lYX7) Learning the hzdden structure
shall use the followmg well known lemma
of .\peech (Tech Rep No 8701) San DIego Institute for
Lemma The quadratlc function
Cogmtlve Science, Umverslty of Cahforma
Gro55berg. S ( 19%) Nonhnedr neural networks Prmclple\. F(z) = /Ic - Mzll? = c’c - 2c’Mz + ;‘M’Mz
mechamsms and architecture\ Neural NetM’orkJ. 1. 17-61 IS convex A point z corresponds to a global mmlmum of F If and
Kshlr\agar. A N (1972) MuhlL’arlate analyus New York, Marcel onlv If It satisfies the eauatlon VF = 0. or eqmvalently M’Mz =
Dehker. Inc M’l If m addltlon M’M IS positive defmlie. then F IS strictly
Llnsker. R ( 198hd) From basic networh prmclples to neural dr- convex and the umque mmlmum of F I\ dttamed for z =
chltecture Emergence of spatial opponent cells Proseedfngt (M’M) ‘M’c
of the National Academy of Jclencer USA. 83. 7508-7512 Proof of fact I For fixed A. use (d) to write vec(Y -
Lmsker. R (1986b) From baalc network principles to neural ABX) = vet Y - vec(ABX) = vet Y - (X’ @ A) vet B and
thus E(A, B) = ((vet Y - (X’ @ A ) vet B11’ By the above
architecture Emergence ol orlentatlon selective cells Pro-
lemma, E IS convex m the coefficients of B and B corresponds to
reetimgc of lhe Nutmnal Academy of Screncey USA. 83, 83YO- a global mmlmum If and only If (X’ @ A)‘( X’ @ A) vet B =
x391 (X’ @ A)’ vet Y Now on one hand (X’ @ A)‘(X’ @ A ) vet
Lmsher. R (19x6~) From bdbic network principle to neural ar- B = (X’@A)vecB = (XX’@A’A)vecB = (Zx\@A’A)
chitecture Emergence of orientation columns Proceedmgc of vet B = vec(A’ABZ,,,) On the other hand (X’ @A)’ vet Y =
rhc Nanonai Academy of Sciences USA, 83. 8779-8783 (X @ A’) vet Y = vec(A’YX’) = vec(A’Crt) Therefore
Llppman. R P (1987) An mtroductlon to computmg with neural
A’AB’,,, = A’:,,,.
net\ IEEE Transactrons on Acoust~s. Speech. and Signal Pro-
ce~~rng Maga:me. J-22 which IS (2) If A 15 full rank, A’A 15 symmetric dnd positive
Magnus. J R , & Neudeckrr, H (1986) Symmetry. O-1 matrIce\ definite As a covariance matrix. 8,, 1s symmetric and positive
senudefmlte, If, m addition, 8,, IS mvertlble, then 8,, IS also
dnd Jacobian\ Economerrlc Theory. 2, 157-IYO
positive definite Because of (h), (X’ @ A )‘(X’ @ A ) = xyx @
Old. E (I%?) A slmphfled neuron model as d prmcipdl com-
2 ‘A 1s dlS0 symmetric and posltlve definite Applying the above
ponent ,malyzer Journal of Malhematrcal Biology. 15, 367- lemma. we conclude that If xty IS mvertible and A IS a flxed full
773 rank matrix. then E IS strictly convex m the coefficients qf B and
Polloch. D S G (197Y) I’he algebra of econometrics New York attams Its unique mmlmum at the unique solution B = B(A) =
John Wiley & Son5 (A’A)mlA’&,C;: of (2). wluch is (3) In the auto-associative
Rumelhdrt. D E . McClelldnd. J L . &the PDP Resedrch Group case, x, = y, Therefore zyk = CI,, = P,, = Zru and the above
( lY8hd) Parallel dl\trrbuted processmg Explorations m the expression simplifies to (3’)
mIcroT1rac1ure of togmnon (Vols 1 & 2) Cdmbridge. MA Proof of Fact 2. For fixed B. use (d) to write vec( Y -
ABX) = vecY - vec(ABX) = vecY - (X’B’@I)vecAand
MIT Prc\s
so E(A. B) = /Ivec Y ~ (X’B’ @ I) vet AlI’ By the above
Rumelhart. D E , Hmton, G E . & Wllham5. R J (1YXhb)
lemma, E IS convex m the coefflclents of A and A corresponds
Ledrnlng lnterndl representation by error propagdton In Par- to a global mmlmum If and only If (X’B’ @ Z)‘(X’B’ @ I) vet
allel dl\trlbuted proce\5lng Explorarronr m the mrcrostructure A = (X’B’ @ I)’ vet Y Since (X’B’ @ I)’ (X’B’ @ I) vet
of tognrtmn (pp 318-362) CambrIdge MA MIT Press A = (BXX’Bpr @ I) vet A = (BZ,,B’@ I) vet A =
vec(ABZkyB’) and (X’B’ @ I)’ vet Y = (BX @ I) vet Y =
APPENDIX A: MATHEMATICAL PROOFS
vec(YX’B’) = vec(Z,,B’) we have
We hdve tried to write proots which are self-contained up to
very bd5lc re,ultr of linear dlgehra Shghtly le55 elementary re5ult\ ABIZ.,,B’ = x,\B’.
which dre often used m the proofs (sometlmes without explicit which 15 (4) If B and S,, are full rank. then the symmetric and
mentlonmg) are listed below as a remmder for the redder For posltlve semi-defmlte m&ix BL,, B’ become5 full ranh dnd
dny matrices P. Q. R we have tr(PQR) = tr(RPQ) = tr( QRP). therefore posltlve definite Because ot (h). (X’B’ @
provided that the\c quantities dre defined Thus m partlculdr If I)‘(X’B’ @ I) = (BZ,,B’ @ I) 15 also po\mve definite and (5)
P 19 ldempotent that 15. P’ = P. then and (5’) are easily derived as m the end 01 the proof of fact 1
rr( PQP) = tr(P?Q) = tr(PQ) (a) NotIce that from facts 1 and 2. two full ranh matrices A and
B define a critical pomt for E If and only If (2) and (4) are SI-
If U 15 orthogonal. thdt 15 LI’C’ = I. then
multaneously satisfied In allcases of practical interest where Clw
rr( [‘VU’) = frl U’UQ) = tr( Q) (b) IS full rank both A(B) and B(A) are full rank In what follows,
The Kronecker product P @ Q of any two matruzes P and Q IS we shall always assume that A IS of full rank p The case rank
the matrlx obtained from the matrix P by replacing each entry p,, (A) < p IS. although mtmtlvely of no practical interest, slightly
of P with the matrix p,,Q If P IS any m x n matrix and pi Its Ith more technical and Its treatment will be postponed to Appendix
column, then vet P IS the mn x 1 vector vet P = [p;, . PAI’ B
Thus the vet operation transforms a matrix mto a column vector Proof of Fact 3 Assume first that A and B define a cr$al
by stacking the columns of the matrix one underneath the other point of E, with A full rank Then from fact 1 we get B = B(A)
We then have (see for Instance Magnus & Neudecker. 1986) for and thus
any matrices P. Q. R
W = AB = A(A’A)-‘A%,,,\;;: = P,&,IZ:;:
rr(PQ’) = (vet P)’ vet Q (cl
which IS (6) Multiphcatlon of (4) by A’ on the right yields
vec(PQR’) = (R @ P) vet Q (d)
WH,,W’ = ABZ,xB’A’ = 8,,8’.4’ = Z.,,W’
(P @ Q)(R @ S) = PR @ QS (e)
or
(P @ Q)-’ = Pm’ @ Q-’ (f)
(P @ Q)’ = P’ @ Q’ (gl or equivalently P,ZP, = IZP, Since both I; dnd P, are symmetnc.
whenever these quantltles are defined Also P,ZP, = HP, IS also symmetric and therefore PP, = (ZP,)’ =
If P and Q are symmetric and positive senudefmlte (resp posltlve PAI’ = P,C So P,Z = P,ZP, = ZP,. which IS (7) Hence If
definite) then P @ Q ISsymmetric and positive semldefmlte (resp A and B correspond to a critical pomt and A IS full rank then (6)
positive definite) (h) and (7) must hold and B = B(A)
Conversely, assume that A and W satisfy (6) and (7), with A
full rank. Multiplying (6) by (A’A)-‘A’ on the left yields B =
(A’A)-‘A’&xZ& = B(A) and (2) IS satnfled From P,,ZP, =
XP, and using (6) we Immediately get ABZ,B’A’ = &%B’A’
and multiplication of both sides by A(A ‘A) ’ on the right yields
ABZ.U~B’ = Z~XB’. which is (4). Thus A and B satisfy (2) .md
(4) and therefore they define a crltrcal point of E
Proof of FUC~ 4. First notice that since Z 1s a real symmetnc
covanance matrix, It can always be wntten as 2 = UA li’ where Theretore ,&:(A. ci) .: or\,, ir f’, , i ,<‘- L !‘,, 1
U IS an orthogonal column matrix of elgenvectors of Z and .Z 15
A,/(1 + E:) + <‘A‘.‘(1 + +,I - WY,, \ t’(h, ,\,I
the diagonal matrix with non-increasing ergenvalues on its dlag- (1 + E) - E(A1. H) E (A. r, I I! -: By takmg values
onal. Also if S is full rank, then Z,,. Zvk and X,., are full rank of E drbltrarlly small. we QX rh‘lt dtny neighborhood 01 4. H
too
contams pomts of the form 4 H with .t *.rrlctlq smaller error
Now clearly if A and B satisfy (8) and (9) for some C and functton Thus 11 ’ t (1. 2. pi thl;n ih) ,tnd 19) detlne !
some $ then A and B are full rank p and satisfy (3) and (5) \addle pomt dnd not d locdl mmimum N~GILIthat. 111an\ cd~e
Therefore they define a cntlcal point of E It could not be d locdl maxlmum hecauby +I! the \trIc! Cc)nvc’;!tI
For the converse, we have (11E. with fixed lull rnnh A. In t;u :
P,,, = U’A(A’UV’A)~‘A’U = V’.4(A’,4) ‘A’U = CI’P,I,

or. equivalently, P, = VP,,~,,U’ Hence (7) yields


VP,~,V’VAV’ = P,Z = ZP, = VhG”UP,,,L” APPENDIX B: THE RANK DEFICIENT CASE

and so P,.,;1 = _hP”., Since hi > 2 A, i 0, It IS readily seen WC now complete the proof of fact 3 (cquattons (6) and (7))
that Pu,A IS diagonal. PUrAJS an orthogonal projector of rank p dnd fact 4, in the case where A 1s not ol tull rank Usmg the
and its eigenvalues are 1 (p times) and 0 (n - p times) Therefore Moore-Penrose Inverse A+ of the matnx rt [see for Instance Pol-
there exists a unique index set v = {I,, . I,} with 1 5 I, c lock. 1979), the general solutmn to equatton (2) can he wrItten
< ~~% n such that PIIcA = It where I , IS the diagonal matrix dX
with entry I = 1 If L E v and 0 otherwlse It follows that H -I L,i\\, - ii \ 11.
p, = I/P,,>,li’ = [ii, C” :~ 1: [I’ where L IS an arbitrary p * ,I matrix UC hve P, q -14 :mcl
AA’A = A and 30 W = AR = AA-I,,\,,, 5 A(1 - A-A)
where U I = [u,,. , UJ Thus P, JS the
orthogonal proJectIon L = P/&Z;: f (A - AA*A)f. T P,l,,,l,:. which IS (6)
onto the subspace spanned by the columns of V Smce the column Multlphcation of (4) by A ’ on the right yields WS, W’ = C, x U”
space of A comcldes with the column space of U . there exists dn and (7) follows as usual Observe that III order for A and B to
I?vertlble p x p matrix C such that A = U C Moreover. B = determine a cntical point of E, L must m general be constrained
B(A) = C1lJ’, Zi,,Z& and (8) and (9) are satlsfled There are by (4). f. = 0 IS always a solution
(;) possible choices for 9 and therefore. up to equivalence. (::I In any case, as m the proof of fact 1 ~OI tull rank A, d runk
critical points with full rank A = r we conclude that Pb.4 JS an orthogonal projector of rank
From (8) and (9), (10) results lmmedlately r commutmg Hrlth A. so that P, 4 = I for :m Index set J= II;.
Remark. In the most general case with n-dlmensronal rnputa . L,]with 1 5 I, - < I, 5 n and Pr VP,,,V = li I!./’
x, and m-dlmenslonal outputs y,, Z has r(r 5 m) distinct elgen- Again as the column space of A 15ldentlc:*l to the column space
values A, 2 2 A, 2 0 with multlphctties m,. . m, Ustng of V , we can write A in the form
the above arguments, It IS easdy seen that P, pqwill now be bloch-
diagonal [P, , . P,] where P,, . P, are orthogonal pro&= 1 = [r! (III
tors of dlmenslon m,. . m, and rhus A I\ of the form A -
where 0 denotes a matrix of dimension ‘G s (p rj ulth all
(UV) I C where V IS block-diagonal [I’, . V.]. v,. 1, entries 0 At any critical point A, B of E. 4 wtll he of the above
being orthogonal matrices of dimension m,. m, For dll such form and. from 12). B ~111be of the form
choices of V, VV IS a matrix of normalized elgenvector\ of x
correspondmg to ordered elgenvalues of x The geometric situ- H _ 4+Z?I,,z,: + (l i iIf>
atlon, as expected. does not really change but the parameterlzd-
where L ISconstrained by (4) No matte! whdt I!. dctually IS. usmg
tlon becomes more involved as V is no longer unique
To prove (11). use (c) to write E(A. B) = (vec( Y - ABX))
4
vec(Y - ABX) = (vet Y)’ vet Y - Z(vec(ABX))’ vet Y +
(vet ABX)‘vec ABX = trYY’ - 2trABXY’ + trABXX’B’A’
= trIitu - 2trWBS, + rrWZll W’ Ii A IS full rdnk and B =~ we obtam that
B(A),then W = AB(A) = P,&Z;~$andthereforetr(W~~, @“I
= tr(P,ZP,) = tr(PAC) = tr(UP,,,U’U.AU’) = tr(P,,,,U’U.\)
= tr(PU.,,A) and tr(WI,\) = tr(P,Z) = tr(P, ,A). SOfor an
arbitrary A of rank p,
E(A, b(A)) = frh,, -~ rrP,_ I 1

It A JS of the form U, C. then P, 4 = I Therefore

E(A. b(A)) = trx;,, - trl. ~1 = trZ,, - c A, U’ 1,,2,:


-z ( !
/*. last p - I rows ot (I, I
which IS (11) Now. by assumption, 2 has full rank n. and so U’, &Z,& has full
We shall now estabhsh that whenever A and B_sat!sty (8) and rank r Upon slightly perturbing the last p - r rows of CB (which
(9) wtth r# (1, 2. . . pj there exist matrices A. B arbltrardy are also the last p - r rows of-CL, we can always obtam fi
close to A, B such that E(A, B) < E(A. B) For ttns purpose It arbitrarily close to B such that B has Gaxxtmal rank and W =
is enough to shghtly perturb the column space of A m the directron Afi = AB and thus E(A, B) = E(A. B) Now B has full rank
of an elgenvector associated with one of the first p elgenvalues of and so E is strict1 convex m the elements of A _p~+ing A =
Z which IS not contamed m {A,, 1 E 4. More precisely, fix two (1 - c)A + ui(-B ) with 0 < E < 1, we have E(A, 8) < E(A,
mdrces 1 and k with 1 E ii k 6i_ Y For any c. put ti, = (1 + B) =E(A,B) Ifc-+O,A-+Aandtherefore(A.B)Isasaddle
E~)~“?(u, + EZI~)and construct Cl. from U, by replacmg u, with pomt for E

You might also like