Abdi CCA2018
Abdi CCA2018
Originally defined by Hotelling in 1935 After principal component analysis (PCA), CCA
(Hotteling 1935; Hotteling 1936, see also Bartlett is one of the oldest multivariate techniques that
1948), canonical correlation analysis (CCA) is a was first defined in 1935 by Hotelling. In addition
statistical method whose goal is to extract the of being the first method created for the statistical
information common to two data tables that mea- analysis of two data tables, CCA is also of theo-
sure quantitative variables on a same set of obser- retical interest because a very large number of
vations. To do so, CCA computes two sets of linear multivariate analytic tools are particular
linear combinations – called latent variables – cases of CCA. Like most multivariate statistical
(one for each data table) that have maximum techniques, CCA has been really feasible only
correlation. To visualize this common information with the advent of modern computers. Recent
extracted by the analysis, a convenient way is developments involve generalization of CCA to
(1) to plot the latent variables of one set against the case of more than two-tables and cross-
the other set (this creates plots akin to plots of validation approaches to select important vari-
factor scores in principal component analysis); ables and the stability and reliability of the solu-
(2) to plot the coefficients of the linear combina- tion obtained on a given sample.
tions (this creates plots akin to plotting the load-
ings in principal component analysis); and (3) to
plot the correlations between the original vari-
Canonical Correlation Analysis
ables and the latent variables (this creates correla-
tion circle” plots like in principal component
Notations
analysis).
Matrices are denoted in upper case bold letters,
CCA generalizes many standard statistical
vectors are denoted in lower case bold, and their
techniques (e.g., multiple regression, analysis of
elements are denoted in lower case italic. Matrices,
variance, discriminant analysis) and can also be
vectors, and elements from the same matrix all
declined in several related methods that address
use the same letter (e.g., A, a, a). The transpose
slightly different types of problems (e.g., different
operation is denoted by the superscript⊺, the
normalization conditions, different types of data).
inverse operation is denoted by 1. The identity
matrix is denoted I, vectors or matrices of ones are
denoted 1, and matrices or vectors of zeros are
Key Points denoted 0. When provided with a square matrix,
the diag operator gives a vector with the diagonal
CCA extracts the information common to two elements of this matrix. When provided with a
data tables measuring quantitative variables on vector, the diag operator gives a diagonal matrix
the same set of observations. For each data table, with the elements of the vector as the diagonal
CCA computes a set of linear combinations of the elements of this matrix. When provided with a
variables of this table called latent variables or square matrix, the trace operator gives the sum of
canonical variates with the constraints that a the diagonal elements of this matrix.
latent variable from one table has maximal corre- The data tables to be analyzed by CCA of, respec-
lation with one latent variable of the other table tively, size N I and N J, are denoted X and
and no correlation with the remaining latent vari- Y and collect two sets of, respectively, I and J
ables of the other table. The results of the analysis quantitative measurements obtained on the same
are interpreted using different types of graphical N observations. Except if mentioned otherwise,
displays that plot the latent variables and the coef- matrices X and Y are column centered and normal-
ficients of the linear combinations used to create ized, and so:
the latent variables.
Canonical Correlation Analysis 3
1⊺ X 5 0, 1⊺ Y 5 0, (1) f ⊺ f ¼ p⊺ X⊺ Xp ¼ p⊺ RX p ¼ 1 ¼ g⊺ g
¼ q⊺ Y⊺ Yq ¼ q⊺ RY q: (6)
(with 1 being a conformable vector of 1 s and 0 a
conformable vector of 0 s), and
With these notations, the maximization prob-
lem from Eq. 5 becomes:
diag X⊺ , X 51, diag Y⊺ , Y 51: (2)
arg max f ⊺ g ¼ p⊺ Rq under the contraints that
Note that because X and Y are centered and p, q
normalized matrices, their inner products are cor- p⊺ RX p ¼ 1 ¼ q⊺ RY q:
relation matrices that are denoted:
(7)
RX ¼ X⊺ X, RY ¼ Y⊺ Y and Equivalent Optimum Criteria
⊺
(3)
R ¼ X Y: The maximization problem expressed in Eq. 5 can
also be expressed as the following equivalent
minimization problem:
Optimization Problem
arg max j j Xp Yq22 jj ¼
p, q
In CCA, the problem is to find two latent vari-
ables, denoted f and g obtained as linear combi- arg max trace ðXp YqÞ⊺ ðXp YqÞ
p, q
nations of the columns of, respectively, X and Y.
The coefficients of these linear combinations are under the contraints that p⊺RXp ¼ 1 ¼ q⊺RYq:
stored, respectively, in the I 1 vector p and the (8)
J 1 vector q; and, so, we are looking for
@ℒ respectively, R1 ⊺ 1
Y R RX R , associated with the
¼ Rq 2aRX p (10)
@p first eigenvalue l1 = d2, and that the maximum
correlation (i.e., the canonical correlation) is
@ℒ equal to d. Note that in order to make explicit
¼ R⊺ p 2bRY q: (11)
@q the constraints expressed in Eq. 7, the vectors
p and q are normalized (respectively) in the metric
RX and RY (i.e., p⊺RXp = 1 and q⊺RYq = 1).
The Normal Equations
Setting Eqs. 10 and 11 to zero gives the normal
equations:
Additional Pairs of Latent Variables
Rq ¼ 2aRX p (12)
After the first pair of latent variables has been
⊺ found, additional pairs of latent variables can be
R p ¼ 2bRY q: (13)
extracted. The criterion from Eqs. 5 and 7 is still
used for the subsequent pairs of latent variables,
Solution of the Normal Equations along with the requirement that the new latent
The first step to solve the normal equations is to variables are orthogonal to the previous ones.
show that a = b. This is done by premultiplying Specifically, if f‘ and g‘ denote the ‘-th pair of
Eq. 12 by p and Eq. 13 by q to obtain (using the latent variables, the orthogonality condition
constraints from Eq. 3): becomes:
R1
X Rq ¼ dp (16) For convenience, latent variables and eigen-
vectors can be stored in matrices F, G, P, and
R1 y
Y R p ¼ dq: (17) Q. With these notations, the normalization (from
Eq. 7) and orthogonality (from Eq. 21) conditions
Replacing q (respectively p) in Eq. 16 are written as
(respectively Eq. 17) by its expression from
Eq. 17 (respectively Eq. 16) gives the following F⊺ F5I , P⊺ RX P5I (22)
two eigen-equations (see Abdi 2007b) for a
refresher about the eigen-decomposition): G⊺ G5I , Q⊺ RY Q5I: (23)
R1 1 ⊺
X RRY R p ¼ d p
2
(18) The matrices of eigenvectors P and Q are
respectively called RX- and RY- orthogonal (the
R1 ⊺ 1
Y R RX Rq ¼ d q,
2
(19) proof of this property is given in the next section,
see Eq. 27).
which shows that p (respectively q) is the eigen- The eigen-decompositions for P and Q can
vector of the nonsymmetric matrix R1X RRY R
1 ⊺
then be expressed in a matrix form as:
Canonical Correlation Analysis 5
R1 1 ⊺ 1 ⊺ 1
X RRY R P5PL and RY R RX RQ5QL:
Solution from One Singular Value
(24)
Decomposition
1
RX 2 RR1 ⊺ 2 ~ ~T
1
~ ⊺P
~ ¼ I: where P,~ Q,~ and D denote (respectively) the left,
Y R RX ¼ PLP with P
(25) right singular vectors, and a diagonal matrix of the
1 1
singular values of matrix RX 2 RRY 2 . The matrices
~ ¼ R 2P 1 P and Q (containing the vectors p and q) are then
This can be shown by 1first defining P X
2 ~ computed as
and replacing P by RX P in Eq. 24 and then
simplifying:
~ ¼ R2 P ~ ¼ R2 Q ,
1 1
P X and Q Y
(30)
R1 1 ⊺
X RRY R P 5 PL ~ 1 ~ 1
P ¼ RX 2 P and Q ¼ RY 2 Q:
2~ 2~1 1 1 ~
R1 1 ⊺
X RRY R RX P ¼ RX PL because P ¼ RX 2 P
1
2~ 1
2~
1
1 From the Eigen-Decomposition to the
R2X R1 1 ⊺
X RRY R RX P ¼ RX RX PL
2
1
Singular Value Decomposition
Multiply both sides by R2X To show that p can be found from the eigen-
1
⊺ 2 ~ ~
1 decompositions from Eqs. 18 and 19, we first
RX 2 RR1
Y R RX P ¼ PL: ~ in Eq. 18. This gives:
replace p by p
(26)
1
This shows that P~ is the matrix of the eigen- R1 1 ⊺
X RRY R RX p
2
~ ¼ d2 p, (31)
1 1
vectors of the symmetric matrix RX 2 RR1 Y R⊺ RX 2 ,
~ ⊺P
1
which also implies that P ~ ¼ I. The eigenvectors then premultiplyinig both sides of Eq. 31 by R2X
of the asymmetric matrix R1 1 ⊺
X RRY R are then
and simplifying gives
12 ~
recovered as P ¼ RX P . A simple substitution
shows that P is RX-orthogonal: 1
1 1
R2X R1 1 ⊺
X RRY R RX p
2
~ ¼ d2 R2X p
1 1
~ ⊺ R2 RX R2 P
P⊺ R X P ¼ P ~ ⊺P
~¼P
1
~ ¼ I:
1
(27) , RX 2 RR1 ⊺
Y R RX p
2
~ ¼ d2 p
~: (32)
X X
An Example: The Colors and Grapes of (Fig. 6) roughly orders the wines according to
Wines their concentration of red pigment (i.e., white,
rosé, and red; similar plots using grape varietal
To illustrate CCA we use the data set presented in or origin of the wines did not show any interesting
Table 1. These data describe thirty-six red, rosé, or patterns and are therefore not shown).
white wines produced in three different countries To understand the contribution of the variables
(Chili, Canada, and USA) using several different of X and Y to the latent variables, two types of
varietal of grapes. These wines are described by plots are used: (1) a plot of the correlation between
two different sets of variables. The first set of latent variables and original variables and (2) a
variables (i.e., matrix X) describes the objective plot of the loadings of the variables. Figure 7
properties of the wines: Price, Acidity, Alcohol (respectively Figure 8) shows the correlation
content, Sugar, and Tannin (in what follows we between the original variables (both X and Y)
capitalize these descriptors). The second set of and respectively F (i.e., the latent variables from
variables (i.e., matrix Y) describes the subjective X) and G (i.e., the latent variables from Y). Figs. 9
properties of the wines as evaluated by a profes- and 10 display the loadings for, respectively,
sional wine taster and consists in ratings on a nine- X and Y (i.e., matrices P and Q) for the first two
point rating scale of eight aspects of taste: fruity, dimensions of the analysis. Together these figures
floral, vegetal, spicy, woody, sweet, astringent, indicate that the first dimension reflects the nega-
acidic, plus an overall evaluation of the hedonic tive correlation between Alcohol (from X) and the
aspect of the wine (i.e., how much the taster liked subjective hedonic evaluation of the wines (from
the wine). Y), whereas the second dimension combines
The analysis of this example was performed (low) Alcohol, (high) Acidity, and (high) Sugar
using the statistical programming language R and (from X) to reflect their correlations with the sub-
is available to download from https://github.com/ jective variables astringent and hedonic. Figs. 7
vguillemot/2Tables. and 8 show very similar pictures (because the
Figures 1, 2, and 3 show heatmaps of the latent variables are very correlated) and this sug-
correlation matrices RX, RY, and R. As shown gests that the first pair of latent variables opposes
in Fig. 3 (for matrix R), the objective variables “bitterness” (i.e., astringent, alcohol, etc.) to
alcohol and tannin are positively correlated with sweetness, whereas the second pair of latent vari-
the perceived qualities of astringent and woody; ables opposes bitterness (from astringent) to the
by contrast, the perceived hedonic aspect of wine “burning” effect of Alcohol.
is negatively correlated with alcohol, tannin (and
price, so our taster liked inexpensive wines) and
positively correlated with the sugar content of the Variations over CCA
wines. Unsurprisingly, the objective amount of
Sugar is correlated with the perceived quality By imposing slightly different orthogonality con-
sweet. ditions than the ones described in Eqs. 14 and 15,
The CCA of these data found five pairs of different (but related) alternative methods can be
latent variables [in general CCA will find a max- defined to analyze two data tables.
imum of min(I, J) pairs of latent variables]. The
values of the canonical correlations are shown in Inter-Battery Analysis (IBA) Et Alia
Fig. 4. The first and second canonical correlations The oldest alternative – originally proposed by
are very high (.98 and.85, see Table 2), and so we Tucker in 1958 – called inter-battery analysis
will only consider them here. As shown in Figs. 5 (IBA) (Tucker 1958), is also known under a vari-
and 6, the latent variables extracted by the analysis ety of different names such as coinertia analysis
are very sensitive to “the color” of the wines: The (Dolédec and Chessel 1994), partial least square
first pair of latent variables (Fig. 5) isolates the red SVD (PLSVD) (Bookstein 1994), partial least
wines, whereas the second pair of latent variables square correlation (PLSC) (Krishnan et al. 2010;
Canonical Correlation Analysis, Table 1 An example for CCA. Thirty-six wines are described by two sets of variables: objective descriptors (Matrix X) and subjective
descriptors (Matrix Y)
Matrix X: Objective Matrix Y: Subjective
Wine Origin Color Varietal Price Acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
CH01 Chili Red Merlot 11 5.33 13.80 2.75 559 6 2 1 4 5 3 5 4 2
Canonical Correlation Analysis
(continued)
8
Abdi and Williams 2013), singular value decom- (as described in Eqs. 5 and 7) the latent variables
position of the covariance between two fields are required to have maximum covariance. So we
(Bretherton et al. 1992), maximum covariance are looking for vectors p and q satisfying:
analysis (von Storch and Zwiers 2002), or even,
recently, “multivariate genotype-phenotype” d ¼ arg max fcovðf, gÞg
(MGP) analysis (Mitteroecker et al. 2016). It is p, q
particularly popular in brain-imaging and related ¼ arg max f ⊺g ¼ p⊺ Rq : (34)
domains (McIntosh et al. 1996). In IBA (like in p, q
p⊺ p ¼ 1 ¼ q⊺ q: (35)
Alcohol X
L
0
R ¼ PDQ⊺ ¼ d‘ p‘ q⊺‘
Tannin ‘ (36)
⊺ ⊺
P P ¼ Q Q ¼ I:
Sugar
Acidity
Price
Alcohol
Tannin
with
−0.5
Canonical Correlation
Analysis, Fig. 2 Heatmap
of correlation matrix RY
(i.e., between the variables 1
of matrix Y) spicy
woody
0.5
astringent
hedonic 0
floral
acidic −0.5
vegetal
fruity
sweet
spicy
woody
astringent
hedonic
floral
acidic
vegetal
fruity
sweet
10 Canonical Correlation Analysis
Canonical Correlation
Analysis, Fig. 3 Heatmap
of correlation matrix R (i.e.,
between the variables of
matrices X and Y)
Price
Alcohol 0.5
Tannin
0
Acidity
Sugar −0.5
spicy
woody
astringent
sweet
hedonic
fruity
vegetal
floral
acidic
Canonical Correlation 1.00
Analysis, Fig. 4 Barplot
of the canonical correlations
(i.e., correlations between
Canonical Correlations
0.75
pairs of latent variables for a
given dimension)
0.50
0.25
0.00
1 2 3 4 5
Dimensions
Canonical Correlation Analysis, Table 2 An example for CCA. Canonical correlations (d‘), loadings for matrices X (objective descriptors, loading matrix P) and Y (subjective
descriptors loading matrix Q) for the five dimensions extracted by CCA
Matrix P: Objective Matrix Q: Subjective
Dimension d‘ Price Acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
1 .98 0.026 0.088 0.489 0.134 0.002 0.000 0.150 0.030 0.068 0.109 0.082 0.095 0.059 0.586
2 .85 0.062 0.306 0.927 0.188 0.006 0.274 0.247 0.150 0.114 0.307 0.285 1.688 0.085 1.179
3 .65 0.024 0.678 0.243 0.288 0.001 0.044 0.605 0.366 0.024 0.390 0.133 0.465 0.083 0.327
4 .48 0.057 0.574 1.382 0.428 0.002 0.565 0.368 0.076 0.427 0.505 0.584 0.531 0.423 0.870
5 .22 0.157 0.360 0.131 0.627 0.000 0.744 0.205 0.087 0.877 0.290 0.568 0.296 0.065 0.185
11
12 Canonical Correlation Analysis
Canonical Correlation
Analysis, Fig. 5 CCA.
Latent variables: First latent
variable from X (1st LV
−1 0 1
1st LV (Objective properties)
Canonical Correlation 2
Analysis, Fig. 6 CCA.
Latent variables: Second
2nd LV (Subjective properties)
−2
−2 −1 0 1 2
2nd LV (Objective properties)
Price
−0.5
−1.0
−1.0
RA can be interpreted as searching for the best latent variable from X) is used, in a regression
predictive linear combinations of the columns of step, to predict Y. After the first latent variable has
X or, equivalently, RA is searching for the sub- been used, its effect is partialled from X and Y and
space of X where the projection of Y has the the procedure is re-iterated to find subsequent
largest variance. latent variables and loadings.
c
Acidity
astringent
en
n
a
Sugar 1.5
0.0 hedonic
n
Tannin
Price
r 1.0
0.5
−0.5 woody
fruity
ru
spicy
0.0 acidic
a d
vegetal
e
floral
Alcohol
sweet
e
0.0 0.2 0.4 −0.50 −0.25 0.00 0.25
1st Loadings 1st Loadings
(Objective properties) (Subjective properties)
Canonical Correlation Analysis, Fig. 9 Loadings of the Canonical Correlation Analysis, Fig. 10 Loadings of
second LV versus the first LV for matrix X the second LV versus the first LV for matrix Y
(just like in plain CCA) and the columns represent X with itself is equivalent to multiple correspon-
a set of exclusive groups (i.e., an observation dence analysis.
belongs to one and only one group). The group
assigned to an observation has a value of 1 for the
row representing this observation and all the other
Some Other Particular Cases of CCA
columns for this observation (representing the
groups not assigned to this observation) will
CCA is a very general method and so a very large
have a value of 0. When both X and Y are non-
number of methods are particular cases of CCA
centered and nonnormalized group matrices, then
(Abdi 2003). For example, when Y has only one
the CCA of these matrices will give correspon-
column, CCA becomes (simple and multiple) lin-
dence analysis – a technique developed to analyze
ear regression. If X is a group matrix and Y stores
contingency tables (see entry on correspondence
one quantitative variable, CCA becomes analysis
analysis and, e.g., Greenacre (1984)). When
of variance. If Y is a group matrix, CCA becomes
X and Y are composed of the concatenation of
discriminant analysis. This versatility of CCA
several noncentered and nonnormalized group
makes it of particular theoretical interest.
matrices, the CCA of these two tables will be
equivalent to partial least square correspondence
analysis (PLSCA (Beaton et al. 2016)) – a tech-
nique originally developed to analyze the infor- Key Applications
mation shared by two tables storing qualitative
data. In the particular case when X is composed CCA and its derivatives – or variations thereof –
of the concatenation of several noncentered and are used when the analytic problem is to relate two
nonnormalized group matrices, the CCA of data tables and this makes these techniques ubiq-
uitous in almost any domains of inquiry from
Canonical Correlation Analysis 15
marketing to brain imaging and network analysis Abdi, H., and Williams, L. J. (2013). Partial least squares
(see Abdi et al., 2016, for examples). methods: partial least squares correlation and partial
least square regression. Computational Toxicology, II,
549–579.
Abdi H, Vinzi VE, Russolillo G, Saporta G, Trinchera
L (eds) (2016) The multiple facets of partial least
Future Directions squares methods. Springer Verlag, New York
Bartlett MS (1948) External and internal factor analysis. Br
CCA is still a domain of intense research with J Psychol 1:73–81
future developments likely to be concerned with Beaton D, Dunlop J, Abdi H, ADNI (2016) Partial least
squares correspondence analysis: a framework to
multi-table extensions (e.g., Horst 1961;
simultaneously analyze behavioral and genetic data.
Tenenhaus et al. 2014), “robustification,” and Psychol Methods 21:621–651
sparsification (Witten et al. 2009). All these Bookstein F (1994) Partial least squares: a dose response
approaches will make CCA and its related tech- model for measurement in the behavioral and brain
sciences. Psycoloquy 5(23)
niques even more suitable for the analysis of very
Bretherton CS, Smith C, Wallace JM (1992) An
large data sets that are becoming prevalent in intercomparison of methods for finding coupled pat-
analytics. terns in climate data. J Clim 5:541–560
Dolédec S, Chessel D (1994) Co-inertia analysis: an alter-
native method for studying species-environment rela-
tionships. Freshw Biol 31:277–294
Cross-References Fortier JJ (1966) Simultaneous linear prediction.
Psychometrika 31:369–381
Gittins R (2012) Canonical analysis: a review with appli-
▶ Barycentric Discriminant Analysis cations in ecology. Springer Verlag, New York
▶ Correspondence analysis Greenacre MJ (1984) Theory and applications of corre-
▶ Eigenvalues, Singular Value Decomposition spondence analysis. Academic Press, London
Grellmann C, Bitzer S, Neumann J, Westlye LT,
▶ Iterative Methods for Eigenvalues/ Andreassen OA, Villringer A, Horstmann A (2015)
Eigenvectors Comparison of variants of canonical correlation analy-
▶ Least Squares sis and partial least squares for combined analysis of
▶ Matrix Algebra, Basics of MRI and genetic data. NeuroImage 107:289–310
Horst P (1961) Relations among m sets of measures.
▶ Matrix Decomposition Psychometrika 26:129–149
▶ Principal Component Analysis Hotteling H (1935) The most predicable criterion. J Educ
▶ Regression Analysis Psychol 26:139–142
▶ Spectral Analysis Hotteling H (1936) Relation between two sets of variates.
Biometrika 28:321–377
Krishnan A, Williams LJ, McIntosh AR, Abdi H (2010)
Partial least squares (PLS) methods for neuroimaging: a
tutorial and review. NeuroImage 56:455–475
References Mardia KV, Kent JT, Bibby JM (1980) Multivariate anal-
ysis. Academic Press, London
Abdi H (2003) Multivariate analysis. In: Lewis-Beck M, McIntosh AR, Bookstein FL, Haxby JV, Grady CL
Bryman A, Futing T (eds) Encyclopedia for research (1996) Spatial pattern analysis of functional brain
methods for the social sciences. Sage, Thousand Oaks, images using partial least squares. NeuroImage
pp 699–702 3:143–157
Abdi H (2007a) Singular value decomposition (SVD) and Mitteroecker P, Cheverud JM, Pavlicev M (2016) Multi-
generalized singular value decomposition (GSVD). In: variate analysis of genotype-phenotype association.
Salkind NJ (ed) Encyclopedia of measurement and Genetics. doi:10.1534/genetics.115.181339
statistics. Sage, Thousand Oaks, pp 907–912 Rao CR (1964) The use and interpretation of principal
Abdi H (2007b) Eigen-decomposition: eigenvalues and component analysis in applied research. Sankhyā
eigenvectors. In: Salkind NJ (ed) Encyclopedia of mea- A26:329–358
surement and statistics. Sage, Thousand Oaks, von Storch H, Zwiers FW (2002) Statistical analysis in
pp 304–308 climate research. Cambridge University Press,
Abdi H (2010) Partial least squares regression and projec- Cambridge
tion on latent structure regression (PLS regression). Takane Y (2013) Constrained principal component analy-
Wiley Interdisciplinary Reviews: Computational Sta- sis and related techniques. CRC Press, Boca Raton
tistics 2:97–106
16 Canonical Correlation Analysis
Tenenhaus M (1998) La régression PLS: théorie et pra- Van Den Wollenberg AL (1977) Redundancy analysis an
tique. Editions Technip, Paris alternative for canonical correlation analysis.
Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Psychometrika 42:207–219
Frouin V (2014) Variable selection for generalized Witten DM, Tibshirani R, Hastie T (2009) A penalized
canonical correlation analysis. Biostatistics matrix decomposition, with applications to sparse prin-
15:569–583 cipal components and canonical correlation analysis.
Tucker LR (1958) An inter-battery method of factor anal- Biostatistics 10:515–534
ysis. Psychometrika 23:111–136