Bocci C., Chiantini L. An Introduction To Algebraic Statistics With Tensors 2019
Bocci C., Chiantini L. An Introduction To Algebraic Statistics With Tensors 2019
70 56 42 84 14
25 20 15 30 5
50 40 30 60 10
10 8 6 12 2
20 16 12 24 4
An Introduction
to Algebraic
Statistics
with Tensors
UNITEXT - La Matematica per il 3+2
Volume 118
Editor-in-Chief
Alfio Quarteroni, Politecnico di Milano, Milan, Italy; EPFL, Lausanne, Switzerland
Series Editors
Luigi Ambrosio, Scuola Normale Superiore, Pisa, Italy
Paolo Biscari, Politecnico di Milano, Milan, Italy
Ciro Ciliberto, Università di Roma “Tor Vergata”, Rome, Italy
Camillo De Lellis, Institute for Advanced Study, Princeton, NJ, USA
Victor Panaretos, Institute of Mathematics, EPFL, Lausanne, Switzerland
Wolfgang J. Runggaldier, Università di Padova, Padova, Italy
The UNITEXT – La Matematica per il 3+2 series is designed for undergraduate
and graduate academic courses, and also includes advanced textbooks at a research
level. Originally released in Italian, the series now publishes textbooks in English
addressed to students in mathematics worldwide. Some of the most successful
books in the series have evolved through several editions, adapting to the evolution
of teaching curricula.
An Introduction to Algebraic
Statistics with Tensors
123
Cristiano Bocci Luca Chiantini
Dipartimento di Ingegneria Dipartimento di Ingegneria
dell’Informazione e Scienze Matematiche dell’Informazione e Scienze Matematiche
Università di Siena Università di Siena
Siena, Italy Siena, Italy
Cover illustration (LaTeX): A decomposable 3-dimensional tensor of type $3\times 5\times 2$.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
We, the authors, dedicate this book to our
great friend Tony Geramita. When the project
started, Tony was one of the promoters and
he should be among us in the list of authors
of the text. Tony passed away when the book
was at an early stage. We finished the book
following the pattern traced in collaboration
with him, and we always felt as if his
encouragement to continue the project never
faded.
Preface
The initial concern of Classical Statistics is the behavior of one random variable X.
Usually X is identified with a function with values in the real numbers. This is
clearly an approximation. For example, if one records the height of the members of
a population, it is unlikely that the measure goes much further than the second
decimal digit (assume that the unit is 1 m). So, the corresponding graph is a
histogram, with a basic interval of 0:01 m. This is translated to a continuous
variable, by sending the length of the basic interval to zero (and the size of the
population increases).
vii
viii Preface
For random variables of this type, the first natural distribution that one expects is
the celebrated Gaussian distribution, which corresponds to the function
1 ðtlÞ2
XðtÞ ¼ pffiffiffiffiffiffi e 2r2
r 1…
where l and r are parameters which describe the shape of the curve (of course,
other types of distributions are possible, in connection with special behaviors of the
random variable XðtÞ).
The first goal of Classical Statistics is the study of the shape of the function XðtÞ,
together with the determination of its numerical parameters.
When two or more variables are considered in the framework of Classical
Statistics, their interplay can be studied with several techniques. For instance, if we
consider both the heights and the weights of the members of a population and our
goal is a proof of the (obvious) fact that the two variables are deeply connected,
then we can consider the distribution over pairs (height, weight), which is repre-
sented by a bivariate Gaussian, in order to detect the existence of the connection.
The starting point of Algebraic Statistics is quite different. Instead of considering
variables as continuous functions, Algebraic Statistics prefers to deal with a finite
(and possibly small) range of values for the variable X. So, Algebraic Statistics
emphasizes the discrete nature of the starting histogram, and tends to group together
values in wider ranges, instead of splitting them. A distribution over the variable X
is thus identified with a discrete function (to begin with, over the integers).
Algebraic Statistics is rarely interested in situations, where just one random
variable is concerned.
Instead, networks containing several random variables are considered and some
relevant questions raised in this perspective are
• Are there connections between the two or more random variables of the
network?
• Which kind of connection is suggested by a set of data?
• Can one measure the complexity of the connections in a given network of
interacting variables?
Since, from the new point of view, we are interested in determining the relations
between discrete variables, in Algebraic Statistics a distribution over a set of
variables is usually represented by matrices, when two variables are involved, or
multidimensional matrices (i.e., tensors), as the number of variables increases.
It is a natural consequence of the previous discussion that while the main
mathematical tools for Classical Statistics are based on multivariate analysis and
measure theory, the underlying mathematical machinery for Algebraic Statistics is
principally based on the Linear and Multi-linear Algebra of tensors (over the
integers, at the start, but quickly one considers both real and complex tensors).
Preface ix
Just to give an example, let us consider the behavior of a population after the
introduction of a new medicine.
Assume that a population is affected by a disease, which dangerously alters the
value of a glycemic indicator in the blood. This dangerous condition is partially
treated with the new drug. Assume that the purpose of the experiment is to detect
the existence of a substantial improvement in the health of the patients.
In Classical Statistics, one considers the distribution of the random variable X1 ¼
the value of the glycemic indicator over a selected population of patients before the
delivery of the drug, and the random variable X2 ¼ the value of the glycemic
indicator of patients after the delivery of the drug. Both distributions are likely to be
represented by Gaussians, the first one centered at an abnormally high value
of the glycemic indicator, the second one centered at a (hopefully) lower value. The
comparison between the two distributions aims to detect if (and how far) the descent
of the recorded values of the glycemic indicator is statistically meaningful, i.e., if it
can be distinguished from the natural underlying ground noise. The celebrated
Student’s t-test is the world-accepted tool for comparing the means of two Gaussian
distributions and for determining the existence of a statistically significant response.
In many experiments, the response variable is binary or categorical with k levels,
leading to a 2 2, or a 2 k, contingency table. Moreover, when there is more
than one response variable and/or other control variables, the resulting data are
summarized in a multiway contingency table, i.e., a tensor.
This structure may also come from the discretization of a continuous variable.
As an example, consider a population divided into two subsets, one of which is
treated with the drug while the other is treated with traditional methods. Then, the
values of the glycemic indicator are divided into classes (in the roughest case just
two classes, i.e., a threshold which separates two classes is established). After some
passage of time, one records the distribution of the population in the four resulting
categories (treated + under-threshold, treated + over-threshold . . .) which deter-
mines a 2 2 matrix, whose properties encode the existence of a relation between
the new treatment and an improved normalization of the value of the glycemic
indicator (this is just to give an example: in the real world, a much more sophis-
ticated analysis is recommended!).
Another celebrated model, which is different from the Gaussian distribution and is
often introduced at the beginning of a course in Statistics, is the so-called Bernoulli
model over one binary variable.
Assume we are given an object that can assume only two states. A coin, with the
two traditional states H (heads) and T (tails), is a good representation. One has to
x Preface
bear in mind, however, that in the real world, binary objects usually correspond to
biased coins, i.e., coins for which the expected distribution over the two states is not
even.
If p is the probability of obtaining a result (say H) by throwing the coin, then one
can roughly estimate p by throwing the coin several times and determining the ratio
but this is usually considered too naïve. Instead, one divides the total set of throws
into several packages, each consisting of r throws, and determines for how many
packages, denoted qðtÞ, one obtained H exactly t times. The value of the constant p
is thus determined by Bernoulli’s formula:
r t
qðtÞ ¼ p ð1 pÞrt :
t
By increasing the number of total throws (and thus increasing the number of
packages and the number of throws r in each package), the function qðtÞ tends to a
real function, which can be treated with the usual analytic methods.
Notice that in this way, at the end of the process, the discrete variable Coin is
substituted by a continuous variable qðtÞ. Usually one even goes one step further,
by substituting the variable q with its logarithm, ending up with a linear description.
Algebraic Statistics is scarcely interested in knowing how a single given coin is
biased. Instead, the main goal of Algebraic Statistics is to understand the connec-
tions between the behavior of two coins. Or, better, the connections between the
behavior of a collection of coins.
Consequently, in Algebraic Statistics one defines a collection of variables, one
for each coin, and defines a distribution by counting the records in which the
variables X1 ; X2 ; . . .; Xn have a fixed combination of states. The distribution is
transformed into a tensor of type 2 2 2. All coins can be biased, with
different loads: this does not matter too much. In fact, the main questions that one
expects to solve are
• Are there connections between the outputs of two or more coins?
• Which kind of connection is suggested by the distribution?
• Can one divide the collection of coins into clusters, such that the behavior of
coins of the same cluster is similar?
Answers are expected from an analysis of the associated tensor, i.e., in the
framework of Multi-linear Algebra.
The importance of the last question can be better understood if one replaces
coins with positions in a composite digital signal. Each position has, again, two
possible states, 0 and 1. If the signal is the result of the superposition of many
Preface xi
elementary signals, coming from different sources, and digits coming from the same
source behave similarly, then the division of the signal into clusters yields the
reconstruction of the original message that each source issued.
Of course, the separation of several phenomena that are mixed together in a given
distribution is also possible using methods of Classical Statistics.
In a famous analysis of 1894, the biologist Karl Pearson made a statistical study
of the shape of a population of crabs (see [1]). He constructed the histogram for the
ratio between the “forehead” breadth and the body length for 1000 crabs, sampled
in Naples, Italy by W. F. R. Weldon. The resulting approximating curve was quite
different from a Gaussian and presented a clear asymmetry around the average
value. The shape of the function suggested the existence of two distinct types of
crab, each determining its own Gaussian, that were mixed together in the observed
histogram. Pearson succeeded in separating the two Gaussians with the method of
moments. Roughly speaking, he introduced new statistical variables, induced by the
same collection of data, and separated the types by studying the interactions
between the Gaussians of these new variables.
This is the first instance of a computation which takes care of several parameters
of the population under analysis, though the variables are derived from the same set
of data. Understanding the interplay between the variables provides the funda-
mental step for a qualitative description of the population of crabs.
From the point of view of Algebraic Statistics, one could obtain the same
description of the two types which compose the population by adding variables
representing other ratios between lengths in the body of crabs, and analyzing the
resulting tensor.
Mixture Models
Summarizing, Algebraic Statistics becomes useful when the existence and the
nature of the relations between several random variables are explored.
We stress that knowing the shape of the interaction between random variables is
a central problem for the description of phenomena in Biology, Chemistry, Social
Sciences, etc. Models for the description of the interactions are often referred to as
Mixture Models. Thus, mixture models are a fundamental object of study in
Algebraic Statistics.
Perhaps, the most famous and easily described mixture models are the Markov
chains, in which the set of variables is organized in a totally ordered chain, and the
behavior of the variable Xi is only influenced by the behavior of the variable Xi1
(usually, this interaction depends on a given matrix).
xii Preface
Of course, much more complicated types of networks are expected when the
complexity of the collection of variables under analysis increases. So, when one
studies composite signals in the real world, or pieces of a DNA chain, or regions in
a neural tissue, higher level models are likely to be necessary for an accurate
description of the phenomenon.
One thus moves from the study of Markov chains
M1 M2 M3
X1 X2 X3 X4
X1
M1 M2
X2 X3
M3 M4 M5 M6
X4 X5 X6 X7
M1
X1 X2
M3 M2
X3
Conclusion
The way we think about Algebraic Statistics focuses on aspects of the theory of
random variables which are different from the targets of Classical Statistics. This is
reflected in the point of view introduced in the book. Our general setting differs
from the classical one and is closer to the one implicitly introduced in the books of
Pachter and Sturmfels [2] and Sullivant [3]. Our aim is not to create a new for-
mulation of the whole statistical theory, but only to present an algebraic natural way
in which Statistics can handle problems related to mixture models.
The discipline is currently living in a rapidly expanding network of new insights
and new areas of application. Our knowledge of what we can do in this area is
constantly increasing and it is reasonable to hope that many of the problems
introduced in this book will soon be solved or, if they cannot be solved completely,
then they will at least be better understood. We feel that the time is right to provide
a systematic foundation, with special attention to the application of tensor theory,
for a field that promises to act as a stimulus for mathematical research in Statistics,
and also as a source of suggestions for further developments in Multi-linear Algebra
and Algebraic Geometry.
Acknowledgements The authors want to warmly thank Fabio Rapallo, who made several fruitful
remarks and suggestions to improve the exposition, especially regarding the connections with
Classical Statistics.
References
1. Pearson K.: Contributions to the mathematical theory of evolution. Phil. Trans. Roy. Soc.
London A, 185, 71–110 (1894)
2. Pachter, L., Sturmfels, B.: Algebraic Statistics for Computational Biology. Cambridge
University Press, New York (2005)
3. Sullivant, S.: Algebraic Statistics. Graduate Studies in Mathematics, vol. 194, AMS,
Providence (2018)
Contents
xv
xvi Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
About the Authors
xix
Part I
Algebraic Statistics
Chapter 1
Systems of Random Variables
and Distributions
This section contains the basic definitions with which we will construct our statistical
theory.
It is important to point out right away that in the field of Algebraic Statistics, a
still rapidly developing area of study, the basic definitions are not yet standardized.
Therefore, the definitions which we shall use in this text can differ significantly (more
in form than in substance) from those of other texts.
Definition 1.1.1 A random variable is a variable x taking values in a finite non-
empty set of symbols, denoted A(x). The set A(x) is called the alphabet of x or the
set of states of x. We will say that every element of A(x) is a state of the variable x.
A system of random variables (or random system) S is a finite set of random
variables.
The condition of finiteness, required both for the alphabet of a random variable
and the number of variables of a system, is typical of Algebraic Statistics. In other
statistical situations this hypothesis is often not present.
Definition 1.1.2 A subsystem of a system S of random variables is a system defined
by a subset S ⊂ S.
Example 1.1.3 The simplest examples of a system of random variables are those
containing a single random variable. A typical example is obtained by thinking
of a die x as a random variable, i.e. as the unique element of S. Its alphabet is
A(x) = {1, 2, 3, 4, 5, 6}.
Another familiar example comes by thinking of the only element of S as a coin c
with alphabet A(c) = {H, T } (heads and tails).
Example 1.1.4 On internet sites about soccer betting one finds systems in which
each random variable has three states. More precisely the set S of random variables
are (say) all the professional soccer games in a given country. For each random
© Springer Nature Switzerland AG 2019 3
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_1
4 1 Systems of Random Variables and Distributions
variable x, (i.e. game), its alphabet is A(x) = {1, 2, T }. The random variable takes
value “1” if the game was won by the home team, value “2” if the game was won by
the visiting team and value “T” if the game was a tie.
Example 1.1.5 (a) We can, similar to Example 1.1.3, construct a system S with two
random variables, namely with two dice {x1 , x2 }, both having alphabet A(xi ) =
{1, 2, 3, 4, 5, 6}.
(b) An example of another system of random variables T , closely related to
the previous one but different, is given by taking a single random variable
as the ordered pair of dice x = (x1 , x2 ) and, as alphabet A(x), all possible
values obtained by throwing the dice simultaneously: {(1, 1), . . . (1, 6), . . . ,
(6, 1), (6, 2), . . . , (6, 6)}.
(c) Another example W , still related to the two above (but different), is given by
taking as system the unique random variable the set consisting of two dice
z = {x1 , x2 } and as alphabet, A(z), the sum of the values of the two dice after
throwing them simultaneously: A(z) = {2, 3, 4, . . . , 12}.
Remark 1.1.6 The random variables of the systems S, T and W might seem, at first
glance, to be the same, but it is important to make clear that they are very different.
In (a) there are two random variables while in (b) and (c) there is only one. Also
notice that in T we have chosen an ordering of the two dice, while in W the random
variable is an unordered set of two dice. With example (a) there is nothing stopping
us from throwing the die x1 , say, twenty times and the die x2 ten times. However, in
both (b) and (c) the dice are each thrown the same number of times.
Example 1.1.7 There are many naturally occurring examples of systems with many
random variables. In fact, some of the most significant ones come from applications
in Economics and Biology and have an astronomical number of variables.
For example, in Economics and in market analysis, there are systems with one
random variable for each company which trades in a particular market. It is easy to
see that, in this case, we can have thousands, even tens of thousands, of variables.
In Biology, very important examples come from studying systems in which the
random variables represent hundreds (or thousands) of positions in the DNA sequence
of one or several species. The alphabet of each variable consists of the four basic
ingredients of DNA: Adenine, Cytosine, Guanine and Thymine. As a shorthand
notation, one usually denotes the alphabet of such random variables as {A, C, G, T }.
In this book, we will refer to the systems arising from DNA sequences, as DNA-
systems.
Example 1.1.8 For cultural reasons (one of the authors was born and lives in Siena!),
we will have several examples in the text of systems describing probabilistic events
related to the famous and colourful Sienese horse race called the Palio di Siena.
Horses which run in the Palio represent the various medieval neighbourhoods of the
city (called contrade) and the Palio is a substitute for the deadly feuds which existed
between the various sections of the city.
1.1 Systems of Random Variables 5
The names of the neighbourhoods are listed below with a shorthand letter abbre-
viation for each of them:
Remark 1.1.15 It is very important to notice that the definition of the total correlation
uses the concept of cartesian product. Moreover the concept of cartesian product
requires that we fix an ordering of the variables in S.
Thus, the total correlation of a system is not uniquely determined, but it changes
as the chosen ordering of the random variables changes.
It is easy to see, however, that all the possible total correlations of the system S
are isomorphic.
Example 1.1.16 If S is a system with two coins c1 , c2 , each having alphabet {H, T },
then the only random variable in its total correlation, has an alphabet with four
elements {(T, T ), (T, H ), (H, T ), (H, H )}, i.e. we have to distinguish between the
states (H, T ) and (T, H ). This is how the ordering of the coins enters into the
definition of the random variable (c1 , c2 ) of the total correlation.
Example 1.1.17 Let S be the system of random variables consisting of two dice, D1
and D2 each having alphabet the set {1, 2, 3, 4, 5, 6}. The total correlation of this
system, S, is the system with a unique random variable D = (D1 , D2 ) and alphabet
the set {(i, j) | 1 ≤ i, j ≤ 6}. So, the alphabet consists of 36 elements.
Now let T be the system whose unique random variable is the set x = {D1 , D2 }
and whose alphabet consists of the eleven numbers {2, 3, . . . , 11, 12}.
We can consider the surjective morphism of systems φ : S → T which takes
the unique random variable of S to the unique random variable of T and takes the
element (i, j) of the alphabet of the unique variable of S to i + j in the alphabet
of the unique variable of T .
Undoubtedly this morphism is familiar to us all!
1.2 Distributions
One of the basic notions in the study of systems of random variables is the idea of a
distribution. Making the definition of a distribution precise will permit us to explain
clearly the idea of an observation on the random variables of a system. This latter
concept is extremely useful for the description of real phenomena.
Definition 1.2.1 Let K be any set. A K -distribution on a system S with random
variables x1 , . . . , xn , is a set of functions D = {D1 , . . . , Dn }, where for 1 ≤ i ≤ n,
Di is a function from A(xi ) to K .
Remark 1.2.2 In most concrete examples, K will be a numerical set, i.e. some subset
of C (the complex numbers).
The usual use of the idea of a distribution is to associate to each state of a variable
xi in the system S, the number of times (or the percentage of times) such a state is
observed in a sequence of observations.
Example 1.2.3 Let S be the system having as unique random variable a coin c, with
alphabet A(c) = {T, H } (the coin needs not be fair!).
Suppose we throw the coin n times and observe the state T exactly dT times and
the state H exactly d H times (dT + d H = n). We can use those observations to get an
N-distribution (N is the set of natural numbers), denoted Dc , where Dc : {T, H } → N
by
Dc (T ) = dT , Dc (H ) = d H .
Example 1.2.4 Now consider the system S with two coins c1 , c2 , and with alphabets
A(ci ) = {T, H }.
Again, suppose we simultaneously throw both coins n times and observe that the
first coin comes up with state T exactly d1 times and with state H exactly e1 times,
while the second coin comes up T exactly d2 times and comes up H exactly e2 times.
From these observations we can define an N-distribution, D = (D1 , D2 ), on S
defined by the functions
D1 : {T, H } → N, D1 (T ) = d1 , D1 (H ) = e1 ,
D2 : {T, H } → N, D2 (T ) = d2 , D2 (H ) = e2
8 1 Systems of Random Variables and Distributions
((d1 , e1 ), (d2 , e2 )) ∈ N2 × N2 .
Example 1.2.6 Consider the DNA-system S with random variables precisely 100
fixed positions (or sites) p1 , . . . , p100 on the DNA strand of a given organism. As
usual, each variable has alphabet {A, C, G, T }. Since each alphabet has exactly four
members, the space of Z-distributions on S is D(S) = Z4 × · · · × Z4 (100times) =
Z400 .
Suppose we now collect 1,000 organisms and observe which DNA component
occurs in site i. With the data so obtained we can construct a Z-distribution D =
{D1 , . . . , D100 } on S where Di associates to each of the members of the alphabet
A( pi ) = {A, C, G, T } the number of occurrences of the corresponding component
in the i−th position. Note that for each Di we have
Remark 1.2.7 Suppose that S is a system with random variables x1 , . . . , xn and that
the cardinality of each alphabet A(xi ) is exactly ai . As we have said before, ai is
simply the number of states that the random variable xi can assume.
With this notation, the K -distributions on S can be seen as points in the space
K a1 × · · · × K an .
Remark 1.2.8 If S is a system with two variables x1 , x2 , whose alphabets have cardi-
nality (respectively) a1 and a2 , then the unique random variable in the total correlation
1.2 Distributions 9
Example 1.2.10 In the city of Siena (Italy) two spectacular horse races have been
run every year since the seventeenth century, with a few interruptions caused by the
World Wars. Each race is called a Palio, and the Palio takes place in the main square
of the city. In addition there have been some additional extraordinary Palios run from
time to time. From the last interruption, which ended in 1945, up to now (2014), a
total number of 152 Palios have taken place. Since the main square is large, but not
enormous, not every contrada can participate in every Palio. There is a method, partly
based on chance, that decides whether or not a contrada can participate in a particular
Palio.
10 1 Systems of Random Variables and Distributions
Table 1.1 Participation of the contrade at the 152 Palii (up to 2014)
x Name Dx (1) Dx (0) x Name Dx (1) Dx (0)
A Aquila 88 64 B Bruco 92 60
H Chiocciola 84 68 C Civetta 90 62
D Drago 95 57 G Giraffa 89 63
I Istrice 84 68 E Leocorno 99 52
L Lupa 89 63 N Nicchio 84 68
O Oca 87 65 Q Onda 84 68
P Pantera 96 56 S Selva 89 63
R Tartuca 91 61 T Torre 90 62
M Valdimontone 89 63
Let’s build a system with 17 boolean random variables, one for each contrada.
For each variable we consider the alphabet {0, 1}. The space of Z-distributions of
this system is Z2 × · · · × Z2 = Z34 .
Let us define a distribution by indicating, for each contrada x, Dx (1) = number
of Palios where contrada x took part and Dx (0) = number of Palios where contrada
x did not participate. Thus we must always have Dx (0) + Dx (1) = 152. The data
are given in Table 1.1.
We see that the Leocorno (unicorn) contrada participated in the most Palios while
the contrada Istrice (crested porcupine), Nicchio (conch), Onda (wave), Chiocciola
(snail) participated in the fewest.
On the same system, we can consider another distribution E, where E x (1) =
number of Palios that contrada x won and E x (0) = number of Palios that contrada x
lost (non-participation is considered a loss). The Win-Loss table is given in Table 1.2.
From the two tables we see that more participation in the Palios does not neces-
sarily imply more victories.
Table 1.2 Win-Loss table of contrade at the 152 Palii (up to 2014)
x Name E x (0) E x (1) x Name E x (0) E x (1)
A Aquila 8 144 B Bruco 5 147
H Chiocciola 9 143 C Civetta 8 144
D Drago 11 141 G Giraffa 12 140
I Istrice 8 144 E Leocorno 9 143
L Lupa 5 147 N Nicchio 9 143
O Oca 14 138 Q Onda 9 143
P Pantera 8 144 S Selva 15 137
R Tartuca 10 142 T Torre 3 149
M Valdimontone 9 143
1.3 Measurements on a Distribution 11
We now introduce the concepts of sampling and scaling on a distribution for a system
of random variables.
is called the sampling of the variable xi in D. We will say that D has constant
sampling if all variables in S have the same sampling in D.
A K -distribution D on S is called probabilistic if each xi ∈ S has sampling equal
to 1.
Remark 1.3.2 Let S be a system with random variables {x1 , . . . , xn } and let D =
(D1 , . . . , Dn ) be a K -distribution on S, where K is a numerical field.
If every variable xi has sampling c D (xi ) = 0, we can obtain from D an associated
probabilistic distribution D̃ = ( D̃1 , . . . D̃n ) defined as follows:
Di (s)
for all i and for all states s ∈ A(xi ) set D̃i (s) = .
c D (xi )
Remark 1.3.3 In Example 1.2.3, the distribution D is exactly the probabilistic dis-
tribution associated to D (seen as a Q-distribution).
Convention. To simplify the notation in what follows and since we will always be
thinking of the set K as some set of numbers, usually clear from the context, we
won’t mention K again but will speak simply of a distribution on a system S of
random variables.
Warning. We want to remind the reader again that the basic notation in Algebraic
Statistics is far from being standardized. In particular, the notation for a distribution
is quite varied in the literature and in other texts.
E.g. if si j is the j−th state of the i−th variable xi of the system S, and D is a
distribution on S, we will denote this by writing Di (si j ) as the value of D on that
state.
You will also find this number Di (si j ) denoted by Dxi =si j .
Example 1.3.4 Suppose we have a tennis tournament with 8 players where a player
is eliminated as soon as that player loses a match. So, in the first set of matches four
players are eliminated and in the second two more are eliminated and then we have
the final match between the remaining two players.
12 1 Systems of Random Variables and Distributions
Clearly the sampling c(xi ) of every player xi represents the number of matches
played. For example, c(xi ) = 3 if and only if xi is a finalist, while c(xi ) = 1 for the
four players eliminated at the end of the first match. Hence D is not a distribution
with constant sampling.
Notice that this distribution doesn’t have any variable with sampling equal to 0
and hence there is an associated probabilistic distribution D̃, which represents the
statistics of victories. For example, for the winner xk , one has
1
D̃ j (0) = D̃ j (1) = .
2
While for a player xi eliminated after the first round we have
Note moreover that, given a scaling D of D, if D, D have the same sampling, then
they must coincide.
1.3 Measurements on a Distribution 13
Remark 1.3.7 In the next chapters we will see that scaling doesn’t substantially
change a distribution. Using a projectivization method, we will consider two distri-
butions “equal” if they differ only by a scaling.
Proposition 1.3.8 Let f : S → T be a map of systems which is a bijection on the
sets of variables. Let D be a distribution on S and D a scaling of D. Then f ∗D is a
scaling of f ∗D .
Proof Let y be a variable of T and let t ∈ A(y). Since f is a bijection there is a
unique x ∈ S for which f (x) = y. Then by definition we have
( f ∗D ) y (t) = D (s) = λx D(s) = λx ( f ∗D ) y (t).
s∈A(x) s∈A(x)
1.4 Exercises
Exercise 1 Let us consider the random system associated with the tennis tourna-
ment, see Example 1.3.4.
Compute the probabilistic distribution for the finalist who did not win the tourna-
ment.
Compute the probabilistic distribution for a tournament with 16 participants.
Exercise 2 Let S be a random system with variables x1 , . . . , xn and assume that all
the variables have the same alphabet A = {s1 , . . . , sm }. Then one can create the dual
system S by taking s1 , . . . , sm as variables, each si with alphabet X = {x1 , . . . , xn }.
Determine the relation between the dimension of the spaces of K -distributions of
S and S .
Exercise 3 Let S be a random system and let S be a subsystem of S.
Determine the relation between the spaces of K -distributions of the correlations
of S and S .
Exercise 4 Let f : S → S be a surjective map of random systems.
Prove that if a distribution D on S has constant sampling, then the same is true
for f D∗ .
Exercise 5 One can define a partial correlation over a system S, by connecting only
some of the variables.
For instance, if S has variables x1 , . . . , xn and m < n, one can consider the
partial correlation on the variables x1 , . . . , xm as a system T whose variables are
Y, xm+1 , . . . xn , where Y stands for the variable x1 × · · · × xm , with alphabet the
product A(x1 ) × · · · × A(xm ).
If S has variables c1 , c2 , c3 , all of them with alphabet {T, H } (see Example 1.1.3),
determine the space of K -distributions of the partial correlation T with random
variables c1 × c2 and c3 .
Chapter 2
Basic Statistics
Example 2.1.1 Let’s consider the data associated with the Italian Series A Soccer
games of 2005/2006. There were 380 games played that season with 176 games won
by the home team, 108 games which resulted in a tie and the remaining 96 games
won by the visiting team.
Suppose now we want to construct a random system S with a unique random
variable x as in Example 1.1.4. Recall that the alphabet for x is A(x) = {1, 2, T }
where 1 means the game was won by the home team, 2 means the game was won by
the visiting team and T means the game ended in a tie. The 2005/2006 Season offers
us a distribution D such that Ds (1) = 176, Ds (T ) = 108, Ds (2) = 96.
© Springer Nature Switzerland AG 2019 15
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_2
16 2 Basic Statistics
176 108 96
D̃s (1) = 46.3% D̃s (T ) = 28.4% D̃s (2) = 25.3%.
380 380 380
So, making reference to a previous season, we have gathered some insight into
the probabilities to assign the various states.
Before we continue our discussion of the probability of the states of a random
variable in a random system, we want to introduce some more terminology for
distributions.
Example 2.1.2 Let S be a system of random variables.
The equidistribution, denoted E, on S is the distribution that associates to every
state of every random variable in the system, the value 1.
Now suppose the random variables of S are denoted by x1 , . . . , xr and that the
states of the variable xi are {si1 , . . . , sini }, then the associated probabilistic distribu-
tion of E, denoted Ẽ (see Remark 1.3.2) is defined by
1
pxEi (si j ) = for every j, 1 ≤ j ≤ n i .
ni
Notice that this distribution is the one which gives us equal probability for all
states in the alphabet of a fixed random variable.
This same probability distribution is clearly obtained if we start with the distri-
bution cE, c ∈ R, which associates to each state of each variable in the system, the
value c and then take c Ẽ xi (si j ) = cnc i .
Clearly, the equidistribution has constant sampling if and only if all variables have
the same number of states.
Remark 2.1.3 There is a famous, but imprecise, formula for calculating the proba-
bility for something to occur. It is usually written as
positive cases
.
possible cases
The problem with this formula is best illustrated by the following example. Assume
that a person likes to play tennis very much. If he were asked: If you had a chance to
play Roger Federer for one point what is your probability of winning that point? Using
the formula above he would offer as a reply that there are two possible outcomes
and one of those has the person winning the point, so the probability is 0.5 (but
maybe who has seen this person playing tennis would appreciate the absurdity of
that reply!).
The problem with the formula is that it is only a reasonable formula if all outcomes
are equally probable, i.e. if we have an equidistribution on the possible states—
something far from the case in the example above.
2.1 Basic Probability 17
Example 2.1.4 The Palio of Siena usually takes place twice a year, in July and
August. Only 10 of the 17 contrade can participate in a Palio. How are the ten
chosen?
First, the 7 contrade which did not participate in the previous year’s Palio are
automatically chosen for this year’s Palio. The remaining three places are chosen,
by a draw, from among the 10 contrade which did participate in the previous year’s
Palio.
What is the probability that a contrada x, which participated in the previous year’s
Palio, will participate in the current year’s Palio?
To answer this question we construct two systems of random variables. The first
system, T , will have only one random variable, which we will denote by e (for
extraction). What are the elements in the alphabet A(e) of e? In particular, how
many states does e have ? We need to choose 3 contrade from a set of 10 contrade.
We have 10 choices for the first extracted contrada, then 9 choices for the second
extracted contrada and finally 8 choices for the third extracted contrada. The total
is 10 · 9 · 8 = 720 states, where each state is an ordered triplet which tells us which
contrade was chosen first, which was chosen second and which was chosen third.
Since we will assume that the extractions will be made honestly, we can consider
the equidistribution on A(e) whose probability distribution associates to each triplet
in the alphabet the value 1/720.
Also the second system S has only one random variable corresponding to exactly
one of the ten contrade from which we will make the extraction. Such a variable,
which we will call c (where c really is the name of one of the contrade), is boolean.
Its alphabet can be considered as Z2 , where 1 signifies that c participates in the Palio
and 0 signifies that it does not participate.
Consider the map f c : T → S, sending e to c and each triplet in t ∈ A(e) to 1 or
0, depending on whether or not c is in t or c is not in t.
The probability that c will participate in the Palio of next July is defined by the
probability distribution associated to D = ( f c )∗E on S.
D(1) is equal to the number of triplets containing c. How many of these triplets
are there?
The triplets where c is the first element are obtained by choosing the second
element among the other 9 contrade and the third element among the remaining 8,
hence we have 72 such triplets. Similarly for the number of triplets where c is in the
second or in the third position, for a total of 72 · 3 = 216 triplets. Hence D(1) = 216
and, obviously, D(0) = 720 − 216 = 504. Thus, the probability for a contrada c to
participate in the Palio of next July, assuming it already run in the Palio of the previous
July, is:
D(1) 216 3
p= = = 30%.
D(0) + D(1) 720 10
In this example we could use the intuitive formula since we assumed the 720
possible states for the random variable e were equally distributed. With this in mind,
the previous procedure yields a general construction which, for elementary situations,
is enough to compute the probability that a given event occurs.
18 2 Basic Statistics
positive cases
possible cases
tossing the coin we obtain one of the two possibilities? A priori, no one can affirm
with certainty that the probability of each state is 1/2 since the coin may not be a
fair coin.
In a certain sense, physical, biological and economic worlds are filled with “unfair”
coins. For example, on examining the first place in the DNA chain of large numbers
of organisms, one observes that the four basic elements do not appear equally often;
the element T appears much more frequently than the element A. In like manner, if
one were attempting to understand the possible results of a soccer match, there is no
way one would say that the three possibilities 1, 2, T are equally probable.
D(1) 46656 9
D̃(1) = = = = 9%.
D(0) + D(1) 518400 100
To compute the probability that c participates to at least one Palio, we build the
map u : T → S where this time u sends each state s of y, corresponding to a pair
of triplets, to 1 or 0, depending if c appears or not in at least one of the triplets in the
pair.
How many states s are sent to 1 now? There are 216 triplets, among the 720
possible ones, where c appears in the first element of s. Among the 720 − 216 = 504
remaining ones, there are 504 · 216 cases where c appears in the second element of s.
Then, the pairs of triplets having c in at least one triplet are 216 · 720 + 504 · 216 =
264384.
The equidistribution E on T induces on T the distribution R = u ∗E such that
R(1) = 264384 and R(0) = 518400 − 264384 = 254016. Hence, the probability
for the contrada c to participate to at least one Palio, in 2016, is
R(1) 264384 51
R̃(1) = = = = 51%.
R(0) + R(1) 518400 100
Example 2.2.3 The most famous elementary logic connectors are AND, OR and
AUT (which means “one of them, but not both”), described respectively by parts (a),
(b) and (c) of Table 2.1.
Definition 2.2.4 Let S be a systems with two random variables x1 , x2 , and consider
a booleanization E of S. Let be a logic connector. Then defines a booleanization
E of the total correlation S by setting, for each pair s1 , s2 of states of x1 , x2 :
Table 2.1 Truth tables of AND (a), OR (b) and AUT (c)
AND 0 1 OR 0 1 AUT 0 1
0 00 0 01 0 01
1 01 1 11 1 10
(a) (b) (c)
The reader can easily rephrase Example 2.2.1 in terms of the booleanization induced
by a connector .
Remark 2.2.5 One can extend the notion of logic connectors on n-tuples, by tak-
ing n-ary operations on Z2 , i.e. maps (Z2 )n → Z2 . This yields an extension of a
booleanization on a system S, with an arbitrary number of random variables, to a
booleanization of S.
The theory of logic connectors has useful applications. Just to mention one, it intro-
duces a theoretical study of queries on databases.
However, since a systematic study of logic connectors goes far beyond the scopes
of this book, we will not continue the discussion here.
The numbers appearing in Example 2.2.1 are quite large and, in similar but slightly
more complicate situations, become rapidly untreatable.
Fortunately, there is a simpler setting to reproduce the computations of probability
performed in Example 2.2.1. It is based on the notion of “independence”.
In order to make this definition we first show a connection between tensors and
distributions, which extends the connection between distribution on a system with
two variables and matrices, introduced in Remark 1.2.8.
Tensors are fundamental objects in Multi-linear Algebra, and they are introduced
in Chap. 6 of the related part of the book. We introduce here the way tensors enter
into the study of random systems, and this will provide the fundamental connection
between Algebraic Statistics and Multi-linear Algebra.
Remark 2.3.1 Let S be a system of random variables, say x1 , . . . , xn and suppose
that xi has ai states and let T = (S) be the total correlation of S. Recall that a
distribution D on T assigns to every state of the unique variable x = (x1 , . . . , xn ),
of T , a number. But the states of x correspond to ordered n-tuples (s1 j1 , . . . , sn jn )
where si ji is a state of the variable xi , i.e. a state of T corresponds to a particular
location in a multidimensional array, and a distribution on T puts a number in that
location. I.e. a distribution on T can be identified with an n-dimensional tensor of
type a1 × · · · × an and conversely.
22 2 Basic Statistics
The entry Ti1 ...,in corresponds to the states i 1 of the variable x1 , i 2 of the variable
x2 , . . . , i n of the variable xn .
Remark 2.3.2 When the original system S has only two variables (random dipole),
then the distributions in the total correlation S are represented as 2-dimensional
tensors, i.e. usual matrices.
Example 2.3.3 Assume we have a system S of three coins, that we toss 15 times and
record their outputs. Assume that we get the following result:
• 1 time we get HHH;
• 2 times we get HHT and TTH;
• 3 times we get HTH and HTT;
• 4 times we get TTT;
• we never get THH and THT
Then the corresponding distribution, in the space of R-distributions of S which
is D(S) = R2 × R2 × R2 , is ((9, 6), (3, 12), (6, 9)), where (9, 6) corresponds to 9
Heads and 6 Tails for the first coin, (3, 12) corresponds to 3 Heads and 12 Tails for
the second coin, and (6, 9) corresponds to 6 Heads and 9 Tails for the third coin.
The distribution over T = S that corresponds to our data is the tensor of Example
6.4.15, namely:
2 3
1 3
D=
0 4
0 2
In general, as we can immediately see from the previous Example, knowing the dis-
tribution on a system S of three coins is not enough to reconstruct the corresponding
distribution on the total correlation T = S.
On the other hand, there is a special class of distributions in S where the recon-
struction is possible. It is the class of independence distributions defined below.
Definition 2.3.4 Let S be a system of random variables, x1 , . . . , xn where each xi
has ai states. Then D K (S) is identified with K a1 × · · · × K an . Set T = S as the
total correlation of S.
Define a function
: D K (S) → D K (T )
called the independence connection (or Segre connection), in the following way:
sends the distribution
162 648
D =
162 648
108 432
Since s1 j ∈A(x1 ) Dx1 (s1 j1 ) = 1, the claim follow by induction on the number of
1
random variables of S.
24 2 Basic Statistics
Example 2.3.9 Let S be a boolean system with two random variables x, y, both with
alphabet {0, 1}. Let D be the distribution defined as
1 5 1 5
Dx (0) = , Dx (1) = , D y (0) = , D y (1) = .
6 6 6 6
(which is clearly a probabilistic distribution).
Its product distribution on the total correlation with variable z = x × y is defined
as
1 1 1
(D)z (0, 0) = · =
6 6 36
1 5 5
(D)z (0, 1) = · =
6 6 36
5 1 5
(D)z (1, 0) = · =
6 6 36
5 5 25
(D)z (1, 1) = · =
6 6 36
The independence connection can be, in some sense, inverted. For this aim, we need
the definition of (total) marginalization.
Definition 2.3.10 Let S be a system of random variables and T = S its total
correlation. Define a function M : D K (T ) → D K (S), called (total) marginalization
in the following way. Given a tensor (thought as distribution on S) D , M(D ) is
2.3 Independence Connections and Marginalization 25
where the last sum runs on the states of the unique random variable of S.
Example 2.3.13 Let us see now how independence connection and marginalization
work together.
To this aim we introduce an example (honestly, rather harsh) on the efficiency of
a medicine, that we will use also in the chapter of statistical models.
Consider a pharmaceutical company who want to verify if a specific product is
efficient against a given pathology.
The company will try to verify such efficiency hiring volunteers (the population)
affected by the pathology and giving the drug to some of them, a placebo to others.
From the registration of the number of healings, the conclusions must be drawn.
26 2 Basic Statistics
We easily observe that (D) is a matrix of rank 1 and this says that the medicine
has not effects (is independent) on the healing of the patients.
The marginalization of (D) gives the distribution D on X :
Dx (0) = 600 + 1400 = 2000, Dx (1) = 2400 + 5600 = 8000,
D y (0) = 600 + 2400 = 3000, D y (1) = 1400 + 5600 = 7000
We notice that D is a scaling of D, with scaling factor equal to 100, which is the
sampling.
Already from the previous Example 2.3.13, it is clear that the marginalization is in
general not injective, in other words Co(D) needs not to be a singleton.
Indeed in the Example, both distributions D and D are coherent with D̃, as one
can immediately compute.
Example 2.3.14 Consider a boolean system with two variables representing to coins
c1 , c2 , each of them has alphabet {H, T }. We throw, separately, 100 times the coins
obtaining, for the first coins c1 , 30 times H and 70 times T , and for the second
coin c2 , 60 times H and 40 times T . Thus one has a distribution D given by
((30, 70), (60, 40)).
Using the independence connection , we get, on the unique variable of the total
correlation T of S, the distribution
(D)(H, H ) = 1800
(D)(H, T ) = 1200
(D)(T, H ) = 4200
(D)(T, T ) = 2800
Similar equalities hold for the other variables of S, thus M((D)) is a scaling
of D.
Example 2.3.16 The viceversa of the previous proposition is not valid in general.
As a matter of fact, in Example 2.3.14, consider a distribution D on T defined as
D (H, H ) =6
D (H, T ) =1
D (T, H ) =3
D (T, T ) =1
(M(D ))(H, H ) = 63
(M(D ))(H, T ) = 14
(M(D ))(T, H ) = 36
(M(D ))(T, T ) = 8
In the previous Example, the initial distribution D does not lie in the image of the
independence connection, hence (M(D )), which obviously lie in the image, cannot
be equal to D .
28 2 Basic Statistics
When we start from a distribution D which lies in the image of the independence
connection, then the marginalization works (up to scaling) as the inverse of the
independence connection.
Proposition 2.3.17 Let S be a system and T its total correlation. Denote, as usual,
by the independence connection from S to T and by M the (total) marginalization
from T to S.
If D = (D) then ((M(D )) is a scaling of D .
Proof Suppose S has n variables x1 , . . . , xn , such that the variable xi has ai states.
By our assumption on D , there exist vectors vi = (vi1 , . . . , viai ) ∈ K ai such that, as
tensor, D = v1 ⊗ · · · ⊗ vn .
Then M(D ) associates the vector
M(D ) j = ( v1i1 . . . vnin ), . . . , v1i1 . . . vnin )
i j =1 i j =a j
(M(D )) = (c1 · · · cn )D .
Let us go back to see how the use of the independence connection can simplify our
computations when we know that two systems are independent.
Example 2.3.19 Consider again Example 2.2.1 and let us use the previous construc-
tions to simplify the computations.
We have a distribution D on the system T whose unique variable is the contrada
c, with alphabet Z2 . We know that the normalization of D sends 1 to 3/10 and 0
to 7/10.
This defines, on T , the distribution ():
2.3 Independence Connections and Marginalization 29
3 3 9 3 7 21
(1, 1) = · = , (1, 0) = · = ,
10 10 100 10 10 100
7 3 21 7 7 49
(0, 1) = · = , (0, 0) = · = . (2.3.2)
10 10 100 10 10 100
If we consider the logic connector AND, this sends (1, 1) to 1 and the other pairs
to 0. Thus, the distribution induced by () sends 1 to 9/100 and 0 to (21 + 21 +
49)/100 = 91/100.
If we consider the logic connector OR, this sends (0, 0) to 0 and the other pairs to
1. Thus, the distribution induced by () sends 1 to (9 + 21 + 21)/100 = 51/100
and 0 to 49/100.
The logic connector AUT sends (0, 0) and (1, 1) to 0 and the others pairs to 1.
Thus, the distribution induced by () sends 1 to (21 + 21)9/100 = 42/100 and 0
to (9 + 49)/100 = 58/100.
And so on.
It is important to observe that the results are consistent with the ones of Example
2.2.1, but the computations are simpler.
Let us finish by showing some properties of the space Co(D) of distributions coherent
with a given distribution D on a system S.
Theorem 2.3.20 For every distribution D, with constant sampling, on a system S,
Co(D) is an affine subspace (i.e. a translate of a vector subspace) in D(S).
Proof We provide the proof only in the case when S is a random dipole, i.e. it has
only two variables, leaving to the reader the straightforward extension to the general
case.
Let x, y be the random variables of S, with states (s1 , . . . , sm ) and (t1 , . . . , tn )
respectively. We will prove that Co(D) is an affine subspace of dimension mn −
m − n + 1. Let D be a distribution on S, identified with the matrix D = (di j ) ∈
Rm,n . Since D ∈ Co(D), then the row sum of the matrix of D must give values
D(s1 ), . . . , D(sm ) of D on the states of x, while similarly the column sum must give
the values D(t1 ), . . . , D(tn ) of D on the states of y. Thus D is in Co(D) if and only
if it is solution of the linear system with n + m equations and nm unknowns:
⎧
⎪
⎪d11 + · · · + d1n = Dx (s1 )
⎪
⎪
⎪
⎪ .. .
⎪
⎪
⎪ . = ..
⎪
⎨d + · · · + d
m1 mn = Dx (sm )
⎪
⎪d11 + · · · + dm1 = D y (t1 )
⎪
⎪
⎪
⎪ .. .
⎪
⎪
⎪ . = ..
⎪
⎩d + · · · + d
1n mn = D y (tn )
It follows that Co(D) is an affine subspace of D(S) = Rm,n . The matrix associated
to the previous linear system has a block structure given by
30 2 Basic Statistics
M1 M2 . . . Mm D x
I I . . . I Dy
We observe that the m + n rows of H are not independent since the vector
(1, 1, . . . , 1) can be obtained as sum of the first m row and as sum of the last n
rows. Hence the rank of H is at most n + m − 1.
In particular, the system has solution if and only if the constant terms satisfy
Definition 2.3.22 In D(S) define the unitary simplex U as the subspace formed
by tensors whose coefficients sum to 1.
U is an affine subspace of codimension 1, which contains all the probabilistic
distributions on S.
We previously said that if D is coherent with D and D has constant sampling k, then
also the sampling of D , on the unique variable of S, is k. In other terms, the matrix
(di j ) associated to D satisfies i, j di j = k. From this fact we get the following
Proposition 2.3.23 For every distribution D on S with constant sampling, the affine
space Co(D) is parallel to U .
We finish by showing that, for a fixed distribution D over S, one can find in Co(D)
distributions with rather different properties.
Example 2.3.24 Assume one needs to find properties of a distribution D on S, but
knows only the marginalisation D = M(D ). Even if Co(D) is in general infinite,
sometimes only few further information is needed to determine some property of D .
For instance, let S be a random dipole, whose variables x, y represent one position
in the DNA chain of a population of cells in two different moments, so that both
variables have alphabet {A, C, G, T }. After treating the cells with some procedure, a
researcher wants to know if some systematic change occurred in the DNA sequence
(which can change also randomly). The total correlation of S has one variable with
16 states. A distribution D on S corresponds to a 4 × 4 matrix.
If the researcher cannot trace the evolution of every single cell, but can only
determine the total number of DNA basis that occur in the given position before
and after the treatment, i.e. if the researcher can only know the distribution D on
S, any conclusion on the dependence of the final basis in terms of the initial one is
impossible.
But assume that one can trace the behaviour of cells having a A in the fixed
position. So one can record the value of the distribution D on the state (A, A). Since
there exists only one distribution of independence D which is coherent with D if the
observed value of D (A, A) does not coincide with D (A, A), then the researcher
has an evidence towards the non-independence of the variables, in the distribution
D.
Notice that, on the contrary, if D (A, A) = D (A, A), one cannot conclude that
D is a distribution of independence, since infinitely many distributions in Co(D)
assume on (A, A) a fixed value.
Example 2.3.25 Consider the following situation. In a bridge game, two assiduous
players A, B follow this rule: they play alternately one day a match in pairs together,
the other day a match in opposing pairs. After 100 days, the situation is as follows:
A won 30 games and lost 70, while B won 40 and lost 60. Can one determine
analytically the total number of wins and defeats? Can one check whether the win
or the defeat of each of them dependent on or not playing in pairs with each other?
We can give an affermative answer to both questions. Here we have a boolean
dipole S, with two variables A, B and state 1 = victory, 0 = defeat. The distribution
D on S is defined by
32 2 Basic Statistics
Clearly D has constant sampling equal to 100. From these data, we want to determine
a distribution D on S, coherent with D, which explains the situation. We already
know that we have infinitely many distributions on S, coherent with D. By Theorem
2.3.20, these distributions D fill an affine subspace of R2,2 having dimension 2 · 2 −
2 − 2 + 1 = 1.
The extra datum with respect to Example 2.3.13 is to know that the players played
alternately in pairs and opposed. Then among the 100 played matches, 50 times they
played together, then the observed result could be only (0, 0) o (1, 1), while 50 time
were opposed, and the observed result could be only (0, 1) or (1, 0). Hence, the
matrix (di j ) of the required distribution D must satisfy the condition:
All others distributions, coherent with D , are obtained summing to D the solution
of the homogeneous system ⎧
⎪
⎪ d11 + d12 = 0
⎪
⎨d + d
11 21 = 0
⎪
⎪ d + d 12 = 0
⎪
⎩
22
d22 + d21 = 0
Requiring d11 + d22 = d12 + d21 , we get z = 2, then the only possible matrix is:
10 20
D = .
30 40
Then playing in pairs A and B won 10 times and lost 40, while playing against A
has won 20 times, B has won 30 times.
2.3 Independence Connections and Marginalization 33
Finally, the percentage of wins depends on the players A and B, because the
determinant of D is −200 = 0 (both have advantage to not play in pairs with each
other).
In D1 and D2 the fact that A gets a grant is a vantage for section B, while in D3 it
represents a disadvantage.
2.4 Exercises
Exercise 6 Following Example 2.2.1 compute the probability that a contrada c par-
ticipates to one Palio in 2016, but not both, under the assumption that c didn’t
participate to any Palio in 2015.
Chapter 3
Statistical Models
3.1 Models
In this chapter, we introduce the concept of model, essential point of statistical infer-
ence. The concept is reviewed here by our algebraic interpretation. The general
definition is very simple:
Definition 3.1.1 A model on a system S of random variables is a subset M of the
space of distributions D(S).
Of course, in its total generality, the previous definition it is not so significant.
The Algebraic Statistics consists of practice in focus only on certain particular
types of models.
Definition 3.1.2 A model M on a system S is called algebraic model if, in the coordi-
nates of D(S), M corresponds to the set of solutions of a finite system of polynomial
equations. If the polynomials are homogeneous, then M si called a homogeneous
algebraic model.
It occurs that the algebraic models are those mainly studied by the proper methods
of Algebra and Algebraic Geometry.
In the statistical reality, it occurs that many models, which are important for the
study of stochastic systems (discrete), are actually algebraic models.
Example 3.1.3 On a system S, the set of distributions with constant sampling is a
model M. Such a model is a homogeneous algebraic one.
As a matter of fact, if x1 , . . . , xn are the random variables in S and we identify
DR (S) with Ra1 × · · · × Ran , where the coordinates are y11 , . . . , y1a1 , y21 , . . . , y2a2 ,
. . . , yn1 , . . . , ynan , then M is defined by the homogeneous equations:
The probabilistic distributions form a submodel of the previous model, which is still
algebraic, but not homogeneous!
© Springer Nature Switzerland AG 2019 35
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_3
36 3 Statistical Models
The most famous class of algebraic models is the one given by independence mod-
els. Given a system S, the independence model on S is, in effect, a subset of the
space of distributions of the total correlation T = S, containing the distributions
in which the variables are independent among them. The basic example is Example
2.3.13, where we consider a Boolean system S, whose two variables x, y represent,
respectively, the administration of the drug and the healing.
This example justifies the definition of a model of independence, for the ran-
dom systems with two variables (dipoles), already in fact introduced in the previous
chapters.
Definition 3.2.1 Let S be a system with two random variables x1 , x2 and let T = S.
The space of K −distributions on T is identified with the space of matrices K a1 ,a2 ,
where ai is the number of states of the variable xi .
We recall that a distribution D ∈ D K (T ) is a distribution of independence if D,
as matrix, has rank ≤ 1.
The independence model on S is the subset of D K (T ) of distributions of rank
≤ 1.
To extend the definition of independence to systems with more variables, consider
the following example.
Example 3.2.2 Let S be a random system, having three random variables x1 , x2 , x3
representing, respectively, a die and two coins (this time not loaded). Let T = S
and consider the R-distribution D on T defined by the tensor
1 1 1 1 1 1
24 24 24 24 24 24
1 1 1 1 1 1
24 24 24 24 24 24
D= 1 1 1 1 1 1
24 24 24 24 24 24
1 1 1 1 1 1
24 24 24 24 24 24
0 0
−4 12
−1 −9
It is clear, by construction, that the image of the connection is formed exactly by the
distributions of independence on S.
Clearly there are other interesting types of connection. A practical example is the
following:
Example 3.3.4 Consider a population of microorganisms in which we have elements
of two types, A, B, that can pair together randomly. In the end of the couplings, we
will have microorganisms with genera of type A A or B B, or mixed type AB = B A.
The initial situation corresponds to a Boolean system with a variable (the initial
type t0 ) which assumes the values A, B. At the end, we still have a system with only
one variable (the final type t) that can assume the 3 values A A, AB, B B.
If we initially insert a distribution with a = D(A) elements of type A and b =
D(B) elements of type B, which distribution we can expect on the final variable t?
An individual has a chance to meet another individual of type A or B which is
proportional to (a, b), then the final distribution on t will be D given by D (A A) =
a 2 , D (AB) = 2ab, D (B B) = b2 . This procedure corresponds to the connection
: R2 → R3 (a, b) = (a 2 , 2ab, b2 ).
The motivation for defining parametric models should be clear from the repre-
sentation of a connection. If s1 , . . . , sa are all possible states of the variables in S,
and ti1 , . . . , tidi are the possible states of the variables yi of T , then in the parametric
model defined by the connection we have
⎧
⎪
⎨ti1 = i1 (s1 , . . . , sa )
... = ...
⎪
⎩
tidi = idi (s1 , . . . , sa )
The tensors T of the independence model have in fact coefficients that satisfy
parametric equations ⎧
⎪
⎨. . .
Ti1 ...in = v1i1 v2i2 · · · vnin (3.3.1)
⎪
⎩
...
From its parametric equations, (3.3.1), we quickly realize that the independence
model is a toric model.
Example 3.3.7 The model of Example 3.3.4 is a toric model, since it is defined by
the equations ⎧
⎪
⎨x = a
2
y = 2ab .
⎪
⎩
z = b2
Remark 3.3.8 It is evident, but it is good to underline it, that for the definitions
we gave, being an algebraic or polynomial parametric model is independent from
changes in coordinates. Being a toric model instead it can depend on the choice of
coordinates.
Definition 3.3.9 The term linear model denotes, in general, a model on S defined
in D(S) by linear equations.
Obviously, every linear model is algebraic and also polynomial parametric,
because you can always parametrize a linear space.
Example 3.3.10 Even if a connection , between the K -distributions of two random
systems S and T , is defined by polynomials, the polynomial parametric model that
defines it is not necessarily algebraic!.
In fact, if we consider K = R and two random systems S and T each having one
single random variable with a single state, the connection : R → R, (s) = s 2
certainly determines a polynomial parametric model (even toric) which corresponds
to R≥0 ⊂ R, so it can not be defined in R as vanishing of polynomials.
We will see, however, that by widening the field of definition of distributions, as
we will do in the next chapter switching to distributions on C, under a certain point
of view all polynomial parametric models will, in fact, be algebraic models.
The following counterexample is a milestone in the development of so much
part of the Modern mathematics. Unlike Example 3.3.10, it cannot be recovered by
enlarging our field of action.
Example 3.3.11 Not all algebraic models are polynomial parametric.
We consider in fact a random system S with only one variable having three states.
In the distribution space D(S) = R3 , we consider the algebraic model V defined by
the unique equation x 3 + y 3 − z 3 = 0.
There cannot be a polynomial connection from a system S to S whose image
is V .
3.3 Connections and Parametric Models 41
The solution p 2 (t), q 2 (t), r 2 (t) must be proportional to the 2 × 2-minors of the
matrix, hence p 2 (t) is proportional to q(t)r (t) − q (t)r (t), and so on. Consider-
ing the equality p 2 (t)( p(t)r (t) − p (t)r (t)) = q 2 (t)(q(t)r (t) − q (t)r (t)), we get
that p 2 (t) divides q(t)r (t) − q (t)r (t), hence 2 deg( p(t)) ≤ deg(q) + deg(r ) − 1
which contradicts the fact that deg( p) ≥ deg(q) ≥ deg(r ).
Naturally, there are examples of models that arise from connections that they do
not relate a system and its total correlation.
Example 3.3.12 Let us say we have a bacterial culture in which we insert bacteria
corresponding to two types of genome, which we will call A, B.
Suppose, according to the genetic makeup, the bacteria can develop characteristics
concerning the thickness of the membrane and of the core. To simplify, let us say
that in this example, cells can develop nucleus and membrane large or small.
According to the theory to be verified, the cells of type A develop, in the descent,
a thick membrane in 20% of cases and develop large core in 40% of cases. Cells of
type B develop thick membrane in the 25% of cases and a large core in one-third
of the cases. Moreover, the theory expects that the two phenomena are independent,
in the intuitive sense that developing a thick membrane is not influenced by, nor
influences, the development of a large core.
We build two random systems. The first S, which is Boolean, has only one variable
random c (= cell) with A, B states. The second T with two boolean variables, m (=
membrane) and n (= core). We denote for both with 0 the status “big” and with 1 the
status “small”.
The theory induces a connection between S and T . In the four states of the two
variables of T , which we will indicate with x0 , x1 , y0 , y1 , this connection is defined
by
42 3 Statistical Models
⎧
⎪
⎪ x0 = 1
a + 1
b
⎪
⎨x
5 4
1 = 4
5
a + 3
4
b
⎪
⎪ y = 2
a + 1
b
⎪
⎩
0 5 3
y1 = 3
5
a + 2
3
b
This reflects the fact that in the cell population (reported at 160) we expect to even-
tually observe 35 cells with large membrane and 60 cells with a large core.
If the experiment, more realistically, manages to capture the percentage of cells
with the two characteristics (shuffled), then we can consider a connection that links
S with the total correlation T : indicating with x00 , x01 , x10 , x11 the variables cor-
responding to the four states of the only variable of T , then such connection is
defined by ⎧
( 15 a+ 41 b)( 25 a+ 13 b)
⎪
⎪ x =
⎪
⎪
00 (a+b)2
⎪
⎪
⎪
⎪
⎪
⎪
⎪x01 = ( 5 a+ 4 b)( 5 a+
1 1 3 2
3 b)
⎪
⎨ (a+b)2
⎪
⎪
⎪
⎪ x10 =
( 45 a+ 43 b)( 25 a+ 13 b)
⎪
⎪ (a+b)2
⎪
⎪
⎪
⎪
⎪
⎪
⎩x ( 45 a+ 43 b)( 35 a+ 23 b)
11 = (a+b)2
From an algebraic point of view, an experiment will be in agreement with the model
if the percentages observed will be exactly those described by the latter connection:
8.2% of cells with a thick membrane, nucleus, etc.
In the real world, of course, some tolerance should be expected from experimental
data error. The control of this experimental tolerance will be not addressed in this
book, as it is a part of standard statistical theories.
3.4 Toric Models and Exponential Matrices 43
a ≥0 c(i )ai
z=
i
a j <0 c( j )a j
Note that the polynomial equations obtained previously, are in fact binomial.
Definition 3.4.3 The polynomial equations associated with the linear relations
between the rows of the exponential matrix of a toric model W define an alge-
braic model containing W . This model takes the name of algebraic model generated
by W .
It is clear from Example 3.3.10 that the algebraic model generated by a toric
model W always contains W , but it does not always coincide with W . Let us see a
couple of examples about it.
44 3 Statistical Models
from which we can see all relations, between rows, of the form
which define, as equations in Rmn = Rm,n , exactly the 2 × 2-minors of the matrices
in the model.
It follows that the algebraic model associated to this connection coincides with
the space of matrices of rank ≤ 1, which is exactly the image of the connection of
independence.
Example 3.4.5 Come back to the connection of Example 3.3.4. It defines a polyno-
mial parametric model W on R3 given by the parametric equations
⎧
⎪
⎨x = a
2
y = 2ab
⎪
⎩
z = b2
which, as unique relation between rows, satisfies R1 + R3 = 2R2 . Using the formula
for the coefficients, we get the equation in R3 :
4x z = y 2 .
3.4 Toric Models and Exponential Matrices 45
The algebraic model W defined by this equation does not coincide with W . As a
matter of fact, it is clear that the points in W have nonnegative x, z, while the point
(−1, 2, −1) is in W .
However, one has W = W ∩ B where B is the subset of the points in R3 with non-
negative coordinates.
√ In fact, if (x, y, z) is a point in B which satisfies the equation,
√
then, posing a = x, b = z, one has y = ab.
4.1 Motivations
We have seen as many models of interest in the fields of statistics are defined, in
the space of the distributions of a system S, by (polynomials) algebraic equations of
degree greater than one. To understand such models, a mathematical approach is to
initially study the subsets of a space defined by the vanishing of polynomials. This
is equivalent to studying the theory of solutions of systems of polynomial equations
of arbitrary degree, which goes under the name of Algebraic Geometry.
The methods of Algebraic Geometry are based on various theories: certainly, on
Linear and Multi-linear Algebra, but also on rings theory (in particular, on the theory
of rings of polynomials) and on Complex Analysis. We will repeat here only a part of
the main results that can be applied to statistical problems. It must be kept in mind,
however, that the theory in question is rather developed, and topics that will not be
introduced here, they could be important in the future also from a statistical point of
view.
The first step to take is to define the ambient space in which we move. Being to
study solutions of nonlinear equations, from an algebraic point of view, it is natural
to pass from the real field, fundamental for applications but without some algebraic
elements, to the complex field which, being algebraically closed, allows a complete
understanding of the solutions of polynomial equations.
We will then have to expand, from a theoretical point of view, the space of dis-
tributions, to admit points with complex coordinates. These points, corresponding
During this chapter, we will use many topics from the Part III. We strongly recom-
mend to the reader which is not expert in Algebraic Geometry to study such chapters
before to start the treatment of projective algebraic models.
Example 4.2.4 The projective algebraic models on a system S with a single variable
(as in the case of the total correlation) are strictly related with the cones of the
vector space D(S). Every cone defines a projective model on S. Vice versa, given a
projective model on S, its preimage in the projection D(S) → P(S) is a cone.
Example 4.2.5 Consider the system S formed by two ordinary dice. The projective
space of distributions is P5 × P5 . Instead, the projective space of distribution of S
is P35 , corresponding to the space of 6 × 6 matrices.
If S is a system where the two variables represent a die and a coin, the projective
space of distributions is P5 × P1 . In this case, it is easy to observe that the only
variable in the total correlation S has 12 states, hence the projective space of
distributions of S is P11 .
If a system S has n Boolean variables, then its projective space of distributions is
a product of n copies of P1 , and the space of distributions of S is P2 −1 .
n
We are now able to define projective parametric models on A. For this aim, we will
use Definition 9.3.1 of projective map, in Sect. 9.3.
Definition 4.3.1 If S, T are random systems, we call projective connection any pro-
jective map : P(D(S)) → P(D(T )). Notice, in particular, that if is a projective
connection, then the image of any scaling D of a distribution D is a scaling of (D).
We say that a model M is projective parametric if it is the image P(D(S)) of a
projective connection .
4.3 Parametric Models 51
three variables, corresponding to (x1 , x2 ), (x1 , x3 ), and (x2 , x3 ). The models without
triple correlation are obtained from the connection from S to S, which sends every
triplet of matrices (A, B, C) ∈ D(S ), with A ∈ Cd1 ,d2 , B ∈ Cd1 ,d3 , C ∈ Cd2 ,d3 , to the
tensor D ∈ D(S) defined by
It is clear that all components of this map are multi-homogeneous of the same degree,
but, in general, they do not define a map
because even if A, B, C are all different from zero, however it is not clear that their
image is not zero.
In order to obtain our model as the image of a well-defined projective map, we
must restrict to a subvariety of Pd1 d2 −1 × Pd1 d3 −1 × Pd2 d3 −1 .
For instance, when the three variables are Boolean, we must restrict the model to
a suitable model X of distributions on S , so that we get a well-defined map from
a variety X ⊂ P3 × P3 × P3 to P7 . This map is obtained by composing the Segre
map P3 × P3 × P3 → P63 with a suitable projection P63 → P7 . It is a matter of
computations that the subvariety X should not intersect a configuration of products
of linear spaces, containing, for instance,
The fact that the image of a Segre map can be interpreted as (projective) model of
independence of an random system, by Theorem 6.4.13, guarantees us that the Segre
varieties are all projective varieties.
Let us see how, in general, there are projective parametric models which are not
algebraic models.
Example 4.3.4 Consider two systems S, S , both with a single random variable.
We identify both the projective spaces of distribution over R, P(D(S)) and
P(D(S )), with P1R . We can define a projective connection : P(D(S)) → P(D(S ))
by posing (x0 , x1 ) = (x02 , x12 ).
It is easy to check that the image W of contains infinitely many points of P1R .
However, it does not contain all points: as a matter of fact, the point with homogenous
coordinates (1, −1) is not in the image.
On the other hand, each projective variety in P1R , being defined by the vanishing
of a homogeneous polynomial in two variables, or coincides with P1R , or it can only
contain a finite number of points.
So W can not be a projective variety.
4.3 Parametric Models 53
An intermediate case between total independence and generic situations of the depen-
dence of random variables concerns the so-called conditional independence.
To understand the practical meaning of models of conditional independence, let us
start with two examples. In the examples, we will use the fact that the naïve notion of
independence between two variables corresponds to distributions whose associated
matrices have rank 1 (see Definition 3.2.1 and the thereby discussion), while the
conditional independence corresponds to have rank 1 in the slices of the associated
tensor.
B\ A 1 2 3
1 72 41 15
D=
2 60 55 45
3 40 70 82
investigation. Since D has not rank 1, that is, it does not belong to model of inde-
pendence on S, the two variables are not independent.
In other words, being football fans affects hair loss.
The result is surprising, although the unequivocal consequence of the collected
data and the magazine launched into a series of interpretations of the case.
In reality, the true interpretation was very simple. A clue to the solution of the
mystery was contained in the fact that the matrix D has rank 2.
In fact, the magazine had mixed, in the result of the investigation, the data related
to two distinct groups: Men and W omen. The M group is more prone to being a
football fan and losing hair compared to the W group. The lack of homogeneity
of the sample led to a false result, in fact dividing the results of the investigation
with respect to an additional Boolean variable (the gender G of the sample) you get
a tensor 3 × 3 × 2 whose scan along the third index (front–back) is made of two
matrices of rank 1.
2 6 8
70 35 7
10 30 40
T =
50 25 5
20 60 80
20 10 2
The previous matrix D represents the marginalization relative to the first index.
So D is the sum of two matrices of rank 1, and in fact has rank 2.
Note that the two starting variables A, B are really dependent on each other, as
follows: if a person is subject to hair loss, he is more likely to be a man, so he is more
likely to be a football fan (in the example cited, in fact a bit dated, it was taken as
hypothesis that men are more susceptible than women to hair loss, and they are also
more likely to follow football).
The fact that D has rank smaller than the maximum, though it does not imply inde-
pendence of the two variables, should suggest to researchers a connection between
the two variables, mediated by a hidden variable G.
Example 5.0.2 This example is another classic of the algebraic statistical study: the
example of a scientific research that leads to a result only apparently significant.
Osteoporosis is a bone disease that mainly affects elderly people. Let us face
the problem: having a driving license affects the vulnerability to osteoporosis? The
question is apparently idiotic: how can the sensitivity to a bone disease be influenced
by the possession of the license? Yet paradoxically the results would seem to state
the opposite.
In fact, a researcher builds a system formed by two Boolean variables to study the
phenomenon: possession of the license and the state of illness. Then he considers a
population of older people, say 100 individuals, examines them with respect to the
possession of the license and the state of the bones, and builds a D distribution on
the total correlation. The result is expressed by the matrix:
13 37
22 28
The matrix expresses the fact that 13 people have at the same time driving license
and osteoporosis, 37 have a license but not osteoporosis, etc.
The result is incontrovertible! The matrix of D has determinant −450, so it is a far
away from having rank 1. Therefore, there is correlation between having a driving
license and contracting osteoporosis. In the specific case it is clear, from the examina-
tion of the results, that having the driving license makes less likely the manifestation
of osteoporosis. Great unexpected discovery. Such research risks ending up in some
scientific journal serious (let’s hope not!) and to be picked up by media of half the
world. You could create unfounded expectations of healing, with crows of old men
and old ladies to the assault of the driving schools. There would be some clinicians
ready to explain that driving vehicles, cause the movement of the pedals and steering
wheel, is a beneficial workout that tones the bones and makes them more resistant
to osteoporosis.
Unfortunately, we have to switch off easy enthusiasms, because the reality is a
little different.
The weak point of the statistical experiment lies in the fact that the chosen sample
is not homogeneous. In fact among the selected individuals there are mixed elderly
men and women. If the sample selection is random, it is likely that there is an equal
split: 50 men and 50 women. However, osteoporosis does not affect the two sexes in
a homogeneous way. Women are a lot more susceptible to the disease than men. On
the other hand, especially in the elderly population, for a man it is much more usual
to have a driving license compared to a woman.
The situation is clarified if the chosen random system has 3 variables: at the
possession of the license x1 and at the osteoporosis x2 we add the Boolean variable
x3 indicating sex (0 = man, 1 = woman). In the total correlation of this system,
which is a tensor of size 3 and type (2, 2, 2), the real distribution is:
58 5 Conditional Independence
8 32
5 5
D =
2 8
20 20
which has not rank 1, as there are submatrices with determinant different from 0.
The tensor tells us that x1 is independent from x2 given x3 (x1 x2 |{x3 }), because the
front and back matrices have both rank 1, i.e., fixing the male or female population,
in both we see that the possession of the license does not affect on the likelihood of
contracting osteoporosis, as was widely expected.
Note that D represents the marginalization of D along x3 , therefore it is not true
that x1 x2 . In other words, x1 and x2 are actually dependent on each other. What is
the meaning of this statement? It must be read this way. We charge a subject z who
has a driving license. As the percentage of licensed persons who are men rather than
women is higher, it is more likely that z is a man. As such, it is less likely to develop
osteoporosis. Conversely, if a subject has osteoporosis, it is more likely to be female,
so it is less likely to have a driving license.
Our perception still remains a bit perplexed. The reason lies in the psychological
fact that the property of being a man or a woman, for an individual, it is obviously seen
as far more fundamental than having a driving license or even to the development of
osteoporosis.
The previous examples explain the usefulness of introducing the concepts of con-
ditional independence of random variables and also the concept of hidden variables.
Remark 5.1.1 As usual, we are more interested to show the geometric or algebraic
structure of probability distributions, hence the explicit computation of conditional
probability is out of our scopes. Consequently we will not introduce, in a formal
way, the celebrated Bayes Formula (but see Example 5.1.13 for an instance of how
the formula can be recovered in our setting).
We will refer to the concepts of Tensorial Algebra contained in Chap. 8 about scan
and contraction. In particular, see Definitions 8.2.1 and 8.3.1.
5.1 Models of Conditional Independence 59
1 0
1 3
D=
0 1
1 −1
describes a distribution on the total correlation of a Boolean system with three vari-
ables x1 , x2 , x3 .
In D, one has x2 x3 , since the contraction along x1 gives
22
.
11
6 3
2 1
D =
2 2
1 1
of rank 2.
Examples 5.0.1 and 5.0.2 represent situations, where two initial variables are
independent, given the third one (the gender).
Example 5.1.6 Consider the transmission chain of a Boolean signal, with a head-
quarter A and two locations B, C non-connected, represented by the oriented graph
B C
Suppose that the following matrices are associated with the edges AB, AC
⎛2 1⎞ ⎛4 1⎞
3 3 5 5
M AB = ⎝ ⎠ M AC = ⎝ ⎠
1 2 1 4
3 3 5 5
These matrices represent the transmission of the signal, in the sense that if A transmits
30 times the signal 0, B transcribes the distribution M AB · (30, 0) = (20, 10), that is,
it transcripts 20 times the signal 0 and 10 times the signal 1. Similarly, C transcribes
the distribution M AC · (30, 0) = (24, 6).
5.1 Models of Conditional Independence 61
If A sends a signal of 30 bits 0 and 30 bits 1, the distribution resulting from the
graphic model, in the Boolean system with three variables A, B, C, is given by the
tensor:
4 2
16 8
8 16
2 4
Observe that the tensor has not rank 1, in fact the three variables are not independent.
On the other hand, the two submatrices that they get by fixing A = 0 and A = 1 both
have null determinant, so that (B C|A) holds. Instead, the marginalization of the
tensor in the direction of A gives the matrix:
12 18
18 12
Definition 5.1.7 The matrices used in the previous example are of a type widely
used in applications of Algebraic Statistics, especially for the theory of strings of
symbols (digital signals, DNA, etc.). They are called Jukes–Cantor’s matrices.
In general, a Jukes–Cantor matrix is a square matrix n × n whose diagonal ele-
ments are all equal to a value a, while all the other elements are equal to a value
b.
These matrices represent the fact that, for example, in the transmission of a signal,
if the transmitter A issues a value xi , the probability of the station B to receive
correctly xi is proportional to a, independently from xi , while the probability to
receive a wrong digit (proportional to (n − 1)b) is distributed uniformly on all the
other values x j , j = i.
Proof We prove the statement by induction on the rank. The cases n = 1, 2 are trivial.
For n generic, notice that deleting the last row and column we get a Jukes–Cantor
matrix of order (n − 1) × (n − 1). Hence, we can suppose, by induction that the first
n − 1 rows of M are linearly independent.
If the last row Rn is linear combination of the previous ones, that is, there exists
a relation
Rn = a1 R1 + · · · + an−1 Rn−1 ,
then comparing the last entries we get a = (a1 + · · · + an−1 )b, from which (a1 +
· · · + an−1 ) > 1. Thus, at least one of the ai ’s is positive. Suppose, that a1 > 0.
Comparing the first entry of the rows, we get
giving a contradiction.
Example 5.1.9 Let us go back to the Example 2.3.26 of the school with two sections
A, B, where scholarships are distributed. Let us say the situation after 25 years is
given by
96
D= .
64
The matrix defines a distribution on the total correlation of the Boolean system which
has two variables A, B, corresponding to the two sections. As the matrix has rank
1, this distribution indicates independence among the possibilities of A, B to have a
scholarship.
We introduce a third random variable N , which is 0 if the year is normal, i.e., one
scholarship is assigned, and 1 if the year is exceptional, that is, or 2 scholarships are
distributed or no scholarships are distributed at all. In the total correlation of the new
system, we necessarily obtain the distribution defined by the tensor
0 6
9 0
D =
6 0
0 4
since in the normal years only one of the two sections has the scholarship, something
that cannot happen in exceptional years.
The tensor D clearly does not have rank 1. Also note that the elements of the
scan of D along N do not have rank 1. As a matter of fact, both in exceptional years
5.1 Models of Conditional Independence 63
and in normal years, knowing if the A section has had or not the scholarship even
determines whether or not B has had it.
On the other hand, A B, because the marginalization of D along the variable
N gives the matrix D, which is a matrix of independence.
Definition 5.1.10 Given a set of conditions (Ai |Bi ) as above, the distributions
that satisfy all of them form a model in D(S). These models are called models of
conditional independence.
Proof By Theorem 6.4.13, we know that imposing a tensor to have rank 1 corre-
sponds to the vanishing of certain 2 × 2 determinants. The equations that are obtained
are homogeneous polynomial (of degree two). Therefore, every condition (A|B) is
defined by the composition of quadratic equations with a marginalization, hence from
the composition of quadratic and linear equations. Therefore, the resulting model
is algebraic. To prove the second statement, we note that if D satisfies a condition
(A|B) with A ∪ B = {1, . . . n} (i.e., no marginalization) then for each element D
of the scan of D along B there must exist v1 . . . , va , where a is the cardinality of A,
such that D = v1 ⊗ · · · ⊗ va . It is clear that such condition is polynomial parametric,
in fact toric. When A ∪ B = {1, . . . n}, the same fact holds on the coefficients that are
obtained from the marginalization that depend linearly on the coefficients of D.
This last parameterization, in the new coordinates D (i, j, k), with D (1, j, k) =
D(1, j, k), D (2, j, k) = D(1, j, k) + D(2, j, k), becomes
⎧
⎪
⎪ D (1, 1, 1) =x
⎪
⎪
⎪
⎪ D (2, 1, 1) = ac
⎪
⎪
⎪
⎪
⎪
⎪ D (1, 2, 2) =y
⎪
⎨ D (2, 2, 2) = bd
⎪ D (1, 2, 1)
⎪ =z
⎪
⎪
⎪
⎪ D (2, 2, 1) = ad
⎪
⎪
⎪
⎪
⎪
⎪ D (1, 1, 2) =t
⎪
⎩
D (2, 1, 2) = bc
αa α+β α a
p(A|B) = = (α a) = .
αa + βb (αa + βb) α a + βb
It follows that
p(A) p(B|A) = p(B) p(A|B).
the state and c times in the state , the variable xi+1 must assume d + d times the
state ξ.
This motivates the following:
vi+1 = Mi vi , i = 1, . . . , n − 1.
We simply call Markov model the model on the total correlation of S, formed by
distributions D, satisfying a Markov model, for some choice of the matrices.
If A transmits 60 times 0 and 120 times 1, the distribution observed on the total
correlation is
15 10
30 5
D =
10 60
20 30
The total marginalization of D is given by (60, 120), (75, 105), (85, 95). Since
one has
60 75 75 85
M = N =
120 105 105 95
it follows that D is the distribution of the Markov model associated to the matrices
M, N .
5.2 Markov Chains and Trees 67
From the previous example it is clear that, when we have three variables, the
Markov model with matrices M, N is formed by distributions D = (Di jk ) such that
Di jk = d j Mi j N jk , where (d1 , . . . , dn ) represents the marginalization of D over the
variables x2 , x3 .
Proposition 5.2.3 The distributions of the Markov model are exactly the distribu-
tions satisfying all conditional independences
xi xk |x j
Given two numbers h, k such that hk = D212 /D222 , we multiply the second column
of M by h and the second row of N by k.
The two matrices thus obtained, appropriately scaled, determine matrices M, N
describing the Markov model satisfied by D.
In the general case, the procedure is similar but more complicated.
For a more complete discussion, the reader is referred to the article by Eriksson,
Ranestad, Sturmfels, and Sullivant in [1].
Corollary 5.2.4 Markov’s chain models are algebraic models and also polynomial
parametric models (since, in general, many conditional independences are involved,
these models are not generally toric).
In this case, the obtained distributions are the same that we obtain by considering
the Markov model on the same system, ordered so that x3 → x2 → x1 , with matrices
N −1 , M −1 .
Thus the Markov chains, when the transition matrices are invertible, cannot distin-
guish who transmits the signal or receives it. From the point of view of distributions,
the two chains
M N N −1 M −1
x1 x2 x3 x3 x2 x1
Markov chains can be generalized to models which are defined over trees.
v j = Mi j vi , ∀i, j.
We simply call Markov model on G the model on the total correlation of S, formed
by distributions D satisfying the condition of a Tree Markov model on G for some
choice of the matrices Mi j .
Example 5.2.7 The Markov chains models are obviously examples of tree Markov
models.
The simplest example of tree Markov model, beyond chains, is the one shown in
the Example 5.1.6.
Remaining in the Example 5.1.6, it is immediate to understand that, for the same
motivations expressed in the Remark 5.2.5, when the matrix M AB is invertible, the
model associated with the scheme
A
MAB MAC
B C
−1
MAB MAC
B A C
The previous argument suggests that the tree Markov models are described by models
of conditional independence. In fact, the hint is valid because in a tree, given two
vertices xi , x j , there exists at most a minimal path that connects them.
Theorem 5.2.8 Given a tree G and a random system S whose variables are the
vertices x1 , . . . , xn of G, then a distribution D on the total correlation of S is in the
tree Markov model associated with G, for some choice of matrices Mi j , if and only
if D satisfies all conditional independencies
xi x j |xk
For the proof, refer to the book [2] or to the aforementioned article by Eriksson,
Ranestad, Sturmfels, and Sullivant in [1].
Example 5.2.9 Both the tree Markov model associated to the following tree
B C
B C|A.
A
G1 =
B
C D E
A
G2 =
B
C D E
Let us go back to the initial Examples 5.0.1 and 5.0.2 of this chapter.
The situation presented in these examples foresees the presence of hidden vari-
ables, that is, variables whose presence was not known at the beginning, but which
condition the dependence between the observable variables.
Also in the Example 5.2.10 a similar situation can occur. If the species A, B from
which the others derive are only hypothesized in the past, it is clear that one cannot
hope to observe the DNA, then the distributions on the variables A, B are not known,
so what we observe is not the true original tensor, but only its marginalization along
the variables A, B.
How can we hope to determine the presence of hidden variables?
One way is suggested by the Example 5.0.1 and uses the concept of rank (see the
Definition 6.3.3). In that situation, the distributions on the two observable variables
(A = football fan, B = hair loss) were represented by 3 × 3-matrices. The existence
of the hidden variable (G = gender) implied that the matrix of the distribution D was
the marginalization of a tensor T of type 3 × 3 × 2, whose scan along the hidden
5.3 Hidden Variables 71
D = D1 + · · · + Dr ,
with each Di of rank ≤ 1, hence the tensor whose elements along the first direction
are D1 , . . . , Dr represents the distribution D we looked for.
Definition 5.3.2 On the total correlation of a random system S we will call hidden
variable model with r states the subset of P(D(S)) formed by points corresponding
to tensors of rank ≤ r .
Because the rank of a tensor T is invariant when you multiply T for a constant
= 0, the definition is well placed in the world of projective distributions.
The model of independence is a particular (and degenerate) case of hidden variable
models.
Example 5.3.3 Consider a random dipole S, consisting of the variables A, B, having
a, b states, respectively. The distributions on S are represented by matrices Mof
type a × b.
When r < min{a, b}, the hidden variable model with r states is equal to the subset
of the matrices of rank ≤ r . It is clear that this model is (projective) algebraic, because
it is described by the vanishing of all subdeterminants (r + 1) × (r + 1), which are
homogeneous polynomials of degree r + 1 in the coefficients of the matrix M.
When r ≥ min{a, b}, the hidden variable model with r states can still be defined,
but it becomes trivial: all a × b-matrices have rank ≤ r .
The previous example can be generalized. The hidden variable models with r
states become trivial, i.e., they coincide with the entire space of distributions, for r
big enough. Furthermore, they are all projective parametric models, therefore also
projective algebraic, for Chow’s Theorem. The hidden variables models are in fact
linked to the geometric concept of secant variety of a subset of a projective space.
Here, we recall the basic definitions, but a more general treatment will be given in
Chap. 12.
72 5 Conditional Independence
Proposition 5.3.6 In the space of tensors P = P(K a1 ···an ), a tensor has rank ≤ r if
and only if it is in the r -secant variety of the Segre variety X .
It follows that the model with a hidden variable with r state corresponds to the
secant variety Sr0 (X ) of the Segre variety.
Proof By definition (see Proposition 10.5.12), the Segre variety X is exactly the set
of tensors of rank 1.
In general, if a tensor has rank ≤ r , then it is sum of r tensors of rank 1, hence it
lies in the r -secant variety of X .
Vice versa, if T lies in the r -secant variety of X , then there exist tensors T1 , . . . , Tr
of X (hence tensors or rank 1) such that
T = α1 T1 + · · · + αr Tr .
Secant varieties have long been studied in Projective Geometry for their applica-
tions to the study of projections of algebraic varieties. Their use in hidden variable
models is one of the largest points of contact between Algebraic Statistics and Alge-
braic Geometry.
An important fact in the study of hidden variable models is that (unfortunately)
such models are not algebraic models (and therefore not even projective parametric).
Proof We will use the tensor of rank 3 of Example 6.4.15, which proves, moreover,
that Y do not coincide with P.
Consider the tensors of the form D = uT1 + t T2 , where
5.3 Hidden Variables 73
2 3 0 0
0 3 1 0
T1 = T2 =
0 4 0 0
0 2 0 0
Such tensors span a space of dimension 2 in the vector space of tensors, hence a line
L ⊂ P. For (u, t) = (1, 1) we get the tensor D of Example 6.4.15, that has rank 3.
Hence L ⊂ Y .
We check now that all other points in L, different from D, are in Y . In fact if
D ∈ L \ {D}, then D = uT1 + t T2 , where (u, t) is not proportional to (1, 1), that
is u = t. Then D can be decomposed in the sum of two tensors of rank 1 as follows:
0 6t−12u
2t−2u 2u 6u
2t−2u
0 3t−6u
2t−2u t 3t
2t−2u
+
0 4u 0 0
0 2u 0 0
To overcome this problem, we define the algebraic secant variety and consequently
the algebraic model of hidden variable.
Q = αP1 + β P2 a, b ∈ C P1 , P2 ∈ X
where P1 = (a1 , a2 ) ⊗ (b1 , b2 ) ⊗ (c1 , c2 ) and P2 = (a1 , a2 ) ⊗ (b1 , b2 ) ⊗ (c1 , c2 ).
Unluckily this parameterization cannot be defined globally.
As a matter of fact, moving freely the parameters, we must consider also the cases
when P1 = P2 . In this situation, for some choice of α, β,the image would be the point
(0, . . . , 0), which does not exist in the projective setting. Hence, the parameterization
is only partial.
If we exclude the parameter values for which the image would give (0, . . . , 0), we
get a well-defined function on a Zariski open set of (P1 )7 . The image Y of that open,
however, is not a Zariski closed set of P7 . Zariski’s closure of Y in P7 coincides with
the whole P7 .
Part of the study of secant varieties is based on the calculation of the dimension.
From what has just been said in the previous remark, a limitation of the dimension
of algebraic secant varieties is always possible.
Proposition 5.3.11 The algebraic r -secant variety of the Segre variety X , image of
the Segre map of the product Pa1 × · · · × Pan , has dimension bounded by
Proof That the dimension of Sr (X ) is at most N depends on the fact that the dimen-
sion of an algebraic variety in P N can not exceed the dimension of the ambient space
(see Proposition 11.2.14).
5.3 Hidden Variables 75
If instead of the Segre map we take the Veronese map, a similar situation is
obtained.
Proposition 5.3.12 The algebraic r -secant variety of the Veronese variety X , image
of the Veronese map of degree d on Pn , has dimension bounded by
In both situations, we will call expected r -secant dimension of the Segre variety
(respectively, of the Veronese variety) the second member of the inequality (5.3.1)
(respectively of the inequality (5.3.2)).
N +1
rg ≥ .
a+1
Given N = n+d d
− 1, the generic symmetric rank rgs of symmetric tensors of type
(n + 1) × · · · × (n + 1) (d times) satisfies
N +1
rgs ≥ .
n+1
76 5 Conditional Independence
Note that, in general, there are tensors whose rank is lower than the generic rank,
but there may also be tensors whose rank is greater than the generic rank (this cannot
happen in the case of matrices). See the Example 6.4.15.
Example 5.3.16 In general, we could expect that the generic rank rg is exactly equal
to the smaller integer ≥ (N + 1)/(n + 1). This does not always occur. This is already
obvious, in the case of matrix spaces.
For tensors of higher dimension, consider the case of 3 × 3 × 3 tensors, for which
N = 26 and n = 6. The minimum integer ≥ (N + 1)/(n + 1) is 4, but the generic
rank is 5.
The tensors for which the generic rank is larger than the minimum integer greater
than or equal to (N + 1)/(n + 1) are called defective.
We know a few examples of defective tensors, but a complete classification of
them is not known. A discussion of the defectiveness (as a proof of the statement on
3 × 3 × 3 tensors) is beyond the scope of this Introduction and we refer to the text
of Landsberg [7].
The importance of the generic rank in the study of hidden variables is evident.
Given a random system S with variables x1 , . . . , xn , where xi has ai + 1 states,
the algebraic model of hidden variable with r states, on the total correlation of S,
is equivalent to the algebraic secant variety Sr (X ) where X is the Segre variety
of Pa1 × · · · × Pan . The distributions that are in this model should suggest that the
phenomenon under observation it is actually driven by a variable (in fact: hidden)
with r -states.
However, if r ≥ rg , this suggestion is null. In fact, in this case, Sr (X ) is equal
to the whole space of the distributions, then practically all of distributions suggest
the presence of such a variable. This, from the practical side, simply means that the
information given the additional hidden variable is null. In practice, therefore, the
existence or nonexistence of the hidden variable does not add any useful information
to the understanding of the phenomenon.
Example 5.3.17 Consider the study of DNA strings. If we observe the distribution
of the bases on 3 positions of the string, we get distributions described by 4 × 4 × 4
tensors. The tensors of this type are not defective, so being n = 9, N = 63, the
generic rank is 7.
The observation of a rank 6 distribution then suggests the presence of a hidden
variable with 6 states (as the subdivision of our sample into 6 different species).
The observation of a rank 7 distribution does not, therefore, give us any practical
evidence of the real existence of a hidden variable with 7 states.
If we really suspect the existence of a hidden variable (the species) with 7 or more
states, how can we verify this?
The answer is that such an observation is not possible considering only three
positions of DNA. However, if we go on to observe four positions, we get a 4 ×
4 × 4 × 4 tensor. The tensors of this type (which are not even them defective) have
5.3 Hidden Variables 77
generic rank equal to 256/13 = 20. If in this case we still get distributions of rank
7, which is much less than 20, our assumption received a formidable experimental
evidence.
References
1. Eriksson, N., Ranestad, K., Sullivant S., Sturmfels, B.: Phylogenetic algebraic geometry. In:
Ciliberto, C., Geramita, A., Harbourne, B., Roig, R-M., Ranestad, K. (eds.) Projective Varieties
with Unexpected Properties, pp. 237–255. De Gruyter, Berlin (2005)
2. Drton M., Sullivant S., Sturmfels B.: Lectures on Algebraic Statistics, Oberwolfach Seminars,
vol. 40. Birkhauser, Basel (2009)
3. Allman, E.S., Rhodes, J.A.: Phylogenetic invariants, chapter 4. In: Gascuel, O., Steel, M. (eds.)
Reconstructing Evolution New Mathematical and Computational Advances (2007)
4. Allman, E.S., Rhodes, J.A.: Molecular phylogenetics from an algebraic viewpoint. Stat. Sin.
17(4), 1299–1316 (2007)
5. Allman, E.S., Rhodes, J.A.: Phylogenetic ideals and varieties for the general Markov model.
Adv. Appl. Math. 40(2), 127–148 (2008)
6. Bocci, C.: Topics in phylogenetic algebraic geometry. Exp. Math. 25, 235–259 (2007)
7. Landsberg, J.M.: The geometry of tensors with applications. Graduate Studies in Mathematics
128. Providence, AMS (2012)
Part II
Multi-linear Algebra
Chapter 6
Tensors
The main objects of multi-linear algebra that we will use in the study of Algebraic
Statistics are multidimensional matrices, that we will call tensors.
One begins by observing that matrices are very versatile objects! One can use
them for keeping track of information in a systematic way. In this case, the entries in
the matrix are “place holders” for the information. Any elementary book on Matrix
Theory will be filled with examples (ranging from uses in Accounting, Biology, and
Combinatorics to uses in Zoology) which illustrate how thinking of matrices in this
way gives a very important perspective for certain types of applied problems.
On the other hand, from the first course on Linear Algebra, we know that matrices
can be used to describe important mathematical objects. For example, one can use
matrices to describe linear transformations between vector spaces or to represent
quadratic forms. Coupled with the calculus these ideas form the backbone of much
of mathematical thinking.
We want to now mention yet another way that matrices can be used: namely to
describe bilinear forms. To see this let M be an m × n matrix with entries from
the field K . Consider the two vector spaces K m and K n and suppose they have the
standard basis. If v ∈ K m and w ∈ K n we will represent them as 1 × m and 1 × n
matrices, respectively, where the entries in the matrices are the coordinates of v and
w with respect to the chosen basis. So, let
v = α1 · · · αm
and
w = β1 · · · βn .
Km × Kn → K
described by
(v, w) → v Mw t
where the expression on the right is simply the multiplication of three matrices
(t denoting matrix transpose). Notice that this function is linear both in K m and in
K n , and hence is called a bilinear form.
On the other hand, given any bilinear form B : K m × K n → K , i.e., a function
which is linear in both arguments, and choosing a basis for both K m and K n , we can
associate to that bilinear form an m × n, matrix, N , as follows: if {v1 , . . . , vm } is the
basis chosen for K m and {w1 , . . . , wn } is the basis chosen for K n then we form the
m × n matrix N = (n i, j ) where n i, j := B(vmi , w j ).
It is easy to see that if v ∈ K m , v = i=1 αi vi and w ∈ K n , w = nj=1 β j w j
then ⎛ ⎞
β1
⎜.⎟
B(v, w) = α1 · · · αm N ⎝ .. ⎠ .
βn
(where this time the 1 occurs in the jth place in this 1 × n matrix) then the product
v Mw t is precisely the (i, j) entry in the matrix M. But, as we noted above, this is the
value of the distribution on the (i, j) element in the alphabet of the unique random
variable in the total correlation we described above.
6.1 Basic Definitions 83
So, although the matrix M started out being considered simply as a place holder
for information, we see that considering it as a bilinear form on an appropriate pair of
vector spaces it can also be used to give us information about the original distribution.
Tensors will give us a way to generalize what we have just seen for two random
variables to any finite number of random variables. So, tensors will encode infor-
mation about the connections between distinct variables in a random system. As
the study of the properties of such connections is a fundamental goal in Algebraic
Statistics, it is clear that the role of tensors is ubiquitous in this book.
From the discussion above concerning bilinear forms and matrices, we see that
we have a choice as to how to proceed. We can define tensors as multidimensional
arrays or we can define tensors as multi-linear functions on a cartesian product of
a finite number of vector spaces. Both points of view are equally valid and will
eventually bring us to the same place. The two ways are equivalent, as we saw above
for bilinear forms, although sometimes one point of view is preferable to the other.
We will continue with both points of view but, for low dimensional tensors, we will
usually prefer to deal with the multidimensional arrays.
Before we get too involved in studying tensors, this is probably a good time
to forewarn the reader that although matrices are very familiar objects for which
there are well-understood tools to aid in their study, that is far from the case for
multidimensional matrices, i.e., tensors. The search for appropriate tools to study
tensors is part of ongoing research. The abundance of research on tensors (research
being carried out by mathematicians, computer scientists, statisticians, and engineers
as well as by people in other scientific fields) attests to the importance that these
objects have nowadays in real-life applications.
Notation. For every positive integer i, we will denote by [i] the set {1, . . . , i}.
For the rest of the section, K can indicate any set, but in practice, K will always
be a set of numbers (like N, Z, Q, R, or C).
a1 × · · · × an
T : [a1 ] × · · · × [an ] → K .
T : K a1 × · · · × K an → K
Remark 6.1.4 If we think of T as a multi-linear map and suppose that for each
1 ≤ i ≤ n, {eij | 1 ≤ j ≤ ai } is the standard basis for K ai then the entry in the
multidimensional array representation of T corresponding to the multi-index
(i 1 , . . . , i n ) is
T (ei11 , ei22 , . . . , einn ) .
4 0
T =
−1 3
4 7
Notation. Although we have written a 2 × 2 × 2 tensor above, we have not made
clear which place in that array corresponds to T (ei11 , ei22 , ei33 ). We will have to make
a convention about that. Again, the conventions in the case of three-dimensional
tensors are not uniform across all books on Multi-linear Algebra, but we will attempt
to motivate the notation that we use, and is most common, by looking at the cases
in which there is widespread agreement, i.e., the cases of one-dimensional and two-
dimensional tensors.
Let’s start by recalling the conventions for how to represent a one-dimensional
tensor, i.e., a linear function
T : Kn → K .
T : Km × Kn → K .
then
A = (αi, j ) where αi, j := T (ei1 , e2j ) .
So ⎛ ⎞
α1,1 α1,2 · · · α1,n
⎜ . ⎟
A = ⎝ ... ..
. · · · .. ⎠
αm,1 αm,2 · · · αm,n
be the standard basis for K m and K n respectively (as above) and let
Then
A = (αi, j,k ) where α(i, j,k) := T (ei1 , e2j , ek3 ) .
How will we arrange these values in a rectangular box? We let the front (or first )
face of the box be the m × n matrix whose (i, j) entry is T (ei1 , e2j , e13 ). The second
face, parallel to the first face, is the m × n matrix whose (i, j) entry is T (ei1 , e2j , e23 ).
We continue in this way so that the back face (the r th face), parallel to the first face
is the m × n matrix whose (i, j) entry is T (ei1 , e2j , er3 ).
2 4
1 2
4 8
T1 = 2 4
6 12
3 6
To be assured that you have the conventions straight for trilinear forms, verify that
the three-dimensional tensor of type 3 × 2 × 2 whose multidimensional matrix rep-
resentation has entries (i, j, k) = i + j + k, looks like
4 5
3 4
5 6
T2 = 4 5
6 7
5 6
Remark 6.1.7 We saw above that elements of K n can be considered as tensors of
dimension 1 and type n. Notice that they can also be considered as tensors of dimen-
sion 2 and type 1 × n, or tensors of dimension 3 and type 1 × 1 × n, etc.
Similarly, n × m matrices are tensors of dimension 2 but they can also be seen as
tensors of dimension 3 and type 1 × n × m, etc.
Elements of K can be seen as tensors of dimension 0.
As a generalization of what we can do with matrices, we mention the following
easy fact.
Proposition 6.1.8 When K is a field, the set of all tensors of fixed dimension n and
type a1 × · · · × an is a vector space where the operations are defined over elements
with corresponding multi-indices.
This space, whose dimension is the product a1 · · · an , will be denoted by K a1 ,...,an .
One basis for this vector space is obtained by considering all the multidimensional
6.1 Basic Definitions 87
matrices with a 1 in precisely one place and a zero in every other place. If that unique
1 is in the position (i 1 , . . . , i n ), we refer to that basis vector as e(i1 ,...,in ) .
The null element of a space of tensors is the tensor having all entries equal to 0.
Now that we have established our convention about how the entries in a multidi-
mensional array can be thought of, it remains to be precise about how a multidimen-
sional array gives us a multi-linear map.
So, suppose we have a tensor T which is a tensor of dimension n and type a1 ×
· · · × an . Let A = (αi1 ,i2 ,...,in ), where 1 ≤ i j ≤ a j , be the multidimensional array
which represents this tensor. We want to use A to define a multi-linear map
T : K a1 × · · · × K an → K
[ j]
Now if {ei | 1 ≤ i ≤ a j , 1 ≤ j ≤ n} is the standard basis for K a j then it is easy to
see that
T (ei[1]
1
, . . . , ei[n]
n
) = αi1 ,i2 ,...,in
Besides the natural operations (addition and scalar multiplication) between tensors of
the same type, there is another operation, the tensor product, which combines tensors
of any type. This tensor product is fundamental for our analysis of the properties of
tensors.
The simplest way to define the tensor product is to think of tensors as multi-linear
maps. With that in mind, we make the following definition.
Definition 6.2.1 Let T ∈ K a1 ,...,an , U ∈ K a1 ,...,am be tensors. We define the tensor
product T ⊗ U as the tensor W ∈ K a1 ,...,an ,a1 ,...,am such that
if vi ∈ K ai , w j ∈ K a j then
We extend this definition to consider more factors. So, for any finite collection
of tensors T j ∈ K a j1 ,...,a jn j , j = 1, . . . , m, one can define their tensor product as the
tensor
W = T1 ⊗ · · · ⊗ Tm ∈ K a11 ,...,a1n1 ,...,am1 ,...,amnm
such that
W (i 11 , . . . , i 1n 1 , . . . , i m1 , . . . , i mn m ) = T1 (i 11 , . . . , i 1n 1 ) · · · Tm (i m1 , . . . , i mn m ).
This innocent looking definition actually contains some new and wonderful ideas.
The following examples will illustrate some of the things that come from the defi-
nition. The reader should keep in mind how different this multiplication is from the
usual multiplication that we know for matrices.
m
v : K m → K defined by: v(x1 , . . . , xm ) = αi xi
i=1
n
w : K n → K defined by: w(y1 , . . . , yn ) = βi yi .
i=1
v ⊗ w : Km × Kn → K
defined by
m
n
v ⊗ w : ((x1 , . . . , xm ), (y1 , . . . , yn )) → ( αi xi )( βi yi ).
i=1 i=1
If we let {e1 , . . . , em } be the standard basis for K m and {e1 , . . . , en } be the standard
basis for K n then
v ⊗ w : (ei , ej ) → αi β j
We could just as well have considered the tensor w ⊗ v. In the specific example
we just considered, notice that
⎞ ⎛ ⎛ ⎞
2 2 4
w ⊗ v = w t v = ⎝−1⎠ 1 2 = ⎝−1 −2⎠ = (v t w)t .
3 3 6
We see here that the tensor product is not commutative. In fact, the two multipli-
cations did not even give us tensors of the same type.
Example 6.2.3 Let’s now consider a slightly more complicated example. This time
we will take the tensor product of v, a one-dimensional tensor of type 2, and multiply
it by w, a two-dimensional tensor of type 2 × 2. We can represent v by a 1 × 2 matrix
and w by a 2 × 2 matrix. So, let
2 −1
v = (2, −3) ∈ R2 and w =
4 3.
v ⊗ w : (K 2 ) × (K 2 × K 2 ) → K
defined by
−2 6
4 8
v⊗w =
3 −9
−6 (−12)
4 −2
w⊗v =
−12 −9
8 6
As we just noted, the tensor product does not define an internal operation in the
spaces of tensors of the same dimension and same type. It is possible, however,
to define something called the tensor algebra on which the tensor product behaves
like a product. We will just give the definition of the tensor algebra, but won’t have
occasion to use it in this text.
Definition 6.2.5 Let K be a field. The tensor algebra over the space K n is the direct
sum
T (n) = K ⊕ K n ⊕ K n,n ⊕ · · · ⊕ K n,...,n ⊕ · · ·
Remark 6.2.6 It is an easy (but messy) consequence of our definition that the tensor
product is an associative product, i.e., if T, U, V are tensors, then
T ⊗ (U ⊗ V ) = (T ⊗ U ) ⊗ V.
Notice that the tensor product is not, in general, a commutative product (see
Example 6.2.3 above). Indeed, in that example we saw that even the spaces in which
T ⊗ U and U ⊗ T lie can be different.
Remark 6.2.7 The tensor product of tensors has the following properties: for any
T, T ∈ K a1 ,...,an , U, U ∈ K a1 ,...,am and λ ∈ K , one has
• T ⊗ (U + U ) = T ⊗ U + T ⊗ U ;
• (T + T ) ⊗ U = T ⊗ U + T ⊗ U ;
• (λT ) ⊗ U = T ⊗ (λU ) = λ(T ⊗ U ).
This can be expressed by saying that the tensor product is linear over the two factors.
which is linear in any factor. For this reason, we say that the tensor product is a
multi-linear product in its factors.
The following useful proposition holds for the tensor product.
determined by the tensor product is not injective (as the Vanishing Law clearly
shows). However, we can characterize tensors T, T ∈ K a1 ,...,an and U, U ∈ K a1 ,...,am
such that T ⊗ U = T ⊗ U .
Proposition 6.2.9 Let T, T ∈ K a1 ,...,an and U, U ∈ K a1 ,...,am satisfy
T ⊗ U = T ⊗ U = 0.
Ti1 ,...,in
Uk1 ,...,km = Uk1 ,...,km = βUk1 ,...,km ,
Ti1 ,...,in
By using the associativity of the tensor product and slightly modifying the proof
of the preceding proposition one can prove, by induction on the number of factors,
the following result:
T1 ⊗ T2 ⊗ · · · ⊗ Ts = U1 ⊗ U2 ⊗ · · · ⊗ Us = 0.
Then there exist nonzero scalars α1 , . . . , αs ∈ K such that Ui = αi Ti for all i, and
moreover α1 · · · αs = 1.
Remark 6.2.11 We mentioned above that the tensor product of two bilinear forms,
represented by matrices M and N , respectively, doesn’t correspond to the product of
6.2 The Tensor Product 93
the two matrices M and N . Indeed, in most cases, we cannot even take the product
of the two matrices!
However, when M is an n × m matrix and N is an m × s matrix we can form
their product as matrices and also form their tensor product. It turns out that there is
a relation between these two objects.
The tensor product is an element of the vector space K n × K m × K m × K s while
the matrix product can be considered as an element of K n × K s . How can we recover
the regular product from the tensor product?
Now the tensor product is the tensor Q of dimension 4 and type (n, m, m, s), such
that Q(i, j, k, l) = M(i, j)N (k, l). The row-by-column product of M, N is obtained
by sending Q to the matrix Z ∈ K n,s defined by
Z (i, l) = Q(i, j, j, l).
j
So, the ordinary matrix product is obtained, in this case, by taking the tensor
product and following that by a projection onto the space K n × K s = K n,s .
Thus, one can define matrices of rank 1 in terms of the tensor product of vectors.
94 6 Tensors
Although the rank of a matrix M is usually defined as the dimension of either the
row space or column space of M, we now give a neat characterization of rank(M)
in terms of matrices of rank 1.
M = AB
and so the rows of M are linear combinations of the rows of B. Since B has only r
rows we obtain that rank(M) ≤ r .
Conversely, assume that M has rank r . Then we can find r linearly inde-
pendent vectors in K n which generate the row space of M. Call those vectors
w1 , . . . , wr . Suppose that the ith row of M is ci,1 w1 + · · · + ci,r wr . Form the vector
vi = (ci,1 , . . . , ci,r ) and construct a matrix A
whose ith column is vit . If B is the
matrix whose jth row is w j then M = AB = ri=1 vit wi is a sum of r matrices of
rank 1 and we are done.
The two previous results on matrices allow us to extend the definition of rank to
tensors of any type.
Definition 6.3.3 A nonzero tensor T ∈ K a1 ,...,an has rank 1 if there are vectors vi ∈
K ai such that T = v1 ⊗ · · · ⊗ vn . (since the tensor product is associative, there is no
need to specify the order in which the tensor products in the formula are performed).
We define the rank of a nonzero tensor T to be the minimum r such that there
exist r tensors T1 , . . . , Tr of rank 1 with
T = T1 + · · · + Tr . (6.3.1)
By convention we say that null tensors, i.e., tensors whose entries are all 0, have
rank 0.
Remark 6.3.5 Let T be a tensor of rank 1 and let α = 0, α ∈ K . Then, using the
multi-linearity of the tensor product, we see that αT also has rank 1. More generally,
if T has rank r then αT also has rank r . Then (exactly as for matrices), the union
of the null tensor with all the tensors in K a1 ,...,an of rank r is closed under scalar
multiplication.
6.3 Rank of Tensors 95
Subsets of vector spaces that are closed under scalar multiplication are called
cones. Thus the set of tensors in K a1 ,...,an of fixed rank (plus 0) is a cone.
On the other hand (again exactly as happens for matrices), in general the sum of
two tensors in K a1 ,...,an of rank r need not have rank r . Thus, the set of tensors in
K a1 ,...,an having fixed rank (union the null tensor) is not a subspace of K a1 ,...,an .
f : [m] → [n]
With this technical definition made we are now able to define the notion of a
subtensor of a given tensor.
T : [a1 ] × · · · × [an ] → K .
where
Ti1 ...in = Ti f1 (i1 ) ...i fn (in ) .
Remark 6.4.3 This is a formal (and perhaps a bit odd) way to say that we are fixing
a few values for the indices i 1 , . . . , i n and forgetting the elements of T whose kth
index is not in the range of the map f k .
Since we usually think of a tensor of type 1 × a2 × · · · × an as a tensor of type
a2 × · · · × an , when a ak = 1 we simply forget the kth index in T . In this case, the
dimension of T is n − m, where m is the number of indices for which ak = 1.
96 6 Tensors
T111 T121
T212 T222
T =
T211 T221
T312 T322
T311 T321
and an instance is
1 0
−2 4
2 3
T =
0 1
−3 4
2 1
−2 4
T =
−3 4
2 1
i.e. one just cancels the layer corresponding to the elements whose first index is 2.
If, instead, one takes f 2 = f 3 = identity, f 1 : [1] → [3] defined as f 1 (1) = 1,
then one gets the matrix in the top face:
−2 4
T =
1 0
6.4 Tensors of Rank 1 97
is not a submatrix of T .
Proof Assume that T ∈ K a1 ,...,an has rank 1. Then there exist vectors vi ∈ K ai such
that T = v1 ⊗ · · · ⊗ vn . Eliminating from T the elements whose kth index has some
value q corresponds to eliminating the qth component in the vector vk . Thus, the
corresponding subtensor T is the tensor product of the vectors v1 , . . . , vn , where vi =
vi if i = k, and vk is the vector obtained from vk by eliminating the qth component.
Thus T has rank ≤ 1 (it has rank 0 if vk = 0). For a general subtensor T ∈ K a1 ,...,an
of T , we obtain the result arguing step by step, by deleting each time one value for
one index of T , i.e., arguing by induction on (a1 + · · · + an ) − (a1 + · · · + an ).
The second claim in the statement of the theorem is immediate from what we
have just said and the fact that a 2 × 2 matrix of rank 1 has determinant 0.
Corollary 6.4.7 The rank of a subtensor of T cannot be bigger than the rank of T .
Proof If T has rank 1, the claim follows from Proposition 6.4.6. For tensors T of
higher rank r , the claim follows since if T = T1 + · · · + Tk , with Ti of rank 1, then a
subtensor T of T is equal to T1 + · · · + Tk , where Ti is the subtensor of Ti obtained
by eliminating all the elements corresponding to elements of T eliminated in the
passage from T to T . Thus, by Proposition 6.4.6 each Ti is either 0 or it has rank 1,
and the claim follows.
Example 6.4.8 Recall that a nonzero matrix has rank 1 if and only if all of its 2 × 2
submatrices have determinant equal to zero. This is not true for tensors of dimension
greater than 2, as the following example shows. Recall our earlier warning about the
subtle differences between matrices and tensors of dimension greater than 2.
Consider the 2 × 2 × 2 tensor T , defined by
0 1
0 0
T =
0 0
1 0
We want to find a set of conditions which describe the set of all tensors of rank 1.
To this aim, we need to introduce some new piece of notation.
Notation. Recall that we denote by [n] the set {1, . . . , n}.
Fix a subset J ⊂ [n]. Then for any fixed pair of multi-indices I1 = (k1 , . . . , kn )
and I2 = (l1 , . . . , ln ), we denote by J (I1 , I2 ) the multi-index (m 1 , . . . , m n ) where
kj if j ∈ J,
mj =
lj otherwise.
Example 6.4.9 Let n = 4 and set J = {2, 3} ⊂ [4]. Consider the two multi-indices
I1 = (1, 3, 3, 2) and I2 = (2, 1, 3, 4). Then J (I1 , I2 ) = (2, 3, 3, 4). Notice that if
J = [n] \ J = {1, 4} then J (I1 , I2 ) = (1, 1, 3, 2).
Remark 6.4.10 If T has rank 1, then for any pair of multi-indices I1 = (k1 , . . . , kn )
and I2 = (l1 , . . . , ln ) and for any subset J ⊂ [n], the entries of T satisfy:
where J = [n] \ J .
To see why this is so recall that since T has rank 1 we can write T = v1 ⊗ · · · ⊗ vn ,
with vi = (vi1 , vi2 , . . . ). In this case both of the products in (6.4.1) are equal
Remark 6.4.11 When I1 , I2 differ only in two indices, the equality (6.4.1) simply
says that the determinant of a 2 × 2 submatrix of T is 0.
6.4 Tensors of Rank 1 99
Example 6.4.12 Look back to Example 6.4.8, and notice that if one takes I1 =
(1, 1, 1), I2 = (2, 2, 2) and J = {1} ⊂ [3], then J (I1 , I2 ) = (1, 2, 2) and J (I1 , I2 ) =
(2, 1, 1) so that formula (6.4.1) does not hold, since
Proof Thanks to Remark 6.4.10, we need only prove that if all the equalities (6.4.1)
hold, then T has rank 1.
Let us argue by induction on the dimension n of T ∈ K a1 ,...,an . The case n = 2 is
well known: a matrix has rank 1 if and only if all its 2 × 2 minors vanish.
For n > 2, pick an entry TI1 = Tk1 ,...,kn = 0 in T .
Let J1 ⊂ [a1 ] where J1 = {1} and let f 1 : J1 → [ai ] be defined by f 1 (1) = k1 .
For 2 ≤ i ≤ n, let f i = identit y. Let T be the subtensor corresponding to these
data. T is a tensor of dimension n − 1 and hence satisfies the equalities (6.4.1). By
induction, we obtain that rank(T ) = 1, so there are vectors v2 , . . . vn such that, for
any choice of i 2 , . . . , i n , one gets
Tm,k2 ,...,kn
pm = . (b)
Tk1 ,k2 ,...,kn
TI1 TI2 = TJ (I1 ,I2 ) TJ (I1 ,I2 ) = Tk1 ,l2 ,...,ln Tl1 ,k2 ,...,kn = Tk1 ,l2 ,...,ln · pl1 Tk1 ,k2 ,...,kn .
Using the terms at the beginning and end of this string of equalities and also taking
into account (a) and (b) above, we obtain
Since TI1 = 0, and hence v2k2 , . . . , vnkn = 0, we can divide both sides of this
equality by v2k2 , . . . , vnkn and finally get
Observe that the rank 1 analysis of a tensor reduces to compute if finitely many
flattening matrices have rank one (see Definition 8.3.4 and Proposition 8.3.7), and
this can be accomplished with Gaussian elimination as well, without the need to
compute all 2 × 2 minors.
The equations corresponding to the equalities (6.4.1) determine a set of polyno-
mial (quadratic) equations, in the space of tensors K a1 ,...,an , which describe the locus
of decomposable tensors (interestingly enough, it turns out that in many cases this
set of equations is not minimal).
In any event, Theorem 6.4.13 provides a finite procedure which allows us to decide
if a given tensor has rank 1 or not. We simply plug the coordinates of the given tensor
into the equations we just described and see if all the equations vanish or not.
Moreover, in Theorem 6.4.13 it is enough to take subsets J given by one single
element, and even are sufficient n − 1 of them. Unfortunately, as the dimension
grows, the number of operations required in the algorithm rapidly becomes quite
large!
Recall that for matrices there is a much simpler method for calculating the rank
of the matrix: one uses Gaussian reduction to find out how many nonzero rows
that reduction has. That number is the rank. We really don’t have to calculate the
determinants of all the 2 × 2 submatrices of the original matrix.
There is nothing like the simple and well-known Gaussian reduction algorithm
(which incidentally calculates the rank for a tensor of dimension 2) for calculating
the rank of tensors of dimension greater than 2. All known procedures for calculating
the rank of such a tensor quickly become not effective.
There are many other ways in which the behavior of rank for tensors having
dimension greater than 2 differs considerably from the behavior of rank for matrices
(tensors of dimension exactly 2). For example, although a matrix of size m × n (a two-
dimensional tensor of type (m, n)) cannot have rank which exceeds the minimum
of m and n, tensors of type a1 × · · · × an (for n > 2) may have rank bigger than
max{ai }. Although the general matrix of size m × n has rank = min{m, n} (the
maximum possible rank) there are often special tensors of a given dimension and
type whose rank is bigger than the rank of a general tensor of that dimension and
type.
The attempt to get a clearer picture of how rank behaves for tensors of a given
dimension and type has many difficult problems associated to it. Is there some nice
geometric structure for the set of tensors having a given rank? Is the set of tensors
of prescribed rank not empty? What is the maximum rank for a tensor of given
dimension and type? These questions, and several variants of them, are the subject
of research for many mathematicians and other scientists today.
We conclude this section with some examples which illustrate that although there
is no algorithm for finding the rank of a given tensor, one can sometimes decide,
using ad hoc methods, exactly what is the rank of the tensor.
6.4 Tensors of Rank 1 101
3 −1
2 −6
3 −5
Indeed it cannot have rank 1, because some of its 2 × 2 submatrices have deter-
minant different from 0. T has rank 2 because it is the sum of two tensors of rank 1
(one can check, using the algorithm, that the summands have rank 1):
2 2 2 −2
1 1 2 −2
+
−2 −2 4 −4
−1 −1 4 −4
1 3
D=
0 4
0 2
has rank 3, i.e., one cannot write D as a sum of two tensors of rank 1. Let us see why.
Let’s assume that D is the sum of two tensors T = (Ti jk ) e T = (Tijk ) of rank 1
and let’s try to derive a contradiction from that assumption.
Notice that the vector (D211 , D212 ) = (0, 0) would have to be equal to the sum
of the vectors (T211 , T212 ) + (T211 , T212 ), Consequently, the two vectors (T211 , T212 )
and (T211 , T212 ) are opposite of each other and hence span a subspace W ⊂ K 2 of
dimension ≤ 1.
102 6 Tensors
If one (hence both) of these vectors is nonzero, then also the vectors (T111 , T112 ),
(T221 , T222 ), and (T111 , T112 ), (T221 , T222 ), would also have to belong to W because all
the 2 × 2 determinants of T and T vanish. But notice that (T121 , T122 ) and (T121 , T122 )
must also belong to W by Remark 6.4.10 (take J = {3} ⊂ [3]).
It follows that both vectors (D111 , D112 ) = (1, 2) and (D121 , D122 ) = (3, 3), must
belong to W . This is a contradiction, since dim(W ) = 1 and (1, 2), (3, 3) are linearly
independent.
So, we are forced to the conclusion that (T211 , T212 ) = (T211 , T212 ) = (0, 0). Since
the sum of (T111 , T112 ) and (T111 , T112 ) is (1, 2) = (0, 0) we may assume that one
of them, say (T111 , T112 ), is nonzero. As T has rank 1, there exists a ∈ K such that
(T221 , T222 ) = a(T111 , T112 ) (we are again using Remark 6.4.10).
Now, the determinant of the front face of the tensor T is 0, i.e.,
1 3 0 0 0 0
+ +
0 0 0 0 0 4
0 0 0 0 0 2
6.5 Exercises
2 2
6 3
T = 4 2
9 3
6 2
Example 7.1.2 When T is a square matrix, then the condition for the symmetry of
T simply requires that Ti, j = T j,i for any choice of the indices. In other words, our
definition of symmetric tensor coincides with the plain old definition of symmetric
matrix, when T has dimension 2.
If T is a cubic tensor of type 2 × 2 × 2, then T is symmetric if and only if the
following equalities hold:
T1,1,2 = T1,2,1 = T2,1,1
T2,2,1 = T2,1,2 = T1,2,2
3 2
1 3
2 0
3 2
Remark 7.1.3 The set of symmetric tensors is a linear subspace of K d,...,d . Namely,
it is defined by a set of linear equations:
Next step is the study of the behavior of symmetric tensors with respect to the rank. It
is easy to realize that there are symmetric tensors of rank 1, i.e., the space Sym n (K d )
intersects the set of decomposable tensors. Just to give an instance, look at:
4 8
2 4
D=
2 4
1 2
If moreover K is an algebraically closed field (as the complex field C), then we
may assume λ = 1.
thus T is symmetric.
Conversely, assume that T is symmetric of rank 1, say T = v1 ⊗ · · · ⊗ vn , where
no vi ∈ K d can be 0, by Proposition 6.2.8. Write vi = (vi,1 , . . . , vi,d ) and fix a multi-
index (i 1 , . . . , i n ) such that v1,i1 = 0, …, vn,in = 0. Then Ti1 ,...,in = v1,i1 · · · vn,in can-
not vanish. Define b2 = v2,i1 /v1,i1 . Then we claim that v2 = b2 v1 . Namely, for all j
we have, by symmetry:
v1,i1 v2, j v3,i3 · · · vn,in = Ti1 , j,i3 ,...,in = T j,i1 ,i3 ,...,in = v1, j v2,i1 v3,i3 · · · vn,in ,
which means that v1,i1 v2, j = v1, j v2,i1 , so that v2, j = b2 v1, j . Similarly we can define
b3 = v3,i1 /v1,i1 ,…, bd = vd,i1 /v1,i1 , and obtain that v3 = b3 v1 , …, vd = bd v1 . Thus,
if λ = b2 · b3 · · · · · bd , then
When K is algebraically close, then take a dth root β of λ ∈ K and define v = βv1 .
Then T = β d (v1 ⊗ v1 ⊗ · · · ⊗ v1 ) = v ⊗ v ⊗ · · · ⊗ v.
Notice that purely algebraic properties of K can be relevant in determining the shape
of a decomposition of a tensor.
Remark 7.2.2 In the sequel, we will often write v ⊗d for v ⊗ v ⊗ · · · ⊗ v, d times.
If K is algebraically closed, then a symmetric tensor T ∈ Sym n (K d ) of rank 1
has a finite number (exactly: d) decompositions as a product T = v ⊗d .
Namely if w ⊗ · · · ⊗ w = v ⊗ · · · ⊗ v, then by Proposition 6.2.9 there exists a
scalar β such that w = βv and moreover β d = 1, thus w is equal to v multiplied by
a dth root of unity.
Passing from rank 1 to higher ranks, the situation becomes suddenly more involved.
108 7 Symmetric Tensors
T = T1 + · · · + Tr .
Then, the natural question is about which choice gives the correct definition. Here,
correct definition means the definition which proves to be most useful, for the appli-
cations to Multi-linear Algebra and random systems.
The reader could be disappointed in knowing that there is no clear preference
between the two options: each can be preferable, depending on the point of view.
Thus, we will leave the word rank for the minimum r for which one has a decom-
position T = T1 + · · · + Tr , with the Ti ’s not necessarily symmetric (i.e., the first
choice above).
Then, we give the following:
Definition 7.2.3 The symmetric rank srank(T ) of a symmetric tensor T ∈ Sym n
(K d ) is the minimum r for which one has r symmetric decomposable tensors T1 ,…,Tr
with
T = T1 + · · · + Tr .
0 2
T =
0 2
2 0
has not rank 1, as one can compute by taking the determinant of some face.
T has rank 2, because it is expressible as the sum of two decomposable tensors
T = T1 + T2 , where
7.2 The Rank of a Symmetric Tensor 109
1 1
1 1
T1 =
1 1
1 1
1 −1
−1 1
T2 =
−1 1
1 −1
1 0
T =
7 8
0 7
is not decomposable. Let us prove that the symmetric rank is bigger than 2.
Assume that T = (a, b)⊗3 + (c, d)⊗3 . Then we have
⎧ 3 3
⎪
⎪a c =1
⎪ 2
⎨a b + c2 d =0
⎪ab2 + cd 2
⎪ =7
⎪
⎩ 3 3
b d = 8.
srank(T ) ≥ rank(T ).
110 7 Symmetric Tensors
Very recently, Shitov found an example where the strict inequality holds (see [1]).
Shitov’s example is quite peculiar: the tensor has dimension 3 and type 800 ×
800 × 800, whose rank is very high with respect of general tensors of same dimension
and type.
The difficulty in finding examples where the two ranks are different, despite the
large number of concrete tensors tested, suggested to the French mathematician Pierre
Comon to launch the following:
Problem 7.2.7 (Comon 2000) Find conditions such that the symmetric rank and rank
of a symmetric tensor coincide.
In other words, find conditions for T ∈ Sym n (Cd ) such that if there exists a
decomposition T = T1 + · · · + Tr in terms of tensors of rank 1, then there exists also
a decomposition with the same number of summands, in which each Ti is symmetric,
of rank 1.
The condition is known for some types of tensors. For instance, it is easy to prove
that the Comon Problem holds for any symmetric matrix T (and this is left as an
exercise at the end of the chapter).
The reader could wonder that such a question, which seems rather elementary
in its formulation, could yield a problem which is still open, after being studied by
many mathematicians, with modern techniques.
This explains a reason why, at the beginning of the chapter, we warned the reader
that problems that are simple for Linear Algebra and matrices can suddenly become
prohibitive, as the dimension of the tensors grows.
Homogeneous polynomials and symmetric tensors are two apparently rather different
mathematical objects, that indeed have a strict interaction, so that one can skip from
each other, translating properties of tensors to properties of polynomials, and vice
versa.
The main construction behind this interaction is probably well known to the
reader, for the case of polynomials of degree 2. It is a standard fact that one can
associate a symmetric matrix to quadratic homogeneous polynomial, in a one-to-one
correspondence, so that properties of the quadratic form (as well as properties of
quadratic hypersurfaces) can be read as properties of the associated matrix.
The aim of this section is to point out that a similar correspondence holds, more
generally, between homogeneous forms of any degree and symmetric tensors of
higher dimension.
Definition 7.3.1 There is a natural map between a space K n,...,n of cubic tensors of
dimension d and the space of homogeneous polynomials of degree d in n variables
(i.e., the dth graded piece Rd of the ring of polynomials R = K [x1 , . . . , xn ]), defined
by sending a tensor T to the polynomial FT such that
7.3 Symmetric Tensors and Polynomials 111
FT = Ti1 ,...,in xi1 · · · xin .
i 1 ,...,i n
It is clear that the previous correspondence is not one to one, as soon as general
tensors are considered. Namely, for the case n, d = 2, one immediately sees that the
two matrices
2 3 20
−1 1 21
define the same polynomial of degree 2 in two variables F = 2x12 + 2x1 x2 + x22 .
The correspondence becomes one to one (and onto) when restricted to symmetric
tensors. To see this, we need to introduce a piece of notation.
Definition 7.3.2 For any multi-index (i 1 , . . . , i d ), we will define the multiplicity
m(i 1 , . . . , i d ) as the number of different permutations of the multi-index.
Definition 7.3.3 Let R = K [x1 , . . . , xn ] be the ring of polynomials, with coeffi-
cients in K , with n variables. Then there are linear isomorphisms
p : Sym d (K n ) → Rd t : Rd → Sym d (K n )
defined as follows. The map p is the restriction to Sym d (K n ) of the previous map
p(T ) = Ti1 ,...,id xi1 · · · xid .
i 1 ,...,i d
The map t is defined by sending the polynomial F to the tensor t (F) such that
1
t (F)i1 ,...,id = (the coefficient of xi1 · · · xid in F).
m(i 1 , . . . , i d )
1 −1
T =
3 −2
−1 3
It is an easy exercise to prove that the two maps p and t defined above are inverse to
each other.
Once the correspondence is settled, one can easily speak about the rank or the the
symmetric rank of a polynomial.
Definition 7.3.6 For any homogeneous polynomial G ∈ K [x1 , . . . , xn ], we define
the rank (respectively, the symmetric rank) of G as the rank (respectively, the sym-
metric rank) of the associated tensor t (G).
Example 7.3.7 The polynomial G = x13 + 21x1 x22 + 8x23 has rank 3, since the asso-
ciated tensor t (G) is exactly the 2 × 2 × 2 symmetric tensor of Example 7.2.5.
d
n+d −a−2
dim(Sym d (K n )) = ,
a=0
d −a
n+d−1
and the sum is d
, by standard facts on binomial coefficients.
7.4 The Complexity of Polynomials 113
In this section, we rephrase the results on the rank of symmetric tensors in terms of
the associated polynomials.
It will turn out that the rank decomposition of a polynomial is the analogue of a
long-standing series of problems in Number Theory, for the expression of integers
as a sum of powers.
In principle, from the point of view of Algebraic Statistic, the complexity of
a polynomial is the complexity of the associated symmetric tensor. So, the most
elementary case of polynomials corresponds to symmetric tensor of rank 1. We start
with a description of polynomials of this type.
Remark 7.4.1 Before we proceed, we need to come back to the multiplicity of a
multi-index J = (i 1 , . . . , i d ), introduced in Definition 7.3.2.
In the correspondence between polynomials and tensors, the element Ti1 ,...,id is
linked with the coefficient of the monomial xi1 · · · xid . Notice that i 1 , . . . , i d need not
be distinct, so the monomial xi1 · · · xid could be written unproperly. The usual way
in which xi1 · · · xid is written is:
G = L d1 + · · · + L rd . (7.1)
The symmetric rank is the number that computes the complexity of symmetric ten-
sors, hence the complexity of homogeneous polynomials, from the point of view of
Algebraic Statistics. Hence, it turns out that the simplest polynomials, in this sense,
are powers of linear forms. We guess that nobody will object to the statement that
powers are rather simple!
We should mention, however, that sometimes the behavior of polynomials with
respect to the complexity can be much less intuitive.
For instance, the rank of monomials is usually very high, so that the complexity
of general monomials is over the average (and we expect that most people will be
surprised). Even worse, efficient formulas for the rank of monomials were obtained
only very recently by Enrico Carlini, Maria Virginia Catalisano, and Anthony V.
Geramita (see [2]). For other famous polynomials, as the determinant of a matrix
of indeterminates, we do not even know the rank. All we have are lower and upper
bounds, not matching.
We finish the chapter by mentioning that the problem of finding the rank of
polynomials reflects a well-known problem in Number Theory. Solving a question
posed by Diophantus, the Italian mathematician Giuseppe Lagrange proved that any
positive integer N can be written as a sum of four squares, i.e., for any positive
integer G, there are integers L 1 , L 2 , L 3 , L 4 such that G = L 21 + L 22 + L 23 + L 24 . The
problem has been generalized by the English mathematician Edward Waring, who
asked in 1770 for the minimum integer r (k) such that any positive integer G can be
written as a sum of r (k) powers L ik . In other words, find the minimum r (k) such that
any positive integers are of the form
G = L k1 + · · · L rk(k) .
The analogy with the decomposition (7.1) that computes the symmetric rank of a
polynomial is evident.
The determination of r (k) is called, from then, the Waring problem for integers.
Because of the analogy, the symmetric rank of a polynomial is also called the War-
ing rank.
For integers, few values of r (k) are known, e.g., r (2) = 4, r (3) = 9, r (4) = 19.
There are also variations on the Waring problem, as asking for the minimum r (k)
such that all positive integers, except for a finite subset, are the sum of r (k) kth
powers (the little Waring problem).
Going back to the polynomial case, as for integers, a complete description of the
maximal complexity that a homogeneous polynomial of the given degree in a given
number of variables can have, is not known. We have only upper bounds for the
maximal rank. On the other hand, we know the solution of an analogue to the little
Waring problem, for polynomials over the complex field.
7.4 The Complexity of Polynomials 115
Theorem 7.4.4 (Alexander-Hirschowitz 1995) Over the complex field, the symmet-
ric rank of a general homogeneous polynomial of degree d in n variables (here
general means: all polynomials outside a set of measure 0 in C[x1 , . . . , xn ]d ; or
also: all polynomials outside a Zariski closed subset of the space C[x1 , . . . , xn ]d ,
see Remark 9.1.10) is n+d−1
r = d
n
except for the following cases:
• d = 2, any n, where r = n;
• d = 3, n = 5, where r = 8;
• d = 4, n = 3, where r = 6.
• d = 4, n = 4, where r = 10.
• d = 4, n = 5, where r = 15.
The original proof of this theorem requires the Horace method. It is long and difficult
and occupies a whole series of papers [3–7].
For specific tensors, an efficient way to compute the rank requires the use of
inverse systems, which will be explained in the next chapter.
7.5 Exercises
Exercise 11 Prove that the two maps p and t introduced in Definition 7.3.3 are
linear and inverse to each other.
Exercise 12 Prove Comon’s Problem for matrices: a symmetric matrix M has rank
r if and only if there are r symmetric matrices of rank 1, M1 ,…, Mr , such that
M = M1 + · · · + Mr .
Exercise 13 Prove that the tensor T of Example 7.2.5 cannot have rank 2.
Exercise 14 Prove that the tensor T of Example 7.2.5 has symmetric rank
srank(T ) = 3 (so, after Exercise 13, also the rank is 3).
References
4. Alexander, J., Hirschowitz, A.: Un lemme d’Horace différentiel: application aux singularités
hyperquartiques de P 5 . J. Algebr. Geom. 1, 411–426 (1992)
5. Alexander, J., Hirschowitz, A.: Polynomial interpolation in several variables. J. Algebr. Geom.
4, 201–222 (1995)
6. Alexander, J., Hirschowitz, A.: Generic hypersurface singularities. Proc. Indian Acad. Sci. Math.
Sci. 107(2), 139–154 (1997)
7. Alexander, J., Hirschowitz, A.: An asympotic vanishing theorem for generic unions of multiple
points. Invent. Math. 140, 303–325 (2000)
Chapter 8
Marginalization and Flattenings
We collect in this chapter some of the most useful operations on tensors, in view of
the applications to Algebraic Statistics.
8.1 Marginalization
Definition 8.1.1 For matrices of given type n × m over a field K , which can be seen
as points of the vector space K n,m , the marginalization is the linear map μ : K n,m →
K n × K m which sends the matrix A = (ai j ) to the pair ((v1 , . . . , vn ), (w1 , . . . , wm ))
∈ K n × K m , where
m n
vi = ai j , wj = ai j .
j=1 i=1
The notion can be extended (with some complication only for the notation) to
tensors of any dimension.
Definition 8.1.2 For tensors of given type a1 × · · · × an over a field K , which can
be seen as points of the vector space K a1 ,...,an , the marginalization is the linear map
μ : K a1 ,...,an → K a1 × · · · × K an which sends the tensor A = (αq1 ...qn ) to the n-uple
((v11 , . . . , v1a1 ), . . . , (vn1 , . . . , vnan )) ∈ K a1 × · · · × K an , where
vi j = αq1 ...qn ,
qi = j
i.e., in each sum we fix the ith index and take the sum over all the elements of the
tensor in which the ith index is equal to j.
1 2
T =
4 8
2 4
6 12
3 6
Since the marginalization μ is a linear map, we can analyze its linear properties.
It is immediate to realize that, except for trivial cases, μ is not injective.
Even the surjectivity of μ fails in general. This is obvious for 2 × 2 matrices, since
μ is a noninjective linear map between K 4 and itself.
8.1 Marginalization 119
v1 + · · · + vn = w1 + · · · + wm .
Proof The fact that the condition is necessary follows from the few lines before
the
proposition.
We need to prove that the condition is sufficient. So, assume that
vi = w j . A way to prove the claim, which can be easily extended even to higher
dimensional tensors, is the following. Write ei for the element in K n with 1 in the
ith position and 0 elsewhere, so that e1 , . . . , en is the canonical basis of K n . Define
similarly e1 , . . . , em ∈ K m . It is clear that any pair (ei , ej ) belongs to the image of
μ: just take the marginalization of the matrix having ai j = 1 and all the remaining
entries equal to 0. So, it is sufficient to prove that if v1 + · · · + vn = w1 + · · · + wm ,
then (v, w) = ((v1 , . . . , vn ), (w1 , . . . , wm )) belongs to the subspace generated by
the (ei , ej )’s. Assume that n ≤ m (if the converse holds, just take the transpose).
Then notice that
n
(v, w) = v1 (e1 , e1 ) + (v1 + · · · + vi − w1 − · · · − wi−1 )(ei , ei )+
i=2
n−1
(w1 + · · · + wi − v1 − · · · − vi )(ei+1 , ei ).
i=1
Corollary 8.1.6 The image of μ has dimension n + m − 1. Thus the kernel of μ has
dimension nm − n − m + 1.
We can extend the previous computation to tensors of arbitrary dimension.
Proposition 8.1.7 A vector
hence u i is a multiple of wi (by w1 j1 · · · ŵi ji · · · wd jd ).
Remark 8.1.10 We can be even more precise about the scaling factor of the previous
Namely, assume T = w1 ⊗ · · · ⊗ wn where wi = (wi1 , . . . , wiai ), and
proposition.
set Wi = wi1 + · · · + wiai . Then u 1 ⊗ · · · ⊗ u n = W T , where
W = i=1
n
W1 · · · Ŵi · · · Wn .
Example 8.1.11 Let T ∈ K 2,2,2 be the rank 1 tensor, product of (1, 2) ⊗ (3, 4) ⊗
(5, 6). Then,
30 40
15 20
T =
36 48
18 24
so that the the marginalization of T is (u 1 , u 2 , u 3 )=((77, 154), (99, 132), (105, 136)).
Clearly, (77, 154) = 77(1, 2), (99, 132) = 33(3, 4) and (105, 136) = 21(5, 6). Here
W1 = 3, W2 = 7, W3 = 11, so that
When T has rank > 1, then clearly it cannot coincide with a multiple of u 1 ⊗
· · · ⊗ u n . For general T , the product of its contractions u 1 ⊗ · · · ⊗ u n determines a
good rank 1 approximation of T .
6 9
T =
16 20
12 18
8.1 Marginalization 121
which is an approximation of (1, 2) ⊗ (2, 3) ⊗ (3, 4) (only one entry has been
changed). The marginalization of T is (u 1 , u 2 , u 3 ) = ((35, 66), (42, 59), (45, 56)).
The product u 1 ⊗ u 2 ⊗ u 3 , divided by 35 · 42 · 45 = 66150 gives the rank 1 tensor
(approximate):
7.5 10.5
6 8.4
T =
14.1 19.8
11.3 15.9
Indeed, for some purposes, the product of the contractions can be considered as
a good rank 1 approximation of a given tensor. We warn the reader that, on the other
hand, there are other methods for the rank 1 approximation of tensors which, in many
cases, produce a result much closer to the best possible rank 1 approximation. See,
for instance, Remark 8.3.12.
8.2 Contractions
Definition 8.2.1 For a tensor T of type a1 × · · · × an and for any subset J of the set
[n] = {1, . . . , n}, of cardinality n − q, define the J -contraction T J as follows. Set
{1, . . . , n} \ J = { j1 , . . . , jq }. For any choice of k1 , . . . , kq with 1 ≤ ks ≤ a js , put
TkJ1 ,...,kq = Ti1 ,...,in ,
where the sum ranges on all the entries Ti1 ,...,in in which i js = ks for s = 1, . . . , q.
18 20
12 16
T =
9 12
6 8
122 8 Marginalization and Flattenings
The contraction of T along J = {2} means that we take the sum of the left face with
the right face, so that, e.g., T11J = T111 + T121 , and so on. We get
28 38
T = J
.
14 21
The contraction of T along J = {3} means that we take the sum of the front face
with the rear face, so that, e.g., T11J = T111 + T112 , and so on. We get
J 30 36
T = .
15 20
Instead, if we take the contraction along J = {2, 3}, we get a vector T J ∈ K 2 ,
whose entries are the sums of the upper and the lower faces. Indeed T1J = T111 +
T112 + T121 + T122 , so that
T J = (66, 35).
In this case, T J is the 1-st contraction of T .
The last observation of the previous example generalizes.
Remark 8.2.4 The contraction along J ⊂ {1, . . . , n} determines a linear map between
spaces of tensors.
If J ⊂ J ⊂ {1, . . . , n}, then the contraction T J is a contraction of T J .
The relation on the ranks of a tensor and its contractions is expressed as follows:
rank(TJ ) ≤ rankT.
Proof In view of Remark 8.2.4, it is enough to prove the first statement when J is a
singleton. Assume, for simplicity, that J = {n}. Then by definition
an
Ti1J...in−1 = T11 ...in−1 j .
j=1
Proof The proof is straightforward when we take the (maximal) partition where
m = n and Q i = {i} for all i. Indeed in this case T Ji is the ith contraction of T , and
we can apply Proposition 8.1.9.
For the general case, we can use Proposition 8.1.9 again and induction on n.
Indeed, by assumption, if ji = min{Q i } − 1, then the kth contraction of T Ji is equal
to the ( ji + k)th contraction of T .
2 4
1 2
T =
8 16
4 8
and consider the partition Q 1 = {1, 2}, Q 2 = {3}, so that J1 = {3} and J2 = {1, 2}.
Then
3 6
T J1 = ,
12 24
90 180
45 90
J1 J2
T ⊗T =
360 720
180 360
2
hence T J1 ⊗ T J = 45T .
124 8 Marginalization and Flattenings
The following operations is natural for tensors and often allow a direct computation
of the rank.
2 4
2 3
T =
4 8
1 1
5 6
3 4
Remark 8.3.3 There is an obvious relation between the scan and the contraction of a
tensor T . If J ⊂ {1, . . . , n} is any subset of the set of indices and J = {1, . . . , n} \ J
then the J −contraction of T equals the sum of the tensors of the scan of the tensor
T along J .
8.3 Scan and Flattening 125
We define the flattening of a tensor by taking the scan along one index, and
arranging the resulting tensors in one big tensor.
Definition 8.3.4 Let T ∈ K a1 ,...,an be a tensor. The flattening of T along the last
index is defined as follows. For any positive q ≤ an−1 an , one finds uniquely defined
integers α, β such that q − 1 = (β − 1)an−1 + (α − 1), with 1 ≤ β ≤ an , 1 ≤ α ≤
an−1 . Then the flattening of T along the last index is the tensor F T ∈ K a1 ,...,an−2 ,an−1 an
with:
F Ti1 ...in−2 q = Ti1 ...in−2 αβ .
3 8
1 1
T =
2 6
3 4
4 8
1 2
T =
8 16
2 4
Remark 8.3.6 One can apply the flattening procedure after a permutation of the
indices. In this way, in fact, one can define the flattening along any of the indices.
We leave the details to the reader.
Moreover, one can take an ordered series of indices and perform a sequence of
flattening procedures, in order to reduce the dimension of the tensor.
126 8 Marginalization and Flattenings
The final target which is usually the most useful for applications is the flattening
reduction of a tensor to a (usually rather huge) matrix, by performing n − 2 flattenings
of an n-dimensional tensor. If we do not use permutations, the final output is a matrix
of size a1 × (a2 · · · an ).
The reason why the flattenings are useful, for the analysis of tensors, is based on
the following property.
Proposition 8.3.7 A tensor T has rank 1, if and only if all its flattening has rank 1.
Conversely, recall that, from Theorem 6.4.13, a tensor T has rank 1 when, for all
choices p, q = 1, . . . , n and numbers 1 ≤ α, γ ≤ a p , 1 ≤ β, δ ≤ aq one has
Ti1 ···i p−1 αi p+1 ...iq−1 βiq+1 ...in · Ti1 ···i p−1 γi p+1 ...iq−1 δiq+1 ...in −
(8.1)
−Ti1 ···i p−1 αi p+1 ...iq−1 δiq+1 ...in · Ti1 ···i p−1 γi p+1 ...iq−1 βiq+1 ...in = 0
If we take the flattening over the two indices p and q, the left term of previous
equation is a 2 × 2 minor of the flattening.
2 4
1 2
T =
4 8
8 16
has rank > 1, because some determinants of its faces are not 0. On the other hand,
its flattening is the 2 × 4 matrix
8.3 Scan and Flattening 127
⎛ ⎞
1 2
⎜8 16⎟
⎜ ⎟
⎝2 4⎠
4 8
T = v1 ⊗ · · · ⊗ vn−2 ⊗ vn−1 ⊗ vn .
Proposition 8.3.10 If a tensor T has rank r , then its flattening has rank ≤ r .
Of course, the rank of the flattening F T can be strictly smaller than the rank of
T . For instance, we know from Example 6.4.15 that there are 2 × 2 × 2 tensors T of
rank 3. The flattening F T , which is a 2 × 4 matrix, cannot have rank bigger than 2.
Remark 8.3.11 Here is one application of the flattening procedure to the computation
of the rank.
Assume we are given a tensor T ∈ K a1 ,...,an and assume we would like to know
if the rank of T is r . If r < a1 , then we can perform a series of flattenings along the
last indices, obtaining a matrix F ∗ T of size a1 × (a2 · · · an ). Then, we can compute
the rank of the matrix (and we have plenty of fast procedures to do this). If F ∗ T has
rank > r , then there is no chance that the original tensor T has rank r . If F ∗ T has
rank r , then this can be considered as a cue toward the fact that rank(T ) = r .
Of course, a similar process is possible, by using permutations on the indices,
when r ≥ a1 but r < ai for some i.
The flattening process is clearly invertible, so that one can reconstruct the original
tensor T from the flattening F T , thus also from the matrix F ∗ T resulting from a
process of n − 2 flattenings.
On the other hand, since a matrix of rank r > 1 has infinitely many decompositions
as a sum of r matrices of rank 1, then by taking one decomposition of F ∗ T as a
128 8 Marginalization and Flattenings
sum of r matrices of rank 1 one cannot hope to reconstruct automatically from that
decomposition of T as a sum of r tensors of rank 1.
Indeed, the existence of a decomposition for T is subject to the existence of a
decomposition for F ∗ T in which every summand satisfies the condition of Proposi-
tion 8.3.9.
Remark 8.3.12 One can try to find an approximation of a given tensor T with a
tensor of prescribed, small rank r < a1 by taking the matrix F ∗ T , resulting from
a process of n − 2 flattenings, and considering the rank r approximation for F ∗ T
obtained by the standard SVD approximation process for matrices (see [1]).
For instance, one can find in this way a rank 1 approximation for a tensor, which
in principle is not equal to the rank 1 approximation obtained by the marginalization
(see Example 8.1.12).
Example 8.3.13 Consider the tensor:
0 0
1 0
T =
0 1
0 0
of Example 8.1.12. The contractions of T are (1, 1), (1, 1), (1, 1), whose tensor
product, divided by 4 = (1 + 1) + (1 + 1) determines a rank 1 approximation:
1/4 1/4
1/4 1/4
T1 =
1/4 1/4
1/4 1/4
The flattening of T is the matrix:
⎛ ⎞
1 0
⎜0 0⎟
FT = ⎜
⎝0
⎟.
0⎠
0 1
1 0
T2 =
0 0
0 0
8.4 Exercises
Exercise 15 Prove the assertion in Example 8.1.4: the matrix M defined there gen-
erates the Kernel of the marginalization.
Exercise 17 Find generators for the image of the marginalization map of tensors
and prove Proposition 8.1.7.
Reference
The scope of this part of the book is to provide a quick introduction to the main tools
of the Algebraic Geometry of projective spaces that are necessary to understand some
aspects of algebraic models in Statistics.
The material collected here is not self-contained. For many technical results, as
the Nullstellensatz or the theory of fields extensions, we will refer to specific texts
on the subject.
We assume in the sequel that the reader knows the basic definitions of algebraic
structures, as rings, ideals, homomorphisms, etc., as well as the main properties of
polynomial rings.
This part of the book could also be used for a short course or a cutway through
the main results of algebraic and projective geometry which are relevant in the study
of Statistics.
The projective dimension of P(V ) is the number dim(V ) − 1 (it is a constant fact
that the dimension of a projective space is always 1 less than its linear dimension).
When V = Cn+1 , we will denote the projective space P(V ) also with Pn .
Points of the projective space are thus equivalent classes of vectors, in the
relation ∼, hence are formed by a vector v = 0 together with all its multiples.
In particular, P ∈ Pn is an equivalence class of (n + 1)-tuples of complex num-
bers. We will denote with homogeneous coordinates of P any representative of the
equivalence class.
Notice that the coordinates, in a projective space, are no longer uniquely defined,
but only defined modulo scalar multiplication. We will also write P = [ p0 : · · · : pn ]
when ( p0 , . . . , pn ) is a representative for the homogeneous coordinates of P.
Remark 9.1.2 Pn contains several subsets in natural one-to-one correspondence with
Cn .
Indeed, take the subset Ui of points with homogeneous coordinates [x0 : · · · : xn ]
whose i-th coordinate xi is nonzero. The condition is clearly independent from the
representative of the class that we choose. There is a one-to-one correspondence
Ui ↔ Cn , obtained as follows:
x0 x1 xˆi xn
[x0 : · · · : xn ] → ( , ,..., ,..., )
xi xi xi xi
f (x) = f d + f d−1 + · · · + f 0 ,
with f i homogeneous of degree i for all i. The previous sum is called the homoge-
neous decomposition of f (x).
f (x) = f d + f d−1 + · · · + f 0 .
Since f (x) is not homogeneous, we may assume f d , f i = 0 for some i < d. Take
the minimal i with f i = 0. Choose y = (y0 , . . . , yn ) ∈ Cn+1 with f d (y) = 0 (it ex-
ists by Lemma 9.1.5). Then f (ay) = a d f d (y) + a d−1 f d−1 (y) + · · · + a i f i (y) is a
polynomial of degree d > 0 in the unique variable a, which can be divided by a i , i.e.
136 9 Elements of Projective Algebraic Geometry
Definition 9.1.7 We call projective variety of Pn every subset of PnK defined by the
vanishing of a family J = { f j } of homogeneous polynomials f j ∈ C[x0 , . . . , xn ].
In other words, projective varieties are subsets of Pn whose equivalence classes
are the solutions of a system of homogeneous polynomial equations.
When V is any linear space of dimension d, we define the projective varieties in
P(V ) by taking an identification V ∼ Cd (hence by fixing a basis of V ).
We will denote with X (J ) the projective variety defined by the family J of ho-
mogeneous polynomials.
Remark 9.1.10 Projective varieties provide a system of closed sets for a topology,
called the Zariski topology on Pn .
Namely, ∅ and Pn are both projective varieties, defined respectively by the fam-
ilies of polynomials {1} and {0}. If {Wi } is a family of projective varieties,with
Wi = X (Ji ), then {Wi } is the projective variety defined by the family J = {Ji }
of homogeneous polynomials. Finally, if W1 = X (J1 ) and W2 = X (J2 ) are pro-
jective varieties, then W1 ∪ W2 is the projective variety defined by the family of
homogeneous polynomials:
J1 J2 = { f g : f ∈ J1 , g ∈ J2 }.
9.1 Projective Varieties 137
I = { p0 xi − pi x0 , . . . , pn xi − pi xn }
defines {P} ⊂ Pn .
In particular, the Zariski topology satisfies the first separation axiom T1 .
β
f = ex1 (x0 − α0 x1 )m 0 · · · (x0 − αk x1 )m k .
It follows immediately that f vanishes only at the points [α0 : 1], . . . , [αk : 1], with
the addition of [1 : 0] if β > 0.
Thus, the open sets in the Zariski topology on P1 are ∅ and the cofinite sets, i.e.
sets whose complement is finite. In other words, the Zariski topology on P1 coincides
with the cofinite topology.
Example 9.1.13 In higher projective spaces there are nontrivial closed subsets which
are infinite. Thus the Zariski topology on Pn , n > 1, is not the cofinite topology.
Indeed, let f = 0 be a homogeneous polynomial in C[x0 , . . . , xn ], of degree
bigger than 0, and assume n > 1. We prove the variety X ( f ), which is not Pn by
Lemma 9.1.5, has infinitely many points.
To see this, notice that if all points Q = [q0 : q1 : · · · : qn ] with q0 = 0 belong to
X ( f ), then we are done. So we can assume that there exists Q = [1 : q1 : · · · : qn ] ∈ /
X ( f ). For any choice of m = (m 2 , . . . , m n ) ∈ Cn−1 consider the line L m , passing
through Q, defined by the vanishing of the linear polynomials
x2 − m 2 (x1 − q1 x0 ) − q2 x0 , . . . , xn − m n (x1 − q1 x0 ) − qn x0 .
Since the polynomial f m is homogeneous of the same degree than f , then it vanishes
at some point, so that X ( f ) ∩ L m = ∅. Since two different lines L m , L m meet only
at Q ∈
/ X ( f ), the claim follows.
I = {h 1 f 1 + · · · + h m f m : h 1 , . . . , h m ∈ R, f 1 , . . . , f m ∈ J }.
f i = h 1,i−d1 q1 + · · · + h m,i−dm qm
It follows that every projective variety can be defined as the vanishing locus of a
homogeneous ideal.
9.1 Projective Varieties 139
Before stating the basic result in the correspondence between projective varieties
and homogeneous ideals (i.e. the homogeneous version of the celebrated Hilbert’s
Nullstellensatz), we need some more piece of notation.
Definition 9.1.17 For any ideal I ⊂ R, define the radical of I as the set
√
I = { f : f m ∈ I for some exponent m}.
√
I is an ideal of R and contains I . √
When I is a homogeneous √ ideal, then also I is homogeneous, and the projective
varieties X (I ) and X ( I ) are equal (see Exercise √ 21). √
We say that
√ an ideal I in R is radical if I = I . For any ideal I , I is a radical
√
ideal, since I = I.
We call irrelevant ideal the ideal of R = C[x0 , . . . , xn ] generated by the indeter-
minates x0 , . . . , xn .
The irrelevant ideal is a radical ideal that defines the empty set in Pn . Indeed,
no points of Pn can annihilate all the variables, as no points in Pn have all the
homogeneous coordinates equal to 0.
Example 9.1.18 In C[x, y] consider the homogeneous element x 2
. The radical of
the ideal I = x is the ideal generated by x. Indeed x belongs to x 2 , moreover
2
if f n ∈ I for some polynomial f , then f cannot have a vanishing constant term, thus
f ∈ x.
The three sets {x 2 }, x 2 , x 2 = x all define the same projective subvariety
of P1 : the point of homogeneous coordinates [0 : 1].
Now we are ready to state the famous Hilbert’s Nullstellensatz, which clarifies
the relations between different sets of polynomials that define the same projective
variety.
Theorem 9.1.19 (Homogeneous Nullstellensatz) Two homogeneous ideals I1 , I2 in
the polynomial ring R = C[x0 , . . . , xn ] define the same projective variety X if and
only if
I1 = I2 ,
Another fundamental result in the study of projective varieties, still due to Hilbert,
is encoded in the following algebraic result:
Theorem 9.1.22 (Basis Theorem) Let J be a set of polynomials and let I be the
ideal generated by J . Then there exists a finite subset J ⊂ J that generates the ideal
I.
In particular, any projective variety can be defined by the vanishing of a finite set
of homogeneous polynomials.
A proof of a weaker version of this theorem will be given in Chap. 13 (or see,
also, Sect. 4 of [1]). Let us list some consequences of the Basis Theorem.
Definition 9.1.23 We call hypersuperface any projective variety defined by the van-
ishing of a single homogeneous polynomial. By abuse, often we will write X ( f )
instead of X ({ f }).
When f has degree 1, then X ( f ) is called a hyperplane.
X = X (J ) = X (J ) = X (J ) = X (J ).
X = X ( f 1 ) ∩ · · · ∩ X ( f m ).
9.1 Projective Varieties 141
Remark 9.1.26 One could think that the homogeneous ideal of every projective vari-
ety in Pn can be generated by a finite set of homogeneous polynomials of cardinality
bounded by a function of n.
F.S. Macaulay proved that this guess is false.
Indeed, in [2] he showed that for every integer m there exists a subset (curve) in P3
whose homogeneous ideal cannot be generated by a set of less than m homogeneous
polynomials.
The Basis Theorem provides a tool for the study of some aspects of the Zariski
topology.
The following Proposition is easy and we will leave it as an exercise (see Exercise
22).
Corollary 9.1.29 Any projective space Pn is irreducible and compact in the Zariski
topology.
Proof Let A1 , A2 be non-empty open subsets, in the Zariski topology, and assume
that Ai is the complement of the projective variety X i = X (Ji ), where J1 , J2 are two
subsets of homogeneous polynomials in C[x0 , . . . , xn ]. We may assume, by the Basis
Theorem, that both J1 , J2 are finite. Notice that none of X (J1 ), X (J2 ) can coincide
with Pn , thus both J1 , J2 contain a nonzero element.
To prove that Pn is irreducible, we must show that A1 ∩ A2 cannot be empty, i.e.
that X 1 ∪ X 2 cannot coincide with Pn . By Remark 9.1.10, X 1 ∪ X 2 is the projective
variety defined by the set of products J1 J2 . If we take f 1 = 0 in J1 and f 2 = 0
142 9 Elements of Projective Algebraic Geometry
Closed subsets in a compact space are compact. Thus any projective variety X ⊂
Pn is compact in the topology induced by the Zariski topology of Pn .
Notice that irreducible topological spaces are far from being Hausdorff spaces.
Thus no nontrivial projective space satisfies the Hausdorff separation axiom T2 .
Another important consequence of the Basis Theorem is the following.
Proof Let the claim fail. Then one can find an infinite chain of elements of F ,
X0 ⊃ X1 ⊃ · · · ⊃ Xi ⊃ . . .
where all the inclusions are strict. Consider for all i the ideal I (X i ) generated by the
homogeneous polynomials which vanish at X i . Then one gets an ascending chain of
ideals
I (X 0 ) ⊂ I (X 1 ) ⊂ · · · ⊂ I (X i ) ⊂ . . .
where again all the inclusions are strict. Let I = I (X i ). It is immediate to see
that I is a homogeneous ideal. By the Basis Theorem, there existsa finite set of
homogeneous generators g1 , . . . , gk for I . Since every g j belongs to I (X i ), for i 0
sufficiently large we have g j ∈ I (X i0 ) for all j. Thus I = I (X i0 ), so that I (X i0 ) =
I (X i0 +1 ), a contradiction.
Theorem 9.1.32 Let X be any projective variety. Then the irreducible components
of X exist and their number is finite.
Moreover there exists a unique decomposition of X as the union
X = X1 ∪ · · · ∪ Xk
Proof First, let us prove that irreducible components exist. To do that, consider the
family F p of closed irreducible subsets containing a point P. F P is not empty, since
it contains {P}. If X 1 ⊂ . . . X i ⊂ . . . is an ascending chain of elements of F p , then
the union Y = X i is irreducible by 9.1.28 (iv), thus the closure of Y sits in F p (by
9.1.28 (ii)) and it is an upper bound for the chain. Then the family F p has maximal
elements, by the Zorn’s Lemma. These elements are irreducible components of X .
Notice that we also proved that every point of X sits in some irreducible com-
ponent, i.e. X is the union of its irreducible components. If Y is an irreducible
component, by 9.1.28 (ii) also the closure of Y is irreducible. Thus, by maximality,
Y must be closed.
Next, we prove that X is a finite union of irreducible closed subsets. For, assume
this is false. Call F the family of closed subsets of X which are not a finite union of
irreducible subsets. F is non-empty, since it contains X . By Theorem 9.1.30, F has
some minimal element X . As X ∈ F , then X cannot be irreducible. Thus there
are two closed subsets X 1 , X 2 , properly contained in X , whose union is X . Since
X is minimal in F , none of X 1 , X 2 is in F , thus both X 1 , X 2 are union of a finite
number of irreducible closed subsets. But then also X would be a finite union of
closed irreducible subsets. As X ∈ F , this is a contradiction.
Thus, there are irreducible closed subsets X 1 , . . . , X k , whose union is X . Then, if
Y is any irreducible component of X , we have Y ⊂ X = X 1 ∪ · · · ∪ X k . By 9.1.28
(iii), Y is contained in some X i . By maximality, we get that Y coincides with some
X i . This proves that the number of irreducible components of X is finite.
We just proved that X decomposes in the union of its irreducible components
Y1 , . . . , Ym . By 9.1.28 (iii), none of the Yi can be contained in the union of the
remaining components. Thus the decomposition is unique.
Example 9.1.33 Let X be the variety in P2 defined by the vanishing of the homoge-
neous polynomial g = x0 x2 − x12 . Then X is irreducible.
Proving the irreducibility of a projective variety, in general, is not an easy task.
We do that, in this case, introducing a method that we will refine later.
Assume that X is the union of two proper closed subsets X 1 , X 2 , where X i is
defined by the vanishing of homogeneous polynomials in the set Ji .
We consider the map f : P1 → P2 defined by sending each point P = [y0 : y1 ]
to the point f (P) = [y02 : y0 y1 : y12 ] of P2 . It is immediate to check, indeed, that
the point f (P) does not depend on the choice of a particular pair of homogeneous
coordinates for P. Here f is simply a set-theoretic map. We will see, later, that f
has relevant geometric properties.
144 9 Elements of Projective Algebraic Geometry
The map f is one-to-one. To see this, assume f ([b : c]) = f ([b : c ]). Then
(b2 , b c , c2 ) is equal to (b2 , bc, c2 ) multiplied by some nonzero scalar z ∈ C. Tak-
ing a suitable square root w of z, we may assume b = wb. We have c = ±wc, but
if c = wc then b c = −zbc = zbc, a contradiction. Thus also c = wc and (b , c ),
(b, c) define the same point in P1 .
In conclusion, f is a bijective map f : P1 → X .
Next, we prove that Z 1 = f −1 (X 1 ) is closed in P1 . Indeed for any polynomial
p = p(y0 , y1 , y2 ) ∈ J1 consider the polynomial q = p(x02 , x0 x1 , x12 ) ∈ C[x0 , x1 ]. It
is immediate to check that any P ∈ P1 satisfies q(P) = 0 if and only if f (P) sat-
isfies p( f (P)) = 0. Thus Z 1 is the projective variety in P1 associated to the set of
homogeneous polynomials
Similarly Z 2 = f −1 (X 2 ) is closed in P1 .
Since f is bijective, then Z 1 , Z 2 are proper closed subset of P1 , whose union is
P . This contradicts the irreducibility of P1 .
1
Example 9.1.34 Let X be the variety in P2 defined by the set of homogeneous poly-
nomials J = {x0 x1 , x0 (x0 − x2 )}. Then X is the union of the sets L 1 = {[x0 : x1 :
x2 ] : x0 = 0} and L 2 = {[x0 : x1 : x2 ] : x1 = 0, x0 = x2 }. These are both linear vari-
eties, hence they are irreducible (L 2 is indeed a singleton). Moreover L 1 ∩ L 2 = ∅.
It follows that X is not irreducible: L 1 , L 2 are its irreducible components.
with each g j irreducible, then h = s and, after a possible permutation, there are
scalars c1 , . . . , ch ∈ C with gi = ci f i for all i.
If f is homogeneous, also the irreducible factors of f are homogeneous.
Notice that the irreducible factors of f need not be distinct. In any event, the
irreducible factors of a product f g are the union of the irreducible factors of f and
the irreducible factors of g.
It follows that f 1 f 2 belongs to the radical of the ideal generated by f , thus some
power of f 1 f 2 belongs to the ideal generated by f , i.e. there is an equality
( f 1 f 2 )n = f h
for some exponent n and some polynomial h. It follows that f is either an irreducible
factor of either f 1 or f 2 . In the former case f 1 = f h 1 hence X ( f 1 ) contains X . In
the latter, X ( f 2 ) contains X .
In particular, X is irreducible if and only if f has a unique irreducible factor. This
clearly happens when f is irreducible, but also when f is a power of an irreducible
polynomial.
Let us move to consider products of projective spaces, which we will call also mul-
tiprojective spaces.
The nonexpert reader would be surprised, at first, by knowing that a product of
projective spaces is not trivially a projective space itself.
For instance, consider the product P1 × P1 , whose points have a pair of homoge-
neous coordinates ([x0 : x1 ], [y0 : y1 ]). These pairs can be multiplied separately by
two different scalars. Thus, ([1 : 1], [1 : 2]) and ([2 : 2], [1 : 2]) represent the same
point of the product. On the other hand, the most naïve association with a point in a
projective space yields to relate ([x0 : x1 ], [y0 : y1 ]) with [x0 : x1 : y0 : y1 ] (which,
by the way, sits in P3 ), but ([1 : 1 : 1 : 2]) and ([2 : 2 : 1 : 2]) are different points in
P3 .
146 9 Elements of Projective Algebraic Geometry
We will see in the next chapters how a product can be identified with a subset
(indeed, with a projective variety) of a higher dimensional projective space.
By now, we develop independently a theory for products of projective spaces and
their relevant subsets: multiprojective varieties.
belong to the same class when there are scalars k1 , . . . kn ∈ C (all of them necessarily
nonzero) such that, for all i, j, qi j = ki pi j .
We will denote the elements of the equivalence class that define P as sets of
multihomogeneous coordinates for P, writing
at a point P of the product above. This time, it is not sufficient that f is homogeneous,
because subsets of coordinates referring to factors of the product can be scaled
independently.
Example 9.2.3 Consider the polynomial ring C[x0 , x1 , y0 , y1 ], with the partition
{x0 , x1 }, {y0 , y1 }, and consider the two homogenous polynomials
9.2 Multiprojective Varieties 147
It is immediate to verify that given two representatives of the same class in Pa1 ×
· · · × Pan :
Example 9.2.6 Consider the product P = Pa1 × · · · × Pan and consider, for all i, a
projective variety X i in Pai . Then the product X 1 × · · · × X n is a multiprojective
variety in P.
Indeed, assume that X i is defined by a finite subset Ji of homogeneous polynomials
in the variables xi,0 , . . . , xi,ai . For all i extend Ji to a finite set K i which defines the
empty set in Pai (e.g. just add to Ji all the coordinates in Pai ). Then X 1 × · · · × X n
is defined by the (finite) set of products of homogeneous polynomials:
J = { f 1 · · · f n : f j ∈ K j ∀ j and ∃i with f i ∈ Ji }.
X = X1 ∪ · · · ∪ Xk
We will also write that, over U , the map f is given parametrically by the system:
⎧
⎪
⎨ y0 = f 0 (x0 , . . . , xn )
... ...
⎪
⎩
ym = f m (x0 , . . . , xm )
(g0 ( p0 , . . . , pn ), . . . , gm ( p0 , . . . , pn )) = α P (h 0 ( p0 , . . . , pn ), . . . , h m ( p0 , . . . , pn )).
g j h i − gi h j
150 9 Elements of Projective Algebraic Geometry
vanish at all the points of U . Thus they must vanish at all the points of Pn , since
U is dense. In particular they vanish in all the points of U1 ∪ U2 . It follows im-
mediately that for any P ∈ U1 ∪ U2 , P = [ p0 : · · · : pn ], the sets of coordinates
[g0 ( p0 , . . . , pn ) : · · · : gm ( p0 , . . . , pn )] and [h 0 ( p0 , . . . , pn ) : · · · : h m ( p0 , . . . , pn )]
determine the same, well defined point of Pm . The claim follows.
After Proposition 9.3.2 one may wonder if the local definition of projective maps
is really necessary.
Well, it is, as illustrated in the following Example 9.3.4.
The fundamental point is that necessarily the polynomials f 0 , . . . , f m that define
the projective map f over U , cannot have a common zero Q ∈ U , otherwise, the
map would not be defined in Q. Sometimes this property cannot be obtained globally
by a unique set of polynomials. It is necessary to use an open cover and vary the
polynomials, passing from one open subset to another one.
Example 9.3.3 Assume n ≤ m and consider the map between projective spaces f :
Pn → Pm , defined globally by polynomials f 0 (x0 , . . . , xn ) , . . . , f m (x0 , . . . , xn ),
where:
xi if i ≤ n
f i (x0 , . . . , xn ) =
0 otherwise.
[0, 1, 1] x1 − x2 = 0
x0 = 0
x1 = 0
[0, −1, 1] x 1 + x2 = 0
It is immediate to check, indeed, that both g ◦ f and f ◦ g are the identity on the
respective spaces.
Remark 9.3.8 We are now able to prove that the map f of Example 9.3.4 cannot be
defined globally by a pair of polynomials:
y0 = p0 (x0 , x1 , x2 )
y1 = p1 (x0 , x1 , x2 )
Otherwise, since the map g defined in the previous example is the inverse of
f , we would have that for any choice of Q = (b0 , b1 ) = (0, 0), the homogeneous
polynomials
h b0 ,b1 = b1 p0 (y0 y1 , y12 − y22 , −y12 − y22 ) − b0 p1 (y0 y1 , y12 − y22 , −y12 − y22 ),
whose vanishing defines f (g(Q)), vanishes at a single point of P1 . Notice that the
degree d of any h b0 ,b1 is at least 2.
Since C is algebraically closed, a homogeneous polynomial in two variables that
vanishes at a single point is a power of linear form. Thus any polynomial h b0 ,b1 is a
d-th power of a linear form. In particular there are scalars a0 , a1 , c0 , c1 such that:
Notice that the point Q = [a1 : a0 ] cannot be equal to [c1 : c0 ], otherwise both p0 , p1
would vanish at g(Q ) ∈ X . Then h 1,−1 = (a0 y0 − a1 y1 )d − (c0 y0 − c1 y1 )d vanishes
at two different points, namely [a1 + c1 : a0 + c0 ] and [ea1 + c1 : ea0 + c0 ], where
e is any d-root of unit, different from 1.
In the case of multiprojective varieties, most definitions and properties above can
be rephrased and proved straightforwardly.
9.3 Projective and Multiprojective Maps 153
We will say that f : X → Pb1 × · · · × Pbm is a multiprojective map if all of its com-
ponents are.
Remark 9.3.10 The composition of two multiprojective maps is a multiprojective
map.
The identity from a multiprojective variety to itself is a multiprojective map.
Multiprojective maps are continuous in the Zariski topology.
Proposition 9.3.11 Let f : Pa1 × · · · × Pan → Pm be a multiprojective map. Then
there exists a set of m + 1 multihomogeneous polynomials f 0 , . . . , f m of the same
multidegree, such that f (Q) is defined by f 0 , . . . , f m for all Q ∈ Pa1 × · · · × Pan .
9.4 Exercises
Exercise 24 Prove that if X 1 and X 2 are topological spaces, and X 1 has the sep-
aration property T1 , then for any Q ∈ X 1 the fiber {Q} × X 2 is a closed subset of
X 1 × X 2 which is homeomorphic to X 2 .
Prove that if X 1 and X 2 are irreducible, and one of them has the property T1 , then
also the product X 1 × X 2 is irreducible.
Exercise 26 Prove that if X is a finite projective variety, then the irreducible com-
ponents of X are its singletons.
Exercise 30 Prove that the composition of two projective maps is a projective map.
Prove that the identity from a projective variety to itself is a projective map.
References
1. Zariski, O., Samuel, P.: Commutative Algebra I. Graduate Texts in Mathematics, Vol. 28 (1958).
Springer, Berlin
2. Macaulay, F.S.: The Algebraic Theory of Modular Systems. Cambridge University Press (1916)
Chapter 10
Projective Maps and the Chow’s
Theorem
The chapter contains the proof of the Chow’s Theorem, a fundamental result for
algebraic varieties with an important consequence for the study of statistical models.
It states that, over an algebraically closed field, like C, the image of a projective (or
multiprojective) variety X under a projective map is a Zariski closed subset of the
target space, i.e., it is itself a projective variety.
The proof of Chow’s Theorem requires an analysis of projective maps, which can
be reduced to a composition of linear maps, Segre maps and Veronese maps.
The proof also will require the introduction of a basic concept of the elimination
theory, i.e., the resultant of two polynomials.
It is clear that the point φ(P) does not depend on the choice of a set of coordinates
for P, since φ is linear.
Notice that we cannot define a projective map in the same way when φ is not
injective. Indeed, in this case, the image of a point P whose coordinates lie in the
Kernel of φ would be indeterminate.
Since any linear map Cn+1 → Cm+1 is defined by linear homogeneous polyno-
mials, then it is clear that the induced map between projective spaces is indeed a
projective map.
Example 10.1.2 Assume that the linear map φ is an isomorphism of Cn+1 . Then the
corresponding linear projective map is called a change of coordinates.
Indeed φ corresponds to a change of basis inside Cn+1 .
The associated map φ : Pn → Pn is an isomorphism, since the inverse isomor-
phism φ−1 determines a projective map which is the inverse of φ.
From now on, when dealing with projective varieties, we will freely act with the
change of coordinates on them.
The previous remark generalizes to any linear projective map.
Proposition 10.1.4 For every injective map φ : Cn+1 → Cm+1 , m ≥ n, the asso-
ciated linear projective map φ : Pn → Pm sends projective subvarieties of Pn to
projective subvarieties of Pm .
In topological terms, any linear projective map is closed in the Zariski topology,
i.e., it sends closed sets to closed sets.
The definition of linear projective maps, which requires that φ is injective, becomes
much more complicated if we drop the injectivity assumption.
10.1 Linear Maps and Change of Coordinates 157
Let φ : Cn+1 → Cm+1 be a non injective linear map. In this case, we cannot define
through φ a projective map Pn → Pm as above, since for any vector ( p0 , . . . , pn )
in the kernel of φ, the image of the point [ p0 : · · · : pn ] is undefined, because
φ( p0 , . . . , pn ) vanishes.
On the other hand, the kernel of φ defines a projective linear subspace of Pn , the
projective kernel, which will be denoted by K φ .
If X ⊂ Pn is a subvariety which does not meet K φ , then the restriction of φ to the
coordinates of the points of X determines a well-defined map from X to Pm .
Example 10.1.6 Let φ : Cm+1 → Cn+1 be any surjective map, with kernel L 1 (of
dimension m − n). We can always assume, up to a change of coordinates, that L 1
coincides with the subspace N defined in Example 10.1.5. Then considering the
linear subspace M ⊂ Cm+1 defined in Example 10.1.5, we can find an isomorphism of
vector spaces ψ from M to Cn+1 such that φ = ψ ◦ φ0 , where φ0 is the map introduced
in Example 10.1.5. Thus, after an isomorphism and a change of coordinates, φ acts
on points of Pm \ K φ as a geometric projection.
Definition 10.1.7 Given a linear surjective map φ : Cm+1 → Cn+1 and a subvariety
X ⊂ Pm which does not meet K φ , the restriction map φ|X : X → Pn is a well defined
projective map, which will be denoted as a projection of X from K φ . The subspace
K φ is also called the center of the projection.
Notice that φ|X is a projective map, since it is defined, up to isomorphisms and
change of coordinates, by (simple) homogeneous polynomials (see Exercise 31).
In this section, we introduce the basic concept of the elimination theory: the resultant
of two polynomials.
The resultant provides an answer to the following problem:
– assume we are given two (not necessarily homogeneous) polynomials f, g ∈ C[x].
Clearly both f and g factorize in a product of linear factors. Which algebraic condi-
tion must f, g satisfy to share a common factor, hence a common root?
where the a’s are repeated m times and the b’s are repeated n times.
Notice that when f is constant and g has degree d > 0, then by definition
R( f, g) = f d .
When both f, g are constant, the previous definition of resultant makes no sense.
In this case we set:
0 if f = g = 0
R( f, g) =
1 otherwise.
Proposition 10.2.3 With the previous notation, f and g have a common root if and
only if R( f, g) = 0.
Proof The proof is immediate when either f or g are constant (Exercise 33).
Otherwise write C[x]i for the vector space of polynomials of degree ≤ i in C[x].
Then the transpose of S( f, g) is the matrix of the linear map:
From Proposition 10.2.3, with an easy induction on the number of variables, one
finds that the resultant of homogeneous polynomials in several variables has the
following property:
Proposition 10.2.5 Let f, g be homogeneous polynomials in C[x0 , x1 , . . . , xr ].
Then R0 ( f, g) vanishes at (α1 , . . . , αr ) if and only if there exists α0 ∈ C with:
f (α0 , α1 , . . . , αr ) = g(α0 , α1 , . . . , αr ) = 0.
Less obvious, but useful, is the following remark on the resultant of two homo-
geneous polynomials.
Proposition 10.2.6 Let f, g be homogeneous polynomials in C[x0 , x1 , . . . , xr ].
Then R0 ( f, g) is homogeneous.
Proof The entries si j of the Sylvester matrix S0 ( f, g) are homogeneous and their
degrees decrease by 1 passing from one element si j to the next element si j+1 in the
same row (unless some of them is 0). Thus for any nonzero entry si j of the matrix,
the number deg si j − j depends only on the row i. Call it u i . Then the summands
given by any permutation, in the computation of the determinant, are homogeneous
of same degree:
1
d = (n + m + 1)(n + m) + ui ,
2 i
R0 ( f, g)(α0 , α1 , . . . , αr ) = R0 ( f, g)(α1 , . . . , αr ) = 0,
Remark 10.3.1 For all Q ∈ π(X ), the inverse image π −1 (Q) is finite.
Indeed π −1 (Q) is a Zariski closed set in the line P0 Q, and it does not contain P0 ,
since P0 ∈/ X . The claim follows since the Zariski topology on a line is the cofinite
topology.
J0 = J ∩ C[x1 , . . . , xn ].
In other words, J0 is the set of elements in J which are constant with respect to the
variable x0 . In Chap. 13 we will talk again about Elimination Theory, but from the
point of view of Groebner Basis; there, the ideal J0 will be called the first elimination
ideal of J (Definition 13.5.1).
We prove that π(X ) is the projective variety defined by J0 . We will need the
following refinement of Lemma 9.1.5:
162 10 Projective Maps and the Chow’s Theorem
Theorem 10.3.4 The variety defined in Pn−1 by the ideal J0 coincides with π(X ).
Proof Let Q ∈ π(X ), Q = [q1 : · · · : qn ]. Then there exists q0 ∈ C such that the
point P = π −1 (Q) = [q0 : q1 : · · · : qn ] ∈ Pn belongs to X . Thus g(P) = 0 for all
g ∈ J . In particular, this is true for all g ∈ J0 . On the other hand, if g ∈ J0 then g
does not contain x0 , thus:
We can repeat all the constructions of this section by selecting any variable xi
instead of x0 and performing the elimination of xi . Thus we can define the i-th
resultant Ri ( f, g) and use it to prove that projections with center any coordinate
point [0 : · · · : 0 : 1 : 0 : · · · : 0] are closed maps.
In this section, we prove that projective maps defined by linear maps of projective
spaces are closed in the Zariski topology.
Proposition 10.4.2 Let φ : Cm+1 → Cn+1 be a linear map. Let K φ be the projective
kernel of φ and let X ⊂ Pm be a projective subvariety such that X ∩ K φ = ∅. Then
φ induces a projective map X → Pn (that we will denote again with φ) which is a
closed map in the Zariski topology.
Proof The map φ factors through a linear surjection φ1 : Cm+1 → Cm+1 /K er (φ)
followed by a linear injection φ2 . After the choice of a basis, the space Cm+1 /K er (φ)
can be identified with C N +1 , where N = m − dim(K er (φ)), so that φ1 can be con-
sidered as a map Cm+1 → C N +1 and φ2 as a map C N +1 → Cn+1 . Since X does
not meet the kernel of φ1 , by Definition 10.1.7 φ1 induces a projection X → P N .
The injective map φ2 defines a projective map P N → Pn , by Definition 10.1.1. The
composition of these two maps is the projective map φ : X → Pn of the claim. It is
closed since it is the composition of two closed maps.
We introduce now two fundamental projective and multiprojective maps, which are
the cornerstone, together with linear maps, of the construction of projective maps. The
first map, the Veronese map, is indeed a generalization of the map built in Example
9.1.33.
We recall that a monomial is monic if its coefficient is 1.
Definition 10.5.1 Fix n, d and set N = n+d d
− 1. There are exactly N + 1 monic
monomials of degree d in n + 1 variables x0 , . . . , xn . Let us call M0 , . . . , M N these
monomials, for which we fixed an order.
The Veronese map of degree d in Pn is the map vn,d : Pn → P N which sends a
point [ p0 : · · · : pn ] to [M0 ( p0 , . . . , pn ) : · · · : M N ( p0 , . . . , pn )].
10.5 The Veronese Map and the Segre Map 165
Notice that a change in the choice of the order of the monic monomials produces
simply the composition of the Veronese map with a change of coordinates. After
choosing an order of the variables, e.g., x0 < x1 < · · · < xn , a very popular order of
the monic monomials is the order in which x0d0 · · · xndn preceeds x0e0 · · · xnen if in the
smallest index i for which di = ei we have di > ei . This order is called lexicographic
order, because it reproduces the way in which words are listed in a dictionary. In
Sect. 13.1 we will discuss different types of monomial orderings.
Notice that we can define an analogue of a Veronese map by choosing arbitrary
(nonzero) coefficients for the monomials M j ’s. This is equivalent to choose a weight
for the monomials. The resulting map has the same fundamental property of our
Veronese map, for which we choose to take all the coefficients equal to 1.
Remark 10.5.2 The Veronese maps are well defined, since for any P = [ p0 : · · · :
pn ] ∈ Pn there exists an index i with pi = 0, and among the monomials there exists
the monomial M = xid , which satisfies M( p0 , . . . , pn ) = pid = 0.
The Veronese map is injective. Indeed if P = [ p0 : · · · : pn ] and Q = [q0 : · · · :
qn ], have the same image, then the powers of the pi ’s and the qi ’s are equal, up to a
scalar multiplication. Thus, up to a scalar multiplication, one may assume pid = qid
for all i, so that qi = ei pi , for some choice of a d-root of unit ei . If the ei ’s are not all
equal to 1, then there exists a monic monomial M such that M(e0 , e1 , . . . , en ) = 1,
thus M( p0 , . . . , pn ) = M(q0 , . . . , qn ), which contradicts vn,d (P) = vn,d (Q).
Because of its injectivity, sometimes we will refer to a Veronese map as a Veronese
embedding.
The images of Veronese embeddings will be denoted as Veronese varieties.
Example 10.5.3 The Veronese map v1,3 sends the point [x0 : x1 ] of P1 to the point
[x03 : x02 x1 : x0 x12 : x13 ] ∈ P3 .
The Veronese map v2,2 sends the point [x0 : x1 : x2 ] ∈ P2 to the point [x02 : x0 x1 :
x0 x2 : x12 : x1 x2 : x22 ] (notice the lexicographic order).
Proposition 10.5.4 The image of a Veronese map is a projective subvariety of P N .
Proof We define equations for Y = vn,d (Pn ).
Consider (n + 1)-tuples of nonnegative integers A = (a0 , . . . , an ), B = (b0 ,
) and C
. . . , bn = (c0 , .
. . , cn ), with the following property:
(*) ai = bi = ci = d and ai + bi ≥ ci for all i. Define D = (d0 , . . . , dn ),
where di = ai + bi − ci , Clearly di = d. For any choice of A = (a0 , . . . , an ), the
monic monomial x0a0 · · · xnan corresponds to a coordinate in P N . Call M A the coordi-
nate corresponding to A. Define in the same way M B , M C and then also M D . The
polynomial:
f ABC = M A M B − M C M D
But we have M A (Q) = M B (Q) = m q , while M C (Q) = 0, since x0c0 · · · xncn preceeds
x0a0 · · · xnan in the lexicographic order. It follows m q2 = 0, a contradiction.
Then at least one coordinate corresponding to a power is nonzero. Just to fix the
ideas, assume that m 0 , which corresponds to x0d in the lexicographic order, is different
from 0. After multiplying the coordinates of Q by 1/m 0 , we may assume m 0 = 1.
Then consider the coordinates corresponding to the monomials x0d−1 x1 , . . . x0d−1 xn .
In the lexicographic order, they turn out to be m 1 , . . . , m n , respectively. Put P = [1 :
m 1 : · · · : m n ] ∈ Pn . We claim that Q is exactly vn,d (P).
The claim means that for any coordinate m of Q, corresponding to the monomial
x0a0 · · · xnan we have m = m a11 · · · m ann . We prove the claim by descending induction
on a0 . The cases a0 = d and a0 = d − 1 are clear by construction. Assume that the
claim holds when a0 > d − s and take m such that a0 = d − s. In this case there
exists some index j > 0 such that a j > 0. Put A = (a0 , . . . , an ), B = (d, 0, . . . , 0)
and C = (c0 , . . . , cn ) where c0 = a0 + 1 = d − s + 1, c j = a j − 1, and ck = ak for
k = 0, j. The (n + 1)-tuples A, B, C satisfy condition (*). Thus
Then m = M A (Q) = M C (Q)M D (Q) = m a11 · · · m ann , and the claim follows.
We observe that all the forms M A M B − M C M D are quadratic forms in the vari-
ables Mi ’s of P N . Thus the Veronese varieties are defined in P N by quadratic equa-
tions.
10.5 The Veronese Map and the Segre Map 167
Example 10.5.5 Consider the Veronese map v1,3 : P1 → P3 . The monic monomials
of degree three in 2 variable are (in lexicographic order):
The equations for the image are obtained by couples A, B, C satisfying condition
(*) above. Up to trivialities, these couples are:
Example 10.5.6 Equations for the image of v2,2 ⊂ P5 (the classical Veronese sur-
face) are given by: ⎧
⎪ M0 M4 − M1 M2
⎪ =0
⎪
⎪
⎪
⎪ M3 M2 − M1 M4 =0
⎪
⎪
⎨M M − M M =0
5 1 2 4
⎪
⎪ M M − M 2
=0
⎪
⎪
3 5 4
⎪
⎪
⎪ M0 M5 − M2 =0
2
⎪
⎩
M0 M3 − M12 =0
Theorem 10.5.7 All the Veronese maps are closed in the Zariski topology.
smallest multiple of d bigger or equal to ei . Then consider all the products x kj i d−ei f i ,
j = 0, . . . , n. These products are homogeneous forms of degree ki d in the x j ’s.
Moreover a point P ∈ Pn satisfies all the equations x kj i d−ei f i = 0 if and only if it
satisfies f i = 0, since at least one coordinate x j of P is nonzero.
With the procedure introduced above, transform arbitrarily each form x kj i d−ei f i =
0 in a form Fi j of degree k in the variables M j ’s. Then we claim that vn,d (X ) is the
subvariety of vn,d (Pn ) defined by the equations Fi j = 0. Since vn,d (Pn ) is closed in
P N , this will complete the proof.
Indeed let Q be a point of vn,d (X ). The coordinates of Q are obtained by the coordi-
nates of its preimage P = [ p0 : · · · : pn ] ∈ X ⊂ Pn by computing in P all the mono-
mials of degree d in the x j ’s. Thus Fi j (Q) = 0 for all i, j if and only if x kj i d−ei f i (P) =
0 for all i, j, i.e., if and only if f i (P) = 0 for all i. The claim follows.
Example 10.5.8 Consider the map v2,2 and let X be the line in P2 defined by the
equation f = (x0 + x1 + x2 ) = 0. Since f has degree 1, consider the products:
x0 f = x02 + x0 x1 + x0 x2 ,
x1 f = x0 x1 + x12 + x1 x2 ,
x2 f = x0 x2 + x1 x2 + x22 .
M0 + M1 + M2 , M1 + M3 + M4 , M2 + M4 + M5 .
Thus the image of X is the variety defined in P5 by the previous three linear forms
and the six quadratic forms of Example 10.5.6, that define v2,2 (P2 ).
have the same image. Fix indices such that p1 j1 , . . . , pn jn = 0. The monomial M =
x1, j1 · · · xn, jn does not vanish at P, hence also q1 j1 , . . . , qn jn = 0.
Call α = q1 j1 / p1 j1 . Our first task is to show that α = q1i / p1i for i = 1, . . . , n 1 , so
that [ p11 : · · · : p1a1 ] = [q11 : · · · : q1a1 ]. Define β = (q2 j2 · · · qn jn )/( p2 j1 · · · pn jn ).
Then β = 0 and:
αβ = (q1 j1 · · · qn jn )/( p1 j1 · · · pn jn ).
Since P, Q have the same image in the Segre map, then for all i = 1, . . . , a1 , the
monomials Mi = x1,i x2, j2 · · · xn, jn satisfy:
αβ Mi (P) = Mi (Q).
It follows immediately αβ( p1i · · · pn jn ) = (q1i · · · qn jn ) so that αβ p1i = q1i for all i.
Thus [ p10 : · · · : p1a1 ] = [q10 : · · · : q1a1 ].
We can repeat the argument for the remaining factors [ pi0 : · · · : piai ] = [qi0 :
· · · : qiai ] of P, Q (i = 2, . . . , n), obtaining P = Q.
Because of its injectivity, sometimes we will refer to a Segre map as a Segre
embedding.
The images of Segre embeddings will be denoted as Segre varieties.
Example 10.5.11 The Segre embedding s1,1 of P1 × P1 to P3 sends the point ([x0 :
x1 ], [y0 : y1 ]) to [x0 y0 : x0 y1 : x1 y0 : x1 y1 ].
The Segre embedding s1,2 of P1 × P2 to P5 sends the point ([x10 : x11 ], [x20 : x21 :
x22 ]) to the point:
[x10 x20 : x10 x21 : x10 x22 : x11 x20 : x11 x21 : x11 x22 ].
[x10 x20 x30 : x10 x20 x31 : x10 x21 x30 : x10 x21 x31 : x11 x20 x30 : x11 x20 x31 : x11 x21 x30 : x11 x21 x31 ].
Recall the general notation that with [n] we denote the set {1, . . . , n}.
Proposition 10.5.12 The image of a Segre map is a projective subvariety of P N .
Since the set of tensors of rank one corresponds to the image of a Segre map, the
proof of the proposition is essentially the same as the proof of Theorem 6.4.13. We
give the proof here, in the terminology of maps, for the sake of completeness.
Proof We define equations for Y = sa1 ,...,an (Pa1 × · · · × Pan ).
For any n-tuple A = (α0 , . . . , αm ) define the form M A of multidegree (1, . . . , 1)
as follows:
170 10 Projective Maps and the Chow’s Theorem
M A = x0α0 · · · xnαn .
Then consider any subset J ⊂ [n] and two n-tuples of nonnegative integers A =
(α0 , . . . , αn ) and B = (β0 , . . . , βn ). Define C = C ABJ
as the n-tuple (γ1 , . . . , γn )
such that:
αi if i ∈ J,
γi = .
βi if i ∈
/ J
c
Define D as D = C AB J
, where J c is the complement of J in [n]. Thus D =
(δ1 , . . . , δn ), where:
βi if i ∈ J,
δi = .
αi if i ∈
/ J
J
Every f AB is homogeneous of degree 2 in the coordinates of P N . We claim that the
J
projective variety defined by the forms f AB , for all possible choices of A, B, J as
above, is exactly equal to Y .
One direction is simple. If:
then it is easy to see that both M A M B (Q) and M C M D (Q) are equal to the product
If we take A = (0, 0), B = (1, 1) and J = {1}, we get that C = (0, 1), D = (1, 0).
Thus M A corresponds to x10 x20 = M0 , M B corresponds to x11 x21 = M3 , M C cor-
responds to x10 x21 = M1 and M D corresponds to x11 x20 = M2 . We get thus the
equation:
M0 M3 − M1 M2 = 0.
The other choices for A, B, J yield either trivialities or the same equation.
Hence the image of s1,1 is the variety defined in P3 by the equation M0 M3 −
M1 M2 = 0. It is a quadric surface (see Fig. 10.1).
Example 10.5.14 Equations for the image of s1,2 ⊂ P5 (up to trivialities) are given
by: ⎧
⎪
⎨ M0 M4 − M1 M3 = 0 for A = (0, 0), B = (1, 1), J = {1}
M0 M5 − M2 M3 = 0 for A = (0, 0), B = (1, 2), J = {1}
⎪
⎩
M5 M1 − M2 M4 = 0 for A = (0, 1), B = (1, 2), J = {1},
where M0 = x10 x20 , M1 = x10 x21 , M2 = x10 x22 , M3 = x11 x20 , M4 = x11 x21 , M5 =
x11 x22 .
172 10 Projective Maps and the Chow’s Theorem
P1 P1
v × X ⊂ P3
w v⊗w
Example 10.5.15 We can give a more direct representation of the equations defining
the Segre embedding of the product of two projective spaces Pa1 × Pa2 .
Namely, we can plot the coordinates of Q ∈ P N in a (a1 + 1) × (a2 + 1) matrix,
putting in the entry i j the coordinate corresponding to x1 i−1 x2 j−1 .
Conversely, any matrix (a1 + 1) × (a2 + 1) (except for the null matrix) corre-
sponds uniquely to a set of coordinates for a point Q ∈ P N . Thus we can identify P N
with the projective space over the linear space of matrices of type (a1 + 1) × (a2 + 1)
over C.
In this identification, the choice of A = (i, j), B = (k, l) and J = {1} (choosing
J = {2} we get the same equation, up to the sign) produces a form equivalent to the
2 × 2 minor:
m i j m kl − m il m k j
of the matrix.
Thus, the image of a Segre embedding of two projective space can be identified
with the set of matrices of rank 1 (up to scalar multiplication) in a projective space
of matrices.
Theorem 10.5.16 All the Segre maps are closed in the Zariski topology.
Proof We need to prove that the image in sa1 ,...,an of a multiprojective subvariety X
of V = Pa1 × · · · × Pan is a projective subvariety of P N .
First notice that if F is a monomial of multidegree (d, . . . , d) in the variables xi j
of V , then it can be written (usually in several ways) as a product of k multilinear
forms in the xi j ’s, which corresponds to a monomial of degree d in the coordinates
M0 , . . . , M N of P N . Thus, any form f of multidegree (d, . . . , d) in the xi j ’s can be
rewritten as a form of degree d in the coordinates M j ’s.
Take now a projective variety X ⊂ V and let f 1 , . . . , f s be multihomogeneous
generators for the ideal of X . Call (dk1 , . . . , dkn ) the multidegree of f k and let dk =
max{dk1 , . . . , dkn }. Consider all the products x1dkj1−dk1 · · · xndkjn−dkn f k . These products are
multihomogeneous forms of multidegree (dk , . . . , dk ) in the xi j ’s. Moreover a point
10.5 The Veronese Map and the Segre Map 173
x10 f = x10
2 2
x21 + x10 x11 x20
2
= (x10 x21 )2 + (x10 x21 x01 x21 ) = M12 + M1 M3 ,
x11 f = x11 x10 x21
2
+ x11
2 2
x20 = (x10 x21 x11 x21 ) + (x11 x21 )2 = M1 M3 + M32 .
These two forms, together with the form M0 M3 − M1 M2 that defines s1,1 (P1 × P1 )
in P3 , define the image of X in the Segre embedding.
Remark 10.5.18 Even if we take a minimal set of forms f k ’s that define X ⊂ Pa1 ×
· · · × Pan , with the procedure of Theorem 10.5.16 we do not find, in general, a
minimal set of forms that define sa1 ,...,an (X ).
Indeed the ideal generated by the forms Fk j1 ,..., jn constructed in the proof of
Theorem 10.5.16 needs not, in general, to be radical or even saturated.
We end this section by pointing out a relation between the Segre and the Veronese
embeddings of projective and multiprojective spaces.
Definition 10.5.19 A multiprojective space Pa1 × · · · × Pan is cubic if ai = a for
all i.
We can embed Pa into the cubic multiprojective space Pa × · · · × Pa (n times) by
sending each point P to (P, . . . , P). We will refer to this map as the diagonal em-
bedding. It is easy to see that the diagonal embedding is an injective multiprojective
map.
Example 10.5.20 Consider the cubic product P1 × P1 and the diagonal embedding
δ : P1 → P1 × P1 .
The point P = [ p0 : p1 ] of P1 is mapped to ([ p0 : p1 ], [ p0 : p1 ]) ∈ P1 × P1 .
Thus the Segre embedding of P1 × P1 , composed with δ, sends P to the point [ p02 :
p0 p1 : p1 p0 : p12 ] ∈ P3 .
We see that the coordinates of the image have a repetition: the second and the third
coordinates are equal, due to the commutativity of the product of complex numbers.
In other words the image s1,1 ◦ δ(P1 ) satisfies the linear equation M1 − M2 = 0 in
P3 .
174 10 Projective Maps and the Chow’s Theorem
Proof For any P = [ p0 : · · · : pn ] ∈ Pn the point sn,...,n ◦ δ(P) has repeated coordi-
nates. Indeed for any permutation σ on [r ] the coordinate corresponding to x1i1 · · · xrir
of sn,...,n ◦ δ(P) is equal to pi1 · · · pir , hence its equal to the coordinate correspond-
ing to x1iσ(1) · · · xriσ(r ) . To get rid of these repetition, we can consider coordinates
corresponding to multilinear forms x1i1 · · · xrir that satisfy:
(**) i 1 ≤ i 2 ≤ · · · ≤ ir .
By
We can decompose f as the Veronese map v1,3 , followed by the linear isomorphism
g(a, b, c) = (a, b − c, c − d, d) and then followed by the projection π to the first,
second and fourth coordinate.
176 10 Projective Maps and the Chow’s Theorem
Namely:
Then the image s(P1 × P1 ) corresponds to the quadric Q in P3 defined by the van-
ishing of the homogeneous polynomial g = z 0 z 3 − z 1 z 2 .
The image of Y is a projective subvariety of P3 , which is contained in Q, but it
is no longer defined by g and another polynomial: we need two polynomials, other
than g.
Namely, Y is defined also by the two multihomogeneous polynomials, of multi-
degree (1, 1), f 0 = f y0 = x0 y0 − x1 y0 and f 1 = f y1 = x0 y1 − x1 y1 . Thus s(Y ) is
defined in P3 by g, g0 = z 0 − z 1 , g1 = z 2 − z 3 . (Indeed, in this case, g0 , g1 alone are
sufficient to determine s(Y ), which is a line).
10.7 Exercises
Exercise 31 Recall that a map between topological spaces is closed if it sends closed
sets to closed sets.
Prove that the composition of closed maps is a closed map, the product of closed
maps is a closed map, and the restriction of a closed map to a closed set is itself a
closed map.
Reference
The first step is rather technical: we need some algebraic properties of irreducible
varieties. We recall that the definition of irreducible topological spaces, together with
examples, can be found in Definition 9.1.27 of Chap. 9.
So, from now on, dealing with projective varieties, we will always refer to
reducibility or irreducibility with respect to the induced Zariski topology.
Let us start with a characterization of irreducible varieties, in terms of the associ-
ated homogeneous ideal (see Corollary 9.1.15).
Proof Assume Y = Y1 ∪ Y2 , where the Yi ’s are proper closed subsets. Then there
exist polynomials f 1 , f 2 such that f i vanishes on Yi but not on Y . Thus f 1 , f 2 ∈
/ J,
while f 1 f 2 vanishes at any point of Y , i.e., f 1 f 2 ∈ J .
The previous argument can be inverted to show that the existence of f 1 , f 2 ∈ / J
such that f 1 f 2 ∈ J implies that Y is reducible.
Example 11.1.4 The space P N itself is defined by the ideal J = (0), thus the pro-
jective function field of Y is the field of fractions C(x0 , . . . , x N ).
11.2 Dimension
There are several definitions of dimension of an irreducible variety. All of them have
some difficult aspect. In some cases, it is laborious even to prove that the definition
makes sense. For most approaches, it is not immediate to see that the geometric naïve
notion of dimension corresponds to the algebraic notion.
Our choice is to make use, as far as possible, of the geometric approach, entering
deeply in the algebraic background only to justify some computational aspect.
The final target is the theorem on the dimension of general fibers (Theorem 11.3.5),
which allows to manage the computation of the dimension for most applications.
Remark 11.2.2 Since projective maps are continuous in the Zariski topology, and
singletons are projective varieties, then the fiber over any point P ∈ Y is closed in
the Zariski topology, hence it is a projective variety.
Proposition 11.2.3 For every projective variety X P N there exists a linear sub-
space L ⊂ P N , not intersecting X , such that the projection of X from L to a linear
subspace L is surjective, with finite fibers.
Now we are ready for the definition of dimension. Notice that we already have
the notion of projective dimension r of a linear subspace of P N , which corresponds
to the projectivization of a linear subspace (of dimension r + 1) of the vector space
C N +1 .
In the rest of the section, since by elementary linear algebra any linear subspace
L of P N is isomorphic to Pn , by abuse we will consider the projection X → L as a
map X → Pn .
It is clear that two projective varieties which are isomorphic under a change of
coordinates share the same dimension.
Example 11.2.5 Since P0 has just one point, clearly singletons have a surjective
projection to P0 , with finite fibers. So singletons have dimension 0.
Finite projective varieties are reducible, unless they are singleton (see Exercise
26). Thus, by definition, singletons are the only projective irreducible varieties of
dimension 0.
Proposition 11.2.8 Fix a linear space L ⊂ P N , which does not meet a variety X ⊂
P N , and fix a linear subspace L , disjoint from L, such that dim(L ) + dim(L) =
N − 1. Fix a point P ∈ L, a linear subspace M ⊂ L of dimension dim(L) − 1 and
disjoint from P, and a hyperplane H containing L and disjoint from P.
Then the projection φ of X from L to L is equal to the projection φ P of X from
P to H , followed by the projection φ (in H = P N −1 ) of φ P (X ) from φ P (M) to L .
By now, we do not know yet that the dimension of an irreducible projective variety
is uniquely defined, since we did not exclude the existence of two different surjective
projections of X to two linear subspaces of different dimensions, both with finite
fibers.
It is not easy to face the problem directly. Instead, we show that the existence of a
map X → Pn with finite fibers is related with a numerical invariant of the irreducible
variety X .
The invariant which defines the dimension is connected with a notion in the
algebraic theory of field extensions: the transcendence degree. We recall some basic
facts in the following remark. For the proofs, we refer to section II of the book [1]
(see also some exercises, at the end of the chapter).
The set of all the elements of K 2 which are algebraic over K 1 is a field K ⊃
K 1 . We call K 1 the algebraic closure of K 1 in K 2 . If K = K 1 , we say that K 1 is
algebraically closed in K 2 . In this case any element of K 2 \ K 1 is trascendent over
K 1 . A field K 1 is algebraically closed if any non-trivial extension K 2 of K 1 contains
only transcendental elements, i.e., if K 1 is algebraically closed in any extension. C
is the most popular example of an algebraically closed field.
If K 2 = K 1 (x) is the field of fractions of the polynomial ring K [x], then K 2 is a
trascendent extension. Conversely, if e is any transcendental element over K 1 , then
K 1 (e) is isomorphic to K 1 (x).
A set of elements e1 , . . . , en ∈ K 2 such that for all i ei is trascendent over
K 1 (e1 , . . . , ei−1 ) and K 2 is an algebraic extension of K 1 (e1 , . . . , en ) is a transcen-
dence basis of the extension. All the transcendence basis have the same number of
elements, which is called the transcendence degree of the extension.
If K 2 has transcendence degree d over K 1 and K 3 is an algebraic extension of K 2 ,
then K 3 has transcendence degree d over K 1 .
Proof Assume that the map has finite fibers. We will indeed prove that the class of
any variable xi in the quotient ring R X is algebraic over C(x0 , . . . , xn ). Since these
classes generate the quotient field of R X , the claim will follow.
First assume that N = n + 1, so that φ is the projection from a point. We may also
assume, after a change of coordinates, that φ is the projection from P = [0 : · · · :
184 11 Dimension Theory
0 : 1] ∈
/ X . Consider the homogeneous ideal J of X and an element g ∈ J such that
g(P) = 0. Write g as a polynomial in x N = xn+1 with coefficients in C(x0 , . . . , xn ):
g = xn+1
d
ad + xn+1
d−1
ad−1 + · · · + a0 ,
Proof The second statement follows since the transcendence degree of the quotient
field of X does not depend on the projections φ, φ .
From the definition of dimension and its characterization in terms of field exten-
sions, one can prove the following properties.
Proof The first claim is clear: fix an irreducible component Y of X and take a
projection φ of φ(Y ) to some linear space Pn , whose fibers are finite (thus n =
dim(φ(Y ))). Then the composition φ ◦ φ maps Y to Pn with finite fibers, so that
n = dim(Y ).
The proof of (c) is straightforward from the definition of dimension. Then (b)
follows since a surjective projection φ with finite fibers from X to some Pn (n =
dim(X )) maps X , with finite fibers, to a subvariety of Pn .
Finally, to see (d) assume X = P N and fix a point P ∈ / X . Consider the projection
φ from P. The fiber of φ which contains any point Q ∈ X is a projective subvariety
of the line Q P and misses P, thus it is finite. Hence φ maps X to P N −1 with finite
fibers. Thus dim(X ) = dim(φ(X )) ≤ N − 1.
Lemma 11.2.16 Let X be an subvariety of P N , with infinitely many points. Then for
any hyperplane H of P N we have X ∩ H = ∅.
Proof Since the number of irreducible components of X is finite, there exists some
component of X which contains infinitely many points. So, without loss of generality,
we may assume that X is irreducible.
If X ⊂ P1 , then X = P1 because X is closed in the Zariski topology, which is the
cofinite topology, and the claim is trivial. Then assume X ⊂ P N , N > 1, and proceed
by induction on N .
Fix a hyperplane H and assume that H ∩ X = ∅. Fix a linear subspace L ⊂ H of
dimension N − 2 and consider the projection φ of X from L to a general line . By
Chow’s Theorem, the image φ(X ) is a closed irreducible subvariety of , and it does
not contain, by construction, the point H ∩ . Thus φ(X ) is finite and irreducible,
hence it is a point Q (see Exercise 26). Then X is contained in the hyperplane H
spanned by L and Q, which is a P N −1 , and it does not meet the hyperplane H ∩ H
of H . This contradicts the inductive assumption.
Proof If n = 1 then X is infinite and the claim follows from Lemma 11.2.16. Then
make induction on N . The result is clear if N = 1. For n, N > 1, take a general
hyperplane H containing L and identify H with P N −1 . Fix an irreducible component
X of dimension n in X . Then either X ⊂ H or dim(H ∩ X ) = n − 1 by Theorem
11.2.17. In any case, the claim follows by induction.
The previous result has an important consequence, that will simplify the compu-
tation of the dimension of projective varieties.
11.2 Dimension 187
Corollary 11.2.19 Let L be a linear subspace which does not intersect a projective
variety X and consider the projection φ of X from L to some linear space Pn , disjoint
from L.
Then φ has finite fibers. Thus, φ is surjective if and only if n = dim(X ).
Proof Assume that there exists a fiber φ−1 (Q) which is infinite, for some Q ∈ Pn .
Then φ−1 (Q) is an infinite subvariety of the span L of L and Q. In particular
dim(φ−1 (Q)) ≥ 1. Since L is a linear subspace of dimension n + 1 which contains
both L and φ−1 (Q), then by the previous corollary we have L ∩ φ−1 (Q) = ∅. This
contradicts the assumption that L and X are disjoint.
Corollary 11.2.20 Let X be a variety of dimension n in P N and let Y be the inter-
section of m hypersurfaces Y = X (g1 ) ∩ · · · ∩ X (gm ). Then:
dim(X ∩ Y ) ≥ n − m.
M0 M3 − M1 M2 = 0 M0 M2 − M12 = 0 M1 M3 − M22 = 0.
On the other hand, by Example 11.2.22, X has dimension 1. Thus X is not complete
intersection.
188 11 Dimension Theory
Remark 11.2.27 The definition of dimension given in this section can be used to
provide a definition for the degree of a projective variety.
Namely, one can prove that given a surjective projection φ : X → Pn , with finite
fibers, the cardinality d of the general fiber is fixed, and it corresponds to the (linear)
dimension of the residue field k(X ), considered as a vector field over C[x1 , . . . , xn ].
Thus d does not depend on the choice of φ. The number d is called the degree of X .
In general, the computation of the degree of a projective variety is not straight-
forward. Since the results on which the very definition of degree is based, as well
as the computation of the degree for elementary examples, requires a theory which
goes beyond the aims of the book, we do not include it here. The interested reader
can find an account of the theory in section VII of [1] and in the book [2].
From Definition 9.3.1, only locally a general projective map between projective
varieties f : X → Y is equivalent, up to changes of coordinates, to a Veronese map
followed by a projection.
It is possible to construct examples of maps f for which an equivalent version of
Theorem 11.2.24 does not hold. In particular, one can have dim(Y ) < dim(X ).
We account in this section some relations between the dimension of projective
varieties connected by projective maps and the dimension of general fibers of the
map.
f = g0 + x0 g1 + · · · + x0d gd ,
thus π −1 (π(Q)) contains an open subset in the line P Q which is non-empty, since
Q belongs to it. Hence the closure of π −1 (π(Q)) coincides with the whole line P Q.
It follows that if we take a set of generators f 1 , . . . , f k for the homogeneous ideal
of X and for each j we write f j = g0 j + x0 g1 j + · · · + x0d gd j , then the set of Q ∈ U
such that P belongs to the closure of π −1 (π(Q)), which is equal to the set such that
π −1 (π(Q)) is not finite, is defined by the vanishing of all the polynomials gi j ’s. The
claims follow.
Proof There exists a finite open cover {Ui } of X such that the restriction of f to each
Ui coincides with a Veronese embedding followed by a change of coordinates and a
projection πi . We claim that U ∩ Ui is open for each i. Indeed πi can be viewed as the
composition of a finite chain of projections from points P j ’s, and in each
projection,
by Lemma 11.3.1, the fibers are finite in an open subset. Thus U = (U ∩ Ui ) is
open.
Next, we give a formula for the dimension of projective varieties, which is the most
useful and used formula in effective computations. It is based on the link between the
dimensions of two varieties X, Y , when there exists a projective map f : X → Y .
The first step toward the formula is the notion of semicontinuous maps on projec-
tive varieties.
The most important upper semicontinuous function that we will study in this
section is constructed as follows. Take a projective map f between projective varieties
f : X → Y and define μ f by:
is open in Y by the inductive assumption. For all these points the dimension of the
fiber of f is e. Thus the set of points such that the dimension of the fiber is bigger
than e is bounded to a closed subvariety Y . The inverse image of Y is a proper
closed subvariety X of X . Restricting f to X and using induction, we see that the
set of points of Y whose fibers have dimension bigger than e is a closed subset Y1 of
192 11 Dimension Theory
Z0 = Y ⊃ Z1 ⊃ Z2 · · · ,
11.4 Exercises
Exercise 37 Prove that every homomorphism between two fields is either trivial of
injective.
References
1. Zariski, O., Samuel, P.: Commutative Algebra I. Graduate Texts in Mathematics, vol. 28.
Springer, Berlin (1958)
2. Harris, J.: Algebraic Geometry, a First Course. Graduate Texts in Mathematics, vol. 133,
Springer, Berlin (1992)
Chapter 12
Secant Varieties
The study of the rank of tensors has a natural geometric counterpart in the study of
secant varieties. Secant varieties or, more generally, joins are a relevant object for
several researches on the properties of projective varieties.
We introduce here the study of join varieties and secant varieties, mainly pointing
out the most important aspects for applications to the theory of tensors.
12.1 Definitions
Proof A point (P1 , . . . , Pk , Q) belongs to the total join T J (Y1 , . . . , Yk ) if and only
if
(i) each Pi satisfies the equations of Yi in the i-set of multihomogeneous coordinates
in (Pn )k ; and
(ii) taking homogeneous coordinates (yi0 , . . . , yin ) for Pi and (x0 , . . . , xn ) for Q,
then all the (k + 1) × (k + 1) minors of the matrix
vanish.
Since these last minors are multilinear polynomials in the multihomogeneous
coordinates of (Pn )k+1 , the claim follows.
Notice that the total join coincides trivially with the product Y1 × · · · × Yk × Pn
when k > n + 1. Thus, in order to avoid trivialities:
Notice that, in the definition of total join, we are not excluding the case in which
some of the Yi ’s, or even all of them, coincide.
Proof Enough to observe that the set is defined by the (multilinear) k × k minors of
the matrix obtained by taking as rows a set of coordinates of the Pi ’s.
We will now consider the behavior of a total join with respect to the natural
projections of (Pn )k to its factors. Let us recall the following fact.
We put the adjective embedded in parenthesis since we will (often) drop it and
say simply that J (Y1 , . . . , Yk ) is the join of Y1 , . . . , Yk .
Notice that while the abstract join is an element of the product (Pn )k × Pn , the
join is a subset of Pn , which is Zariski closed, since the last projection is closed (see
Proposition 10.4.4).
irreducible variety AS2 (Y ) has P = Q (notice that the limit of a family of points in
AS2 (Y ) necessarily belongs to AS2 (Y ), since the abstract secant variety is closed).
For instance, consider the limit in the case P = x d , Q = (x + t y)d , T = P − Q.
For t = 0, we have
d − 2 2 d−2 2
T = (d − 1)t x d−1
y+ t x y + · · · + t d yd .
2
d−2 2
Projectively, for t = 0, T is equivalent to (d − 1)x d−1 y + d−2 2
tx y + ··· +
t d−1 y d . Thus for t → 0, T goes to x d−1 y. This implies that (x d , x d , x d−1 y) ∈
AS2 (Y ), i.e., x d−1 y ∈ S2 (Y ).
Notice that x d−1 y cannot be written as L d + M d , for any choice of linear forms
L , M. So x d−1 y is a genuinely new point of S2 (Y ).
With a similar trick, one can prove that for any choice of two linear forms L , M
in x, y, the form L d−1 M belongs to S2 (Y ).
While the dimension of the abstract secant variety is always easy to compute, for
the embedded secant variety the computation of the dimension can be complicated. To
give an example, let us show first how the interpretation of secant varieties introduced
in Example 12.1.10 can be extended.
Example 12.1.12 Consider the Veronese variety Y obtained by the Veronese embed-
ding vn,d : Pn → P N , where N = d+n n
− 1.
Y can be considered as the set of symmetric tensors of rank 1 and type (n + 1) ×
· · · × (n + 1), d times, i.e., the set of forms of degree d in n + 1 variables, which
are powers of linear forms.
Consider a form T of rank k. Then T can be written as a sum T1 + · · · + Tk of
forms of rank 1, with k minimal. The minimality of k implies that T1 , . . . , Tk are
linearly independent, since otherwise we could barely forget one of them and write
T as a sum of k − 1 powers. In particular k ≤ N + 1. We get that (T1 , . . . , Tk , T )
belongs to the k-th abstract secant variety of Y , so that T is a point of Sk (Y ).
The secant variety Sk (Y ) also contains forms whose rank is not k. For instance, if
T has rank k < k, then we can write T = T1 + · · · + Tk , with T1 , . . . , Tk linearly
independent. Consider points Tk +1 , . . . , Tk ∈ Y such that T1 , . . . , Tk are linearly
independent. Then consider the form
T (t) = T1 + · · · + Tk + t Tk +1 + · · · + t Tk ,
200 12 Secant Varieties
We can generalize the construction given in the previous example to show that
Example 12.1.14 Let Y = v2,2 (P2 ) be the Veronese variety of rank 1 forms of degree
2 in 3 variables, which is a subvariety of P5 . The abstract secant variety AS2 (Y ),
corresponding to triplets (L 2 , M 2 , T ) such that L , M are linear forms and T =
a L 2 + bM 2 , has dimension 2 + 2 + 1 = 5, by Proposition 12.1.11.
Let us prove that the dimension of the secant variety S2 (Y ) is 4. To do that, it
is enough to prove that the general fiber of the projection π : AS2 (Y ) → S2 (Y ) is
1-dimensional. To this aim, consider T = a L 2 + bM 2 , L , M general linear forms.
L , M correspond to general points of the projective plane P2 of linear forms in 3
variables. Let ⊂ P2 be the line joining L , M. After a change of coordinates, without
loss of generality, we may assume that L = x02 , M = x12 , so that has equation
x2 = 0 and T becomes a form of degree 2 in the variables x0 , x1 , i.e., T = x02 +
x12 . It is easy to prove that π −1 (T ) has infinitely many points. Indeed for every
point P = ax0 + bx1 ∈ there exists exactly one Q = a x0 + b x1 ∈ such that
(v2 (P), v2 (Q), T ) ∈ AS2 (Y ), i.e.,
T = (ax0 + bx1 )2 + (a x0 + b x1 )2 ,
(we will drop the reference to Y , in the case of tensors of some given type, because
by default we consider Y to be the variety of rank 1 tensors).
One can find examples in which the previous inequality is strict.
The latter property listed in the previous remark explains why the computation of
the border rank k of a tensor T can be important for applications: we can realize T
as a limit of tensors of rank k. In other words, T can be slightly modified to obtain a
tensor of rank k.
Varieties Y for which the abstract secant variety ASk (Y ) and the secant variety
Sk (Y ) have the same dimension are of particular interest in the theory of secant
spaces. For instance, if Y ⊂ P N is a variety of tensors of rank 1, then dim(ASk (Y )) =
dim(Sk (Y )) implies that the general fiber of the projection ASk (Y ) → Sk (Y ) is finite,
which means that for a general tensor T ∈ P N of rank k, there are (projectively) only a
finite number of presentations T = T1 + · · · + Tk , with Ti ∈ Y for all i. In particular,
tensors for which the presentation is unique are called identifiable tensors.
We say that T is finitely r -identifiable if there are only a finite number of decom-
positions of T of length r . We say that T is r -identifiable if there is only one decom-
positions of T of length r .
We say that Y is generically finitely r -identifiable if a general element of Sr (Y ) is
finitely r -identifiable. Notice that Y is generically finitely r -identifiable if and only
if ASr (Y ) and Sr (Y ) have the same dimension.
We say that Y is generically r -identifiable if a general element of Sr (Y ) is r -
identifiable.
The first method to compute the identifiability is based on the computation of the
tangent space at a generic point.
Roughly speaking, the tangent space to a projective variety X ⊂ P N at a general
point P ∈ X can be defined by considering that a sufficiently small Zariski open
subset U of X around P is a differential subvariety of P N , for which the notion of
tangent space is a well established, differential object. Yet, we will give an algebraic
definition of tangent vectors, which are suitable for the computation of the dimension.
We will base the notion of tangent space first by giving the definition of embedded
tangent space of a hypersurface at the origin, and then extending the notion to any
(regular) point of any projective variety.
f = x0d−1 g1 + · · · + gd ,
It is clear that P0 ∈ TX (P0 ). Notice that the previous definition admits the case
g1 = 0: when this happens, then TX (P0 ) coincides with P N . Otherwise TX (P0 ) is a
linear subspace of dimension N − 1 = dim(X ).
We are ready for a rough definition of tangent space of any variety X .
We leave to the reader the proof that the definition of TX (P) does not depend on
the choice of the change of coordinates φ.
Unfortunately, in a certain sense as it happens for Groebner basis (Chap. 13), it
is not guaranteed that if f 1 , . . . , f s generate the ideal I X , the intersection of the cor-
responding tangent spaces TX ( fi ) (P0 ) determines TX (P). The situation can however
be controlled as follows.
Example 12.2.6 The special case in which X = Pn is easy: the tangent space of X
at any point coincides with Pn . Thus any point of X is regular.
Next, we provide examples of tangent spaces to relevant varieties for tensor anal-
ysis.
Inverse systems are a method to compute the rank and the identifiability of a given
symmetric tensor T , identified as forms of given degree d in some polynomial ring
R = C[x0 , . . . , xn ].
The method is based on the remark that if
T = L d1 + · · · + L rd ,
where the L i ’s are linear forms in R, then every derivative of T is also spanned by
L 1, . . . , Lr .
12.2 Methods for Identifiability 207
One can easily prove that, for all i, ∂i (xim ) = (∂i xi )xim−1 = xim−1 . It follows:
xib−a if a ≤ b
∂ia xib = .
0 if a > b
Example 12.2.18 If F is a form of rank 1, i.e., F = L d for some linear form L, then
it is easy to compute F ⊥ . Write L = a0 x0 + · · · + an xn . Put v0 = (a0 , . . . , an ) and
extend to a basis v0 , v1 , . . . , vn of Cn+1 , where each vi , i > 0, is orthogonal to v0 .
Set vi = (ai0 , . . . , ain ). Define
The previous example provides a way in which one can use F ⊥ to determine the
rank of a symmetric tensor, i.e., of a polynomial form.
Namely, if F = L d1 + · · · + L rd , where L i is a linear form associated to the point
Pi ∈ Pn , then clearly F ⊥ contains the intersection of the homogeneous ideals in S
associated to the points Pi ’s.
It follows that
Proposition 12.2.19 The rank r of F is the minimum such that there exist a finite
set of points {P1 , . . . , Pr } ⊂ Pn such that the intersection of the ideals associated to
the Pi ’s is contained in F ⊥ .
The linear forms associated to the points Pi ’s provide a decomposition for L.
Example 12.2.20 Consider the case of two variables x0 , x1 , i.e., n = 1, and let F =
x02 x12 . F is thus a monomial, but this by no means implies that it is easy to compute
a decomposition of F in terms of (quartic) powers of linear forms.
In the specific case, one computes that F ⊥ is the ideal generated by ∂03 , ∂13 . Since
the ideal of any point [a : b] ∈ Pn = P1 is generated by one linear form b∂1 − a∂0 ,
then the intersection of two ideals of points is simply the ideal generated by the
product of the two generators. By induction, it follows that in order to determine
a decomposition of F with r summands, we must find r distinct linear forms in S
whose product is contained in the ideal generated by ∂03 , ∂13 . It is almost immediate
to see that we cannot find such linear forms if r = 2, thus the rank of F is bigger
than 2. It is possible to find a product of three linear forms which lies in F ⊥ , namely:
√ √
1+i 3 3+i 3
(∂0 − ∂1 )(∂0 − ( )∂1 )(∂0 + ( )∂1 ).
2 2
It follows from the theory that F has rank 3, and √ it is a linear combination
√
of the 4-th
powers of the linear forms (x0 − x1 ), (x0 − ( 2 )x1 ), (x0 + ( 2 )x1 ).
1+i 3 3+i 3
12.3 Exercises
Exercise 44 Prove the statement of Proposition 12.1.13: for every Y ⊂ Pn and for
k < k ≤ n + 1, we always have Sk ⊆ Sk .
Exercise 45 Prove that when Y ⊂ P14 is the image of the 4-veronese map of P2 ,
then dim(AS5 (Y )) = 14, but dim(S5 (Y )) < 14.
Chapter 13
Groebner Bases
Groebner bases represent the most powerful tool for computational algebra, in par-
ticular for the study of polynomial ideals. In this chapter, based on [1, Chap. 2], we
give a brief survey on the subject. For a deeper study of it, we suggest [1, 2].
Before treating Groebner bases, it is mandatory to recall many concepts, starting
from the one of monomial ordering.
x α > x β , x α = x β , x β > x α.
(ii) to take into consideration the effects of the sum and product operations on the
monomials. When we add polynomials, after collecting the terms, we can simply
rearrange the terms. Multiplication could give problems if multiplying a poly-
nomial for a monomial, the ordering of terms changed. In order to avoid this, we
require that if x α > x β and x γ are monomials, then x α x γ > x β x γ .
Remark 13.1.1 If we consider an ordering on Zn≥0 , then property (ii) means that if
α > β then, for any γ ∈ Zn≥0 , α + γ > β + γ.
Remark 13.1.3 Property (iii) is equivalent to the fact that > is a well ordering on
Zn≥0 , that is any non-empty subset of Zn≥0 has a smallest element with respect to >. It
is not difficult to prove that this implies that each sequence in Zn≥0 , strictly decreasing,
at some point ends. This fact will be fundamental when we want to prove that some
algorithm stops in a finite number of steps as some terms decrease strictly.
n
n
|α| = αi > |β| = βi or |α| = |β| and α >lex β.
i=1 i=1
n
n
|α| = αi > |β| = βi or |α| = |β|
i=1 i=1
and the first nonzero entry, starting from right, is negative. We write
x α >gr evlex x β if α >gr evlex β (reverse graded lexicographic ordering).
Example 13.1.5
(1) (1, 2, 5, 4) >lex (0, 2, 4, 6) since (1, 2, 5, 4) − (0, 2, 4, 6) = (1, 0, 1, −2);
(2) (3, 2, 2, 4) >lex (3, 2, 2, 3) since (3, 2, 2, 4) − (3, 2, 2, 3) = (0, 0, 0, 1);
(3) (1, 3, 1, 4) <lex (2, 3, 2, 1) since the first entry, from left of (1, 3, 1, 4) −
(2, 3, 2, 1) = (−1, 0, 0, 3) is negative;
(4) (1, 2, 3, 5) <grlex (0, 1, 5, 6) since |(1, 2, 3, 5)| = 11 < 12 = |(0, 1, 5, 6)|;
(5) (4, 2, 2, 4) >grlex (4, 2, 1, 5) since |(4, 2, 2, 4)| = |(4, 2, 1, 5)| = 12 and
(4, 2, 2, 4) >lex (4, 2, 1, 5)) (in fact (4, 2, 2, 4) − (4, 2, 1, 5)) = (0, 0, 1, −1);
(6) (4, 3, 2, 1) <gr evlex (0, 2, 4, 6) since |(4, 3, 2, 1)| = 10 < 12 = |(0, 2, 4, 6)|;
(7) (1, 3, 4, 4) >gr evlex (2, 3, 2, 5) since |(3, 1, 2, 4)| = |(3, 1, 1, 5)| = 12 and the
first nonzero entry, from right, of (1, 3, 4, 4) − (2, 3, 2, 5) = (−1, 0, 2, −1) is
negative.
To each variable xi is associated the vector of Zn≥0 with 1 in the i−th position and
zero elsewhere. It is easy to check that
from which we get x1 >lex x2 >lex · · · >lex xn . In the practice, if we are working
with variables x, y, z, . . . we assume that the alphabetic ordering among variables
x > y > z > · · · is used to define the lexicographic ordering among monomials.
In the lexicographical ordering, each variable dominates any monomial com-
posed only of smaller variables. For example, x1 >lex x24 x35 x4 since (1, 0, 0, 0) −
(0, 4, 3, 1) = (1, −4, −3, −1). Roughly speaking, the lexicographical ordering does
not take into account the total degree of monomial and, for this reason, we introduce
the graded lexicographic ordering and the reverse graded lexicographic ordering. The
two orderings behave in a different way: both use the total degree of the monomials,
but grlex uses the ordering lex and therefore “favors” the greater power of the first
variable, while gr evlex, looking at the first negative entry from right, “favors” the
smaller power of the last variable. For example,
x14 x2 x32 >grlex x13 x23 x3 and x13 x23 x3 >gr evlex x14 x2 x32 .
It is important to notice that there are n! orderings of type grlex and gr evlex
according to the ordering we give to the monomials of degree 1. For example, for
two variables we can have x1 < x2 or x2 < x1 .
214 13 Groebner Bases
Example 13.1.6 Let us show how monomial orderings are applied to polynomials.
If f ∈ k[x1 , . . . , xn ] and we had chosen a monomial ordering >, then we can order,
with respect to >, in a nonambiguous way the terms of f . Consider, for example, f =
2x12 x22 + 3x24 x3 − 5x14 x2 x32 + 7x13 x23 x3 . With respect to the lexicographic ordering, f
is written as
f = −5x14 x2 x32 + 7x13 x23 x3 + 2x12 x22 + 3x24 x3 .
LC( f ) = amultideg( f ) ∈ k.
L M( f ) = x multideg( f ) .
L T ( f ) = LC( f ) · L M( f ).
If we consider the polynomial f = 2x12 x22 + 3x24 x3 − 5x14 x2 x32 + 7x13 x23 x3 of
Example 13.1.6, once the reverse degree lexicographic ordering is chosen, one has:
From now on we will always assume that a particular monomial ordering has been
chosen and therefore that LC( f ), L M( f ) and L T ( f ) are calculated relative to that
monomial ordering only.
The concepts introduced in Definition 13.1.7 permits to extend the classical divi-
sion algorithm for polynomial in one variable, i.e., f ∈ k[x], to the case of polyno-
mials in more variable, f ∈ k[x1 , . . . , xn ]. In the general case, this means to divide
f ∈ k[x1 , . . . , xn ] by f 1 , . . . , f t ∈ k[x1 , . . . , xn ], which is equivalent to write f as
f = a1 f 1 + · · · + at f t + r,
where the ai ’s and r are elements in k[x1 , . . . , xn ]. The idea of this division algorithm
is the same of the case of a single variable: we multiply f 1 for a suitable a1 in such
a way to cancel the leading term of f obtaining f = a1 f 1 + r1 , Then we multiply
f 2 for a suitable a2 in such a way to cancel the leading term of r1 obtaining r1 =
a2 f 2 + r2 and hence f = a1 f 1 + a2 f 2 + r2 . And we proceed in the same manner for
the other polynomials f 3 , . . . , f t . The following theorem assures the correctness of
the algorithm.
Theorem 13.1.9 Let > be a fixed monomial ordering on Zn≥0 and let
F = ( f 1 , . . . , f t ) be an ordered t−uple of polynomials in k[x1 , . . . , xn ]. Then any
f ∈ k[x1 , . . . , xn ] can be written as
f = a1 f 1 + · · · + at f t + r
The proof of Theorem 13.1.9, which we do not include here (see [1, Theorem 3,
pag 61]), is based on the fact that the algorithm of division terminates after a finite
number of steps, which is a consequence of the fact that > is a well ordering (see
Remark 13.1.3).
LT ( f )
a1 = = x.
L T ( f1)
g = f − a1 f 1 = x 3 y 2 + 3x y − 2y − x(x y + y) = −x − 2y.
The leading term of this polynomial, L T (g) = −x, is divisible for the one of f 2 and
hence we compute
L T (g)
a2 = = −1, r = g − a2 f 2 = −x − 2y + x + y = −y.
L T ( f2 )
Unluckily, the division algorithm of Theorem 13.1.9 does not behave well as for
the case of a single variable. This is shown in the following two examples.
LT ( f ) x2 y
a1 = = = x, g = f − a1 f 1 = 4x y 2 − x y − 2x
L T ( f1 ) xy
L T (g) 4x y 2
a2 = = = 4y 2 , r = g − a2 f 2 = −x y − 2x − 4y 3 .
L T ( f2 ) x
Notice that the leading term of the remainder, L T (r ) = −x y, is still divisible for the
leading term of f 1 . Hence we can again divide by f 1 getting
L T (r ) −x y
a1 = = = −1, r = r − a1 f 1 = −2x − 4y 3 + y.
L T ( f1 ) xy
Again the leading term of the new remainder, L T (r ) = −2x, is still divisible for
the leading term of f 2 . Hence we can again divide by f 2 getting
L T (r ) −2x
a2 = = = −2, r = r − a2 f 2 = −4y 3 + 3y.
L T ( f2 ) x
LT ( f ) x2 y
a2 = = = x y, g = f − a2 f 2 = 3x y 2 − 2x
L T ( f2 ) x
L T (g) 3x y 2
a1 = = = 3y, g − a1 f 1 = −2x − 3y 2 .
L T ( f1 ) xy
Notice that the leading term of the remainder, L T (r ) = −2x, is still divisible for the
leading term of f 2 . Hence we can again divide by f 2 getting
L T (r ) 2x
a2 = = = −2, r = r − a2 f 2 = −3y 2 + 2y
L T ( f2 ) x
and giving a remainder, −3y 2 + 2y, which is different to the one of the Example
13.1.11.
From the previous examples, we can conclude that the division algorithm in
k[x1 , . . . , xn ] is an imperfect generalization of the case of a single variable. To over-
come these problems it will be necessary to introduce the Groebner basis. The basic
idea is based on the fact that, when we work with a set of polynomials f 1 , . . . , f t ,
this leads to working with the ideal generated by them I = f 1 , . . . , f t . This gives
us the ability to switch from f 1 , . . . , f t , to a different set of generators of I , but
with better properties with respect to the division algorithm. Before introducing the
Groebner basis we recall some concepts and results that will be useful to us.
t
xβ = h i x αi , (13.2.1)
i=1
α + Zn≥0 = {α + γ : γ ∈ Zn≥0 }
consists of the exponents of monomials which are divisible by x α . This fact, together
with Lemma 13.2.2, permits us to give a graphical description of the monomials in
a given monomial ideal. For example, if I = x 5 y 2 , x 3 y 3 , x 2 y 4 , then the exponents
of monomials in I form the set
(5, 2) + Zn≥0 ∪ (3, 3) + Zn≥0 ∪ (2, 4) + Zn≥0 .
We can visualize this set as the union of the integer points in three translated copies
of the first quadrant in the plane, as showed in Fig. 13.1.
The following lemma allows to say if a polynomial f is in a monomial ideal I ,
looking at the monomials of f .
Lemma 13.2.3 Let I be a monomial ideal and consider f ∈ k[x1 , . . . , xn ]. Then
the following conditions are equivalent:
(i) f ∈ I ;
(ii) every term of f is in I ;
(iii) f is a linear combination of monomials in I .
One of the main results on monomial ideals is the so-called Dickson’s Lemma
which assures us that every monomial ideal is generated by a finite number of mono-
mials. For the proof, the interested reader can consult [1, Theorem 5, Chap. 2.4],
Lemma 13.2.4 (Dickson’s Lemma) A monomial ideal I = x α : α ∈ A ⊂ k[x1 ,
. . . , xn ] can be written as I = x α1 , x α2 , . . . , x αt where α1 , α2 , . . . , αt ∈ A. In par-
ticular I has a finite basis.
In practice, Dickson’s Lemma follows immediately from the Basis Theorem 9.1.22,
which has an independent proof. Since we did not provide a proof of Theorem 9.1.22,
for the sake of completeness we show how, conversely, Dickson’s Lemma can pro-
vide, as a corollary, a proof of a weak version of the Basis Theorem.
13.2 Monomial Ideals 219
(2 , 4)
n
(3 , 3)
(5 , 2)
m
Fig. 13.1 (5, 2) + Zn≥0 ∪ (3, 3) + Zn≥0 ∪ (2, 4) + Zn≥0
Theorem 13.2.5 (Basis Theorem, weak version) Any ideal I ⊂ k[x1 , . . . , xn ] has a
finite basis, that is I = g1 , . . . , gt for some g1 , . . . , gt ∈ I .
Before proving the Hilbert Basis Theorem we introduce some concepts.
Definition 13.2.6 Let I ⊂ k[x1 , . . . , xn ] be an ideal different from the zero ideal
{0}.
(i) We denote by L T (I ) the set of leading terms of I
y · (x 3 y − x 2 + x) − x · (x 2 y 2 − x y) = x y
Proof For (i), notice that the leading monomials L M(g) of the elements g ∈ I \ {0}
generate the monomial ideal J := L M(g) : g ∈ I \ {0}. Since L M(g) and L T (g)
differ only by a nonzero constant, one has J = L T (g) : g ∈ I \ {0} = L T (I ).
Hence L T (I ) is a monomial ideal.
For (ii), since L T (I ) is generated by the monomials L M(g) with g ∈ I \ {0}, by
Dickson’s Lemma, we know that L T (I ) = L M(g1 ), L M(g2 ), . . . , L M(gt ) for
a finite number of polynomials g1 , g2 , . . . , gt ∈ I . Since L M(gi ) and L T (gi ) differ
only by a nonzero constant, for i = 1, . . . , t, one has L T (I ) = L T (g1 ), L T (g2 ),
. . . , L T (gt ).
Using Proposition 13.2.8 and the division algorithm of Theorem 13.1.9 we can
prove Theorem 13.2.5.
Proof of Hilbert Basis Theorem. If I = {0} then, as a set of generators, we take
{0} which is clearly finite. If I contains some nonzero polynomials, then a set of
generators g1 , . . . , gt for I can be build in the following way. By Proposition 13.2.8
there exist g1 , . . . , gt ∈ I such that L T (I ) = L T (g1 ), L T (g2 ), . . . , L T (gt ). We
prove that I = g1 , . . . , gt .
Clearly g1 , . . . , gt ⊂ I since, for any i = 1, . . . , t, gi ∈ I . On the other hand,
let f ∈ I be a polynomial. We apply the division algorithm of Theorem 13.1.9 to
divide f by g1 , . . . , gt . We get
f = a1 g1 + · · · + at gt + r,
where the terms in r are not divisible for any of the leading terms L T (gi ). We show
that r = 0. To this aim, we observe, first of all, that
r = f − a1 g1 − · · · − at gt ∈ I.
f = a1 g1 + · · · + at gt + 0 ∈ g1 , . . . , gt ,
Groebner basis is “good” basis for the division algorithm of Theorem 13.1.9. Here
“good” means that the problems of Examples 13.1.11 and 13.1.12 do not happen.
Let us think about Theorem 13.2.5: the basis used in the proof has the particular
property that L T (g1 ), . . . , L T (gt ) = L T (I ). It is not true that any basis of I has
this property and so we give a specific name to the basis having this property.
The following result guarantees us that every ideal has a Groebner basis.
Proof Given an ideal I , different from the zero ideal, the set G = {g1 , . . . , gt }, built as
in the proof of Theorem 13.2.5, is a Groebner basis by definition. To prove the second
part of the statement it is enough to observe that, again, the proof of Theorem 13.2.5
assures us that I = g1 , . . . , gt , that is G is a basis for I .
Remark 13.3.4 The remainder r of Proposition 13.3.3 is usually called normal form
of f . The Proposition 13.3.3 tells us that Groebner basis can be characterized through
uniqueness of the remainder. Observe that, even though the remainder is unique,
independently of the order in which we divide f by the L T (gi )’s, the coefficients ai ,
in f = a1 g1 + · · · + at gt + r , are not unique.
Let’s start now to explain how it is possible to build a Groebner basis for an ideal
I from a set of generators f 1 , . . . , f t of I . As we saw before, one of the reasons for
which { f 1 , . . . , f t } could not be a Groebner basis depends on the case that there is a
combination of the f i ’s whose leading term does not lie in the ideal generated by the
L T ( f i ). This happens, for example, when the leading term of a given combination
ax α f i − bx β f j are canceled, leaving only terms of a lower degree. On the other hand,
ax α f i − bx β f j ∈ I and therefore its leading term belongs to L T (I ). To study this
cancelation phenomenon, we introduce the concept of S-polynomial.
xγ xγ
S( f, g) = · f − · g.
LT ( f ) L T (g)
13.3 Groebner Basis 223
x 3 y3 z2 x 3 y3 z2 1 1
S( f, g) = 3 2
· f − 2 3 · g = x 2 y4 z − x 2 y3 z + x y4 z − x z3.
3x z x y z 3 3
Using the concept of S-polynomial and the previous lemma we can prove the
following criterion to establish whether a basis of an ideal is Groebner basis.
Proof The “only if” direction is simple because, if G is a Groebner basis, then since
S(gi , g j ) ∈ I , their remainder in the division by G is zero, by Corollary 13.3.5. It
remains to prove the “if” direction.
Let f ∈ I = g1 , . . . , gt be a nonzero polynomial. Hence there exist polynomials
h i ∈ k[x1 , . . . , xn ] such that
t
f = h i gi . (13.3.1)
i=1
The monomials appearing in the second and third addition of the second line, all
have multidegree < δ. So, our hypothesis multideg( f ) < δ tells us that the first sum
also has multidegree < δ. Let L T (h i ) = ci x αi , then the first sum
L T (h i )gi = ci x αi gi
m i =δ m i =δ
has exactly the form described in Lemma 13.3.10 with f i = x αi gi . Hence, again by
Lemma 13.3.10, this sum is a linear combination of the S-polynomials S(x α j g j ,
x αk gk ). Moreover one has
xδ xδ
S(x α j g j , x αk gk ) = x αj
g j − x αk gk
x α j L T (g j ) x αk L T (gk )
= x δ−γ jk S(g j gk ),
where x γ jk is the lower common multiple between L M(g j ) and L M(gk ). Hence there
exist constants c jk ∈ k such that
L T (h i )gi = c jk x δ−γ jk S(g j , gk ). (13.3.4)
m i =δ j,k
t
S(g j , gk ) = ai jk gi (13.3.5)
i=1
for any choice of i, j and k. This means that, when the remainder is zero, we can find
an expression for S(g j , gk ) in terms of G where not all the leading terms are canceled.
As a matter of fact, we multiply the expression of S(g j , gk ) by x δ−γ jk having
13.3 Groebner Basis 225
t
x δ−γ jk S(g j , gk ) = bi jk gi ,
i=1
If we replace the previous expression of x δ−γ jk S(g j , gk ) in (13.3.4) we get the fol-
lowing equation
δ−γ jk
L T (h i )gi = c jk x S(g j , gk ) = c jk bi jk gi = h̃ i gi
m i =δ j,k j,k t i
Example 13.3.12 [from [1], page 84] Consider the ideal I = y − x 2 , z − x 3 of the
twisted cubic in R3 . Fix the lexicographic ordering with y > z > x. We prove that
G = {y − x 2 , z − x 3 } is a Groebner basis for I . Compute the S-polynomial
yz yz
S(y − x 2 , z − x 3 ) = (y − x 2 ) − (z − x 3 ) = −zx 2 + yx 3 .
y z
We have seen, by Corollary 13.3.2, that every ideal admits a Groebner basis, but
unfortunately, the corollary does not tell us how to build it. So let’s see now how this
problem can be solved via the Buchberger algorithm.
Theorem 13.4.1 Let I = f 1 , . . . , f s = {0]} be an ideal in k[x1 , . . . , xn ]. A Groeb-
ner basis for I can be built, in a finite number of steps, by the following algorithm.
Input: F = ( f 1 , . . . , f s )
Output: a Groebner basis G = (g1 , . . . , gt ) for I , with F ⊂ G.
G := F
REPEAT
G := G
For any pairs { p, q}, p = q in G DO
G
S := S( p, q)
IF S = 0 THEN G := G ∪ {S}
UNTIL G = G
For the proof, the reader can see [1].
Example 13.4.2 Consider again the ideal I = f 1 , f 2 of Example 13.2.7. We
already know that { f 1 , f 2 } = {x 3 y − x 2 + x, x 2 y 2 − x y} is not a Groebner basis
since y · (x 3 y − x 2 + x) − x · (x 2 y 2 − x y) = x y = L T (x y) ∈ / L T ( f 1 ), L T ( f 2 ).
We fix G = G = { f 1 , f 2 } and compute
x 3 y2 x 3 y2
S( f 1 , f 2 ) := f 1 − f 2 = x y.
x3 y x 2 y2
G
Since S( f 1 , f 2 ) = x y, we add f 3 = x y to G. We repeat again the cycle with the
new set of polynomials obtaining
S( f 1 , f 2 ) = x y, S( f 1 , f 3 ) = −x 2 + x, S( f 2 , f 3 ) = −x y
and
G G G
S( f 1 , f 2 ) = 0, S( f 1 , f 3 ) = −x 2 + x, S( f 2 , f 3 ) = 0.
S( f 1 , f 2 ) = x y, S( f 1 , f 3 ) = −x 2 + x, S( f 1 , f 4 ) = x 2 y − x 2 + x
S( f 2 , f 3 ) = −x y, S( f 2 , f 4 ) = x 2 y − x y, S( f 3 , f 4 ) = x y
G G G
S( f 1 , f 2 ) = 0, S( f 1 , f 3 ) = 0, S( f 1 , f 4 ) = 0,
G G G
S( f 2 , f 3 ) = 0, S( f 2 , f 4 ) = 0, S( f 3 , f 4 ) = 0.
13.4 Buchberger’s Algorithm 227
Thus we can exit from the cycle obtaining the Groebner basis
G = {x 3 y − x 2 + x, x 2 y 2 − x y, x y, x 2 − x}.
L T (x 3 y − x 2 + x) = x 3 y,
L T (x 2 y 2 − x y) = x 2 y 2 ,
L T (x y) = x y,
L T (x 2 − x) = x 2 .
An ideal can have many minimal Groebner bases. However, we can find one which
is better than the others.
Definition 13.4.7 A reduced Groebner basis for an ideal I ⊂ k[x1 , . . . , xn ] is a
Groebner basis G for I such that
(i) LC( p) = 1 for all p ∈ G.
(ii) For all p ∈ G, no monomial of p is in L T (G \ { p}).
The reduced Groebner bases have the following important property.
Proposition 13.4.8 Let I ⊂ k[x1 , . . . , xn ] be an ideal different from {0}. Then, given
a monomial ordering, I has a unique reduced Groebner basis.
The Elimination Theory, as shown in Sects. 10.2 and 10.3, is a systematic way to
eliminate variables from a system of polynomial equations. The central part of this
method is based on the so-called Elimination Theorem and Extension Theorem. We
now define the concept of “eliminating variables” in terms of ideals and Groebner
bases.
Definition 13.5.1 Given an ideal I = f 1 , . . . , f t ⊂ k[x1 , . . . , xn ], the l−th elim-
ination ideal Il of I is the ideal of k[xl+1 , . . . , xn ] defined as
Il = I ∩ k[xl+1 , . . . , xn ].
G l = G ∩ k[xl+1 , . . . , xn ]
The Elimination Theorem shows that a Groebner basis, in the lexicographic order,
does not eliminate only the first variable, but also the first two variables, and the first
three variables and so on. Often, however, we want to eliminate only certain variables,
while we are not interested in others. In these cases, it may be difficult to calculate a
Groebner basis using the lexicographical ordering, especially because this ordering
can give some Groebner basis not particularly good. For different versions of the
Elimination Theorem that are based on other orderings, refer to [1].
Now let us introduce the Extension Theorem. Suppose we have an ideal I ⊂
k[x1 , . . . , xn ] that defines the affine variety
The gi (xl , al+1 , . . . an )’s are polynomials in one variable hence their common solu-
tions are the ones of the greater common divisors of these s polynomials. Obviously,
it can happen that the gi (xl , al+1 , . . . an )’s do not have common solutions, depending
on the choice of al+1 , . . . an . Hence, our aim, at the moment, is to try to determine,
a priori, which partial solutions extend to complete solutions. We restrict to study
the case in which we eliminated the first variable x1 and hence we want to know if a
partial solution (a2 , . . . , an ) ∈ V (I1 ) extends to a solution (a1 , . . . , an ) ∈ V (I ). The
following theorem tells us when it is possible.
Notice that the Extension Theorem requires the complex field. Consider the equa-
tions
x 2 = y, x 2 = z.
If we eliminate x we get y = z and, hence, all partial solutions (a, a) for all a ∈ R.
Since the leading coefficient of x in x 2 = y and x 2 = z never vanish, Theorem 13.5.4
guarantees that we can extend (a, a), under the condition that we are working on C.
As a matter of fact, on R, x 2 = a has not real solutions if a is negative, hence the
only partial solutions (a, a) that we can extend, are the ones for all a ∈ R≥0 .
Remark 13.5.5 Although the Extension Theorem gives a statement only in case the
first variable is eliminated, it can still be used to eliminate any number of variables.
The idea is to extend solutions to one variable at a time: first to xl−1 , then to xl−2 and
so on up to x1 .
The Extension Theorem is particularly useful when one of the leading coefficients
is constant.
where c ∈ C is different from zero and N > 0. If I1 is the first elimination ideal of I
and (a2 , . . . , an ) ∈ V (I1 ), then there exists a1 ∈ C such that (a1 , . . . , an ) ∈ V (I ).
We end this section by recalling again that the process of elimination corresponds
to the projection of varieties in subspaces of lower dimension. For the rest of this
section, we work over C.
Let V = V ( f 1 , . . . , f t ) ⊂ Cn be an affine variety. To eliminate the first l variables
x1 , . . . , xl we consider the projection map
πl : Cn → Cn−l
.
(a1 , . . . , an )
→ (al+1 , . . . an )
The following lemma explains the link between πl (V ) and the l−th elimination ideal.
πl (V ) ⊂ V (Il ).
13.5 Groebner Bases and Elimination Theory 231
Hence, πl (V ) consists exactly of the partial solutions that can be extended to complete
solutions. Then, we can give a geometric version of the Extension Theorem.
The previous theorem tell us that π1 (V ) covers the affine variety V (I1 ), with the
exception, eventually, of a part in V (g1 , . . . , gt ). Unluckily we don’t know how much
this part is big and, there are cases where V (g1 , . . . , gt ) is enormous. However, the
following result permits us to understand even better the link between π1 (V ) and
V (I1 ).
The Closure Theorem gives a partial description of πl (V ) that covers V (Il ) except
for the points lying in a variety strictly smaller than V (Il ).
Finally, we have also the geometric version of Corollary 13.5.6 that represents a
good situation for elimination.
Corollary 13.5.10 Let V = V ( f 1 , . . . , f t ) ⊂ Cn and suppose that for some i, f i
can be written as
f i = ci x1N + terms in x1 of degree < N ,
where ci ∈ C is different from zero and N > 0. If I1 is the first elimination ideal of
I , then, in Cn−1 ,
π1 (V ) = V (I1 ),
13.6 Exercises
Exercise 48 Check if the following sets are Grobener basis. In case of negative
answer compute the Groebner basis of the ideal they generate.
(a) {x1 − x2 + x3 − x4 + x5 , x1 − 2x2 + 3x3 − x4 + x5 , 4x3 + 4x4 + 5x5 }, using
>lex ;
(b) {x1 − x2 + x3 − x4 + x5 , x1 − 2x2 + 3x3 − x4 + x5 , 4x3 + 4x4 + 5x5 }, using
>grlex ;
(c) {x1 + x2 + x3 , x2 x3 , x22 + x32 , x33 }, using >gr evlex .
References
1. Cox, D., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithms: An Introduction to Compu-
tational Algebraic Geometry and Commutative Algebra. Undergraduate Texts in Mathematics,
Springer, New York (2007)
2. Greuel, G.-M., Pfister G.: A Singular Introduction to Commutative Algebra. Springer, Berlin
(2007)
Index
A or a hypersurface, 185
Algebraic Dipole, 22, 36
closure, 183 Distribution, 7
element, 182 coherent, 25
Algebraically closed field, 183 induced, 9
Algebraic model of hidden variable, 73 of independence, 36, 37
Alphabet, 3 probabilistic, 11
associated, 11, 16
without triple correlation, 37
B DNA-systems, 4
Bilinear form, 82 Dual ring, 207
Booleanization, 18
Buchberger’s algorithm, 225, 226
E
Elementary logic connector, 20
C Equidistribution, 16
Change of coordinates, 156 Exponential matrix, 43
Cone, 134 Extension
Connection, 38 algebraic, 182
projectove, 50 transcendent, 182
Contraction, 121
i-contraction, 119
F
J -contraction, 121
Flattening, 125
partial, 121
Correlation
partial, 13 G
total, 6 Groebner basis, 221
minimal, 227
reduced, 227, 228
D
Defectiveness, 76
Dickson’s Lemma, 218 H
Dicotomia, 18 Hidden variable, 56, 70
Dimension, 181 model, 71
expected, 75, 205 Homogeneous coordinates, 134
of Segre variety, 188 Hyperplane, 140
of Veronese variety, 187 Hypersurface, 140
© Springer Nature Switzerland AG 2019 233
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2
234 Index
I parametric, 35
Ideal linear, 40
associated to a variety, 140 of conditional independence, 63
first elimination, 161 parametric, 39
generated by, 138 projective, 50
homogeneous, 138 projective algebraic, 50
irrelevant, 139 projective parametric, 50
l-th elimination, 228 toric, 39, 43
monomial, 217 without triple correlation, 51
prime, 179 Monomial ordering, 212
radical, 139 graded lexicographic, 212
Identifiable, 203 lexicographic, 165, 212
finitely, 203 reverse graded lexicographic, 213
generically, 203 Multihomogeneous coordinates, 146
Image distribution, 9 Multi-linear map, 83, 87
Independence connection, 22, 38 Multiprojective space, 145
Independence model, 36, 37 cubic, 173
Independent set of varieties, 196 of distributions, 49
Inverse system, 207
P
J Polynomial
Join Homogeneous decomposition of a, 135
abstract, 198 irreducible, 144
dimension of the, 199 leading coefficient of a, 214
embedded, 198 leading monomial of a, 214
total, 195 leading term of a, 214
Jukes–Cantor’s matrix, 61 multidegree of a, 214
multihomogeneous, 146
normal form of a , 222
K Preimage distribution, 9
K-distribution, see Distribution Projection
as projective map, 155
center of the, 158
M from a projective linear subspace, 157
Map from a projective point, 157
diagonal embedding, 173 of a random system, 38
dominant, 190 Projective function field, 180
isomorphism, 152 Projective kernel, 157
multiprojective, 153 Projective space, 133
projective, 149 dimension of a, 134
fiber of a, 180
linear, 156
Segre, 168 Q
upper semicontinuous, 190 Quotient field, 180
Veronese, 164
Marginalisation
of a matrix, 117 R
of a tensor, 118 Random variable, 3
Marginalization, 24 boolean, 5
Markov chains, 65 state of a, 3
Markov model, 66 Rank
Model, 35 border, 201
algebraic, 35 generic, 75, 201
Index 235