0% found this document useful (0 votes)
317 views240 pages

Bocci C., Chiantini L. An Introduction To Algebraic Statistics With Tensors 2019

Introduction to Algebraic Statistics

Uploaded by

Dr Alker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
317 views240 pages

Bocci C., Chiantini L. An Introduction To Algebraic Statistics With Tensors 2019

Introduction to Algebraic Statistics

Uploaded by

Dr Alker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 240

UNITEXT 118 35 28 21 42 7

70 56 42 84 14

25 20 15 30 5

50 40 30 60 10

10 8 6 12 2

20 16 12 24 4

Cristiano Bocci · Luca Chiantini

An Introduction
to Algebraic
Statistics
with Tensors
UNITEXT - La Matematica per il 3+2

Volume 118

Editor-in-Chief
Alfio Quarteroni, Politecnico di Milano, Milan, Italy; EPFL, Lausanne, Switzerland

Series Editors
Luigi Ambrosio, Scuola Normale Superiore, Pisa, Italy
Paolo Biscari, Politecnico di Milano, Milan, Italy
Ciro Ciliberto, Università di Roma “Tor Vergata”, Rome, Italy
Camillo De Lellis, Institute for Advanced Study, Princeton, NJ, USA
Victor Panaretos, Institute of Mathematics, EPFL, Lausanne, Switzerland
Wolfgang J. Runggaldier, Università di Padova, Padova, Italy
The UNITEXT – La Matematica per il 3+2 series is designed for undergraduate
and graduate academic courses, and also includes advanced textbooks at a research
level. Originally released in Italian, the series now publishes textbooks in English
addressed to students in mathematics worldwide. Some of the most successful
books in the series have evolved through several editions, adapting to the evolution
of teaching curricula.

More information about this subseries at http://www.springer.com/series/5418


Cristiano Bocci Luca Chiantini

An Introduction to Algebraic
Statistics with Tensors

123
Cristiano Bocci Luca Chiantini
Dipartimento di Ingegneria Dipartimento di Ingegneria
dell’Informazione e Scienze Matematiche dell’Informazione e Scienze Matematiche
Università di Siena Università di Siena
Siena, Italy Siena, Italy

ISSN 2038-5714 ISSN 2532-3318 (electronic)


UNITEXT - La Matematica per il 3+2
ISSN 2038-5722 ISSN 2038-5757 (electronic)
ISBN 978-3-030-24623-5 ISBN 978-3-030-24624-2 (eBook)
https://doi.org/10.1007/978-3-030-24624-2

© Springer Nature Switzerland AG 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

Cover illustration (LaTeX): A decomposable 3-dimensional tensor of type $3\times 5\times 2$.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
We, the authors, dedicate this book to our
great friend Tony Geramita. When the project
started, Tony was one of the promoters and
he should be among us in the list of authors
of the text. Tony passed away when the book
was at an early stage. We finished the book
following the pattern traced in collaboration
with him, and we always felt as if his
encouragement to continue the project never
faded.
Preface

Statistics and Algebraic Statistics

At the beginning of a book on Algebraic Statistics, it is undoubtedly a good idea to


give the reader some idea of the goals of the discipline.
A reader who is already familiar with the basics of Statistics and Probability is
probably curious about what the prefix “Algebraic” might mean. As we will see,
Algebraic Statistics has its own way of approaching statistical problems, exploiting
algebraic, geometric, or combinatorial properties. These problems are somewhat
different from the one studied by Classical Statistics.
We will illustrate this point of view with some examples, which consider
well-known statistical models and problems. At the same time, we will point out the
difference between the two approaches to these examples.

The Treatment of Random Variables

The initial concern of Classical Statistics is the behavior of one random variable X.
Usually X is identified with a function with values in the real numbers. This is
clearly an approximation. For example, if one records the height of the members of
a population, it is unlikely that the measure goes much further than the second
decimal digit (assume that the unit is 1 m). So, the corresponding graph is a
histogram, with a basic interval of 0:01 m. This is translated to a continuous
variable, by sending the length of the basic interval to zero (and the size of the
population increases).

vii
viii Preface

For random variables of this type, the first natural distribution that one expects is
the celebrated Gaussian distribution, which corresponds to the function

1 ðtlÞ2
XðtÞ ¼ pffiffiffiffiffiffi e 2r2
r 1…

where l and r are parameters which describe the shape of the curve (of course,
other types of distributions are possible, in connection with special behaviors of the
random variable XðtÞ).
The first goal of Classical Statistics is the study of the shape of the function XðtÞ,
together with the determination of its numerical parameters.
When two or more variables are considered in the framework of Classical
Statistics, their interplay can be studied with several techniques. For instance, if we
consider both the heights and the weights of the members of a population and our
goal is a proof of the (obvious) fact that the two variables are deeply connected,
then we can consider the distribution over pairs (height, weight), which is repre-
sented by a bivariate Gaussian, in order to detect the existence of the connection.
The starting point of Algebraic Statistics is quite different. Instead of considering
variables as continuous functions, Algebraic Statistics prefers to deal with a finite
(and possibly small) range of values for the variable X. So, Algebraic Statistics
emphasizes the discrete nature of the starting histogram, and tends to group together
values in wider ranges, instead of splitting them. A distribution over the variable X
is thus identified with a discrete function (to begin with, over the integers).
Algebraic Statistics is rarely interested in situations, where just one random
variable is concerned.
Instead, networks containing several random variables are considered and some
relevant questions raised in this perspective are
• Are there connections between the two or more random variables of the
network?
• Which kind of connection is suggested by a set of data?
• Can one measure the complexity of the connections in a given network of
interacting variables?
Since, from the new point of view, we are interested in determining the relations
between discrete variables, in Algebraic Statistics a distribution over a set of
variables is usually represented by matrices, when two variables are involved, or
multidimensional matrices (i.e., tensors), as the number of variables increases.
It is a natural consequence of the previous discussion that while the main
mathematical tools for Classical Statistics are based on multivariate analysis and
measure theory, the underlying mathematical machinery for Algebraic Statistics is
principally based on the Linear and Multi-linear Algebra of tensors (over the
integers, at the start, but quickly one considers both real and complex tensors).
Preface ix

Relations Among Variables

Just to give an example, let us consider the behavior of a population after the
introduction of a new medicine.
Assume that a population is affected by a disease, which dangerously alters the
value of a glycemic indicator in the blood. This dangerous condition is partially
treated with the new drug. Assume that the purpose of the experiment is to detect
the existence of a substantial improvement in the health of the patients.
In Classical Statistics, one considers the distribution of the random variable X1 ¼
the value of the glycemic indicator over a selected population of patients before the
delivery of the drug, and the random variable X2 ¼ the value of the glycemic
indicator of patients after the delivery of the drug. Both distributions are likely to be
represented by Gaussians, the first one centered at an abnormally high value
of the glycemic indicator, the second one centered at a (hopefully) lower value. The
comparison between the two distributions aims to detect if (and how far) the descent
of the recorded values of the glycemic indicator is statistically meaningful, i.e., if it
can be distinguished from the natural underlying ground noise. The celebrated
Student’s t-test is the world-accepted tool for comparing the means of two Gaussian
distributions and for determining the existence of a statistically significant response.
In many experiments, the response variable is binary or categorical with k levels,
leading to a 2  2, or a 2  k, contingency table. Moreover, when there is more
than one response variable and/or other control variables, the resulting data are
summarized in a multiway contingency table, i.e., a tensor.
This structure may also come from the discretization of a continuous variable.
As an example, consider a population divided into two subsets, one of which is
treated with the drug while the other is treated with traditional methods. Then, the
values of the glycemic indicator are divided into classes (in the roughest case just
two classes, i.e., a threshold which separates two classes is established). After some
passage of time, one records the distribution of the population in the four resulting
categories (treated + under-threshold, treated + over-threshold . . .) which deter-
mines a 2  2 matrix, whose properties encode the existence of a relation between
the new treatment and an improved normalization of the value of the glycemic
indicator (this is just to give an example: in the real world, a much more sophis-
ticated analysis is recommended!).

Bernoulli Binary Models

Another celebrated model, which is different from the Gaussian distribution and is
often introduced at the beginning of a course in Statistics, is the so-called Bernoulli
model over one binary variable.
Assume we are given an object that can assume only two states. A coin, with the
two traditional states H (heads) and T (tails), is a good representation. One has to
x Preface

bear in mind, however, that in the real world, binary objects usually correspond to
biased coins, i.e., coins for which the expected distribution over the two states is not
even.
If p is the probability of obtaining a result (say H) by throwing the coin, then one
can roughly estimate p by throwing the coin several times and determining the ratio

number of throws giving H


total number of throws

but this is usually considered too naïve. Instead, one divides the total set of throws
into several packages, each consisting of r throws, and determines for how many
packages, denoted qðtÞ, one obtained H exactly t times. The value of the constant p
is thus determined by Bernoulli’s formula:
 
r t
qðtÞ ¼ p ð1  pÞrt :
t

By increasing the number of total throws (and thus increasing the number of
packages and the number of throws r in each package), the function qðtÞ tends to a
real function, which can be treated with the usual analytic methods.
Notice that in this way, at the end of the process, the discrete variable Coin is
substituted by a continuous variable qðtÞ. Usually one even goes one step further,
by substituting the variable q with its logarithm, ending up with a linear description.
Algebraic Statistics is scarcely interested in knowing how a single given coin is
biased. Instead, the main goal of Algebraic Statistics is to understand the connec-
tions between the behavior of two coins. Or, better, the connections between the
behavior of a collection of coins.
Consequently, in Algebraic Statistics one defines a collection of variables, one
for each coin, and defines a distribution by counting the records in which the
variables X1 ; X2 ; . . .; Xn have a fixed combination of states. The distribution is
transformed into a tensor of type 2  2      2. All coins can be biased, with
different loads: this does not matter too much. In fact, the main questions that one
expects to solve are
• Are there connections between the outputs of two or more coins?
• Which kind of connection is suggested by the distribution?
• Can one divide the collection of coins into clusters, such that the behavior of
coins of the same cluster is similar?
Answers are expected from an analysis of the associated tensor, i.e., in the
framework of Multi-linear Algebra.
The importance of the last question can be better understood if one replaces
coins with positions in a composite digital signal. Each position has, again, two
possible states, 0 and 1. If the signal is the result of the superposition of many
Preface xi

elementary signals, coming from different sources, and digits coming from the same
source behave similarly, then the division of the signal into clusters yields the
reconstruction of the original message that each source issued.

Splitting into Types

Of course, the separation of several phenomena that are mixed together in a given
distribution is also possible using methods of Classical Statistics.
In a famous analysis of 1894, the biologist Karl Pearson made a statistical study
of the shape of a population of crabs (see [1]). He constructed the histogram for the
ratio between the “forehead” breadth and the body length for 1000 crabs, sampled
in Naples, Italy by W. F. R. Weldon. The resulting approximating curve was quite
different from a Gaussian and presented a clear asymmetry around the average
value. The shape of the function suggested the existence of two distinct types of
crab, each determining its own Gaussian, that were mixed together in the observed
histogram. Pearson succeeded in separating the two Gaussians with the method of
moments. Roughly speaking, he introduced new statistical variables, induced by the
same collection of data, and separated the types by studying the interactions
between the Gaussians of these new variables.
This is the first instance of a computation which takes care of several parameters
of the population under analysis, though the variables are derived from the same set
of data. Understanding the interplay between the variables provides the funda-
mental step for a qualitative description of the population of crabs.
From the point of view of Algebraic Statistics, one could obtain the same
description of the two types which compose the population by adding variables
representing other ratios between lengths in the body of crabs, and analyzing the
resulting tensor.

Mixture Models

Summarizing, Algebraic Statistics becomes useful when the existence and the
nature of the relations between several random variables are explored.
We stress that knowing the shape of the interaction between random variables is
a central problem for the description of phenomena in Biology, Chemistry, Social
Sciences, etc. Models for the description of the interactions are often referred to as
Mixture Models. Thus, mixture models are a fundamental object of study in
Algebraic Statistics.
Perhaps, the most famous and easily described mixture models are the Markov
chains, in which the set of variables is organized in a totally ordered chain, and the
behavior of the variable Xi is only influenced by the behavior of the variable Xi1
(usually, this interaction depends on a given matrix).
xii Preface

Of course, much more complicated types of networks are expected when the
complexity of the collection of variables under analysis increases. So, when one
studies composite signals in the real world, or pieces of a DNA chain, or regions in
a neural tissue, higher level models are likely to be necessary for an accurate
description of the phenomenon.
One thus moves from the study of Markov chains

M1 M2 M3
X1 X2 X3 X4

to the study of Markov trees

X1
M1 M2

X2 X3
M3 M4 M5 M6

X4 X5 X6 X7

and the study of neural nets

M1
X1 X2

M3 M2

X3

In Classical Statistics, the structure of the connections among variables is often a


postulate. In Algebraic Statistics, determining the combinatorics and the topology
of the network is a fundamental task. On the other hand, the time-dependent
activating functions that transfer information from one variable to the next ones,
deeply studied by Classical Statistics, are of no immediate interest for Algebraic
Statistics which, at first, considers steady states of the configuration of variables.
The Multi-linear Algebra behind the aforementioned models is not completely
understood. It requires a deep analysis of subsets of linear spaces described by
parametric or implicit polynomial equations. This is the reason why, at a certain
point, methods of Algebraic Geometry are invoked to push the analysis further.
Preface xiii

Conclusion

The way we think about Algebraic Statistics focuses on aspects of the theory of
random variables which are different from the targets of Classical Statistics. This is
reflected in the point of view introduced in the book. Our general setting differs
from the classical one and is closer to the one implicitly introduced in the books of
Pachter and Sturmfels [2] and Sullivant [3]. Our aim is not to create a new for-
mulation of the whole statistical theory, but only to present an algebraic natural way
in which Statistics can handle problems related to mixture models.
The discipline is currently living in a rapidly expanding network of new insights
and new areas of application. Our knowledge of what we can do in this area is
constantly increasing and it is reasonable to hope that many of the problems
introduced in this book will soon be solved or, if they cannot be solved completely,
then they will at least be better understood. We feel that the time is right to provide
a systematic foundation, with special attention to the application of tensor theory,
for a field that promises to act as a stimulus for mathematical research in Statistics,
and also as a source of suggestions for further developments in Multi-linear Algebra
and Algebraic Geometry.

Siena, Italy Cristiano Bocci


May 2019 Luca Chiantini

Acknowledgements The authors want to warmly thank Fabio Rapallo, who made several fruitful
remarks and suggestions to improve the exposition, especially regarding the connections with
Classical Statistics.

References

1. Pearson K.: Contributions to the mathematical theory of evolution. Phil. Trans. Roy. Soc.
London A, 185, 71–110 (1894)
2. Pachter, L., Sturmfels, B.: Algebraic Statistics for Computational Biology. Cambridge
University Press, New York (2005)
3. Sullivant, S.: Algebraic Statistics. Graduate Studies in Mathematics, vol. 194, AMS,
Providence (2018)
Contents

Part I Algebraic Statistics


1 Systems of Random Variables and Distributions . . . . . . . . . . . . . . . 3
1.1 Systems of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Measurements on a Distribution . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Booleanization and Logic Connectors . . . . . . . . . . . . . . . . . . . 19
2.3 Independence Connections and Marginalization . . . . . . . . . . . . 21
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Independence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Connections and Parametric Models . . . . . . . . . . . . . . . . . . . . . 38
3.4 Toric Models and Exponential Matrices . . . . . . . . . . . . . . . . . . 43
4 Complex Projective Algebraic Statistics . . . . . . . . . . . . . . . . . . . . . 47
4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Projective Algebraic Models . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Models of Conditional Independence . . . . . . . . . . . . . . . . . . . . 58
5.2 Markov Chains and Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Hidden Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

xv
xvi Contents

Part II Multi-linear Algebra


6 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 The Tensor Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Rank of Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Tensors of Rank 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 Symmetric Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 Generalities and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 The Rank of a Symmetric Tensor . . . . . . . . . . . . . . . . . . . . . . 106
7.3 Symmetric Tensors and Polynomials . . . . . . . . . . . . . . . . . . . . 110
7.4 The Complexity of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . 113
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8 Marginalization and Flattenings . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.1 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Contractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.3 Scan and Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Part III Commutative Algebra and Algebraic Geometry


9 Elements of Projective Algebraic Geometry . . . . . . . . . . . . . . . . . . 133
9.1 Projective Varieties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.1.1 Associated Ideals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.1.2 Topological Properties of Projective Varieties . . . . . . . 141
9.2 Multiprojective Varieties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.3 Projective and Multiprojective Maps . . . . . . . . . . . . . . . . . . . . 148
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10 Projective Maps and the Chow’s Theorem . . . . . . . . . . . . . . . . . . . 155
10.1 Linear Maps and Change of Coordinates . . . . . . . . . . . . . . . . . 155
10.2 Elimination Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.3 Forgetting a Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
10.4 Linear Projective and Multiprojective Maps . . . . . . . . . . . . . . . 163
10.5 The Veronese Map and the Segre Map . . . . . . . . . . . . . . . . . . 164
10.6 The Chow’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Contents xvii

11 Dimension Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179


11.1 Complements on Irreducible Varieties . . . . . . . . . . . . . . . . . . . 179
11.2 Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.3 General Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
12 Secant Varieties . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . . 195
12.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . . 195
12.2 Methods for Identifiability . . . . . . . . . . . . . ....... . . . . . . . . 203
12.2.1 Tangent Spaces and the Terracini’s Lemma . . . . . . . . . 203
12.2.2 Inverse Systems . . . . . . . . . . . . . . ....... . . . . . . . . 206
12.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . . 209
13 Groebner Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
13.1 Monomial Orderings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
13.2 Monomial Ideals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
13.3 Groebner Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
13.4 Buchberger’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
13.5 Groebner Bases and Elimination Theory . . . . . . . . . . . . . . . . . 228
13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
About the Authors

Prof. Cristiano Bocci is Assistant Professor of Geometry at the University of


Siena (Italy). His research concerns Algebraic Geometry, Commutative Algebra,
and their applications. In particular, his current interests are focused on symbolic
powers of ideals, Hadamard product of varieties, and the study of secant spaces. He
also works in two interdisciplinary teams in the fields of Electronic Measurements
and Sound Synthesis.

Prof. Luca Chiantini is Full Professor of Geometry at the University of Siena


(Italy). His research interests focus mainly on Algebraic Geometry and Multi-linear
Algebra, and include the theory of vector bundles on varieties and the study of
secant spaces, which are the geometric counterpart of the theory of tensor ranks. In
particular, he recently studied the relations between Multi-linear Algebra and the
theory of finite sets in projective spaces.

xix
Part I
Algebraic Statistics
Chapter 1
Systems of Random Variables
and Distributions

1.1 Systems of Random Variables

This section contains the basic definitions with which we will construct our statistical
theory.
It is important to point out right away that in the field of Algebraic Statistics, a
still rapidly developing area of study, the basic definitions are not yet standardized.
Therefore, the definitions which we shall use in this text can differ significantly (more
in form than in substance) from those of other texts.
Definition 1.1.1 A random variable is a variable x taking values in a finite non-
empty set of symbols, denoted A(x). The set A(x) is called the alphabet of x or the
set of states of x. We will say that every element of A(x) is a state of the variable x.
A system of random variables (or random system) S is a finite set of random
variables.
The condition of finiteness, required both for the alphabet of a random variable
and the number of variables of a system, is typical of Algebraic Statistics. In other
statistical situations this hypothesis is often not present.
Definition 1.1.2 A subsystem of a system S of random variables is a system defined
by a subset S  ⊂ S.

Example 1.1.3 The simplest examples of a system of random variables are those
containing a single random variable. A typical example is obtained by thinking
of a die x as a random variable, i.e. as the unique element of S. Its alphabet is
A(x) = {1, 2, 3, 4, 5, 6}.
Another familiar example comes by thinking of the only element of S as a coin c
with alphabet A(c) = {H, T } (heads and tails).

Example 1.1.4 On internet sites about soccer betting one finds systems in which
each random variable has three states. More precisely the set S of random variables
are (say) all the professional soccer games in a given country. For each random
© Springer Nature Switzerland AG 2019 3
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_1
4 1 Systems of Random Variables and Distributions

variable x, (i.e. game), its alphabet is A(x) = {1, 2, T }. The random variable takes
value “1” if the game was won by the home team, value “2” if the game was won by
the visiting team and value “T” if the game was a tie.

Example 1.1.5 (a) We can, similar to Example 1.1.3, construct a system S with two
random variables, namely with two dice {x1 , x2 }, both having alphabet A(xi ) =
{1, 2, 3, 4, 5, 6}.
(b) An example of another system of random variables T , closely related to
the previous one but different, is given by taking a single random variable
as the ordered pair of dice x = (x1 , x2 ) and, as alphabet A(x), all possible
values obtained by throwing the dice simultaneously: {(1, 1), . . . (1, 6), . . . ,
(6, 1), (6, 2), . . . , (6, 6)}.
(c) Another example W , still related to the two above (but different), is given by
taking as system the unique random variable the set consisting of two dice
z = {x1 , x2 } and as alphabet, A(z), the sum of the values of the two dice after
throwing them simultaneously: A(z) = {2, 3, 4, . . . , 12}.

Remark 1.1.6 The random variables of the systems S, T and W might seem, at first
glance, to be the same, but it is important to make clear that they are very different.
In (a) there are two random variables while in (b) and (c) there is only one. Also
notice that in T we have chosen an ordering of the two dice, while in W the random
variable is an unordered set of two dice. With example (a) there is nothing stopping
us from throwing the die x1 , say, twenty times and the die x2 ten times. However, in
both (b) and (c) the dice are each thrown the same number of times.

Example 1.1.7 There are many naturally occurring examples of systems with many
random variables. In fact, some of the most significant ones come from applications
in Economics and Biology and have an astronomical number of variables.
For example, in Economics and in market analysis, there are systems with one
random variable for each company which trades in a particular market. It is easy to
see that, in this case, we can have thousands, even tens of thousands, of variables.
In Biology, very important examples come from studying systems in which the
random variables represent hundreds (or thousands) of positions in the DNA sequence
of one or several species. The alphabet of each variable consists of the four basic
ingredients of DNA: Adenine, Cytosine, Guanine and Thymine. As a shorthand
notation, one usually denotes the alphabet of such random variables as {A, C, G, T }.
In this book, we will refer to the systems arising from DNA sequences, as DNA-
systems.

Example 1.1.8 For cultural reasons (one of the authors was born and lives in Siena!),
we will have several examples in the text of systems describing probabilistic events
related to the famous and colourful Sienese horse race called the Palio di Siena.
Horses which run in the Palio represent the various medieval neighbourhoods of the
city (called contrade) and the Palio is a substitute for the deadly feuds which existed
between the various sections of the city.
1.1 Systems of Random Variables 5

The names of the neighbourhoods are listed below with a shorthand letter abbre-
viation for each of them:

Aquila (eagle) (symbol: A) Bruco (caterpillar) (symbol: B)


Chiocciola (snail) (symbol: H) Civetta (little owl) (symbol: C)
Drago (dragon) (symbol: D) Giraffa (giraffe) (symbol: G)
Istrice (crested porcupine) (symbol: I) Leocorno (unicorn) (symbol: E)
Lupa (she-wolf) (symbol: L) Nicchio (conch) (symbol: N)
Oca (goose) (symbol: O) Onda (wave) (symbol: Q)
Pantera (panther) (symbol: P) Selva (forest) (symbol: S)
Tartuca (tortoise) (symbol: R) Torre (tower) (symbol: T)
Valdimontone (valley of the ram) (symbol: M).

Definition 1.1.9 A random variable x of a system S is called a boolean variable


if its alphabet has cardinality 2. A system is boolean if all its random variables are
boolean.
Remark 1.1.10 The states of a boolean random variable can be thought of as the
pair of conditions (true, false). As a matter of fact the standard alphabet of a boolean
random variable can be thought of as the elements of the finite field Z2 , where 1 =
true and 0 = false (this is our convention; be careful: in some texts this notation is
reversed!). Other alphabets, such as heads-tails or even-odd, are also often used for
the alphabets of boolean random variables.
Definition 1.1.11 A map or morphism between systems S and T of random variables
is a pair f = (F, G) where F is a function F : S → T and, for all x ∈ S, G defines
a function between alphabets G(x) : A(x) → A(F(x)).
The terminology used for functions can be transferred to maps of system of random
variables. Thus we can have injective maps (in which case both F and each of the
G(x) are injective), surjective maps (in which case both F and each of the G(x)
are surjective), isomorphism (in which case both F and all the maps G(x) are 1−1
correspondences). With respect to these definitions, the systems of random variables
form a category.
Example 1.1.12 If S  is a subsystem of S, the inclusion function S  → S defines, in
a obvious way, an injective map of systems. In this case, the maps between alphabets
are always represented by the identity map.
Example 1.1.13 Let S = {x} be the system defined by a die as in Example 1.1.3
with alphabet {1, 2, 3, 4, 5, 6}. Let T be the system defined by T = {y}, with
A(y) = {E, O} (E = even, O = odd). The function F : S → T , defined by F(x) =
y, and the function G : A(x) → A(y) defined by G(1) = G(3) = G(5) = O and
G(2) = G(4) = G(6) = E, define a surjective map from S to T of systems of ran-
dom variables.
The following definition will be of fundamental importance for the study of the
relationship between systems of random variables.
6 1 Systems of Random Variables and Distributions

Definition 1.1.14 The (total) correlation of a system S of random variables


{x1 , . . . , xn } is the system S = {x}, with a unique random variable x = (x1 , . . . , xn )
(the cartesian product of the elements x1 , . . . , xn of S). Its alphabet is given by
A(x1 ) × · · · × A(xn ), the cartesian product of the alphabets of the individual ran-
dom variables.

Remark 1.1.15 It is very important to notice that the definition of the total correlation
uses the concept of cartesian product. Moreover the concept of cartesian product
requires that we fix an ordering of the variables in S.
Thus, the total correlation of a system is not uniquely determined, but it changes
as the chosen ordering of the random variables changes.
It is easy to see, however, that all the possible total correlations of the system S
are isomorphic.

Example 1.1.16 If S is a system with two coins c1 , c2 , each having alphabet {H, T },
then the only random variable in its total correlation, has an alphabet with four
elements {(T, T ), (T, H ), (H, T ), (H, H )}, i.e. we have to distinguish between the
states (H, T ) and (T, H ). This is how the ordering of the coins enters into the
definition of the random variable (c1 , c2 ) of the total correlation.

Example 1.1.17 Let S be the system of random variables consisting of two dice, D1
and D2 each having alphabet the set {1, 2, 3, 4, 5, 6}. The total correlation of this
system, S, is the system with a unique random variable D = (D1 , D2 ) and alphabet
the set {(i, j) | 1 ≤ i, j ≤ 6}. So, the alphabet consists of 36 elements.
Now let T be the system whose unique random variable is the set x = {D1 , D2 }
and whose alphabet consists of the eleven numbers {2, 3, . . . , 11, 12}.
We can consider the surjective morphism of systems φ : S → T which takes
the unique random variable of S to the unique random variable of T and takes the
element (i, j) of the alphabet of the unique variable of S to i + j in the alphabet
of the unique variable of T .
Undoubtedly this morphism is familiar to us all!

Clearly if S is a system containing a single random variable, then X coincides


with its total correlation.

Definition 1.1.18 Let f : S → T be a map of systems of random variables, defined


by F : S → T and G(x) : A(x) → A(F(x)) for all random variables x ∈ S, and
suppose that F is bijective i.e. S and T have the same number of random variables.
Then f defines, in a natural way, a map f : S → T between the total corre-
lations as follows: for each state s = (s1 , . . . , sn ) of the unique variable (x1 , . . . , xn )
of S,
(f )(s) = (G(x1 )(s1 ), . . . , G(xn )(sn )).
1.2 Distributions 7

1.2 Distributions

One of the basic notions in the study of systems of random variables is the idea of a
distribution. Making the definition of a distribution precise will permit us to explain
clearly the idea of an observation on the random variables of a system. This latter
concept is extremely useful for the description of real phenomena.
Definition 1.2.1 Let K be any set. A K -distribution on a system S with random
variables x1 , . . . , xn , is a set of functions D = {D1 , . . . , Dn }, where for 1 ≤ i ≤ n,
Di is a function from A(xi ) to K .

Remark 1.2.2 In most concrete examples, K will be a numerical set, i.e. some subset
of C (the complex numbers).

The usual use of the idea of a distribution is to associate to each state of a variable
xi in the system S, the number of times (or the percentage of times) such a state is
observed in a sequence of observations.
Example 1.2.3 Let S be the system having as unique random variable a coin c, with
alphabet A(c) = {T, H } (the coin needs not be fair!).
Suppose we throw the coin n times and observe the state T exactly dT times and
the state H exactly d H times (dT + d H = n). We can use those observations to get an
N-distribution (N is the set of natural numbers), denoted Dc , where Dc : {T, H } → N
by
Dc (T ) = dT , Dc (H ) = d H .

One can identify this distribution with the element (dT , d H ) ∈ N2 .


We can define a different distribution, Dc on S (using the same series of observa-
tions) as follows:

Dc : {T, H } → Q (the rational numbers),

where Dc (T ) = dT /n and Dc (H ) = d H /n.

Example 1.2.4 Now consider the system S with two coins c1 , c2 , and with alphabets
A(ci ) = {T, H }.
Again, suppose we simultaneously throw both coins n times and observe that the
first coin comes up with state T exactly d1 times and with state H exactly e1 times,
while the second coin comes up T exactly d2 times and comes up H exactly e2 times.
From these observations we can define an N-distribution, D = (D1 , D2 ), on S
defined by the functions

D1 : {T, H } → N, D1 (T ) = d1 , D1 (H ) = e1 ,
D2 : {T, H } → N, D2 (T ) = d2 , D2 (H ) = e2
8 1 Systems of Random Variables and Distributions

It is also possible to identify this distribution with the element

((d1 , e1 ), (d2 , e2 )) ∈ N2 × N2 .

It is also possible to use this series of observations to define a distribution on the


total correlation S.
That system has a unique variable c = c1 × c2 = (c1 , c2 ) with alphabet A(c) =
{T T, T H, H T, H H }. The N-distribution on S we have in mind associates to each
of the four states how often that state appeared in the series of throws.
Notice that if we only had the first distribution we could not calculate the second
one since we would not have known (solely from the functions D1 and D2 ) how
often each of the states in the second system were observed.

Definition 1.2.5 The set of K -distributions of a system S of random variables forms


the space of distributions D K (S).

Example 1.2.6 Consider the DNA-system S with random variables precisely 100
fixed positions (or sites) p1 , . . . , p100 on the DNA strand of a given organism. As
usual, each variable has alphabet {A, C, G, T }. Since each alphabet has exactly four
members, the space of Z-distributions on S is D(S) = Z4 × · · · × Z4 (100times) =
Z400 .
Suppose we now collect 1,000 organisms and observe which DNA component
occurs in site i. With the data so obtained we can construct a Z-distribution D =
{D1 , . . . , D100 } on S where Di associates to each of the members of the alphabet
A( pi ) = {A, C, G, T } the number of occurrences of the corresponding component
in the i−th position. Note that for each Di we have

Di (A) + Di (C) + Di (G) + Di (T ) = 1,000.

Remark 1.2.7 Suppose that S is a system with random variables x1 , . . . , xn and that
the cardinality of each alphabet A(xi ) is exactly ai . As we have said before, ai is
simply the number of states that the random variable xi can assume.
With this notation, the K -distributions on S can be seen as points in the space

K a1 × · · · × K an .

We will often identify D K (S) with this space.


It is also certainly true that K a1 × · · · × K an = K a1 +···+an , and so it might seem
reasonable to say that this last is the set of distributions on S. However, since there
are so many ways to make this last identification, we could easily lose track of what
a particular distribution did on a member of the alphabet of one of the variables in S.

Remark 1.2.8 If S is a system with two variables x1 , x2 , whose alphabets have cardi-
nality (respectively) a1 and a2 , then the unique random variable in the total correlation
1.2 Distributions 9

S has a1 a2 states. Hence, as we said above, the space of K -distributions on S


could be identified with K a1 a2 .
Since we also wish to remember that the unique variable of S arises as the
cartesian product of the variables of S, it is even more convenient to think of
D K (S) = K a1 a2 as the set of a1 × a2 matrices with coefficients in K .
Thus, for a distribution D on S, we denote by Di j the value associated to the
state (i, j) of the unique variable, which corresponds to the states i of x1 and j of x2 .

For system with a bigger number of variables, we need to use multidimensional


matrices, commonly called tensors (see Definition 6.1.3).
The study of tensors is thus strongly connected to the study of systems of ran-
dom variables when we want to fix relationships among the variables (i.e. look at
distributions on the system). In fact, the algebra (and geometry) of spaces of tensors
represents the point of connection between the study of statistics on discrete sets and
other disciplines, such as Algebraic Geometry. The exploration of this connection is
our main goal in this book. We will take up that connection in another chapter.
Definition 1.2.9 Let S and T be two systems of random variables and f = (F, G) :
S → T a map of systems where F is a surjection. Let D be a distribution on S, and
D  a distribution on T .
The induced distribution f ∗D on T (called the image distribution) is defined as
follows: for t a state of the variable y ∈ T :

( f ∗D ) y (t) = Dx (s).
x∈F −1 (y),s∈G(x)−1 (t)

The induced distribution f D∗  on S (called the preimage distribution) is defined as


follows: for s a state of the variable x in S:

( f D∗  )x (s) = D F(x) (G(x)(s)).

We want to emphasize that distributions on a system of random variables should,


from a certain point of view, be considered as data on a problem. Data from which one
hopes to deduce other distributions or infer certain physical, biological or economic
facts about the system. We illustrate this idea with the following example.

Example 1.2.10 In the city of Siena (Italy) two spectacular horse races have been
run every year since the seventeenth century, with a few interruptions caused by the
World Wars. Each race is called a Palio, and the Palio takes place in the main square
of the city. In addition there have been some additional extraordinary Palios run from
time to time. From the last interruption, which ended in 1945, up to now (2014), a
total number of 152 Palios have taken place. Since the main square is large, but not
enormous, not every contrada can participate in every Palio. There is a method, partly
based on chance, that decides whether or not a contrada can participate in a particular
Palio.
10 1 Systems of Random Variables and Distributions

Table 1.1 Participation of the contrade at the 152 Palii (up to 2014)
x Name Dx (1) Dx (0) x Name Dx (1) Dx (0)
A Aquila 88 64 B Bruco 92 60
H Chiocciola 84 68 C Civetta 90 62
D Drago 95 57 G Giraffa 89 63
I Istrice 84 68 E Leocorno 99 52
L Lupa 89 63 N Nicchio 84 68
O Oca 87 65 Q Onda 84 68
P Pantera 96 56 S Selva 89 63
R Tartuca 91 61 T Torre 90 62
M Valdimontone 89 63

Let’s build a system with 17 boolean random variables, one for each contrada.
For each variable we consider the alphabet {0, 1}. The space of Z-distributions of
this system is Z2 × · · · × Z2 = Z34 .
Let us define a distribution by indicating, for each contrada x, Dx (1) = number
of Palios where contrada x took part and Dx (0) = number of Palios where contrada
x did not participate. Thus we must always have Dx (0) + Dx (1) = 152. The data
are given in Table 1.1.
We see that the Leocorno (unicorn) contrada participated in the most Palios while
the contrada Istrice (crested porcupine), Nicchio (conch), Onda (wave), Chiocciola
(snail) participated in the fewest.
On the same system, we can consider another distribution E, where E x (1) =
number of Palios that contrada x won and E x (0) = number of Palios that contrada x
lost (non-participation is considered a loss). The Win-Loss table is given in Table 1.2.
From the two tables we see that more participation in the Palios does not neces-
sarily imply more victories.

Table 1.2 Win-Loss table of contrade at the 152 Palii (up to 2014)
x Name E x (0) E x (1) x Name E x (0) E x (1)
A Aquila 8 144 B Bruco 5 147
H Chiocciola 9 143 C Civetta 8 144
D Drago 11 141 G Giraffa 12 140
I Istrice 8 144 E Leocorno 9 143
L Lupa 5 147 N Nicchio 9 143
O Oca 14 138 Q Onda 9 143
P Pantera 8 144 S Selva 15 137
R Tartuca 10 142 T Torre 3 149
M Valdimontone 9 143
1.3 Measurements on a Distribution 11

1.3 Measurements on a Distribution

We now introduce the concepts of sampling and scaling on a distribution for a system
of random variables.

Definition 1.3.1 Let K be a numerical set and let D = (D1 , . . . , Dn ) be a distribu-


tion on the system of random variables S = {x1 , . . . , xn }. The number

c D (xi ) = Di (s).
s∈A(xi )

is called the sampling of the variable xi in D. We will say that D has constant
sampling if all variables in S have the same sampling in D.
A K -distribution D on S is called probabilistic if each xi ∈ S has sampling equal
to 1.

Remark 1.3.2 Let S be a system with random variables {x1 , . . . , xn } and let D =
(D1 , . . . , Dn ) be a K -distribution on S, where K is a numerical field.
If every variable xi has sampling c D (xi ) = 0, we can obtain from D an associated
probabilistic distribution D̃ = ( D̃1 , . . . D̃n ) defined as follows:

Di (s)
for all i and for all states s ∈ A(xi ) set D̃i (s) = .
c D (xi )

Remark 1.3.3 In Example 1.2.3, the distribution D  is exactly the probabilistic dis-
tribution associated to D (seen as a Q-distribution).

Convention. To simplify the notation in what follows and since we will always be
thinking of the set K as some set of numbers, usually clear from the context, we
won’t mention K again but will speak simply of a distribution on a system S of
random variables.
Warning. We want to remind the reader again that the basic notation in Algebraic
Statistics is far from being standardized. In particular, the notation for a distribution
is quite varied in the literature and in other texts.
E.g. if si j is the j−th state of the i−th variable xi of the system S, and D is a
distribution on S, we will denote this by writing Di (si j ) as the value of D on that
state.
You will also find this number Di (si j ) denoted by Dxi =si j .

Example 1.3.4 Suppose we have a tennis tournament with 8 players where a player
is eliminated as soon as that player loses a match. So, in the first set of matches four
players are eliminated and in the second two more are eliminated and then we have
the final match between the remaining two players.
12 1 Systems of Random Variables and Distributions

We can associate to this tournament a system with 8 boolean random variables,


one variable for each player. We denote by D the distribution that, for each player
xi , is defined as:

Di (0) = number of matches lost;

Di (1) = number of matches won.

Clearly the sampling c(xi ) of every player xi represents the number of matches
played. For example, c(xi ) = 3 if and only if xi is a finalist, while c(xi ) = 1 for the
four players eliminated at the end of the first match. Hence D is not a distribution
with constant sampling.
Notice that this distribution doesn’t have any variable with sampling equal to 0
and hence there is an associated probabilistic distribution D̃, which represents the
statistics of victories. For example, for the winner xk , one has

D̃k (0) = 0, D̃k (1) = 1.

Instead, for a player x j eliminated in the semi-final,

1
D̃ j (0) = D̃ j (1) = .
2
While for a player xi eliminated after the first round we have

D̃i (0) = 1, D̃i (1) = 0.

The concept of an associated probabilistic distribution to a distribution D is quite


important in texts concerned with the analytic Theory of Probability. This is true to
such an extent that those texts work directly only with probabilistic distributions.
This is not the path we have chosen in this text. For us the concept that will be
more important than a probabilistic distribution is the concept of scaling. This latter
idea is more useful in connecting the space of distributions with the usual spaces in
which Algebraic Geometry is done.

Definition 1.3.5 Let D = (D1 , . . . , Dn ) be a distribution on a system S with random


variables {x1 , . . . , xn }. A distribution D  = (D1 , . . . , Dn ) is a scaling of D if, for any
x = xi ∈ S, there exists a constant λx ∈ K \ {0} such that, for all states s ∈ A(x),
Dx (s) = λx Dx (s).

Remark 1.3.6 Notice that the probabilistic distribution, D  , associated to a distribu-


tion D is an example of a scaling of D, where λx = 1/c(x).

Note moreover that, given a scaling D  of D, if D, D  have the same sampling, then
they must coincide.
1.3 Measurements on a Distribution 13

Remark 1.3.7 In the next chapters we will see that scaling doesn’t substantially
change a distribution. Using a projectivization method, we will consider two distri-
butions “equal” if they differ only by a scaling.
Proposition 1.3.8 Let f : S → T be a map of systems which is a bijection on the

sets of variables. Let D be a distribution on S and D  a scaling of D. Then f ∗D is a
scaling of f ∗D .
Proof Let y be a variable of T and let t ∈ A(y). Since f is a bijection there is a
unique x ∈ S for which f (x) = y. Then by definition we have

 
( f ∗D ) y (t) = D  (s) = λx D(s) = λx ( f ∗D ) y (t). 
s∈A(x) s∈A(x)

1.4 Exercises

Exercise 1 Let us consider the random system associated with the tennis tourna-
ment, see Example 1.3.4.
Compute the probabilistic distribution for the finalist who did not win the tourna-
ment.
Compute the probabilistic distribution for a tournament with 16 participants.
Exercise 2 Let S be a random system with variables x1 , . . . , xn and assume that all
the variables have the same alphabet A = {s1 , . . . , sm }. Then one can create the dual
system S  by taking s1 , . . . , sm as variables, each si with alphabet X = {x1 , . . . , xn }.
Determine the relation between the dimension of the spaces of K -distributions of
S and S  .
Exercise 3 Let S be a random system and let S  be a subsystem of S.
Determine the relation between the spaces of K -distributions of the correlations
of S and S  .
Exercise 4 Let f : S → S  be a surjective map of random systems.
Prove that if a distribution D on S  has constant sampling, then the same is true
for f D∗ .
Exercise 5 One can define a partial correlation over a system S, by connecting only
some of the variables.
For instance, if S has variables x1 , . . . , xn and m < n, one can consider the
partial correlation on the variables x1 , . . . , xm as a system T whose variables are
Y, xm+1 , . . . xn , where Y stands for the variable x1 × · · · × xm , with alphabet the
product A(x1 ) × · · · × A(xm ).
If S has variables c1 , c2 , c3 , all of them with alphabet {T, H } (see Example 1.1.3),
determine the space of K -distributions of the partial correlation T with random
variables c1 × c2 and c3 .
Chapter 2
Basic Statistics

In this chapter, we focus on some basic examples in Probability and Statistics. We


phrase these concepts using the language and definitions we have given in the previous
chapter.

2.1 Basic Probability

Given a system S of random variables, we introduce the concept of the probability


of each of the states s1 , . . . , sn of a random variable x in S.
The first thing to keep in mind is that if we have n states then, in general, there is
no reason to believe, a priori, that each of the states has probability 1/n. Usually, in
fact, we don’t know what the probability for each of the states is when we begin to
study a system of random variable.
Consider, for example, the case in which S has one random variable x which
represents one running of the Palio di Siena (see Example 1.1.8). Then x has ten
states, one for each of the ten contrade which will participate in that Palio. The
probability that a particular contrada will win is not equal for each of the ten contrade
participating in the Palio, but will depend on (among other things) the strength of the
contrada’s horse, the ability of the jockey, the historical deviousness of the contrada,
etc.

Example 2.1.1 Let’s consider the data associated with the Italian Series A Soccer
games of 2005/2006. There were 380 games played that season with 176 games won
by the home team, 108 games which resulted in a tie and the remaining 96 games
won by the visiting team.
Suppose now we want to construct a random system S with a unique random
variable x as in Example 1.1.4. Recall that the alphabet for x is A(x) = {1, 2, T }
where 1 means the game was won by the home team, 2 means the game was won by
the visiting team and T means the game ended in a tie. The 2005/2006 Season offers
us a distribution D such that Ds (1) = 176, Ds (T ) = 108, Ds (2) = 96.
© Springer Nature Switzerland AG 2019 15
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_2
16 2 Basic Statistics

The normalization of D, which we will denote D̃, gives us some reasonable


probabilities for the new season games. Namely:

176 108 96
D̃s (1) =  46.3% D̃s (T ) =  28.4% D̃s (2) =  25.3%.
380 380 380
So, making reference to a previous season, we have gathered some insight into
the probabilities to assign the various states.
Before we continue our discussion of the probability of the states of a random
variable in a random system, we want to introduce some more terminology for
distributions.
Example 2.1.2 Let S be a system of random variables.
The equidistribution, denoted E, on S is the distribution that associates to every
state of every random variable in the system, the value 1.
Now suppose the random variables of S are denoted by x1 , . . . , xr and that the
states of the variable xi are {si1 , . . . , sini }, then the associated probabilistic distribu-
tion of E, denoted Ẽ (see Remark 1.3.2) is defined by

1
pxEi (si j ) = for every j, 1 ≤ j ≤ n i .
ni

Notice that this distribution is the one which gives us equal probability for all
states in the alphabet of a fixed random variable.
This same probability distribution is clearly obtained if we start with the distri-
bution cE, c ∈ R, which associates to each state of each variable in the system, the
value c and then take c Ẽ xi (si j ) = cnc i .
Clearly, the equidistribution has constant sampling if and only if all variables have
the same number of states.
Remark 2.1.3 There is a famous, but imprecise, formula for calculating the proba-
bility for something to occur. It is usually written as

positive cases
.
possible cases

The problem with this formula is best illustrated by the following example. Assume
that a person likes to play tennis very much. If he were asked: If you had a chance to
play Roger Federer for one point what is your probability of winning that point? Using
the formula above he would offer as a reply that there are two possible outcomes
and one of those has the person winning the point, so the probability is 0.5 (but
maybe who has seen this person playing tennis would appreciate the absurdity of
that reply!).
The problem with the formula is that it is only a reasonable formula if all outcomes
are equally probable, i.e. if we have an equidistribution on the possible states—
something far from the case in the example above.
2.1 Basic Probability 17

Example 2.1.4 The Palio of Siena usually takes place twice a year, in July and
August. Only 10 of the 17 contrade can participate in a Palio. How are the ten
chosen?
First, the 7 contrade which did not participate in the previous year’s Palio are
automatically chosen for this year’s Palio. The remaining three places are chosen,
by a draw, from among the 10 contrade which did participate in the previous year’s
Palio.
What is the probability that a contrada x, which participated in the previous year’s
Palio, will participate in the current year’s Palio?
To answer this question we construct two systems of random variables. The first
system, T , will have only one random variable, which we will denote by e (for
extraction). What are the elements in the alphabet A(e) of e? In particular, how
many states does e have ? We need to choose 3 contrade from a set of 10 contrade.
We have 10 choices for the first extracted contrada, then 9 choices for the second
extracted contrada and finally 8 choices for the third extracted contrada. The total
is 10 · 9 · 8 = 720 states, where each state is an ordered triplet which tells us which
contrade was chosen first, which was chosen second and which was chosen third.
Since we will assume that the extractions will be made honestly, we can consider
the equidistribution on A(e) whose probability distribution associates to each triplet
in the alphabet the value 1/720.
Also the second system S has only one random variable corresponding to exactly
one of the ten contrade from which we will make the extraction. Such a variable,
which we will call c (where c really is the name of one of the contrade), is boolean.
Its alphabet can be considered as Z2 , where 1 signifies that c participates in the Palio
and 0 signifies that it does not participate.
Consider the map f c : T → S, sending e to c and each triplet in t ∈ A(e) to 1 or
0, depending on whether or not c is in t or c is not in t.
The probability that c will participate in the Palio of next July is defined by the
probability distribution associated to D = ( f c )∗E on S.
D(1) is equal to the number of triplets containing c. How many of these triplets
are there?
The triplets where c is the first element are obtained by choosing the second
element among the other 9 contrade and the third element among the remaining 8,
hence we have 72 such triplets. Similarly for the number of triplets where c is in the
second or in the third position, for a total of 72 · 3 = 216 triplets. Hence D(1) = 216
and, obviously, D(0) = 720 − 216 = 504. Thus, the probability for a contrada c to
participate in the Palio of next July, assuming it already run in the Palio of the previous
July, is:
D(1) 216 3
p= = =  30%.
D(0) + D(1) 720 10

In this example we could use the intuitive formula since we assumed the 720
possible states for the random variable e were equally distributed. With this in mind,
the previous procedure yields a general construction which, for elementary situations,
is enough to compute the probability that a given event occurs.
18 2 Basic Statistics

We introduce the definition of booleanization of a system of random variables.


As a matter of fact, in the previous example, we passed from a system T of random
variables to a boolean system associated to S. Practically speaking, we divide all
possible states of every random variables in good states and bad states, sending the
first one to 1 and the others to 0 (accordingly with the fact that a fixed contrada c is
extracted or not).
Definition 2.1.5 We call booleanization or dicotomia of a system T , with variables
{x1 , . . . , xn }, any pair (S, f ) where S is a boolean system, with variables {y1 , . . . , yn },
and f is a map f : T → S which determines a bijection on the sets of variables.
Thus, the output of Example 2.1.4 is a consequence of the following result.
Proposition 2.1.6 Let T be a system with a unique variable x and let E be the
equidistribution on T . Consider a booleanization f : T → S of T and let D be the
image distribution of E under f . The probability induced by D on S corresponds to
the ratio where the denominator is the number of all states of the variable in T (the
possibile cases) and the numerator is the sum of all states of the variable of T sent
to 1 through f (the positive cases).
Remark 2.1.7 It is obvious that, given a rational distribution D on a system S with
a unique variable, we can always find a system S and a map f : S → T in such a
way D is the image, through f , of the equidistribution on S.
Example 2.1.8 Let us give another example to stress how the formula

positive cases
possible cases

is rephrased in our setting.


Let T be a random system with one variable x representing a (fair) die. Thus x has
6 states. Which is the probability that launching the die one gets an even number?
Yes, everybody expects that the answer is 1/2 = 50%, but let us get the result in
our setting.
We construct a boolean system S with one variable y and states {0, 1} and construct
the map f : T → S which sends 1, 3, 5 to 0 (bad cases) and 2, 4, 6 to 1 (good cases).
Then the image in f of the equidistribution over T is the distribution D such that
D(0) = D(1) = 3. Thus we get the expected answer D(1)/(D(0) + D(1)) = 3/6 =
1/2.
Remark 2.1.9 Notice that the previous construction of probability, in practice, is
strongly influenced by the choice of the equidistribution on T . It is reliable only when
we have good reasons (which, often, cannot be mathematical reasons) to believe that
the states of any variable of T are equiprobable.
We cannot emphasize enough that in many concrete cases it is absolutely impos-
sible to know, ahead of time, if the choice of the equidistribution on T makes sense.
It’s enough to consider the example of a coin, about which we know nothing
except that it has two states, H = heads or T = tails. What is the probability that after
2.1 Basic Probability 19

tossing the coin we obtain one of the two possibilities? A priori, no one can affirm
with certainty that the probability of each state is 1/2 since the coin may not be a
fair coin.
In a certain sense, physical, biological and economic worlds are filled with “unfair”
coins. For example, on examining the first place in the DNA chain of large numbers
of organisms, one observes that the four basic elements do not appear equally often;
the element T appears much more frequently than the element A. In like manner, if
one were attempting to understand the possible results of a soccer match, there is no
way one would say that the three possibilities 1, 2, T are equally probable.

2.2 Booleanization and Logic Connectors

The elementary construction of probability, obtained in the previous section, becomes


more involved when one wants to consider a series of events. Let us start with an
example, taken again from the Palio.
Example 2.2.1 The procedure for the choice of the ten contradas which run in the
Palio of July is repeated verbatim for the Palio of August. In August the race is open
for the seven contrade that did not participate to the Palio of August of the previous
year, plus three contrade, chosen by chance.
The procedures for the participation to the two Palios ran every year are indepen-
dent. Hence it is possible that a contrada, one year, can participate to both Palio, or
does not participate at all.
Assume that one contrada c did not participate to both Palio in 2014. Then it will
participate to both Palio in 2015. What is the probability that it participates to both
palio in 2016? What is the probability that it participates to at least one palio in 2016?
Observe that the problem can be easily solved by applying the results of Example
 3 2  7 2
2.1.4. obtaining the expected solutions 10 = 100
9
and 1 − 10 = 51/100.
However, to give a more formal answer to these questions, we define a system T
with two random variables J = July and A = August. For each variable, the alphabet
is the set of possible triplets of extracted contrade. Hence each variable has 720 states
(see Example 2.1.4).
Consider now T  = S, the total correlation on T . This furnishes all possible
results of the extractions for the two Palii in 2008. S  has only one variable, with
(720)2 = 518400 states.
Consider the boolean system S with unique variable c and alphabet {0, 1}.
To compute the probability that c participates to both Palii, we build the map
 : T  → S, where  sends (obviously) the variable y ∈ T  to c and sends each state
s of y, corresponding to a pair of triplets, to 1 or 0, depending if c appears in both
the triplets or not.
How many states s are sent to 1? There are 216 triplets, among the 720 possible
ones, where c appears. Then the pairs of triplets, in both of which c appears, are
216 · 216 = 46656.
20 2 Basic Statistics

The equidistribution E on T  induces on S the distribution D = ∗E such that


D(1) = 46656 and D(0) = 518400 − 46656 = 471744. Hence, the probability for
the contrada c to participate to both Palii, in 2016, is

D(1) 46656 9
D̃(1) = = = = 9%.
D(0) + D(1) 518400 100

To compute the probability that c participates to at least one Palio, we build the
map u : T  → S where this time u sends each state s of y, corresponding to a pair
of triplets, to 1 or 0, depending if c appears or not in at least one of the triplets in the
pair.
How many states s are sent to 1 now? There are 216 triplets, among the 720
possible ones, where c appears in the first element of s. Among the 720 − 216 = 504
remaining ones, there are 504 · 216 cases where c appears in the second element of s.
Then, the pairs of triplets having c in at least one triplet are 216 · 720 + 504 · 216 =
264384.
The equidistribution E on T  induces on T the distribution R = u ∗E such that
R(1) = 264384 and R(0) = 518400 − 264384 = 254016. Hence, the probability
for the contrada c to participate to at least one Palio, in 2016, is

R(1) 264384 51
R̃(1) = = = = 51%.
R(0) + R(1) 518400 100

A different approach to this problem can be found in Example 2.3.19.


Notice that the example suggests a way to define, in general, what happens for the
composition of two events. It suggests, however, that there are several ways in which
the composition can be constructed, depending of which kind of logical operator one
uses to combine the two events.
In the example, we used the two operators “for any” and “at least one”, but other
choices are possible (see Exercise 6).
Let us review, briefly, the setting of logic connectors.

Definition 2.2.2 Define an elementary logic connector simply as a map Z2 × Z2 →


Z2 . In other words, an elementary logic connector is an operation on Z2 . Clearly there
are exactly 16 elementary logic connectors.

Example 2.2.3 The most famous elementary logic connectors are AND, OR and
AUT (which means “one of them, but not both”), described respectively by parts (a),
(b) and (c) of Table 2.1.

Definition 2.2.4 Let S be a systems with two random variables x1 , x2 , and consider
a booleanization E of S. Let  be a logic connector. Then  defines a booleanization
E  of the total correlation S by setting, for each pair s1 , s2 of states of x1 , x2 :

E  (s, t) = (E(s), E(t)).


2.2 Booleanization and Logic Connectors 21

Table 2.1 Truth tables of AND (a), OR (b) and AUT (c)

AND 0 1 OR 0 1 AUT 0 1
0 00 0 01 0 01
1 01 1 11 1 10
(a) (b) (c)

The reader can easily rephrase Example 2.2.1 in terms of the booleanization induced
by a connector .
Remark 2.2.5 One can extend the notion of logic connectors on n-tuples, by tak-
ing n-ary operations on Z2 , i.e. maps (Z2 )n → Z2 . This yields an extension of a
booleanization on a system S, with an arbitrary number of random variables, to a
booleanization of S.
The theory of logic connectors has useful applications. Just to mention one, it intro-
duces a theoretical study of queries on databases.
However, since a systematic study of logic connectors goes far beyond the scopes
of this book, we will not continue the discussion here.

2.3 Independence Connections and Marginalization

The numbers appearing in Example 2.2.1 are quite large and, in similar but slightly
more complicate situations, become rapidly untreatable.
Fortunately, there is a simpler setting to reproduce the computations of probability
performed in Example 2.2.1. It is based on the notion of “independence”.
In order to make this definition we first show a connection between tensors and
distributions, which extends the connection between distribution on a system with
two variables and matrices, introduced in Remark 1.2.8.
Tensors are fundamental objects in Multi-linear Algebra, and they are introduced
in Chap. 6 of the related part of the book. We introduce here the way tensors enter
into the study of random systems, and this will provide the fundamental connection
between Algebraic Statistics and Multi-linear Algebra.
Remark 2.3.1 Let S be a system of random variables, say x1 , . . . , xn and suppose
that xi has ai states and let T = (S) be the total correlation of S. Recall that a
distribution D on T assigns to every state of the unique variable x = (x1 , . . . , xn ),
of T , a number. But the states of x correspond to ordered n-tuples (s1 j1 , . . . , sn jn )
where si ji is a state of the variable xi , i.e. a state of T corresponds to a particular
location in a multidimensional array, and a distribution on T puts a number in that
location. I.e. a distribution on T can be identified with an n-dimensional tensor of
type a1 × · · · × an and conversely.
22 2 Basic Statistics

The entry Ti1 ...,in corresponds to the states i 1 of the variable x1 , i 2 of the variable
x2 , . . . , i n of the variable xn .

Remark 2.3.2 When the original system S has only two variables (random dipole),
then the distributions in the total correlation S are represented as 2-dimensional
tensors, i.e. usual matrices.

Example 2.3.3 Assume we have a system S of three coins, that we toss 15 times and
record their outputs. Assume that we get the following result:
• 1 time we get HHH;
• 2 times we get HHT and TTH;
• 3 times we get HTH and HTT;
• 4 times we get TTT;
• we never get THH and THT
Then the corresponding distribution, in the space of R-distributions of S which
is D(S) = R2 × R2 × R2 , is ((9, 6), (3, 12), (6, 9)), where (9, 6) corresponds to 9
Heads and 6 Tails for the first coin, (3, 12) corresponds to 3 Heads and 12 Tails for
the second coin, and (6, 9) corresponds to 6 Heads and 9 Tails for the third coin.
The distribution over T = S that corresponds to our data is the tensor of Example
6.4.15, namely:
2 3

1 3
D=
0 4

0 2

(we identify H = 1 and T = 2).

In general, as we can immediately see from the previous Example, knowing the dis-
tribution on a system S of three coins is not enough to reconstruct the corresponding
distribution on the total correlation T = S.
On the other hand, there is a special class of distributions in S where the recon-
struction is possible. It is the class of independence distributions defined below.
Definition 2.3.4 Let S be a system of random variables, x1 , . . . , xn where each xi
has ai states. Then D K (S) is identified with K a1 × · · · × K an . Set T = S as the
total correlation of S.
Define a function
 : D K (S) → D K (T )

called the independence connection (or Segre connection), in the following way: 
sends the distribution

D = ((d11 , . . . , d1a1 ), . . . , (dn1 , . . . , dnan ))


2.3 Independence Connections and Marginalization 23

to the distribution D  = (D) on S (thought of as a tensor) such that

Di1 ,...,in = d1i1 · · · dnin .

Coherently with the notation of Chap. 6 we will denote the image of D as v1 ⊗ · · · ⊗


vn , where vi = (di1 , . . . , diai ).
Example 2.3.5 The distribution on S in Example 2.3.3 above is not the image
of D = ((9, 6), (3, 12), (6, 9)) under the independence connection , even after
scaling. Namely the image of D under  is
243 972

162 648

D =
162 648

108 432

where 162 = 9 · 3 · 6, 648 = 6 · 12 · 9 and so on.


Remark 2.3.6 The image of the independence connection is, by definition, exactly
the set of tensors of rank 1, as introduced in Definition 6.3.3.
The elements of the image will be called distributions of independence.
Proposition 2.3.7 If D is a probabilistic distribution, then (D) is a probabilistic
distribution.
Proof Let S be a system, with random variables X = {x1 , . . . , xn } and let S = Y
be a total correlation of S with alphabet B. Let D be a probabilistic distribution on
S. Denote by y the unique random variable of S. We need to prove that

(D) y (s1 j1 , . . . , sn jn ) = 1.
s1 j1 ∈A(x1 ),...,sn jn =A(xn )

We can observe that



(D) y (s1 j1 , . . . , sn jn ) =
s1 j1 ∈A(x1 ),...,sn jn ∈A(xn )

= (s1 j1 ) · Dx2 (s2 j2 ) · · · · · Dxn (sn jn ) =
s1 j1 ∈A(x1 ),...,sn jn Dx1
   
=( Dx1 (s1 j1 )) Dx2 (s2 j2 ) · · · · · Dxn (sn jn ) .
s1 j1 ∈A(x1 ) s2 j2 ∈A(x2 ),...,sn jn ∈A(xn )


Since s1 j ∈A(x1 ) Dx1 (s1 j1 ) = 1, the claim follow by induction on the number of
1
random variables of S. 
24 2 Basic Statistics

Proposition 2.3.8 Let f = (F, G) : S → T be a map between systems and D a


distribution on S. Let D  = f ∗D be the induced distribution on T . Then the distribution
(D) induced in the total correlation of S has, as image in f , the distribution
(D  ).

Proof Denote by x1 , . . . , xn the random variables in S and by y1 , . . . , yn the random


variables in T , with yi = F(xi ). For each state t = (t1 , . . . tn ) of the variable (y1 ×
· · · × yn ) of T , one has (D  )(t) = i=1,...,n D  (ti ) and

D  (ti ) = D(si j ).
si j ∈A(xi ),G(xi )(si j )=ti

We know that (D)(s1 j1 , . . . , sn jn ) = D(s1 j1 ) · · · · · D(sn jn ), hence



(f )(D)
∗ (t1 , . . . , tn ) = (D)(s1 j1 , . . . , sn jn )
(s1 j1 ,...,sn jn )→(t1 ,...,tn )

coincides with (D  )(t1 , . . . , tn ). 

Example 2.3.9 Let S be a boolean system with two random variables x, y, both with
alphabet {0, 1}. Let D be the distribution defined as

1 5 1 5
Dx (0) = , Dx (1) = , D y (0) = , D y (1) = .
6 6 6 6
(which is clearly a probabilistic distribution).
Its product distribution on the total correlation with variable z = x × y is defined
as
1 1 1
(D)z (0, 0) = · =
6 6 36
1 5 5
(D)z (0, 1) = · =
6 6 36
5 1 5
(D)z (1, 0) = · =
6 6 36
5 5 25
(D)z (1, 1) = · =
6 6 36

which is still a probabilistic distribution since 1


36
+ 5
36
+ 5
36
+ 25
36
= 1.

The independence connection can be, in some sense, inverted. For this aim, we need
the definition of (total) marginalization.
Definition 2.3.10 Let S be a system of random variables and T = S its total
correlation. Define a function M : D K (T ) → D K (S), called (total) marginalization
in the following way. Given a tensor (thought as distribution on S) D  , M(D  ) is
2.3 Independence Connections and Marginalization 25

the total marginalization


  of D  : M(D  ) send the jth state of the variable xi of S to
the number Dai ,...,an , where the sum is taken over all elements in the tensor whose
i−th index is equal to j.
We say that a distribution D on S is coherent with the distribution D  on S if
M(D  ) = D, i.e. D  ∈ M −1 (D).
For a distribution D on S, we denote:

Co(D) = { distributions D  on S, coherent with D} = M −1 (D).

Assume that S is a system, with random variables x1 , . . . , xn , and let D be a distri-


bution on S and D  be a distribution on S. Then D  is coherent with D if, for all
i = 1, . . . , n and for each state si j of xi , one has

D(si ji ) = D  (s1 j1 , . . . , si ji , . . . , sn jn )

where the sum ranges for all choices of states sk jk of xk , with k = i.


Proposition 2.3.11 If there exists is a distribution D  on S coherent with the
distribution D on S, then D has constant sampling.

Proof It is sufficient to show that the sampling of D on no matter variable xi is equal


to the sampling of D  on the unique random variable of S. One has:

c(xi ) = Dxi (si ji ) =
si ji ∈A(xi )
⎛ ⎞
 
= ⎝ D  (s1 j1 , . . . , si ji , . . . , sn jn )⎠ =
si ji ∈A(xi ) k =i,sk jk ∈A(xk )
 
= D  (s1 j1 , . . . , sn jn ) = D  (s1 j1 , . . . , sn jn ) (2.3.1)
sk jk ∈A(xk )

where the last sum runs on the states of the unique random variable of S. 

Remark 2.3.12 From the previous proof it follows that if a distribution D on S is


probabilistic, then every distribution D  su S, coherent with D, is still probabilistic.

Example 2.3.13 Let us see now how independence connection and marginalization
work together.
To this aim we introduce an example (honestly, rather harsh) on the efficiency of
a medicine, that we will use also in the chapter of statistical models.
Consider a pharmaceutical company who want to verify if a specific product is
efficient against a given pathology.
The company will try to verify such efficiency hiring volunteers (the population)
affected by the pathology and giving the drug to some of them, a placebo to others.
From the registration of the number of healings, the conclusions must be drawn.
26 2 Basic Statistics

The situation is described by a boolean system S, whose two variables x, y rep-


resent respectively the administration of the drug and the healing (as usual 1 = yes
0 = no).
On this system we introduce the following distribution D (with constant sampling)

Dx (0) = 20, Dx (1) = 80, D y (0) = 30, D y (1) = 70.

This corresponds to an experiment with 100 persons, affected by the pathology: 80


persons receive the drug, while the other 20 receive the placebo. At the end of the
observation, 30 subjects are still sick, while the remaining 70 are healed.
If  is the independence connection, then (D) is the 2 × 2 tensor (matrix):
 
600 1400
2400 5600

We easily observe that (D) is a matrix of rank 1 and this says that the medicine
has not effects (is independent) on the healing of the patients.
The marginalization of (D) gives the distribution D  on X :

Dx (0) = 600 + 1400 = 2000, Dx (1) = 2400 + 5600 = 8000,

D y (0) = 600 + 2400 = 3000, D y (1) = 1400 + 5600 = 7000

We notice that D  is a scaling of D, with scaling factor equal to 100, which is the
sampling.

Already from the previous Example 2.3.13, it is clear that the marginalization is in
general not injective, in other words Co(D) needs not to be a singleton.
Indeed in the Example, both distributions D  and D  are coherent with D̃, as one
can immediately compute.
Example 2.3.14 Consider a boolean system with two variables representing to coins
c1 , c2 , each of them has alphabet {H, T }. We throw, separately, 100 times the coins
obtaining, for the first coins c1 , 30 times H and 70 times T , and for the second
coin c2 , 60 times H and 40 times T . Thus one has a distribution D given by
((30, 70), (60, 40)).
Using the independence connection , we get, on the unique variable of the total
correlation T of S, the distribution

(D)(H, H ) = 1800
(D)(H, T ) = 1200
(D)(T, H ) = 4200
(D)(T, T ) = 2800

Normalizing this distribution, we notice that the probability to have (H, T ) is


1200/10000 = 12%. The marginalization is the distribution M(D) with values
((3000, 7000), (6000, 4000)).
2.3 Independence Connections and Marginalization 27

Clearly M(D) is a scaling of D.


The previous examples can be generalized.
Proposition 2.3.15 Let S be a system and T its total correlation. Denote, as usual,
by  the independence connection from S to T and by M the marginalization from
T to S.
If D is a distribution on S, then M((D)) is a scaling of D.

Proof If s11 is the first state of the first variable of S, then



M((D(s11 ))) = (D(s11 , s2, j2 , . . . , sn jn ))

= D(s11 )D(s2, j2 ) . . . D(sn jn )

= D(s11 )( D(s2, j2 ) . . . D(sn jn ))

and the same equality holds


 for all other states.
Hence, given c1 = D(s11 )D(s2, j2 ) . . . D(sn jn ), for each state s1i of the first
variable of S, we get
M((D(s1i ))) = D(s1i )c1 .

Similar equalities hold for the other variables of S, thus M((D)) is a scaling
of D. 

Example 2.3.16 The viceversa of the previous proposition is not valid in general.
As a matter of fact, in Example 2.3.14, consider a distribution D  on T defined as

D  (H, H ) =6
D  (H, T ) =1
D  (T, H ) =3
D  (T, T ) =1

obtained throwing 11 times the two coins.


The marginalization M gives the distribution ((7, 4), (9, 2)) on S. If we apply the
independence connection , we get on T

(M(D  ))(H, H ) = 63
(M(D  ))(H, T ) = 14
(M(D  ))(T, H ) = 36
(M(D  ))(T, T ) = 8

which is not a scaling of D  .

In the previous Example, the initial distribution D  does not lie in the image of the
independence connection, hence (M(D  )), which obviously lie in the image, cannot
be equal to D  .
28 2 Basic Statistics

When we start from a distribution D  which lies in the image of the independence
connection, then the marginalization works (up to scaling) as the inverse of the
independence connection.
Proposition 2.3.17 Let S be a system and T its total correlation. Denote, as usual,
by  the independence connection from S to T and by M the (total) marginalization
from T to S.
If D  = (D) then ((M(D  )) is a scaling of D  .

Proof Suppose S has n variables x1 , . . . , xn , such that the variable xi has ai states.
By our assumption on D  , there exist vectors vi = (vi1 , . . . , viai ) ∈ K ai such that, as
tensor, D  = v1 ⊗ · · · ⊗ vn .
Then M(D  ) associates the vector
 
M(D  ) j = ( v1i1 . . . vnin ), . . . , v1i1 . . . vnin )
i j =1 i j =a j

to the state of the variable x j . Hence, given ci = sampling of v1 ⊗ · · · ⊗ v̂i ⊗ · · · ⊗


vn , M(D  ) associates the value vi j ci to the j−th state of xi . Thus

(M(D  )) = (c1 · · · cn )D  .

If one of the ci ’s is 0, we get the empty distribution, otherwise we get a scaling of


D. 

Corollary 2.3.18 For every distribution D, with non-empty constant sampling, on


S, there exists one and only one distribution of independence D  on S = T such
that D is the marginalization of D  . In other words, Co(D) intersects the image of
the independence correlation in a singleton.

Proof Let D  , D  be two distributions of independence with the same marginaliza-


tion D = (v1 , . . . , vn ). Then D  , D  have the same sampling c, which is equal to the
sampling of the variables in D. By the previous proposition, D  , D  are both equal
to a scaling of (D). Since they have the same sampling, they must coincide.
For the existence, it is enough to consider (1/c)D. 

Let us go back to see how the use of the independence connection can simplify our
computations when we know that two systems are independent.
Example 2.3.19 Consider again Example 2.2.1 and let us use the previous construc-
tions to simplify the computations.
We have a distribution D on the system T whose unique variable is the contrada
c, with alphabet Z2 . We know that the normalization  of D sends 1 to 3/10 and 0
to 7/10.
This defines, on T , the distribution ():
2.3 Independence Connections and Marginalization 29

3 3 9 3 7 21
(1, 1) = · = , (1, 0) = · = ,
10 10 100 10 10 100
7 3 21 7 7 49
(0, 1) = · = , (0, 0) = · = . (2.3.2)
10 10 100 10 10 100
If we consider the logic connector AND, this sends (1, 1) to 1 and the other pairs
to 0. Thus, the distribution induced by () sends 1 to 9/100 and 0 to (21 + 21 +
49)/100 = 91/100.
If we consider the logic connector OR, this sends (0, 0) to 0 and the other pairs to
1. Thus, the distribution induced by () sends 1 to (9 + 21 + 21)/100 = 51/100
and 0 to 49/100.
The logic connector AUT sends (0, 0) and (1, 1) to 0 and the others pairs to 1.
Thus, the distribution induced by () sends 1 to (21 + 21)9/100 = 42/100 and 0
to (9 + 49)/100 = 58/100.
And so on.
It is important to observe that the results are consistent with the ones of Example
2.2.1, but the computations are simpler.
Let us finish by showing some properties of the space Co(D) of distributions coherent
with a given distribution D on a system S.
Theorem 2.3.20 For every distribution D, with constant sampling, on a system S,
Co(D) is an affine subspace (i.e. a translate of a vector subspace) in D(S).

Proof We provide the proof only in the case when S is a random dipole, i.e. it has
only two variables, leaving to the reader the straightforward extension to the general
case.
Let x, y be the random variables of S, with states (s1 , . . . , sm ) and (t1 , . . . , tn )
respectively. We will prove that Co(D) is an affine subspace of dimension mn −
m − n + 1. Let D  be a distribution on S, identified with the matrix D  = (di j ) ∈
Rm,n . Since D  ∈ Co(D), then the row sum of the matrix of D  must give values
D(s1 ), . . . , D(sm ) of D on the states of x, while similarly the column sum must give
the values D(t1 ), . . . , D(tn ) of D on the states of y. Thus D  is in Co(D) if and only
if it is solution of the linear system with n + m equations and nm unknowns:


⎪d11 + · · · + d1n = Dx (s1 )



⎪ .. .


⎪ . = ..

⎨d + · · · + d
m1 mn = Dx (sm )

⎪d11 + · · · + dm1 = D y (t1 )



⎪ .. .


⎪ . = ..

⎩d + · · · + d
1n mn = D y (tn )

It follows that Co(D) is an affine subspace of D(S) = Rm,n . The matrix associated
to the previous linear system has a block structure given by
30 2 Basic Statistics
 
M1 M2 . . . Mm D x
I I . . . I Dy

where I is the n × n identity matrix, Mi is the m × n matrix whose entries are 1 in


the i-th row and 0 otherwise, Dx and D y represent respectively the column vectors
(Dx (s1 ), . . . , Dx (sm )) and (D y (t1 ), . . . , D y (tn )).
Denote by H the matrix
 
M1 M2 . . . Mm
I I ... I

We observe that the m + n rows of H are not independent since the vector
(1, 1, . . . , 1) can be obtained as sum of the first m row and as sum of the last n
rows. Hence the rank of H is at most n + m − 1.
In particular, the system has solution if and only if the constant terms satisfy

Dx (s1 ) + · · · + Dx (sm ) = D y (t1 ) + · · · + D y (tn ),

which is equivalent to the hypothesis that D has constant sampling.


To conclude the proof, it is enough to check that H has rank at least n + m − 1,
that is H has a (n + m − 1) × (n + m − 1) submatrix of maximal rank.
Observe that the n × n block in the left bottom corner is an identity matrix,
with rank n. Deleting the last n rows and the first n columns of H , we get the
m × (mn − n) matrix H  = (M2 M3 . . . Mm ) that has null first row, but rank equal
to m − 1, since the columns in position 1, n + 1, 2n + 1, . . . , (m − 2)n + 1 contain
an identity matrix of size (m − 1) × (m − 1). 
Example 2.3.21 Let us show the trueness of the previous theorem, verifying that H
has rank m + n − 1, in some numerical case.
For example, if m = 2, n = 3, the matrix H is:
⎛ ⎞
1 1 1 0 0 0
⎜0 0 0 1 1 1⎟
⎜ ⎟
⎜1 0 0 1 0 0⎟
⎜ ⎟
⎝0 1 0 0 1 0⎠
0 0 1 0 0 1

If m = 3, n = 2, the matrix H is:


⎛ ⎞
1 1 0 0 0 0
⎜0 0 1 1 0 0⎟
⎜ ⎟
⎜0 0 0 0 1 1⎟
⎜ ⎟
⎝1 0 1 0 1 0⎠
0 1 0 1 0 1

We can say a little more on the Geometry of Co(D).


2.3 Independence Connections and Marginalization 31

Definition 2.3.22 In D(S) define the unitary simplex U as the subspace formed
by tensors whose coefficients sum to 1.
U is an affine subspace of codimension 1, which contains all the probabilistic
distributions on S.
We previously said that if D  is coherent with D and D has constant sampling k, then
also the sampling of D  , on the unique variable of S, is k. In other terms, the matrix
(di j ) associated to D  satisfies i, j di j = k. From this fact we get the following
Proposition 2.3.23 For every distribution D on S with constant sampling, the affine
space Co(D) is parallel to U .
We finish by showing that, for a fixed distribution D over S, one can find in Co(D)
distributions with rather different properties.
Example 2.3.24 Assume one needs to find properties of a distribution D  on S, but
knows only the marginalisation D = M(D  ). Even if Co(D) is in general infinite,
sometimes only few further information is needed to determine some property of D  .
For instance, let S be a random dipole, whose variables x, y represent one position
in the DNA chain of a population of cells in two different moments, so that both
variables have alphabet {A, C, G, T }. After treating the cells with some procedure, a
researcher wants to know if some systematic change occurred in the DNA sequence
(which can change also randomly). The total correlation of S has one variable with
16 states. A distribution D  on S corresponds to a 4 × 4 matrix.
If the researcher cannot trace the evolution of every single cell, but can only
determine the total number of DNA basis that occur in the given position before
and after the treatment, i.e. if the researcher can only know the distribution D on
S, any conclusion on the dependence of the final basis in terms of the initial one is
impossible.
But assume that one can trace the behaviour of cells having a A in the fixed
position. So one can record the value of the distribution D  on the state (A, A). Since
there exists only one distribution of independence D  which is coherent with D if the
observed value of D  (A, A) does not coincide with D  (A, A), then the researcher
has an evidence towards the non-independence of the variables, in the distribution
D.
Notice that, on the contrary, if D  (A, A) = D  (A, A), one cannot conclude that

D is a distribution of independence, since infinitely many distributions in Co(D)
assume on (A, A) a fixed value.
Example 2.3.25 Consider the following situation. In a bridge game, two assiduous
players A, B follow this rule: they play alternately one day a match in pairs together,
the other day a match in opposing pairs. After 100 days, the situation is as follows:
A won 30 games and lost 70, while B won 40 and lost 60. Can one determine
analytically the total number of wins and defeats? Can one check whether the win
or the defeat of each of them dependent on or not playing in pairs with each other?
We can give an affermative answer to both questions. Here we have a boolean
dipole S, with two variables A, B and state 1 = victory, 0 = defeat. The distribution
D on S is defined by
32 2 Basic Statistics

D A (1) = 30, D A (0) = 70, D B (1) = 40, D B (0) = 60.

Clearly D has constant sampling equal to 100. From these data, we want to determine
a distribution D  on S, coherent with D, which explains the situation. We already
know that we have infinitely many distributions on S, coherent with D. By Theorem
2.3.20, these distributions D  fill an affine subspace of R2,2 having dimension 2 · 2 −
2 − 2 + 1 = 1.
The extra datum with respect to Example 2.3.13 is to know that the players played
alternately in pairs and opposed. Then among the 100 played matches, 50 times they
played together, then the observed result could be only (0, 0) o (1, 1), while 50 time
were opposed, and the observed result could be only (0, 1) or (1, 0). Hence, the
matrix (di j ) of the required distribution D  must satisfy the condition:

d11 + d22 = d12 + d21 (= 50).

The matrix of distribution of independence D  , coherent with D, is given by


(30, 70)(40, 60), divided by the sampling of D, i.e. 100. One has
 
12 18
D  = .
28 42

All others distributions, coherent with D  , are obtained summing to D  the solution
of the homogeneous system ⎧

⎪ d11 + d12 = 0

⎨d + d
11 21 = 0

⎪ d + d 12 = 0


22
d22 + d21 = 0

which are multiple of  


−1 1
.
1 −1

Then a generic distribution, coherent with D, has matrix:


 
 12 − z 18 + z
D = .
28 + z 42 − z

Requiring d11 + d22 = d12 + d21 , we get z = 2, then the only possible matrix is:
 
 10 20
D = .
30 40

Then playing in pairs A and B won 10 times and lost 40, while playing against A
has won 20 times, B has won 30 times.
2.3 Independence Connections and Marginalization 33

Finally, the percentage of wins depends on the players A and B, because the
determinant of D  is −200 = 0 (both have advantage to not play in pairs with each
other).

Example 2.3.26 Sometimes even extra information is not enough to determine a


unique D  ∈ Co(D).
Consider the following situation. There are two sections A, B in a school. The
Director can assign, every year, some grants for some students, depending on the
financial situation of the school. Every year, the Director assigns at most two grants,
but when the balance of the school is poor, he can assign only one grant, or no grants
at all.
No section can have a privilege with respect to the other: if two grants are assigned,
they are divided one for each section. If one year only one grant is assigned to a
section, then the next year in which only one grant is assigned, the grant will be
reserved for the other section.
After 25 years the situation is: section A obtained 15 grants, and also section B
obtained 15 grants.
Can one deduce, from these data, in how many years no grants were assigned?
Can one deduce that the fact that section A gets a grant is an advantage or not for
section B?
Both answers to these questions are negative.
Create a random system S, which is a boolean dipole, with variables A, B whose
states are 1 = obtains a grant, 0 = obtains no grants. We know that the distribution
of grants that we observe in S is D = ((10, 15), (10, 15)). We would like to know
the distribution D  on S such that D  (1, 0) = number of years in which A got a
grant and did not, and so on. Since both sections obtained 15 grants, the number of
years in which only one grant was assigned is even. Moreover the number of years
in which section A obtained a grant and section B did not is equal to the number
of years in which section B obtained a grant and section A did not. In other words
D  (1, 0) = D  (0, 1), i.e. the matrix representing D  is symmetric.
But this is still not enough to determine D  !
Indeed all distributions in Co(D) are symmetric, because the values of D on the
states of A and B coincide, so that if D  = (di j ) then

d11 + d12 = d11 + d21 , d21 + d22 = d12 + d22 ,

thus d12 = d21 .


For instance, the unique distribution of independence coherent with D is d11 = 4,
d12 = 6, d21 = 6, d22 = 9, which is symmetric. In geometric terms, Co(D) is parallel
to the space of symmetric matrices.
In the example, Co(D) contains the matrices:
     
37 5 5 28
D1 = D2 = D3 = .
78 5 10 87
34 2 Basic Statistics

In D1 and D2 the fact that A gets a grant is a vantage for section B, while in D3 it
represents a disadvantage.

2.4 Exercises

Exercise 6 Following Example 2.2.1 compute the probability that a contrada c par-
ticipates to one Palio in 2016, but not both, under the assumption that c didn’t
participate to any Palio in 2015.
Chapter 3
Statistical Models

3.1 Models

In this chapter, we introduce the concept of model, essential point of statistical infer-
ence. The concept is reviewed here by our algebraic interpretation. The general
definition is very simple:
Definition 3.1.1 A model on a system S of random variables is a subset M of the
space of distributions D(S).
Of course, in its total generality, the previous definition it is not so significant.
The Algebraic Statistics consists of practice in focus only on certain particular
types of models.
Definition 3.1.2 A model M on a system S is called algebraic model if, in the coordi-
nates of D(S), M corresponds to the set of solutions of a finite system of polynomial
equations. If the polynomials are homogeneous, then M si called a homogeneous
algebraic model.
It occurs that the algebraic models are those mainly studied by the proper methods
of Algebra and Algebraic Geometry.
In the statistical reality, it occurs that many models, which are important for the
study of stochastic systems (discrete), are actually algebraic models.
Example 3.1.3 On a system S, the set of distributions with constant sampling is a
model M. Such a model is a homogeneous algebraic one.
As a matter of fact, if x1 , . . . , xn are the random variables in S and we identify
DR (S) with Ra1 × · · · × Ran , where the coordinates are y11 , . . . , y1a1 , y21 , . . . , y2a2 ,
. . . , yn1 , . . . , ynan , then M is defined by the homogeneous equations:

y11 + · · · + y1a1 = y21 + · · · + y2a2 = · · · = yn1 + · · · + ynan .

The probabilistic distributions form a submodel of the previous model, which is still
algebraic, but not homogeneous!
© Springer Nature Switzerland AG 2019 35
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_3
36 3 Statistical Models

3.2 Independence Models

The most famous class of algebraic models is the one given by independence mod-
els. Given a system S, the independence model on S is, in effect, a subset of the
space of distributions of the total correlation T = S, containing the distributions
in which the variables are independent among them. The basic example is Example
2.3.13, where we consider a Boolean system S, whose two variables x, y represent,
respectively, the administration of the drug and the healing.
This example justifies the definition of a model of independence, for the ran-
dom systems with two variables (dipoles), already in fact introduced in the previous
chapters.
Definition 3.2.1 Let S be a system with two random variables x1 , x2 and let T = S.
The space of K −distributions on T is identified with the space of matrices K a1 ,a2 ,
where ai is the number of states of the variable xi .
We recall that a distribution D ∈ D K (T ) is a distribution of independence if D,
as matrix, has rank ≤ 1.
The independence model on S is the subset of D K (T ) of distributions of rank
≤ 1.
To extend the definition of independence to systems with more variables, consider
the following example.
Example 3.2.2 Let S be a random system, having three random variables x1 , x2 , x3
representing, respectively, a die and two coins (this time not loaded). Let T = S
and consider the R-distribution D on T defined by the tensor
1 1 1 1 1 1
24 24 24 24 24 24

1 1 1 1 1 1
24 24 24 24 24 24
D= 1 1 1 1 1 1
24 24 24 24 24 24

1 1 1 1 1 1
24 24 24 24 24 24

It is clear that D is a distribution of independence and probabilistic. You can read


it like the fact that the probability that a d comes out from the die at the same time of
the faces (for example, T and H ) from the two coins, is the product of the probability
1
6
that d comes out of the die times the probability 21 that T comes out of the coin
times the probability 21 that H comes out of the other coin.
Hence, we can use Definition 6.3.3 to define the independence model.
Definition 3.2.3 Let S be a system with random variables x1 , . . . , xn and let T =
S. The space of K -distributions on T is identified with the space of tensors K a1 ,...,an ,
where ai is the number of states of the variable xi .
3.2 Independence Models 37

A distribution D ∈ D K (T ) is a distribution of independence if D lies in the image


of independence connection (see Definition 2.3.4), i.e., as a tensor, it has rank 1 (see
Definition 6.3.3).
The independence model on X is the subset of D K (T ) consisting of all distribu-
tions of independence (that is, of all tensors of rank 1).
The model of independence, therefore, corresponds to the subset of simple (or
decomposable) tensors in a tensor space (see Remark 6.3.4).
We have seen, in Theorem 6.4.13 of the chapter on Tensorial Algebra, how such
subset can be described. Since all relations (6.4.1) correspond to the vanishing of
one (quadratic) polynomial expression in the coefficients of the tensor, we have
Corollary 3.2.4 The model of independence is an algebraic model.
Note that for the tensors 2 × 2 × 2, the independence model is defined by 12
quadratic equations (6 faces + 6 diagonal).
The equations corresponding to the equalities (6.4.1) describe a set of equations
for the model of independence. However such a set, in general, it is not minimal.
The distributions of independence represent situations in which there is no link
between the behavior of the various random variables of S, which are, therefore,
independent.
There are, of course, intermediate cases between a total link and a null link, as
seen in the following:
Example 3.2.5 Let S be a system with 3 random variables. The space of distribu-
tion D(S) consists of tensors of dimension 3 and type (d1 , d2 , d3 ). We say that
a distribution D ∈ D(S) is without triple correlation if there exist three matrices
A ∈ R d1 ,d2 , B ∈ R d1 ,d3 , C ∈ R d2 ,d3 such that for all i, j, k:

D(i, j, k) = A(i, j)B(i, k)C( j, k).

An example, when S is Boolean, is given by the tensor


−4 2

0 0

−4 12

−1 −9

which is obtained by the matrices


     
21 0 1 1 −2
A= B= C=
13 −1 2 3 2
38 3 Statistical Models

3.3 Connections and Parametric Models

Another important example of models in Algebraic Statistics is provided by the


parametric models. They are models whose elements have coefficients that vary
according to certain parameters. To be able to define parametric models, it is necessary
first to fix the concept of connection between two random systems.
Definition 3.3.1 Let S, T be system of random variables. We call K -connection
between S and T any function  from the space of K -distributions D K (S) to the
space of K -distributions D K (T ).
As usual, when the field K is understood, we will omit it in the notation.
After all, therefore, connections are nothing more than functions between a space
K s and a space K t . The name we gave, in reference to the fact that these are two
spaces connected to random systems, serves to emphasize the use we will make of
connections: to transport distributions from the system S to the system T .
In this regard, if T has n random variables y1 , . . . , yn , and the alphabet of each
variable yi has di elements, then D K (T ) can be identified with K d1 × · · · × K dn .
In this case, it is sometimes useful to think of a connection  as a set of functions
i : D(S) → K di .
If s1 , . . . , sa are all possible states of the variables in S, and ti1 , . . . , tidi are the
possible states of the variable yi , then we will also write


⎨ti1 = i1 (s1 , . . . , sa )
... = ...


tidi = idi (s1 , . . . , sa )

The definition of connection given here, in principle, it is extremely general: no


particular properties are required for the i functions; not even continuity. Of course,
in concrete cases, we will study in particular certain connections having well-defined
properties.
It is clear that, in the absence of any property, we cannot hope that the more
general connections satisfy many properties.
Let us look at some significant examples of connections.
Example 3.3.2 Let S be a random system and S  a subsystem of S. You get a con-
nection from S to S  , called projection simply by forgetting the components of the
distributions which correspond to random variables not contained in S  .
Example 3.3.3 Let S be a random system and T = S its total correlation. Assume
that S has random variables x1 , . . . , xn , and each variable xi has ai states, then D K (S)
is identified with K a1 × · · · × K an . In Definition 2.3.4, we defined a connection  :
D K (S) → D K (T ), called connection of independence or Segre connection, in the
following way:  sends the distribution

D = ((d11 , . . . , d1a1 ), . . . , (dn1 , . . . , dnan ))


3.3 Connections and Parametric Models 39

to the tensor (thought as distribution on S) D  = (D) such that

Di1 ,...,in = d1i1 · · · dnin .

It is clear, by construction, that the image of the connection is formed exactly by the
distributions of independence on S.

Clearly there are other interesting types of connection. A practical example is the
following:
Example 3.3.4 Consider a population of microorganisms in which we have elements
of two types, A, B, that can pair together randomly. In the end of the couplings, we
will have microorganisms with genera of type A A or B B, or mixed type AB = B A.
The initial situation corresponds to a Boolean system with a variable (the initial
type t0 ) which assumes the values A, B. At the end, we still have a system with only
one variable (the final type t) that can assume the 3 values A A, AB, B B.
If we initially insert a distribution with a = D(A) elements of type A and b =
D(B) elements of type B, which distribution we can expect on the final variable t?
An individual has a chance to meet another individual of type A or B which is
proportional to (a, b), then the final distribution on t will be D  given by D  (A A) =
a 2 , D  (AB) = 2ab, D  (B B) = b2 . This procedure corresponds to the connection
 : R2 → R3 (a, b) = (a 2 , 2ab, b2 ).

Definition 3.3.5 We say that a model V ⊂ D(T ) is parametric if there exists a


random system S and a connection  from S to T such that V is the image of S
under  in D(T ), i.e., V = (D(S)).
A model is polynomial parametric if  is defined by polynomials.
A model is toric if  is defined by monomials.

The motivation for defining parametric models should be clear from the repre-
sentation of a connection. If s1 , . . . , sa are all possible states of the variables in S,
and ti1 , . . . , tidi are the possible states of the variables yi of T , then in the parametric
model defined by the connection  we have


⎨ti1 = i1 (s1 , . . . , sa )
... = ...


tidi = idi (s1 , . . . , sa )

where the i j ’s represent the components of .


The model definition we initially gave is so vast to be generally poorly usable. In
reality, the models we will use in the following will always be algebraic models or
polynomial parametric.
Example 3.3.6 It is clear from the Example 3.3.3 that the model of independence is
given by the image of the independence connection, defined by the Segre map (see
the Definition 10.5.9), so it is a parametric model.
40 3 Statistical Models

The tensors T of the independence model have in fact coefficients that satisfy
parametric equations ⎧

⎨. . .
Ti1 ...in = v1i1 v2i2 · · · vnin (3.3.1)


...

From its parametric equations, (3.3.1), we quickly realize that the independence
model is a toric model.
Example 3.3.7 The model of Example 3.3.4 is a toric model, since it is defined by
the equations ⎧

⎨x = a
2

y = 2ab .


z = b2

Remark 3.3.8 It is evident, but it is good to underline it, that for the definitions
we gave, being an algebraic or polynomial parametric model is independent from
changes in coordinates. Being a toric model instead it can depend on the choice of
coordinates.
Definition 3.3.9 The term linear model denotes, in general, a model on S defined
in D(S) by linear equations.
Obviously, every linear model is algebraic and also polynomial parametric,
because you can always parametrize a linear space.
Example 3.3.10 Even if a connection , between the K -distributions of two random
systems S and T , is defined by polynomials, the polynomial parametric model that
 defines it is not necessarily algebraic!.
In fact, if we consider K = R and two random systems S and T each having one
single random variable with a single state, the connection  : R → R, (s) = s 2
certainly determines a polynomial parametric model (even toric) which corresponds
to R≥0 ⊂ R, so it can not be defined in R as vanishing of polynomials.
We will see, however, that by widening the field of definition of distributions, as
we will do in the next chapter switching to distributions on C, under a certain point
of view all polynomial parametric models will, in fact, be algebraic models.
The following counterexample is a milestone in the development of so much
part of the Modern mathematics. Unlike Example 3.3.10, it cannot be recovered by
enlarging our field of action.
Example 3.3.11 Not all algebraic models are polynomial parametric.
We consider in fact a random system S with only one variable having three states.
In the distribution space D(S) = R3 , we consider the algebraic model V defined by
the unique equation x 3 + y 3 − z 3 = 0.
There cannot be a polynomial connection  from a system S  to S whose image
is V .
3.3 Connections and Parametric Models 41

In fact, suppose the existence of three polynomials p, q, r , such that x = p, y =


q, z = r . Obviously, the three polynomials must satisfy identically the equation p 3 +
q 3 − r 3 = 0. It is enough to verify that there are no three polynomials satisfying the
previous relationship. Provided to set values for the other variables, we can assume
that p, q, r are polynomials in a single variable t. We can also suppose that the three
polynomials do not have common factors. Let us say that deg( p) ≥ deg(q) ≥ deg(r ).
Deriving the equation
p(t)3 + q(t)3 − r (t)3 = 0,

with respect to t, we get

p 2 (t) p  (t) + q 2 (t)q  (t) − r 2 (t)r  (t) = 0.

The two previous equations give a homogeneous linear system


 
p(t) q(t) −r (t)
.
p  (t) q  (t) −r  (t)

The solution p 2 (t), q 2 (t), r 2 (t) must be proportional to the 2 × 2-minors of the
matrix, hence p 2 (t) is proportional to q(t)r  (t) − q  (t)r (t), and so on. Consider-
ing the equality p 2 (t)( p(t)r  (t) − p  (t)r (t)) = q 2 (t)(q(t)r  (t) − q  (t)r (t)), we get
that p 2 (t) divides q(t)r  (t) − q  (t)r (t), hence 2 deg( p(t)) ≤ deg(q) + deg(r ) − 1
which contradicts the fact that deg( p) ≥ deg(q) ≥ deg(r ).
Naturally, there are examples of models that arise from connections that they do
not relate a system and its total correlation.
Example 3.3.12 Let us say we have a bacterial culture in which we insert bacteria
corresponding to two types of genome, which we will call A, B.
Suppose, according to the genetic makeup, the bacteria can develop characteristics
concerning the thickness of the membrane and of the core. To simplify, let us say
that in this example, cells can develop nucleus and membrane large or small.
According to the theory to be verified, the cells of type A develop, in the descent,
a thick membrane in 20% of cases and develop large core in 40% of cases. Cells of
type B develop thick membrane in the 25% of cases and a large core in one-third
of the cases. Moreover, the theory expects that the two phenomena are independent,
in the intuitive sense that developing a thick membrane is not influenced by, nor
influences, the development of a large core.
We build two random systems. The first S, which is Boolean, has only one variable
random c (= cell) with A, B states. The second T with two boolean variables, m (=
membrane) and n (= core). We denote for both with 0 the status “big” and with 1 the
status “small”.
The theory induces a connection  between S and T . In the four states of the two
variables of T , which we will indicate with x0 , x1 , y0 , y1 , this connection is defined
by
42 3 Statistical Models


⎪ x0 = 1
a + 1
b

⎨x
5 4
1 = 4
5
a + 3
4
b

⎪ y = 2
a + 1
b


0 5 3
y1 = 3
5
a + 2
3
b

where a, b correspond to the two states of S. As a matter of fact, suppose we introduce


160 cells, 100 of type A, and 60 of type B. This leads to consider a distribution D
on S given by D = (100, 60) ∈ R2 .
The distribution on T defined by the connection is given by

(D) = ((35, 125), (60, 100)) ∈ (R2 ) × (R2 ).

This reflects the fact that in the cell population (reported at 160) we expect to even-
tually observe 35 cells with large membrane and 60 cells with a large core.
If the experiment, more realistically, manages to capture the percentage of cells
with the two characteristics (shuffled), then we can consider a connection that links
S with the total correlation T : indicating with x00 , x01 , x10 , x11 the variables cor-
responding to the four states of the only variable of T , then such connection   is
defined by ⎧
( 15 a+ 41 b)( 25 a+ 13 b)

⎪ x =


00 (a+b)2






⎪x01 = ( 5 a+ 4 b)( 5 a+
1 1 3 2
3 b)

⎨ (a+b)2




⎪ x10 =
( 45 a+ 43 b)( 25 a+ 13 b)

⎪ (a+b)2






⎩x ( 45 a+ 43 b)( 35 a+ 23 b)
11 = (a+b)2

This connection, starting at D, determines the (approximate) probabilistic distribu-


tion on T :
  (D) = (0.082, 0.137, 0.293, 0.488) ∈ R4 .

From an algebraic point of view, an experiment will be in agreement with the model
if the percentages observed will be exactly those described by the latter connection:
8.2% of cells with a thick membrane, nucleus, etc.
In the real world, of course, some tolerance should be expected from experimental
data error. The control of this experimental tolerance will be not addressed in this
book, as it is a part of standard statistical theories.
3.4 Toric Models and Exponential Matrices 43

3.4 Toric Models and Exponential Matrices

Recall that a toric model is a parametric model on a system T corresponding to a


connection from S to T which is defined by monomials.
Definition 3.4.1 Let W be a toric model defined by a connection  from S to T . Let
s1 , . . . , sq be all possible states of all the variables of S and let t1 , . . . , t p the states
of all the variables of T . One has, for every i, ti = i (s1 , . . . , sq ), where each i is
a monomial in the s j .
We will call exponential matrix of W the matrix E = (ei j ), where ei j is the
exponent of s j in ti .
E is, therefore, a p × q array of nonnegative integers. We will call it complex
associated with W the subset of Zq formed by the points corresponding to the rows
of E.

Proposition 3.4.2 Let W be a toric model defined by a monomial connection  from


S to T and let E be its exponential
 matrix.
Each linear relationship ai Ri = 0 among the Ri rows of E corresponds to
implicit polynomial equations that are satisfied by all points in W .

Proof Consider a relation ai Ri = 0 among the rows of E. We associate to this
relation a polynomial equation
a
tiai − z tj j = 0
ai ≥0 a j <0

where, by indicating with c(i ) the monomial coefficient,


a ≥0 c(i )ai
z=
i
a j <0 c( j )a j

We verify that this polynomial relationship is satisfied by all points in W .


In effect, by replacing t1 , . . . , t p with their expressions in terms of , we get two
monomials with equal exponents and opposite coefficients, which are canceled. 

Note that the polynomial equations obtained previously, are in fact binomial.
Definition 3.4.3 The polynomial equations associated with the linear relations
between the rows of the exponential matrix of a toric model W define an alge-
braic model containing W . This model takes the name of algebraic model generated
by W .
It is clear from Example 3.3.10 that the algebraic model generated by a toric
model W always contains W , but it does not always coincide with W . Let us see a
couple of examples about it.
44 3 Statistical Models

Example 3.4.4 We still consider the example of the model of independence on a


dipole S.
Resuming the terminology of the previous section, we indicate with t1 , . . . , tn the
states of the first variable T and with u 1 , . . . , u m the states of the second variable. The
resulting model is parametrically defined, on S, by y(ti ,u j ) = ti u j . It is, therefore,
a toric model, whose exponential matrix is given by
⎛ ⎞ ⎛ ⎞
R1 1 0 ... 0 1 0 ... 0
⎜ R2 ⎟ ⎜1 0 ... 0 0 1 . . . 0⎟
⎜ ⎟ ⎜ ⎟
⎜ ..⎟ ⎜ .. .. . . . .. ⎟

⎜ .⎟ ⎜.
⎟ ⎜ . . . . .. .. .. . . . .⎟ ⎟
⎜ R m ⎟ ⎜1 0 ... 0 0 0 . . . 1⎟
⎜ ⎟ ⎜ ⎟
⎜ Rm+1 ⎟ = ⎜0 1 ... 0 1 0 . . . 0⎟
⎜ ⎟ ⎜ ⎟
⎜ Rm+2 ⎟ ⎜0 1 ... 0 0 1 . . . 0⎟
⎜ ⎟ ⎜ ⎟
⎜ . ⎟ ⎜. .. . . . .. ⎟
.
⎝ . ⎠ ⎝ .. . . . . .. .. .. . . . .⎠
Rmn 0 0 ... 1 0 0 ... 1

from which we can see all relations, between rows, of the form

Rqm+h + R pm+k = Rqm+k + R pm+h

which define, as equations in Rmn = Rm,n , exactly the 2 × 2-minors of the matrices
in the model.
It follows that the algebraic model associated to this connection coincides with
the space of matrices of rank ≤ 1, which is exactly the image of the connection of
independence.

Example 3.4.5 Come back to the connection of Example 3.3.4. It defines a polyno-
mial parametric model W on R3 given by the parametric equations


⎨x = a
2

y = 2ab


z = b2

The associated exponential matrix is


⎛ ⎞
20
⎝1 1⎠
02

which, as unique relation between rows, satisfies R1 + R3 = 2R2 . Using the formula
for the coefficients, we get the equation in R3 :

4x z = y 2 .
3.4 Toric Models and Exponential Matrices 45

The algebraic model W  defined by this equation does not coincide with W . As a
matter of fact, it is clear that the points in W have nonnegative x, z, while the point
(−1, 2, −1) is in W  .
However, one has W = W  ∩ B where B is the subset of the points in R3 with non-
negative coordinates.
√ In fact, if (x, y, z) is a point in B which satisfies the equation,

then, posing a = x, b = z, one has y = ab.

Remark 3.4.6 The scientific method.


Given a parametric model, defined by a connection  from S to T , if we know an
“initial” distribution D over S (the experiment data) and we measure the distribution
D  = (D) that we derive on T (the result of the experiment), we can easily deduce
if the hypothesized model fits or not with reality.
But if we have no way of knowing the D distribution and we can only measure
the D  distribution, as happens in many real cases, then it would be a great help to
know some polynomial F that vanishes on the model, that is, to know its implicit
equations. In fact, in this case the simple check of the fact that F(D  ) = 0 can give
us many indications: if the vanishing does not occur, our model is clearly inadequate;
instead if it occurs, it gives a clue in favor of the validity of the model.
If we then knew that the model is also algebraic and we knew the equations, their
control over many distributions, results of experiments, would give a good scientific
evidence on the validity of the model itself.
Chapter 4
Complex Projective Algebraic Statistics

No, we are not exaggerating. We are instead simplifying.


In fact, many of the phenomena associated with the main statistic models are
better understood if studied, at least in the first measure, from the projective point of
view and on an algebraically closed numerical field.
The main link between Algebraic Statistics and Projective Algebraic Geometry
is based on the constructions of this chapter.

4.1 Motivations

We have seen as many models of interest in the fields of statistics are defined, in
the space of the distributions of a system S, by (polynomials) algebraic equations of
degree greater than one. To understand such models, a mathematical approach is to
initially study the subsets of a space defined by the vanishing of polynomials. This
is equivalent to studying the theory of solutions of systems of polynomial equations
of arbitrary degree, which goes under the name of Algebraic Geometry.
The methods of Algebraic Geometry are based on various theories: certainly, on
Linear and Multi-linear Algebra, but also on rings theory (in particular, on the theory
of rings of polynomials) and on Complex Analysis. We will repeat here only a part of
the main results that can be applied to statistical problems. It must be kept in mind,
however, that the theory in question is rather developed, and topics that will not be
introduced here, they could be important in the future also from a statistical point of
view.
The first step to take is to define the ambient space in which we move. Being to
study solutions of nonlinear equations, from an algebraic point of view, it is natural
to pass from the real field, fundamental for applications but without some algebraic
elements, to the complex field which, being algebraically closed, allows a complete
understanding of the solutions of polynomial equations.
We will then have to expand, from a theoretical point of view, the space of dis-
tributions, to admit points with complex coordinates. These points, corresponding

© Springer Nature Switzerland AG 2019 47


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_4
48 4 Complex Projective Algebraic Statistics

to distributions with complex numbers, will allow us a more immediate character-


ization of the sets of solutions of algebraic systems. Naturally, when we have to
reread the results in the common statistical theory, we will have to return to models
exclusively defined on the real field, that intersects with the real space contained in
every complex space. This final step, which in general poses technical problems that
are absolutely nontrivial, can, however, be overlooked in a first reading, where the
indications that we will get on the complexes will still help us to understand of real
phenomena.
Once this first enlargement is accepted, to arrive at an understanding even more
in-depth of algebraic phenomena, it is opportune to operate a second step, perhaps
apparently even more challenging: the passage from affine spaces Cn to the associated
projective spaces.
The reasons for this second enlargement are justified, from a geometric point
of view, with the need to work in compact ambients. Compactness is indeed an
essential property for our geometric understanding. In very descriptive terms, thanks
to the introduction of points to infinity, we will avoid to lose the solutions when,
passing to the limit, we should have phenomena of parallelism or however asymptotic
phenomena. The possibility of following the reasonings passing to the limit is one
of the winning cards that geometry in projective setting offers, compared to that in
similar environments.
Of course, projective compactification, to make sense in statistical problems, it
must be carried out properly, differentiating, for example, the passage to the limit of
the various random variables.
For those who find too cumbersome the procedure to use homogeneous coordi-
nates to describe distributions on random systems, perhaps it is worth remembering
that a similar procedure, in statistics, is always present: the normalization. In prac-
tice, if we have a distribution D on a random variable x having states s1 , . . . , sn , then
it is natural toreplace D with the D̄ distribution obtained dividing each Dx (si ) by
the sampling Dx (s j ) (in the event that x is not neutral with respect to D). Note
that, in doing so, in the space of distributions that concerns the variable x, we get
to replace the point (Dx (s1 ), . . . , Dx (sn )) with the point ( D̄x (s1 ), . . . , D̄x (sn )). If,
in the affine space, the point is changed, in the projective space, where the n -tuple
represents homogeneous coordinates, passing to normalization, the point does not
change! In effect, every point in the projective space  can always be represented by
homogeneous coordinates (a1 , . . . , an ) such that a j = 1.
From another point of view, the classical statistical theory, in the space Rn of the
distributions of a random variable x as above, tended to shrink to the hyperplane
defined by the equation a j = 1, as we have seen, for example, in the statement
of Varcenko’s Theorem. Instead, projective statistical theory works on distributions
up to scaling, then does not need this restriction, since from a projective point of
view, normalization, like any other scaling, is irrelevant.
Therefore, it is quite simple to convince ourselves that, in the end, these are two
equivalent approaches. The difficulty of passing from one to another basically resides
in habit. The advantage of using the projective language consists in being able to
access directly to the vast literature on Algebraic Geometry which, in many ways,
mainly uses this terminology.
4.1 Motivations 49

During this chapter, we will use many topics from the Part III. We strongly recom-
mend to the reader which is not expert in Algebraic Geometry to study such chapters
before to start the treatment of projective algebraic models.

4.2 Projective Algebraic Models

What is the link of all of this with Algebraic Statistics?


When we meet a distribution D ∈ D(S), where S is a system of random variables,
we are, in practice, collecting a set of observed data. To the aim of our interpretation
of data, is usually irrelevant (within certain reasonable limits) the sampling of the
variables.
If, for example, we are evaluating the efficiency of a medicine, to administer the
medicine to 100 sick people and having 50 healed suggests that the medicine is
efficient in about the 50% of cases. From this point of view, we obtain the same
conclusion if we administer the medicine to 120 sick and record 60 healed.
So, in our setting, we will consider that a distribution gives the same information
on the analyzed phenomenon than any of its scaling.
In Statistics, the problem is solved by choosing, among all possible scaling of a
distribution, the associated probabilistic distribution, introduced in Definition 1.3.2.
Such distribution is uniquely determined, given a distribution D, but it exists only
when all variables have sampling different from 0 in D.
The associated probabilistic distribution lies in a linear subspace of D(S) =
K a1 × · · · × K an , defined by the vectors v = ((v11 , . . . , v1a1 ), . . . , (vn1 , . . . , vnan ))
satisfying the equations

v11 + · · · + v1a1 = · · · = vn1 + · · · + vnan = 1.

Instead, in our setting, we prefer to consider the multi-projective space of distri-


butions.
Definition 4.2.1 Given a system S, with random variables x1 , . . . , xn , we call multi-
projective space of distributions the multi-projective space

P(D(S)) = Pa1 × · · · × Pan

where xi has ai + 1 states. (note the increasing by 1!).


The elements of P(D(S)) are thus equivalence classes, each of them containing a
distribution D and all its scaling. In such a way we can link more easily (Projective)
Algebraic Geometry with the study of significative statistical models.
In this new view, only the statistical models which are independent of the scaling
(multiconi) in the space of distributions, are significative.
50 4 Complex Projective Algebraic Statistics

Because the overwhelming majority of important models (if correctly interpreted)


are independent of the scaling, so they are cones, the depth investigation of our theory
will not be affected.
Definition 4.2.2 A K -projective model on S is a subset of P(D K (S)). A K -projective
algebraic model on S is a model corresponding to a (multi)projective variety, i.e.,
defined by the vanishing of multi-homogeneous polynomials.
There is a natural surjection from D(S) \ {0} to P(D(S)). The preimage of a projective
model on S, in such projection, is a model M on S with the following property:

if D ∈ M and D  is a scaling of D, with D  = 0, then D  ∈ M.

Example 4.2.3 The independence model can be thought as a projective algebraic


model, because it is defined by many multi-homogeneous equations (see Theorem
6.4.13).
A linear model is a projective algebraic model when its defining linear polynomials
do not have a constant term.

Example 4.2.4 The projective algebraic models on a system S with a single variable
(as in the case of the total correlation) are strictly related with the cones of the
vector space D(S). Every cone defines a projective model on S. Vice versa, given a
projective model on S, its preimage in the projection D(S) → P(S) is a cone.

Example 4.2.5 Consider the system S formed by two ordinary dice. The projective
space of distributions is P5 × P5 . Instead, the projective space of distribution of S
is P35 , corresponding to the space of 6 × 6 matrices.
If S is a system where the two variables represent a die and a coin, the projective
space of distributions is P5 × P1 . In this case, it is easy to observe that the only
variable in the total correlation S has 12 states, hence the projective space of
distributions of S is P11 .
If a system S has n Boolean variables, then its projective space of distributions is
a product of n copies of P1 , and the space of distributions of S is P2 −1 .
n

4.3 Parametric Models

We are now able to define projective parametric models on A. For this aim, we will
use Definition 9.3.1 of projective map, in Sect. 9.3.
Definition 4.3.1 If S, T are random systems, we call projective connection any pro-
jective map  : P(D(S)) → P(D(T )). Notice, in particular, that if  is a projective
connection, then the image of any scaling D  of a distribution D is a scaling of (D).
We say that a model M is projective parametric if it is the image P(D(S)) of a
projective connection .
4.3 Parametric Models 51

Many interesting parametric models have a counterpart projective parametric.


Example 4.3.2 The independence model is projective parametric. As a matter of fact
let S be a system with random variables x1 , . . . , xn and let ai + 1 be the number of
states of the variable xi . Hence, the total correlation S has a unique variable, with
(ai + 1) states.
The model of independence of S corresponds to the map

P(D(S)) = Pa1 × · · · × Pan → P(D(S)) = P M

(M = −1 + (ai + 1)) defined by




⎪ .. ..

⎨ . = .
ti1 ,...,in = s1i1 s2i2 · · · snin


⎩ ...

=
..
.

where we have numbered the coordinates of an element of P(D(X )), as usual,


identifying this element as a tensor.
It is evident from the same definition that the model of independence corresponds
to a Segre variety (compare with Definition 10.5.9).
Notice that, in general, M is quite bigger than ai . For example, if n = 3 and a1 = a2 =
a3 = 3, then M = 63 and the model corresponds to the Segre variety of P3 × P3 × P3
embedded in P63 .
We recall that the product P1 × P1 is not isomorphic to P2 . Through the Segre
map, P1 × P1 corresponds to a surface in P3 , image of the map

((x1 , x2 ), (y1 , y2 )) → (x1 y1 , x1 y2 , x2 y1 , x2 y2 )

that is, in parametric form ⎧



⎪a11 = x1 y1

⎨a12 = x1 y2

⎪a21 = x2 y1


a22 = x2 y2

This surface, which represents the (projective) model of independence of a Boolean


system with two variables, is defined by a single equation (determinant of the corre-
sponding 2 × 2 matrix) a11 a22 = a12 a21 .
Example 4.3.3 On a random system with three variables x1 , x2 , x3 , the model without
triple correlation of Example 3.2.5 it is not, strictly speaking, parametric projective.
In fact, taking up the terminology of the example, this model is defined by con-
sidering the model S  given by the union of the total correlations of the three sub-
systems of S that are obtained by canceling one of the variables in turn. S  also has
52 4 Complex Projective Algebraic Statistics

three variables, corresponding to (x1 , x2 ), (x1 , x3 ), and (x2 , x3 ). The models without
triple correlation are obtained from the connection from S  to S, which sends every
triplet of matrices (A, B, C) ∈ D(S  ), with A ∈ Cd1 ,d2 , B ∈ Cd1 ,d3 , C ∈ Cd2 ,d3 , to the
tensor D ∈ D(S) defined by

D(i, j, k) = A(i, j)B(i, k)C( j, k).

It is clear that all components of this map are multi-homogeneous of the same degree,
but, in general, they do not define a map

Pd1 d2 −1 × Pd1 d3 −1 × Pd2 d3 −1 → Pd1 d2 d3 −1 .

because even if A, B, C are all different from zero, however it is not clear that their
image is not zero.
In order to obtain our model as the image of a well-defined projective map, we
must restrict to a subvariety of Pd1 d2 −1 × Pd1 d3 −1 × Pd2 d3 −1 .
For instance, when the three variables are Boolean, we must restrict the model to
a suitable model X of distributions on S  , so that we get a well-defined map from
a variety X ⊂ P3 × P3 × P3 to P7 . This map is obtained by composing the Segre
map P3 × P3 × P3 → P63 with a suitable projection P63 → P7 . It is a matter of
computations that the subvariety X should not intersect a configuration of products
of linear spaces, containing, for instance,

L 1 × P3 × L 1 where L 1 = {x1 = x3 = 0} L 1 = {z 3 = z 4 = 0}


L 2 × L 2 × P3 where L 2 = {x1 = x2 = 0} L 1 = {y3 = y4 = 0}
...

The fact that the image of a Segre map can be interpreted as (projective) model of
independence of an random system, by Theorem 6.4.13, guarantees us that the Segre
varieties are all projective varieties.
Let us see how, in general, there are projective parametric models which are not
algebraic models.
Example 4.3.4 Consider two systems S, S  , both with a single random variable.
We identify both the projective spaces of distribution over R, P(D(S)) and
P(D(S  )), with P1R . We can define a projective connection  : P(D(S)) → P(D(S  ))
by posing (x0 , x1 ) = (x02 , x12 ).
It is easy to check that the image W of  contains infinitely many points of P1R .
However, it does not contain all points: as a matter of fact, the point with homogenous
coordinates (1, −1) is not in the image.
On the other hand, each projective variety in P1R , being defined by the vanishing
of a homogeneous polynomial in two variables, or coincides with P1R , or it can only
contain a finite number of points.
So W can not be a projective variety.
4.3 Parametric Models 53

Example 4.3.5 Let us go back to the situation represented in Example 3.3.4.


Recall that the initial situation corresponds to a Boolean system S with a variable
(of states A, B) while the final situation corresponded to a system S  with only one
variable that could take the 3 values A A, AB, B B.
The connection , defined by (a, b) = (a 2 , 2ab, b2 ), is clearly a projective map
between P(D(S)) = P1R and P(D(S  )) = P2R . The image corresponds to the subset
W ⊂ P2R defined by the points satisfying the equation y 2 = 4x z.
It should be noted, however, that not all the homogeneous coordinates of these
points can be obtained in the map. In fact the point P of coordinates (1, 2, 1) is in
the image (we get it for (a, b) = (1, 1), but no pair of R2 gives (−1, −2, −1), which
are also coordinates of P.

From Chow’s Theorem (see Theorem 10.6.3), it immediately follows:


Theorem 4.3.6 Each projective parametric model is a projective algebraic model.
This theorem generalizes the situation already seen for the model of independence
and explains how each projective parametric model can be defined by homogeneous
polynomial equations.
The proof of Chow’s Theorem also explains theoretically how the homogeneous
equations of a projective parametric model can be found.
In practice, as one can imagine, when the number of variables grows, it is not easy
to follow step by step the directions, and find an effective set of equations, even with
the help of computational tools. The use of Groebner bases, which we will introduce
later, allows an optimization of the procedure (see Chap. 13).
The advantage of presenting a model with homogeneous equations (implicit equa-
tions), rather than through parametric equations, should instead be clear in the daily
practice of Algebraic Statistics: to test whether a given phenomenon, i.e., a given
distribution, falls within the model imagined by a theory (in more imaginative words:
if an experiment confirms or less a theory), once the implicit equations are known,
it is sufficient to check if the distribution satisfies them. A similar computation is
elementary, for every single equation. In everyday practice, the complication derives
only from the fact that normally each model is described by an astronomical number
of equations, sometimes with approximate coefficients. However, these problems
can be managed by methods of sample searching and error checking.
Instead, having to show that a given distribution belongs to a model of which
only parametric equations are known, the problem moves to show the existence of
parameters for which the parameterization returns the starting distribution. Such
a problem of existence is extremely difficult to control, even in the presence of a
few, precise equations. Imagine when the equations are thousands, with approximate
coefficients!
Chapter 5
Conditional Independence

An intermediate case between total independence and generic situations of the depen-
dence of random variables concerns the so-called conditional independence.
To understand the practical meaning of models of conditional independence, let us
start with two examples. In the examples, we will use the fact that the naïve notion of
independence between two variables corresponds to distributions whose associated
matrices have rank 1 (see Definition 3.2.1 and the thereby discussion), while the
conditional independence corresponds to have rank 1 in the slices of the associated
tensor.

Example 5.0.1 Let us take again an example, presented by B. Sturmfels in a confer-


ence, and cited as urban legend.
In England, a magazine, specialized in curious statistics, commissioned a study
on the following problem: being football fans increase hair loss?
The authors of the study interviewed many people, reporting the answers to two
questions:
(A) Are you a football fan? (possible replies 1 = no, 2 = a little, 3 = a lot).
(B) Do you lose your hair? (possible replies 1 = no, 2 = a little, 3 = a lot).
The results were then listed in the following 3 × 3-matrix

B\ A 1 2 3
1 72 41 15
D=
2 60 55 45
3 40 70 82

As we can easily see, the matrix has not rank 1.


We interpret the fact in terms of Algebraic Statistics.
The starting random system S includes two variables ( A = football fan, B = hair
loss), each with three states. D is the distribution on S that arises from the

© Springer Nature Switzerland AG 2019 55


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_5
56 5 Conditional Independence

investigation. Since D has not rank 1, that is, it does not belong to model of inde-
pendence on S, the two variables are not independent.
In other words, being football fans affects hair loss.
The result is surprising, although the unequivocal consequence of the collected
data and the magazine launched into a series of interpretations of the case.
In reality, the true interpretation was very simple. A clue to the solution of the
mystery was contained in the fact that the matrix D has rank 2.
In fact, the magazine had mixed, in the result of the investigation, the data related
to two distinct groups: Men and W omen. The M group is more prone to being a
football fan and losing hair compared to the W group. The lack of homogeneity
of the sample led to a false result, in fact dividing the results of the investigation
with respect to an additional Boolean variable (the gender G of the sample) you get
a tensor 3 × 3 × 2 whose scan along the third index (front–back) is made of two
matrices of rank 1.
2 6 8

70 35 7

10 30 40
T =
50 25 5

20 60 80

20 10 2

The previous matrix D represents the marginalization relative to the first index.
So D is the sum of two matrices of rank 1, and in fact has rank 2.
Note that the two starting variables A, B are really dependent on each other, as
follows: if a person is subject to hair loss, he is more likely to be a man, so he is more
likely to be a football fan (in the example cited, in fact a bit dated, it was taken as
hypothesis that men are more susceptible than women to hair loss, and they are also
more likely to follow football).
The fact that D has rank smaller than the maximum, though it does not imply inde-
pendence of the two variables, should suggest to researchers a connection between
the two variables, mediated by a hidden variable G.

A similar example is the following.


5 Conditional Independence 57

Example 5.0.2 This example is another classic of the algebraic statistical study: the
example of a scientific research that leads to a result only apparently significant.
Osteoporosis is a bone disease that mainly affects elderly people. Let us face
the problem: having a driving license affects the vulnerability to osteoporosis? The
question is apparently idiotic: how can the sensitivity to a bone disease be influenced
by the possession of the license? Yet paradoxically the results would seem to state
the opposite.
In fact, a researcher builds a system formed by two Boolean variables to study the
phenomenon: possession of the license and the state of illness. Then he considers a
population of older people, say 100 individuals, examines them with respect to the
possession of the license and the state of the bones, and builds a D distribution on
the total correlation. The result is expressed by the matrix:
 
13 37
22 28

The matrix expresses the fact that 13 people have at the same time driving license
and osteoporosis, 37 have a license but not osteoporosis, etc.
The result is incontrovertible! The matrix of D has determinant −450, so it is a far
away from having rank 1. Therefore, there is correlation between having a driving
license and contracting osteoporosis. In the specific case it is clear, from the examina-
tion of the results, that having the driving license makes less likely the manifestation
of osteoporosis. Great unexpected discovery. Such research risks ending up in some
scientific journal serious (let’s hope not!) and to be picked up by media of half the
world. You could create unfounded expectations of healing, with crows of old men
and old ladies to the assault of the driving schools. There would be some clinicians
ready to explain that driving vehicles, cause the movement of the pedals and steering
wheel, is a beneficial workout that tones the bones and makes them more resistant
to osteoporosis.
Unfortunately, we have to switch off easy enthusiasms, because the reality is a
little different.
The weak point of the statistical experiment lies in the fact that the chosen sample
is not homogeneous. In fact among the selected individuals there are mixed elderly
men and women. If the sample selection is random, it is likely that there is an equal
split: 50 men and 50 women. However, osteoporosis does not affect the two sexes in
a homogeneous way. Women are a lot more susceptible to the disease than men. On
the other hand, especially in the elderly population, for a man it is much more usual
to have a driving license compared to a woman.
The situation is clarified if the chosen random system has 3 variables: at the
possession of the license x1 and at the osteoporosis x2 we add the Boolean variable
x3 indicating sex (0 = man, 1 = woman). In the total correlation of this system,
which is a tensor of size 3 and type (2, 2, 2), the real distribution is:
58 5 Conditional Independence

8 32

5 5

D =
2 8

20 20

which has not rank 1, as there are submatrices with determinant different from 0.
The tensor tells us that x1 is independent from x2 given x3 (x1  x2 |{x3 }), because the
front and back matrices have both rank 1, i.e., fixing the male or female population,
in both we see that the possession of the license does not affect on the likelihood of
contracting osteoporosis, as was widely expected.
Note that D represents the marginalization of D  along x3 , therefore it is not true
that x1  x2 . In other words, x1 and x2 are actually dependent on each other. What is
the meaning of this statement? It must be read this way. We charge a subject z who
has a driving license. As the percentage of licensed persons who are men rather than
women is higher, it is more likely that z is a man. As such, it is less likely to develop
osteoporosis. Conversely, if a subject has osteoporosis, it is more likely to be female,
so it is less likely to have a driving license.
Our perception still remains a bit perplexed. The reason lies in the psychological
fact that the property of being a man or a woman, for an individual, it is obviously seen
as far more fundamental than having a driving license or even to the development of
osteoporosis.
The previous examples explain the usefulness of introducing the concepts of con-
ditional independence of random variables and also the concept of hidden variables.

5.1 Models of Conditional Independence

In this section, we introduce the concept of conditional independence and we show


its basic properties.

Remark 5.1.1 As usual, we are more interested to show the geometric or algebraic
structure of probability distributions, hence the explicit computation of conditional
probability is out of our scopes. Consequently we will not introduce, in a formal
way, the celebrated Bayes Formula (but see Example 5.1.13 for an instance of how
the formula can be recovered in our setting).

We will refer to the concepts of Tensorial Algebra contained in Chap. 8 about scan
and contraction. In particular, see Definitions 8.2.1 and 8.3.1.
5.1 Models of Conditional Independence 59

Definition 5.1.2 Let S be a random system with variables x1 , . . . , xn .


Let A ⊂ {1, . . . , n}. A distribution D on the total correlation of S satisfies the
condition A (or even that A is independent) if the contraction of D along {1, . . . , n} \
A has rank 1.
Given B = {1, . . . , n} \ A, we will say that D satisfies the condition A|B (which
means A is independent, given B) if all the elements of the scan D along B have
rank 1.

The previous definitions can be generalized and cumulated in the following.


Definition 5.1.3 If A, B are disjoint subsets of {1, . . . , n}, we will say that D satisfies
the condition A|B (A is independent, given B) if, given C = {1, . . . , n} \ (A ∪ B),
the contraction D  of D along C satisfies the A|B condition, that is, all the elements
of the scan of D  along B have rank 1.
When A = {xi , x j } has two elements, we will also write xi  x j and xi  x j |B
instead of A and A|B, respectively. It is clear that A is equivalent to A|B with
B = ∅.

Example 5.1.4 The tensor

1 0

1 3
D=
0 1

1 −1

describes a distribution on the total correlation of a Boolean system with three vari-
ables x1 , x2 , x3 .
In D, one has x2  x3 , since the contraction along x1 gives
 
22
.
11

However, in D, x1  x3 does not hold, since the contraction along x2 gives


 
41
.
01
60 5 Conditional Independence

Example 5.1.5 The tensor

6 3

2 1
D =
2 2

1 1

on the total correlation of a Boolean system with three variables x1 , x2 , x3 , describes


a distribution D for which one has (x2  x3 |{x1 }), since the two submatrices obtained
by taking x3 = 0 and x3 = 1 have rank 1. Notice that in D, x1  x2 does not hold,
since the contraction along x3 gives
 
8 4
3 3

of rank 2.

Examples 5.0.1 and 5.0.2 represent situations, where two initial variables are
independent, given the third one (the gender).

Example 5.1.6 Consider the transmission chain of a Boolean signal, with a head-
quarter A and two locations B, C non-connected, represented by the oriented graph

B C

Suppose that the following matrices are associated with the edges AB, AC
⎛2 1⎞ ⎛4 1⎞
3 3 5 5
M AB = ⎝ ⎠ M AC = ⎝ ⎠
1 2 1 4
3 3 5 5

These matrices represent the transmission of the signal, in the sense that if A transmits
30 times the signal 0, B transcribes the distribution M AB · (30, 0) = (20, 10), that is,
it transcripts 20 times the signal 0 and 10 times the signal 1. Similarly, C transcribes
the distribution M AC · (30, 0) = (24, 6).
5.1 Models of Conditional Independence 61

If A sends a signal of 30 bits 0 and 30 bits 1, the distribution resulting from the
graphic model, in the Boolean system with three variables A, B, C, is given by the
tensor:
4 2

16 8

8 16

2 4

Observe that the tensor has not rank 1, in fact the three variables are not independent.
On the other hand, the two submatrices that they get by fixing A = 0 and A = 1 both
have null determinant, so that (B  C|A) holds. Instead, the marginalization of the
tensor in the direction of A gives the matrix:
 
12 18
18 12

that has rank 2, so B  C does not hold.


In fact, if one does not consider the contribution of A, the fact that B receives 0
makes it reliable that the bit emitted was really 0, so it is more likely that C also
receives a 0. So, if we do not fix the status of A, then B and C are actually dependent.
On the other hand, if we know the status of A, then B and C can receive correctly or
erratically, independently of each other.

Definition 5.1.7 The matrices used in the previous example are of a type widely
used in applications of Algebraic Statistics, especially for the theory of strings of
symbols (digital signals, DNA, etc.). They are called Jukes–Cantor’s matrices.
In general, a Jukes–Cantor matrix is a square matrix n × n whose diagonal ele-
ments are all equal to a value a, while all the other elements are equal to a value
b.
These matrices represent the fact that, for example, in the transmission of a signal,
if the transmitter A issues a value xi , the probability of the station B to receive
correctly xi is proportional to a, independently from xi , while the probability to
receive a wrong digit (proportional to (n − 1)b) is distributed uniformly on all the
other values x j , j = i.

Proposition 5.1.8 Let M be a real-valued Jukes–Cantor matrix of order n × n,


having a on the main diagonal and b elsewhere, with a > b > 1. Then M has
rank n.
62 5 Conditional Independence

Proof We prove the statement by induction on the rank. The cases n = 1, 2 are trivial.
For n generic, notice that deleting the last row and column we get a Jukes–Cantor
matrix of order (n − 1) × (n − 1). Hence, we can suppose, by induction that the first
n − 1 rows of M are linearly independent.
If the last row Rn is linear combination of the previous ones, that is, there exists
a relation
Rn = a1 R1 + · · · + an−1 Rn−1 ,

then comparing the last entries we get a = (a1 + · · · + an−1 )b, from which (a1 +
· · · + an−1 ) > 1. Thus, at least one of the ai ’s is positive. Suppose, that a1 > 0.
Comparing the first entry of the rows, we get

b = a1 a + (a2 + · · · + an−1 )b > (a1 + a2 + · · · + an−1 )b > b,

giving a contradiction.

Example 5.1.9 Let us go back to the Example 2.3.26 of the school with two sections
A, B, where scholarships are distributed. Let us say the situation after 25 years is
given by  
96
D= .
64

The matrix defines a distribution on the total correlation of the Boolean system which
has two variables A, B, corresponding to the two sections. As the matrix has rank
1, this distribution indicates independence among the possibilities of A, B to have a
scholarship.
We introduce a third random variable N , which is 0 if the year is normal, i.e., one
scholarship is assigned, and 1 if the year is exceptional, that is, or 2 scholarships are
distributed or no scholarships are distributed at all. In the total correlation of the new
system, we necessarily obtain the distribution defined by the tensor

0 6

9 0

D =
6 0

0 4

since in the normal years only one of the two sections has the scholarship, something
that cannot happen in exceptional years.
The tensor D  clearly does not have rank 1. Also note that the elements of the
scan of D  along N do not have rank 1. As a matter of fact, both in exceptional years
5.1 Models of Conditional Independence 63

and in normal years, knowing if the A section has had or not the scholarship even
determines whether or not B has had it.
On the other hand, A  B, because the marginalization of D  along the variable
N gives the matrix D, which is a matrix of independence.

Definition 5.1.10 Given a set of conditions (Ai |Bi ) as above, the distributions
that satisfy all of them form a model in D(S). These models are called models of
conditional independence.

Proposition 5.1.11 All the models of conditional independence are homogeneous


algebraic models, defined by equations of degree ≤ 2.
All the models of conditional independence are polynomial parametric models.
Each model defined by a single condition (A|B) is a toric model, provided a
homogeneous change of coordinates.

Proof By Theorem 6.4.13, we know that imposing a tensor to have rank 1 corre-
sponds to the vanishing of certain 2 × 2 determinants. The equations that are obtained
are homogeneous polynomial (of degree two). Therefore, every condition (A|B) is
defined by the composition of quadratic equations with a marginalization, hence from
the composition of quadratic and linear equations. Therefore, the resulting model
is algebraic. To prove the second statement, we note that if D satisfies a condition
(A|B) with A ∪ B = {1, . . . n} (i.e., no marginalization) then for each element D 
of the scan of D along B there must exist v1 . . . , va , where a is the cardinality of A,
such that D  = v1 ⊗ · · · ⊗ va . It is clear that such condition is polynomial parametric,
in fact toric. When A ∪ B = {1, . . . n}, the same fact holds on the coefficients that are
obtained from the marginalization that depend linearly on the coefficients of D. 

Example 5.1.12 Consider a Boolean system S with three variables {x1 , x2 , x3 }, so


that the distribution space D(S) corresponds to the space of tensors of type (2, 2, 2).
The model determined by (x1  x2 |x3 ) contains all distributions D satisfying

D(1, 1, 1)D(1, 2, 2) − D(1, 2, 1)D(1, 1, 2) = 0
.
D(2, 1, 1)D(2, 2, 2) − D(2, 2, 1)D(2, 1, 2) = 0

The same model can be described in parametric form by




⎪ D(1, 1, 1) = ac



⎪ D(1, 2, 1) = ad





⎪ D(1, 1, 2) = bc

⎨ D(1, 2, 2) = bd

⎪ D(2, 1, 1) = a  c



⎪ = ad 


D(2, 2, 1)



⎪ D(2, 1, 2) = b c


D(2, 2, 2) = b d 
64 5 Conditional Independence

showing that it is toric.


The model determined by x1  x2 contains all distributions D satisfying

(D(1, 1, 1)+D(2, 1, 1))(D(1, 2, 2) + D(2, 2, 2))−


− (D(1, 2, 1) + D(2, 2, 1))(D(1, 1, 2) + D(2, 1, 2)) = 0 (5.1.1)

that is, all the distributions defined by




⎪(D(1, 1, 1) + D(2, 1, 1)) = ac

⎨(D(1, 2, 2) + D(2, 2, 2)) = bd

⎪(D(1, 2, 1) + D(2, 2, 1)) = ad


(D(1, 1, 2) + D(2, 1, 2)) = bc

hence it corresponds to the polynomial parametric model




⎪ D(1, 1, 1) =x



⎪ D(2, 1, 1) = ac − x





⎪ D(1, 2, 2) =y

⎨ D(2, 2, 2) = bd − y

⎪ D(1, 2, 1) =z



⎪ = ad − z


D(2, 2, 1)


⎪ D(1, 1, 2)
⎪ =t


D(2, 1, 2) = bc − t

This last parameterization, in the new coordinates D  (i, j, k), with D  (1, j, k) =
D(1, j, k), D  (2, j, k) = D(1, j, k) + D(2, j, k), becomes


⎪ D  (1, 1, 1) =x



⎪ D  (2, 1, 1) = ac





⎪ D  (1, 2, 2) =y

⎨ D  (2, 2, 2) = bd
⎪ D  (1, 2, 1)
⎪ =z



⎪ D  (2, 2, 1) = ad





⎪ D  (1, 1, 2) =t

⎩ 
D (2, 1, 2) = bc

which represents a toric model.

Example 5.1.13 (Bayes formula) We illustrate in this setting a special instance of


the Bayes’ formula.
5.1 Models of Conditional Independence 65

Consider a system in which A sends a digital signal to B. The connection between


A and B is not perfect, so that what B receives is modified by the multiplication by
a Jukes–Cantor matrix
 
a b
.
b a

We assume that A sends a string of {0, 1}, containing α occurrences of 0 and β


occurrences of 1.
Write α = α/(α + β) and β  = β/(α + β), so that α + β  = 1. Also, after scal-
ing, we may assume that the Jukes–Cantor matrix satisfies a + b = 1.
It follows from the setting that B receives a string with about αa + βb 0’s and
αb + βa 1’s. Notice that (α a + β  b) + (α b + β  a) = 1.
The probability that A sends 0 is thus p(A) = α , while the probability that B
receives 0, assuming that A sends 0, is p(A|B) = a. The probability that B receives
0 is
αa + βb αa + βb
p(B) = = = α a + β  b
(αa + βb) + (αb + βa) (α + β)(a + b)

while the probability that A did send 0 when B receives 0 is

αa α+β α a
p(A|B) = = (α a) =  .
αa + βb (αa + βb) α a + βb

It follows that
p(A) p(B|A) = p(B) p(A|B).

5.2 Markov Chains and Trees

Among all the situations concerning conditional independence, a important case


apart is represented by Markov chains.
In common practice, a Markov chain is a random system if the variables are strictly
ordered and the state assumed by each variable is determined exclusively from the
state assumed by the previous variable.
If exclusivity is intended strictly, other conditions such as time or external factors
to the system do not influence the passage from a variable to the consecutive one.
Therefore, if in a distribution D, with sampling c, in which the variable xi is always
in the state  and the variable xi+1 is d times in the state ξ, then in a distribution of
sampling 2c, in which the variable xi is always in the state , the variable xi+1 must
assume 2d times the state ξ. And if in another distribution D  , with sampling c , in
which the variable xi is always in the state  and the variable xi+1 is d  in the state
ξ, then in a distribution with sampling c + c , in which the variable xi is c times in
66 5 Conditional Independence

the state  and c times in the state  , the variable xi+1 must assume d + d  times the
state ξ.
This motivates the following:

Definition 5.2.1 Let S be a random system with variables x1 , . . . , xn (thus, the


variables have a natural ordering). Let ai be the number of states of variable xi .
Consider matrices M1 , . . . , Mn−1 , where each matrix Mi has ai columns and ai+1
rows.
We call Markov model with matrices M1 , . . . , Mn−1 the model on the total cor-
relation of S, formed by distributions D, whose total marginalization (v1 , . . . , vn ),
vi ∈ K ai , satisfy all the following conditions

vi+1 = Mi vi , i = 1, . . . , n − 1.

We simply call Markov model the model on the total correlation of S, formed by
distributions D, satisfying a Markov model, for some choice of the matrices.

Example 5.2.2 Consider a system formed by three headquarters A, B, C transmit-


ting a Boolean signal. A transmits the signal to B, which in turn retransmitted it to
C. The signal is disturbed according to the matrices of Jukes–Cantor
⎛3 1
⎞ ⎛2 1

4 4 3 3
M =⎝ ⎠ N =⎝ ⎠
1 3 1 2
4 4 3 3

If A transmits 60 times 0 and 120 times 1, the distribution observed on the total
correlation is
15 10

30 5

D =
10 60

20 30

The total marginalization of D is given by (60, 120), (75, 105), (85, 95). Since
one has        
60 75 75 85
M = N =
120 105 105 95

it follows that D is the distribution of the Markov model associated to the matrices
M, N .
5.2 Markov Chains and Trees 67

From the previous example it is clear that, when we have three variables, the
Markov model with matrices M, N is formed by distributions D = (Di jk ) such that
Di jk = d j Mi j N jk , where (d1 , . . . , dn ) represents the marginalization of D over the
variables x2 , x3 .
Proposition 5.2.3 The distributions of the Markov model are exactly the distribu-
tions satisfying all conditional independences

xi  xk |x j

for any choice of i, j, k with i < j < k.

Proof We give only a sketch of the proof.


We prove one direction for n = 3. If D satisfies a Markov model, relatively to the
matrices M, N , then, denoted by (v1 , v2 , v3 ) the total marginalization of D, consider
R = {2} and Q : R → Jn 2 , Q(2) = j. For what has been said following the Example
5.2.2, the element R, Q of D is given by a multiple of C j ⊗ R j , where C j is the jth
column of M while R j is the jth column of N . Therefore, all these elements have
rank 1.
The general case is solved by marginalizing the distribution D in such a way to
restrict it to the variables xi , x j , xk .
For the other direction, we describe what happens for a system of three Boolean
variables. Given the distribution Di jk = 0 that satisfies x1  x3 |x2 , unless renumber-
ing the states, we can assume D222 = 0. Consider the submatrices of D
   
D112 D122 D211 D212
M = N = .
D212 D222 D221 D222

Given two numbers h, k such that hk = D212 /D222 , we multiply the second column
of M  by h and the second row of N  by k.
The two matrices thus obtained, appropriately scaled, determine matrices M, N
describing the Markov model satisfied by D.
In the general case, the procedure is similar but more complicated. 

For a more complete discussion, the reader is referred to the article by Eriksson,
Ranestad, Sturmfels, and Sullivant in [1].

Corollary 5.2.4 Markov’s chain models are algebraic models and also polynomial
parametric models (since, in general, many conditional independences are involved,
these models are not generally toric).

Remark 5.2.5 Consider a system consisting of three variables x1 , x2 , x3 with the


same number of states.
In practice, the Markov chain model is almost always associated to matrices M, N
that are invertible.
68 5 Conditional Independence

In this case, the obtained distributions are the same that we obtain by considering
the Markov model on the same system, ordered so that x3 → x2 → x1 , with matrices
N −1 , M −1 .
Thus the Markov chains, when the transition matrices are invertible, cannot distin-
guish who transmits the signal or receives it. From the point of view of distributions,
the two chains
M N N −1 M −1
x1 x2 x3 x3 x2 x1

are, in fact, indistinguishable.

Markov chains can be generalized to models which are defined over trees.

Definition 5.2.6 Let G be a directed tree.


Consider a random system S whose variables x1 , . . . , xn are the vertices of G
(which will be partially ordered by the direction on the tree). Let ai be the number
of states of the variable xi . For any (directed) edge xi , x j , we consider a matrix Mi j
with ai columns and a j rows.
We call Tree Markov model on G, with matrices {Mi j } the model on the total
correlation of S, formed by distributions D, whose total marginalization (v1 , . . . , vn ),
vi ∈ K ai , satisfies the following conditions:

v j = Mi j vi , ∀i, j.

We simply call Markov model on G the model on the total correlation of S, formed
by distributions D satisfying the condition of a Tree Markov model on G for some
choice of the matrices Mi j .

Example 5.2.7 The Markov chains models are obviously examples of tree Markov
models.
The simplest example of tree Markov model, beyond chains, is the one shown in
the Example 5.1.6.

Remaining in the Example 5.1.6, it is immediate to understand that, for the same
motivations expressed in the Remark 5.2.5, when the matrix M AB is invertible, the
model associated with the scheme
A
MAB MAC

B C

is indistinguishable from the following Markov chain model


5.2 Markov Chains and Trees 69

−1
MAB MAC
B A C

The previous argument suggests that the tree Markov models are described by models
of conditional independence. In fact, the hint is valid because in a tree, given two
vertices xi , x j , there exists at most a minimal path that connects them.

Theorem 5.2.8 Given a tree G and a random system S whose variables are the
vertices x1 , . . . , xn of G, then a distribution D on the total correlation of S is in the
tree Markov model associated with G, for some choice of matrices Mi j , if and only
if D satisfies all conditional independencies

xi  x j |xk

whenever xk is in the minimal path joining xi , x j .

For the proof, refer to the book [2] or to the aforementioned article by Eriksson,
Ranestad, Sturmfels, and Sullivant in [1].

Example 5.2.9 Both the tree Markov model associated to the following tree

B C

and the Markov chain model


B A C

are equivalent to the conditional independence model

B  C|A.

Example 5.2.10 An interesting example of application of the tree Markov models is


in the study of phylogenetic, where one tries to reconstruct the genealogical tree of
an evolution (which can be biological, but also chemistry or linguistics, etc.). See,
for example, [3–6].
For example, suppose we have to fix the evolutionary situation of five species,
A, B, C, D, E, starting from the ancestor A. We can hypothesize two different evo-
lutionary situations, represented by graphs G 1 , G 2 , where
70 5 Conditional Independence

A
G1 =
B
C D E

that is, B, E directly descend from A while C, D descend from B; or

A
G2 =
B
C D E

that is, B, C directly descend from A while E, D descend from B.


We build a random system on the variables A, B, C, D, E, which we can also
consider Boolean. If the situation concerns biological evolution, the two states could
represent the presence of purine or pyrimidine bases in the positions of the DNA
chain of the species. In this case, a distribution is represented by a tensor of type
2 × 2 × 2 × 2 × 2.
The models associated with the two graphs G 1 , G 2 can be distinguished as, for
example, in the first one you have A  C|B, which does not happen in the second
case.

5.3 Hidden Variables

Let us go back to the initial Examples 5.0.1 and 5.0.2 of this chapter.
The situation presented in these examples foresees the presence of hidden vari-
ables, that is, variables whose presence was not known at the beginning, but which
condition the dependence between the observable variables.
Also in the Example 5.2.10 a similar situation can occur. If the species A, B from
which the others derive are only hypothesized in the past, it is clear that one cannot
hope to observe the DNA, then the distributions on the variables A, B are not known,
so what we observe is not the true original tensor, but only its marginalization along
the variables A, B.
How can we hope to determine the presence of hidden variables?
One way is suggested by the Example 5.0.1 and uses the concept of rank (see the
Definition 6.3.3). In that situation, the distributions on the two observable variables
(A = football fan, B = hair loss) were represented by 3 × 3-matrices. The existence
of the hidden variable (G = gender) implied that the matrix of the distribution D was
the marginalization of a tensor T of type 3 × 3 × 2, whose scan along the hidden
5.3 Hidden Variables 71

variable is formed by two matrices M1 , M2 of rank 1. Therefore, M = M1 + M2 had


rank ≤ 2.

Remark 5.3.1 Let S be a system of random variables y, x1 , . . . , xn , where y has r


states while every xi has ai states. A distribution D on S in the conditional inde-
pendence model {x1 , . . . , xn }|y is represented by a tensor of type r × a1 × · · · × an
whose scan along the first variable is formed by elements of rank 1. Therefore, the
marginalization of D along the variable y has rank ≤ r .
Vice versa, consider a system S  with random variables x1 , . . . , xn as above and
let D  be a distribution of rank ≤ r on S  . Then there is a distribution D on S (not
necessarily unique!) whose marginalization along y is D  . In fact we can write

D  = D1 + · · · + Dr ,

with each Di of rank ≤ 1, hence the tensor whose elements along the first direction
are D1 , . . . , Dr represents the distribution D we looked for.

The previous remark justifies the definition of hidden variable model.

Definition 5.3.2 On the total correlation of a random system S we will call hidden
variable model with r states the subset of P(D(S)) formed by points corresponding
to tensors of rank ≤ r .

Because the rank of a tensor T is invariant when you multiply T for a constant
= 0, the definition is well placed in the world of projective distributions.
The model of independence is a particular (and degenerate) case of hidden variable
models.
Example 5.3.3 Consider a random dipole S, consisting of the variables A, B, having
a, b states, respectively. The distributions on S are represented by matrices Mof
type a × b.
When r < min{a, b}, the hidden variable model with r states is equal to the subset
of the matrices of rank ≤ r . It is clear that this model is (projective) algebraic, because
it is described by the vanishing of all subdeterminants (r + 1) × (r + 1), which are
homogeneous polynomials of degree r + 1 in the coefficients of the matrix M.
When r ≥ min{a, b}, the hidden variable model with r states can still be defined,
but it becomes trivial: all a × b-matrices have rank ≤ r .
The previous example can be generalized. The hidden variable models with r
states become trivial, i.e., they coincide with the entire space of distributions, for r
big enough. Furthermore, they are all projective parametric models, therefore also
projective algebraic, for Chow’s Theorem. The hidden variables models are in fact
linked to the geometric concept of secant variety of a subset of a projective space.
Here, we recall the basic definitions, but a more general treatment will be given in
Chap. 12.
72 5 Conditional Independence

Definition 5.3.4 Let Y be a subset of a projective space Pn . We will say that P ∈ Pn


belongs to a space r -secant to Y if there are points P1 , . . . , Pr ∈ Y (not necessarily
distinct) such that the homogeneous coordinates of P are linear combination of the
homogeneous coordinates of P1 , . . . , Pn . It is clear that the definition is well placed,
because it is invariant when multiplying the coordinates of P for a nonzero constant.
We will call r -secant variety of Y , denoted by Sr0 (Y ), the subset of Pn formed by
the points belonging to a space r -secant to Y .

Remark 5.3.5 It is clear that S10 (Y ) = Y . Moreover, Si0 (Y ) ⊆ Si+1


0
(Y ) (and equality
could hold).
When the cone over Y span the vector space K n+1 , then Y contains n + 1 points
whose coordinates are linearly independent, hence Sn+1 0
(Y ) = Pn . In fact, it is clear
that Sn+1 (Y ) = P if and only if the cone over Y is contained in a proper subspace
0 n

of K n+1 , that is if and only if Y is contained in (at least) one hyperplane in Pn .


Notice that we can have Sr0 (Y ) = Pn also for r much smaller than n + 1.

Proposition 5.3.6 In the space of tensors P = P(K a1 ···an ), a tensor has rank ≤ r if
and only if it is in the r -secant variety of the Segre variety X .
It follows that the model with a hidden variable with r state corresponds to the
secant variety Sr0 (X ) of the Segre variety.

Proof By definition (see Proposition 10.5.12), the Segre variety X is exactly the set
of tensors of rank 1.
In general, if a tensor has rank ≤ r , then it is sum of r tensors of rank 1, hence it
lies in the r -secant variety of X .
Vice versa, if T lies in the r -secant variety of X , then there exist tensors T1 , . . . , Tr
of X (hence tensors or rank 1) such that

T = α1 T1 + · · · + αr Tr .

Hence, since the rank of αi Ti is 1, at least if αi is 0, we get that T is sum of ≤ r


tensors of rank 1. 

Secant varieties have long been studied in Projective Geometry for their applica-
tions to the study of projections of algebraic varieties. Their use in hidden variable
models is one of the largest points of contact between Algebraic Statistics and Alge-
braic Geometry.
An important fact in the study of hidden variable models is that (unfortunately)
such models are not algebraic models (and therefore not even projective parametric).

Proposition 5.3.7 In the projective space P = P7 of tensors of type 2 × 2 × 2 over


C, the subset Y of tensors or rank ≤ 2 is not an algebraic variety.

Proof We will use the tensor of rank 3 of Example 6.4.15, which proves, moreover,
that Y do not coincide with P.
Consider the tensors of the form D = uT1 + t T2 , where
5.3 Hidden Variables 73

2 3 0 0

0 3 1 0
T1 = T2 =
0 4 0 0

0 2 0 0

Such tensors span a space of dimension 2 in the vector space of tensors, hence a line
L ⊂ P. For (u, t) = (1, 1) we get the tensor D of Example 6.4.15, that has rank 3.
Hence L ⊂ Y .
We check now that all other points in L, different from D, are in Y . In fact if
D  ∈ L \ {D}, then D  = uT1 + t T2 , where (u, t) is not proportional to (1, 1), that
is u = t. Then D  can be decomposed in the sum of two tensors of rank 1 as follows:

0 6t−12u
2t−2u 2u 6u
2t−2u

0 3t−6u
2t−2u t 3t
2t−2u

+
0 4u 0 0

0 2u 0 0

If Y were an algebraic model, there would be at least one homogeneous polynomial


which vanishes on all points of L, except D. But by restricting this polynomial to L,
one would obtain a homogeneous polynomial p ∈ C[u, t] that vanishes everywhere,
except in the coordinates of D, that is, in the pairs (u, u). On the other hand, every
nonzero homogeneous polynomial in C[u, t] decomposes into the product of a finite
number of homogeneous linear factors, so p, which cannot be null because it does
not vanishes in D, can vanish only in a finite number of points of the projective line
of coordinates u, t, that is, in L. 

To overcome this problem, we define the algebraic secant variety and consequently
the algebraic model of hidden variable.

Definition 5.3.8 Let Y be a subset of a projective space Pn . We will call algebraic


r -secant variety of Y , denoted by Sr (Y ), the closure, in Zariski’s topology, of Sr0 (Y ).
This closure corresponds to the smallest algebraic variety containing Sr0 (Y ).
On the total correlation of a random system S we will call algebraic model of
hidden variable with r states the subset of P(D(S)) formed by the algebraic r -
secant variety of the Segre variety corresponding to the tensors of rank 1.
74 5 Conditional Independence

Example 5.3.9 In the projective space P = P7 of tensors of type 2 × 2 × 2 over C,


let X be the Segre variety given by the embedding of P1 × P1 × P1 .
The algebraic 2-secant variety of S coincides with P7 . It occurs indeed that every
tensor of rank > 2 is limit of tensors of rank 2.

Remark 5.3.10 We can try to characterize hidden variable models as parametric


models. Consider, for example, the product P1 × P1 × P1 and his embedding X in
P7 . The 2-secant variety can, at first glance, be obtained as a parametric variety
defined by the equations

Q = αP1 + β P2 a, b ∈ C P1 , P2 ∈ X

which combined with the parametric equations of X , leads to overall parametric


equations ⎧

⎪ x111 = αa1 b1 c1 + βa1 b1 c1

⎨x = αa b c + βa  b c
112 1 1 2 1 1 2
⎪. . .
⎪ . . .


x222 = αa2 b2 c2 + βa2 b2 c2

where P1 = (a1 , a2 ) ⊗ (b1 , b2 ) ⊗ (c1 , c2 ) and P2 = (a1 , a2 ) ⊗ (b1 , b2 ) ⊗ (c1 , c2 ).
Unluckily this parameterization cannot be defined globally.
As a matter of fact, moving freely the parameters, we must consider also the cases
when P1 = P2 . In this situation, for some choice of α, β,the image would be the point
(0, . . . , 0), which does not exist in the projective setting. Hence, the parameterization
is only partial.
If we exclude the parameter values for which the image would give (0, . . . , 0), we
get a well-defined function on a Zariski open set of (P1 )7 . The image Y of that open,
however, is not a Zariski closed set of P7 . Zariski’s closure of Y in P7 coincides with
the whole P7 .
Part of the study of secant varieties is based on the calculation of the dimension.
From what has just been said in the previous remark, a limitation of the dimension
of algebraic secant varieties is always possible.

Proposition 5.3.11 The algebraic r -secant variety of the Segre variety X , image of
the Segre map of the product Pa1 × · · · × Pan , has dimension bounded by

dim(Sr (X )) ≤ min{N , ar + r − 1} (5.3.1)

where N = (a1 + 1) · · · (am + 1) − 1 is the dimension where X is embedded and


a = a1 + · · · + an is the dimension of X .

Proof That the dimension of Sr (X ) is at most N depends on the fact that the dimen-
sion of an algebraic variety in P N can not exceed the dimension of the ambient space
(see Proposition 11.2.14).
5.3 Hidden Variables 75

The second limitation dim(Sr (X )) ≤ ar + r − 1 follows from Theorem 11.2.24,


since, generalizing the previous example, on an open Zariski, dim(Sr (X )) is the
image of a polynomial map from (Y )s × Pr −1 to P N . 

If instead of the Segre map we take the Veronese map, a similar situation is
obtained.

Proposition 5.3.12 The algebraic r -secant variety of the Veronese variety X , image
of the Veronese map of degree d on Pn , has dimension bounded by

dim(Sr (X )) ≤ min{N , nr + r − 1} (5.3.2)


n+d 
where N = d
− 1 is the dimension of the space where X is embedded.

In both situations, we will call expected r -secant dimension of the Segre variety
(respectively, of the Veronese variety) the second member of the inequality (5.3.1)
(respectively of the inequality (5.3.2)).

Definition 5.3.13 We call generic rank of tensors of type (a1 + 1) × · · · × (an + 1)


the minimum r such that, if X is the Segre variety of Pa1 × · · · × Pan in P N , one has
Sr (X ) = P N .
We call generic symmetric rank of symmetric tensors of type n × · · · × n (d
times) the minimum r such that, if X is the Veronese variety of degree d of Pn in
P N , one has Sr (X ) = P N .

Example 5.3.14 The generic rank of n × n matrices is n.


The generic rank of 2 × 2 × 2 tensors is 2.
The generic rank of 3 × 3 × 3 tensors cannot be 3. As a matter of fact, such
tensors of rank 1 correspond to Segre embedding X of P2 × P2 × P2 in P26 .The
dimension of the algebraic 3-secant variety S3 (X ), by Proposition 5.3.11, is bounded
by 6 · 3 + 3 − 1 = 20, hence dim(S3 (X )) = 26.

The last part of the previous example provides a general principle.

Proposition 5.3.15 Given a = a1 + · · · + an and N = (a1 + 1) · · · (an + 1) − 1,


the generic rank rg of tensors of type (a1 + 1) × · · · × (an + 1) satisfies

N +1
rg ≥ .
a+1

Given N = n+d d
− 1, the generic symmetric rank rgs of symmetric tensors of type
(n + 1) × · · · × (n + 1) (d times) satisfies

N +1
rgs ≥ .
n+1
76 5 Conditional Independence

Note that, in general, there are tensors whose rank is lower than the generic rank,
but there may also be tensors whose rank is greater than the generic rank (this cannot
happen in the case of matrices). See the Example 6.4.15.

Example 5.3.16 In general, we could expect that the generic rank rg is exactly equal
to the smaller integer ≥ (N + 1)/(n + 1). This does not always occur. This is already
obvious, in the case of matrix spaces.
For tensors of higher dimension, consider the case of 3 × 3 × 3 tensors, for which
N = 26 and n = 6. The minimum integer ≥ (N + 1)/(n + 1) is 4, but the generic
rank is 5.
The tensors for which the generic rank is larger than the minimum integer greater
than or equal to (N + 1)/(n + 1) are called defective.
We know a few examples of defective tensors, but a complete classification of
them is not known. A discussion of the defectiveness (as a proof of the statement on
3 × 3 × 3 tensors) is beyond the scope of this Introduction and we refer to the text
of Landsberg [7].

The importance of the generic rank in the study of hidden variables is evident.
Given a random system S with variables x1 , . . . , xn , where xi has ai + 1 states,
the algebraic model of hidden variable with r states, on the total correlation of S,
is equivalent to the algebraic secant variety Sr (X ) where X is the Segre variety
of Pa1 × · · · × Pan . The distributions that are in this model should suggest that the
phenomenon under observation it is actually driven by a variable (in fact: hidden)
with r -states.
However, if r ≥ rg , this suggestion is null. In fact, in this case, Sr (X ) is equal
to the whole space of the distributions, then practically all of distributions suggest
the presence of such a variable. This, from the practical side, simply means that the
information given the additional hidden variable is null. In practice, therefore, the
existence or nonexistence of the hidden variable does not add any useful information
to the understanding of the phenomenon.

Example 5.3.17 Consider the study of DNA strings. If we observe the distribution
of the bases on 3 positions of the string, we get distributions described by 4 × 4 × 4
tensors. The tensors of this type are not defective, so being n = 9, N = 63, the
generic rank is 7.
The observation of a rank 6 distribution then suggests the presence of a hidden
variable with 6 states (as the subdivision of our sample into 6 different species).
The observation of a rank 7 distribution does not, therefore, give us any practical
evidence of the real existence of a hidden variable with 7 states.
If we really suspect the existence of a hidden variable (the species) with 7 or more
states, how can we verify this?
The answer is that such an observation is not possible considering only three
positions of DNA. However, if we go on to observe four positions, we get a 4 ×
4 × 4 × 4 tensor. The tensors of this type (which are not even them defective) have
5.3 Hidden Variables 77

generic rank equal to 256/13 = 20. If in this case we still get distributions of rank
7, which is much less than 20, our assumption received a formidable experimental
evidence.

References

1. Eriksson, N., Ranestad, K., Sullivant S., Sturmfels, B.: Phylogenetic algebraic geometry. In:
Ciliberto, C., Geramita, A., Harbourne, B., Roig, R-M., Ranestad, K. (eds.) Projective Varieties
with Unexpected Properties, pp. 237–255. De Gruyter, Berlin (2005)
2. Drton M., Sullivant S., Sturmfels B.: Lectures on Algebraic Statistics, Oberwolfach Seminars,
vol. 40. Birkhauser, Basel (2009)
3. Allman, E.S., Rhodes, J.A.: Phylogenetic invariants, chapter 4. In: Gascuel, O., Steel, M. (eds.)
Reconstructing Evolution New Mathematical and Computational Advances (2007)
4. Allman, E.S., Rhodes, J.A.: Molecular phylogenetics from an algebraic viewpoint. Stat. Sin.
17(4), 1299–1316 (2007)
5. Allman, E.S., Rhodes, J.A.: Phylogenetic ideals and varieties for the general Markov model.
Adv. Appl. Math. 40(2), 127–148 (2008)
6. Bocci, C.: Topics in phylogenetic algebraic geometry. Exp. Math. 25, 235–259 (2007)
7. Landsberg, J.M.: The geometry of tensors with applications. Graduate Studies in Mathematics
128. Providence, AMS (2012)
Part II
Multi-linear Algebra
Chapter 6
Tensors

6.1 Basic Definitions

The main objects of multi-linear algebra that we will use in the study of Algebraic
Statistics are multidimensional matrices, that we will call tensors.
One begins by observing that matrices are very versatile objects! One can use
them for keeping track of information in a systematic way. In this case, the entries in
the matrix are “place holders” for the information. Any elementary book on Matrix
Theory will be filled with examples (ranging from uses in Accounting, Biology, and
Combinatorics to uses in Zoology) which illustrate how thinking of matrices in this
way gives a very important perspective for certain types of applied problems.
On the other hand, from the first course on Linear Algebra, we know that matrices
can be used to describe important mathematical objects. For example, one can use
matrices to describe linear transformations between vector spaces or to represent
quadratic forms. Coupled with the calculus these ideas form the backbone of much
of mathematical thinking.
We want to now mention yet another way that matrices can be used: namely to
describe bilinear forms. To see this let M be an m × n matrix with entries from
the field K . Consider the two vector spaces K m and K n and suppose they have the
standard basis. If v ∈ K m and w ∈ K n we will represent them as 1 × m and 1 × n
matrices, respectively, where the entries in the matrices are the coordinates of v and
w with respect to the chosen basis. So, let
 
v = α1 · · · αm

and  
w = β1 · · · βn .

The matrix M above can be used to define a function

Km × Kn → K

© Springer Nature Switzerland AG 2019 81


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_6
82 6 Tensors

described by
(v, w) → v Mw t

where the expression on the right is simply the multiplication of three matrices
(t denoting matrix transpose). Notice that this function is linear both in K m and in
K n , and hence is called a bilinear form.
On the other hand, given any bilinear form B : K m × K n → K , i.e., a function
which is linear in both arguments, and choosing a basis for both K m and K n , we can
associate to that bilinear form an m × n, matrix, N , as follows: if {v1 , . . . , vm } is the
basis chosen for K m and {w1 , . . . , wn } is the basis chosen for K n then we form the
m × n matrix N = (n i, j ) where n i, j := B(vmi , w j ). 
It is easy to see that if v ∈ K m , v = i=1 αi vi and w ∈ K n , w = nj=1 β j w j
then ⎛ ⎞
β1
  ⎜.⎟
B(v, w) = α1 · · · αm N ⎝ .. ⎠ .
βn

Thus, bilinear forms mapping K m × K n → K (and a choice of basis for both K m


and for K n ) are in 1-1 correspondence with m × n matrices with entries from K .
Remark 6.1.1 One should note that although K m × K n is a vector space of dimen-
sion m + n, the bilinear map defined above from that vector space to K is not a linear
map. In fact, any vector in the cartesian product of the form (v, 0) or (0, w) (where
0 is the zero vector) is sent to 0 under the bilinear form, but the sum of those two
vectors is (v, w) which does not necessarily get sent to 0 by the bilinear form.
Example 6.1.2 Recall that if S is a system with two random variables, say x and
y, where A(x) contains m elements and A(y) contains n elements, then we used an
m × n matrix M to encode all the information of a distribution on the total correlation
S. The (i, j) entry in M was the value of the distribution on the (i, j)th element in
the alphabet of the unique random variable (x, y) of the system S (see Definition
1.1.14). This is an example where we used a matrix as a convenient place to store
the information of a distribution on S.
However, if we consider the ith element of the alphabet of the random variable x
as corresponding to the matrix
 
v = 0 ··· 0 1 0 ··· 0

(where the 1 occurs in the ith place in this 1 × m matrix) and


 
w = 0 ··· 0 1 0 ··· 0

(where this time the 1 occurs in the jth place in this 1 × n matrix) then the product
v Mw t is precisely the (i, j) entry in the matrix M. But, as we noted above, this is the
value of the distribution on the (i, j) element in the alphabet of the unique random
variable in the total correlation we described above.
6.1 Basic Definitions 83

So, although the matrix M started out being considered simply as a place holder
for information, we see that considering it as a bilinear form on an appropriate pair of
vector spaces it can also be used to give us information about the original distribution.
Tensors will give us a way to generalize what we have just seen for two random
variables to any finite number of random variables. So, tensors will encode infor-
mation about the connections between distinct variables in a random system. As
the study of the properties of such connections is a fundamental goal in Algebraic
Statistics, it is clear that the role of tensors is ubiquitous in this book.
From the discussion above concerning bilinear forms and matrices, we see that
we have a choice as to how to proceed. We can define tensors as multidimensional
arrays or we can define tensors as multi-linear functions on a cartesian product of
a finite number of vector spaces. Both points of view are equally valid and will
eventually bring us to the same place. The two ways are equivalent, as we saw above
for bilinear forms, although sometimes one point of view is preferable to the other.
We will continue with both points of view but, for low dimensional tensors, we will
usually prefer to deal with the multidimensional arrays.
Before we get too involved in studying tensors, this is probably a good time
to forewarn the reader that although matrices are very familiar objects for which
there are well-understood tools to aid in their study, that is far from the case for
multidimensional matrices, i.e., tensors. The search for appropriate tools to study
tensors is part of ongoing research. The abundance of research on tensors (research
being carried out by mathematicians, computer scientists, statisticians, and engineers
as well as by people in other scientific fields) attests to the importance that these
objects have nowadays in real-life applications.
Notation. For every positive integer i, we will denote by [i] the set {1, . . . , i}.
For the rest of the section, K can indicate any set, but in practice, K will always
be a set of numbers (like N, Z, Q, R, or C).

Definition 6.1.3 A tensor T over K , of dimension n and type

a1 × · · · × an

is a multidimensional table of elements of K , in which any element is determined


by a multi-index (i 1 , . . . , i n ), where i j ranges between 1 and a j .
In more formal terms, a tensor T as above is a map:

T : [a1 ] × · · · × [an ] → K .

Equivalently (when K is a field) such a tensor T is a multi-linear map

T : K a1 × · · · × K an → K

where we consider the standard basis for each of the K ai .


84 6 Tensors

Remark 6.1.4 If we think of T as a multi-linear map and suppose that for each
1 ≤ i ≤ n, {eij | 1 ≤ j ≤ ai } is the standard basis for K ai then the entry in the
multidimensional array representation of T corresponding to the multi-index
(i 1 , . . . , i n ) is
T (ei11 , ei22 , . . . , einn ) .

Tensors are a natural generalization of matrices. Indeed matrices of real numbers


and of type m × n correspond exactly to tensors over R of dimension 2 and type
m × n.
Example 6.1.5 An example of a tensor over R, of dimension 3 and type 2 × 2 × 2
is:
2 1

4 0
T =
−1 3

4 7
Notation. Although we have written a 2 × 2 × 2 tensor above, we have not made
clear which place in that array corresponds to T (ei11 , ei22 , ei33 ). We will have to make
a convention about that. Again, the conventions in the case of three-dimensional
tensors are not uniform across all books on Multi-linear Algebra, but we will attempt
to motivate the notation that we use, and is most common, by looking at the cases
in which there is widespread agreement, i.e., the cases of one-dimensional and two-
dimensional tensors.
Let’s start by recalling the conventions for how to represent a one-dimensional
tensor, i.e., a linear function
T : Kn → K .

Recall that such a tensor can be represented by a 1 × n matrix as follows: let e1 , . . . , en


be the standard basis for K n and suppose that T (ei ) = αi , then the matrix for this
linear map is  
α1 · · · αn .
n
So, if v = i=1 γi ei is any vector in K n then
⎛ ⎞
γ1
 ⎜ . ⎟
T (v) = α1 · · · αn ⎝ .. ⎠
γn

Now suppose that we have a two-dimensional tensor T of type m × n, i.e., a


bilinear form
6.1 Basic Definitions 85

T : Km × Kn → K .

Recall that such a tensor is represented by an m × n matrix, A, as follows: let

{e1j | 1 ≤ j ≤ m} be the standard basis for K m ;

{e2j | 1 ≤ j ≤ n} be the standard basis for K n

then
A = (αi, j ) where αi, j := T (ei1 , e2j ) .

So ⎛ ⎞
α1,1 α1,2 · · · α1,n
⎜ . ⎟
A = ⎝ ... ..
. · · · .. ⎠
αm,1 αm,2 · · · αm,n

Now suppose we have a three-dimensional tensor T of type m × n × r , i.e., a


trilinear form
T : Km × Kn × Kr → K .

This tensor is represented by an m × n × r box, A, as follows: let

{e1j | 1 ≤ j ≤ m} and {e2j | 1 ≤ j ≤ n}

be the standard basis for K m and K n respectively (as above) and let

{e3j | 1 ≤ j ≤ r } be the standard basis for K r .

Then
A = (αi, j,k ) where α(i, j,k) := T (ei1 , e2j , ek3 ) .

How will we arrange these values in a rectangular box? We let the front (or first )
face of the box be the m × n matrix whose (i, j) entry is T (ei1 , e2j , e13 ). The second
face, parallel to the first face, is the m × n matrix whose (i, j) entry is T (ei1 , e2j , e23 ).
We continue in this way so that the back face (the r th face), parallel to the first face
is the m × n matrix whose (i, j) entry is T (ei1 , e2j , er3 ).

Example 6.1.6 Let T be the three-dimensional tensor of type 3 × 2 × 2, whose


(i, j, k) entry is equal to i jk (the product of the three numbers). Then the 3 × 2 × 2
rectangle has first face a 3 × 2 matrix whose (i, j) entry is (i j) · 1. The second (or
back) face is a 3 × 2 matrix whose (i, j) entry is (i j) · 2. We put this all together to
get our 3 × 2 × 2 tensor.
86 6 Tensors

2 4

1 2

4 8

T1 = 2 4

6 12

3 6
To be assured that you have the conventions straight for trilinear forms, verify that
the three-dimensional tensor of type 3 × 2 × 2 whose multidimensional matrix rep-
resentation has entries (i, j, k) = i + j + k, looks like
4 5

3 4

5 6

T2 = 4 5

6 7

5 6
Remark 6.1.7 We saw above that elements of K n can be considered as tensors of
dimension 1 and type n. Notice that they can also be considered as tensors of dimen-
sion 2 and type 1 × n, or tensors of dimension 3 and type 1 × 1 × n, etc.
Similarly, n × m matrices are tensors of dimension 2 but they can also be seen as
tensors of dimension 3 and type 1 × n × m, etc.
Elements of K can be seen as tensors of dimension 0.
As a generalization of what we can do with matrices, we mention the following
easy fact.
Proposition 6.1.8 When K is a field, the set of all tensors of fixed dimension n and
type a1 × · · · × an is a vector space where the operations are defined over elements
with corresponding multi-indices.
This space, whose dimension is the product a1 · · · an , will be denoted by K a1 ,...,an .
One basis for this vector space is obtained by considering all the multidimensional
6.1 Basic Definitions 87

matrices with a 1 in precisely one place and a zero in every other place. If that unique
1 is in the position (i 1 , . . . , i n ), we refer to that basis vector as e(i1 ,...,in ) .
The null element of a space of tensors is the tensor having all entries equal to 0.
Now that we have established our convention about how the entries in a multidi-
mensional array can be thought of, it remains to be precise about how a multidimen-
sional array gives us a multi-linear map.
So, suppose we have a tensor T which is a tensor of dimension n and type a1 ×
· · · × an . Let A = (αi1 ,i2 ,...,in ), where 1 ≤ i j ≤ a j , be the multidimensional array
which represents this tensor. We want to use A to define a multi-linear map

T : K a1 × · · · × K an → K

whose multidimensional matrix representation is precisely A. Let v j ∈ K a j , where


v j has coordinates (v j,1 , . . . , v j,a j ) with respect to the standard basis for K a j . Then
define
T (v1 , v2 , . . . , vn ) = (αi1 ,i2 ,...,in )(v1,i1 · v2,i2 · · · vn,in ) .

[ j]
Now if {ei | 1 ≤ i ≤ a j , 1 ≤ j ≤ n} is the standard basis for K a j then it is easy to
see that
T (ei[1]
1
, . . . , ei[n]
n
) = αi1 ,i2 ,...,in

Since the (ei[1]


1
, . . . , ei[n]
n
) form a basis for the space K a1 × . . . K an and T is the unique
multi-linear map with values equal to the entries in the multidimensional matrix A
we are done.

6.2 The Tensor Product

Besides the natural operations (addition and scalar multiplication) between tensors of
the same type, there is another operation, the tensor product, which combines tensors
of any type. This tensor product is fundamental for our analysis of the properties of
tensors.
The simplest way to define the tensor product is to think of tensors as multi-linear
maps. With that in mind, we make the following definition.
 
Definition 6.2.1 Let T ∈ K a1 ,...,an , U ∈ K a1 ,...,am be tensors. We define the tensor
 
product T ⊗ U as the tensor W ∈ K a1 ,...,an ,a1 ,...,am such that

if vi ∈ K ai , w j ∈ K a j then

W (v1 , . . . , vn , w1 , . . . , wm ) = T (v1 , . . . , vn )U (w1 , . . . , wm ).


88 6 Tensors

We extend this definition to consider more factors. So, for any finite collection
of tensors T j ∈ K a j1 ,...,a jn j , j = 1, . . . , m, one can define their tensor product as the
tensor
W = T1 ⊗ · · · ⊗ Tm ∈ K a11 ,...,a1n1 ,...,am1 ,...,amnm

such that

W (i 11 , . . . , i 1n 1 , . . . , i m1 , . . . , i mn m ) = T1 (i 11 , . . . , i 1n 1 ) · · · Tm (i m1 , . . . , i mn m ).

This innocent looking definition actually contains some new and wonderful ideas.
The following examples will illustrate some of the things that come from the defi-
nition. The reader should keep in mind how different this multiplication is from the
usual multiplication that we know for matrices.

Example 6.2.2 Given 2 one-dimensional tensors v and w of type m and n, respec-


tively, we write v = (α1 , . . . , αm ) ∈ K m and w = (β1 , . . . , βn ) ∈ K n . Then v defines
a linear map (which we’ll also call v)


m
v : K m → K defined by: v(x1 , . . . , xm ) = αi xi
i=1

and w a linear map (again abusively denoted w)


n
w : K n → K defined by: w(y1 , . . . , yn ) = βi yi .
i=1

By definition, the tensor product v ⊗ w is the bilinear map:

v ⊗ w : Km × Kn → K

defined by


m
n
v ⊗ w : ((x1 , . . . , xm ), (y1 , . . . , yn )) → ( αi xi )( βi yi ).
i=1 i=1

If we let {e1 , . . . , em } be the standard basis for K m and {e1 , . . . , en } be the standard
basis for K n then
v ⊗ w : (ei , ej ) → αi β j

and so the matrix for this bilinear form is v t w.


To give a very specific example of this let v = (1, 2) ∈ R2 and w = (2, −1, 3) ∈
R . Then:
3

1   2 −1 3
v⊗w =v w = t
2 −1 3 =
2 4 −2 6
6.2 The Tensor Product 89

We could just as well have considered the tensor w ⊗ v. In the specific example
we just considered, notice that
⎞ ⎛ ⎛ ⎞
2   2 4
w ⊗ v = w t v = ⎝−1⎠ 1 2 = ⎝−1 −2⎠ = (v t w)t .
3 3 6

We see here that the tensor product is not commutative. In fact, the two multipli-
cations did not even give us tensors of the same type.

Example 6.2.3 Let’s now consider a slightly more complicated example. This time
we will take the tensor product of v, a one-dimensional tensor of type 2, and multiply
it by w, a two-dimensional tensor of type 2 × 2. We can represent v by a 1 × 2 matrix
and w by a 2 × 2 matrix. So, let

2 −1
v = (2, −3) ∈ R2 and w =
4 3.

Then v defines a linear map

v : K 2 → K given by v(x1 , x2 ) = 2x1 − 3x2

and w defines a bilinear map



z  
w : K × K → K given by w : ((y1 , y2 ), (z 1 , z 2 )) = y1 y2 w 1 =
2 2
z2

= 2y1 z 1 + 4y2 z 1 − y1 z 2 + 3y2 z 2 .

Putting these all together we have a trilinear form,

v ⊗ w : (K 2 ) × (K 2 × K 2 ) → K

defined by

v ⊗ w((x1 , x2 ), (y1 , y2 ), (z 1 , z 2 )) = (2x1 − 3x2 )(2y1 z 1 + 4y2 z 1 − y1 z 2 + 3y2 z 2 ) =

= 4x1 y1 z 1 + 8x1 y2 z 1 − 2x1 y1 z 2 + 6x1 y2 z 2 − 6x2 y1 z 1 − 12x2 y2 z 1 + 3x2 y1 z 2 − 9x2 y2 z 2 .

From this, we express v ⊗ w as a 2 × 2 × 2 multidimensional matrix, namely,


90 6 Tensors

−2 6

4 8
v⊗w =
3 −9

−6 (−12)

On the other hand,

w ⊗ v((x1 , x2 ), (y1 , y2 ), (z 1 , z 2 )) = w((x1 , x2 ), (y1 , y2 ))v(z 1 , z 2 ) =

(2x1 y1 + 4x2 y1 − x1 y2 + 3x2 y2 )(2z 1 − 3z 2 ) =

= 4x1 y1 z 1 + 8x2 y1 z 1 − 2x1 y2 z 1 + 6x2 y2 z 1 − 6x1 y1 z 2 − 12x2 y1 z 2 + 3x1 y2 z 2 − 9x2 y2 z 2 .

So, the multidimensional array for w ⊗ v is


−6 3

4 −2
w⊗v =
−12 −9

8 6

Example 6.2.4 Observe that if T, U are n × n matrices, the tensor product T ⊗ U


does not coincide with their row-by-column product. The tensor product of these
two matrices is a tensor of dimension 4, of type n × n × n × n.

As we just noted, the tensor product does not define an internal operation in the
spaces of tensors of the same dimension and same type. It is possible, however,
to define something called the tensor algebra on which the tensor product behaves
like a product. We will just give the definition of the tensor algebra, but won’t have
occasion to use it in this text.

Definition 6.2.5 Let K be a field. The tensor algebra over the space K n is the direct
sum
T (n) = K ⊕ K n ⊕ K n,n ⊕ · · · ⊕ K n,...,n ⊕ · · ·

The tensor product defines a homogeneous operation inside T (n).


6.2 The Tensor Product 91

Remark 6.2.6 It is an easy (but messy) consequence of our definition that the tensor
product is an associative product, i.e., if T, U, V are tensors, then

T ⊗ (U ⊗ V ) = (T ⊗ U ) ⊗ V.

Notice that the tensor product is not, in general, a commutative product (see
Example 6.2.3 above). Indeed, in that example we saw that even the spaces in which
T ⊗ U and U ⊗ T lie can be different.

Remark 6.2.7 The tensor product of tensors has the following properties: for any
 
T, T  ∈ K a1 ,...,an , U, U  ∈ K a1 ,...,am and λ ∈ K , one has
• T ⊗ (U + U  ) = T ⊗ U + T ⊗ U  ;
• (T + T  ) ⊗ U = T ⊗ U + T  ⊗ U ;
• (λT ) ⊗ U = T ⊗ (λU ) = λ(T ⊗ U ).
This can be expressed by saying that the tensor product is linear over the two factors.

More generally, the tensor product defines a map

K a11 ,...,a1n1 × · · · × K am1 ,...,amnm −→ K a11 ,...,a1n1 ,...,am1 ,...,amnm

which is linear in any factor. For this reason, we say that the tensor product is a
multi-linear product in its factors.
The following useful proposition holds for the tensor product.

Proposition 6.2.8 (Vanishing Law) Let T, U be tensors. Then:


- If T = 0 or U = 0, then T ⊗ U = 0.
- Conversely, if T ⊗ U = 0 then either T = 0 or U = 0.
 
Proof Assume T  ∈ K a1 ,...,an , U ∈ K a1 ,...,am .
If T = 0 then for any choice of the indices i 1 , . . . , i n , j1 , . . . , jm one has

(T ⊗ U )i1 ,...,in , j1 ,..., jm = Ti1 ,...,in · U j1 ,..., jm = 0 · U j1 ,..., jm = 0.

A similar computation holds when U = 0.


Conversely, if T = 0 and U = 0, then there exist two sets of indices, i 1 , . . . , i n
and j1 , . . . , jm , such that Ti1 ,...,in = 0 and U j1 ,..., jm = 0. Thus

(T ⊗ U )i1 ,...,in , j1 ,..., jm = Ti1 ,...,in · U j1 ,..., jm = 0.

The bilinear map


   
K a1 ,...,an × K a1 ,...,am → K a1 ,...,an ,a1 ,...,am
92 6 Tensors

determined by the tensor product is not injective (as the Vanishing Law clearly
 
shows). However, we can characterize tensors T, T  ∈ K a1 ,...,an and U, U  ∈ K a1 ,...,am
 
such that T ⊗ U = T ⊗ U .
 
Proposition 6.2.9 Let T, T  ∈ K a1 ,...,an and U, U  ∈ K a1 ,...,am satisfy

T ⊗ U = T  ⊗ U  = 0.

Then there exists a nonzero scalar α ∈ K such that T  = αT and U  = α1 U .


In particular, if U = U  then T = T  (and conversely).

Proof Put Z = T ⊗ U = T  ⊗ U  . Since Z = 0, there exists a choice of indices such


that
Z i1 ,...,in , j1 ,..., jm = Ti1 ,...,in · U j1 ,..., jm = Ti1 ,...,in · U j1 ,..., jm = 0.

Thus Ti1 ,...,in = 0.


Let
β = Ti1 ,...,in /Ti1 ,...,in .

Since β = 0, it is easy to show that

Ti1 ,...,in
Uk1 ,...,km = Uk1 ,...,km = βUk1 ,...,km ,
Ti1 ,...,in

for all k1 , . . . , km , i.e. U  = βU .


Similarly, since U j1 ,..., jm = 0, we can let α = 0 be the quotient U j1 ,..., jm /U j1 ,..., jm .
As above one shows that T  = αT .
Finally, by multi-linearity, we get Z = T  ⊗ U  = (αβ)(T ⊗ U ). Hence, αβ = 1,
i.e., β = α1 .
The final statement of the proposition follows from the previous one, since α is
1. 

By using the associativity of the tensor product and slightly modifying the proof
of the preceding proposition one can prove, by induction on the number of factors,
the following result:

Proposition 6.2.10 Let T1 , U1 ∈ K a11 ,...,a1n1 , . . . , Ts , Us ∈ K as1 ,...,asns satisfy

T1 ⊗ T2 ⊗ · · · ⊗ Ts = U1 ⊗ U2 ⊗ · · · ⊗ Us = 0.

Then there exist nonzero scalars α1 , . . . , αs ∈ K such that Ui = αi Ti for all i, and
moreover α1 · · · αs = 1.

Remark 6.2.11 We mentioned above that the tensor product of two bilinear forms,
represented by matrices M and N , respectively, doesn’t correspond to the product of
6.2 The Tensor Product 93

the two matrices M and N . Indeed, in most cases, we cannot even take the product
of the two matrices!
However, when M is an n × m matrix and N is an m × s matrix we can form
their product as matrices and also form their tensor product. It turns out that there is
a relation between these two objects.
The tensor product is an element of the vector space K n × K m × K m × K s while
the matrix product can be considered as an element of K n × K s . How can we recover
the regular product from the tensor product?
Now the tensor product is the tensor Q of dimension 4 and type (n, m, m, s), such
that Q(i, j, k, l) = M(i, j)N (k, l). The row-by-column product of M, N is obtained
by sending Q to the matrix Z ∈ K n,s defined by

Z (i, l) = Q(i, j, j, l).
j

So, the ordinary matrix product is obtained, in this case, by taking the tensor
product and following that by a projection onto the space K n × K s = K n,s .

6.3 Rank of Tensors

In the next two sections we generalize, to tensors of any dimension, a definition


which is basic in the theory of matrices, namely the notion of the rank of a matrix.
To find the appropriate generalization to tensors, we will have to choose among
the many equivalent ways one can define the rank of a matrix. It turns out that it is not
convenient to choose, as the definition of rank, its characterization as the dimension
of either the row space or the column space of a matrix. We will use, instead, a
characterization of the rank of a matrix which is probably less familiar to the reader,
but which turns out to be perfect for a generalization to arbitrary tensors. The starting
point is a simple characterization of matrices of rank 1.

Proposition 6.3.1 Let M = (m i j ) be a nonzero m × n matrix with coefficients in a


field K . M has rank 1 if and only if there are nonzer o vectors v ∈ K m , w ∈ K n such
that,
M = v ⊗ w = v t w.

Proof Assume that v and w exist. Since M = v t w every row of M is a multiple of


w and so the row space of M has dimension 1 and hence the rank of M is 1.
Conversely, if the rank of M is 1 then every row of M is a multiple of some nonzero
vector, which we will call w. I.e. the ith row of M is ci w. If we set v = (c1 , . . . , cm )
then clearly M = v t w = v ⊗ w. 

Thus, one can define matrices of rank 1 in terms of the tensor product of vectors.
94 6 Tensors

Although the rank of a matrix M is usually defined as the dimension of either the
row space or column space of M, we now give a neat characterization of rank(M)
in terms of matrices of rank 1.

Proposition 6.3.2 Let M = 0 be an m × n matrix. Then the rank of M is equal to


the smallest integer r such that M is a sum of r matrices of rank 1.

Proof Assume M = M1 + · · · + Mr , where every Mi has rank 1. Then we may write


Mi = (vi )t wi where vi ∈ K m , wi ∈ K n . Form the matrix A whose columns are the
vectors vit , and the matrix B whose rows are the vectors wi . It is easy to see that

M = AB

and so the rows of M are linear combinations of the rows of B. Since B has only r
rows we obtain that rank(M) ≤ r .
Conversely, assume that M has rank r . Then we can find r linearly inde-
pendent vectors in K n which generate the row space of M. Call those vectors
w1 , . . . , wr . Suppose that the ith row of M is ci,1 w1 + · · · + ci,r wr . Form the vector
vi = (ci,1 , . . . , ci,r ) and construct a matrix A 
whose ith column is vit . If B is the
matrix whose jth row is w j then M = AB = ri=1 vit wi is a sum of r matrices of
rank 1 and we are done. 

The two previous results on matrices allow us to extend the definition of rank to
tensors of any type.

Definition 6.3.3 A nonzero tensor T ∈ K a1 ,...,an has rank 1 if there are vectors vi ∈
K ai such that T = v1 ⊗ · · · ⊗ vn . (since the tensor product is associative, there is no
need to specify the order in which the tensor products in the formula are performed).
We define the rank of a nonzero tensor T to be the minimum r such that there
exist r tensors T1 , . . . , Tr of rank 1 with

T = T1 + · · · + Tr . (6.3.1)

Remark 6.3.4 A tensor of rank 1 is also called a simple or decomposable tensor.


For any tensor T of rank r , the expression (6.3.1) is called a (decomposable)
decomposition of T .
We will sometimes just refer to the decomposable decomposition of T as a decom-
position of T or a rank decomposition of T .

By convention we say that null tensors, i.e., tensors whose entries are all 0, have
rank 0.

Remark 6.3.5 Let T be a tensor of rank 1 and let α = 0, α ∈ K . Then, using the
multi-linearity of the tensor product, we see that αT also has rank 1. More generally,
if T has rank r then αT also has rank r . Then (exactly as for matrices), the union
of the null tensor with all the tensors in K a1 ,...,an of rank r is closed under scalar
multiplication.
6.3 Rank of Tensors 95

Subsets of vector spaces that are closed under scalar multiplication are called
cones. Thus the set of tensors in K a1 ,...,an of fixed rank (plus 0) is a cone.
On the other hand (again exactly as happens for matrices), in general the sum of
two tensors in K a1 ,...,an of rank r need not have rank r . Thus, the set of tensors in
K a1 ,...,an having fixed rank (union the null tensor) is not a subspace of K a1 ,...,an .

6.4 Tensors of Rank 1

In this section, we give a useful characterization of tensors of rank 1. There exists a


generalization for matrices of higher rank but, unfortunately, there does not exist a
similar characterization for tensors of higher rank and having dimension ≥ 3.
Recall that we are using the notation [i] = {1, 2, . . . , i − 1, i}.

Definition 6.4.1 Let 0 < m ≤ n be integers. An injective nondecreasing function

f : [m] → [n]

is a function with the property that

whenever a, b ∈ [m] and a < b then f (a) < f (b) .

With this technical definition made we are now able to define the notion of a
subtensor of a given tensor.

Definition 6.4.2 Let T be a tensor in K a1 ,...,an . We consider T as a map

T : [a1 ] × · · · × [an ] → K .

For any choice of positive integers a j ≤ a j (1 ≤ j ≤ n) and for any choice of


 
injective, nondecreasing maps f j : [a j ] → [a j ] we define the tensor T  ∈ K a1 ,...,an
as follows:
T  : [a1 ] × · · · × [an ] → K

where
Ti1 ...in = Ti f1 (i1 ) ...i fn (in ) .

Remark 6.4.3 This is a formal (and perhaps a bit odd) way to say that we are fixing
a few values for the indices i 1 , . . . , i n and forgetting the elements of T whose kth
index is not in the range of the map f k .
Since we usually think of a tensor of type 1 × a2 × · · · × an as a tensor of type
a2 × · · · × an , when a ak = 1 we simply forget the kth index in T . In this case, the
dimension of T  is n − m, where m is the number of indices for which ak = 1.
96 6 Tensors

Example 6.4.4 A 3 × 2 × 2 tensors T can be denoted as follows:


T112 T122

T111 T121

T212 T222
T =
T211 T221

T312 T322

T311 T321

and an instance is
1 0

−2 4

2 3
T =
0 1

−3 4

2 1

If one takes the maps f 2 = f 3 = identity, f 1 : [2] → [3] defined as f 1 (1) =


1, f 1 (2) = 3, then the corresponding subtensor is
1 0

−2 4

T =
−3 4

2 1

i.e. one just cancels the layer corresponding to the elements whose first index is 2.
If, instead, one takes f 2 = f 3 = identity, f 1 : [1] → [3] defined as f 1 (1) = 1,
then one gets the matrix in the top face:

−2 4
T =
1 0
6.4 Tensors of Rank 1 97

Definition 6.4.5 A subtensor of T of dimension 2 is called a submatrix of T . Note


that any submatrix of T is a 2 × 2 matrix inside T (considered as a multidimensional
array) which is parallel to one of the faces of the array. So, for instance, in the
Example 6.4.4, the array
T112 T122
T211 T221

is not a submatrix of T .

Proposition 6.4.6 If T has rank 1, and T  is a subtensor of T then either T  is the


null tensor or T  has rank 1.
In particular, if T has rank 1, then the determinant of any 2 × 2 submatrix of T
vanishes.

Proof Assume that T ∈ K a1 ,...,an has rank 1. Then there exist vectors vi ∈ K ai such
that T = v1 ⊗ · · · ⊗ vn . Eliminating from T the elements whose kth index has some
value q corresponds to eliminating the qth component in the vector vk . Thus, the
corresponding subtensor T  is the tensor product of the vectors v1 , . . . , vn , where vi =
vi if i = k, and vk is the vector obtained from vk by eliminating the qth component.
 
Thus T  has rank ≤ 1 (it has rank 0 if vk = 0). For a general subtensor T  ∈ K a1 ,...,an
of T , we obtain the result arguing step by step, by deleting each time one value for
one index of T , i.e., arguing by induction on (a1 + · · · + an ) − (a1 + · · · + an ).
The second claim in the statement of the theorem is immediate from what we
have just said and the fact that a 2 × 2 matrix of rank 1 has determinant 0. 

Corollary 6.4.7 The rank of a subtensor of T cannot be bigger than the rank of T .

Proof If T has rank 1, the claim follows from Proposition 6.4.6. For tensors T of
higher rank r , the claim follows since if T = T1 + · · · + Tk , with Ti of rank 1, then a
subtensor T  of T is equal to T1 + · · · + Tk , where Ti is the subtensor of Ti obtained
by eliminating all the elements corresponding to elements of T eliminated in the
passage from T to T  . Thus, by Proposition 6.4.6 each Ti is either 0 or it has rank 1,
and the claim follows. 

Example 6.4.8 Recall that a nonzero matrix has rank 1 if and only if all of its 2 × 2
submatrices have determinant equal to zero. This is not true for tensors of dimension
greater than 2, as the following example shows. Recall our earlier warning about the
subtle differences between matrices and tensors of dimension greater than 2.
Consider the 2 × 2 × 2 tensor T , defined by

T1,1,1 = 0 T1,2,1 = 0 T2,1,1 = 1 T2,2,1 = 0 (front face)


T1,1,2 = 0 T1,2,2 = 1 T2,1,2 = 0 T2,2,2 = 0 (back face) .
98 6 Tensors

0 1

0 0
T =
0 0

1 0

It is clear that all the 2 × 2 faces of T have determinant equal to 0. On the


other hand, if T has rank 1, i.e., T = (α1 , α2 ) ⊗ (β1 , β2 ) ⊗ (γ1 , γ2 ), then T2,1,1 =
α2 β1 γ1 = 0 which implies α2 , β1 , γ1 = 0. However, T2,1,2 = T1,1,1 = T2,2,1 = 0
implies α1 = β2 = γ2 = 0. But then T1,2,2 = α1 β2 γ2 = 1 = 0 yields a contradic-
tion.

We want to find a set of conditions which describe the set of all tensors of rank 1.
To this aim, we need to introduce some new piece of notation.
Notation. Recall that we denote by [n] the set {1, . . . , n}.
Fix a subset J ⊂ [n]. Then for any fixed pair of multi-indices I1 = (k1 , . . . , kn )
and I2 = (l1 , . . . , ln ), we denote by J (I1 , I2 ) the multi-index (m 1 , . . . , m n ) where

kj if j ∈ J,
mj =
lj otherwise.

Example 6.4.9 Let n = 4 and set J = {2, 3} ⊂ [4]. Consider the two multi-indices
I1 = (1, 3, 3, 2) and I2 = (2, 1, 3, 4). Then J (I1 , I2 ) = (2, 3, 3, 4). Notice that if
J  = [n] \ J = {1, 4} then J  (I1 , I2 ) = (1, 1, 3, 2).

Remark 6.4.10 If T has rank 1, then for any pair of multi-indices I1 = (k1 , . . . , kn )
and I2 = (l1 , . . . , ln ) and for any subset J ⊂ [n], the entries of T satisfy:

TI1 TI2 = TJ (I1 ,I2 ) TJ  (I1 ,I2 ) (6.4.1)

where J  = [n] \ J .
To see why this is so recall that since T has rank 1 we can write T = v1 ⊗ · · · ⊗ vn ,
with vi = (vi1 , vi2 , . . . ). In this case both of the products in (6.4.1) are equal

v1k1 v1l1 · · · vnkn vnln

and so the result is obvious.

Remark 6.4.11 When I1 , I2 differ only in two indices, the equality (6.4.1) simply
says that the determinant of a 2 × 2 submatrix of T is 0.
6.4 Tensors of Rank 1 99

Example 6.4.12 Look back to Example 6.4.8, and notice that if one takes I1 =
(1, 1, 1), I2 = (2, 2, 2) and J = {1} ⊂ [3], then J (I1 , I2 ) = (1, 2, 2) and J  (I1 , I2 ) =
(2, 1, 1) so that formula (6.4.1) does not hold, since

TI1 TI2 = 0 = 1 = TJ (I1 ,I2 ) TJ  (I1 ,I2 ) .

Theorem 6.4.13 A tensor T = 0 of dimension n has rank 1 if and only if it satisfies


all the equalities (6.4.1), for any choice of multi-indices I1 , I2 , and J ⊂ [n].

Proof Thanks to Remark 6.4.10, we need only prove that if all the equalities (6.4.1)
hold, then T has rank 1.
Let us argue by induction on the dimension n of T ∈ K a1 ,...,an . The case n = 2 is
well known: a matrix has rank 1 if and only if all its 2 × 2 minors vanish.
For n > 2, pick an entry TI1 = Tk1 ,...,kn = 0 in T .
Let J1 ⊂ [a1 ] where J1 = {1} and let f 1 : J1 → [ai ] be defined by f 1 (1) = k1 .
For 2 ≤ i ≤ n, let f i = identit y. Let T  be the subtensor corresponding to these
data. T  is a tensor of dimension n − 1 and hence satisfies the equalities (6.4.1). By
induction, we obtain that rank(T  ) = 1, so there are vectors v2 , . . . vn such that, for
any choice of i 2 , . . . , i n , one gets

Tk1 ,i2 ,...,in = Ti2 ,...,in = v2i2 · · · vnin . (a)

For all m ∈ [a1 ] define the number

Tm,k2 ,...,kn
pm = . (b)
Tk1 ,k2 ,...,kn

We use those numbers to define the vector v1 = ( p1 , . . . , pa1 ).


We now claim that T = v1 ⊗ v2 ⊗ · · · ⊗ vn .
Indeed for any I2 = (l1 , . . . , ln ), by setting J = {1}, and hence J  = {2, . . . , n},
one obtains from the equalities (6.4.1) that

TI1 TI2 = TJ (I1 ,I2 ) TJ  (I1 ,I2 ) = Tk1 ,l2 ,...,ln Tl1 ,k2 ,...,kn = Tk1 ,l2 ,...,ln · pl1 Tk1 ,k2 ,...,kn .

Using the terms at the beginning and end of this string of equalities and also taking
into account (a) and (b) above, we obtain

v2k2 · · · vnkn TI2 = v2l2 · · · vnln · v1l1 · v2k2 · · · vnkn .

Since TI1 = 0, and hence v2k2 , . . . , vnkn = 0, we can divide both sides of this
equality by v2k2 , . . . , vnkn and finally get

TI2 = v2l2 · · · vnln · v1l1 ,

which proves the claim. 


100 6 Tensors

Observe that the rank 1 analysis of a tensor reduces to compute if finitely many
flattening matrices have rank one (see Definition 8.3.4 and Proposition 8.3.7), and
this can be accomplished with Gaussian elimination as well, without the need to
compute all 2 × 2 minors.
The equations corresponding to the equalities (6.4.1) determine a set of polyno-
mial (quadratic) equations, in the space of tensors K a1 ,...,an , which describe the locus
of decomposable tensors (interestingly enough, it turns out that in many cases this
set of equations is not minimal).
In any event, Theorem 6.4.13 provides a finite procedure which allows us to decide
if a given tensor has rank 1 or not. We simply plug the coordinates of the given tensor
into the equations we just described and see if all the equations vanish or not.
Moreover, in Theorem 6.4.13 it is enough to take subsets J given by one single
element, and even are sufficient n − 1 of them. Unfortunately, as the dimension
grows, the number of operations required in the algorithm rapidly becomes quite
large!
Recall that for matrices there is a much simpler method for calculating the rank
of the matrix: one uses Gaussian reduction to find out how many nonzero rows
that reduction has. That number is the rank. We really don’t have to calculate the
determinants of all the 2 × 2 submatrices of the original matrix.
There is nothing like the simple and well-known Gaussian reduction algorithm
(which incidentally calculates the rank for a tensor of dimension 2) for calculating
the rank of tensors of dimension greater than 2. All known procedures for calculating
the rank of such a tensor quickly become not effective.
There are many other ways in which the behavior of rank for tensors having
dimension greater than 2 differs considerably from the behavior of rank for matrices
(tensors of dimension exactly 2). For example, although a matrix of size m × n (a two-
dimensional tensor of type (m, n)) cannot have rank which exceeds the minimum
of m and n, tensors of type a1 × · · · × an (for n > 2) may have rank bigger than
max{ai }. Although the general matrix of size m × n has rank = min{m, n} (the
maximum possible rank) there are often special tensors of a given dimension and
type whose rank is bigger than the rank of a general tensor of that dimension and
type.
The attempt to get a clearer picture of how rank behaves for tensors of a given
dimension and type has many difficult problems associated to it. Is there some nice
geometric structure for the set of tensors having a given rank? Is the set of tensors
of prescribed rank not empty? What is the maximum rank for a tensor of given
dimension and type? These questions, and several variants of them, are the subject
of research for many mathematicians and other scientists today.
We conclude this section with some examples which illustrate that although there
is no algorithm for finding the rank of a given tensor, one can sometimes decide,
using ad hoc methods, exactly what is the rank of the tensor.
6.4 Tensors of Rank 1 101

Example 6.4.14 The following tensor of type 2 × 2 × 2 has rank 2:


4 0

3 −1

2 −6

3 −5

Indeed it cannot have rank 1, because some of its 2 × 2 submatrices have deter-
minant different from 0. T has rank 2 because it is the sum of two tensors of rank 1
(one can check, using the algorithm, that the summands have rank 1):
2 2 2 −2

1 1 2 −2
+
−2 −2 4 −4

−1 −1 4 −4

Example 6.4.15 The tensor


2 3

1 3
D=
0 4

0 2

has rank 3, i.e., one cannot write D as a sum of two tensors of rank 1. Let us see why.
Let’s assume that D is the sum of two tensors T = (Ti jk ) e T  = (Tijk ) of rank 1
and let’s try to derive a contradiction from that assumption.
Notice that the vector (D211 , D212 ) = (0, 0) would have to be equal to the sum
 
of the vectors (T211 , T212 ) + (T211 , T212 ), Consequently, the two vectors (T211 , T212 )
 
and (T211 , T212 ) are opposite of each other and hence span a subspace W ⊂ K 2 of
dimension ≤ 1.
102 6 Tensors

If one (hence both) of these vectors is nonzero, then also the vectors (T111 , T112 ),
   
(T221 , T222 ), and (T111 , T112 ), (T221 , T222 ), would also have to belong to W because all
  
the 2 × 2 determinants of T and T vanish. But notice that (T121 , T122 ) and (T121 , T122 )
must also belong to W by Remark 6.4.10 (take J = {3} ⊂ [3]).
It follows that both vectors (D111 , D112 ) = (1, 2) and (D121 , D122 ) = (3, 3), must
belong to W . This is a contradiction, since dim(W ) = 1 and (1, 2), (3, 3) are linearly
independent.
 
So, we are forced to the conclusion that (T211 , T212 ) = (T211 , T212 ) = (0, 0). Since
 
the sum of (T111 , T112 ) and (T111 , T112 ) is (1, 2) = (0, 0) we may assume that one
of them, say (T111 , T112 ), is nonzero. As T has rank 1, there exists a ∈ K such that
(T221 , T222 ) = a(T111 , T112 ) (we are again using Remark 6.4.10).
Now, the determinant of the front face of the tensor T is 0, i.e.,

0 = T111 T221 − T121 T211 .

Since T211 = 0 and T221 = aT111 we get 0 = aT111 2


. Doing the same argument on the
back face of the tensor T we get 0 = aT112 . It follows that a = 0 and so the bottom
2

face of the tensor T consists only of zeroes.


 
It follows that (T221 , T222 ) = (2, 4). Since the tensor T  has rank 1, it follows that
   
the vector (T111 , T112 ) is also a multiple of (2, 4), as is the vector (T121 , T122 ).
   
Since (T111 , T112 ) = (1, 2) − (T111 , T112 ), and both (1, 2) and (T111 , T112 ) are mul-
tiples of (2, 4), it follows that the vector (T111 , T112 ) (which we assumed was not 0) is
also a multiple of (2, 4). Thus, since the tensor T has rank 1, the vector (T121 , T122 )
 
is also a multiple of (2, 4). Since we already noted that (T121 , T122 ) is a multiple
of (2, 4) it follows that the vector (3, 3) is a multiple of (2, 4), which is the final
contradiction.
Notice that a decomposition of the tensor is given by
0 0 2 3 0 0

1 3 0 0 0 0
+ +
0 0 0 0 0 4

0 0 0 0 0 2

6.5 Exercises

Exercise 7 Give a graphical representation of the following tensors:


(a) T is the three-dimensional tensor of type 3 × 2 × 2 whose (i, j, k) entry is equal
to i · ( j + k);
6.5 Exercises 103

(b) T is the three-dimensional tensor of type 2 × 3 × 2 whose (i, j, k) entry is equal


to i · ( j + k);
(c) T is the three-dimensional tensor of type 2 × 2 × 3 whose (i, j, k) entry is equal
to i · ( j + k).

Exercise 8 Find two tensors of rank 1 whose sum is the 2 × 2 × 2 tensor v ⊗ w of


Example 6.2.3.

Exercise 9 Show that the tensor:


3 3

2 2

6 3

T = 4 2

9 3

6 2

is the tensor product of a matrix times the vector (2, 3).

Exercise 10 Prove that the tensor T of Exercise 9 has rank 2.


Chapter 7
Symmetric Tensors

In this chapter, we make a specific analysis of the behavior of symmetric tensors,


with respect to the rank and the decomposition.
We will see, indeed, that besides their utility to understand some models of random
systems, symmetric tensors have a relevant role in the study of the algebra and the
computational complexity of polynomials.

7.1 Generalities and Examples

Definition 7.1.1 A cubic tensor is a tensor of type a1 × · · · × an where all the ai ’s


are equal, i.e., a tensor of type d × · · · × d (n times).
We say that a cubic tensor T is symmetric if for any multi-index (i 1 , . . . , i n ) and
for any permutation σ on the set {i 1 , . . . , i n }, it satisfies

Ti1 ,...,in = Tiσ(1) ,...,iσ(n) .

Example 7.1.2 When T is a square matrix, then the condition for the symmetry of
T simply requires that Ti, j = T j,i for any choice of the indices. In other words, our
definition of symmetric tensor coincides with the plain old definition of symmetric
matrix, when T has dimension 2.
If T is a cubic tensor of type 2 × 2 × 2, then T is symmetric if and only if the
following equalities hold:

T1,1,2 = T1,2,1 = T2,1,1
T2,2,1 = T2,1,2 = T1,2,2

© Springer Nature Switzerland AG 2019 105


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_7
106 7 Symmetric Tensors

An example of a 2 × 2 × 2 symmetric tensor is the following:

3 2

1 3

2 0

3 2

Remark 7.1.3 The set of symmetric tensors is a linear subspace of K d,...,d . Namely,
it is defined by a set of linear equations:

Ti1 ,...,in = Tσ(i1 ),...,σ(in )

in the coordinates of K d,...,d .


As a vector space itself, the space of symmetric tensors of type d × · · · × d, n
times, is usually denoted by Sym n (K d ).
Of course, from the point of view of multi-linear forms, Sym n (K d ) coincides
with the space of symmetric multi-linear maps (K d )n → K .
As we will see later (see Proposition 7.3.8), the dimension of Sym n (K d ) is
   
n+d −1 n+d −1
dim(Sym n (K d )) = = .
n d −1

7.2 The Rank of a Symmetric Tensor

Next step is the study of the behavior of symmetric tensors with respect to the rank. It
is easy to realize that there are symmetric tensors of rank 1, i.e., the space Sym n (K d )
intersects the set of decomposable tensors. Just to give an instance, look at:

4 8

2 4
D=
2 4

1 2

The following proposition determines how one construct decomposable, symmetric


tensors.
7.2 The Rank of a Symmetric Tensor 107

Proposition 7.2.1 Let T be a cubic tensor of type d × · · · × d, n times. Then T is


symmetric, of rank 1, if and only if there exist a nonzero scalar λ ∈ K and a nonzero
vector v ∈ K d with:
T = λ(v ⊗ v ⊗ · · · ⊗ v).

If moreover K is an algebraically closed field (as the complex field C), then we
may assume λ = 1.

Proof If T = λ(v ⊗ v ⊗ · · · ⊗ v), v = 0, then T cannot be zero by Proposition 6.2.8,


thus it has rank 1. Moreover if v = (α1 , . . . , αd ), then for any multi-index (i 1 , . . . , i n )
and for any permutation σ:

Ti1 ,...,in = αi1 · · · αin = Tσ(i1 ),...,σ(in )

thus T is symmetric.
Conversely, assume that T is symmetric of rank 1, say T = v1 ⊗ · · · ⊗ vn , where
no vi ∈ K d can be 0, by Proposition 6.2.8. Write vi = (vi,1 , . . . , vi,d ) and fix a multi-
index (i 1 , . . . , i n ) such that v1,i1 = 0, …, vn,in = 0. Then Ti1 ,...,in = v1,i1 · · · vn,in can-
not vanish. Define b2 = v2,i1 /v1,i1 . Then we claim that v2 = b2 v1 . Namely, for all j
we have, by symmetry:

v1,i1 v2, j v3,i3 · · · vn,in = Ti1 , j,i3 ,...,in = T j,i1 ,i3 ,...,in = v1, j v2,i1 v3,i3 · · · vn,in ,

which means that v1,i1 v2, j = v1, j v2,i1 , so that v2, j = b2 v1, j . Similarly we can define
b3 = v3,i1 /v1,i1 ,…, bd = vd,i1 /v1,i1 , and obtain that v3 = b3 v1 , …, vd = bd v1 . Thus,
if λ = b2 · b3 · · · · · bd , then

T = v1 ⊗ v2 ⊗ · · · ⊗ vn = v1 ⊗ (b2 v1 ) ⊗ · · · ⊗ (bn v1 ) = α(v1 ⊗ v1 ⊗ · · · ⊗ v1 ).

When K is algebraically close, then take a dth root β of λ ∈ K and define v = βv1 .
Then T = β d (v1 ⊗ v1 ⊗ · · · ⊗ v1 ) = v ⊗ v ⊗ · · · ⊗ v. 

Notice that purely algebraic properties of K can be relevant in determining the shape
of a decomposition of a tensor.
Remark 7.2.2 In the sequel, we will often write v ⊗d for v ⊗ v ⊗ · · · ⊗ v, d times.
If K is algebraically closed, then a symmetric tensor T ∈ Sym n (K d ) of rank 1
has a finite number (exactly: d) decompositions as a product T = v ⊗d .
Namely if w ⊗ · · · ⊗ w = v ⊗ · · · ⊗ v, then by Proposition 6.2.9 there exists a
scalar β such that w = βv and moreover β d = 1, thus w is equal to v multiplied by
a dth root of unity.
Passing from rank 1 to higher ranks, the situation becomes suddenly more involved.
108 7 Symmetric Tensors

The definition itself of rank of a symmetric tensors is not completely trivial, as


we have two natural choices for it:
• First choice. The rank of a symmetric tensor T ∈ Sym n (K d ) is simply its rank
as a tensor in K d,...,d , i.e., it is the minimum r for which one has r decomposable
tensors T1 ,…,Tr with
T = T1 + · · · + Tr .

• Second choice. The rank of a symmetric tensor T ∈ Sym n (K d ) is the minimum


r for which one has r symmetric decomposable tensors T1 ,…,Tr with

T = T1 + · · · + Tr .

Then, the natural question is about which choice gives the correct definition. Here,
correct definition means the definition which proves to be most useful, for the appli-
cations to Multi-linear Algebra and random systems.
The reader could be disappointed in knowing that there is no clear preference
between the two options: each can be preferable, depending on the point of view.
Thus, we will leave the word rank for the minimum r for which one has a decom-
position T = T1 + · · · + Tr , with the Ti ’s not necessarily symmetric (i.e., the first
choice above).
Then, we give the following:
Definition 7.2.3 The symmetric rank srank(T ) of a symmetric tensor T ∈ Sym n
(K d ) is the minimum r for which one has r symmetric decomposable tensors T1 ,…,Tr
with
T = T1 + · · · + Tr .

Example 7.2.4 The symmetric tensor


2 0

0 2
T =
0 2

2 0

has not rank 1, as one can compute by taking the determinant of some face.
T has rank 2, because it is expressible as the sum of two decomposable tensors
T = T1 + T2 , where
7.2 The Rank of a Symmetric Tensor 109

1 1

1 1
T1 =
1 1

1 1
1 −1

−1 1
T2 =
−1 1

1 −1

and T1 = (1, 1)⊗3 , T2 = (−1, 1)⊗3 .


Example 7.2.5 The tensor (over C):
0 7

1 0
T =
7 8

0 7
is not decomposable. Let us prove that the symmetric rank is bigger than 2.
Assume that T = (a, b)⊗3 + (c, d)⊗3 . Then we have
⎧ 3 3

⎪a c =1
⎪ 2
⎨a b + c2 d =0
⎪ab2 + cd 2
⎪ =7

⎩ 3 3
b d = 8.

Notices that none of a, b, c, d can be 0. Moreover we have ac =  and bd = 2 ,


where ,  are two cubic roots of unit, not necessarily equal. But then c = /a and
d =  /b, so that a 2 b + c2 d = 0 yields 1 + 2  = 0, which cannot hold, because −1
is not a cubic root of unit.
Remark 7.2.6 Proposition 7.2.1 shows in particular that any symmetric tensor of
(general) rank 1 has also the symmetric rank equal to 1.
The relations between the rank and the symmetric rank of a tensor are not obvious
at all, when the ranks are bigger than 1. It is clear that

srank(T ) ≥ rank(T ).
110 7 Symmetric Tensors

Very recently, Shitov found an example where the strict inequality holds (see [1]).
Shitov’s example is quite peculiar: the tensor has dimension 3 and type 800 ×
800 × 800, whose rank is very high with respect of general tensors of same dimension
and type.
The difficulty in finding examples where the two ranks are different, despite the
large number of concrete tensors tested, suggested to the French mathematician Pierre
Comon to launch the following:

Problem 7.2.7 (Comon 2000) Find conditions such that the symmetric rank and rank
of a symmetric tensor coincide.
In other words, find conditions for T ∈ Sym n (Cd ) such that if there exists a
decomposition T = T1 + · · · + Tr in terms of tensors of rank 1, then there exists also
a decomposition with the same number of summands, in which each Ti is symmetric,
of rank 1.

The condition is known for some types of tensors. For instance, it is easy to prove
that the Comon Problem holds for any symmetric matrix T (and this is left as an
exercise at the end of the chapter).
The reader could wonder that such a question, which seems rather elementary
in its formulation, could yield a problem which is still open, after being studied by
many mathematicians, with modern techniques.
This explains a reason why, at the beginning of the chapter, we warned the reader
that problems that are simple for Linear Algebra and matrices can suddenly become
prohibitive, as the dimension of the tensors grows.

7.3 Symmetric Tensors and Polynomials

Homogeneous polynomials and symmetric tensors are two apparently rather different
mathematical objects, that indeed have a strict interaction, so that one can skip from
each other, translating properties of tensors to properties of polynomials, and vice
versa.
The main construction behind this interaction is probably well known to the
reader, for the case of polynomials of degree 2. It is a standard fact that one can
associate a symmetric matrix to quadratic homogeneous polynomial, in a one-to-one
correspondence, so that properties of the quadratic form (as well as properties of
quadratic hypersurfaces) can be read as properties of the associated matrix.
The aim of this section is to point out that a similar correspondence holds, more
generally, between homogeneous forms of any degree and symmetric tensors of
higher dimension.
Definition 7.3.1 There is a natural map between a space K n,...,n of cubic tensors of
dimension d and the space of homogeneous polynomials of degree d in n variables
(i.e., the dth graded piece Rd of the ring of polynomials R = K [x1 , . . . , xn ]), defined
by sending a tensor T to the polynomial FT such that
7.3 Symmetric Tensors and Polynomials 111

FT = Ti1 ,...,in xi1 · · · xin .
i 1 ,...,i n

It is clear that the previous correspondence is not one to one, as soon as general
tensors are considered. Namely, for the case n, d = 2, one immediately sees that the
two matrices    
2 3 20
−1 1 21

define the same polynomial of degree 2 in two variables F = 2x12 + 2x1 x2 + x22 .
The correspondence becomes one to one (and onto) when restricted to symmetric
tensors. To see this, we need to introduce a piece of notation.
Definition 7.3.2 For any multi-index (i 1 , . . . , i d ), we will define the multiplicity
m(i 1 , . . . , i d ) as the number of different permutations of the multi-index.
Definition 7.3.3 Let R = K [x1 , . . . , xn ] be the ring of polynomials, with coeffi-
cients in K , with n variables. Then there are linear isomorphisms

p : Sym d (K n ) → Rd t : Rd → Sym d (K n )

defined as follows. The map p is the restriction to Sym d (K n ) of the previous map

p(T ) = Ti1 ,...,id xi1 · · · xid .
i 1 ,...,i d

The map t is defined by sending the polynomial F to the tensor t (F) such that

1
t (F)i1 ,...,id = (the coefficient of xi1 · · · xid in F).
m(i 1 , . . . , i d )

Example 7.3.4 If G is a quadratic homogeneous polynomial in 3 variables G =


Ax 2 + Bx y + C y 2 + Dx z + E yz + F z 2 , then t (G) is the symmetric matrix
⎛ ⎞
A B/2 D/2
t (G) = ⎝ B/2 C E/2 ⎠
D/2 E/2 F,

which the usual matrix of the bilinear form associated to G.


Example 7.3.5 Consider the homogeneous cubic polynomial in two variables

F(x1 , x2 ) = x13 − 3x12 x2 + 9x1 x22 − 2x23 .

Since one easily computes that

m(1, 1, 1) = 1, m(1, 1, 2) = m(2, 1, 1) = 3, m(2, 2, 2) = 1,


112 7 Symmetric Tensors

then the symmetric tensor t (F) is:


−1 3

1 −1
T =
3 −2

−1 3

It is an easy exercise to prove that the two maps p and t defined above are inverse to
each other.
Once the correspondence is settled, one can easily speak about the rank or the the
symmetric rank of a polynomial.
Definition 7.3.6 For any homogeneous polynomial G ∈ K [x1 , . . . , xn ], we define
the rank (respectively, the symmetric rank) of G as the rank (respectively, the sym-
metric rank) of the associated tensor t (G).

Example 7.3.7 The polynomial G = x13 + 21x1 x22 + 8x23 has rank 3, since the asso-
ciated tensor t (G) is exactly the 2 × 2 × 2 symmetric tensor of Example 7.2.5.

Proposition 7.3.8 The linear space Sym d (K n ) has dimension


   
n+d −1 n+d −1
dim(Sym d (K n )) = = .
d n−1
 
Proof This is obvious once one knows that n+d−1 d
is the dimension of the space of
homogeneous polynomials Rd of degree d in n variables. We prove it for the sake of
completeness.
Since monomials of degree d in n variables are a basis for Rd , it is enough to
count the number of such monomials.
The proof goes by induction on n. For n = 2 the statement is easy: we have d + 1
monomials, namely x1d , x1d−1 x2 , . . . , x2d .
Assume the formula holds for n − 1 variables x2 , . . . , xn . Every monomial of
degree d in n variables is obtained by multiplying x1a by any monomial of degree
− a in x2 , . . . , xn . Thus, we have 1 monomial with x1d , n monomials with x1d−1 ,…,
dn+d−a−2
d−a
monomials with x1a , and so on. Summing up

d 

n+d −a−2
dim(Sym d (K n )) = ,
a=0
d −a

n+d−1
and the sum is d
, by standard facts on binomial coefficients. 
7.4 The Complexity of Polynomials 113

7.4 The Complexity of Polynomials

In this section, we rephrase the results on the rank of symmetric tensors in terms of
the associated polynomials.
It will turn out that the rank decomposition of a polynomial is the analogue of a
long-standing series of problems in Number Theory, for the expression of integers
as a sum of powers.
In principle, from the point of view of Algebraic Statistic, the complexity of
a polynomial is the complexity of the associated symmetric tensor. So, the most
elementary case of polynomials corresponds to symmetric tensor of rank 1. We start
with a description of polynomials of this type.
Remark 7.4.1 Before we proceed, we need to come back to the multiplicity of a
multi-index J = (i 1 , . . . , i d ), introduced in Definition 7.3.2.
In the correspondence between polynomials and tensors, the element Ti1 ,...,id is
linked with the coefficient of the monomial xi1 · · · xid . Notice that i 1 , . . . , i d need not
be distinct, so the monomial xi1 · · · xid could be written unproperly. The usual way
in which xi1 · · · xid is written is:

xi1 · · · xid = x1m J (1) x2m J (2) · · · xnm J (n) ,

where m J (i) indicates the times in which i occurs in the multi-index J .


With the notation just introduced, one can describe the multiplicity m(i 1 , . . . , i d ).
Indeed a permutation changes the multi-index, unless it simply switches indices i a , i b
which are equal. Since the number of permutations over a set with m elements is m!,
then one finds that
d!
m(J ) = m(i 1 , . . . , i d ) = .
m J (1)! · · · m J (n)!

Proposition 7.4.2 Let G be a homogeneous polynomial of degree d in n variables,


so that t (G) ∈ Sym d (K n ).
Then t (G) has rank 1 if and only if there exists a homogeneous linear polynomial
L ∈ K [x1 , . . . , xn ], such that G = L d .

Proof It is sufficient to prove that t (G) = v ⊗d , where v = (α1 , . . . , αn ) ∈ K n , if


and only if G = (α1 x1 + · · · + αn xn )d .
To this aim, just notice that the coefficient of the monomial x1m 1 · · · xnm n in p(v ⊗d )
is the sum of the entries of the tensors v ⊗d whose multi-index J satisfies m J (1) =
m 1 , . . . , m J (n) = m n . These entries are all equal to α1m 1 · · · αnm n and their number is
m(J ). On the other hand, by the well-known Newton formula, m(J )(α1m 1 · · · αnm n )
is exactly the coefficient of the monomial x1m 1 · · · xnm n in the power (α1 x1 + · · · +
αn xn )d .

Corollary 7.4.3 The symmetric rank of a homogeneous polynomial


G ∈ K [x1 , . . . , xn ]d is the minimum r for which there are r linear homogeneous
114 7 Symmetric Tensors

forms L 1 , . . . , L r ∈ K [x1 , . . . , xn ], with

G = L d1 + · · · + L rd . (7.1)

The symmetric rank is the number that computes the complexity of symmetric ten-
sors, hence the complexity of homogeneous polynomials, from the point of view of
Algebraic Statistics. Hence, it turns out that the simplest polynomials, in this sense,
are powers of linear forms. We guess that nobody will object to the statement that
powers are rather simple!
We should mention, however, that sometimes the behavior of polynomials with
respect to the complexity can be much less intuitive.
For instance, the rank of monomials is usually very high, so that the complexity
of general monomials is over the average (and we expect that most people will be
surprised). Even worse, efficient formulas for the rank of monomials were obtained
only very recently by Enrico Carlini, Maria Virginia Catalisano, and Anthony V.
Geramita (see [2]). For other famous polynomials, as the determinant of a matrix
of indeterminates, we do not even know the rank. All we have are lower and upper
bounds, not matching.
We finish the chapter by mentioning that the problem of finding the rank of
polynomials reflects a well-known problem in Number Theory. Solving a question
posed by Diophantus, the Italian mathematician Giuseppe Lagrange proved that any
positive integer N can be written as a sum of four squares, i.e., for any positive
integer G, there are integers L 1 , L 2 , L 3 , L 4 such that G = L 21 + L 22 + L 23 + L 24 . The
problem has been generalized by the English mathematician Edward Waring, who
asked in 1770 for the minimum integer r (k) such that any positive integer G can be
written as a sum of r (k) powers L ik . In other words, find the minimum r (k) such that
any positive integers are of the form

G = L k1 + · · · L rk(k) .

The analogy with the decomposition (7.1) that computes the symmetric rank of a
polynomial is evident.
The determination of r (k) is called, from then, the Waring problem for integers.
Because of the analogy, the symmetric rank of a polynomial is also called the War-
ing rank.
For integers, few values of r (k) are known, e.g., r (2) = 4, r (3) = 9, r (4) = 19.
There are also variations on the Waring problem, as asking for the minimum r  (k)
such that all positive integers, except for a finite subset, are the sum of r  (k) kth
powers (the little Waring problem).
Going back to the polynomial case, as for integers, a complete description of the
maximal complexity that a homogeneous polynomial of the given degree in a given
number of variables can have, is not known. We have only upper bounds for the
maximal rank. On the other hand, we know the solution of an analogue to the little
Waring problem, for polynomials over the complex field.
7.4 The Complexity of Polynomials 115

Theorem 7.4.4 (Alexander-Hirschowitz 1995) Over the complex field, the symmet-
ric rank of a general homogeneous polynomial of degree d in n variables (here
general means: all polynomials outside a set of measure 0 in C[x1 , . . . , xn ]d ; or
also: all polynomials outside a Zariski closed subset of the space C[x1 , . . . , xn ]d ,
see Remark 9.1.10) is  n+d−1
r = d

n
except for the following cases:
• d = 2, any n, where r = n;
• d = 3, n = 5, where r = 8;
• d = 4, n = 3, where r = 6.
• d = 4, n = 4, where r = 10.
• d = 4, n = 5, where r = 15.
The original proof of this theorem requires the Horace method. It is long and difficult
and occupies a whole series of papers [3–7].
For specific tensors, an efficient way to compute the rank requires the use of
inverse systems, which will be explained in the next chapter.

7.5 Exercises

Exercise 11 Prove that the two maps p and t introduced in Definition 7.3.3 are
linear and inverse to each other.

Exercise 12 Prove Comon’s Problem for matrices: a symmetric matrix M has rank
r if and only if there are r symmetric matrices of rank 1, M1 ,…, Mr , such that
M = M1 + · · · + Mr .

Exercise 13 Prove that the tensor T of Example 7.2.5 cannot have rank 2.

Exercise 14 Prove that the tensor T of Example 7.2.5 has symmetric rank
srank(T ) = 3 (so, after Exercise 13, also the rank is 3).

References

1. Shitov, Y.: A counterexample to Comon’s conjecture. arXiv:1705.08740


2. Carlini, E., Catalisano, M.V., Geramita, A.V.: The solution to Waring?s problem for monomials
and the sum of coprime monomials. J. Algebra 370, 5–14 (2012)
3. Alexander, J., Hirschowitz, A.: La méthode d’Horace éclatée: application à l’interpolation en
degré quatre. Invent. Math. 107, 582–602 (1992)
116 7 Symmetric Tensors

4. Alexander, J., Hirschowitz, A.: Un lemme d’Horace différentiel: application aux singularités
hyperquartiques de P 5 . J. Algebr. Geom. 1, 411–426 (1992)
5. Alexander, J., Hirschowitz, A.: Polynomial interpolation in several variables. J. Algebr. Geom.
4, 201–222 (1995)
6. Alexander, J., Hirschowitz, A.: Generic hypersurface singularities. Proc. Indian Acad. Sci. Math.
Sci. 107(2), 139–154 (1997)
7. Alexander, J., Hirschowitz, A.: An asympotic vanishing theorem for generic unions of multiple
points. Invent. Math. 140, 303–325 (2000)
Chapter 8
Marginalization and Flattenings

We collect in this chapter some of the most useful operations on tensors, in view of
the applications to Algebraic Statistics.

8.1 Marginalization

The concept of marginalization goes back to the beginnings of a statistical analysis


of discrete random variables, when, mainly, only a pair of variables were compared
and correlated. In this case, distributions in the total correlation corresponded to
matrices, and it was natural to annotate the sums of rows and columns at the edges
(margins) of the matrices.

Definition 8.1.1 For matrices of given type n × m over a field K , which can be seen
as points of the vector space K n,m , the marginalization is the linear map μ : K n,m →
K n × K m which sends the matrix A = (ai j ) to the pair ((v1 , . . . , vn ), (w1 , . . . , wm ))
∈ K n × K m , where
m n
vi = ai j , wj = ai j .
j=1 i=1

In practice, the vi ’s correspond to the sums of the columns, the w j ’s correspond to


the sums of the rows.

Notice that we can define as well the marginalization of A by ((1, . . . , 1)A,


(1, . . . , 1)At ).
Below there is an example of marginalization of a 3 × 3 matrix.

© Springer Nature Switzerland AG 2019 117


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_8
118 8 Marginalization and Flattenings
⎛ ⎞
⎛ ⎞ 12 3 6
12 3 ⎝0 2 −1⎠ 1
M = ⎝0 2 −1⎠ →
2 4 7 13
24 7
389
μ(M) = ((6, 1, 13), (3, 8, 9)).

The notion can be extended (with some complication only for the notation) to
tensors of any dimension.

Definition 8.1.2 For tensors of given type a1 × · · · × an over a field K , which can
be seen as points of the vector space K a1 ,...,an , the marginalization is the linear map
μ : K a1 ,...,an → K a1 × · · · × K an which sends the tensor A = (αq1 ...qn ) to the n-uple
((v11 , . . . , v1a1 ), . . . , (vn1 , . . . , vnan )) ∈ K a1 × · · · × K an , where

vi j = αq1 ...qn ,
qi = j

i.e., in each sum we fix the ith index and take the sum over all the elements of the
tensor in which the ith index is equal to j.

Example 8.1.3 The marginalization of the 3 × 2 × 2 tensor of Example 6.1.6


2 4

1 2
T =
4 8

2 4

6 12

3 6

is ((18, 36), (9, 18, 27), (18, 36)).

Since the marginalization μ is a linear map, we can analyze its linear properties.
It is immediate to realize that, except for trivial cases, μ is not injective.

Example 8.1.4 Even for 2 × 2 matrices, the matrix


 
1 −1
M=
−1 1

belongs to the Kernel of μ. Indeed, M generates the Kernel of μ.

Even the surjectivity of μ fails in general. This is obvious for 2 × 2 matrices, since
μ is a noninjective linear map between K 4 and itself.
8.1 Marginalization 119

Indeed, if ((v1 , . . . , vn ), (w1 , . . . , wm )) is the marginalization of the matrix A =


(ai j ), then clearly v1 + · · · + vn is equal to the sum of all the entries of A, thus it is
also equal to w1 + · · · + wm . We prove that this is the only restriction for a vector in
K n × K m to belong to the image of the marginalization.
Proposition 8.1.5 A vector ((v1 , . . . , vn ), (w1 , . . . , wm )) ∈ K n × K m is the marginal-
ization of a matrix A = (ai j ) ∈ K n,m if and only if

v1 + · · · + vn = w1 + · · · + wm .

Proof The fact that the condition is necessary follows from the few lines before
the
proposition.
We need to prove that the condition is sufficient. So, assume that
vi = w j . A way to prove the claim, which can be easily extended even to higher
dimensional tensors, is the following. Write ei for the element in K n with 1 in the
ith position and 0 elsewhere, so that e1 , . . . , en is the canonical basis of K n . Define
similarly e1 , . . . , em ∈ K m . It is clear that any pair (ei , ej ) belongs to the image of
μ: just take the marginalization of the matrix having ai j = 1 and all the remaining
entries equal to 0. So, it is sufficient to prove that if v1 + · · · + vn = w1 + · · · + wm ,
then (v, w) = ((v1 , . . . , vn ), (w1 , . . . , wm )) belongs to the subspace generated by
the (ei , ej )’s. Assume that n ≤ m (if the converse holds, just take the transpose).
Then notice that


n
(v, w) = v1 (e1 , e1 ) + (v1 + · · · + vi − w1 − · · · − wi−1 )(ei , ei )+
i=2


n−1
(w1 + · · · + wi − v1 − · · · − vi )(ei+1 , ei ). 
i=1

Corollary 8.1.6 The image of μ has dimension n + m − 1. Thus the kernel of μ has
dimension nm − n − m + 1.
We can extend the previous computation to tensors of arbitrary dimension.
Proposition 8.1.7 A vector

((v11 , . . . , v1a1 ), . . . , (vn1 , . . . , vnan )) ∈ K a1 × · · · × K an

is the marginalization of a tensor of type a1 × · · · × an if and only if

v11 + · · · + v1a1 = v21 + · · · + v2a2 = · · · = vn1 + · · · + vnan .

Definition 8.1.8 If the marginalization μ(T ) of a tensor T is the element


((v11 , . . . , v1a1 ), . . . , (vn1 , . . . , vnan )) of K a1 × · · · × K an , then the vector μ(T )i =
(vi1 , . . . , viai ) ∈ K ai is called the i-contraction of T .
Next, we will study the connections between the marginalization of a tensor T
and the tensor product of its contractions.
120 8 Marginalization and Flattenings

Proposition 8.1.9 Let T ∈ K a1 ,...,an be a tensor of rank 1. Let u i ∈ K ai be the i-


contraction of T . Then T is a scaling of u 1 ⊗ · · · ⊗ u n .

Proof Assume T = w1 ⊗ · · · ⊗ wn where wi = (wi1 , . . . , wiai ). Then Ti1 ,...,in =


w1i1 · · · wnin , hence u i = (u i1 , . . . , u iai ) where
 
ui j = w1 j1 · · · wn jn = wi ji w1 j1 · · · ŵi ji · · · wn jn ,
ji = j


hence u i is a multiple of wi (by w1 j1 · · · ŵi ji · · · wd jd ). 

Remark 8.1.10 We can be even more precise about the scaling factor of the previous
Namely, assume T = w1 ⊗ · · · ⊗ wn where wi = (wi1 , . . . , wiai ), and
proposition.
set Wi = wi1 + · · · + wiai . Then u 1 ⊗ · · · ⊗ u n = W T , where

W = i=1
n
W1 · · · Ŵi · · · Wn .

Example 8.1.11 Let T ∈ K 2,2,2 be the rank 1 tensor, product of (1, 2) ⊗ (3, 4) ⊗
(5, 6). Then,
30 40

15 20
T =
36 48

18 24

so that the the marginalization of T is (u 1 , u 2 , u 3 )=((77, 154), (99, 132), (105, 136)).
Clearly, (77, 154) = 77(1, 2), (99, 132) = 33(3, 4) and (105, 136) = 21(5, 6). Here
W1 = 3, W2 = 7, W3 = 11, so that

u 1 ⊗ u 2 ⊗ u 3 = (3 · 7)(3 · 11)(7 · 11)T = (53361)T.

When T has rank > 1, then clearly it cannot coincide with a multiple of u 1 ⊗
· · · ⊗ u n . For general T , the product of its contractions u 1 ⊗ · · · ⊗ u n determines a
good rank 1 approximation of T .

Example 8.1.12 Consider the tensor


8 12

6 9
T =
16 20

12 18
8.1 Marginalization 121

which is an approximation of (1, 2) ⊗ (2, 3) ⊗ (3, 4) (only one entry has been
changed). The marginalization of T is (u 1 , u 2 , u 3 ) = ((35, 66), (42, 59), (45, 56)).
The product u 1 ⊗ u 2 ⊗ u 3 , divided by 35 · 42 · 45 = 66150 gives the rank 1 tensor
(approximate):
7.5 10.5

6 8.4

T =
14.1 19.8

11.3 15.9

not really far from T .

Indeed, for some purposes, the product of the contractions can be considered as
a good rank 1 approximation of a given tensor. We warn the reader that, on the other
hand, there are other methods for the rank 1 approximation of tensors which, in many
cases, produce a result much closer to the best possible rank 1 approximation. See,
for instance, Remark 8.3.12.

8.2 Contractions

The concept of i-contraction of a tensor can be extended to any subset of indices. In


this case, we will talk about partial contraction. The definition is straightforward,
though it can look odd, at a first sight, since it requires many levels of indices.

Definition 8.2.1 For a tensor T of type a1 × · · · × an and for any subset J of the set
[n] = {1, . . . , n}, of cardinality n − q, define the J -contraction T J as follows. Set
{1, . . . , n} \ J = { j1 , . . . , jq }. For any choice of k1 , . . . , kq with 1 ≤ ks ≤ a js , put

TkJ1 ,...,kq = Ti1 ,...,in ,

where the sum ranges on all the entries Ti1 ,...,in in which i js = ks for s = 1, . . . , q.

Some example is needed.


Example 8.2.2 Consider the tensor:

18 20

12 16
T =
9 12

6 8
122 8 Marginalization and Flattenings

The contraction of T along J = {2} means that we take the sum of the left face with
the right face, so that, e.g., T11J = T111 + T121 , and so on. We get
 
28 38
T = J
.
14 21

The contraction of T along J  = {3} means that we take the sum of the front face

with the rear face, so that, e.g., T11J = T111 + T112 , and so on. We get
 
J 30 36
T = .
15 20

Instead, if we take the contraction along J  = {2, 3}, we get a vector T J ∈ K 2 ,

whose entries are the sums of the upper and the lower faces. Indeed T1J = T111 +
T112 + T121 + T122 , so that

T J = (66, 35).

In this case, T J is the 1-st contraction of T .
The last observation of the previous example generalizes.

Proposition 8.2.3 If J = {1, . . . , n} \ {i}, then T J is the ith contraction of T .

Proof Just compare the two definitions.

Remark 8.2.4 The contraction along J ⊂ {1, . . . , n} determines a linear map between
spaces of tensors.

If J  ⊂ J ⊂ {1, . . . , n}, then the contraction T J is a contraction of T J .

The relation on the ranks of a tensor and its contractions is expressed as follows:

Proposition 8.2.5 If T has rank 1, then every contraction of T is either 0 or it has


rank 1. As a consequence, for every J ⊂ {1, . . . , n},

rank(TJ ) ≤ rankT.

Proof In view of Remark 8.2.4, it is enough to prove the first statement when J is a
singleton. Assume, for simplicity, that J = {n}. Then by definition


an
Ti1J...in−1 = T11 ...in−1 j .
j=1

If T = v1 ⊗ · · · ⊗ vn , where vi = (vi1 , . . . , viai ), then Ti1 ...in−1 j = v1i1 · · · vn−1 in−1 vn j .


It follows that, setting w = vn1 + · · · + vnan , then

Ti1J...in−1 = wv1i1 · · · vn−1 in−1 ,


8.2 Contractions 123

so that T J = wv1 ⊗ · · · ⊗ vn−1 .


The second statement follows immediately from the first one and from the linearity
of the contraction. Indeed, if T = T1 + · · · + Tr , where each Ti has rank 1, then
T J = T1J + · · · + TrJ .

The following proposition is indeed an extension of Proposition 8.1.9.

Proposition 8.2.6 Let T ∈ K a1 ,...,an be a tensor of rank 1. Let Q 1 , . . . , Q m be a


partition of {1, . . . , n}, with the property that every element of Q i is smaller than
every element of Q i+1 , for all i. Define Ji = {1, . . . , n} \ Q i .
Then the tensor product T J1 ⊗ · · · ⊗ T Jm is a scalar multiple of T .

Proof The proof is straightforward when we take the (maximal) partition where
m = n and Q i = {i} for all i. Indeed in this case T Ji is the ith contraction of T , and
we can apply Proposition 8.1.9.
For the general case, we can use Proposition 8.1.9 again and induction on n.
Indeed, by assumption, if ji = min{Q i } − 1, then the kth contraction of T Ji is equal
to the ( ji + k)th contraction of T . 

Example 8.2.7 Consider the rank 1 tensor:

2 4

1 2
T =
8 16

4 8

and consider the partition Q 1 = {1, 2}, Q 2 = {3}, so that J1 = {3} and J2 = {1, 2}.
Then  
3 6
T J1 = ,
12 24

while T J2 = (15, 30). We get:

90 180

45 90
J1 J2
T ⊗T =
360 720

180 360
2
hence T J1 ⊗ T J = 45T .
124 8 Marginalization and Flattenings

8.3 Scan and Flattening

The following operations is natural for tensors and often allow a direct computation
of the rank.

Definition 8.3.1 Let T ∈ K a1 ,...,an be a tensor. For any j = 1, . . . , n we denote with


Scan(T ) j the set of a j tensors obtained by fixing the index j in all the possible ways.
Technically, Scan(T ) j is the set of tensors T1 , . . . Ta j in K a1 ,...,â j ,...,an such that

(Tq )i1 ...in−1 = Ti1 ...q...in−1 ,

where the index q occurs in the jth place.


By applying the definition recursively, we obtain the definition of scan of a tensor
T along a subset J ⊂ {1, . . . , n} of two, three, etc., indices.

Example 8.3.2 Consider the tensor:

2 4

2 3
T =
4 8

1 1

5 6

3 4

The scan Scan(T )1 is given by the three matrices


     
23 11 34
,
24 48 56

while Scan(T )2 is the couple of matrices


⎛ ⎞ ⎛ ⎞
2 2 34
⎝1 4⎠ ⎝1 8 ⎠ .
3 5 46

Remark 8.3.3 There is an obvious relation between the scan and the contraction of a
tensor T . If J ⊂ {1, . . . , n} is any subset of the set of indices and J  = {1, . . . , n} \ J
then the J −contraction of T equals the sum of the tensors of the scan of the tensor
T along J  .
8.3 Scan and Flattening 125

We define the flattening of a tensor by taking the scan along one index, and
arranging the resulting tensors in one big tensor.

Definition 8.3.4 Let T ∈ K a1 ,...,an be a tensor. The flattening of T along the last
index is defined as follows. For any positive q ≤ an−1 an , one finds uniquely defined
integers α, β such that q − 1 = (β − 1)an−1 + (α − 1), with 1 ≤ β ≤ an , 1 ≤ α ≤
an−1 . Then the flattening of T along the last index is the tensor F T ∈ K a1 ,...,an−2 ,an−1 an
with:
F Ti1 ...in−2 q = Ti1 ...in−2 αβ .

Example 8.3.5 Consider the tensor:

3 8

1 1
T =
2 6

3 4

Its flattening is the 2 × 4 matrix: ⎛ ⎞


1 1
⎜3 4⎟
⎜ ⎟.
⎝3 8⎠
2 6

The flattening of the rank 1 tensor:

4 8

1 2
T =
8 16

2 4

is the 2 × 4 matrix of rank 1: ⎛ ⎞


1 2
⎜2 4⎟
⎜ ⎟.
⎝4 8⎠
8 16

Remark 8.3.6 One can apply the flattening procedure after a permutation of the
indices. In this way, in fact, one can define the flattening along any of the indices.
We leave the details to the reader.
Moreover, one can take an ordered series of indices and perform a sequence of
flattening procedures, in order to reduce the dimension of the tensor.
126 8 Marginalization and Flattenings

The final target which is usually the most useful for applications is the flattening
reduction of a tensor to a (usually rather huge) matrix, by performing n − 2 flattenings
of an n-dimensional tensor. If we do not use permutations, the final output is a matrix
of size a1 × (a2 · · · an ).

The reason why the flattenings are useful, for the analysis of tensors, is based on
the following property.

Proposition 8.3.7 A tensor T has rank 1, if and only if all its flattening has rank 1.

Proof Let T ∈ K a1 ,...,an be a tensor of rank 1, T = v1 ⊗ · · · ⊗ vn where vi =


(vi1 , . . . , viai ). Since T = 0, then also F T = 0. Then one computes directly that
the flattening F T is equal to v1 ⊗ · · · ⊗ vn−2 ⊗ w, where

w = (vn−1 1 vn1 , . . . , vn−1an−1 vn1 , vn−1 1 vn2 , . . . , vn−1na−1 vnan ).

Conversely, recall that, from Theorem 6.4.13, a tensor T has rank 1 when, for all
choices p, q = 1, . . . , n and numbers 1 ≤ α, γ ≤ a p , 1 ≤ β, δ ≤ aq one has

Ti1 ···i p−1 αi p+1 ...iq−1 βiq+1 ...in · Ti1 ···i p−1 γi p+1 ...iq−1 δiq+1 ...in −
(8.1)
−Ti1 ···i p−1 αi p+1 ...iq−1 δiq+1 ...in · Ti1 ···i p−1 γi p+1 ...iq−1 βiq+1 ...in = 0

If we take the flattening over the two indices p and q, the left term of previous
equation is a 2 × 2 minor of the flattening.

The second tensor in Example 8.3.5 shows the flattening of a 2 × 2 × 2 tensor of


rank 1.
Notice that one implication for the Proposition works indeed for any rank. Namely,
if T has rank r , then all of its flattenings have rank ≤ r .
On the other hand, the converse does not hold at least when the rank is big. For
instance, there are tensors of type 2 × 2 × 2 and rank 3 (see Example 6.4.15) for
which, of course, the 2 × 4 flattenings cannot have rank 3.

Example 8.3.8 The tensor

2 4

1 2
T =
4 8

8 16

has rank > 1, because some determinants of its faces are not 0. On the other hand,
its flattening is the 2 × 4 matrix
8.3 Scan and Flattening 127
⎛ ⎞
1 2
⎜8 16⎟
⎜ ⎟
⎝2 4⎠
4 8

which has rank 1.

We have the following, straightforward:

Proposition 8.3.9 Let F T have rank 1, F T = v1 ⊗ · · · ⊗ vn−2 ⊗ w, with w ∈


K an−1 ,an . Assume that we can split w in an−1 pieces of length an which are pro-
portional, i.e., assume that there is vn = (α1 , . . . , αan ) with

w = (β1 α1 , . . . , β1 αan , β2 α1 , . . . , βan−1 αan ).

Then T has rank 1, and setting vn−1 = (β1 , . . . , βan−1 ), then:

T = v1 ⊗ · · · ⊗ vn−2 ⊗ vn−1 ⊗ vn .

As a corollary of Proposition 8.3.7, we get:

Proposition 8.3.10 If a tensor T has rank r , then its flattening has rank ≤ r .

Proof The flattening is a linear operation in the space of tensors. Thus, if T =


T1 + · · · + Tr with each Ti of rank 1, then also F T = F T1 + · · · + F Tr , and each
F Ti has rank 1. The claim follows.

Of course, the rank of the flattening F T can be strictly smaller than the rank of
T . For instance, we know from Example 6.4.15 that there are 2 × 2 × 2 tensors T of
rank 3. The flattening F T , which is a 2 × 4 matrix, cannot have rank bigger than 2.

Remark 8.3.11 Here is one application of the flattening procedure to the computation
of the rank.
Assume we are given a tensor T ∈ K a1 ,...,an and assume we would like to know
if the rank of T is r . If r < a1 , then we can perform a series of flattenings along the
last indices, obtaining a matrix F ∗ T of size a1 × (a2 · · · an ). Then, we can compute
the rank of the matrix (and we have plenty of fast procedures to do this). If F ∗ T has
rank > r , then there is no chance that the original tensor T has rank r . If F ∗ T has
rank r , then this can be considered as a cue toward the fact that rank(T ) = r .
Of course, a similar process is possible, by using permutations on the indices,
when r ≥ a1 but r < ai for some i.

The flattening process is clearly invertible, so that one can reconstruct the original
tensor T from the flattening F T , thus also from the matrix F ∗ T resulting from a
process of n − 2 flattenings.
On the other hand, since a matrix of rank r > 1 has infinitely many decompositions
as a sum of r matrices of rank 1, then by taking one decomposition of F ∗ T as a
128 8 Marginalization and Flattenings

sum of r matrices of rank 1 one cannot hope to reconstruct automatically from that
decomposition of T as a sum of r tensors of rank 1.
Indeed, the existence of a decomposition for T is subject to the existence of a
decomposition for F ∗ T in which every summand satisfies the condition of Proposi-
tion 8.3.9.
Remark 8.3.12 One can try to find an approximation of a given tensor T with a
tensor of prescribed, small rank r < a1 by taking the matrix F ∗ T , resulting from
a process of n − 2 flattenings, and considering the rank r approximation for F ∗ T
obtained by the standard SVD approximation process for matrices (see [1]).
For instance, one can find in this way a rank 1 approximation for a tensor, which
in principle is not equal to the rank 1 approximation obtained by the marginalization
(see Example 8.1.12).
Example 8.3.13 Consider the tensor:
0 0

1 0
T =
0 1

0 0
of Example 8.1.12. The contractions of T are (1, 1), (1, 1), (1, 1), whose tensor
product, divided by 4 = (1 + 1) + (1 + 1) determines a rank 1 approximation:
1/4 1/4

1/4 1/4
T1 =
1/4 1/4

1/4 1/4
The flattening of T is the matrix:
⎛ ⎞
1 0
⎜0 0⎟
FT = ⎜
⎝0
⎟.
0⎠
0 1

The rank 1 approximation of F T , by the SVD process, is:


⎛ ⎞
1 0
⎜0 0⎟
FT = ⎜
⎝0
⎟.
0⎠
0 0
8.3 Scan and Flattening 129

This matrix defines the tensor:


0 0

1 0
T2 =
0 0

0 0

which is another rank 1 approximation of T .


√ If we consider the tensors as vectors in K , then the natural distance d(T, T1 ) is
8

3/2, while the natural distance d(T, T2 ) is 1. So T2 is “closer” to T than T1 . On the


other hand, for some purposes, one could consider T1 as a better rank 1 approximation
of T than T2 (e.g., it preserves marginalization).

8.4 Exercises

Exercise 15 Prove the assertion in Example 8.1.4: the matrix M defined there gen-
erates the Kernel of the marginalization.

Exercise 16 Find generators for the Kernel of the marginalization map of 3 × 3


matrices.

Exercise 17 Find generators for the image of the marginalization map of tensors
and prove Proposition 8.1.7.

Exercise 18 Prove the statements of Remark 8.2.4.

Reference

1. Markovsky, I.: Low-Rank Approximation: Algorithms, Implementation, Applications. Springer,


Berlin (2012)
Part III
Commutative Algebra and Algebraic
Geometry
Chapter 9
Elements of Projective Algebraic
Geometry

The scope of this part of the book is to provide a quick introduction to the main tools
of the Algebraic Geometry of projective spaces that are necessary to understand some
aspects of algebraic models in Statistics.
The material collected here is not self-contained. For many technical results, as
the Nullstellensatz or the theory of fields extensions, we will refer to specific texts
on the subject.
We assume in the sequel that the reader knows the basic definitions of algebraic
structures, as rings, ideals, homomorphisms, etc., as well as the main properties of
polynomial rings.
This part of the book could also be used for a short course or a cutway through
the main results of algebraic and projective geometry which are relevant in the study
of Statistics.

9.1 Projective Varieties

The first step is a definition of the ambient space.


Since we need to deal with subsets defined by polynomial equations, the starting
point is the polynomial rings over the complex field, the field where solutions of
polynomial equations live properly. Several claims that we are going to illustrate
also work on any algebraically closed field. We deal only with the complex field, in
order to avoid details on the structure of fields of arbitrary characteristic.
The main feature of Projective Geometry is that the coordinates of a point are
defined only up to scaling.

Definition 9.1.1 Let V be a linear space over C. Define on V ∗ = V \ {0} an equiva-


lence relation ∼ which associates v, v  if and only if there exists α ∈ C with v  = αv.
The quotient P(V ) = V ∗ / ∼ is the projective space associated to V .
© Springer Nature Switzerland AG 2019 133
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_9
134 9 Elements of Projective Algebraic Geometry

The projective dimension of P(V ) is the number dim(V ) − 1 (it is a constant fact
that the dimension of a projective space is always 1 less than its linear dimension).
When V = Cn+1 , we will denote the projective space P(V ) also with Pn .
Points of the projective space are thus equivalent classes of vectors, in the
relation ∼, hence are formed by a vector v = 0 together with all its multiples.
In particular, P ∈ Pn is an equivalence class of (n + 1)-tuples of complex num-
bers. We will denote with homogeneous coordinates of P any representative of the
equivalence class.
Notice that the coordinates, in a projective space, are no longer uniquely defined,
but only defined modulo scalar multiplication. We will also write P = [ p0 : · · · : pn ]
when ( p0 , . . . , pn ) is a representative for the homogeneous coordinates of P.
Remark 9.1.2 Pn contains several subsets in natural one-to-one correspondence with
Cn .
Indeed, take the subset Ui of points with homogeneous coordinates [x0 : · · · : xn ]
whose i-th coordinate xi is nonzero. The condition is clearly independent from the
representative of the class that we choose. There is a one-to-one correspondence
Ui ↔ Cn , obtained as follows:

x0 x1 xˆi xn
[x0 : · · · : xn ] → ( , ,..., ,..., )
xi xi xi xi

Ui is called the i-th affine subspace.


Notice that when P = [ p0 : · · · : pn ] ∈ Ui , hence pi = 0, then there exists a
unique representative of P with pi = 1. The previous process identifies P ∈ Ui with
the point of Cn whose coordinates correspond to such representative of P, excluding
the i-th coordinate.
Definition 9.1.3 A subset C of a linear space V is a cone if for any v ∈ C and a ∈ C
then av ∈ C.
Remark 9.1.4 Cones in the linear space Cn+1 define subsets in the associated pro-
jective space Pn .
Indeed there is an obvious map π : Cn+1 \ {0} → Pn that sends (n + 1)-tuples to
their equivalence classes. If W ⊂ Pn is any subset, then π −1 (W ) ∪ {0} is a cone in
Cn+1 .
Conversely, every cone in Cn+1 is the inverse image in π of a subset of Pn (one
must add 0).
The same argument applies if we substitute Cn+1 and Pn with, respectively, a
generic linear space V and its associated projective space P(V ).
In general, one cannot expect that a polynomial f ∈ C[x0 , . . . , xn ] defines a cone
in Cn+1 . This turns out to be true when f is homogeneous. Indeed, if f is homoge-
neous of degree d and a ∈ C is any scalar, then

f (ax1 , . . . , axn ) = a d f (x1 , . . . , xn ),


9.1 Projective Varieties 135

thus for a = 0 the vanishing of f (ax1 , . . . , axn ) is equivalent to the vanishing of


f (x1 , . . . , xn ).
The observation can be reversed, as follows.

Lemma 9.1.5 Let f be a polynomial in C[x0 , . . . , xn ], of degree bigger than 0.


Then there exist a point P = ( p0 , . . . , pn ) ∈ Cn+1 with f (P) = 0 and infinitely
many points Q = (q0 , . . . , qn ) ∈ Cn+1 with f (Q) = 0.

Proof Make induction on the number of variables.


When f has only one variable, then the first claim is exactly the definition of
algebraically closed field. The second claim holds because every nonzero polynomial
of degree d has at most d roots.
If we know both claims for n variables, then write f ∈ C[x0 , . . . , xn ] as a poly-
nomial in x0 , with coefficients in C[x1 , . . . , xn ]:

f = f d x0d + f d−1 x0d−1 + · · · + f 0

where each f i is a polynomial in x1 , . . . , xn . We may assume that f d = 0, otherwise


f has only n − 1 variables, and the claim holds by induction. By induction, there
are infinitely many points (q1 , . . . , qn ) ∈ Cn which are not a solution of f d (notice
that such points exist trivially if f d = 0 is constant). Then for any (q1 , . . . , qn ) the
polynomial g = f (x0 , q1 , . . . , qn ) has just one variable and degree d > 0, hence
there are infinitely many q0 ∈ C with g(q0 ) = f (q0 , . . . , qn ) = 0 and at least one
p0 ∈ C such that g( p0 ) = f ( p0 , q1 , . . . , qn ) = 0. 

We recall that any polynomial f (x) ∈ C[x0 , . . . , xn ] can be written uniquely as


a sum of homogeneous polynomials

f (x) = f d + f d−1 + · · · + f 0 ,

with f i homogeneous of degree i for all i. The previous sum is called the homoge-
neous decomposition of f (x).

Proposition 9.1.6 Let f = f (t) be a polynomial in C[x0 , . . . , xn ] of degree d > 0.


Assume that f (t) is not homogeneous.
Then there exists P = ( p0 , . . . , pn ) ∈ Cn+1 and a scalar α ∈ C \ {0} such that
f (P) = 0 and f (α P) = 0.

Proof Take the homogeneous decomposition of f (x)

f (x) = f d + f d−1 + · · · + f 0 .

Since f (x) is not homogeneous, we may assume f d , f i = 0 for some i < d. Take
the minimal i with f i = 0. Choose y = (y0 , . . . , yn ) ∈ Cn+1 with f d (y) = 0 (it ex-
ists by Lemma 9.1.5). Then f (ay) = a d f d (y) + a d−1 f d−1 (y) + · · · + a i f i (y) is a
polynomial of degree d > 0 in the unique variable a, which can be divided by a i , i.e.
136 9 Elements of Projective Algebraic Geometry

f (ay) = a i g(ay), where g(ay) is a polynomial of degree d − i > 0 in a, whose con-


stant term is nonzero. By Lemma 9.1.5 again, there exist a1 , a2 ∈ C with g(a1 y) = 0
and g(a2 y) = 0. Notice that a1 = 0, since the constant term of g(ay) does not vanish.
The claim now holds by taking P = a1 y and α = a2 /a1 . 

The previous Proposition shows that the vanishing of a polynomial in


C[x0 , . . . , xn ] is not defined over a cone, hence over a subset of Pn , unless the poly-
nomial is homogeneous. Conversely, if f ∈ C[x0 , . . . , xn ] is homogeneous, then the
vanishing of f at a set of projective coordinates [ p0 : · · · : pn ] of P ∈ Pn implies the
vanishing of f at any set of homogeneous coordinates of P, because the vanishing
set of f is a cone.
Consequently, we give the following, basic:

Definition 9.1.7 We call projective variety of Pn every subset of PnK defined by the
vanishing of a family J = { f j } of homogeneous polynomials f j ∈ C[x0 , . . . , xn ].
In other words, projective varieties are subsets of Pn whose equivalence classes
are the solutions of a system of homogeneous polynomial equations.
When V is any linear space of dimension d, we define the projective varieties in
P(V ) by taking an identification V ∼ Cd (hence by fixing a basis of V ).
We will denote with X (J ) the projective variety defined by the family J of ho-
mogeneous polynomials.

Example 9.1.8 Let { f 1 , . . . , f m } be a family of linear homogeneous polynomials in


C[x0 , . . . , xn ]. The projective variety X defined by { f 1 , . . . , f m } is called a linear
projective variety.
The polynomials f 1 , . . . , f m define also a linear subspace W ⊂ Cn+1 . It is easy
to prove that there is a canonical identification of X with the projective space P(W )
(see Exercise 19).

Remark 9.1.9 Let X be a projective variety defined by a set J of homogeneous poly-


nomials and take a subset J  ⊂ J . Then the projective variety X  = X (J  ) defined
by J  contains X .
One can easily find examples (even of linear varieties) with X  = X even if J  is
properly contained in J (see Exercise 20).

Remark 9.1.10 Projective varieties provide a system of closed sets for a topology,
called the Zariski topology on Pn .
Namely, ∅ and Pn are both projective varieties, defined respectively by the fam-
ilies of polynomials {1} and {0}. If {Wi } is a family of projective varieties,with
Wi = X (Ji ), then {Wi } is the projective variety defined by the family J = {Ji }
of homogeneous polynomials. Finally, if W1 = X (J1 ) and W2 = X (J2 ) are pro-
jective varieties, then W1 ∪ W2 is the projective variety defined by the family of
homogeneous polynomials:

J1 J2 = { f g : f ∈ J1 , g ∈ J2 }.
9.1 Projective Varieties 137

Example 9.1.11 Every singleton in Pn is a projective variety.


Namely, if [ p0 : · · · : pn ] are homogeneous coordinates of a point P, with pi = 0,
then the set of linear homogeneous polynomials

I = { p0 xi − pi x0 , . . . , pn xi − pi xn }

defines {P} ⊂ Pn .
In particular, the Zariski topology satisfies the first separation axiom T1 .

Example 9.1.12 Every Zariski closed subset of P1 is finite, except P1 itself.


Namely let f be any nonzero homogeneous polynomial of degree d in C[x0 , x1 ].
Then setting x1 = 1, we get a polynomial f¯ ∈ C[x0 ] which, by the Fundamen-
tal Theorem of Algebra, can be uniquely decomposed in a products f¯ = e(x0 −
α0 )m 0 · · · (x0 − αk )m k , where α1 , . . . , αk are the roots of f¯ and e ∈ C.
β
Going back to f , we see that there exists a power x1 (maybe β = 0) such that

β
f = ex1 (x0 − α0 x1 )m 0 · · · (x0 − αk x1 )m k .

It follows immediately that f vanishes only at the points [α0 : 1], . . . , [αk : 1], with
the addition of [1 : 0] if β > 0.
Thus, the open sets in the Zariski topology on P1 are ∅ and the cofinite sets, i.e.
sets whose complement is finite. In other words, the Zariski topology on P1 coincides
with the cofinite topology.

Example 9.1.13 In higher projective spaces there are nontrivial closed subsets which
are infinite. Thus the Zariski topology on Pn , n > 1, is not the cofinite topology.
Indeed, let f = 0 be a homogeneous polynomial in C[x0 , . . . , xn ], of degree
bigger than 0, and assume n > 1. We prove the variety X ( f ), which is not Pn by
Lemma 9.1.5, has infinitely many points.
To see this, notice that if all points Q = [q0 : q1 : · · · : qn ] with q0 = 0 belong to
X ( f ), then we are done. So we can assume that there exists Q = [1 : q1 : · · · : qn ] ∈ /
X ( f ). For any choice of m = (m 2 , . . . , m n ) ∈ Cn−1 consider the line L m , passing
through Q, defined by the vanishing of the linear polynomials

x2 − m 2 (x1 − q1 x0 ) − q2 x0 , . . . , xn − m n (x1 − q1 x0 ) − qn x0 .

Define the polynomial

f m = f (x0 , x1 , m 2 (x1 − q1 x0 ) + q2 x0 , . . . , m n (x1 − q1 x0 ) + qn x0 ).

If (α0 , α1 ) is a solution of the equation f m = 0, then the intersection X ( f ) ∩ L m


contains the point

[α0 : α1 : m 2 (α1 − q1 α0 ) + q2 α0 , . . . , m n (α1 − q1 α0 ) + qn α0 ].


138 9 Elements of Projective Algebraic Geometry

Since the polynomial f m is homogeneous of the same degree than f , then it vanishes
at some point, so that X ( f ) ∩ L m = ∅. Since two different lines L m , L m  meet only
at Q ∈
/ X ( f ), the claim follows.

9.1.1 Associated Ideals

Definition 9.1.14 Let I be an ideal of a polynomial ring R = C[x0 , . . . , xn ]. We


say that I is generated by J ⊂ R, and write I = J , when

I = {h 1 f 1 + · · · + h m f m : h 1 , . . . , h m ∈ R, f 1 , . . . , f m ∈ J }.

We say that I is a homogeneous ideal if there is a set of homogeneous elements


J ⊂ R such that I = J .
Notice that not every element of a homogeneous ideal is homogeneous. for in-
stance, in C[x] the homogeneous ideal I = x contains the nonhomogeneous ele-
ment x + x 2 .
Proposition 9.1.15 The ideal I is homogeneous if and only if for any polynomial
f ∈ I , whose homogeneous components are f d , . . . , f 0 , then every f i belongs to I .
Proof Assume that I is generated by a set of homogeneous elements J and take f ∈
I . Consider the decomposition in homogeneous components f = f d + · · · + f 0 .
There are homogeneous polynomials q1 , . . . , qm ∈ J such that f = h 1 q1 + · · · +
h m qm , for some polynomials h j ∈ R. Denote by d j the degree of q j and denote by
h i, j the homogeneous component of degree i in h j (with h i j = 0 whenever i < 0).
Then, by comparing the homogeneous components, one gets for every degree i

f i = h 1,i−d1 q1 + · · · + h m,i−dm qm

and thus f i ∈ J  = I for every i.


Conversely, I is always contained in the ideal generated by the homogeneous
components of its elements. Thus, when these components are also in I for all
f ∈ I , then I is generated by homogeneous polynomials. 
Remark 9.1.16 If W is a projective variety defined by the vanishing of a set J
of homogeneous polynomials, then W is also defined by the vanishing of all the
polynomials in the ideal I = J .
Indeed if P is a point of W , then for all f ∈ I , write f = h 1 f 1 + · · · + h m f m for
some f i ’s in J . We have

f (P) = h 1 (P) f 1 (P) + · · · + h m (P) f m (P) = 0.

It follows that every projective variety can be defined as the vanishing locus of a
homogeneous ideal.
9.1 Projective Varieties 139

Before stating the basic result in the correspondence between projective varieties
and homogeneous ideals (i.e. the homogeneous version of the celebrated Hilbert’s
Nullstellensatz), we need some more piece of notation.
Definition 9.1.17 For any ideal I ⊂ R, define the radical of I as the set

I = { f : f m ∈ I for some exponent m}.

I is an ideal of R and contains I . √
When I is a homogeneous √ ideal, then also I is homogeneous, and the projective
varieties X (I ) and X ( I ) are equal (see Exercise √ 21). √
We say that
√ an ideal I in R is radical if I = I . For any ideal I , I is a radical

ideal, since I = I.
We call irrelevant ideal the ideal of R = C[x0 , . . . , xn ] generated by the indeter-
minates x0 , . . . , xn .
The irrelevant ideal is a radical ideal that defines the empty set in Pn . Indeed,
no points of Pn can annihilate all the variables, as no points in Pn have all the
homogeneous coordinates equal to 0.
Example 9.1.18 In C[x, y] consider the homogeneous element x 2
. The radical of
the ideal I = x  is the ideal generated by x. Indeed x belongs to x 2 , moreover
2

if f n ∈ I for some polynomial f , then f cannot have a vanishing constant term, thus
f ∈ x. 
The three sets {x 2 }, x 2 , x 2  = x all define the same projective subvariety
of P1 : the point of homogeneous coordinates [0 : 1].
Now we are ready to state the famous Hilbert’s Nullstellensatz, which clarifies
the relations between different sets of polynomials that define the same projective
variety.
Theorem 9.1.19 (Homogeneous Nullstellensatz) Two homogeneous ideals I1 , I2 in
the polynomial ring R = C[x0 , . . . , xn ] define the same projective variety X if and
only if  
I1 = I2 ,

with the unique exception I1 = R, I2 = the irrelevant ideal.


Thus, if two radical homogeneous ideals I1 , I2 define the same projective variety
X , and none of them is the whole ring R, then I1 = I2 .
Moreover, two sets J1√, J2 of homogeneous
√ polynomials define the same projective
variety X if and only if J1  = J2 .
A proof of the homogeneous Nullstellensatz can be found in the book [1].
We should notice that the Theorem works because our varieties are defined over
C, which is algebraically closed. The statement is definitely not true over a nonal-
gebraically closed field, as the real field R. This is itself a good reason to define
projective varieties on an algebraically closed field, as C.
We list below other consequences of the Nullstellensatz.
140 9 Elements of Projective Algebraic Geometry

Theorem 9.1.20 Let J ∈ C[x0 . . . , xn ] be a set of homogeneous polynomials which


define a projective variety X = ∅. Then the set

J (X ) = { f ∈ C[x0 . . . , xn ] : f is homogeneous and f (P) = 0 for all P ∈ X }

coincides with the radical of the ideal generated by J .

We will call J (X ) the homogeneous ideal associated with X .

Corollary 9.1.21 Let I ∈ R = C[x0 . . . , xn ] be a homogeneous ideal. I defines the


empty set in Pn if and only if, for some m, all the powers xim belong to I .

Proof Indeed we get from the homogeneous Nullstellensatz that I is either R or
the irrelevant ideal. In the former case, the claim is obvious. In the latter, for every i
there exists m i such that xim i ∈ I , and one can take m as the maximum of the m i ’s. 

Another fundamental result in the study of projective varieties, still due to Hilbert,
is encoded in the following algebraic result:

Theorem 9.1.22 (Basis Theorem) Let J be a set of polynomials and let I be the
ideal generated by J . Then there exists a finite subset J  ⊂ J that generates the ideal
I.
In particular, any projective variety can be defined by the vanishing of a finite set
of homogeneous polynomials.

A proof of a weaker version of this theorem will be given in Chap. 13 (or see,
also, Sect. 4 of [1]). Let us list some consequences of the Basis Theorem.

Definition 9.1.23 We call hypersuperface any projective variety defined by the van-
ishing of a single homogeneous polynomial. By abuse, often we will write X ( f )
instead of X ({ f }).
When f has degree 1, then X ( f ) is called a hyperplane.

Corollary 9.1.24 Every projective variety is the intersection of a finite number of


hypersurfaces. Equivalently, every open set in the Zariski topology is a finite union
of complements of hypersurfaces.

Proof Let X = X (J ) be a projective variety, defined by the set J of homogeneous


polynomials. Find a finite subset J  ⊂ J such that J  = J  . Then:

X = X (J ) = X (J ) = X (J  ) = X (J  ).

If J = { f 1 , . . . , f m } then, by Remark 9.1.10:

X = X ( f 1 ) ∩ · · · ∩ X ( f m ).


9.1 Projective Varieties 141

Example 9.1.25 If L ⊂ Pn is a linear variety which corresponds to a linear subspace


of dimension m + 1 in Cn+1 , then L can be defined by n − m linear homogeneous
polynomials, i.e. L is the intersection of n − m hyperplanes.

Remark 9.1.26 One could think that the homogeneous ideal of every projective vari-
ety in Pn can be generated by a finite set of homogeneous polynomials of cardinality
bounded by a function of n.
F.S. Macaulay proved that this guess is false.
Indeed, in [2] he showed that for every integer m there exists a subset (curve) in P3
whose homogeneous ideal cannot be generated by a set of less than m homogeneous
polynomials.

9.1.2 Topological Properties of Projective Varieties

The Basis Theorem provides a tool for the study of some aspects of the Zariski
topology.

Definition 9.1.27 A topological space Y is irreducible when any pair of non-empty


open subsets have a non-empty intersection.
Equivalently, Y is irreducible if it is not the union of two proper closed subsets.
Equivalently, Y is irreducible if the closure of every non-empty open subset A is
Y itself, i.e. every non-empty open subset is dense in Y .

The following Proposition is easy and we will leave it as an exercise (see Exercise
22).

Proposition 9.1.28 (i) Every singleton is irreducible.


(ii) If Y is an irreducible subset, then the closure of Y is irreducible.
(iii) If an irreducible subset Y is contained in a finite union of closed subsets X 1 ∪
· · · ∪ X m , then Y is contained in some X i . 
(iv) If Y1 ⊂ . . . Yi ⊂ . . . is an ascending chain of irreducible subsets, then Yi is
irreducible.

Corollary 9.1.29 Any projective space Pn is irreducible and compact in the Zariski
topology.

Proof Let A1 , A2 be non-empty open subsets, in the Zariski topology, and assume
that Ai is the complement of the projective variety X i = X (Ji ), where J1 , J2 are two
subsets of homogeneous polynomials in C[x0 , . . . , xn ]. We may assume, by the Basis
Theorem, that both J1 , J2 are finite. Notice that none of X (J1 ), X (J2 ) can coincide
with Pn , thus both J1 , J2 contain a nonzero element.
To prove that Pn is irreducible, we must show that A1 ∩ A2 cannot be empty, i.e.
that X 1 ∪ X 2 cannot coincide with Pn . By Remark 9.1.10, X 1 ∪ X 2 is the projective
variety defined by the set of products J1 J2 . If we take f 1 = 0 in J1 and f 2 = 0
142 9 Elements of Projective Algebraic Geometry

in J2 , then f = f 1 f 2 is a nonzero element in J1 J2 . By Lemma 9.1.5 there exist


points P ∈ Pn such that f (P) = 0. Thus P does not belong to X 1 ∪ X 2 , and the
irreducibility of Pn is settled.
For the compactness, we prove that Pn enjoys the Finite Intersection Property:
if the intersection of a family of closed subset is empty, then there exists a finite
subfamily with empty intersection. 
Let {X i } be any
 family of closed subsets such that X i = ∅. Assume  X i = X (Ji )
and define J = Ji . By Remark 9.1.10, X i = X (J ), thus also X i = X (J ).
By the Basis Theorem, there exists a finite subset J  of J such that J   = J . Thus
there exists a finite subfamily Ji1 , . . . , Jik such that Ji1 ∪ · · · ∪ Jik  = J . Thus

∅ = X (J ) = X (Ji1 ∪ · · · ∪ Jik ) = X (Ji1 ∪ · · · ∪ Jik ) = X i1 ∩ · · · ∩ X ik

and the Finite Intersection Property holds. 

Closed subsets in a compact space are compact. Thus any projective variety X ⊂
Pn is compact in the topology induced by the Zariski topology of Pn .
Notice that irreducible topological spaces are far from being Hausdorff spaces.
Thus no nontrivial projective space satisfies the Hausdorff separation axiom T2 .
Another important consequence of the Basis Theorem is the following.

Theorem 9.1.30 Any non-empty family F of closed subsets of Pn (i.e. of projective


varieties), partially ordered by inclusion, has a minimal element.

Proof Let the claim fail. Then one can find an infinite chain of elements of F ,

X0 ⊃ X1 ⊃ · · · ⊃ Xi ⊃ . . .

where all the inclusions are strict. Consider for all i the ideal I (X i ) generated by the
homogeneous polynomials which vanish at X i . Then one gets an ascending chain of
ideals
I (X 0 ) ⊂ I (X 1 ) ⊂ · · · ⊂ I (X i ) ⊂ . . .

where again all the inclusions are strict. Let I = I (X i ). It is immediate to see
that I is a homogeneous ideal. By the Basis Theorem, there existsa finite set of
homogeneous generators g1 , . . . , gk for I . Since every g j belongs to I (X i ), for i 0
sufficiently large we have g j ∈ I (X i0 ) for all j. Thus I = I (X i0 ), so that I (X i0 ) =
I (X i0 +1 ), a contradiction. 

Definition 9.1.31 For any projective variety X , a subset X  of X is an irreducible


component of X if it is closed (in the Zariski topology, thus X  is a projective variety
itself), irreducible and X  maximal with respect to these two properties.

It is clear that X is irreducible if and only if X itself is the unique irreducible


component of X .
9.1 Projective Varieties 143

Theorem 9.1.32 Let X be any projective variety. Then the irreducible components
of X exist and their number is finite.
Moreover there exists a unique decomposition of X as the union

X = X1 ∪ · · · ∪ Xk

where X 1 , . . . , X k are precisely the irreducible components of X .

Proof First, let us prove that irreducible components exist. To do that, consider the
family F p of closed irreducible subsets containing a point P. F P is not empty, since
it contains {P}. If X 1 ⊂ . . . X i ⊂ . . . is an ascending chain of elements of F p , then
the union Y = X i is irreducible by 9.1.28 (iv), thus the closure of Y sits in F p (by
9.1.28 (ii)) and it is an upper bound for the chain. Then the family F p has maximal
elements, by the Zorn’s Lemma. These elements are irreducible components of X .
Notice that we also proved that every point of X sits in some irreducible com-
ponent, i.e. X is the union of its irreducible components. If Y is an irreducible
component, by 9.1.28 (ii) also the closure of Y is irreducible. Thus, by maximality,
Y must be closed.
Next, we prove that X is a finite union of irreducible closed subsets. For, assume
this is false. Call F the family of closed subsets of X which are not a finite union of
irreducible subsets. F is non-empty, since it contains X . By Theorem 9.1.30, F has
some minimal element X  . As X  ∈ F , then X  cannot be irreducible. Thus there
are two closed subsets X 1 , X 2 , properly contained in X  , whose union is X  . Since
X  is minimal in F , none of X 1 , X 2 is in F , thus both X 1 , X 2 are union of a finite
number of irreducible closed subsets. But then also X  would be a finite union of
closed irreducible subsets. As X  ∈ F , this is a contradiction.
Thus, there are irreducible closed subsets X 1 , . . . , X k , whose union is X . Then, if
Y is any irreducible component of X , we have Y ⊂ X = X 1 ∪ · · · ∪ X k . By 9.1.28
(iii), Y is contained in some X i . By maximality, we get that Y coincides with some
X i . This proves that the number of irreducible components of X is finite.
We just proved that X decomposes in the union of its irreducible components
Y1 , . . . , Ym . By 9.1.28 (iii), none of the Yi can be contained in the union of the
remaining components. Thus the decomposition is unique. 

Example 9.1.33 Let X be the variety in P2 defined by the vanishing of the homoge-
neous polynomial g = x0 x2 − x12 . Then X is irreducible.
Proving the irreducibility of a projective variety, in general, is not an easy task.
We do that, in this case, introducing a method that we will refine later.
Assume that X is the union of two proper closed subsets X 1 , X 2 , where X i is
defined by the vanishing of homogeneous polynomials in the set Ji .
We consider the map f : P1 → P2 defined by sending each point P = [y0 : y1 ]
to the point f (P) = [y02 : y0 y1 : y12 ] of P2 . It is immediate to check, indeed, that
the point f (P) does not depend on the choice of a particular pair of homogeneous
coordinates for P. Here f is simply a set-theoretic map. We will see, later, that f
has relevant geometric properties.
144 9 Elements of Projective Algebraic Geometry

The image of f is contained in X , for any point with homogeneous coordinates


[x0 : x1 : x2 ] = [y02 : y0 y1 : y12 ] annihilates g. Moreover the image of f is exactly X .
Indeed let Q = [q0 : q1 : q2 ] be a point of X . Fix elements b, c ∈ C such that b2 = q0
and c2 = q2 . Then we cannot have both b, c equal to 0, for in this case q0 = q2 = 0
and also q1 = 0, because g(Q) = 0, a contradiction. Moreover (bc)2 = q0 q2 = q12 .
Thus, after possibly the change of the sign of one between b and c, we may also
assume bc = q1 . Then:

f ([b : c]) = [b2 : bc : c2 ] = [q0 : q1 : q2 ] = Q.

The map f is one-to-one. To see this, assume f ([b : c]) = f ([b : c ]). Then
(b2 , b c , c2 ) is equal to (b2 , bc, c2 ) multiplied by some nonzero scalar z ∈ C. Tak-
ing a suitable square root w of z, we may assume b = wb. We have c = ±wc, but
if c = wc then b c = −zbc = zbc, a contradiction. Thus also c = wc and (b , c ),
(b, c) define the same point in P1 .
In conclusion, f is a bijective map f : P1 → X .
Next, we prove that Z 1 = f −1 (X 1 ) is closed in P1 . Indeed for any polynomial
p = p(y0 , y1 , y2 ) ∈ J1 consider the polynomial q = p(x02 , x0 x1 , x12 ) ∈ C[x0 , x1 ]. It
is immediate to check that any P ∈ P1 satisfies q(P) = 0 if and only if f (P) sat-
isfies p( f (P)) = 0. Thus Z 1 is the projective variety in P1 associated to the set of
homogeneous polynomials

J  = { p(x02 , x0 x1 , x12 ) : p ∈ J1 } ⊂ C[x0 , x1 ].

Similarly Z 2 = f −1 (X 2 ) is closed in P1 .
Since f is bijective, then Z 1 , Z 2 are proper closed subset of P1 , whose union is
P . This contradicts the irreducibility of P1 .
1

We will see (Exercise 28) that any linear variety is irreducible.

Example 9.1.34 Let X be the variety in P2 defined by the set of homogeneous poly-
nomials J = {x0 x1 , x0 (x0 − x2 )}. Then X is the union of the sets L 1 = {[x0 : x1 :
x2 ] : x0 = 0} and L 2 = {[x0 : x1 : x2 ] : x1 = 0, x0 = x2 }. These are both linear vari-
eties, hence they are irreducible (L 2 is indeed a singleton). Moreover L 1 ∩ L 2 = ∅.
It follows that X is not irreducible: L 1 , L 2 are its irreducible components.

Definition 9.1.35 We say that a polynomial f ∈ C[x0 , . . . , xn ] is irreducible when


f = g1 g2 implies that either g1 or g2 are constant.

Irreducible polynomials are the basic elements that determine a factorization of


every polynomial. We refer to [1] I.14 for a proof of the following statement, which
establishes a link between irreducible hypersurfaces and irreducible polynomials.

Theorem 9.1.36 (Unique factorization) Any polynomial f can be written as a prod-


uct f = f 1 f 2 · · · f h where the f i ’s are irreducible polynomials. The f i ’s are called
irreducible factors of f and are unique up to scalar, in the sense that if f = g1 · · · gs ,
9.1 Projective Varieties 145

with each g j irreducible, then h = s and, after a possible permutation, there are
scalars c1 , . . . , ch ∈ C with gi = ci f i for all i.
If f is homogeneous, also the irreducible factors of f are homogeneous.

Notice that the irreducible factors of f need not be distinct. In any event, the
irreducible factors of a product f g are the union of the irreducible factors of f and
the irreducible factors of g.

Example 9.1.37 Let X = X ( f ) be a hypersurface and take a decomposition f =


f 1 · · · f h of f into irreducible factors. Then the X ( f i )’s are precisely the irreducible
components of X .
To prove this, first notice that when f is irreducible, then X is irreducible. Indeed
assume that X = X 1 ∪ X 2 , where X 1 , X 2 are closed and none of them contains X .
Then take f 1 (respectively f 2 ) in the radical ideal of X 1 (respectively X 2 ) and such
that X is not contained in X ( f 1 ) (respectively in X ( f 2 )). We have X 1 ⊂ X ( f 1 ) and
X 2 ⊂ X ( f 2 ), thus:
X ( f ) ⊂ X ( f 1 ) ∪ X ( f 2 ) = X ( f 1 f 2 ).

It follows that f 1 f 2 belongs to the radical of the ideal generated by f , thus some
power of f 1 f 2 belongs to the ideal generated by f , i.e. there is an equality

( f 1 f 2 )n = f h

for some exponent n and some polynomial h. It follows that f is either an irreducible
factor of either f 1 or f 2 . In the former case f 1 = f h 1 hence X ( f 1 ) contains X . In
the latter, X ( f 2 ) contains X .
In particular, X is irreducible if and only if f has a unique irreducible factor. This
clearly happens when f is irreducible, but also when f is a power of an irreducible
polynomial.

9.2 Multiprojective Varieties

Let us move to consider products of projective spaces, which we will call also mul-
tiprojective spaces.
The nonexpert reader would be surprised, at first, by knowing that a product of
projective spaces is not trivially a projective space itself.
For instance, consider the product P1 × P1 , whose points have a pair of homoge-
neous coordinates ([x0 : x1 ], [y0 : y1 ]). These pairs can be multiplied separately by
two different scalars. Thus, ([1 : 1], [1 : 2]) and ([2 : 2], [1 : 2]) represent the same
point of the product. On the other hand, the most naïve association with a point in a
projective space yields to relate ([x0 : x1 ], [y0 : y1 ]) with [x0 : x1 : y0 : y1 ] (which,
by the way, sits in P3 ), but ([1 : 1 : 1 : 2]) and ([2 : 2 : 1 : 2]) are different points in
P3 .
146 9 Elements of Projective Algebraic Geometry

We will see in the next chapters how a product can be identified with a subset
(indeed, with a projective variety) of a higher dimensional projective space.
By now, we develop independently a theory for products of projective spaces and
their relevant subsets: multiprojective varieties.

Remark 9.2.1 Consider a product P = Pa1 × · · · × Pan . A point P ∈ P corresponds


to an equivalence class whose elements are n-tuples

(( p1,0 , . . . , p1,a1 ), . . . , ( pn,0 , . . . , pn,an ))

where, for all i, ( pi,0 , . . . , pi,ai ) = 0. Two such elements

P = (( p1,0 , . . . , p1,a1 ), . . . , ( pn,0 , . . . , pn,an ))


Q = ((q1,0 , . . . , q1,a1 ), . . . , (qn,0 , . . . , qn,an ))

belong to the same class when there are scalars k1 , . . . kn ∈ C (all of them necessarily
nonzero) such that, for all i, j, qi j = ki pi j .
We will denote the elements of the equivalence class that define P as sets of
multihomogeneous coordinates for P, writing

P = ([ p1,0 : · · · : p1,a1 ], . . . , [ pn,0 : · · · : pn,an ]).

Since we want to construct a projective geometry for multiprojective spaces, we


need to define the vanishing of a polynomial

f ∈ C[x1,0 , . . . , x1,a1 , . . . , xn,0 , . . . , xn,an ]

at a point P of the product above. This time, it is not sufficient that f is homogeneous,
because subsets of coordinates referring to factors of the product can be scaled
independently.

Definition 9.2.2 A polynomial f ∈ C[x1,0 , . . . , x1,a1 , . . . , xn,0 , . . . , xn,an ] is multi-


homogeneous of multidegree (d1 , . . . , dn ) if f , considered as a polynomial in the
variables xi,0 , . . . , xi,ai , is homogeneous of degree di , for every i.

Strictly speaking, the definition of a multihomogeneous polynomial in a poly-


nomial ring C[x0 , . . . , x N ] makes sense only after we defined a partition in the set
of variables. Moreover, if we change the partition, the notion of multihomogeneous
polynomial also changes.
Notice, however, that a partition is canonically determined when we consider the
polynomial ring C[x1,0 , . . . , x1,a1 , . . . , xn,0 , . . . , xn,an ] associated to the multiprojec-
tive space Pa1 × · · · × Pan .
Multihomogeneous polynomials are homogeneous, but the converse is false.

Example 9.2.3 Consider the polynomial ring C[x0 , x1 , y0 , y1 ], with the partition
{x0 , x1 }, {y0 , y1 }, and consider the two homogenous polynomials
9.2 Multiprojective Varieties 147

f 1 = x02 y0 + 2x0 x1 y1 − 3x12 y0 f 2 = x03 − 2x1 y0 y1 + x0 y12 .

Then f 1 is multihomogeneous (of multidegree (2, 1)) while f 2 is not multihomo-


geneous.

Example 9.2.4 In C[x1,0 , . . . , x1,a1 , . . . , xn,0 , . . . , xn,an ] the homogeneous linear


polynomial x1,0 + · · · + x1,a1 + · · · + xn,0 + · · · + xn,an is never multihomogeneous,
except for the trivial partition.
For the trivial partition, homogeneous and multihomogeneous polynomials coin-
cide.
If for any i one takes a homogeneous polynomial f i ∈ C[xi,0 , . . . , xi,ai ] of degree
di , then the product f 1 · · · f n is multihomogeneous of multidegree (d1 , . . . , dn ), in
the ring C[x1,0 , . . . , x1,a1 , . . . , xn,0 , . . . , xn,an ] with the natural partition.

It is immediate to verify that given two representatives of the same class in Pa1 ×
· · · × Pan :

P = (( p1,0 , . . . , p1,a1 ), . . . , ( pn,0 , . . . , pn,an ))


Q = ((q1,0 , . . . , q1,a1 ), . . . , (qn,0 , . . . , qn,an )),

when f ∈ C[x1,0 , . . . , x1,a1 , . . . , xn,0 , . . . , xn,an ] is multihomogeneous of any mul-


tidegree, then f (P) = 0 if and only if f (Q) = 0.
As a consequence, one can define the vanishing of f at a point P = ([ p1,0 : · · · :
p1,a1 ], . . . , [ pn,0 : · · · : pn,an ]) of the product, as the vanishing of f at any set of
multihomogeneous coordinates.

Definition 9.2.5 We call multiprojective variety every subset X ⊂ Pa1 × · · · × Pan


defined by the vanishing of a family J of multihomogeneous polynomials

J ⊂ C[x1,0 , . . . , x1,a1 , . . . , xn,0 , . . . , xn,an ].

We will, as usual, write X (J ) to denote the multiprojective variety defined by J .

Example 9.2.6 Consider the product P = Pa1 × · · · × Pan and consider, for all i, a
projective variety X i in Pai . Then the product X 1 × · · · × X n is a multiprojective
variety in P.
Indeed, assume that X i is defined by a finite subset Ji of homogeneous polynomials
in the variables xi,0 , . . . , xi,ai . For all i extend Ji to a finite set K i which defines the
empty set in Pai (e.g. just add to Ji all the coordinates in Pai ). Then X 1 × · · · × X n
is defined by the (finite) set of products of homogeneous polynomials:

J = { f 1 · · · f n : f j ∈ K j ∀ j and ∃i with f i ∈ Ji }.

Namely, take P = (P1 , . . . , Pn ) ∈ P. Clearly if P ∈ X 1 × · · · × X n then P anni-


hilates all the elements in J . Conversely, if P ∈/ X 1 × · · · × X n , say P j ∈
/ X j , then
for every i = j take gi any homogeneous polynomial in K i such that gi (Pi ) = 0, and
148 9 Elements of Projective Algebraic Geometry

take g j ∈ J j such that g j (P j ) = 0. Clearly the product g1 · · · gn belongs to J and it


does not vanish at P.
Example 9.2.7 There are multiprojective varieties that are not a product of projective
varieties.
For instance, consider the multiprojective variety X defined by x0 y1 − x1 y0 in the
product P1 × P1 . It is easy to see that X does not coincide with P1 × P1 (Exercise
29), but for each point P ∈ P1 the point (P, P) of the product sits in X . Thus X
cannot be the product of two subsets of P1 , one of which is a proper subset.
Most properties introduced in the previous section for projective varieties also
hold for multiprojective varieties. We give here a short survey (the proofs are left as
an exercise).
Remark 9.2.8 Let X (J ) be a multiprojective variety, defined by a set J of multiho-
mogeneous polynomials. Then for any J  ⊂ J , the variety X (J  ) contains X (J ).
One can have X (J  ) = X (J ) even if J  is properly contained in J . √
For instance, X (J ) is also defined by the ideal J  and by its radical J .
Remark 9.2.9 Multiprojective varieties define a family of closed subset for a topol-
ogy on a product P = Pa1 × · · · × Pan . We call this topology the Zariski topology on
P.
P is irreducible and compact, in the Zariski topology. Thus every multiprojective
variety is itself compact, in the induced topology.
Remark 9.2.10 The Basis Theorem 9.1.22 guarantees that every multihomogeneous
variety is the zero locus of a finite set of multihomogeneous polynomials.
Every multihomogeneous variety is indeed the intersection of a finite number of
hypersurfaces in P = Pa1 × · · · × Pan , where a hypersurface is defined as a multi-
projective variety X (J ), where J is a singleton.
Theorem 9.2.11 Let X be a multiprojective variety. Then there exists a unique de-
composition of X as the union

X = X1 ∪ · · · ∪ Xk

where X 1 , . . . , X k are irreducible multiprojective varieties: the irreducible compo-


nents of X .

9.3 Projective and Multiprojective Maps

A theory of algebraic objects in Mathematics cannot be considered complete unless


one introduces also the notion of good maps between the objects.
We define in this section a class of maps between projective and multiprojective
varieties, that are good for our purposes. We will call them projective or multipro-
jective maps.
9.3 Projective and Multiprojective Maps 149

In principle, a projective map is a map which is described by polynomials. Un-


fortunately, one cannot take this as a global definition, because it is too restrictive
and would introduce some undesirable phenomenon.
Instead, we define projective maps in terms of a local description by polynomials.
Definition 9.3.1 Let X ⊂ Pn and Y ⊂ Pm be projective varieties. We say that a map
f : X → Y is projective if the following property holds:
for any P ∈ X there exist an open set U of X (in the Zariski topology), containing
P, and m + 1 polynomials f 0 , . . . , f m ∈ C[x0 , . . . , xn ], homogeneous of the same
degree, such that for all Q = [q0 : · · · : qn ] ∈ U :

f (Q) = [ f 0 (q0 , . . . , qn ) : · · · : f m (q0 , . . . , qn )].

We will also write that, over U , the map f is given parametrically by the system:


⎨ y0 = f 0 (x0 , . . . , xn )
... ...


ym = f m (x0 , . . . , xm )

Thus, X is covered by open subsets on which f is defined by polynomials. Since


X is compact, we can always assume that the cover is finite.
Notice that if we take another set of homogeneous coordinates for the point
Q ∈ U , i.e. we write Q = [cq0 : · · · : cqn ], where c is a nonzero scalar, then
since the polynomials are homogeneous of the same degree, say degree d, we
get f i (cq0 , . . . , cqn ) = cd f i (q0 , . . . , qn ) for all i. Thus f (Q) is independent on the
choice of a specific set of homogeneous coordinates for Q.
We may always consider a projective map f : X → Y ⊂ Pm as a map from X to
the projective space Pm . The following proposition shows that when the domain X
of f is a projective space itself, then the localization to open sets, in the definition
of projective maps, is useless.
Proposition 9.3.2 Let f : Pn → Pm be a projective map. Then there exists a set of
m + 1 homogeneous polynomials f 0 , . . . , f m ∈ C[x0 , . . . , xn ] of the same degree,
such that f (Q) is defined by f 0 , . . . , f m for all Q ∈ Pn .
In other words, in the definition we can always take just one open subset U = Pn .
Proof Take two open subsets U1 , U2 where f is defined, respectively, by homo-
geneous polynomials g0 , . . . , gm and h 0 , . . . , h m of the same degree. Since Pn is
irreducible, then U = U1 ∩ U2 is a non-empty, dense open subset. For any point
P ∈ U there exists a scalar α P ∈ C − {0} such that, if P = [ p0 : · · · : pn ], then

(g0 ( p0 , . . . , pn ), . . . , gm ( p0 , . . . , pn )) = α P (h 0 ( p0 , . . . , pn ), . . . , h m ( p0 , . . . , pn )).

It follows that the homogeneous polynomials

g j h i − gi h j
150 9 Elements of Projective Algebraic Geometry

vanish at all the points of U . Thus they must vanish at all the points of Pn , since
U is dense. In particular they vanish in all the points of U1 ∪ U2 . It follows im-
mediately that for any P ∈ U1 ∪ U2 , P = [ p0 : · · · : pn ], the sets of coordinates
[g0 ( p0 , . . . , pn ) : · · · : gm ( p0 , . . . , pn )] and [h 0 ( p0 , . . . , pn ) : · · · : h m ( p0 , . . . , pn )]
determine the same, well defined point of Pm . The claim follows. 

After Proposition 9.3.2 one may wonder if the local definition of projective maps
is really necessary.
Well, it is, as illustrated in the following Example 9.3.4.
The fundamental point is that necessarily the polynomials f 0 , . . . , f m that define
the projective map f over U , cannot have a common zero Q ∈ U , otherwise, the
map would not be defined in Q. Sometimes this property cannot be obtained globally
by a unique set of polynomials. It is necessary to use an open cover and vary the
polynomials, passing from one open subset to another one.

Example 9.3.3 Assume n ≤ m and consider the map between projective spaces f :
Pn → Pm , defined globally by polynomials f 0 (x0 , . . . , xn ) , . . . , f m (x0 , . . . , xn ),
where:
xi if i ≤ n
f i (x0 , . . . , xn ) =
0 otherwise.

Then f is a projective map. It is obvious that f is injective.


Be careful that the map f would not exist for n > m.
Indeed, if for instance n = m + 1, then the image of the point P ∈ Pn , with
coordinates [0 : · · · : 0 : 1], would be the point of coordinates [0 : · · · : 0], which
does not exist in Pm .

Example 9.3.4 Let X be the hypersurface of the projective plane P2 , defined by


g(x0 , x1 , x2 ) = x02 + x12 − x22 . People can immediately realize that X corresponds to
the usual circle of analytic geometry.
We define a projective map (the stereographic projection) f : X → P1 (see
Figure 9.1).
Consider the two hypersurfaces X (h 1 ), X (h 2 ), where h 1 , h 2 are respectively
equal to x1 − x2 and x1 + x2 . Notice that X (h 1 ) ∩ X (h 2 ) is just the point of co-
ordinates [1 : 0 : 0], which does not belong to X . Thus the open subsets of the plane
(X (h 1 ))c , (X (h 2 ))c cover X . Define two open subsets of X by U1 = X ∩ (X (h 1 ))c ,
U2 = X ∩ (X (h 2 ))c .
Then define the map f as follows:

y0 = x0 y0 = x1 + x2
on U1 , f = , on U2 , f = .
y1 = x1 − x2 y1 = −x0

We need to prove that the definition is consistent in the intersection U1 ∩ U2 . Notice


that if Q = [0 : q1 : q2 ] belongs to X , then q12 − q22 = 0, so that Q belongs either
to U1 or to U2 , but not to both, since one cannot have q1 = q2 = 0. Thus any point
9.3 Projective and Multiprojective Maps 151

[0, 1, 1] x1 − x2 = 0

x0 = 0

x1 = 0

[0, −1, 1] x 1 + x2 = 0

Fig. 9.1 Stereographic projection

Q = [q0 : q1 : q2 ] ∈ U1 ∩ U2 satisfies q0 = 0. Since clearly Q also satisfies q1 +


q2 = 0, then:

[q0 : q1 − q2 ] = [q0 (q1 + q2 ) : (q1 − q2 )(q1 + q2 )] =


= [q0 (q1 + q2 ) : q12 − q22 ] = [q0 (q1 + q2 ) : −q02 ] = [q1 + q2 : −q0 ].

Thus f is a well defined projective map.


Notice that the two polynomials that define the map on U1 cannot define the map
globally, because X \ U1 contains the point [0 : 1 : 1], where both x0 and x1 − x2
vanish.
The map f is one-to-one and onto. Indeed if B = [b0 : b1 ] ∈ P1 , then f −1 (B)
consists of the unique point [2b0 b1 : −b02 + b12 : −b02 − b12 ], as one can easily com-
pute.
Proposition 9.3.5 Projective maps are continuous in the Zariski topology.
Proof Consider X ⊂ Pn and a projective map f : X → Y ⊂ Pm . We may assume
indeed Y = Pm , to prove the continuity. Let U be an open subset of Pm , which is the
complement of a hypersurface X (g), for some g ∈ C[y0 , . . . , ym ]. Let  be an open
subset of X where f is defined by the polynomials f 0 , . . . , f m . Then f −1 (U ) ∩ 
is the intersection of  with the complement of the hypersurface defined by the
homogeneous polynomial g( f 0 , . . . , f m ) ∈ C[x0 , . . . , xn ]. It follows that f −1 (U ) is
a union of open sets, hence it is open.
Since every open subset of Pm is a (finite) union of complements of hypersurfaces
(see Exercise 27), the claim follows. 
152 9 Elements of Projective Algebraic Geometry

Definition 9.3.6 We will say that a projective map f : X → Y is an isomorphism


if there is a projective map g : Y → X such that g ◦ f = identity on X and f ◦ g =
identity on Y .
Equivalently, a projective map f is an isomorphism if it is one-to-one and onto,
and the set-theoretic inverse g is itself a projective map.
Example 9.3.7 Let us prove that the map f defined in Example 9.3.4, from the hy-
persurface X ⊂ P2 defined by the polynomial x02 + x12 − x22 to P1 is an isomorphism.
We already know, indeed, what is the inverse of f : it is the map g : P1 → X
defined parametrically by ⎧

⎨x0 = 2y0 y1
x1 = −y02 + y12


x2 = −y02 − y12

It is immediate to check, indeed, that both g ◦ f and f ◦ g are the identity on the
respective spaces.
Remark 9.3.8 We are now able to prove that the map f of Example 9.3.4 cannot be
defined globally by a pair of polynomials:

y0 = p0 (x0 , x1 , x2 )
y1 = p1 (x0 , x1 , x2 )

Otherwise, since the map g defined in the previous example is the inverse of
f , we would have that for any choice of Q = (b0 , b1 ) = (0, 0), the homogeneous
polynomials

h b0 ,b1 = b1 p0 (y0 y1 , y12 − y22 , −y12 − y22 ) − b0 p1 (y0 y1 , y12 − y22 , −y12 − y22 ),

whose vanishing defines f (g(Q)), vanishes at a single point of P1 . Notice that the
degree d of any h b0 ,b1 is at least 2.
Since C is algebraically closed, a homogeneous polynomial in two variables that
vanishes at a single point is a power of linear form. Thus any polynomial h b0 ,b1 is a
d-th power of a linear form. In particular there are scalars a0 , a1 , c0 , c1 such that:

h 1,0 = p0 (y0 y1 , y12 − y22 , −y12 − y22 ) = (a0 y0 − a1 y1 )d


h 0,1 = p1 (y0 y1 , y12 − y22 , −y12 − y22 ) = (c0 y0 − c1 y1 )d

Notice that the point Q  = [a1 : a0 ] cannot be equal to [c1 : c0 ], otherwise both p0 , p1
would vanish at g(Q  ) ∈ X . Then h 1,−1 = (a0 y0 − a1 y1 )d − (c0 y0 − c1 y1 )d vanishes
at two different points, namely [a1 + c1 : a0 + c0 ] and [ea1 + c1 : ea0 + c0 ], where
e is any d-root of unit, different from 1.
In the case of multiprojective varieties, most definitions and properties above can
be rephrased and proved straightforwardly.
9.3 Projective and Multiprojective Maps 153

Definition 9.3.9 Let X ⊂ Pa1 × · · · × Pan be a multiprojective variety. A map f :


X → Pm is a projective map if there exists a open cover {Ui } of the domain X such
that f is defined over each Ui by multihomogeneous polynomials, all of the same
multidegree.
In other words f is multiprojective if for any Ui of a cover there are multihomo-
geneous polynomials f 0 , . . . , f m in C[x0,1 , . . . , xn,an ], of the same multidegree, such
that for all P ∈ X , P = ([ p0,1 : · · · : p0,a1 ], . . . , [ pn,1 : · · · : pn,an ]), then f (P) has
coordinates

f (P) = [ f 0 (( p0,1 , . . . , p0,a1 ), . . . , ( pn,1 , . . . , pn,an )) : . . .


· · · : f m (( p0,1 , . . . , p0,a1 ), . . . , ( pn,1 , . . . , pn,an ))).

We will write in parametric form:




⎨ y0 = f 0 (x0,1 , . . . , xn,an )
... ...


ym = f m (x0,1 , . . . , xn,an )

We will say that f : X → Pb1 × · · · × Pbm is a multiprojective map if all of its com-
ponents are.
Remark 9.3.10 The composition of two multiprojective maps is a multiprojective
map.
The identity from a multiprojective variety to itself is a multiprojective map.
Multiprojective maps are continuous in the Zariski topology.
Proposition 9.3.11 Let f : Pa1 × · · · × Pan → Pm be a multiprojective map. Then
there exists a set of m + 1 multihomogeneous polynomials f 0 , . . . , f m of the same
multidegree, such that f (Q) is defined by f 0 , . . . , f m for all Q ∈ Pa1 × · · · × Pan .

9.4 Exercises

Exercise 19 Prove the last assertion of Example 9.1.8. If { p1 , . . . , pm } is a collection


of linear homogeneous polynomials in C[t0 , . . . , tn ], then the projective variety X
defined by { p1 , . . . , pm } can be canonically identified with the projective space P(W )
over the linear subspace W ⊂ Cn+1 defined by the pi ’s.
Exercise 20 Prove that if X is the projective variety defined by a set J of homoge-
neous polynomials and J  ⊂ J , then X  = X (J  ) contains X .
Find an examples of different subsets J = J  of linear polynomials such that
X (J ) = X (J  ).

Exercise 21 Prove that√ if I is a homogeneous ideal, then also I is a homogeneous
ideal and X (I ) = X ( I ).
154 9 Elements of Projective Algebraic Geometry

Exercise 22 Prove that if Y is an irreducible subset, then the closure of Y is irre-


ducible.
Prove that if an irreducible subset Y is contained in a finite union of closed subsets
X 1 ∪ · · · ∪ X m , then Y is contained in some X i .
 Prove that if Y1 ⊂ . . . Yi ⊂ . . . is an ascending chain of irreducible subsets, then
Yi is irreducible.

Exercise 23 Prove that if X is an irreducible subset of a topological space, then X


is not the union of a finite number of subsets Yi ⊂ X which are closed in the induced
topology.

Exercise 24 Prove that if X 1 and X 2 are topological spaces, and X 1 has the sep-
aration property T1 , then for any Q ∈ X 1 the fiber {Q} × X 2 is a closed subset of
X 1 × X 2 which is homeomorphic to X 2 .
Prove that if X 1 and X 2 are irreducible, and one of them has the property T1 , then
also the product X 1 × X 2 is irreducible.

Exercise 25 Determine which Hausdorff topological space can be irreducible.

Exercise 26 Prove that if X is a finite projective variety, then the irreducible com-
ponents of X are its singletons.

Exercise 27 Prove that any open subset of a projective variety X is covered by a


finite union of open subsets which are the intersection of X with the complement of
a hypersurface.

Exercise 28 Prove that if p is an irreducible polynomial and c ∈ C, then also cp is


irreducible.
Prove that every linear polynomial is irreducible.

Exercise 29 Prove that the multiprojective variety X defined by x0 y1 − x1 y0 in the


product Y = P1 × P1 does not coincide with Y .

Exercise 30 Prove that the composition of two projective maps is a projective map.
Prove that the identity from a projective variety to itself is a projective map.

References

1. Zariski, O., Samuel, P.: Commutative Algebra I. Graduate Texts in Mathematics, Vol. 28 (1958).
Springer, Berlin
2. Macaulay, F.S.: The Algebraic Theory of Modular Systems. Cambridge University Press (1916)
Chapter 10
Projective Maps and the Chow’s
Theorem

The chapter contains the proof of the Chow’s Theorem, a fundamental result for
algebraic varieties with an important consequence for the study of statistical models.
It states that, over an algebraically closed field, like C, the image of a projective (or
multiprojective) variety X under a projective map is a Zariski closed subset of the
target space, i.e., it is itself a projective variety.
The proof of Chow’s Theorem requires an analysis of projective maps, which can
be reduced to a composition of linear maps, Segre maps and Veronese maps.
The proof also will require the introduction of a basic concept of the elimination
theory, i.e., the resultant of two polynomials.

10.1 Linear Maps and Change of Coordinates

We start by analyzing projective maps induced by linear maps of vector spaces.


The nontrivial case concerns linear maps which are surjective but not injective.
After a change of coordinates, such maps induce maps between projective varieties
that can be described as projections.
Despite the fact that the words projective and projection have a common origin
(in the paintings of the Italian Renaissance, e.g.) projections not always give rise to
projective maps.
The description of the image of a projective variety under projections relies indeed
on nontrivial algebraic tools: the rudiments of the elimination theory.
Let us start with a generalization of Example 9.3.3.

Definition 10.1.1 Consider a linear map φ : Cn+1 → Cm+1 which is injective.


Then φ defines a projective map (which, by abuse, we will still denote by φ)
between the projective spaces Pn → Pm , as follows:

© Springer Nature Switzerland AG 2019 155


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_10
156 10 Projective Maps and the Chow’s Theorem

– for all P ∈ Pn , consider a set of homogeneous coordinates [x0 : · · · : xn ] and


send P to the point φ(P) ∈ Pm with homogeneous coordinates [φ(x0 , . . . , xn )].
Such maps are called linear projective maps.

It is clear that the point φ(P) does not depend on the choice of a set of coordinates
for P, since φ is linear.
Notice that we cannot define a projective map in the same way when φ is not
injective. Indeed, in this case, the image of a point P whose coordinates lie in the
Kernel of φ would be indeterminate.
Since any linear map Cn+1 → Cm+1 is defined by linear homogeneous polyno-
mials, then it is clear that the induced map between projective spaces is indeed a
projective map.

Example 10.1.2 Assume that the linear map φ is an isomorphism of Cn+1 . Then the
corresponding linear projective map is called a change of coordinates.
Indeed φ corresponds to a change of basis inside Cn+1 .
The associated map φ : Pn → Pn is an isomorphism, since the inverse isomor-
phism φ−1 determines a projective map which is the inverse of φ.

Remark 10.1.3 By construction, any change of coordinates in a projective space is


a homeomorphism of the corresponding topological space, in the Zariski topology.
So, the image of a projective variety under a change of coordinates is still a
projective variety.

From now on, when dealing with projective varieties, we will freely act with the
change of coordinates on them.
The previous remark generalizes to any linear projective map.

Proposition 10.1.4 For every injective map φ : Cn+1 → Cm+1 , m ≥ n, the asso-
ciated linear projective map φ : Pn → Pm sends projective subvarieties of Pn to
projective subvarieties of Pm .
In topological terms, any linear projective map is closed in the Zariski topology,
i.e., it sends closed sets to closed sets.

Proof The linear map φ factorizes in a composition φ = ψ ◦ φ where φ is the


inclusion which sends [x0 : · · · : xn ] to [x0 : · · · : xn : 0 : · · · : 0], m − n zeroes, and
ψ is a change of coordinates (notice that we are identifying the coordinates in Pn
with the first n + 1 coordinates in Pm ).
Thus, up to a change of coordinates, any linear projective map can be reduced
to the map that embeds Pn into Pm as the linear space defined by equations xn+1 =
· · · = xm = 0.
It follows that if X ⊂ Pn is the subvariety defined by homogeneous polynomials
f 1 , . . . , f s then, up to a change of coordinates, the image of X is the projective
subvariety defined in Pm by the polynomials f 1 , . . . , f s , xn+1 , . . . , xm . 

The definition of linear projective maps, which requires that φ is injective, becomes
much more complicated if we drop the injectivity assumption.
10.1 Linear Maps and Change of Coordinates 157

Let φ : Cn+1 → Cm+1 be a non injective linear map. In this case, we cannot define
through φ a projective map Pn → Pm as above, since for any vector ( p0 , . . . , pn )
in the kernel of φ, the image of the point [ p0 : · · · : pn ] is undefined, because
φ( p0 , . . . , pn ) vanishes.
On the other hand, the kernel of φ defines a projective linear subspace of Pn , the
projective kernel, which will be denoted by K φ .
If X ⊂ Pn is a subvariety which does not meet K φ , then the restriction of φ to the
coordinates of the points of X determines a well-defined map from X to Pm .

Example 10.1.5 Consider the point P0 ∈ Pm of projective coordinates [1 : 0 : 0 :


· · · : 0] and let M be the linear subspace of Cm+1 of the points with first coordinate
equal to 0, i.e., M = X (x0 ). Let φ0 : Cm+1 → M be the linear surjective (but not
injective) map which sends a vector (x0 , x1 , . . . , xm ) to (0, x1 , . . . , xm ).
Notice that M defines a linear projective subspace P(M) ⊂ Pm , of projective
dimension m − 1 (i.e., a hyperplane), and P0 ∈ / P(M). Moreover P0 is exactly the
projective kernel of φ0 . Let Q be any point of Pm different from P0 . If Q = [q0 : q1 :
· · · : qm ] then φ0 (Q) = (0, q1 , . . . , qm ) determines a well defined projective point,
which corresponds to the intersection of P(M) with the line P0 Q. This is the reason
why we call φ0 the projection from P0 to P(M). Notice that we cannot define a
global projection Pm → P(M), since it would not be defined in P0 . What we get is a
set-theoretic map Pm \ {P0 } → P(M). For any other choice of a point P ∈ Pm and
a hyperplane H , not containing P, there exists a change of coordinates which sends
P to P0 and H to P(M). Thus the geometric projection Pm \ {P} → H from P to
H is equal to the map described above, up to a change of coordinates.
We can generalize the construction to projections from positive dimensional linear
subspaces.
Namely, for a fixed n < m consider the subspace N ⊂ Cm+1 , of dimension
m − n < m + 1, formed by the (m + 1)-tuples of type (0, . . . , 0, xn+1 , . . . , xm )
and let M be the (n + 1)-dimensional linear subspace of (m + 1)-tuples of type
(x0 , . . . , xn , 0 . . . , 0). Let φ0 : Cm+1 → M be the linear surjective (but not injec-
tive) map which sends any (x0 , . . . , xm ) to (x1 , . . . , xn , 0, . . . , 0).
Notice that N and M define disjoint linear projective subspaces, respectively
P(N ), of projective dimension m − n − 1, and P(M), of projective dimension n. Let
Q be a point of Pm \ P(N ) (this means exactly that Q has coordinates [q0 : · · · : qn ],
with qi = 0 for some index i between 0 and n). Then the image of Q under φ0 is
a well defined projective point, which corresponds to the intersection of P(M) with
the projective linear subspace spanned by P(N ) and Q. This is why we get from φ0
a set-theoretic map Pm \ P(N ) → P(M), which we call the projection from P(N ) to
P(M).
For any choice of two disjoint linear subspaces L 1 , of dimension m − n − 1, and
L 2 , of dimension n,there exists a change of coordinates which sends L 1 to P(N ) and
L 2 to P(M). Thus the geometric projection Pm \ L 1 → L 2 from L 1 to L 2 is equal to
the map described above, up to a change of coordinates.
158 10 Projective Maps and the Chow’s Theorem

Example 10.1.6 Let φ : Cm+1 → Cn+1 be any surjective map, with kernel L 1 (of
dimension m − n). We can always assume, up to a change of coordinates, that L 1
coincides with the subspace N defined in Example 10.1.5. Then considering the
linear subspace M ⊂ Cm+1 defined in Example 10.1.5, we can find an isomorphism of
vector spaces ψ from M to Cn+1 such that φ = ψ ◦ φ0 , where φ0 is the map introduced
in Example 10.1.5. Thus, after an isomorphism and a change of coordinates, φ acts
on points of Pm \ K φ as a geometric projection.

Example 10.1.6 suggests the following definition.

Definition 10.1.7 Given a linear surjective map φ : Cm+1 → Cn+1 and a subvariety
X ⊂ Pm which does not meet K φ , the restriction map φ|X : X → Pn is a well defined
projective map, which will be denoted as a projection of X from K φ . The subspace
K φ is also called the center of the projection.
Notice that φ|X is a projective map, since it is defined, up to isomorphisms and
change of coordinates, by (simple) homogeneous polynomials (see Exercise 31).

Thus, linear surjective maps define projections from suitable subvarieties of Pm


to Pn . Next section is devoted to prove that projections are closed, in the Zariski
topology.

10.2 Elimination Theory

In this section, we introduce the basic concept of the elimination theory: the resultant
of two polynomials.
The resultant provides an answer to the following problem:
– assume we are given two (not necessarily homogeneous) polynomials f, g ∈ C[x].
Clearly both f and g factorize in a product of linear factors. Which algebraic condi-
tion must f, g satisfy to share a common factor, hence a common root?

Definition 10.2.1 Let f, g be nonconstant (nonhomogeneous) polynomials in one


variable x, with coefficients in C. Write f = a0 + a1 x + a2 x 2 + · · · + an x n and
g = b0 + b1 x + b2 x 2 + · · · + bm x m . The resultant R( f, g) of f, g is the determinant
of the Sylvester matrix S( f, g), which in turn is the (m + n) × (m + n) matrix defined
as follows:
⎛ ⎞
a0 a1 a2 . . . an 0 0 0 . . . 0
⎜ 0 a0 a1 a2 . . . an 0 0 . . . 0 ⎟
⎜ ⎟
⎜. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ⎟
⎜ ⎟
⎜ 0 . . . 0 0 0 a0 a1 a2 . . . an ⎟
S( f, g) = det ⎜⎜ ⎟

⎜ b0 b1 . . . bm 0 0 0 0 . . . 0 ⎟
⎜ 0 b0 b1 . . . bm 0 0 0 . . . 0 ⎟
⎜ ⎟
⎝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ⎠
0 . . . 0 0 0 0 b0 b1 . . . bm
10.2 Elimination Theory 159

where the a’s are repeated m times and the b’s are repeated n times.
Notice that when f is constant and g has degree d > 0, then by definition
R( f, g) = f d .
When both f, g are constant, the previous definition of resultant makes no sense.
In this case we set: 
0 if f = g = 0
R( f, g) =
1 otherwise.

Example 10.2.2 Just to give an instance, if f = x 2 − 3x + 2 and g = x − 1 (both


vanishing at x = 1), the resultant R( f, g) is
⎛ ⎞
1 −3 2
R( f, g) = det S( f, g) = det ⎝1 −1 0 ⎠ = 0
0 1 −1

Proposition 10.2.3 With the previous notation, f and g have a common root if and
only if R( f, g) = 0.
Proof The proof is immediate when either f or g are constant (Exercise 33).
Otherwise write C[x]i for the vector space of polynomials of degree ≤ i in C[x].
Then the transpose of S( f, g) is the matrix of the linear map:

φ : C[x]m−1 × C[x]n−1 → C[x]m+n−1

which sends ( p, q) to p f + qg (matrix computed with respect to the natural basis


defined by the monomials). Thus R( f, g) = 0 if and only if the map has a nontrivial
kernel.
Let ( p0 , q0 ) be a nontrivial element of the kernel, i.e., p0 f + q0 g = 0. Consider
the factors (x − αi ) of f , where the αi ’s are the roots of f (possibly some factor is
repeated). Then, all these factors must divide q0 g. Since deg q0 < deg f , at least one
factor x − αi must divide g. Thus αi is a common root of f and g.
Conversely, if α is a common root of f and g, then x − α divides both f and
g. Hence setting p0 = g/(x − α), q0 = − f /(x − α), one finds a nontrivial element
( p0 , q0 ) of the kernel of φ, so that det S( f, g) = 0. 
We have the analogue construction if f, g are homogeneous polynomials in two
or more variables.
Definition 10.2.4 If f, g are homogeneous polynomials in C[x0 , x1 , . . . , xr ] one
can define the 0th resultant R0 ( f, g) of f, g just by considering f and g as poly-
nomials in x0 , with coefficients in C[x1 , . . . , xr ], and taking the determinant of the
corresponding Sylvester matrix S0 ( f, g).
R0 ( f, g) is thus a polynomial in x1 , . . . , xr .
For instance, if f = x02 x1 + x0 x1 x2 + x23 and g = 2x0 x12 + x0 x22 + 3x12 x2 , then
the 0th resultant is:
160 10 Projective Maps and the Chow’s Theorem
⎛ ⎞
x1 x1 x2 x23
R0 ( f, g) = det S0 ( f, g) = det ⎝2x12 + x22 3x12 x2 0 ⎠=
0 2x1 + x2 3x12 x2
2 2

= 3x15 x22 + 4x14 x23 − 3x13 x24 + 4x12 x25 + x27 .

From Proposition 10.2.3, with an easy induction on the number of variables, one
finds that the resultant of homogeneous polynomials in several variables has the
following property:
Proposition 10.2.5 Let f, g be homogeneous polynomials in C[x0 , x1 , . . . , xr ].
Then R0 ( f, g) vanishes at (α1 , . . . , αr ) if and only if there exists α0 ∈ C with:

f (α0 , α1 , . . . , αr ) = g(α0 , α1 , . . . , αr ) = 0.

Less obvious, but useful, is the following remark on the resultant of two homo-
geneous polynomials.
Proposition 10.2.6 Let f, g be homogeneous polynomials in C[x0 , x1 , . . . , xr ].
Then R0 ( f, g) is homogeneous.
Proof The entries si j of the Sylvester matrix S0 ( f, g) are homogeneous and their
degrees decrease by 1 passing from one element si j to the next element si j+1 in the
same row (unless some of them is 0). Thus for any nonzero entry si j of the matrix,
the number deg si j − j depends only on the row i. Call it u i . Then the summands
given by any permutation, in the computation of the determinant, are homogeneous
of same degree:
1
d = (n + m + 1)(n + m) + ui ,
2 i

so that R0 ( f, g) is homogeneous, and its degree is equal to d. 


Next, a fundamental property of the resultant R0 ( f, g) is that it belongs to the
ideal generated by f and g. We will not give a full proof of this property, and refer
to the book [1] for it.
Instead, we just prove that R0 ( f, g) belongs to the radical of the ideal generated
by f and g, which is sufficient for our aims.
Proposition 10.2.7 Let f, g be homogeneous polynomials in C[x0 , x1 , . . . , xr ].
Then R0 ( f, g) belongs to the radical of the ideal generated by f and g.
Proof In view of the Nullstellensatz, it is sufficient to prove that R0 ( f, g) vanishes
at all points of the variety defined by f and g. But this is obvious from Proposition
10.2.5: if [α0 : α1 : · · · : αr ] are homogeneous coordinates of P ∈ V ( f, g), then

R0 ( f, g)(α0 , α1 , . . . , αr ) = R0 ( f, g)(α1 , . . . , αr ) = 0,

since f (x0 , α1 , . . . , αr ) and g(x0 , α1 , . . . , αr ), polynomials in C[x0 ], have a com-


mon root α0 . 
10.3 Forgetting a Variable 161

10.3 Forgetting a Variable

In Sect. 10.1 we introduced the projection maps as projectifications of surjective


linear maps φ : Cm+1 → Cn+1 . It is important to recall that when φ has nontrivial
kernel (i.e., when n < m) the projection is not defined as a map between the two
projective spaces Pm and Pn . On the other hand, for any subvariety X ⊂ Pm which
does not intersect the projectification of K er (φ), the map φ corresponds to a well
defined projective map X → Pn .
In this section, we describe the image of a variety in a projection π from a point,
i.e., when the center of projections has dimension 0. It turns out, in particular, that
π(X ) is itself an algebraic variety.
Through this section, consider the surjective linear map φ : Cm+1 → Cm which
sends (α0 , α1 , . . . , αm ) to (α1 , . . . , αm ). The kernel of the map is generated by
(1, 0, . . . , 0). Thus, if X is a projective variety in Pm which misses the point P0 =
[1 : 0 : · · · : 0], then the map induces a well defined projective map π : X → Pm−1 :
the projection from P0 (see Definition 10.1.7).
For any point Q ∈ π(X ), Q = [q1 : · · · : qn ], the inverse image of Q in X is the
the intersection of X with the line joining P0 and Q. Thus π −1 (Q) is the set of points
in X with coordinates [q0 : q1 : · · · : qn ], for some q0 ∈ C.

Remark 10.3.1 For all Q ∈ π(X ), the inverse image π −1 (Q) is finite.
Indeed π −1 (Q) is a Zariski closed set in the line P0 Q, and it does not contain P0 ,
since P0 ∈/ X . The claim follows since the Zariski topology on a line is the cofinite
topology.

Let J ⊂ C[x0 , . . . , xn ] be the homogeneous ideal associated to X . Define:

J0 = J ∩ C[x1 , . . . , xn ].

In other words, J0 is the set of elements in J which are constant with respect to the
variable x0 . In Chap. 13 we will talk again about Elimination Theory, but from the
point of view of Groebner Basis; there, the ideal J0 will be called the first elimination
ideal of J (Definition 13.5.1).

Remark 10.3.2 J0 is a homogeneous radical ideal in C[x1 , . . . , xn ].


Indeed J0 is obviously an ideal. Moreover for any g ∈ J0 , any homogeneous
component gd of g belongs to J , because J is homogeneous, and does not contain
x0 . Thus gd ∈ J0 , and this is sufficient to conclude that J0 is homogeneous (see
Proposition 9.1.15).
If g d ∈ J0 for some g ∈ C[x1 , . . . , xn ], then g ∈ J , because J is radical, moreover
g does not contain x0 . Thus also g ∈ J0 .

We prove that π(X ) is the projective variety defined by J0 . We will need the
following refinement of Lemma 9.1.5:
162 10 Projective Maps and the Chow’s Theorem

Lemma 10.3.3 Let Q 1 , . . . , Q k be a finite set of points in Pn . Then there exists a


linear form  ∈ C[x0 , . . . , xn ] such that (Q i ) = 0 for all i.
If none of the Q i ’s belong to the variety defined by a homogeneous ideal J , then
there exists g ∈ J such that g(Q i ) = 0 for all i.

Proof Fix a set of homogeneous generators g1 , . . . gs of J .


First assume that all the gi ’s are linear. Then the gi ’s define a subspace L of the
space of linear homogeneous polynomials in C[x0 , . . . , xn ]. For each Q i , the set L i
of linear forms in L that vanish at Q i is a linear subspace of L, which is properly
contained in L, because some g j does not vanish at Q i . Since a nontrivial complex
linear space cannot be the union of a finite number of proper subspaces, we get that
for a general choice of b1 , . . . , bs ∈ C, the linear form  = b1 g1 + · · · + bs gs does
not belong to any L i , thus (Q i ) = 0 for all i. This proves the second claim for ideals
generated by linear forms.
The first claim now follows soon, since the (irrelevant) ideal J generated by all
the linear forms defines the empty set.
For general g1 , . . . , gs , call di the degree of gi and d = max{di }. If  is a linear
form that does not vanish at any Q i , then h j = d−d j g j is a form of degree d that
vanishes at Q i precisely when g j vanishes. The forms h 1 , . . . , h s define a subspace
L  of the space of forms of degree d. For all i, the set of forms in L  that vanish at Q i
is a proper subspace of L  . Thus, as before, for a general choice of b1 , . . . , bs ∈
C, the form g = b1 h 1 + · · · + bs h s is an element of J which does not vanish
at any Q i . 

Theorem 10.3.4 The variety defined in Pn−1 by the ideal J0 coincides with π(X ).

Proof Let Q ∈ π(X ), Q = [q1 : · · · : qn ]. Then there exists q0 ∈ C such that the
point P = π −1 (Q) = [q0 : q1 : · · · : qn ] ∈ Pn belongs to X . Thus g(P) = 0 for all
g ∈ J . In particular, this is true for all g ∈ J0 . On the other hand, if g ∈ J0 then g
does not contain x0 , thus:

0 = g(P) = g(q0 , q1 , . . . , qn ) = g(q1 , . . . , qn ) = g(Q).

This proves that the variety defined by J0 contains π(X ).


Conversely, identify Pn−1 with the hyperplane x0 = 0, and fix a point Q = [q1 :
· · · : qn ] ∈
/ π(X ). Consider an element f ∈ J that does not vanish at P0 and let W
be the variety defined by f . The intersection of W with the line P0 Q is a finite
set Q 1 , . . . , Q k . Moreover no Q i can belong to X , since π −1 (Q) is empty. Thus,
by Lemma 10.3.3, there exists g ∈ J that does not vanish at any Q i . Consider the
resultant h = R0 ( f, g). By Proposition 10.2.7, h belongs to the radical of the ideal
generated by f, g, thus it belongs to J , which is a radical ideal. Moreover h does not
contain the variable x0 . Thus h ∈ J0 . Finally, from Proposition 10.2.5 it follows that
h(Q) = 0. Then Q does not belong to the variety defined by J0 . 

Remark 10.3.5 A direct consequence of Theorem 10.3.4 is that the projection π is


a closed map, in the Zariski topology.
10.3 Forgetting a Variable 163

Indeed any closed subset Y of a projective variety X is itself a projective variety,


thus by Theorem 10.3.4 the image of Y in π is Zariski closed.

We can repeat all the constructions of this section by selecting any variable xi
instead of x0 and performing the elimination of xi . Thus we can define the i-th
resultant Ri ( f, g) and use it to prove that projections with center any coordinate
point [0 : · · · : 0 : 1 : 0 : · · · : 0] are closed maps.

10.4 Linear Projective and Multiprojective Maps

In this section, we prove that projective maps defined by linear maps of projective
spaces are closed in the Zariski topology.

Remark 10.4.1 Let V, W be linear space, respectively, of dimension n + 1, m + 1.


The choice of a basis for V corresponds to fixing an isomorphism between V and
Cn+1 . Thus we can identify, after a choice of the basis, the projective space P(V )
with Pn . We will use this identification to introduce all the concepts of Projective
Geometry into P(V ). Notice that two such identifications differ by a change of basis
in Pn , thus they are equivalent, up to an isomorphism of Pn .
Similarly, the choice of a basis for W corresponds to fixing an isomorphism
between W and Cm+1 .
A linear map W → V corresponds, under the choice of a basis, to a linear map
Cm+1 → Cn+1 . Thus, the study of projective maps in Pm and Pn induced by linear
maps Cm+1 → Cn+1 corresponds to the study of projective maps in P(W ), P(V )
induced by linear maps W → V .

Proposition 10.4.2 Let φ : Cm+1 → Cn+1 be a linear map. Let K φ be the projective
kernel of φ and let X ⊂ Pm be a projective subvariety such that X ∩ K φ = ∅. Then
φ induces a projective map X → Pn (that we will denote again with φ) which is a
closed map in the Zariski topology.

Proof The map φ factors through a linear surjection φ1 : Cm+1 → Cm+1 /K er (φ)
followed by a linear injection φ2 . After the choice of a basis, the space Cm+1 /K er (φ)
can be identified with C N +1 , where N = m − dim(K er (φ)), so that φ1 can be con-
sidered as a map Cm+1 → C N +1 and φ2 as a map C N +1 → Cn+1 . Since X does
not meet the kernel of φ1 , by Definition 10.1.7 φ1 induces a projection X → P N .
The injective map φ2 defines a projective map P N → Pn , by Definition 10.1.1. The
composition of these two maps is the projective map φ : X → Pn of the claim. It is
closed since it is the composition of two closed maps. 

Notice that the projective map φ is only defined up to a change of coordinate,


since it relies on the choice of a basis in Cm+1 /K er (φ).
The previous result can be extended to maps from multiprojective varieties to
multiprojective spaces.
164 10 Projective Maps and the Chow’s Theorem

Example 10.4.3 Consider a multiprojective product Pa1 × · · · × Pas and an injective


linear map φ : Ca1 +1 → Cm+1 . Then the induced linear map:

Pa1 × Pa2 × · · · × Pas → Pm × Pa2 × · · · × Pas

is multiprojective and closed, because the product of closed maps is closed.


Of course, the same statement holds if we replace i with any index or if we mix
up the indices. Moreover we can apply it repeatedly.
Similarly, consider a linear map φ : Ca1 +1 → Cm+1 and a multiprojective subva-
riety X ⊂ Pa1 × · · · × Pas such that X is disjoint from P(ker φ) × Pa2 × · · · × Pas .
Then there is an induced linear map: X → Pm × Pa2 × · · · × Pas which is multipro-
jective and closed.
Proposition 10.4.4 Any projection πi from a multiprojective space Pa1 × · · · × Pas
to any of its factor Pai is a closed projective map.
Proof The map πi is defined by sending P = ([ p10 : · · · : p1a1 ], . . . , [ ps0 : · · · :
psas ]) to [ pi0 : . . . , piai ]. Thus the map is defined by multihomogeneous polyno-
mials (of multidegree 1 in the ith set of variables and 0 in the other sets).
To prove that πi is closed, we show that the image in πi of any multiprojective
variety is a projective subvariety of Pai . Let X ⊂ Pa1 × · · · × Pas be a multiprojective
subvariety and let Y = πi (X ). If Y = Pai , there is nothing to prove. Thus the claim
holds if n i = 0, i.e., Pai is a point. We will proceed then by induction on ai , assuming
that Y = Pai .
Let Q be a point of Pai \ Y . Then no points of type (P1 , . . . , Ps ) with Pi = Q can
belong to X . Thus X does not contain points (P1 , . . . , Ps ), with Pi in the projective
kernel of the projection from Q.
The claim now follows from Example 10.4.3. 
Corollary 10.4.5 Any projection πi from a multiprojective space Pa1 × · · · × Pas to
a product of some of its factors is a closed projective map.

10.5 The Veronese Map and the Segre Map

We introduce now two fundamental projective and multiprojective maps, which are
the cornerstone, together with linear maps, of the construction of projective maps. The
first map, the Veronese map, is indeed a generalization of the map built in Example
9.1.33.
We recall that a monomial is monic if its coefficient is 1.


Definition 10.5.1 Fix n, d and set N = n+d d
− 1. There are exactly N + 1 monic
monomials of degree d in n + 1 variables x0 , . . . , xn . Let us call M0 , . . . , M N these
monomials, for which we fixed an order.
The Veronese map of degree d in Pn is the map vn,d : Pn → P N which sends a
point [ p0 : · · · : pn ] to [M0 ( p0 , . . . , pn ) : · · · : M N ( p0 , . . . , pn )].
10.5 The Veronese Map and the Segre Map 165

Notice that a change in the choice of the order of the monic monomials produces
simply the composition of the Veronese map with a change of coordinates. After
choosing an order of the variables, e.g., x0 < x1 < · · · < xn , a very popular order of
the monic monomials is the order in which x0d0 · · · xndn preceeds x0e0 · · · xnen if in the
smallest index i for which di = ei we have di > ei . This order is called lexicographic
order, because it reproduces the way in which words are listed in a dictionary. In
Sect. 13.1 we will discuss different types of monomial orderings.
Notice that we can define an analogue of a Veronese map by choosing arbitrary
(nonzero) coefficients for the monomials M j ’s. This is equivalent to choose a weight
for the monomials. The resulting map has the same fundamental property of our
Veronese map, for which we choose to take all the coefficients equal to 1.
Remark 10.5.2 The Veronese maps are well defined, since for any P = [ p0 : · · · :
pn ] ∈ Pn there exists an index i with pi = 0, and among the monomials there exists
the monomial M = xid , which satisfies M( p0 , . . . , pn ) = pid = 0.
The Veronese map is injective. Indeed if P = [ p0 : · · · : pn ] and Q = [q0 : · · · :
qn ], have the same image, then the powers of the pi ’s and the qi ’s are equal, up to a
scalar multiplication. Thus, up to a scalar multiplication, one may assume pid = qid
for all i, so that qi = ei pi , for some choice of a d-root of unit ei . If the ei ’s are not all
equal to 1, then there exists a monic monomial M such that M(e0 , e1 , . . . , en ) = 1,
thus M( p0 , . . . , pn ) = M(q0 , . . . , qn ), which contradicts vn,d (P) = vn,d (Q).
Because of its injectivity, sometimes we will refer to a Veronese map as a Veronese
embedding.
The images of Veronese embeddings will be denoted as Veronese varieties.
Example 10.5.3 The Veronese map v1,3 sends the point [x0 : x1 ] of P1 to the point
[x03 : x02 x1 : x0 x12 : x13 ] ∈ P3 .
The Veronese map v2,2 sends the point [x0 : x1 : x2 ] ∈ P2 to the point [x02 : x0 x1 :
x0 x2 : x12 : x1 x2 : x22 ] (notice the lexicographic order).
Proposition 10.5.4 The image of a Veronese map is a projective subvariety of P N .
Proof We define equations for Y = vn,d (Pn ).
Consider (n + 1)-tuples of nonnegative integers A = (a0 , . . . , an ), B = (b0 ,
) and C
. . . , bn = (c0 , .
. . , cn ), with the following property:
(*) ai = bi = ci = d and ai + bi ≥ ci for all i. Define D = (d0 , . . . , dn ),
where di = ai + bi − ci , Clearly di = d. For any choice of A = (a0 , . . . , an ), the
monic monomial x0a0 · · · xnan corresponds to a coordinate in P N . Call M A the coordi-
nate corresponding to A. Define in the same way M B , M C and then also M D . The
polynomial:
f ABC = M A M B − M C M D

is homogeneous of degree 2 and vanishes at the points of Y . Indeed for any Q =


vn,d ([α0 : · · · : αn ]) it is easy to see that:

M A M B = M C M D = α0a0 +b0 · · · αnan +bn ,


166 10 Projective Maps and the Chow’s Theorem

since any xk appears in M C M D with exponent ak + bk . It follows that Y is contained


in the projective subvariety W defined by the forms f ABC , when A, B, C vary in the
set of (n + 1)-tuples with the property (*) above.
To see that Y = W , take Q = [m 0 : · · · : m N ] ∈ W . Each m i corresponds to a
monic monomial in the x j ’s, and we assume they are ordered in the lexicographic
order.
First, we claim that at least one coordinate m i , corresponding to a power xid , must
be nonzero. Indeed, on the contrary, assume that all the coordinates corresponding
to powers vanish, and consider a minimal q such that coordinate m q of Q is nonzero.
Let m q correspond to the monomial x0a0 · · · xnan . Since m q is not a power, there are
at least two indices j > i such that ai , a j > 0. Put A = (a0 , . . . , an ) = B and C =
(c0 , . . . , cn ) where ci = ai − 1, c j = a j + 1 and ck = ak for k = i, j. The (n + 1)-
tuples A, B, C satisfy condition (*). One computes that D = (d0 , . . . , dn ) with di =
ai + 1, d j = a j − 1 and dk = ak for k = i, j. Moreover, since Q ∈ W , then:

M A (Q)M B (Q) = M C (Q)M D (Q).

But we have M A (Q) = M B (Q) = m q , while M C (Q) = 0, since x0c0 · · · xncn preceeds
x0a0 · · · xnan in the lexicographic order. It follows m q2 = 0, a contradiction.
Then at least one coordinate corresponding to a power is nonzero. Just to fix the
ideas, assume that m 0 , which corresponds to x0d in the lexicographic order, is different
from 0. After multiplying the coordinates of Q by 1/m 0 , we may assume m 0 = 1.
Then consider the coordinates corresponding to the monomials x0d−1 x1 , . . . x0d−1 xn .
In the lexicographic order, they turn out to be m 1 , . . . , m n , respectively. Put P = [1 :
m 1 : · · · : m n ] ∈ Pn . We claim that Q is exactly vn,d (P).
The claim means that for any coordinate m of Q, corresponding to the monomial
x0a0 · · · xnan we have m = m a11 · · · m ann . We prove the claim by descending induction
on a0 . The cases a0 = d and a0 = d − 1 are clear by construction. Assume that the
claim holds when a0 > d − s and take m such that a0 = d − s. In this case there
exists some index j > 0 such that a j > 0. Put A = (a0 , . . . , an ), B = (d, 0, . . . , 0)
and C = (c0 , . . . , cn ) where c0 = a0 + 1 = d − s + 1, c j = a j − 1, and ck = ak for
k = 0, j. The (n + 1)-tuples A, B, C satisfy condition (*). Thus

M A (Q)M B (Q) = M C (Q)M D (Q),

where D = (d0 , . . . , dn ) and d0 = d − 1, d j = 1, and dk = 0 for k = 0, j. It follows


by induction that M B (Q) = 1, M D (Q) = m j and

M C (Q) = m c11 · · · m cnn = m a11 · · · m ann /m j .

Then m = M A (Q) = M C (Q)M D (Q) = m a11 · · · m ann , and the claim follows. 

We observe that all the forms M A M B − M C M D are quadratic forms in the vari-
ables Mi ’s of P N . Thus the Veronese varieties are defined in P N by quadratic equa-
tions.
10.5 The Veronese Map and the Segre Map 167

Example 10.5.5 Consider the Veronese map v1,3 : P1 → P3 . The monic monomials
of degree three in 2 variable are (in lexicographic order):

M0 = x03 , M1 = x02 x1 , M2 = x0 x12 , m 3 = x13 .

The equations for the image are obtained by couples A, B, C satisfying condition
(*) above. Up to trivialities, these couples are:

A = (3, 0) B = (0, 3) C = (1, 2), so D = (2, 1) and we get M0 M3 − M1 M2 = 0,


A = (3, 0) B = (1, 2) C = (2, 1), so D = (2, 1) and we get M0 M2 − M12 = 0,
A = (2, 1) B = (0, 3) C = (1, 2), so D = (1, 2) and we get M1 M3 − M22 = 0

Consider the point Q = [3 : 6 : 12 : 24], which satisfies the previous equations.


Since the coordinate m 0 corresponding to x03 is equal to 3 = 0, we divide the coor-
dinates of Q by 3 and obtain Q = [1 : 2 : 4 : 8]. Then Q = v1,3 ([1 : 2]), as one can
check directly.

Example 10.5.6 Equations for the image of v2,2 ⊂ P5 (the classical Veronese sur-
face) are given by: ⎧
⎪ M0 M4 − M1 M2
⎪ =0



⎪ M3 M2 − M1 M4 =0


⎨M M − M M =0
5 1 2 4

⎪ M M − M 2
=0


3 5 4


⎪ M0 M5 − M2 =0
2


M0 M3 − M12 =0

The point Q = [0 : 0 : 0 : 1 : −2 : 4] satisfies the equations, and indeed it is equal to


v2,2 ([0 : 1 : −2]). Notice that in this case, to recover the preimage of Q, one needs
to replace x0 with x1 in the procedure of the proof of Proposition 10.5.4, since the
coordinate corresponding to x02 is 0.

As a consequence of Proposition 10.5.4, one gets the following result.

Theorem 10.5.7 All the Veronese maps are closed in the Zariski topology.

Proof We need to prove that the image in vn,d of a projective subvariety of Pn is a


projective subvariety of P N .
First notice that if F is a monomial of degree kd in the variables x0 , . . . , xn of
Pn , then it can be written (usually in several ways) as a product of k monomials of
degree d in the xi ’s, which corresponds to a monomial of degree k in the coordinates
M0 , . . . , M N of P N . Thus, any form f of degree kd in the x j ’s can be rewritten as a
form of degree k in the coordinates M j ’s.
Take now a projective variety X ⊂ Pn and let f 1 , . . . , f s be homogeneous gen-
erators for the homogeneous ideal of X . Call ei the degree of f i and let ki d be the
168 10 Projective Maps and the Chow’s Theorem

smallest multiple of d bigger or equal to ei . Then consider all the products x kj i d−ei f i ,
j = 0, . . . , n. These products are homogeneous forms of degree ki d in the x j ’s.
Moreover a point P ∈ Pn satisfies all the equations x kj i d−ei f i = 0 if and only if it
satisfies f i = 0, since at least one coordinate x j of P is nonzero.
With the procedure introduced above, transform arbitrarily each form x kj i d−ei f i =
0 in a form Fi j of degree k in the variables M j ’s. Then we claim that vn,d (X ) is the
subvariety of vn,d (Pn ) defined by the equations Fi j = 0. Since vn,d (Pn ) is closed in
P N , this will complete the proof.
Indeed let Q be a point of vn,d (X ). The coordinates of Q are obtained by the coordi-
nates of its preimage P = [ p0 : · · · : pn ] ∈ X ⊂ Pn by computing in P all the mono-
mials of degree d in the x j ’s. Thus Fi j (Q) = 0 for all i, j if and only if x kj i d−ei f i (P) =
0 for all i, j, i.e., if and only if f i (P) = 0 for all i. The claim follows. 

Example 10.5.8 Consider the map v2,2 and let X be the line in P2 defined by the
equation f = (x0 + x1 + x2 ) = 0. Since f has degree 1, consider the products:

x0 f = x02 + x0 x1 + x0 x2 ,
x1 f = x0 x1 + x12 + x1 x2 ,
x2 f = x0 x2 + x1 x2 + x22 .

they can be transformed, respectively, in the monomials

M0 + M1 + M2 , M1 + M3 + M4 , M2 + M4 + M5 .

Thus the image of X is the variety defined in P5 by the previous three linear forms
and the six quadratic forms of Example 10.5.6, that define v2,2 (P2 ).

Next, let us turn to the Segre embeddings.

Definition 10.5.9 Fix a1 , . . . , an and N = (a1 + 1) · (a2 + 1) · · · (an + 1) − 1.


There are exactly N + 1 monic monomials of multidegree (1, . . . , 1) (i.e., multi-
linear forms) in the variables x1,0 , . . . , x1,a1 , x2,0 , . . . , x2,a2 , . . . , xn,0 . . . xn,an . Let us
choose an order and denote with M0 , . . . , M N these monomials.
The Segre map of a1 , . . . , an is the map sa1 ,...,an : Pa1 × · · · × Pan → P N which
sends a point P = ([ p10 : · · · : p1n 1 ], . . . , [ pn0 : · · · : pnan ]) to [M0 (P) : · · · :
M N (P)].

The map is well defined, since for any i = 1, . . . , n there exists pi ji = 0,


and among the monomials there is M = x1, j1 · · · xn, jn , which satisfies M(P) =
p1 j1 · · · pn jn = 0.
Notice that when n = 1, then the Segre map is the identity.

Proposition 10.5.10 The Segre maps are injective.

Proof Make induction on n, the case n = 1 being trivial.


For the general case, assume that
10.5 The Veronese Map and the Segre Map 169

P = ([ p10 : · · · : p1a1 ], . . . , [ pn,0 : · · · : pn,an ]),


Q = ([q10 : · · · : q1a1 ], . . . , [qn,0 : · · · : qn,an ])

have the same image. Fix indices such that p1 j1 , . . . , pn jn = 0. The monomial M =
x1, j1 · · · xn, jn does not vanish at P, hence also q1 j1 , . . . , qn jn = 0.
Call α = q1 j1 / p1 j1 . Our first task is to show that α = q1i / p1i for i = 1, . . . , n 1 , so
that [ p11 : · · · : p1a1 ] = [q11 : · · · : q1a1 ]. Define β = (q2 j2 · · · qn jn )/( p2 j1 · · · pn jn ).
Then β = 0 and:
αβ = (q1 j1 · · · qn jn )/( p1 j1 · · · pn jn ).

Since P, Q have the same image in the Segre map, then for all i = 1, . . . , a1 , the
monomials Mi = x1,i x2, j2 · · · xn, jn satisfy:

αβ Mi (P) = Mi (Q).

It follows immediately αβ( p1i · · · pn jn ) = (q1i · · · qn jn ) so that αβ p1i = q1i for all i.
Thus [ p10 : · · · : p1a1 ] = [q10 : · · · : q1a1 ].
We can repeat the argument for the remaining factors [ pi0 : · · · : piai ] = [qi0 :
· · · : qiai ] of P, Q (i = 2, . . . , n), obtaining P = Q. 
Because of its injectivity, sometimes we will refer to a Segre map as a Segre
embedding.
The images of Segre embeddings will be denoted as Segre varieties.
Example 10.5.11 The Segre embedding s1,1 of P1 × P1 to P3 sends the point ([x0 :
x1 ], [y0 : y1 ]) to [x0 y0 : x0 y1 : x1 y0 : x1 y1 ].
The Segre embedding s1,2 of P1 × P2 to P5 sends the point ([x10 : x11 ], [x20 : x21 :
x22 ]) to the point:

[x10 x20 : x10 x21 : x10 x22 : x11 x20 : x11 x21 : x11 x22 ].

The Segre embedding s1,1,1 : P1 × P1 × P1 → P7 sends the point P = ([x10 : x11 ],


[x20 : x21 ], [x30 : x31 ]) to the point:

[x10 x20 x30 : x10 x20 x31 : x10 x21 x30 : x10 x21 x31 : x11 x20 x30 : x11 x20 x31 : x11 x21 x30 : x11 x21 x31 ].

Recall the general notation that with [n] we denote the set {1, . . . , n}.
Proposition 10.5.12 The image of a Segre map is a projective subvariety of P N .
Since the set of tensors of rank one corresponds to the image of a Segre map, the
proof of the proposition is essentially the same as the proof of Theorem 6.4.13. We
give the proof here, in the terminology of maps, for the sake of completeness.
Proof We define equations for Y = sa1 ,...,an (Pa1 × · · · × Pan ).
For any n-tuple A = (α0 , . . . , αm ) define the form M A of multidegree (1, . . . , 1)
as follows:
170 10 Projective Maps and the Chow’s Theorem

M A = x0α0 · · · xnαn .

Then consider any subset J ⊂ [n] and two n-tuples of nonnegative integers A =
(α0 , . . . , αn ) and B = (β0 , . . . , βn ). Define C = C ABJ
as the n-tuple (γ1 , . . . , γn )
such that: 
αi if i ∈ J,
γi = .
βi if i ∈
/ J

c
Define D as D = C AB J
, where J c is the complement of J in [n]. Thus D =
(δ1 , . . . , δn ), where: 
βi if i ∈ J,
δi = .
αi if i ∈
/ J

Consider the polynomials


J
f AB = M A M B − MC M D.

J
Every f AB is homogeneous of degree 2 in the coordinates of P N . We claim that the
J
projective variety defined by the forms f AB , for all possible choices of A, B, J as
above, is exactly equal to Y .
One direction is simple. If:

Q = sa1 ,...,an ([q10 : · · · : q1a1 ], . . . , [qn0 : · · · : qnan ])

then it is easy to see that both M A M B (Q) and M C M D (Q) are equal to the product

q1α1 q1β1 q2α2 q2β2 · · · qnαn qnβn .

It follows that Y is contained in the projective subvariety W defined by the forms


J
f AB .
To see the converse, we make induction on the number n of factors. The claim is
obvious if n = 1, for in this case the equations f AB J
are trivial and the Segre map is
the identity on P . a1

Assume that the claim holds for n − 1 factors. Take Q = [m 0 : · · · : m N ] ∈ W .


Each m i corresponds to a monic monomial x1i1 · · · xnin of multidegree (1, . . . , 1) in
the xi j ’s. Fix a coordinate m of Q different from 0. Just to fix the ideas, we assume
that m corresponds to x10 · · · xn−1 0 xn0 . If m corresponds to another multilinear form,
the argument remains valid, it just requires heavier notation.
Consider the point Q  obtained from Q by deleting all the coordinates corre-
sponding to multilinear forms in which the last factor is not xn0 . If we consider

N  = n2 (ai + 1) − 1, then Q  can be considered as a point in P N , moreover the

coordinates of Q  satisfy all the equation f AJ B  = 0, where A , B  are (n − 1)-tuples
(α1 , . . . , αn−1 ), (β1 , . . . , βn−1 ) and J  ⊂ [n − 1]. It follows by induction that Q 
10.5 The Veronese Map and the Segre Map 171

corresponds to the image of some P  ∈ Pa1 × · · · × Pan−1 in the Segre embedding in



PN .
Write P  = ([ p10 : · · · : p1a1 ], . . . , [ pn−1 0 : · · · : pn−1 an−1 ]). Since m = 0, i.e.,
the coordinate of Q  corresponding to m is nonzero, then we must have p j0 = 0
for all j = 1, . . . n. Let m i , i = 1, . . . , an , be the coordinate of Q corresponding to
the multilinear form x10 · · · xnin . Then we prove that the coordinate m  corresponding
to x1i1 · · · xnin satisfies
mi
m  = n p1i1 · · · pn−1 in−1 ,
m
This will prove that Q is the image of the point:

P = ([ p10 : · · · : p1a1 ], . . . , [ pn−1 0 : · · · : pn−1 an−1 ], [ pn0 : · · · : pnan ]),

where pni = m i /m for all i = 0, . . . , an .


To prove the claim, take A = (i 1 , . . . , i n ), B = (0, . . . , 0) and J = {n}. Then
we have γk = αk and δk = 0 for k = 1, . . . , n − 1, while γn = 0 and δn = i n .
Thus M A (Q) = m  , M B (Q) = m, M D (Q) = m in and, by induction M C (Q) =
p1i1 · · · pn−1 in−1 . Since M A (Q)M B (Q) = M C (Q)M D (Q), the claim follows. 
We observe that all the forms M A M B − M C M D are quadratic forms in the vari-
ables Mi ’s of P N . Thus the Segre varieties are defined in P N by quadratic equations.
Example 10.5.13 Consider the Segre embedding s1,1 : P1 × P1 → P3 . The 4 vari-
ables M0 , M1 , M2 , M3 in P3 correspond, respectively, to the multilinear forms

M0 = x10 x20 , M1 = x10 x21 , M2 = x11 x20 , M3 = x11 x21 .

If we take A = (0, 0), B = (1, 1) and J = {1}, we get that C = (0, 1), D = (1, 0).
Thus M A corresponds to x10 x20 = M0 , M B corresponds to x11 x21 = M3 , M C cor-
responds to x10 x21 = M1 and M D corresponds to x11 x20 = M2 . We get thus the
equation:
M0 M3 − M1 M2 = 0.

The other choices for A, B, J yield either trivialities or the same equation.
Hence the image of s1,1 is the variety defined in P3 by the equation M0 M3 −
M1 M2 = 0. It is a quadric surface (see Fig. 10.1).

Example 10.5.14 Equations for the image of s1,2 ⊂ P5 (up to trivialities) are given
by: ⎧

⎨ M0 M4 − M1 M3 = 0 for A = (0, 0), B = (1, 1), J = {1}
M0 M5 − M2 M3 = 0 for A = (0, 0), B = (1, 2), J = {1}


M5 M1 − M2 M4 = 0 for A = (0, 1), B = (1, 2), J = {1},

where M0 = x10 x20 , M1 = x10 x21 , M2 = x10 x22 , M3 = x11 x20 , M4 = x11 x21 , M5 =
x11 x22 .
172 10 Projective Maps and the Chow’s Theorem

P1 P1
v × X ⊂ P3
w v⊗w

Fig. 10.1 Segre embedding of P1 × P1 in P3

Example 10.5.15 We can give a more direct representation of the equations defining
the Segre embedding of the product of two projective spaces Pa1 × Pa2 .
Namely, we can plot the coordinates of Q ∈ P N in a (a1 + 1) × (a2 + 1) matrix,
putting in the entry i j the coordinate corresponding to x1 i−1 x2 j−1 .
Conversely, any matrix (a1 + 1) × (a2 + 1) (except for the null matrix) corre-
sponds uniquely to a set of coordinates for a point Q ∈ P N . Thus we can identify P N
with the projective space over the linear space of matrices of type (a1 + 1) × (a2 + 1)
over C.
In this identification, the choice of A = (i, j), B = (k, l) and J = {1} (choosing
J = {2} we get the same equation, up to the sign) produces a form equivalent to the
2 × 2 minor:
m i j m kl − m il m k j

of the matrix.
Thus, the image of a Segre embedding of two projective space can be identified
with the set of matrices of rank 1 (up to scalar multiplication) in a projective space
of matrices.

As a consequence of Proposition 10.5.12, one gets the following result.

Theorem 10.5.16 All the Segre maps are closed in the Zariski topology.

Proof We need to prove that the image in sa1 ,...,an of a multiprojective subvariety X
of V = Pa1 × · · · × Pan is a projective subvariety of P N .
First notice that if F is a monomial of multidegree (d, . . . , d) in the variables xi j
of V , then it can be written (usually in several ways) as a product of k multilinear
forms in the xi j ’s, which corresponds to a monomial of degree d in the coordinates
M0 , . . . , M N of P N . Thus, any form f of multidegree (d, . . . , d) in the xi j ’s can be
rewritten as a form of degree d in the coordinates M j ’s.
Take now a projective variety X ⊂ V and let f 1 , . . . , f s be multihomogeneous
generators for the ideal of X . Call (dk1 , . . . , dkn ) the multidegree of f k and let dk =
max{dk1 , . . . , dkn }. Consider all the products x1dkj1−dk1 · · · xndkjn−dkn f k . These products are
multihomogeneous forms of multidegree (dk , . . . , dk ) in the xi j ’s. Moreover a point
10.5 The Veronese Map and the Segre Map 173

P ∈ V satisfies all the equations x1dkj1−dk1 · · · xndkjn−dkn f k = 0 if and only if it satisfies


f k = 0, since for all i at least one coordinate xi j of P is nonzero.
With the procedure introduced above, transform arbitrarily each form x1dkj1−dk1 · · ·
xndkjn−dkn f k in a form Fk j1 ,..., jn of multidegree (dk , . . . , dk ) in the variables of P N . Then
we claim the sa1 ,...,an (X ) is the subvariety of sa1 ,...,an (V ) defined by the equations
Fk j1 ,..., jn = 0. Since sa1 ,...,an (V ) is closed in P N , this will complete the proof.
To prove the claim, let Q be a point of sa1 ,...,an (X ). The coordinates of Q are
obtained by the coordinates of its preimage P = ([ p10 : · · · : p1n 1 ], . . . , [ pn0 : · · · :
pnan ] ∈ X ⊂ V by computing all the multilinear forms in the xi j ’s at P. Thus
Fk j1 ,..., jn (Q) = 0 for all k, j1 , . . . , jn if and only if f k (P) = 0 for all k. The claim
follows. 
Example 10.5.17 Consider the variety X in P1 × P1 defined by the multihomoge-
neous form f 1 = x10 x21
2
+ x11 x20
2
of multidegree (1, 2). Then we have:

x10 f = x10
2 2
x21 + x10 x11 x20
2
= (x10 x21 )2 + (x10 x21 x01 x21 ) = M12 + M1 M3 ,
x11 f = x11 x10 x21
2
+ x11
2 2
x20 = (x10 x21 x11 x21 ) + (x11 x21 )2 = M1 M3 + M32 .

These two forms, together with the form M0 M3 − M1 M2 that defines s1,1 (P1 × P1 )
in P3 , define the image of X in the Segre embedding.
Remark 10.5.18 Even if we take a minimal set of forms f k ’s that define X ⊂ Pa1 ×
· · · × Pan , with the procedure of Theorem 10.5.16 we do not find, in general, a
minimal set of forms that define sa1 ,...,an (X ).
Indeed the ideal generated by the forms Fk j1 ,..., jn constructed in the proof of
Theorem 10.5.16 needs not, in general, to be radical or even saturated.
We end this section by pointing out a relation between the Segre and the Veronese
embeddings of projective and multiprojective spaces.
Definition 10.5.19 A multiprojective space Pa1 × · · · × Pan is cubic if ai = a for
all i.
We can embed Pa into the cubic multiprojective space Pa × · · · × Pa (n times) by
sending each point P to (P, . . . , P). We will refer to this map as the diagonal em-
bedding. It is easy to see that the diagonal embedding is an injective multiprojective
map.
Example 10.5.20 Consider the cubic product P1 × P1 and the diagonal embedding
δ : P1 → P1 × P1 .
The point P = [ p0 : p1 ] of P1 is mapped to ([ p0 : p1 ], [ p0 : p1 ]) ∈ P1 × P1 .
Thus the Segre embedding of P1 × P1 , composed with δ, sends P to the point [ p02 :
p0 p1 : p1 p0 : p12 ] ∈ P3 .
We see that the coordinates of the image have a repetition: the second and the third
coordinates are equal, due to the commutativity of the product of complex numbers.
In other words the image s1,1 ◦ δ(P1 ) satisfies the linear equation M1 − M2 = 0 in
P3 .
174 10 Projective Maps and the Chow’s Theorem

We can get rid of the repetition if we project P3 → P2 by forgetting the third


coordinate, i.e., by taking the map C4 → C3 that maps (M0 , M1 , M2 , M3 ) to
(M0 , M1 , M3 ). The projective kernel of this map is the point [0 : 0 : 1 : 0], which
does not belong to δ(P1 ), since P = [ p0 : p1 ] ∈ P1 cannot have p02 = p12 = 0. Thus
we obtain a well defined projection π : s1,1 ◦ δ(P1 ) → P2 .
The composition π ◦ s1,1 ◦ δ : P1 → P2 corresponds to the map which sends [ p0 :
p1 ] to [ p02 : p0 p1 : p12 ]. In other words π ◦ s1,1 ◦ δ is the Veronese embedding v1,2
of P1 in P2 .

The previous example generalizes to any cubic Segre product.

Theorem 10.5.21 Consider a cubic multiprojective space Pn × · · · × Pn , with r >


1 factors. Then the Veronese embedding vn,r of degree r corresponds to the compo-
sition of the diagonal embedding δ, the Segre embedding sn,...,n and one projection.

Proof For any P = [ p0 : · · · : pn ] ∈ Pn the point sn,...,n ◦ δ(P) has repeated coordi-
nates. Indeed for any permutation σ on [r ] the coordinate corresponding to x1i1 · · · xrir
of sn,...,n ◦ δ(P) is equal to pi1 · · · pir , hence its equal to the coordinate correspond-
ing to x1iσ(1) · · · xriσ(r ) . To get rid of these repetition, we can consider coordinates
corresponding to multilinear forms x1i1 · · · xrir that satisfy:
(**) i 1 ≤ i 2 ≤ · · · ≤ ir .
By

n+d easy combinatorial computations, the number of these forms is equal to


d
. Forgetting the variables corresponding to multilinear forms that do not sat-

isfy condition
n+d (**) is equivalent to take a projection φ : C N +1 → C N +1 , where

N + 1 = d . The kernel of this projections is the set of (N + 1)-tuples in which
the coordinates corresponding to linear forms that satisfy (**) are all zero. Among
these coordinates there are those for which i 1 = i 2 = · · · = ir = i, i = 0, . . . , n.
So sn,...,n ◦ δ(P) cannot meet the projective kernel of φ, because that would imply
p0d = · · · = pnd = 0.
Thus φ ◦ sn,...,n ◦ δ(P) is well defined for all P ∈ Pn . The coordinate of sn,...,n ◦
δ(P) corresponding to x1i1 · · · xrir is equal to p0d0 · · · pndn , where, for i = 0, . . . , n, di
di , is the number in which i appears among i 1 , . . . , ir . Then d0 + · · · + dn = r .
It is clear then that computing φ ◦ sn,...,n ◦ δ(P) corresponds to computing (once)
in P all the monomials of degree r in x0 , . . . , xn . 

10.6 The Chow’s Theorem

We prove in this section the Chow’s theorem: every projective or multiprojective


map is closed in the Zariski topology.

Proposition 10.6.1 Every projective map f : Pn → Pm factors through a Veronese


map, a change of coordinates and a projection.
10.6 The Chow’s Theorem 175

Proof By Proposition 9.3.2, there are homogeneous polynomials f 0 , . . . , f m ∈


C[x0 , . . . , xn ] of the same degree d, which do not vanish simultaneously at any
point P ∈ Pn , and such that f is defined by the f j ’s. Each f j is a linear combination
of monic monomials of degree d. Hence, there exists a change of coordinates g in
the target space P N of vn,d such that f is equal to vn,d followed by g and by the
projection to the first m + 1 coordinates. Notice that since ( f 0 (P), . . . , f m (P)) = 0
for all P ∈ Pn , then the projection is well defined on the image of g ◦ vn,d . 
A similar procedure holds to describe a canonical decomposition of multiprojec-
tive maps.
Proposition 10.6.2 Every multiprojective map f : Pa1 × · · · × Pan → P N factors
through Veronese maps, a Segre map, a change of coordinates and a projection.
Proof By Proposition 9.3.11, there are multihomogeneous polynomials f 1 , . . . , f s
in the ring C[x1,0 , . . . , x1,a1 , . . . , xn,0 . . . xn,an ] of the same multidegrees (d1 , . . . , dn ),
which do not vanish simultaneously at any point P ∈ Pa1 × · · · × Pan , and such that
f is defined by the f j ’s. Each f j is a linear combination of products of monic
monomials, of degrees d1 , . . . , dn , in the set of coordinates (x1,0 , . . . , x1,a1 ), . . . ,
(xn,0 . . . xn,an ), respectively. If vai ,di denotes the Veronese embedding of degree di
of Pai into the corresponding space P Ni , then f factors through va1 ,d1 × · · · × van ,dn
followed by a multilinear map F : Pa1 × · · · × Pan → P N , which in turn is defined by
multihomogeneous polynomials F1 , . . . , Fs of multidegree (1, . . . , 1) (multilinear
forms). Each F j is a linear combination of products of n coordinates in the sets
(x1,0 , . . . , x1,a1 ), . . . , (xn,0 . . . xn,an ), respectively. Hence F factors through a Segre
map s N1 ,...,Nr , followed by a change of coordinates in P M , M = (N1 + 1) · · · (Nr +
1) − 1 which sends the linear polynomial associated to the F j ’s to the first N + 1
coordinates of P N , and then followed by a projection to the first s coordinates. 
Now we are ready to state and prove the Chow’s Theorem.
Theorem 10.6.3 (Chow’s Theorem) Every projective map f : Pn → P N is Zariski
closed, i.e., the image of a projective subvariety is a projective subvariety.
Every multiprojective map f : Pa1 × · · · × Pan → P M is Zariski closed.
Proof In view of the two previous propositions, this is just an obvious consequence
of Theorems 10.5.7 and 10.5.16. 
We will see, indeed, in Corollary 11.3.7, that the conclusion of Chow’s Theorem
holds for any projective map f : X → Y between any projective varieties.
Example 10.6.4 Let us consider the projective map f : P1 → P2 defined by

f (x1 , x2 ) = (x13 , x12 x2 − x1 x22 , x23 ).

We can decompose f as the Veronese map v1,3 , followed by the linear isomorphism
g(a, b, c) = (a, b − c, c − d, d) and then followed by the projection π to the first,
second and fourth coordinate.
176 10 Projective Maps and the Chow’s Theorem

Namely:

(π ◦ g ◦ v1,3 )(x1 , x2 ) = (π ◦ g)(x13 , x12 x2 , x1 x22 , x23 ) =


= π(x13 , x12 x2 − x1 x22 , x1 x22 − x23 , x23 ) = (x13 , x12 x2 − x1 x22 , x23 ).

The image of π is a projective curve in P3 , whose equation can be obtained by


elimination theory. One can see that, in the coordinates z 0 , z 1 , z 2 of P2 , f (P1 ) is the
zero locus of
z 13 − z 0 z 2 (z 0 − 3z 1 − z 2 ).

Example 10.6.5 Let us consider the subvariety Y of P1 × P1 , defined by the mul-


tihomogeneous polynomial f = x0 − x1 , of multidegree (1, 0) in the coordinates
(x0 , x1 ), (y0 , y1 ) of P1 × P1 . Y corresponds to [1 : 1] × P1 .
Take the Segre embedding s : P1 × P1 → P3 ,

(x0 , x1 ), (y0 , y1 ) = (x0 y0 , x0 y1 , x1 y0 , x1 y1 ).

Then the image s(P1 × P1 ) corresponds to the quadric Q in P3 defined by the van-
ishing of the homogeneous polynomial g = z 0 z 3 − z 1 z 2 .
The image of Y is a projective subvariety of P3 , which is contained in Q, but it
is no longer defined by g and another polynomial: we need two polynomials, other
than g.
Namely, Y is defined also by the two multihomogeneous polynomials, of multi-
degree (1, 1), f 0 = f y0 = x0 y0 − x1 y0 and f 1 = f y1 = x0 y1 − x1 y1 . Thus s(Y ) is
defined in P3 by g, g0 = z 0 − z 1 , g1 = z 2 − z 3 . (Indeed, in this case, g0 , g1 alone are
sufficient to determine s(Y ), which is a line).

Other examples and applications are contained in the exercise section.

10.7 Exercises

Exercise 31 Recall that a map between topological spaces is closed if it sends closed
sets to closed sets.
Prove that the composition of closed maps is a closed map, the product of closed
maps is a closed map, and the restriction of a closed map to a closed set is itself a
closed map.

Exercise 32 Given a linear surjective map φ : Cm+1 → Cn+1 and a subvariety X ⊂


Pm which does not meet K φ , find the polynomials that define the projection of X
from K φ , in terms of the matrix associated to φ.

Exercise 33 Let f, g be nonzero polynomials in C[x], with f constant. Prove that


the resultant R( f, g) is nonzero.
10.7 Exercises 177

Exercise 34 Consider the Veronese map P1 → P3 defined in Example 10.5.17 and


call X the image. Show that the three equations for X found in Example 10.5.17 are
nonredundant: for any choice of two of them, there exists a point Q which satisfies
the two equations but does not belong to X .

Reference

1. Walker, R.J.: Algebraic Curves. Princeton University Press, Princeton (1950)


Chapter 11
Dimension Theory

The concept of dimension of a projective variety is a fairly intuitive but surprisingly


delicate invariant, from an algebraic point of view.
If one considers projective varieties over C with their natural structure of complex
or holomorphic varieties, then the algebraic definition of dimension coincides with
the usual (complex) dimension.
On the other hand, for many purposes, it is necessary to deal with the concept
from a completely algebraic point of view, so the definition of dimension that we
give below is fundamental for our analysis.
The point of view that we take is mainly concerned with a geometric, projective
definition, though at a certain point, for the sake of completeness, we cannot avoid
to invoke some deeper algebraic result.

11.1 Complements on Irreducible Varieties

The first step is rather technical: we need some algebraic properties of irreducible
varieties. We recall that the definition of irreducible topological spaces, together with
examples, can be found in Definition 9.1.27 of Chap. 9.
So, from now on, dealing with projective varieties, we will always refer to
reducibility or irreducibility with respect to the induced Zariski topology.
Let us start with a characterization of irreducible varieties, in terms of the associ-
ated homogeneous ideal (see Corollary 9.1.15).

Definition 11.1.1 An ideal J of a polynomial ring R = C[x0 , . . . , x N ] is a prime


ideal if f 1 f 2 ∈ J implies that either f 1 ∈ J or f 2 ∈ J .
Equivalently, J is prime if and only if the quotient ring R/J is a domain, i.e.,
a, b ∈ R/J , ab = 0 implies that either a = 0 or b = 0.

© Springer Nature Switzerland AG 2019 179


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_11
180 11 Dimension Theory

Proposition 11.1.2 Let Y ⊂ P N be a projective variety and call J the homogeneous


ideal defined by Y . Then Y is irreducible if and only if J is a prime ideal.

Proof Assume Y = Y1 ∪ Y2 , where the Yi ’s are proper closed subsets. Then there
exist polynomials f 1 , f 2 such that f i vanishes on Yi but not on Y . Thus f 1 , f 2 ∈
/ J,
while f 1 f 2 vanishes at any point of Y , i.e., f 1 f 2 ∈ J .
The previous argument can be inverted to show that the existence of f 1 , f 2 ∈ / J
such that f 1 f 2 ∈ J implies that Y is reducible. 

Definition 11.1.3 Let Y ⊂ P N be an irreducible projective variety and let J ⊂


C[x0 , . . . , x N ] be its homogeneous ideal. Then J is a prime ideal and RY =
C[x0 , . . . , x N ]/J is a domain. So, one can construct the quotient field k(RY ) as the

field of all quotients { ab : a, b ∈ RY , b = 0}, where ab = ab if and only if ab = a  b.
We call k(RY ) the projective function field of Y and we will indicate this field
with K Y .

Example 11.1.4 The space P N itself is defined by the ideal J = (0), thus the pro-
jective function field of Y is the field of fractions C(x0 , . . . , x N ).

11.2 Dimension

There are several definitions of dimension of an irreducible variety. All of them have
some difficult aspect. In some cases, it is laborious even to prove that the definition
makes sense. For most approaches, it is not immediate to see that the geometric naïve
notion of dimension corresponds to the algebraic notion.
Our choice is to make use, as far as possible, of the geometric approach, entering
deeply in the algebraic background only to justify some computational aspect.
The final target is the theorem on the dimension of general fibers (Theorem 11.3.5),
which allows to manage the computation of the dimension for most applications.

Definition 11.2.1 Given a projective map f : X → Y we call fiber of f over the


point P ∈ Y the inverse image f −1 (P).

Remark 11.2.2 Since projective maps are continuous in the Zariski topology, and
singletons are projective varieties, then the fiber over any point P ∈ Y is closed in
the Zariski topology, hence it is a projective variety.

Proposition 11.2.3 For every projective variety X  P N there exists a linear sub-
space L ⊂ P N , not intersecting X , such that the projection of X from L to a linear
subspace L  is surjective, with finite fibers.

Proof We make induction on N . If N = 1 the claim follows immediately, since in


this case X is a finite set.
For N > 1, fix a point P ∈ / X . A change of coordinates will not make any differ-
ence for our claim, so we assume P = [1 : 0 : · · · : 0]. The projection π with center P
11.2 Dimension 181

maps X to a subvariety of the hyperplane L  = P N −1 defined by x0 = 0 (see Example


10.1.5). Since the fibers of π are closed subvarieties of a line and do not contain the
point P of the line, then the fibers of π are finite. By induction we know the existence
of a surjective projection φ of π(X ) from a linear subspace L 0 ⊂ P N −1 \ π(X ) to a
linear subspace, and such that φ has finite fibers. The composition φ ◦ π is surjective,
has finite fibers and corresponds to the projection of X from the span L of L 0 and
P. Notice that L cannot intersect X , since L 0 does not intersect π(X ). 

Now we are ready for the definition of dimension. Notice that we already have
the notion of projective dimension r of a linear subspace of P N , which corresponds
to the projectivization of a linear subspace (of dimension r + 1) of the vector space
C N +1 .

Definition 11.2.4 We say that an irreducible projective variety X ⊂ P N has dimen-


sion n if there exists a linear subspace L ⊂ P N of (projective) dimension N − n − 1,
which does not meet X , such that the projection with center L, which maps X to a
linear subspace L  of dimension n, is surjective, with finite fibers.
We assign dimension −1 to the emptyset.
Also, when X = P N , we consider valid to take L = ∅ and the projection equal to
the identity. This implies that P N has dimension N .

In the rest of the section, since by elementary linear algebra any linear subspace
L  of P N is isomorphic to Pn , by abuse we will consider the projection X → L  as a
map X → Pn .
It is clear that two projective varieties which are isomorphic under a change of
coordinates share the same dimension.

Example 11.2.5 Since P0 has just one point, clearly singletons have a surjective
projection to P0 , with finite fibers. So singletons have dimension 0.
Finite projective varieties are reducible, unless they are singleton (see Exercise
26). Thus, by definition, singletons are the only projective irreducible varieties of
dimension 0.

Example 11.2.6 The linear subspace L in P N defined by xn+1 = · · · = x N = 0 is


isomorphic to Pn and a projection from the subspace L  defined by x1 = · · · = xn = 0
to L is an isomorphism of L to itself. Thus L has dimension n.
Base changes provide isomorphisms between L and any linear variety of projective
dimension n. Thus, for linear subspaces, the projective dimension is a value for the
dimension, as defined above.

Example 11.2.7 Let C be a projective variety in P2 , defined by the vanishing of


one irreducible homogeneous polynomial g = 0. Then C is an irreducible variety
(Example 9.1.37) and it contains infinitely many points (Example 9.1.13).
By Lemma 9.1.5, there exists a point P0 ∈ P2 which does not belong to C. The
projection π from P0 maps C to P1 . Every fiber of π is a proper projective subvariety
of a line, since it cannot contain P0 . Since the Zariski topology on a line is the cofinite
topology (Example 9.1.12), then every fiber of π is finite. The image of π, which is
182 11 Dimension Theory

a projective subvariety of P1 (by the Chow’s Theorem), cannot be finite (since C is


infinite), so it coincides with P1 . Hence π is surjective.
We just proved that C has dimension 1.

The following remark, on the structure of projections, will be useful to produce


inductive arguments for the dimension of a variety.

Proposition 11.2.8 Fix a linear space L ⊂ P N , which does not meet a variety X ⊂
P N , and fix a linear subspace L  , disjoint from L, such that dim(L  ) + dim(L) =
N − 1. Fix a point P ∈ L, a linear subspace M ⊂ L of dimension dim(L) − 1 and
disjoint from P, and a hyperplane H containing L  and disjoint from P.
Then the projection φ of X from L to L  is equal to the projection φ P of X from
P to H , followed by the projection φ (in H = P N −1 ) of φ P (X ) from φ P (M) to L  .

Proof A point Q ∈ X is sent by φ P to the intersection of the line Q P with H .


Similarly, since the span of M and P is L, then M is sent by φ P to the intersection
L ∩ H . In turn, φ P (Q) is sent by φ to the intersection of L  with the span of φ P (Q)
and φ P (M) in H .
Since, by elementary linear algebra, the span of the line Q P and L ∩ H is equal
to the intersection of H with the span of L and Q, then the claim follows. 

By now, we do not know yet that the dimension of an irreducible projective variety
is uniquely defined, since we did not exclude the existence of two different surjective
projections of X to two linear subspaces of different dimensions, both with finite
fibers.
It is not easy to face the problem directly. Instead, we show that the existence of a
map X → Pn with finite fibers is related with a numerical invariant of the irreducible
variety X .
The invariant which defines the dimension is connected with a notion in the
algebraic theory of field extensions: the transcendence degree. We recall some basic
facts in the following remark. For the proofs, we refer to section II of the book [1]
(see also some exercises, at the end of the chapter).

Remark 11.2.9 (Field extensions) Let K 1 , K 2 be fields, with a nonzero homomor-


phism φ : K 1 → K 2 . Then φ is injective, since the kernel is an ideal of K 1 , hence it
is trivial. So we can consider φ as a inclusion which realizes K 2 as an extension of
K1.
The extension is algebraic, when K 2 is finitely generated as a vector space over
K 1 . Otherwise the extension is transcendent. If K 2 is an algebraic extension of K 1 ,
then for any e ∈ K 2 the powers e, e2 , e3 , . . . , eh , . . . become eventually linearly
dependent over K 1 . This means that there exists a polynomial p(x), with coefficient
in K 1 , such that p(e) = 0.
Conversely, given any extension K 2 of K 1 , for any e ∈ K 2 define K 1 (e) as the
minimal subfield of K 2 which contains K 1 and e. We say that e is an algebraic element
over K 1 if K 1 (e) is an algebraic extension of K 1 , otherwise e is a transcendental
element.
11.2 Dimension 183

The set of all the elements of K 2 which are algebraic over K 1 is a field K  ⊃
K 1 . We call K 1 the algebraic closure of K 1 in K 2 . If K  = K 1 , we say that K 1 is
algebraically closed in K 2 . In this case any element of K 2 \ K 1 is trascendent over
K 1 . A field K 1 is algebraically closed if any non-trivial extension K 2 of K 1 contains
only transcendental elements, i.e., if K 1 is algebraically closed in any extension. C
is the most popular example of an algebraically closed field.
If K 2 = K 1 (x) is the field of fractions of the polynomial ring K [x], then K 2 is a
trascendent extension. Conversely, if e is any transcendental element over K 1 , then
K 1 (e) is isomorphic to K 1 (x).
A set of elements e1 , . . . , en ∈ K 2 such that for all i ei is trascendent over
K 1 (e1 , . . . , ei−1 ) and K 2 is an algebraic extension of K 1 (e1 , . . . , en ) is a transcen-
dence basis of the extension. All the transcendence basis have the same number of
elements, which is called the transcendence degree of the extension.
If K 2 has transcendence degree d over K 1 and K 3 is an algebraic extension of K 2 ,
then K 3 has transcendence degree d over K 1 .

Proposition 11.2.10 A surjective projection φ : X → Pn determines an inclusion


κφ of the field K Pn = C(x0 , . . . , xn ) into the projective function field K X (see Defi-
nition 11.1.3).

Proof Assume in the proof that X ⊂ Pm .


The map φ is defined by n homogeneous polynomials F0 , . . . , Fn of the same
degree d in the coordinates of Pm , by Proposition 9.3.2. For any element f /g ∈
C(x0 , . . . , xn ) define F = f (F0 , . . . , Fn ) and G = g(F0 , . . . , Fn ). Notice that we
cannot have G(P) = 0 for all P ∈ X . Indeed this will imply g(φ(P)) = 0 for all
P ∈ X , hence g(Q) = 0 for all Q ∈ Pn , since φ is surjective. By Lemma 9.1.5 this
implies g = 0, a contradiction.
It follows that the quotient q of the equivalence classes of F and G is a well
defined element of K X . We define κφ ( f /g) = q.
Notice that q is 0 only if F vanishes on every point of X . As above this
implies that f vanishes on every point of Pn , i.e., f = 0. Thus there are elements
f /g ∈ C(x0 , . . . , xn ) such that κφ ( f /g) = 0, i.e., κφ is not the zero map. Thus κφ is
injective. 

From now, when there exists a surjective projective map φ : X → Pn we will


identify C(x1 , . . . , xn ) with the subfield κφ (C(x0 , . . . , xn )) of K X .

Theorem 11.2.11 Assume there exists a surjective projection φ : X → Pn from the


irreducible variety X ⊂ P N to a linear space Pn , with finite fibers. Then the quotient
field K X is an algebraic extension of K Pn = C(x0 , . . . , xn ).

Proof Assume that the map has finite fibers. We will indeed prove that the class of
any variable xi in the quotient ring R X is algebraic over C(x0 , . . . , xn ). Since these
classes generate the quotient field of R X , the claim will follow.
First assume that N = n + 1, so that φ is the projection from a point. We may also
assume, after a change of coordinates, that φ is the projection from P = [0 : · · · :
184 11 Dimension Theory

0 : 1] ∈
/ X . Consider the homogeneous ideal J of X and an element g ∈ J such that
g(P) = 0. Write g as a polynomial in x N = xn+1 with coefficients in C(x0 , . . . , xn ):

g = xn+1
d
ad + xn+1
d−1
ad−1 + · · · + a0 ,

where each ai = ai (x0 , . . . , xn ) is a polynomial in C(x0 , . . . , xn ). We cannot have


d = 0, otherwise g ∈ C(x0 , . . . , xn ) and g vanishes at P. Since the class of g vanishes
in the quotient ring R X , we get that the class of xn+1 is algebraic over C(x0 , . . . , xn ).
As the classes of x0 , . . . , xn are clearly algebraic over C(x0 , . . . , xn ), we are done in
this case.
Then make induction on N − n. Let φ be the projection from L and fix a point
P ∈ L. We may assume that P = [0 : · · · : 0 : 1] ∈ / X . The projection φ factorizes
though the projection φ P from P to P N −1 followed by the projection from φ P (L)
(see Proposition 11.2.8). By induction we know that the classes of x0 , . . . , x N −1 are
algebraic over C(x0 , . . . , xn ). Arguing as above, we get that x N is algebraic over
C(x0 , . . . , x N −1 ). This concludes the proof. 

We can now prove that our definition of dimension is unambiguous.

Corollary 11.2.12 Let X ⊂ P N be an irreducible variety. If there exists a surjec-


tive projection φ : X → Pn with finite fibers, then the transcendence degree of the
quotient field of X is n.
In particular, if m = n, then one cannot have a surjective projection φ : X → Pm
with finite fibers.

Proof The second statement follows since the transcendence degree of the quotient
field of X does not depend on the projections φ, φ . 

For reducible varieties, the definition of dimension is a straightforward extension


of the definition for irreducible varieties.

Definition 11.2.13 Let X 1 , . . . , X m be the irreducible components of a variety X


(we recall that, by Theorem 9.1.32, the number of irreducible varieties of X is finite).
Then we define:
dim(X ) = max{dim(X i )}.

From the definition of dimension and its characterization in terms of field exten-
sions, one can prove the following properties.

Proposition 11.2.14 Let X ⊂ P N be any variety and let X  be a subvariety of X .


Let φ be a projection of X to some linear subspace, whose fibers are finite. Then:
(a) dim(X ) = dim(φ(X )).
(b) dim(X  ) ≤ dim(X ).
(c) In particular dim(X ) ≤ N .
(d) If X ⊂ P N has dimension N , then X = P N .
11.2 Dimension 185

Proof The first claim is clear: fix an irreducible component Y of X and take a
projection φ of φ(Y ) to some linear space Pn , whose fibers are finite (thus n =
dim(φ(Y ))). Then the composition φ ◦ φ maps Y to Pn with finite fibers, so that
n = dim(Y ).
The proof of (c) is straightforward from the definition of dimension. Then (b)
follows since a surjective projection φ with finite fibers from X to some Pn (n =
dim(X )) maps X  , with finite fibers, to a subvariety of Pn .
Finally, to see (d) assume X = P N and fix a point P ∈ / X . Consider the projection
φ from P. The fiber of φ which contains any point Q ∈ X is a projective subvariety
of the line Q P and misses P, thus it is finite. Hence φ maps X to P N −1 with finite
fibers. Thus dim(X ) = dim(φ(X )) ≤ N − 1. 

Example 11.2.15 Hypersurfaces X in P N have dimension N − 1.


Indeed take a points P ∈ / X and consider the projection π of X from P to P N −1 . The
fibers of the projection are closed proper subvarieties of lines, hence they are finite.
Moreover, the projection is surjective. Indeed let p ∈ C[x0 , . . . , x N ] be a polynomial
which defines X , i.e., X = X ( p). Take any point Q ∈ P N −1 and call L the line P Q.
After a change of coordinates, we may assume P = [1 : 0 : · · · : 0] and Q = [0 : 1 :
0 : · · · : 0]. Then the restriction of p to L is a polynomial p̄ in the coordinates x0 , x1 ,
of the same degree of p, unless it is 0. In any case, p̄ has some nontrivial solution,
corresponding to points of X which are mapped to Q by the projection. The claim
follows.

We generalize the computation of the dimension of hypersurfaces as follows.

Lemma 11.2.16 Let X be an subvariety of P N , with infinitely many points. Then for
any hyperplane H of P N we have X ∩ H = ∅.

Proof Since the number of irreducible components of X is finite, there exists some
component of X which contains infinitely many points. So, without loss of generality,
we may assume that X is irreducible.
If X ⊂ P1 , then X = P1 because X is closed in the Zariski topology, which is the
cofinite topology, and the claim is trivial. Then assume X ⊂ P N , N > 1, and proceed
by induction on N .
Fix a hyperplane H and assume that H ∩ X = ∅. Fix a linear subspace L ⊂ H of
dimension N − 2 and consider the projection φ of X from L to a general line . By
Chow’s Theorem, the image φ(X ) is a closed irreducible subvariety of , and it does
not contain, by construction, the point H ∩ . Thus φ(X ) is finite and irreducible,
hence it is a point Q (see Exercise 26). Then X is contained in the hyperplane H 
spanned by L and Q, which is a P N −1 , and it does not meet the hyperplane H  ∩ H
of H  . This contradicts the inductive assumption. 

Next result can be viewed as a generalization of Example 11.2.15.

Theorem 11.2.17 Let X ⊂ P N be an irreducible variety and consider a homoge-


neous polynomial g ∈ C[x0 , . . . , x N ] which does not belong to the ideal of X . Then
dim(X ∩ X (g)) = dim(X ) − 1.
186 11 Dimension Theory

Proof Set n = dim(X ). If n = N , then X = P N (Proposition 11.2.14 part (d)) and


X ∩ X (g) = X (g), so the claim holds in this case.
If X = P N , fix a surjective projection with finite fibers φ : X → Pn from some
linear subspace L ⊂ P N , of dimension N − n − 1.
Assume first that g is a general linear form, so that X (g) is a general hyperplane.
Fix a general point Q ∈ Pn and consider the span L  of Q and L. The intersection of
L  with X is a general fiber of φ, thus it is finite. Since X (g) is a general hyperplane,
then it will contain none of the points of L  ∩ X . This means that Q is not in the image
of φ(X ∩ X (g)). Thus φ maps (with finite fibers) X ∩ X (g) to a proper subvariety
of Pn . This proves that dim(X ∩ X (g)) < n.
Next, take a general hyperplane H  = Pn−1 of Pn and consider the projection
φ of φ(X ∩ X (g)) from Q to H  . Notice that since Q ∈

/ φ(X ∩ X (g)), then φ has
finite fibers, as in the proof of Proposition 11.2.14. The composition φ0 of φ and φ
corresponds to the projection of X ∩ X (g) from L  . For all Q  ∈ Pn−1 , the fiber of
φ0 over Q  corresponds to the intersection of X ∩ X (g) with the span L  of Q  and
L  , which is also the span of L , Q, Q  , i.e., the span of L and the line Q Q  . Since
φ : X → Pn surjects, the inverse image of the line Q Q  in φ is a subvariety Y ⊂ X
which contains infinitely many points. Thus, by Lemma 11.2.16, the intersection of
Y with H = X (g) is not empty. Any point P ∈ Y ∩ X (g) is a point of X ∩ X (g)
mapped to Q  by the projection φ0 . It follows that φ0 maps surjectively X ∩ X (g) to
Pn−1 , with finite fibers. Thus dim(X ∩ V (g)) = n − 1.
Now we prove the general case, by induction on N . If N = 1, then X is either
P1 or a point, and the claim is obvious. If N > 1, use the fact that for a general
hyperplane H of P N , the intersection X (g) ∩ H is a hypersurface in P N −1 which
does not contain H ∩ X . Then, using the inductive hypothesis and the intersection
with hyperplanes:

dim(X ) = dim(X ∩ H ) + 1 = dim((X ∩ H ) ∩ (X (g) ∩ H )) + 2


= dim((X ∩ X (g)) ∩ H ) + 2 = dim(X ∩ X (g)) + 1.

Corollary 11.2.18 Let X be a variety of dimension n in P N and let L be a linear


subspace of dimension d ≥ N − n. Then L ∩ X = ∅.

Proof If n = 1 then X is infinite and the claim follows from Lemma 11.2.16. Then
make induction on N . The result is clear if N = 1. For n, N > 1, take a general
hyperplane H containing L and identify H with P N −1 . Fix an irreducible component
X  of dimension n in X . Then either X  ⊂ H or dim(H ∩ X  ) = n − 1 by Theorem
11.2.17. In any case, the claim follows by induction. 

The previous result has an important consequence, that will simplify the compu-
tation of the dimension of projective varieties.
11.2 Dimension 187

Corollary 11.2.19 Let L be a linear subspace which does not intersect a projective
variety X and consider the projection φ of X from L to some linear space Pn , disjoint
from L.
Then φ has finite fibers. Thus, φ is surjective if and only if n = dim(X ).
Proof Assume that there exists a fiber φ−1 (Q) which is infinite, for some Q ∈ Pn .
Then φ−1 (Q) is an infinite subvariety of the span L  of L and Q. In particular
dim(φ−1 (Q)) ≥ 1. Since L  is a linear subspace of dimension n + 1 which contains
both L and φ−1 (Q), then by the previous corollary we have L ∩ φ−1 (Q) = ∅. This
contradicts the assumption that L and X are disjoint. 
Corollary 11.2.20 Let X be a variety of dimension n in P N and let Y be the inter-
section of m hypersurfaces Y = X (g1 ) ∩ · · · ∩ X (gm ). Then:

dim(X ∩ Y ) ≥ n − m.

Proof Call X 1 , . . . , X k the irreducible components of X . First assume m = 1. For i =


1, . . . , k either Y contains X i or dim(Y ∩ X i ) = dim(X i ) − 1, by Theorem 11.2.17.
It follows that either dim(Y ∩ X ) = dim(X ) or dim(Y ∩ X ) = dim(X ) − 1. The
general claim follows soon by induction. 
Example 11.2.21 By Corollary 11.2.20, a variety Y = X (g1 ) ∩ · · · ∩ X (gm ) has
dimension at least N − m, because X (g1 ) has dimension N − 1.
The inequality is indeed an equality if m = 1, by Example 11.2.15.
We observe that the dimension of an intersection Y = X (g1 ) ∩ · · · ∩ X (gm ), m >
1, can be strictly bigger than N − m (see Example 11.2.23).
A variety Y is called complete intersection if dim(Y ) = N − m.
Example 11.2.22 Let X be a Veronese variety in P N , image of the d-Veronese
embedding vn,d of Pn . Then dim(X ) = n.
Indeed we can produce a projection of X to Pn , with finite fibers, as follows.
Consider coordinates x0 , . . . , xn for Pn , so that the coordinates of P N correspond to
monomials of degree d in the xi ’s. After a change of coordinates, we may arrange the
monomials so that the first n + 1 coordinates correspond to x0d , . . . , xnd . Then take
the projection φ of X to the linear space Pn spanned by the first n + 1 coordinates.
For each Q = [q0 : · · · : qn ], the inverse image of Q in φ is the set of points in X
which are of the form vd ([ p0 : · · · : pn ]) with pid = qi for i = 0, . . . , n. Since we
have only a finite number of choices for each of these numbers p j ’s, the fibers of φ
are finite.
Example 11.2.23 We know by Example 10.5.17 that the image X of the Veronese
map of degree 3, P1 → P3 , is minimally defined in C[M0 , . . . , M3 ] by the three
quadric equations

M0 M3 − M1 M2 = 0 M0 M2 − M12 = 0 M1 M3 − M22 = 0.

On the other hand, by Example 11.2.22, X has dimension 1. Thus X is not complete
intersection.
188 11 Dimension Theory

Next, we have the following consequence of the description of projective maps


Pn → P N , obtained in the proof of the Chow’s Theorem.
Theorem 11.2.24 Let f : Pn → P N be a nonconstant projective map. Then the
dimension of f (Pn ) is n. Hence N ≥ n.
Proof We know that f is equivalent to a Veronese map vn,d , followed by a change
of coordinates and a projection π. Call Y the image of f and consider a surjective
projection π  : Y → Pm with finite fibers, with m = dim(Y ). Then π  ◦ π is a sur-
jective projection of vn,d (Pn ) to Pm , with finite fibers, by Corollary 11.2.19. Thus
dim(vn,d (Pn )) = m, hence m = n by the previous example. 
Example 11.2.25 Let X be a Segre variety in P N , image of the Segre embedding
sn,m of Pn × Pm . Then dim(X ) = n + m.
Indeed we can produce a projection of X to Pn+m , with finite fibers, as follows.
Consider coordinates x0 , . . . , xn for Pn and coordinates y0 , . . . , ym for Pm , so that
the coordinates z i j of P N correspond to the products z i j = xi y j , with i = 0, . . . , n
and j = 0, . . . , m.
Thus the difference i − j of the two indices of z i j ranges between −m and n. We
define the map φ : P N → Pn+m by sending the point [z i j ] to [q0 : · · · : qn+m ] where

qk = zi j .
i− j=k−m

Since φ is linear and surjective, it defines a projection of X , provided that X does


not intersect the kernel of φ. This last fact can be proved by induction on n, m.
It is clear if either m or n are 0. In the general case, if for some P ∈ X , P =
[xi y j ] one has qk = 0 for all k = 0, . . . , n + m, then in particular 0 = z 0m = x0 ym .
If x0 = 0, then the image of φ(P) corresponds to the image of the point s(Q) of
Q = ([x1 : · · · : xn ], [y0 : · · · : ym ]) in the Segre embedding of Pn−1 × Pm , so that
we get a contradiction with the inductive assumption. A similar argument works
when ym = 0.
From Corollary 11.2.19, we obtain that the general fiber of the restriction φ X =
φ|X is finite. It remains to prove that φ X is surjective. We prove it by induction
on the number n. If n = 1, the claim is obvious. Assume that the claim holds for
n − 1, m. Fix a general point [1 : q1 : · · · : qn+m ] ∈ Pn+m . By induction the map φ X 
from the Segre embedding X  of Pn−1 × Pm to Pn+m−1 surjects. This implies that
we can find a point Q  = ([x0 : · · · : xn−1 ], [1 : y1 : · · · : ym ]) whose image in φ X 
is the point [q1 : · · · : qn+m

], where qi = qi − yi . Then the image in φ X of the point
Q = ([x0 : · · · : xn−1 : 1], [1 : y1 : · · · : ym ]) is [1 : q1 : · · · : qn+m ].
As in Theorem 11.2.24, one proves the following corollary on the dimension of
the image of a multiprojective map.
Theorem 11.2.26 Let X = Pa1 × · · · × Pak be a product of projective spaces and
let f : X → P N be a multiprojective map, which is nonconstant on any factor. Then
the dimension of f (X ) is a1 + · · · + ak . Hence N ≥ a1 + · · · + ak .
11.2 Dimension 189

Remark 11.2.27 The definition of dimension given in this section can be used to
provide a definition for the degree of a projective variety.
Namely, one can prove that given a surjective projection φ : X → Pn , with finite
fibers, the cardinality d of the general fiber is fixed, and it corresponds to the (linear)
dimension of the residue field k(X ), considered as a vector field over C[x1 , . . . , xn ].
Thus d does not depend on the choice of φ. The number d is called the degree of X .
In general, the computation of the degree of a projective variety is not straight-
forward. Since the results on which the very definition of degree is based, as well
as the computation of the degree for elementary examples, requires a theory which
goes beyond the aims of the book, we do not include it here. The interested reader
can find an account of the theory in section VII of [1] and in the book [2].

11.3 General Maps

From Definition 9.3.1, only locally a general projective map between projective
varieties f : X → Y is equivalent, up to changes of coordinates, to a Veronese map
followed by a projection.
It is possible to construct examples of maps f for which an equivalent version of
Theorem 11.2.24 does not hold. In particular, one can have dim(Y ) < dim(X ).
We account in this section some relations between the dimension of projective
varieties connected by projective maps and the dimension of general fibers of the
map.

Lemma 11.3.1 Let X ∈ Pn be a variety. Fix P ∈ Pn and let U be a Zariski open


subset of X not containing P. Consider the projection π : U → Pn−1 and let U  be
the image. Let X  ⊂ U be the set of points Q such that P belongs to the closure of
the fiber π −1 (π(Q)). Then X  is closed, in the Zariski topology of U . Moreover the
set of points Q ∈ U such that the fiber π −1 (π(Q)) is finite is a (possibly empty) open
subset of U .

Proof After a change of coordinates, we may assume that P = [1 : 0 : · · · : 0] and


π maps a point [ p0 : p1 : · · · : pn ] to [ p1 : · · · : pn ]. Let f be a polynomial in the
coordinates x0 , . . . , xn of Pn , which vanishes at the points of X , and write

f = g0 + x0 g1 + · · · + x0d gd ,

where each gi is a polynomial in x1 , . . . , xn . Fix a point Q = [q0 : q1 : · · · : qn ] ∈ U


and assume that gi (Q) = gi (q1 , . . . , qn ) does not vanish, for some i. Then the fiber
π −1 (π(Q)) is contained in the subset of the line P Q defined by the polynomial
f 0 = f (x0 , q1 , . . . , qn ) = 0, which is finite, since f 0 is nontrivial. Thus the fiber
π −1 (π(Q)) is finite, hence closed, and P cannot belong to the closure of π −1 (π(Q)).
Conversely, assume that for all the generators f of the homogeneous ideal of X one
has gi (Q) = 0. Then any point Q  = [b : q1 : · · · : qn ] of the line P Q belongs to X ,
190 11 Dimension Theory

thus π −1 (π(Q)) contains an open subset in the line P Q which is non-empty, since
Q belongs to it. Hence the closure of π −1 (π(Q)) coincides with the whole line P Q.
It follows that if we take a set of generators f 1 , . . . , f k for the homogeneous ideal
of X and for each j we write f j = g0 j + x0 g1 j + · · · + x0d gd j , then the set of Q ∈ U
such that P belongs to the closure of π −1 (π(Q)), which is equal to the set such that
π −1 (π(Q)) is not finite, is defined by the vanishing of all the polynomials gi j ’s. The
claims follow. 

Corollary 11.3.2 Let f : X → Y be a projective map. Then the set U of points


Q ∈ Y such that the fiber f −1 (Q) of the map f over Q is finite is a (possibly empty)
Zariski open set in X . Thus the set of points P ∈ X such that the fiber f −1 ( f (P)))
is finite is open in X .

Proof There exists a finite open cover {Ui } of X such that the restriction of f to each
Ui coincides with a Veronese embedding followed by a change of coordinates and a
projection πi . We claim that U ∩ Ui is open for each i. Indeed πi can be viewed as the
composition of a finite chain of projections from points P j ’s, and in each
 projection,
by Lemma 11.3.1, the fibers are finite in an open subset. Thus U = (U ∩ Ui ) is
open. 

Next, we give a formula for the dimension of projective varieties, which is the most
useful and used formula in effective computations. It is based on the link between the
dimensions of two varieties X, Y , when there exists a projective map f : X → Y .
The first step toward the formula is the notion of semicontinuous maps on projec-
tive varieties.

Definition 11.3.3 Let X be a projective variety, and consider a map g : X → Z. We


say that g is upper semicontinuous if for every z ∈ Z the set of points P ∈ X such
that g(P) ≥ z is closed in the Zariski topology.

The most important upper semicontinuous function that we will study in this
section is constructed as follows. Take a projective map f between projective varieties
f : X → Y and define μ f by:

μ f (Q) = the dimension of the fiber f −1 (Q).

Definition 11.3.4 A continuous map f : X → Y is dominant if the image is dense


in Y .

Theorem 11.3.5 Let X be irreducible and let f : X → Y be a dominant projective


map. Then f is surjective, the function μ f is upper semicontinuous and its minimum
is dim(X ) − dim(Y ).

Proof Since the statement is trivial if dim(X ) = 0, we proceed by induction on


dim(X ).
First let us prove that dim(X ) ≥ dim(Y ). Let Y ⊂ Pm . Notice that f (X ) cannot
miss any component of Y , since it is dense in Y . Consider thus a hyperplane H not
11.3 General Maps 191

containing a point of f (X ) chosen in any component of Y . Then dim(H ∩ Y ) =


dim(Y ) − 1, by Theorem 11.2.17. Consider an irreducible component Y  of H ∩ Y ,
of dimension dim(Y ) − 1. The inverse image f −1 (Y  ) is a proper subvariety of X , so
it has finitely many irreducible components. Since Y  is irreducible, it cannot be the
union of a finite number of proper closed subsets (see Exercise 23). Thus there exists
a component X  of f −1 (Y  ) such that f |X  : X  → Y  is dominant. As X  is a proper
closed subvariety of X , then dim(X  ) < dim(X ), by Theorem 11.2.17 again. Thus,
by induction, dim(X ) > dim(X  ) ≥ dim(Y  ) = dim(Y ) − 1, and the claim follows.
Next we prove that f is surjective. Assume there exists P ∈ Y not belonging
to f (X ). As above, fix a general hyperplane H of Pm containing P and missing
some point of any component of Y in f (X ). Take a component Y  of Y ∩ H , which
contains P. As above, f −1 (Y  ) is a proper closed subvariety of X and there exists a
component X  of f −1 (Y  ) which dominates Y  . By induction the map f |X  : X  → Y 
is dominant, which contradicts that P ∈ / f (X ).
Next, assume there exists a point P ∈ X such that the fiber f P = f −1 ( f (P))
is finite. We know by Corollary 11.3.2 that the set of points P  ∈ X , such that the
fiber f P  = f −1 ( f (P  )) is finite is open in X . We claim that dim(Y ) = dim(X ).
Indeed let X ⊂ Pn and fix a hyperplane H of Pn which misses all the points of f P
and set X  = X ∩ H . Then there exists an irreducible component X  of X ∩ H of
dimension dim(X  ) = dim(X ) − 1. Moreover the restriction of f to X  maps X  to
a closed subvariety Y  of Y . Since X is irreducible and f is surjective, then Y is
irreducible, thus dim(Y  ) < dim(Y ). By induction, since f |X  has some finite fibers,
then dim(X  ) = dim(Y  ). One has

dim(Y ) ≤ dim(X ) = dim(X  ) + 1 = dim(Y  ) + 1 ≤ dim(Y )

and the claim follows.


Finally, we prove the last statement by induction on the minimum of the dimension
of the fibers of f . If the minimum is 0, the claim has been proved above. Assume that
the minimum e > 0. Then by Theorem 11.2.17, the hyperplanes H of Pn intersect
all the fibers of f . Fix a fiber F of f of dimension e and fix a general H which
misses a point of each irreducible component of F. The restriction of f to X ∩ H
still surjects onto Y , thus, being Y irreducible, there exists as above a component X 
of X ∩ H such that f |X  still surjects onto Y . Thus X  contains some point Q ∈ F
and the fiber F  of f |X  which contains Q has dimension equal to e − 1. It follows
by induction that:

dim(X ) − dim(Y ) = dim(X  ) − dim(Y ) + 1 = (e − 1) + 1 = e.


−1
The set of points Q ∈ Y such that the dimension of f |X  (Q) has dimension e − 1

is open in Y by the inductive assumption. For all these points the dimension of the
fiber of f is e. Thus the set of points such that the dimension of the fiber is bigger
than e is bounded to a closed subvariety Y  . The inverse image of Y  is a proper
closed subvariety X  of X . Restricting f to X  and using induction, we see that the
set of points of Y whose fibers have dimension bigger than e is a closed subset Y1 of
192 11 Dimension Theory

Y  , hence a closed subset of Y . It corresponds to the set of points Q ∈ Y such that


dim( f −1 (Q) > e. Call e1 the minimal dimension of a fiber of F|X 1 , where X 1 is the
inverse image of Y1 . Then, arguing as above, we find a closed subset Y2 of Y such
that the fibers of f over the points of Y2 have dimension greater than e1 . And so on.
It follows that the map μ f is semicontinuous. 

Remark 11.3.6 By Theorem 9.1.30, the chain of closed subsets

Z0 = Y ⊃ Z1 ⊃ Z2 · · · ,

such that Z i = {P ∈ Y : μ f (P) ≥ e + i}, becomes constant after a finite number of


steps. Thus the dimension of the fibers of f is bounded.

As a consequence of Theorem 11.3.5, we find the following extension of the


Chow’s Theorem.

Corollary 11.3.7 The image of any projective map f : X → Y is closed in Y .

Examples and applications are contained in the exercise section.

11.4 Exercises

Exercise 35 Prove the second characterization of prime ideals, given in Definition


11.1.1: the ideal J is prime if and only if the quotient ring R/J is a domain, i.e., for
all a, b ∈ R/J , ab = 0 implies that either a = 0 or b = 0.

Exercise 36 Let P, Q be points of a projective space Pn and let H be a hyperplane


in Pn not containing P. Prove that, for any linear subspace L ⊂ Pn containing P,
the linear span of P, Q and the intersection L ∩ H is equal to the intersection of H
with the linear span of L and Q.

Exercise 37 Prove that every homomorphism between two fields is either trivial of
injective.

Exercise 38 Prove that if K ⊂ K  is a field extension, then the set of elements of


K  which are algebraic over K is a subfield of K  (containing K ).

Exercise 39 Prove that if e is a transcendental element over K then the extension


K (e) is isomorphic to the field of fractions of the polynomial ring K [x].

Exercise 40 Let L ⊂ P N be a linear subspace of dimension m. Let M be another


linear subspace of P N , which does not intersect L. Prove that the projection of M
from L to some space P N −m−1 is a linear subspace of the same dimension of M.

Exercise 41 Prove that any semicontinuous map from a projective variety to Z is


bounded.
11.4 Exercises 193

Exercise 42 Find examples of maps C → C which are continuous and dominant


(in the cofinite topology of C, which is the restriction to C of the Zariski topology
of P1 ) and not surjective.

References

1. Zariski, O., Samuel, P.: Commutative Algebra I. Graduate Texts in Mathematics, vol. 28.
Springer, Berlin (1958)
2. Harris, J.: Algebraic Geometry, a First Course. Graduate Texts in Mathematics, vol. 133,
Springer, Berlin (1992)
Chapter 12
Secant Varieties

The study of the rank of tensors has a natural geometric counterpart in the study of
secant varieties. Secant varieties or, more generally, joins are a relevant object for
several researches on the properties of projective varieties.
We introduce here the study of join varieties and secant varieties, mainly pointing
out the most important aspects for applications to the theory of tensors.

12.1 Definitions

Consider k projective varieties Y1 , . . . , Yk in some projective space Pn . Roughly


speaking, the join of Y1 , . . . , Yk is the set obtained by taking points P1 ∈ Y1 , . . . , Pk ∈
Yk and taking the span of the points.
The formal definition, however, requires some caution, for otherwise one ends up
with an object which is rather difficult to handle.

Definition 12.1.1 Let Y1 , . . . , Yk be projective subvarieties of Pn . Then Y1 × · · · ×


Yk is a multiprojective subvariety of (Pn )k = Pn × · · · × Pn . The total join of
Y1 , . . . , Yk is the subset T J (Y1 , . . . , Yk ) ⊂ Y1 × · · · × Yk × Pn of all (k + 1)-tuples
(P1 , . . . , Pk , Q) such that the points P1 , . . . , Pk , Q, as points in Pn , are linearly
dependent.

Proposition 12.1.2 The total join of Y1 , . . . , Yk is a multiprojective subvariety of


(Pn )k+1 .

Proof A point (P1 , . . . , Pk , Q) belongs to the total join T J (Y1 , . . . , Yk ) if and only
if
(i) each Pi satisfies the equations of Yi in the i-set of multihomogeneous coordinates
in (Pn )k ; and
(ii) taking homogeneous coordinates (yi0 , . . . , yin ) for Pi and (x0 , . . . , xn ) for Q,
then all the (k + 1) × (k + 1) minors of the matrix

© Springer Nature Switzerland AG 2019 195


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_12
196 12 Secant Varieties
⎛ ⎞
x0 x1 ... xn
⎜ y10 y11 ... y1n ⎟
⎜ ⎟
⎝. . . ... ... . . .⎠
yk0 yk1 ... ykn

vanish.
Since these last minors are multilinear polynomials in the multihomogeneous
coordinates of (Pn )k+1 , the claim follows. 

Notice that the total join coincides trivially with the product Y1 × · · · × Yk × Pn
when k > n + 1. Thus, in order to avoid trivialities:

we assume f r om now on in this chapter that k ≤ n + 1.

Definition 12.1.3 A set of varieties Y1 , . . . , Yk is independent if one can find linearly


independent points P1 , . . . , Pk such that P1 ∈ Y1 , . . . , Pk ∈ Yk .

Notice that, in the definition of total join, we are not excluding the case in which
some of the Yi ’s, or even all of them, coincide.

Remark 12.1.4 If we take the same variety Y k times, i.e., we take Y1 = · · · = Yk =


Y , then the set Y1 , . . . , Yk is independent exactly when Y is not contained in a linear
subspace of (projective) dimension k − 1.
In particular, when Y is not contained in any hyperplane, i.e., when Y is nonde-
generate, then for any k ≤ n + 1 we obtain a set of k independent varieties by taking
k copies of Y .

Notice also that, by definition, if the points P1 ∈ Y1 , . . . , Pk ∈ Yk are linearly


dependent (e.g., if some of them coincide), then for any Q ∈ Pn the point (P1 , . . . , Pk ,
Q) belongs to the total join. This is a quite degenerate situation, and we would like
to exclude the case of linearly dependent k-tuples P1 , . . . , Pk . We warn immediately
that will not be able to exclude all of them. Yet, we will find a refined definition of
joins, which excludes most of (k + 1)-tuples (P1 , . . . , Pk , Q) in which the Pi ’s are
dependent.

Proposition 12.1.5 If Y1 , . . . , Yk are varieties, then the set of k-tuples of points


(P1 , . . . , Pk ) ∈ Y1 × · · · × Yk such that P1 , . . . , Pk are dependent is a subvariety of
the product.

Proof Enough to observe that the set is defined by the (multilinear) k × k minors of
the matrix obtained by taking as rows a set of coordinates of the Pi ’s. 

We will now consider the behavior of a total join with respect to the natural
projections of (Pn )k to its factors. Let us recall the following fact.

Proposition 12.1.6 The product of a finite number of irreducible projective varieties


is irreducible.
12.1 Definitions 197

Proof Follows immediately by a straightforward application of Exercise 24. 

Theorem 12.1.7 Consider an independent set of irreducible varieties Y1 , . . . , Yk .


Then there exists a unique irreducible component Z of the total join T J (Y1 , . . . , Yk )
such that the restriction to Z of the projection π of Y1 × · · · × Yk × Pn to the first k
factors surjects onto Y1 × · · · × Yk .

Proof Since, by construction, the projection of T J (Y1 , . . . , Yk ) to Y1 × · · · × Yk


surjects, then Y1 × · · · × Yk is contained in the finite union π(Z 1 ) ∪ . . . π(Z m ), where
Z 1 , . . . Z m are the irreducible components of T J (Y1 , . . . , Yk ). Since each Z i is closed
and π is a closed map (see Proposition 10.4.4), then each π(Z j ) is closed. Since by
Proposition 12.1.6 the product Y1 × · · · × Yk is irreducible, it follows that there exists
at least one component Z j such that π(Z j ) = Y1 × · · · × Yk .
Assume that there are two such components Z i , Z j . Then consider a set of points
P1 ∈ Y1 , . . . , Pk ∈ Yk , such that P1 , . . . , Pk are linearly independent. The set exists,
because we are assuming that the set Y1 , . . . , Yk is independent. It is easy to see
that the fiber π −1 (P1 , . . . , Pk ) is a subset of {(P1 , . . . , Pk )} × Pn which is naturally
isomorphic to {P1 , . . . , Pk } × L, where L is the linear span of the Pi ’s. Thus L is a
linear space of dimension k − 1, so that π −1 (P1 , . . . , Pk ) is irreducible. It follows that
if the points P1 ∈ Y1 , . . . , Pk ∈ Yk are independent, then the fiber π −1 (P1 , . . . , Pk )
is contained in one irreducible component of T J (Y1 , . . . , Yk ).
We claim that for any irreducible component Z i of the join, the set Wi ⊂ Y1 ×
· · · × Yk of k-tuples (P1 , . . . , Pk ) such that (P1 , . . . , Pk , Q) ⊂ Z i for all Q in the
span of P1 , . . . , Pk , is a subvariety.
To prove the claim, take any (multihomogeneous) equation f for Z i as a subva-
riety of (Pn )k × Pn . Consider f as a polynomial in the variables of the last factor
Pn , with coefficients f i ’s which are multihomogeneous polynomials in the coordi-
nates of (P1 . . . , Pk ) ∈ (Pn )k . Then (P1 , . . . , Pk , Q) ⊂ Z i for all Q in the span of
P1 , . . . , Pk if and only if (P1 , . . . , Pk ) annihilates all the f i ’s. This provides a set of
multihomogeneous equations that defines Wi .
Now we show that the claim proves the statement. Assume that there are several
irreducible components of the join, Z 1 , . . . , Z m , which map onto Y1 × · · · × Yk in π
and consider the sets W1 , . . . , Wm as above. We know that the fibers π −1 (P1 , . . . , Pk )
belong to some Z i , whenever the points P1 , . . . , Pk are independent. So, the union

π(Z i ) contains the subset U of k-tuples (P1 , . . . , Pk ) in Y1 × · · · × Yk such that
the Pi ’s linearly independent. Since U is open, by Proposition 12.1.5, and non-empty,
since the Yi ’s are independent, then U is dense in Y1 × · · · × Yk , which is irreducible,
by Proposition 12.1.6.  It follows that Y1 × · · · × Yk , i.e., the closure of U , is con-
tained in the union π(Z i ), hence it is contained in some π(Z i ), say in π(Z 1 ). In
particular, π(Z 1 ) contains an open, non-empty, hence dense, subset of Y1 × · · · × Yk .
Thus π(Z 1 ) = Y1 × · · · × Yk . Assume that Z 2 in another component which satisfies
π(Z 2 ) = Y1 × · · · × Yk . We prove that Z 2 ⊂ Z 1 , which contradicts the maximality
of irreducible components. Namely W = (Y1 × · · · × Yk ) \ U is closed in the prod-
uct, so Z 2 ∩ (π −1 (W )) is closed in Z 2 and it is a proper subset, since π restricted to
Z 2 surjects. Hence Z 2 \ (π −1 (W )) is dense in Z 2 , which is irreducible. On the other
198 12 Secant Varieties

hand if (P1 , . . . , Pk , Q) ∈ Z 2 \ (π −1 (W )), then P1 , . . . , Pk are linearly independent,


thus (P1 , . . . , Pk , Q) ∈ Z 1 because Z 1 contains the fiber of π over (P1 , . . . , Pk ). It
follows that Z 2 \ (π −1 (W )) ⊂ Z 1 , hence Z 2 ⊂ Z 1 , a contradiction. 

Definition 12.1.8 Consider an independent set Y1 , . . . , Yk of irreducible varieties.


The abstract join of the Yi ’s is the unique irreducible component A J (Y1 , . . . , Yk ) of
the total join T J (Y1 , . . . , Yk ) which maps onto Y1 × · · · × Yk in the natural projec-
tion.
The (embedded) join J (Y1 , . . . , Yk ) is the image of A J (Y1 , . . . , Yk ) under the
projection of (Pn )k × Pn to the last copy of Pn .
Since the image of an irreducible variety is irreducible, then J (Y1 , . . . , Yk ) is
irreducible.

We put the adjective embedded in parenthesis since we will (often) drop it and
say simply that J (Y1 , . . . , Yk ) is the join of Y1 , . . . , Yk .
Notice that while the abstract join is an element of the product (Pn )k × Pn , the
join is a subset of Pn , which is Zariski closed, since the last projection is closed (see
Proposition 10.4.4).

Definition 12.1.9 If we apply the previous definitions to the case Y1 = · · · = Yk =


Y , where Y is an irreducible variety in Pn , we get the definitions of the abstract k-th
secant variety ASk (Y ) and the (embedded) k-th secant variety Sk (Y ) (which are both
irreducible).

Example 12.1.10 Let Y be the d-th Veronese embedding of P1 in Pd . The space


Pd can be identified with the space of all forms of degree d in 2 variables x, y, by
identifying the coordinates z 0 , . . . , z d with monomials of degree d in two variables,
e.g., z i = x i y d−i . In this representation, Y can be identified with the set of powers,
i.e., forms that can be written as (ax + by)d , for some choice of the scalars a, b.
With this in mind, let us determine a representation of the secant variety S2 (Y ).
A general point of the abstract secant variety AS2 (Y ) is a triplet (P, Q, T ) where
T lies on the line joining P, Q ∈ Y . Choose scalars a P , b P , a Q , b Q such that P =
(a P x + b P y)d and Q = (a Q x + b Q y)d . Then T belongs to the line P, Q if and only if
there are scalars u, v such that T = u P + v Q = u(a P x + b P y)d + v(a Q x + b Q y)d .
In particular, when P = x d and Q = y d , then (P, Q, T ) belongs to the abstract secant
variety AS2 (Y ) if and only if T is a binomial of type T = ux d + vy d .
Notice that given two independent linear forms L = (a P x + b P y), M = (a P x +
b P y), then after a change of coordinates in P1 we may always assume that L = x
and M = y.
Summarizing, it follows that the secant variety S2 (Y ) contains all the points T ∈
Pd which, after a change of coordinates, miss all the mixed monomials, i.e., are of
the form T = ux d + vy d .
There are special points in AS2 (Y ) corresponding to triplets (P, P, T ), i.e., with
P = Q. They can arise as limit of families (P, Q(t), T ) ∈ AS2 (Y ) for families of
points Q(t) ∈ Y which tend toward P for t going to 0, since the general point of the
12.1 Definitions 199

irreducible variety AS2 (Y ) has P = Q (notice that the limit of a family of points in
AS2 (Y ) necessarily belongs to AS2 (Y ), since the abstract secant variety is closed).
For instance, consider the limit in the case P = x d , Q = (x + t y)d , T = P − Q.
For t = 0, we have

d − 2 2 d−2 2
T = (d − 1)t x d−1
y+ t x y + · · · + t d yd .
2
d−2 2
Projectively, for t = 0, T is equivalent to (d − 1)x d−1 y + d−2 2
tx y + ··· +
t d−1 y d . Thus for t → 0, T goes to x d−1 y. This implies that (x d , x d , x d−1 y) ∈
AS2 (Y ), i.e., x d−1 y ∈ S2 (Y ).
Notice that x d−1 y cannot be written as L d + M d , for any choice of linear forms
L , M. So x d−1 y is a genuinely new point of S2 (Y ).
With a similar trick, one can prove that for any choice of two linear forms L , M
in x, y, the form L d−1 M belongs to S2 (Y ).

Proposition 12.1.11 The abstract join A J (Y1 , . . . , Yk ) has dimension (k − 1) +


dim(Y1 ) + · · · + dim(Yk ). In particular, the abstract secant variety ASk (Y ) of a
variety Y of dimension n has dimension k − 1 + nk.

Proof The projection A J (Y1 , . . . , Yk ) → Y1 × · · · × Yk has general fibers, over a


point (P1 , . . . , Pk ) of the product such that the Pi ’s are linearly independent, cor-
responding to the projective span of P1 , . . . , Pk , which has dimension k − 1. The
claim follows from Theorem 11.3.5. 

While the dimension of the abstract secant variety is always easy to compute, for
the embedded secant variety the computation of the dimension can be complicated. To
give an example, let us show first how the interpretation of secant varieties introduced
in Example 12.1.10 can be extended.

Example 12.1.12 Consider the Veronese variety Y obtained by the Veronese embed-
ding vn,d : Pn → P N , where N = d+n n
− 1.
Y can be considered as the set of symmetric tensors of rank 1 and type (n + 1) ×
· · · × (n + 1), d times, i.e., the set of forms of degree d in n + 1 variables, which
are powers of linear forms.
Consider a form T of rank k. Then T can be written as a sum T1 + · · · + Tk of
forms of rank 1, with k minimal. The minimality of k implies that T1 , . . . , Tk are
linearly independent, since otherwise we could barely forget one of them and write
T as a sum of k − 1 powers. In particular k ≤ N + 1. We get that (T1 , . . . , Tk , T )
belongs to the k-th abstract secant variety of Y , so that T is a point of Sk (Y ).
The secant variety Sk (Y ) also contains forms whose rank is not k. For instance, if
T has rank k < k, then we can write T = T1 + · · · + Tk , with T1 , . . . , Tk linearly
independent. Consider points Tk +1 , . . . , Tk ∈ Y such that T1 , . . . , Tk are linearly
independent. Then consider the form

T (t) = T1 + · · · + Tk + t Tk +1 + · · · + t Tk ,
200 12 Secant Varieties

where t ∈ C is a parameter. It is clear that for t = 0 the point (T1 , . . . , Tk , T (t))


belongs to the abstract secant variety ASk (Y ). Thus T (t) ∈ Sk (Y ) for t = 0. It follows
that the limit of T (t) for t going to 0, which is T (0) = T , also belongs to Sk (Y ),
which is closed.
By arguing as in Example 12.1.10, one can prove that the form T = x0d−1 x1 +
x2 + · · · + xnd , whose rank is bigger than n, also belongs to Sn (Y ).
d

We can generalize the construction given in the previous example to show that

Proposition 12.1.13 For every variety Y ⊂ Pn and for k < k ≤ n + 1, we always


have Sk ⊆ Sk .

Example 12.1.14 Let Y = v2,2 (P2 ) be the Veronese variety of rank 1 forms of degree
2 in 3 variables, which is a subvariety of P5 . The abstract secant variety AS2 (Y ),
corresponding to triplets (L 2 , M 2 , T ) such that L , M are linear forms and T =
a L 2 + bM 2 , has dimension 2 + 2 + 1 = 5, by Proposition 12.1.11.
Let us prove that the dimension of the secant variety S2 (Y ) is 4. To do that, it
is enough to prove that the general fiber of the projection π : AS2 (Y ) → S2 (Y ) is
1-dimensional. To this aim, consider T = a L 2 + bM 2 , L , M general linear forms.
L , M correspond to general points of the projective plane P2 of linear forms in 3
variables. Let  ⊂ P2 be the line joining L , M. After a change of coordinates, without
loss of generality, we may assume that L = x02 , M = x12 , so that  has equation
x2 = 0 and T becomes a form of degree 2 in the variables x0 , x1 , i.e., T = x02 +
x12 . It is easy to prove that π −1 (T ) has infinitely many points. Indeed for every
point P = ax0 + bx1 ∈  there exists exactly one Q = a x0 + b x1 ∈  such that
(v2 (P), v2 (Q), T ) ∈ AS2 (Y ), i.e.,

T = (ax0 + bx1 )2 + (a x0 + b x1 )2 ,

as a projective point: enough to take a = b and b = −a (which is the unique choice,


modulo scalar multiplication, for a, b general).
Thus we obtain a projective map f from  to the fiber π −1 (T ) by send-
ing P = ax0 + bx1 to (v2 (P), v2 (Q), T ), where Q = bx0 − ax1 . The map f is
clearly injective. We show that f is surjective, by proving that, when T is general,
(v2 (P), v2 (Q), T ) cannot stay in AS2 (Y ), unless P ∈ . Namely if T = (ax0 +
bx1 + cx2 )2 + (a x0 + b x1 + c x2 )2 with c = 0, then one computes c = −ic , so
that a = −ia and b = −ib , thus T = (−i)2 (a x0 + b x1 + c x2 )2 + (a x0 + b x1 +
c x2 )2 = 0, a contradiction. It follows that π −1 (T ) is isomorphic to , hence it has
dimension 1.

Example 12.1.14 is a special case of the Alexander–Hirschowitz Theorem which


determines the dimension of secant varieties of a Veronese variety Y (see Theorems
7.4.4 and 12.2.13). The example illustrates one of the few cases in which ASk (Y )
and Sk (Y ) have different dimension.
Similar examples can be found by looking at projective spaces of matrices.
12.1 Definitions 201

Example 12.1.15 Consider the Segre embedding Y ⊂ P8 of P2 × P2 . If we interpret


P8 as the space of matrices of type 3 × 3, then S2 (Y ) contains matrices of rank 2.
Namely matrices T of type v ⊗ w + v ⊗ w have rank 2, for a general choice of
v, v , w, w ∈ K 3 . Notice that the space of rows of a matrix T = v ⊗ w + v ⊗ w
is generated by v, v . Thus a general element of S2 (Y ) has rank 2.
Conversely, as we saw in the proof of Proposition 6.3.2, if v, v are generators for
space of rows of a 3 × 3 matrix T of rank 2, then one can find vectors w, w ∈ K 3
such that T = v ⊗ w + v ⊗ w .
It follows that while AS2 (Y ) has dimension 2 dim(Y ) + 1 = 9, yet S2 (Y ) has
the dimension of the projective space of matrices of rank 2 in P8 , which is the
hypersurface of P8 defined by the vanishing of the determinant. Thus, dim(S2 (Y )) =
7.

Definition 12.1.16 Let Y ⊂ P N be any irreducible variety, which spans P N . We say


that T ∈ P N has Y -border rank k if k is the minimum such that T ∈ Sk (Y ).
Thus the set of points of Y -border rank k corresponds to the projective variety
Sk (Y ).
The generic Y -rank is the minimum k such that Sk (Y ) = P N .

The previous definition is particularly interesting when Y corresponds to the d-


th Veronese embedding of some projective space Pn , or to the Segre embedding
of a product Pa1 × · · · × Pas . In the latter case, points T ∈ P N can be identified
with tensors of type (a1 + 1) × · · · × (as + 1), while in the former case they can be
identified with symmetric tensors of type (n + 1) × · · · × (n + 1), d times.
Points of Y correspond to tensors of rank 1. Every point corresponding to a tensor
of rank k belongs to the secant variety Sk (Y ), so it has Y -border rank ≤ k. Thus in
general, for a tensor T one has

border rank of T ≤ rank of T

(we will drop the reference to Y , in the case of tensors of some given type, because
by default we consider Y to be the variety of rank 1 tensors).
One can find examples in which the previous inequality is strict.

Example 12.1.17 Go back to Example 12.1.10, in which Y is the d-th Veronese


embedding of P1 in Pd , and Pd can be identified with the space of all forms of degree
d in 2 variables x, y.
We proved there that the form x d−1 y cannot be written as L d + M d , for any choice
of linear forms L , M. So x d−1 y has rank bigger than 2. On the other hand, we also
proved that x d−1 y belongs to S2 (Y ). Hence

border rank of x d−1 y < rank of x d−1 y.

Some properties of the Y -border rank are listed in the following:

Remark 12.1.18 A general tensor of border rank k also has rank k.


202 12 Secant Varieties

Namely, by definition, the set of points (P1 , . . . , Pk , P) ∈ ASk (Y ) such that


P1 , . . . , Pk are linearly independent (in particular, distinct) is open dense in ASk (Y ).
Thus a general tensor T ∈ Sk (Y ) has rank ≤ k, hence it has rank k.
If r is the generic Y -rank, no points P ∈ P N can have border rank bigger than
r , because P ∈ Sr (X ) = P N . It is, however, possible that some tensors T have rank
bigger than the generic border rank.
In any event, if T is a tensor of border rank k, i.e., T ∈ Sk (Y ) for the variety Y of
tensors of rank 1, then, since Sk (Y ) is irreducible, there exists a neighborhood of T
in Sk (Y ) whose generic point is a tensor of rank k.

The latter property listed in the previous remark explains why the computation of
the border rank k of a tensor T can be important for applications: we can realize T
as a limit of tensors of rank k. In other words, T can be slightly modified to obtain a
tensor of rank k.

Remark 12.1.19 Let Y ⊂ Pn be a projective variety and let P ∈ Pn be a point, not


belonging to Y . Then the projection from P determines a well defined map π : Y →
Pn−1 . Call Y the image of π. Assume that π is not injective. Then there are two
points P1 , P2 ∈ Y which are sent to the same point of Pn−1 . This means that the
points P, P1 , P2 are aligned. Thus P belongs to the secant variety S2 (Y ).
It follows that the projection from a general point defines an injective map (i.e.,
a bijection from Y to Y ), provided that S2 (Y ) = Pn , i.e., provided that the generic
Y -rank is bigger than 2.
In fact, one can prove (but the proof uses algebraic tools which go beyond the
theory introduced in this book) that the projection from P determines an isomorphism
between Y and Y if and only if P does not belong to S2 (Y ). Thus, if S2 (Y ) = Pn ,
there is no chance of projecting isomorphically Y to a projective space of smaller
dimension.
The study of projection of varieties to spaces Pn whose dimension n is only slightly
bigger than the dimension of Y is a long-standing problem in classical Algebraic
Geometry, which indeed stimulated the first studies on secant varieties. For instance,
we know only a few examples of varieties of dimension n − 3 > 1 which can be
isomorphically projected to Pn−1 .

Varieties Y for which the abstract secant variety ASk (Y ) and the secant variety
Sk (Y ) have the same dimension are of particular interest in the theory of secant
spaces. For instance, if Y ⊂ P N is a variety of tensors of rank 1, then dim(ASk (Y )) =
dim(Sk (Y )) implies that the general fiber of the projection ASk (Y ) → Sk (Y ) is finite,
which means that for a general tensor T ∈ P N of rank k, there are (projectively) only a
finite number of presentations T = T1 + · · · + Tk , with Ti ∈ Y for all i. In particular,
tensors for which the presentation is unique are called identifiable tensors.

Definition 12.1.20 For an irreducible variety Y ⊂ P N and for any T ∈ P N , a Y -


decomposition of T of length r is a set of linearly independent points T1 , . . . , Tr ∈ Y
such that T = T1 + · · · + Tr , i.e., (T1 , . . . , Tr , T ) ∈ ASr (Y ).
If r = rank(T ), we say that the decomposition computes the rank.
12.1 Definitions 203

We say that T is finitely r -identifiable if there are only a finite number of decom-
positions of T of length r . We say that T is r -identifiable if there is only one decom-
positions of T of length r .
We say that Y is generically finitely r -identifiable if a general element of Sr (Y ) is
finitely r -identifiable. Notice that Y is generically finitely r -identifiable if and only
if ASr (Y ) and Sr (Y ) have the same dimension.
We say that Y is generically r -identifiable if a general element of Sr (Y ) is r -
identifiable.

We refer to the chapters on Multi-linear Algebra for a discussion on the importance


of the identifiability properties of tensors.

12.2 Methods for Identifiability

Finding whether a given tensor is identifiable or not, or whether tensors of a given


type are generically identifiable or not, is in general a difficult question. Results
on these problems are not yet complete, and the matter is an important subject of
ongoing investigations.
We briefly introduce in this section a couple of methods that are universally used
to detect the identifiability, or the generic identifiability, of tensors of a given type.

12.2.1 Tangent Spaces and the Terracini’s Lemma

The first method to compute the identifiability is based on the computation of the
tangent space at a generic point.
Roughly speaking, the tangent space to a projective variety X ⊂ P N at a general
point P ∈ X can be defined by considering that a sufficiently small Zariski open
subset U of X around P is a differential subvariety of P N , for which the notion of
tangent space is a well established, differential object. Yet, we will give an algebraic
definition of tangent vectors, which are suitable for the computation of the dimension.
We will base the notion of tangent space first by giving the definition of embedded
tangent space of a hypersurface at the origin, and then extending the notion to any
(regular) point of any projective variety.

Definition 12.2.1 Let X ⊂ P N by the hypersurface defined by the form

f = x0d−1 g1 + · · · + gd ,

where each gi is a form of degree i in x1 , . . . , x N . Clearly the point P0 = [1 : 0 : · · · :


0] belongs to X . The (embedded) tangent space to X at P0 is the linear subspace
TX (P0 ) of P N defined by the equation g1 = 0.
204 12 Secant Varieties

It is clear that P0 ∈ TX (P0 ). Notice that the previous definition admits the case
g1 = 0: when this happens, then TX (P0 ) coincides with P N . Otherwise TX (P0 ) is a
linear subspace of dimension N − 1 = dim(X ).
We are ready for a rough definition of tangent space of any variety X .

Definition 12.2.2 Let X be any variety in P N , containing the point P0 = [1 : 0 :


· · · : 0]. Each element f in the homogeneous ideal I X of X defines a hypersurface
X ( f ) which contains P0 .
The (embedded) tangent space to X at P0 is the intersection of the tangent spaces
TX ( f ) (P0 ), where f ranges among the elements of the ideal I X .
If P ∈ X is any point, then we define the tangent space to X at P as

TX (P) = φ−1 (Tφ(X ) (P0 ),

where φ is any change of coordinates which sends P to P0 .

We leave to the reader the proof that the definition of TX (P) does not depend on
the choice of the change of coordinates φ.
Unfortunately, in a certain sense as it happens for Groebner basis (Chap. 13), it
is not guaranteed that if f 1 , . . . , f s generate the ideal I X , the intersection of the cor-
responding tangent spaces TX ( fi ) (P0 ) determines TX (P). The situation can however
be controlled as follows.

Definition 12.2.3 We say that P ∈ X is a regular point if TX (P) has dimension


equal to dim(X ).

Example 12.2.4 If X is a hypersurface defined by a form f as above, then P0 is a


regular point of X if and only if g1 = 0.

We give without proof the following:

Theorem 12.2.5 For every P ∈ X it holds:

dim TX (P) ≥ dim(X ).

Moreover the equality holds in a Zariski open subset of X .

Example 12.2.6 The special case in which X = Pn is easy: the tangent space of X
at any point coincides with Pn . Thus any point of X is regular.

Next, we provide examples of tangent spaces to relevant varieties for tensor anal-
ysis.

Example 12.2.7 Consider


the image X of the Veronese embedding of degree d of
Pn into P N , N = n+d n
− 1. Let Q 0 ∈ Pn be the point of coordinates [1 : 0 : · · · : 0],
which corresponds to the linear form x0 . Its image in P N is the point P0 = [1 : 0 :
· · · : 0], which corresponds to the monomial M0 = x0d .
12.2 Methods for Identifiability 205

Consider a quadratic equation M0 Mi − M j Mk = 0 in the ideal of X . This equation


exists unless Mi is divided by x0d−1 , in which case the equation becomes null. Thus, the
ideal of TX (P0 ) contains all the equations of type M j = 0, where M j is a monomial in
which the exponent of x0 does not exceed d − 2. It follows that TX (P0 ) is contained
in the space generated by the monomial corresponding to x0d , x0d−1 x1 , . . . , x0d−1 xn .
Since this last space has dimension n = dim(X ) and the dimension of TX (P0 ) cannot
be smaller than n, by Theorem 12.2.5, we get that TX (P0 ) is the subspace generated by
forms in K [x0 , . . . , xn ]d which sit in the ideal spanned by x0d , x0d−1 x1 , . . . , x0d−1 xn .
A similar computation determines the tangent space to X at a general point P ∈ X
corresponding to a power L d , where L is a linear form in K [x0 , . . . , xn ]d : TX (P)
corresponds to the subspace generated by forms in K [x0 , . . . , xn ]d which sit in the
ideal spanned by L d , L d−1 x1 , . . . L d−1 xn .
Example 12.2.8 Consider the image X of the Segre embedding of Pa1 × · · · × Pas
into P N , N = (ai + 1) − 1. The image of the point Q = ([1 : 0 : · · · : 0], . . . , [1 :
0 : · · · : 0]) is the point P0 ∈ P N of coordinates [1 : 0 : · · · : 0]. If we call xi j the
coordinates in Pai , then P0 is the point corresponding to the multihomogeneous
polynomial M0 = x10 · · · xs0 .
For every multihomogeneous polynomial M = x1q1 · · · xsqs , except those for
which at least s − 1 of the qi ’s are 0, we can find an equation M0 M − Ma Mb = 0 for
a suitable choice of Ma , Mb . Thus all the corresponding coordinates of P N vanish on
TX (P0 ). It follows that TX (P0 ) is the linear subspace of the coordinates representing
multihomogeneous polynomial M = x1q1 · · · xsqs for which at least s − 1 of the qi ’s
are 0.
If we call i the coordinate space in Pa1 × · · · × Pas of points (Q 1 , . . . , Q s ) in
which all the Q j ’s, j = i, have coordinates [1 : 0 : · · · : 0], then it turns out that
TX (P0 ) is spanned by the images of the i ’s, i = 1, . . . , s.
Similarly, for a general point P ∈ X which is the image of the point (Q 1 , . . . , Q s )
in Pa1 × · · · × Pas , the tangent space TX (P) is spanned by the images of the spaces
i of points (R1 , . . . , Rs ) in which R j = Q j for all j = i.
The tangent spaces to secant varieties are then described by the following:
Lemma 12.2.9 (Terracini’s Lemma) Let U ∈ Sk (X ) be a general point of the k-th
secant variety of X . Assume that U belongs to the span of P1 , . . . , Pk ∈ X . Then
the tangent space to Sk (X ) at the point U is equal to the linear span of the tangent
spaces TX (P1 ), . . . , TX (Pk ).
The computation of tangent spaces allows us to study the dimension of secant
varieties.
Remark 12.2.10 Thanks to the Terracini’s Lemma, we can associate to each secant
variety an expected dimension. Namely, we may assume that the tangent spaces at
general points of X ⊂ P N are as independent as they can. With this in mind, we
define the expected dimension of Sr (X ) as

ex pdim r (X ) = min{N , r dim(X ) + r − 1}.


206 12 Secant Varieties

Example 12.2.11 Let X be the image in P5 of the 2-Veronese embedding of P2 .


Since X is a surface, the dimension of the abstract secant variety AS2 (X ) is 5. On
the other hand, S2 (X ) has dimension 4. Indeed, let T ∈ S2 (X ) belong to the span of
P1 = v2 (L 1 ) and P2 = v2 (L 2 ). After a change of coordinates, we may always assume
that L 1 corresponds to x and L 2 corresponds to y, in P(C[x, y, z]1 ) = P2 . Thus, the
tangent space to X at T corresponds to the projectification of the linear subspace W
of C[x, y, z]2 , generated by x 2 , x y, x z, y 2 , yz. This last space has dimension 4.
Example 12.2.12 Let X be the image in P8 of the Segre embedding s2,2 of P2 × P2 .
Since X has dimension 4, the dimension of the abstract secant variety AS2 (X ) is 9.
On the other hand, S2 (X ) has dimension 7. Indeed, let T ∈ S2 (X ) belong to the span
of P1 = s(x1 , y1 ) and P2 = s(x2 , y2 ). The tangent spaces to X at P1 and P2 share
the two points (x1 , y2 ) and (x2 , y1 ). Thus they share a line. Consequently their span
has dimension 7.
Notice that if we identify P8 with the projective space of 3 × 3 matrices, then X
corresponds to the variety of matrices of rank ≤ 2, which is the hypersurface whose
equation is the determinant of a generic matrix.
In both the examples above, the dimension of the secant variety is smaller than
the expected value. Such varieties are called defective.
Defective varieties, in fact, are quite rare in Algebraic Geometry. For instance,
curves are never defective.
Moreover, reading Theorem 7.4.4 in this context, we have a complete list of
defective Veronese varieties.
Theorem 12.2.13 (Alexander-Hirschowitz) Let X = vn,d (Pn ) be a Veronese variety.
Then the dimension of Sr (X ) equals the expected dimension, unless:
• n > 1, d = 2, 2 ≤ r ≤ n;
• n = 2, d = 4, r = 5;
• n = 3, d = 4, r = 9;
• n = 4, d = 3, r = 7;
• n = 4, d = 4, r = 14.

12.2.2 Inverse Systems

Inverse systems are a method to compute the rank and the identifiability of a given
symmetric tensor T , identified as forms of given degree d in some polynomial ring
R = C[x0 , . . . , xn ].
The method is based on the remark that if

T = L d1 + · · · + L rd ,

where the L i ’s are linear forms in R, then every derivative of T is also spanned by
L 1, . . . , Lr .
12.2 Methods for Identifiability 207

Thus, the method starts by considering a dual ring of coordinates S = C[∂0 , . . . ,


∂n ], where each ∂i should be considered as the (divided) partial derivative with
respect to xi . In other words, ∂i acts on R by the linear, associative action which
satisfies:
1 if i = j;
∂i (x j ) = .
0 if i = j,

One can easily prove that, for all i, ∂i (xim ) = (∂i xi )xim−1 = xim−1 . It follows:

xib−a if a ≤ b
∂ia xib = .
0 if a > b

We also have that if L = a0 x0 + · · · + an xn is a linear form, then by taking D =


ā0 ∂0 + · · · + ān ∂n (where āi is the conjugate of ai , then D(L) = 1. More generally,
D d (L d ) = 1.
Definition 12.2.14 For every form F ∈ R, the inverse system of F is the set of all
D ∈ S such that D(F) = 0. The inverse system of F is indicated with the symbol
F ⊥.
Proposition 12.2.15 The inverse system of a form F is a homogeneous ideal in the
ring S.
Example 12.2.16 Let us consider the form F = x03 + x0 x1 x2 + 2x0 x22 − 2x1 x22 − x23
in three variables. For any linear combination ∂ = a∂0 + b∂1 + c∂2 , one computes

∂(F) = a(3x02 + 2x22 ) + b(−2x22 ) + c(4x0 x2 − 4x1 x2 − 3x22 )

so that ∂(F) = 0 if and only if a = b = c = 0. Thus no linear form ∂ = 0, in S,


belongs to F ⊥ . On the other hand, ∂12 ∈ F ⊥ .
It is clear that every homogeneous element in S of degree at least 4 belongs to
F ⊥ . Notice that ∂o3 (F) = 1, so that ∂03 ∈
/ F ⊥.
Example 12.2.17 In the case n = 1, i.e., R = C[x0 , xn ] is a polynomial ring in two
variables, consider F = x0d + x1d . Then F ⊥ is the ideal generated by

∂0 ∂1 , ∂0d − ∂1d , ∂0d+1 .

Example 12.2.18 If F is a form of rank 1, i.e., F = L d for some linear form L, then
it is easy to compute F ⊥ . Write L = a0 x0 + · · · + an xn . Put v0 = (a0 , . . . , an ) and
extend to a basis v0 , v1 , . . . , vn of Cn+1 , where each vi , i > 0, is orthogonal to v0 .
Set vi = (ai0 , . . . , ain ). Define

D0 = ā0 ∂0 + · · · + ān ∂n , Di = ai0 ∂0 + · · · + ain ∂n .

Then F ⊥ is the ideal generated by D0d+1 , D1 , . . . , Dn .


208 12 Secant Varieties

Notice that if we consider S as a normal polynomial ring, then F ⊥ contains


the ideal of all polynomials whose evaluation on (a0 , . . . , an ), which is a set of
coordinates for the point P ∈ Pn associated with L, vanishes.

The previous example provides a way in which one can use F ⊥ to determine the
rank of a symmetric tensor, i.e., of a polynomial form.
Namely, if F = L d1 + · · · + L rd , where L i is a linear form associated to the point
Pi ∈ Pn , then clearly F ⊥ contains the intersection of the homogeneous ideals in S
associated to the points Pi ’s.
It follows that

Proposition 12.2.19 The rank r of F is the minimum such that there exist a finite
set of points {P1 , . . . , Pr } ⊂ Pn such that the intersection of the ideals associated to
the Pi ’s is contained in F ⊥ .
The linear forms associated to the points Pi ’s provide a decomposition for L.

Example 12.2.20 Consider the case of two variables x0 , x1 , i.e., n = 1, and let F =
x02 x12 . F is thus a monomial, but this by no means implies that it is easy to compute
a decomposition of F in terms of (quartic) powers of linear forms.
In the specific case, one computes that F ⊥ is the ideal generated by ∂03 , ∂13 . Since
the ideal of any point [a : b] ∈ Pn = P1 is generated by one linear form b∂1 − a∂0 ,
then the intersection of two ideals of points is simply the ideal generated by the
product of the two generators. By induction, it follows that in order to determine
a decomposition of F with r summands, we must find r distinct linear forms in S
whose product is contained in the ideal generated by ∂03 , ∂13 . It is almost immediate
to see that we cannot find such linear forms if r = 2, thus the rank of F is bigger
than 2. It is possible to find a product of three linear forms which lies in F ⊥ , namely:
√ √
1+i 3 3+i 3
(∂0 − ∂1 )(∂0 − ( )∂1 )(∂0 + ( )∂1 ).
2 2
It follows from the theory that F has rank 3, and √ it is a linear combination

of the 4-th
powers of the linear forms (x0 − x1 ), (x0 − ( 2 )x1 ), (x0 + ( 2 )x1 ).
1+i 3 3+i 3

It is laborious, but possible, to prove that indeed:


√ √
−1 + i 3 1+i 3
x02 x12 = a(x0 − x1 )4 + b(x0 − ( )x1 )4 + c(x0 + ( )x1 )4 ,
2 2
√ √
−1+i 3
where a = 1
18
, b= 36
and c = − 1+i36 3 .

In general, it is not easy to determine which intersections of ideals of points are


contained in a given ideal F ⊥ . The solution is known, by now, only for a few special
types of forms F.
12.3 Exercises 209

12.3 Exercises

Exercise 43 Prove the first claim of Remark 12.1.4: if Y1 = · · · = Yk = Y , then the


set of varieties Y1 , . . . , Yk is independent exactly when Y is not contained in a linear
subspace of (projective) dimension k − 1.

Exercise 44 Prove the statement of Proposition 12.1.13: for every Y ⊂ Pn and for
k < k ≤ n + 1, we always have Sk ⊆ Sk .

Exercise 45 Prove that when Y ⊂ P14 is the image of the 4-veronese map of P2 ,
then dim(AS5 (Y )) = 14, but dim(S5 (Y )) < 14.
Chapter 13
Groebner Bases

Groebner bases represent the most powerful tool for computational algebra, in par-
ticular for the study of polynomial ideals. In this chapter, based on [1, Chap. 2], we
give a brief survey on the subject. For a deeper study of it, we suggest [1, 2].
Before treating Groebner bases, it is mandatory to recall many concepts, starting
from the one of monomial ordering.

13.1 Monomial Orderings

In general, when we are working with polynomials in a single variable x, there is an


underlined ordering among monomials in function of the degree of each monomial:
x α > x β if α > β. To be more precise we can say that we are working with an
ordering, on the degree, on the monomials in one variable:

· · · > x m+1 > x m > · · · > x 2 > x > 1.

When we are working in a polynomial ring with n variables x1 , x2 , . . . , xn the


situation is more complicated. As a matter of fact, it is not obvious to have an
underlined ordering, since this would mean to fix, first of all, an order of “importance”
on the variables.
We define now different orderings on k[x1 , . . . , xn ], which will be useful in differ-
ent contexts. Before to do that is is important to observe that we can always recover a
monomial x α = x1α1 · · · xnαn from its n−tuple of exponents (α1 , . . . , αn ) ∈ Zn≥0 . This
fact establishes an injective correspondence between the monomials in k[x1 , . . . , xn ]
and Zn≥0 . In addition, each ordering between the vectors of Zn≥0 defines an ordering
between monomials: if α > β, where > is a given ordering on Zn≥0 , then we will say
that x α > x β .

© Springer Nature Switzerland AG 2019 211


C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2_13
212 13 Groebner Bases

Since a polynomial is a sum of monomials, we want to be able to write its terms by


ordering them in ascending or descending order (and, of course, in an unambiguous
way). To do this we need:
(i) that it is possible to compare any two monomials. This requires that the ordering
is a total ordering: given the monomials x α and x β , only one of the following
statement must be true

x α > x β , x α = x β , x β > x α.

(ii) to take into consideration the effects of the sum and product operations on the
monomials. When we add polynomials, after collecting the terms, we can simply
rearrange the terms. Multiplication could give problems if multiplying a poly-
nomial for a monomial, the ordering of terms changed. In order to avoid this, we
require that if x α > x β and x γ are monomials, then x α x γ > x β x γ .

Remark 13.1.1 If we consider an ordering on Zn≥0 , then property (ii) means that if
α > β then, for any γ ∈ Zn≥0 , α + γ > β + γ.

Definition 13.1.2 A monomial ordering on k[x1 , . . . , xn ] is any relation > on Zn≥0


or, equivalently, any relation on the set of monomials x α , α ∈ Zn≥0 satisfying
(i) > is a total ordering on Zn≥0 ;
(ii) If α > β and γ ∈ Zn≥0 , then α + γ > β + γ;
(iii) x α > 1 for every nonzero α.

Remark 13.1.3 Property (iii) is equivalent to the fact that > is a well ordering on
Zn≥0 , that is any non-empty subset of Zn≥0 has a smallest element with respect to >. It
is not difficult to prove that this implies that each sequence in Zn≥0 , strictly decreasing,
at some point ends. This fact will be fundamental when we want to prove that some
algorithm stops in a finite number of steps as some terms decrease strictly.

We pass now to introduce the most frequently used orderings.

Definition 13.1.4 Let α = (α1 , . . . , αn ) and β = (β1 , . . . , βn ) be elements of Zn≥0 .


(lex) We say that α >lex β if, in the vector α − β ∈ Zn≥0 , the first nonzero
entry, starting from left, is positive. We write x α >lex x β if α >lex β
(lexicographic ordering).
(grlex) We say that α >grlex β if,


n 
n
|α| = αi > |β| = βi or |α| = |β| and α >lex β.
i=1 i=1

We write x α >grlex x β if α >grlex β (graded lexicographic ordering).


13.1 Monomial Orderings 213

(grevlex) We say that α >gr evlex β if,


n 
n
|α| = αi > |β| = βi or |α| = |β|
i=1 i=1

and the first nonzero entry, starting from right, is negative. We write
x α >gr evlex x β if α >gr evlex β (reverse graded lexicographic ordering).

Example 13.1.5
(1) (1, 2, 5, 4) >lex (0, 2, 4, 6) since (1, 2, 5, 4) − (0, 2, 4, 6) = (1, 0, 1, −2);
(2) (3, 2, 2, 4) >lex (3, 2, 2, 3) since (3, 2, 2, 4) − (3, 2, 2, 3) = (0, 0, 0, 1);
(3) (1, 3, 1, 4) <lex (2, 3, 2, 1) since the first entry, from left of (1, 3, 1, 4) −
(2, 3, 2, 1) = (−1, 0, 0, 3) is negative;
(4) (1, 2, 3, 5) <grlex (0, 1, 5, 6) since |(1, 2, 3, 5)| = 11 < 12 = |(0, 1, 5, 6)|;
(5) (4, 2, 2, 4) >grlex (4, 2, 1, 5) since |(4, 2, 2, 4)| = |(4, 2, 1, 5)| = 12 and
(4, 2, 2, 4) >lex (4, 2, 1, 5)) (in fact (4, 2, 2, 4) − (4, 2, 1, 5)) = (0, 0, 1, −1);
(6) (4, 3, 2, 1) <gr evlex (0, 2, 4, 6) since |(4, 3, 2, 1)| = 10 < 12 = |(0, 2, 4, 6)|;
(7) (1, 3, 4, 4) >gr evlex (2, 3, 2, 5) since |(3, 1, 2, 4)| = |(3, 1, 1, 5)| = 12 and the
first nonzero entry, from right, of (1, 3, 4, 4) − (2, 3, 2, 5) = (−1, 0, 2, −1) is
negative.

To each variable xi is associated the vector of Zn≥0 with 1 in the i−th position and
zero elsewhere. It is easy to check that

(1, 0, . . . , 0) >lex (0, 1, . . . , 0) >lex · · · >lex (0, 0, . . . , 0, 1)

from which we get x1 >lex x2 >lex · · · >lex xn . In the practice, if we are working
with variables x, y, z, . . . we assume that the alphabetic ordering among variables
x > y > z > · · · is used to define the lexicographic ordering among monomials.
In the lexicographical ordering, each variable dominates any monomial com-
posed only of smaller variables. For example, x1 >lex x24 x35 x4 since (1, 0, 0, 0) −
(0, 4, 3, 1) = (1, −4, −3, −1). Roughly speaking, the lexicographical ordering does
not take into account the total degree of monomial and, for this reason, we introduce
the graded lexicographic ordering and the reverse graded lexicographic ordering. The
two orderings behave in a different way: both use the total degree of the monomials,
but grlex uses the ordering lex and therefore “favors” the greater power of the first
variable, while gr evlex, looking at the first negative entry from right, “favors” the
smaller power of the last variable. For example,

x14 x2 x32 >grlex x13 x23 x3 and x13 x23 x3 >gr evlex x14 x2 x32 .

It is important to notice that there are n! orderings of type grlex and gr evlex
according to the ordering we give to the monomials of degree 1. For example, for
two variables we can have x1 < x2 or x2 < x1 .
214 13 Groebner Bases

Example 13.1.6 Let us show how monomial orderings are applied to polynomials.
If f ∈ k[x1 , . . . , xn ] and we had chosen a monomial ordering >, then we can order,
with respect to >, in a nonambiguous way the terms of f . Consider, for example, f =
2x12 x22 + 3x24 x3 − 5x14 x2 x32 + 7x13 x23 x3 . With respect to the lexicographic ordering, f
is written as
f = −5x14 x2 x32 + 7x13 x23 x3 + 2x12 x22 + 3x24 x3 .

With respect to the degree lexicographic ordering, f is written as

f = −5x14 x2 x32 + 7x13 x23 x3 + 3x24 x3 + 2x12 x22 .

With respect to the reverse degree lexicographic ordering, f is written as

f = 7x13 x23 x3 − 5x14 x2 x32 + 3x24 x3 + 2x12 x22 .



Definition 13.1.7 Let f = α aα x α be a nonzero polynomial in k[x1 , . . . , xn ] and
let > be a monomial ordering
(i) the multidegree of f is

multideg( f ) = max{α ∈ Zn≥0 : aα = 0}.

(ii) The leading coefficient of f is

LC( f ) = amultideg( f ) ∈ k.

(iii) The leading monomial of f is

L M( f ) = x multideg( f ) .

(iv) The leading term of f is

L T ( f ) = LC( f ) · L M( f ).

If we consider the polynomial f = 2x12 x22 + 3x24 x3 − 5x14 x2 x32 + 7x13 x23 x3 of
Example 13.1.6, once the reverse degree lexicographic ordering is chosen, one has:

multideg( f ) = (3, 3, 1),


LC( f ) = 7,
L M( f ) = x13 x23 x3 ,
L T ( f ) = 7x13 x23 x3 .
13.1 Monomial Orderings 215

Lemma 13.1.8 Let f, g ∈ k[x1 , . . . , xn ] be nonzero polynomials. Then


(i) multideg( f g) = multideg( f ) + multideg(g)
(ii) If f + g = 0 then multideg( f + g) ≤ max{multideg( f ), multideg(g)}. If,
moreover, multideg( f ) = multideg(g), then equality holds.

From now on we will always assume that a particular monomial ordering has been
chosen and therefore that LC( f ), L M( f ) and L T ( f ) are calculated relative to that
monomial ordering only.
The concepts introduced in Definition 13.1.7 permits to extend the classical divi-
sion algorithm for polynomial in one variable, i.e., f ∈ k[x], to the case of polyno-
mials in more variable, f ∈ k[x1 , . . . , xn ]. In the general case, this means to divide
f ∈ k[x1 , . . . , xn ] by f 1 , . . . , f t ∈ k[x1 , . . . , xn ], which is equivalent to write f as

f = a1 f 1 + · · · + at f t + r,

where the ai ’s and r are elements in k[x1 , . . . , xn ]. The idea of this division algorithm
is the same of the case of a single variable: we multiply f 1 for a suitable a1 in such
a way to cancel the leading term of f obtaining f = a1 f 1 + r1 , Then we multiply
f 2 for a suitable a2 in such a way to cancel the leading term of r1 obtaining r1 =
a2 f 2 + r2 and hence f = a1 f 1 + a2 f 2 + r2 . And we proceed in the same manner for
the other polynomials f 3 , . . . , f t . The following theorem assures the correctness of
the algorithm.

Theorem 13.1.9 Let > be a fixed monomial ordering on Zn≥0 and let
F = ( f 1 , . . . , f t ) be an ordered t−uple of polynomials in k[x1 , . . . , xn ]. Then any
f ∈ k[x1 , . . . , xn ] can be written as

f = a1 f 1 + · · · + at f t + r

where ai , r ∈ k[x1 , . . . , xn ] and r = 0 or r is a linear combination of monomi-


als, with coefficients in k, none of them divisible by any of the leading terms
LT( f 1 ), . . . ,LT( f t ). We say that r is the remainder of the division of f by F. More-
over, if ai f i = 0, then
multideg( f ) ≥ multideg(ai f i ).

The proof of Theorem 13.1.9, which we do not include here (see [1, Theorem 3,
pag 61]), is based on the fact that the algorithm of division terminates after a finite
number of steps, which is a consequence of the fact that > is a well ordering (see
Remark 13.1.3).

Example 13.1.10 We divide f = x 2 y + x y − x − 2y by F = ( f 1 , f 2 ) where f 1 =


x y + y and f 2 = x + y and using the lexicographic ordering. Both the leading terms
L T ( f 1 ) = x y e L T ( f 2 ) = x divide the leading term of f , L T ( f ) = x 2 y. Since F
is ordered we start dividing by f 1 :
216 13 Groebner Bases

LT ( f )
a1 = = x.
L T ( f1)

Then we subtract a1 f 1 from f

g = f − a1 f 1 = x 3 y 2 + 3x y − 2y − x(x y + y) = −x − 2y.

The leading term of this polynomial, L T (g) = −x, is divisible for the one of f 2 and
hence we compute

L T (g)
a2 = = −1, r = g − a2 f 2 = −x − 2y + x + y = −y.
L T ( f2 )

Hence one has


f = x · (x y + y) − 1 · (x + y) − y.

Unluckily, the division algorithm of Theorem 13.1.9 does not behave well as for
the case of a single variable. This is shown in the following two examples.

Example 13.1.11 We divide f = x 2 y + 4x y 2 − 2x by F = ( f 1 , f 2 ) where f 1 =


x y + y and f 2 = x + y and using again the lexicographic ordering. Proceeding as
in the previous example we get

LT ( f ) x2 y
a1 = = = x, g = f − a1 f 1 = 4x y 2 − x y − 2x
L T ( f1 ) xy

L T (g) 4x y 2
a2 = = = 4y 2 , r = g − a2 f 2 = −x y − 2x − 4y 3 .
L T ( f2 ) x

Notice that the leading term of the remainder, L T (r ) = −x y, is still divisible for the
leading term of f 1 . Hence we can again divide by f 1 getting

L T (r ) −x y
a1 = = = −1, r  = r − a1 f 1 = −2x − 4y 3 + y.
L T ( f1 ) xy

Again the leading term of the new remainder, L T (r  ) = −2x, is still divisible for
the leading term of f 2 . Hence we can again divide by f 2 getting

L T (r  ) −2x
a2 = = = −2, r  = r  − a2 f 2 = −4y 3 + 3y.
L T ( f2 ) x

Hence one has

f = x(x y + y) + 4y 2 (x + y) − 1(x y + y) − 2(x + y) − 4y 3 + 3y =


= (x − 1)(x y + y) + (4y 2 − 2)(x 2 + 1) − 4y 3 + 3y.
13.1 Monomial Orderings 217

Example 13.1.12 Another problem of the division algorithm in k[x1 , . . . , xn ] con-


cerns the fact that, changing the order of the f i ’s, the values of ai and r can change.
In particular, the remainder r is not univocally determined. Consider, for example,
the polynomial f = x 2 y + 4x y 2 − 2x of the previous example, dividing first by
f 2 = x + y and then by f 1 = x y + y.

LT ( f ) x2 y
a2 = = = x y, g = f − a2 f 2 = 3x y 2 − 2x
L T ( f2 ) x

L T (g) 3x y 2
a1 = = = 3y, g − a1 f 1 = −2x − 3y 2 .
L T ( f1 ) xy

Notice that the leading term of the remainder, L T (r ) = −2x, is still divisible for the
leading term of f 2 . Hence we can again divide by f 2 getting

L T (r ) 2x
a2 = = = −2, r  = r − a2 f 2 = −3y 2 + 2y
L T ( f2 ) x

and giving a remainder, −3y 2 + 2y, which is different to the one of the Example
13.1.11.

From the previous examples, we can conclude that the division algorithm in
k[x1 , . . . , xn ] is an imperfect generalization of the case of a single variable. To over-
come these problems it will be necessary to introduce the Groebner basis. The basic
idea is based on the fact that, when we work with a set of polynomials f 1 , . . . , f t ,
this leads to working with the ideal generated by them I =  f 1 , . . . , f t . This gives
us the ability to switch from f 1 , . . . , f t , to a different set of generators of I , but
with better properties with respect to the division algorithm. Before introducing the
Groebner basis we recall some concepts and results that will be useful to us.

13.2 Monomial Ideals

Definition 13.2.1 An ideal I ⊂ k[x1 , . . . , xn ] is a monomial ideal if there exists a


subset A ⊂ Zn≥0 (eventuallyinfinite) such that I consists of all polynomials which
are finite sums of the form α∈A h α x α , where h α ∈ k[x1 , . . . , xn ]. In such case, we
write I = x α : α ∈ A.

An example of monomial ideal is given by I = x 5 y 2 , x 3 y 3 , x 2 y 4 .


It is possible to characterize all the monomials that are in a given monomial ideal.

Lemma 13.2.2 Let I = x α : α ∈ A be a monomial ideal. Then a monomial x β


is in I if and only if x β is divisible for x α , for some α ∈ A.
218 13 Groebner Bases

Proof Let x β be a multiple of x α for some α ∈ A, then x β ∈ I , by definition of ideal.


On the other hand, if x β ∈ I then


t
xβ = h i x αi , (13.2.1)
i=1

where h i ∈ k[x1 , . . . , xn ] and αi ∈ A. Writing each h i as a combination of monomi-


als, we can observe that each term in the right side of (13.2.1) is divisible for some
αi . Hence also the left side x β of (13.2.1) must have the same property, i.e., it is
divisible for some αi . 
Observe that x β is divisible by x α when x β = x α · x γ for some γ ∈ Zn≥0 which is
equivalent to require β = α + γ. Hence, the set

α + Zn≥0 = {α + γ : γ ∈ Zn≥0 }

consists of the exponents of monomials which are divisible by x α . This fact, together
with Lemma 13.2.2, permits us to give a graphical description of the monomials in
a given monomial ideal. For example, if I = x 5 y 2 , x 3 y 3 , x 2 y 4 , then the exponents
of monomials in I form the set
     
(5, 2) + Zn≥0 ∪ (3, 3) + Zn≥0 ∪ (2, 4) + Zn≥0 .

We can visualize this set as the union of the integer points in three translated copies
of the first quadrant in the plane, as showed in Fig. 13.1.
The following lemma allows to say if a polynomial f is in a monomial ideal I ,
looking at the monomials of f .
Lemma 13.2.3 Let I be a monomial ideal and consider f ∈ k[x1 , . . . , xn ]. Then
the following conditions are equivalent:
(i) f ∈ I ;
(ii) every term of f is in I ;
(iii) f is a linear combination of monomials in I .
One of the main results on monomial ideals is the so-called Dickson’s Lemma
which assures us that every monomial ideal is generated by a finite number of mono-
mials. For the proof, the interested reader can consult [1, Theorem 5, Chap. 2.4],
Lemma 13.2.4 (Dickson’s Lemma) A monomial ideal I = x α : α ∈ A ⊂ k[x1 ,
. . . , xn ] can be written as I = x α1 , x α2 , . . . , x αt  where α1 , α2 , . . . , αt ∈ A. In par-
ticular I has a finite basis.
In practice, Dickson’s Lemma follows immediately from the Basis Theorem 9.1.22,
which has an independent proof. Since we did not provide a proof of Theorem 9.1.22,
for the sake of completeness we show how, conversely, Dickson’s Lemma can pro-
vide, as a corollary, a proof of a weak version of the Basis Theorem.
13.2 Monomial Ideals 219

(2 , 4)
n

(3 , 3)

(5 , 2)

m
     
Fig. 13.1 (5, 2) + Zn≥0 ∪ (3, 3) + Zn≥0 ∪ (2, 4) + Zn≥0

Theorem 13.2.5 (Basis Theorem, weak version) Any ideal I ⊂ k[x1 , . . . , xn ] has a
finite basis, that is I = g1 , . . . , gt  for some g1 , . . . , gt ∈ I .
Before proving the Hilbert Basis Theorem we introduce some concepts.
Definition 13.2.6 Let I ⊂ k[x1 , . . . , xn ] be an ideal different from the zero ideal
{0}.
(i) We denote by L T (I ) the set of leading terms of I

L T (I ) = {cx α : there exists f ∈ I with L T ( f ) = cx α }

(ii) We denote by L T (I ) the ideal generated by the elements in L T (I ).


Given an ideal I =  f 1 , . . . , f t , we observe that L T ( f 1 ), . . . , L T ( f t ) is not
necessarily equal to L T (I ). It is true that L T ( f i ) ∈ L T (I ) ⊂ L T (I ) from which
it follows L T ( f 1 ), . . . , L T ( f t ) ⊂ L T (I ). However L T (I ) can contain strictly
L T ( f 1 ), . . . , L T ( f t ).
Example 13.2.7 Let I =  f 1 , f 2  with f 1 = x 3 y − x 2 + x e f 2 = x 2 y 2 − x y. The
ordering grlex is chosen. Since

y · (x 3 y − x 2 + x) − x · (x 2 y 2 − x y) = x y

one has x y ∈ I , from which x y = L T (x y) ∈ L T (I ). However x y is not divis-


ible by L T ( f 1 ) = x 3 y and by L T ( f 1 ) = x 2 y 2 , and hence, by Lemma 13.2.2
xy ∈/ L T ( f 1 ), L T ( f 2 ).
220 13 Groebner Bases

Proposition 13.2.8 Let I ⊂ k[x1 , . . . , xn ] be an ideal.


(i) L T (I ) is a monomial ideal;
(ii) There exist g1 , . . . , gt such that L T (I ) = L T (g1 ), . . . , L T (gt ).

Proof For (i), notice that the leading monomials L M(g) of the elements g ∈ I \ {0}
generate the monomial ideal J := L M(g) : g ∈ I \ {0}. Since L M(g) and L T (g)
differ only by a nonzero constant, one has J = L T (g) : g ∈ I \ {0} = L T (I ).
Hence L T (I ) is a monomial ideal.
For (ii), since L T (I ) is generated by the monomials L M(g) with g ∈ I \ {0}, by
Dickson’s Lemma, we know that L T (I ) = L M(g1 ), L M(g2 ), . . . , L M(gt ) for
a finite number of polynomials g1 , g2 , . . . , gt ∈ I . Since L M(gi ) and L T (gi ) differ
only by a nonzero constant, for i = 1, . . . , t, one has L T (I ) = L T (g1 ), L T (g2 ),
. . . , L T (gt ). 

Using Proposition 13.2.8 and the division algorithm of Theorem 13.1.9 we can
prove Theorem 13.2.5.
Proof of Hilbert Basis Theorem. If I = {0} then, as a set of generators, we take
{0} which is clearly finite. If I contains some nonzero polynomials, then a set of
generators g1 , . . . , gt for I can be build in the following way. By Proposition 13.2.8
there exist g1 , . . . , gt ∈ I such that L T (I ) = L T (g1 ), L T (g2 ), . . . , L T (gt ). We
prove that I = g1 , . . . , gt .
Clearly g1 , . . . , gt  ⊂ I since, for any i = 1, . . . , t, gi ∈ I . On the other hand,
let f ∈ I be a polynomial. We apply the division algorithm of Theorem 13.1.9 to
divide f by g1 , . . . , gt . We get

f = a1 g1 + · · · + at gt + r,

where the terms in r are not divisible for any of the leading terms L T (gi ). We show
that r = 0. To this aim, we observe, first of all, that

r = f − a1 g1 − · · · − at gt ∈ I.

If r = 0 then L T (r ) ∈ L T (I ) = L T (g1 ), L T (g2 ), . . . , L T (gt ) and, by Lemma


13.2.2, it follows that L T (r ) must be divisible by at least one leading term L T (gi ).
This contradicts the definition of remainder of division and hence r must be equal to
zero, from which one has

f = a1 g1 + · · · + at gt + 0 ∈ g1 , . . . , gt ,

which proves I ⊂ g1 , . . . , gt . 


13.3 Groebner Basis 221

13.3 Groebner Basis

Groebner basis is “good” basis for the division algorithm of Theorem 13.1.9. Here
“good” means that the problems of Examples 13.1.11 and 13.1.12 do not happen.
Let us think about Theorem 13.2.5: the basis used in the proof has the particular
property that L T (g1 ), . . . , L T (gt ) = L T (I ). It is not true that any basis of I has
this property and so we give a specific name to the basis having this property.

Definition 13.3.1 Fix a monomial ordering. A finite subset G = {g1 , . . . , gt } of an


ideal I is a Groebner basis if

L T (g1 ), . . . , L T (gt ) = L T (I ).

The following result guarantees us that every ideal has a Groebner basis.

Corollary 13.3.2 Fix a monomial ordering. Then every ideal I ⊂ k[x1 , . . . , xn ],


different from {0}, admits a Groebner basis. Moreover, every Groebner basis for an
ideal I is a basis of I .

Proof Given an ideal I , different from the zero ideal, the set G = {g1 , . . . , gt }, built as
in the proof of Theorem 13.2.5, is a Groebner basis by definition. To prove the second
part of the statement it is enough to observe that, again, the proof of Theorem 13.2.5
assures us that I = g1 , . . . , gt , that is G is a basis for I . 

Consider the ideal I =  f 1 , f 2  of Example 13.2.7. According to Definition


13.3.1, { f 1 , f 2 } = {x 3 y − x 2 + x, x 2 y 2 − x y} is not a Groebner basis.
Before to show how to find a Groebner basis for a given ideal, we point out some
property of them.

Proposition 13.3.3 Let G = {g1 , . . . , gt } be a Groebner basis for an ideal I ⊂


k[x1 , . . . , xn ] and let f ∈ k[x1 , . . . , xn ] be a polynomial. Then there exists a unique
r ∈ k[x1 , . . . , xn ] such that
(i) no monomials in r are divisible by L T (g1 ), . . . , L T (gt );
(ii) there exists g ∈ I such that f = g + r ;
In particular, r is the remainder of the division of f by G, using the division algorithm,
independently how the elements in G are listed.

Proof The division algorithm applied to f and G gives f = a1 g1 + · · · + at gt + r


where r satisfies (i). In order that also (ii) be satisfied is sufficient to take g =
a1 g1 + · · · + at gt ∈ I . This proves the existence of r . To prove uniqueness, suppose
that f = g + r = g̃ + r̃ satisfy (i) and (ii). Then r̃ − r = g̃ − g ∈ I and hence, if
r̃ = r , then L T (r̃ − r ) ∈ L T (I ) = L T (g1 ), . . . , L T (gt ). By Lemma 13.2.2 it
follows that L T (r̃ − r ) is divisible for some L T (gi ). This is impossibile since no
term of r and r̃ is divisible by L T (g1 ), . . . , L T (gt ). Hence r̃ − r must be zero and
uniqueness is proved. 
222 13 Groebner Bases

Remark 13.3.4 The remainder r of Proposition 13.3.3 is usually called normal form
of f . The Proposition 13.3.3 tells us that Groebner basis can be characterized through
uniqueness of the remainder. Observe that, even though the remainder is unique,
independently of the order in which we divide f by the L T (gi )’s, the coefficients ai ,
in f = a1 g1 + · · · + at gt + r , are not unique.

As a corollary of Proposition 13.3.3 we get the following criterion to establish if


a polynomial is contained in a given ideal.

Corollary 13.3.5 Let G = {g1 , . . . , gt } be a Groebner basis for an ideal I ⊂


k[x1 , . . . , xn ] and let f ∈ k[x1 , . . . , xn ]. Then f ∈ I if and only if the remainder
of the division of f by G is zero.
F
Definition 13.3.6 We write f for the remainder of the division of f by an ordered
t−uple F = ( f 1 , . . . , f t ). If F is a Groebner basis for  f 1 , . . . , f t , then we can look
at F as an unordered set, by Proposition 13.3.3.

Example 13.3.7 Consider the polynomial f = x 2 y + 4x y 2 − 2x and F = { f 1 , f 2 }


with f 1 = x y + y and f 2 = x + y, From Example 13.1.11 we know that
F
f = −4y 3 + 3y.

On the other side, if we consider F  = { f 2 , f 1 }, then, from Example 13.1.12, we


get
F
f = −3y 2 + 2y.

Let’s start now to explain how it is possible to build a Groebner basis for an ideal
I from a set of generators f 1 , . . . , f t of I . As we saw before, one of the reasons for
which { f 1 , . . . , f t } could not be a Groebner basis depends on the case that there is a
combination of the f i ’s whose leading term does not lie in the ideal generated by the
L T ( f i ). This happens, for example, when the leading term of a given combination
ax α f i − bx β f j are canceled, leaving only terms of a lower degree. On the other hand,
ax α f i − bx β f j ∈ I and therefore its leading term belongs to L T (I ). To study this
cancelation phenomenon, we introduce the concept of S-polynomial.

Definition 13.3.8 Let f, g ∈ k[x1 , . . . , xn ] be two nonzero polynomials.


(i) If multideg( f ) = α and multideg(g) = β, then, we define γ = (γ1 , . . . , γn )
where γi = max{αi , βi }, and we call x γ the lower common multiple of L M( f )
and L M(g), by writing x γ =lcm(L M( f ), L M(g)).
(ii) The S-polynomial of f and g is the combination

xγ xγ
S( f, g) = · f − · g.
LT ( f ) L T (g)
13.3 Groebner Basis 223

Example 13.3.9 Consider the polynomials f = 3x 3 z 2 + x 2 yz + x yz e g = x 2 y 3 z +


x y 3 + z 2 in k[x, y, z], If we choose the lexicographic ordering we get

multideg( f ) = (3, 0, 2),


multideg(g) = (2, 3, 1),

hence γ = (3, 3, 2) and

x 3 y3 z2 x 3 y3 z2 1 1
S( f, g) = 3 2
· f − 2 3 · g = x 2 y4 z − x 2 y3 z + x y4 z − x z3.
3x z x y z 3 3

A S-polynomial S( f, g) is used to produce the cancelation of the leading terms. In


fact, any cancelation of leading terms between polynomials of the same multidegree is
obtained from this type of polynomial combinations, as guaranteed by the following
result.
t
Lemma 13.3.10 Suppose to have a sum of polynomials  t t ci ∈ k
i=1 ci f i where
and multideg( f i ) = δ ∈ Zn≥0 for all i. If multideg( i=1 ci f i ) < δ, then i=1 ci f i
is a linear combination, with coefficients in k, of the S-polynomials S( f i , f j ), for
1 ≤ i, j ≤ t. Moreover, each S( f i , f j ) has multidegree < δ.

Using the concept of S-polynomial and the previous lemma we can prove the
following criterion to establish whether a basis of an ideal is Groebner basis.

Theorem 13.3.11 (Buchberger S-pair criterion) Let I be an ideal in k[x1 , . . . , xn ].


Then a basis G = {g1 , . . . , gt } for I is a Groebner basis for I if and only if, for any
pair of indices i = j, the remainder of the division of S(gi , g j ) by G is zero.

Proof The “only if” direction is simple because, if G is a Groebner basis, then since
S(gi , g j ) ∈ I , their remainder in the division by G is zero, by Corollary 13.3.5. It
remains to prove the “if” direction.
Let f ∈ I = g1 , . . . , gt  be a nonzero polynomial. Hence there exist polynomials
h i ∈ k[x1 , . . . , xn ] such that
 t
f = h i gi . (13.3.1)
i=1

By Lemma 13.1.8 we know that

multideg( f ) ≤ max (multideg(h i gi )) . (13.3.2)

Let m i = multideg(h i gi ) and define δ = max(m 1 , . . . , m t ). Hence the previous


inequality can be written as multideg( f ) ≤ δ. If we change, in (13.3.1), the way to
write f in terms of G, we get a different value for δ. Since a monomial ordering is
a well ordering, we can choose an expression for f , of the form (13.3.1), for which
δ is minimal.
224 13 Groebner Bases

Let us show that, if δ is minimal, then multideg( f ) = δ.


Suppose that multideg ( f ) < δ and write f in order to isolate the terms of
multidegree δ:
 
f = h i gi + h i gi
m i =δ m i <δ
   (13.3.3)
= L T (h i )gi + (h i − L T (h i ))gi + h i gi .
m i =δ m i =δ m i <δ

The monomials appearing in the second and third addition of the second line, all
have multidegree < δ. So, our hypothesis multideg( f ) < δ tells us that the first sum
also has multidegree < δ. Let L T (h i ) = ci x αi , then the first sum
 
L T (h i )gi = ci x αi gi
m i =δ m i =δ

has exactly the form described in Lemma 13.3.10 with f i = x αi gi . Hence, again by
Lemma 13.3.10, this sum is a linear combination of the S-polynomials S(x α j g j ,
x αk gk ). Moreover one has

xδ xδ
S(x α j g j , x αk gk ) = x αj
g j − x αk gk
x α j L T (g j ) x αk L T (gk )
= x δ−γ jk S(g j gk ),

where x γ jk is the lower common multiple between L M(g j ) and L M(gk ). Hence there
exist constants c jk ∈ k such that
 
L T (h i )gi = c jk x δ−γ jk S(g j , gk ). (13.3.4)
m i =δ j,k

By hypothesis, we know that the remainder of the division of S(g j , gk ) by g1 , . . . , gt


is zero, hence, by the division algorithm, each S-polynomial can be written as


t
S(g j , gk ) = ai jk gi (13.3.5)
i=1

where ai jk ∈ k[x1 , . . . , xn ]. The division algorithm tell us also that

multideg(ai jk gi ) ≤ multideg(S(g j , gk )) (13.3.6)

for any choice of i, j and k. This means that, when the remainder is zero, we can find
an expression for S(g j , gk ) in terms of G where not all the leading terms are canceled.
As a matter of fact, we multiply the expression of S(g j , gk ) by x δ−γ jk having
13.3 Groebner Basis 225


t
x δ−γ jk S(g j , gk ) = bi jk gi ,
i=1

where bi jk = x δ−γ jk ai jk . Hence, by (13.3.6) and Lemma 13.3.10 we get

multideg(bi jk gi ) ≤ multideg(x δ−γ jk S(g j , gk )) < δ. (13.3.7)

If we replace the previous expression of x δ−γ jk S(g j , gk ) in (13.3.4) we get the fol-
lowing equation

    
δ−γ jk
L T (h i )gi = c jk x S(g j , gk ) = c jk bi jk gi = h̃ i gi
m i =δ j,k j,k t i

that, by (13.3.7), satisfies

multideg(h̃ i gi ) < δ, for all i.


 
Finally, we substitute m i =δ L T (h i )gi = i h̃ i gi in (13.3.3), obtaining an expres-
sion for f which is a linear combination of the gi ’s where all terms have multidegree
strictly less than δ. This contradicts the minimality of δ and hence multideg( f ) = δ.
Then multideg( f ) = multideg (h i gi ) for some i, from which it follows that L T ( f )
is divisible for L T (gi ). Hence L T ( f ) ∈ L T (g1 ), . . . , L T (gt ) and the theorem is
proved. 

Example 13.3.12 [from [1], page 84] Consider the ideal I = y − x 2 , z − x 3  of the
twisted cubic in R3 . Fix the lexicographic ordering with y > z > x. We prove that
G = {y − x 2 , z − x 3 } is a Groebner basis for I . Compute the S-polynomial
yz yz
S(y − x 2 , z − x 3 ) = (y − x 2 ) − (z − x 3 ) = −zx 2 + yx 3 .
y z

By the division algorithm we get

−zx 2 + yx 3 = x 3 (y − x 2 ) + (−x 2 )(z − x 3 ) + 0,

hence S(y − x 2 , z − x 3 )G = 0 and, by Theorem 13.3.11 G is a Groebner basis for


I . The reader check that, for the lexicographic ordering with x > y > z, G is not a
Groebner basis for I .
226 13 Groebner Bases

13.4 Buchberger’s Algorithm

We have seen, by Corollary 13.3.2, that every ideal admits a Groebner basis, but
unfortunately, the corollary does not tell us how to build it. So let’s see now how this
problem can be solved via the Buchberger algorithm.
Theorem 13.4.1 Let I =  f 1 , . . . , f s  = {0]} be an ideal in k[x1 , . . . , xn ]. A Groeb-
ner basis for I can be built, in a finite number of steps, by the following algorithm.
Input: F = ( f 1 , . . . , f s )
Output: a Groebner basis G = (g1 , . . . , gt ) for I , with F ⊂ G.

G := F
REPEAT
G  := G
For any pairs { p, q}, p = q in G  DO
G
S := S( p, q)
IF S = 0 THEN G := G ∪ {S}
UNTIL G = G 
For the proof, the reader can see [1].
Example 13.4.2 Consider again the ideal I =  f 1 , f 2  of Example 13.2.7. We
already know that { f 1 , f 2 } = {x 3 y − x 2 + x, x 2 y 2 − x y} is not a Groebner basis
since y · (x 3 y − x 2 + x) − x · (x 2 y 2 − x y) = x y = L T (x y) ∈ / L T ( f 1 ), L T ( f 2 ).
We fix G  = G = { f 1 , f 2 } and compute

x 3 y2 x 3 y2
S( f 1 , f 2 ) := f 1 − f 2 = x y.
x3 y x 2 y2

G
Since S( f 1 , f 2 ) = x y, we add f 3 = x y to G. We repeat again the cycle with the
new set of polynomials obtaining

S( f 1 , f 2 ) = x y, S( f 1 , f 3 ) = −x 2 + x, S( f 2 , f 3 ) = −x y

and
G G G
S( f 1 , f 2 ) = 0, S( f 1 , f 3 ) = −x 2 + x, S( f 2 , f 3 ) = 0.

Hence we add f 4 = x 2 − x to G. Iterating again the cycle one has

S( f 1 , f 2 ) = x y, S( f 1 , f 3 ) = −x 2 + x, S( f 1 , f 4 ) = x 2 y − x 2 + x
S( f 2 , f 3 ) = −x y, S( f 2 , f 4 ) = x 2 y − x y, S( f 3 , f 4 ) = x y

from which we get

G G G
S( f 1 , f 2 ) = 0, S( f 1 , f 3 ) = 0, S( f 1 , f 4 ) = 0,
G G G
S( f 2 , f 3 ) = 0, S( f 2 , f 4 ) = 0, S( f 3 , f 4 ) = 0.
13.4 Buchberger’s Algorithm 227

Thus we can exit from the cycle obtaining the Groebner basis

G = {x 3 y − x 2 + x, x 2 y 2 − x y, x y, x 2 − x}.

Remark 13.4.3 The algorithm of Theorem 13.4.1 is just a rudimentary version of


the Buchberger algorithm, as it is not very practical from a computational point of
G
view. In fact, once a remainder S( p, q) is equal to zero, this will remain zero even
if we add additional generators to G  . So there is no reason to recalculate those
remainders that have already been analyzed in the main loop. In fact, if we add the
new generators f j , one at a time, the only remainders to be checked are those of the
G
type S( f i , f j ) , where i ≤ j − 1. The interested reader can find a refined version
of the Buchberger algorithm in [1, Chap. 2.9].
The Groebner basis obtained through Theorem 13.4.1 are often too large compared
to the necessary. We can eliminate some generators using the following result.
Lemma 13.4.4 Let G be a Groebner basis for an I ⊂ k[x1 , . . . , xn ] and let p ∈ G
be a polynomial such that L T ( p) ∈ L T (G \ { p}. Then G \ { p} is still a Groebner
basis for I .
Proof We know that L T (G) = L T (I ). If L T ( p) ∈ L T (G \ { p}), then
L T (G \ { p}) = L T (G). Hence, by definition, G \ { p} is still a Groebner basis
for I . 
If we multiply each polynomial in G by a suitable constant in such a way all
leading coefficients are equal to 1, and then we remove from G, any p such that
L T ( p) ∈ L T (G \ { p}, we obtain the so-called minimal Groebner basis.
Definition 13.4.5 A minimal Groebner basis for an ideal I is a Groebner basis G
for I such that
(i) LC( p) = 1 for all p ∈ G.
(ii) For all p ∈ G, L T ( p) ∈
/ L T (G \ { p}).
Example 13.4.6 Consider the Groebner basis G = {x 3 y − x 2 + x, x 2 y 2 − x y, x y,
x 2 − x} of Example 13.4.2 (with ordering grlex). The leading coefficients are all
equal to 1, so condition i) is satisfied (otherwise it would be enough to multiply the
polynomials of the basis for suitable constants). Observe that

L T (x 3 y − x 2 + x) = x 3 y,
L T (x 2 y 2 − x y) = x 2 y 2 ,
L T (x y) = x y,
L T (x 2 − x) = x 2 .

Thus the leading terms of x 2 y − x 2 + x e x y 2 − x y are contained in the ideal


x y, x 2  = L T (x y), L T (x 2 − x) and hence a minimal basis for the ideal I =
x 2 y − x 2 + x, x y 2 − x y is given by {x y, x 2 − x}.
228 13 Groebner Bases

An ideal can have many minimal Groebner bases. However, we can find one which
is better than the others.
Definition 13.4.7 A reduced Groebner basis for an ideal I ⊂ k[x1 , . . . , xn ] is a
Groebner basis G for I such that
(i) LC( p) = 1 for all p ∈ G.
(ii) For all p ∈ G, no monomial of p is in L T (G \ { p}).
The reduced Groebner bases have the following important property.
Proposition 13.4.8 Let I ⊂ k[x1 , . . . , xn ] be an ideal different from {0}. Then, given
a monomial ordering, I has a unique reduced Groebner basis.

13.5 Groebner Bases and Elimination Theory

The Elimination Theory, as shown in Sects. 10.2 and 10.3, is a systematic way to
eliminate variables from a system of polynomial equations. The central part of this
method is based on the so-called Elimination Theorem and Extension Theorem. We
now define the concept of “eliminating variables” in terms of ideals and Groebner
bases.
Definition 13.5.1 Given an ideal I =  f 1 , . . . , f t  ⊂ k[x1 , . . . , xn ], the l−th elim-
ination ideal Il of I is the ideal of k[xl+1 , . . . , xn ] defined as

Il = I ∩ k[xl+1 , . . . , xn ].

It is easy to prove that Il is an ideal of k[xl+1 , . . . , xn ]. Obviously, the ideal I0


coincides with I itself. It should also be noted that different monomial orderings give
different elimination ideals.
Remark 13.5.2 The reader must notice that here the notation is different with respect
to Sect. 10.3, where we define J0 as the first elimination ideal in the fist variable x0 .
The same ideal is called J1 in Definition 13.5.1. For this reason, now our variables
start from x1 , in such a way we have a more comfortable notation for indices.
It is clear, at this point, that eliminating x1 , . . . , xl means to find the non-null
polynomials contained in the l−th elimination ideal. This can easily be done through
Groebner bases (once a proper monomial ordering has been fixed!).
Theorem 13.5.3 (of elimination) Let I ⊂ k[x1 , . . . , xn ] be an ideal and G a Groeb-
ner basis for I with respect to the lexicographic ordering with x1 > x2 > · · · > xn .
Then, for any 0 ≤ l ≤ n, the set

G l = G ∩ k[xl+1 , . . . , xn ]

is a Groebner basis for the l−th elimination ideal Il .


13.5 Groebner Bases and Elimination Theory 229

Proof Fix l with 0 ≤ l ≤ n. By construction G l ⊂ Il , hence it is sufficient to prove


that L T (Il ) = L T (G l ). The inclusion L T (G l ) ⊂ L T (Il ) is obvious. To prove
the other inclusion, observe that if f ∈ Il than f ∈ I . Hence L T ( f ) is divisible
by L T (g) for some g ∈ G. Since f ∈ Il , then L T (g) contains only the variables
xl+1 , . . . , xn . Since we are using the lexicographic ordering with x1 > x2 > · · · > xn ,
any monomial formed by variables x1 , . . . , xl is greater than all monomials in
k[xl+1 , . . . , xn ] and hence L T (g) ∈ k[xl+1 , . . . , xn ] implies g ∈ k[xl+1 , . . . , xn ].
This proves that g ∈ G l , from which it follows L T (Il ) ⊂ L T (G l ). 

The Elimination Theorem shows that a Groebner basis, in the lexicographic order,
does not eliminate only the first variable, but also the first two variables, and the first
three variables and so on. Often, however, we want to eliminate only certain variables,
while we are not interested in others. In these cases, it may be difficult to calculate a
Groebner basis using the lexicographical ordering, especially because this ordering
can give some Groebner basis not particularly good. For different versions of the
Elimination Theorem that are based on other orderings, refer to [1].
Now let us introduce the Extension Theorem. Suppose we have an ideal I ⊂
k[x1 , . . . , xn ] that defines the affine variety

V (I ) = {(a1 , . . . , an ) ∈ k n : f (a1 , . . . , an ) = 0 for all f ∈ I }.

Consider the l−th elimination ideal. We denote by (al+1 , . . . , an ) ∈ V (Il ) a partial


solution of the starting system of equations. To extend (al+1 , . . . , an ) to a complete
solution of V (I ), first of all, we have to add a coordinate: this means to find al in such
a way (al , al+1 . . . , an ) ∈ V (Il−1 ), that is (al , al+1 . . . , an ) is in the variety defined by
the previous elimination ideal. More precisely, suppose that Il−1 = g1 , . . . , gs  ⊂
k[xl , . . . , xn ]. Hence we want to find solutions xl = al of the equations

g1 (xl , al+1 , . . . an ) = 0, . . . , gs (xl , al+1 , . . . an ) = 0.

The gi (xl , al+1 , . . . an )’s are polynomials in one variable hence their common solu-
tions are the ones of the greater common divisors of these s polynomials. Obviously,
it can happen that the gi (xl , al+1 , . . . an )’s do not have common solutions, depending
on the choice of al+1 , . . . an . Hence, our aim, at the moment, is to try to determine,
a priori, which partial solutions extend to complete solutions. We restrict to study
the case in which we eliminated the first variable x1 and hence we want to know if a
partial solution (a2 , . . . , an ) ∈ V (I1 ) extends to a solution (a1 , . . . , an ) ∈ V (I ). The
following theorem tells us when it is possible.

Theorem 13.5.4 (Extension Theorem) Consider I =  f 1 , . . . , f t  ⊂ C[x1 , . . . , xn ]


and let I1 the first elimination ideal of I . For each 1 ≤ i ≤ t we write f i as

f i = gi (x2 , . . . , xn )x1Ni + terms in x1 of degree < Ni ,


230 13 Groebner Bases

where Ni ≥ 0 and gi ∈ C[x2 , . . . , xn ] is different from zero. Suppose there exists


a partial solution (a2 , . . . , an ) ∈ V (I1 ). If (a2 , . . . , an ) ∈
/ V (g1 , . . . gt ), then there
exists a1 ∈ C such that (a1 , . . . , an ) ∈ V (I ).

Notice that the Extension Theorem requires the complex field. Consider the equa-
tions
x 2 = y, x 2 = z.

If we eliminate x we get y = z and, hence, all partial solutions (a, a) for all a ∈ R.
Since the leading coefficient of x in x 2 = y and x 2 = z never vanish, Theorem 13.5.4
guarantees that we can extend (a, a), under the condition that we are working on C.
As a matter of fact, on R, x 2 = a has not real solutions if a is negative, hence the
only partial solutions (a, a) that we can extend, are the ones for all a ∈ R≥0 .

Remark 13.5.5 Although the Extension Theorem gives a statement only in case the
first variable is eliminated, it can still be used to eliminate any number of variables.
The idea is to extend solutions to one variable at a time: first to xl−1 , then to xl−2 and
so on up to x1 .

The Extension Theorem is particularly useful when one of the leading coefficients
is constant.

Corollary 13.5.6 Let I =  f 1 , . . . , f t  ⊂ C[x1 , . . . , xn ] and assume that for some


i, f i can be written as

f i = ci x1N + terms in x1 of degree < N ,

where c ∈ C is different from zero and N > 0. If I1 is the first elimination ideal of I
and (a2 , . . . , an ) ∈ V (I1 ), then there exists a1 ∈ C such that (a1 , . . . , an ) ∈ V (I ).

We end this section by recalling again that the process of elimination corresponds
to the projection of varieties in subspaces of lower dimension. For the rest of this
section, we work over C.
Let V = V ( f 1 , . . . , f t ) ⊂ Cn be an affine variety. To eliminate the first l variables
x1 , . . . , xl we consider the projection map

πl : Cn → Cn−l
.
(a1 , . . . , an ) → (al+1 , . . . an )

The following lemma explains the link between πl (V ) and the l−th elimination ideal.

Lemma 13.5.7 Let Il =  f 1 , . . . , f t  ∩ C[xl+1 , . . . , xn ] be the l−th elimination


ideal of I . Then, in Cn−l , one has

πl (V ) ⊂ V (Il ).
13.5 Groebner Bases and Elimination Theory 231

Observe that we can write πl (V ) as




(al+1 , . . . , an ) ∈ V (Il ) : ∃a1 , . . . al ∈ C
πl (V ) = .
with (a1 , . . . , al , al+1 , . . . , an ) ∈ V

Hence, πl (V ) consists exactly of the partial solutions that can be extended to complete
solutions. Then, we can give a geometric version of the Extension Theorem.

Theorem 13.5.8 Given V = V ( f 1 , . . . , f t ) ⊂ Cn , let gi be as in Theorem 13.5.4. If


I1 is the first elimination ideal of  f 1 , . . . , f t , then the following equality holds in
Cn−l :
V (I1 ) = π1 (V ) ∪ (V (g1 , . . . , gt ) ∩ V (I1 )),

where π1 : Cn → Cn−1 is the projection on the last n − 1 components.

The previous theorem tell us that π1 (V ) covers the affine variety V (I1 ), with the
exception, eventually, of a part in V (g1 , . . . , gt ). Unluckily we don’t know how much
this part is big and, there are cases where V (g1 , . . . , gt ) is enormous. However, the
following result permits us to understand even better the link between π1 (V ) and
V (I1 ).

Theorem 13.5.9 (Closure Theorem) Given V = V ( f 1 , . . . , f t ) ⊂ Cn let Il be the


l−th elimination ideal of I =  f 1 , . . . , f t , then:
(i) V (Il ) is the smaller affine variety containing πl (V ) ⊂ Cn−l .
(ii) When V = ∅, there exists an affine variety W  V (Il ) such that V (Il ) \ W ⊂
πl (V ).

The Closure Theorem gives a partial description of πl (V ) that covers V (Il ) except
for the points lying in a variety strictly smaller than V (Il ).
Finally, we have also the geometric version of Corollary 13.5.6 that represents a
good situation for elimination.
Corollary 13.5.10 Let V = V ( f 1 , . . . , f t ) ⊂ Cn and suppose that for some i, f i
can be written as
f i = ci x1N + terms in x1 of degree < N ,

where ci ∈ C is different from zero and N > 0. If I1 is the first elimination ideal of
I , then, in Cn−1 ,
π1 (V ) = V (I1 ),

where π1 is the projection on the last n − 1 components.


232 13 Groebner Bases

13.6 Exercises

Exercise 46 Prove Lemma 13.2.3.

Exercise 47 Compute the S-polynomials S( f, g) where


(a) f = x 2 y + x y 2 + y 3 , g = x 2 y 3 + x 3 y 2 , using >lex ;
(b) f = y 2 + x y 2 + y 3 , g = x 2 + y 3 + x y 2 , using >grlex ;
(c) f = y 2 + x, g = x 2 + y, using >gr evlex .

Exercise 48 Check if the following sets are Grobener basis. In case of negative
answer compute the Groebner basis of the ideal they generate.
(a) {x1 − x2 + x3 − x4 + x5 , x1 − 2x2 + 3x3 − x4 + x5 , 4x3 + 4x4 + 5x5 }, using
>lex ;
(b) {x1 − x2 + x3 − x4 + x5 , x1 − 2x2 + 3x3 − x4 + x5 , 4x3 + 4x4 + 5x5 }, using
>grlex ;
(c) {x1 + x2 + x3 , x2 x3 , x22 + x32 , x33 }, using >gr evlex .

References

1. Cox, D., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithms: An Introduction to Compu-
tational Algebraic Geometry and Commutative Algebra. Undergraduate Texts in Mathematics,
Springer, New York (2007)
2. Greuel, G.-M., Pfister G.: A Singular Introduction to Commutative Algebra. Springer, Berlin
(2007)
Index

A or a hypersurface, 185
Algebraic Dipole, 22, 36
closure, 183 Distribution, 7
element, 182 coherent, 25
Algebraically closed field, 183 induced, 9
Algebraic model of hidden variable, 73 of independence, 36, 37
Alphabet, 3 probabilistic, 11
associated, 11, 16
without triple correlation, 37
B DNA-systems, 4
Bilinear form, 82 Dual ring, 207
Booleanization, 18
Buchberger’s algorithm, 225, 226
E
Elementary logic connector, 20
C Equidistribution, 16
Change of coordinates, 156 Exponential matrix, 43
Cone, 134 Extension
Connection, 38 algebraic, 182
projectove, 50 transcendent, 182
Contraction, 121
i-contraction, 119
F
J -contraction, 121
Flattening, 125
partial, 121
Correlation
partial, 13 G
total, 6 Groebner basis, 221
minimal, 227
reduced, 227, 228
D
Defectiveness, 76
Dickson’s Lemma, 218 H
Dicotomia, 18 Hidden variable, 56, 70
Dimension, 181 model, 71
expected, 75, 205 Homogeneous coordinates, 134
of Segre variety, 188 Hyperplane, 140
of Veronese variety, 187 Hypersurface, 140
© Springer Nature Switzerland AG 2019 233
C. Bocci and L. Chiantini, An Introduction to Algebraic Statistics
with Tensors, UNITEXT 118, https://doi.org/10.1007/978-3-030-24624-2
234 Index

I parametric, 35
Ideal linear, 40
associated to a variety, 140 of conditional independence, 63
first elimination, 161 parametric, 39
generated by, 138 projective, 50
homogeneous, 138 projective algebraic, 50
irrelevant, 139 projective parametric, 50
l-th elimination, 228 toric, 39, 43
monomial, 217 without triple correlation, 51
prime, 179 Monomial ordering, 212
radical, 139 graded lexicographic, 212
Identifiable, 203 lexicographic, 165, 212
finitely, 203 reverse graded lexicographic, 213
generically, 203 Multihomogeneous coordinates, 146
Image distribution, 9 Multi-linear map, 83, 87
Independence connection, 22, 38 Multiprojective space, 145
Independence model, 36, 37 cubic, 173
Independent set of varieties, 196 of distributions, 49
Inverse system, 207

P
J Polynomial
Join Homogeneous decomposition of a, 135
abstract, 198 irreducible, 144
dimension of the, 199 leading coefficient of a, 214
embedded, 198 leading monomial of a, 214
total, 195 leading term of a, 214
Jukes–Cantor’s matrix, 61 multidegree of a, 214
multihomogeneous, 146
normal form of a , 222
K Preimage distribution, 9
K-distribution, see Distribution Projection
as projective map, 155
center of the, 158
M from a projective linear subspace, 157
Map from a projective point, 157
diagonal embedding, 173 of a random system, 38
dominant, 190 Projective function field, 180
isomorphism, 152 Projective kernel, 157
multiprojective, 153 Projective space, 133
projective, 149 dimension of a, 134
fiber of a, 180
linear, 156
Segre, 168 Q
upper semicontinuous, 190 Quotient field, 180
Veronese, 164
Marginalisation
of a matrix, 117 R
of a tensor, 118 Random variable, 3
Marginalization, 24 boolean, 5
Markov chains, 65 state of a, 3
Markov model, 66 Rank
Model, 35 border, 201
algebraic, 35 generic, 75, 201
Index 235

generic symmetric, 75 symmetric, 105


of a matrix, 94 Tensor algebra, 90
of a polynomial, 112 Tensor product, 87
of a tensor, 94 vanishing law of, 91
symmetric, 108 Terracini’s Lemma, 205
of a polynomial, 112 Theorem
Regular point, 204 Alexander-Hirschowitz, 115, 206
Resultant, 158 Buchberger S-pair criterion, 223
Chow’s, 175, 192
Hilbert basis, 140, 219
S Hilbert Nullstellensatz, 139
Sampling, 11 of closure, 231
constant, 11 of elimination, 228
Scaling, 12 of extension, 229
Scan of a tensor, 124 Total marginalization, see Marginalization
Secant space, 72 Transcendence
Secant variety, 72 basis, 183
abstract, 198 degree, 183
dimension of the, 199 Transcendental element, 182
algebraic, 73 Tree Markov model, 68
embedded, 198
Segre
connection, see Independence connec- U
tion Unitary simplex, 31
embedding, 169
map, 168
variety, 51, 169 V
dimension of, 188 Variety
Set of conditions, 59, 63 complete intersection, 187
Space of distributions, 8 defective, 206
Stereographic projection, 150 dimension of a, 181
Sylvester matrix, 158 homogeneous ideal associated to a, 140
System (of random variables), 3 identifiable, 203
boolean, 5 multiprojective, 147
dual, 13 irreducible components of a, 148
map or morphism of, 5 non-degenerate, 196
subsystem of a, 3 projective, 136
degree of, 189
irreducible components of a, 142
T Veronese
Tangent space embedding, 165
to a hypersurface, 203 map, 164
to a variety, 204 variety, 165
Tensor, 83 dimension of, 187
cubic, 105
decomposable, 94
decomposition of a, 94 W
identifiable, 202 Waring problem, 114
polynomial associated to a, 111
simple, 94
submatrix of a, 97 Z
subtensor of a, 95, 97 Zariski topology, 136, 148

You might also like