0% found this document useful (0 votes)

446 views

An Invitation To 3-D Vision PDF

This document provides an introduction to and overview of the book "An Invitation to 3-D Vision: From Images to Models" by Yi Ma, Jana Koˇseck ́a, Stefano Soatto, and Shankar Sastry. The book presents a unified mathematical approach to reconstructing 3D models from 2D images. It covers topics such as camera models, image formation, feature extraction and correspondence, two-view geometry, and camera self-calibration. The book is intended as a textbook for advanced undergraduate and graduate students in computer vision.

Uploaded by

gunnersregister

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

446 views

An Invitation To 3-D Vision PDF

Uploaded by

gunnersregister

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 338

This is page i

Printer: Opaque this

An Invitation to 3-D Vision

From Images to Models

Yi Ma
Jana Kosecka
Stefano Soatto
Shankar Sastry
November 19, 2001

This is page v
Printer: Opaque this

This book is dedicated to people who love this book.

This is page vi
Printer: Opaque this

This is page vii

Printer: Opaque this

Preface

This book is intended to give students at the advanced undergraduate or

introductory graduate level and researchers in computer vision, robotics,
and computer graphics a self-contained introduction to the geometry of
3-D vision: that is, the reconstruction of 3-D models of objects from a
collection of 2-D images. The only prerequisite for this book is a course in
linear algebra at the undergraduate level.
As timely research summary, two bursts of manuscripts were published
in the past on a geometric approach to computer vision: the ones that
were published in mid 1990s on the geometry of two views (e.g., Faugeras
1993, Weng, Ahuja and Huang 1993, Maybank 1993), and the ones that
were recently published on the geometry of multiple views (e.g., Hartley
and Zisserman 2000, Faugeras, Luong and Papadopoulo 2001).1 While a
majority of those manuscripts were to summarize up to date research results by the time they were published, we sense that now the time is ripe
for putting a coherent part of that material in a unified and yet simplified
framework which can be used for pedegogical purposes. Although the approach we are to take here deviates from those old ones and the techniques
we use are mainly linear algebra, this book nonetheless gives a comprehensive coverage of what is known todate on the geometry of 3-D vision. It

1 To our knowledge, there are also two other books on computer vision currently in
preparation: Ponce and Forsyth (expected in 2002), Pollefeys and van Gool (expected
in 2002).

viii

Preface

also builds on a homogeneous terminology a solid analytical foundation for

future research in this young field.
This book is organized as follows. Following a brief introduction, Part I
provides background materials for the rest of the book. Two fundamental
transformations in multiple view geometry, namely the rigid-body motion
and perspective projection, are introduced in Chapters 2 and 3 respectively.
Image formation and feature extraction are discussed in Chapter 4. The two
chapters in Part II cover the classic theory of two view geometry based on
the so-called epipolar constraint. Theory and algorithms are developed for
both discrete and continuous motions, and for both calibrated and uncalibrated camera models. Although the epipolar constraint has been very
successful in the two view case, Part III shows that a more proper tool
for studying the geometry of multiple views is the so-called rank condition on the multiple view matrix, which trivially implies all the constraints
among multiple images that are known todate, in particular the epipolar
constraint. The theory culminates in Chapter 10 with a unified theorem
on a rank condition for arbitrarily mixed point, line and plane features. It
captures all possible constraints among multiple images of these geometric
primitives, and serves as a key to both geometric analysis and algorithmic
development. Based on the theory and conceptual algorithms developed in
early part of the book, Part IV develops practical reconstruction algorithms
step by step, as well as discusses possible extensions of the theory covered
in this book.
Different parts and chapters of this book have been taught as a onesemester course at the University of California at Berkeley, the University
of Illinois at Urbana-Champaign, and the George Mason University, and as
a two-quater course at the University of California at Los Angles. There is
apparantly adequate material for lectures of one and a half semester or two
quaters. Advanced topics suggested in Part IV or chosen by the instructor
can be added to the second half of the second semester if a two-semester
course is offered. Given below are some suggestions for course development
based on this book:
1. A one-semester course: Appendix A and Chapters 1 - 7 and part of
Chapters 8 and 13.
2. A two-quater course: Chapters 1 - 6 for the first quater and Chapters
7 - 10 and 13 for the second quater.
3. A two-semester course: Appendix A and Chapters 1 - 7 for the first
semester; Chapters 8 - 10 and the instructors choice of some advanced
topics from chapters in Part IV for the second semester.
Exercises are provided at the end of each chapter. They consist of three
types: 1. drill exercises that help student understand the theory covered
in each chapter; 2. programming exercises that help student grasp the
algorithms developed in each chapter; 3. exercises that guide student to

Preface

creatively develop a solution to a specialized case that is related to but

not necessarily covered by the general theorems in the book. Solutions to
most of the exercises will be made available upon the publication of this
book. Software for examples and algorithms in this book will also be made
available at a designated website.
Yi Ma
Stefano Soatto
Jana Koseck
a
Shankar Sastry
Cafe Nefeli, Berkeley
July 1, 2001

This is page x
Printer: Opaque this

Contents

Preface
1 Introduction
1.1
Visual perception: from 2-D images to 3-D models
1.2
A historical perspective . . . . . . . . . . . . . . .
1.3
A mathematical approach . . . . . . . . . . . . .
1.4
Organization of the book . . . . . . . . . . . . . .

vii

.
.
.
.

Introductory material

2 Representation of a three dimensional moving scene

2.1
Three-dimensional Euclidean space . . . . . . . . . .
2.2
Rigid body motion . . . . . . . . . . . . . . . . . . .
2.3
Rotational motion and its representations . . . . . .
2.3.1 Canonical exponential coordinates . . . . . . .
2.3.2 Quaternions and Lie-Cartan coordinates . . .
2.4
Rigid body motion and its representations . . . . . .
2.4.1 Homogeneous representation . . . . . . . . . .
2.4.2 Canonical exponential coordinates . . . . . . .
2.5
Coordinates and velocity transformation . . . . . . .
2.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
References . . . . . . . . . . . . . . . . . . . . . . . .
2.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
3
4
5

7
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

9
9
12
16
18
22
24
26
27
30
33
33
33

Contents

3 Image formation
3.1
Representation of images . . . . . . . . . . .
3.2
Lenses, surfaces and light . . . . . . . . . . .
3.2.1 Imaging through lenses . . . . . . . .
3.2.2 Imaging through pin-hole . . . . . . .
3.3
A first model of image formation . . . . . .
3.3.1 Basic photometry . . . . . . . . . . .
3.3.2 The basic model of imaging geometry
3.3.3 Ideal camera . . . . . . . . . . . . . .
3.3.4 Camera with intrinsic parameters . .
3.3.5 Spherical projection . . . . . . . . . .
3.3.6 Approximate camera models . . . . .
3.4
Summary . . . . . . . . . . . . . . . . . . . .
3.5
Exercises . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

36
38
40
40
42
43
43
47
48
50
53
53
54
54

4 Image primitives and correspondence

4.1
Correspondence between images . . . . . . . .
4.1.1 Transformations in the image domain .
4.1.2 Transformations of the intensity value
4.2
Photometric and geometric features . . . . . .
4.3
Optical flow and feature tracking . . . . . . .
4.3.1 Translational model . . . . . . . . . . .
4.3.2 Affine deformation model . . . . . . . .
4.4
Feature detection algorithms . . . . . . . . . .
4.4.1 Computing image gradient . . . . . . .
4.4.2 Line features: edges . . . . . . . . . . .
4.4.3 Point features: corners . . . . . . . . .
4.5
Compensating for photometric factors . . . . .
4.6
Exercises . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

57
58
59
60
62
64
64
67
68
68
70
71
73
73

Geometry of pairwise views

5 Reconstruction from two calibrated views

5.1
The epipolar constraint . . . . . . . . . . . . . . . . . . .
5.1.1 Discrete epipolar constraint and the essential matrix
5.1.2 Elementary properties of the essential matrix . .
5.2
Closed-form reconstruction . . . . . . . . . . . . . . . . .
5.2.1 The eight-point linear algorithm . . . . . . . . . .
5.2.2 Euclidean constraints and structure reconstruction
5.3
Optimal reconstruction . . . . . . . . . . . . . . . . . . .
5.4
Continuous case . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Continuous epipolar constraint and the continuous
essential matrix . . . . . . . . . . . . . . . . . . .
5.4.2 Properties of continuous essential matrices . . . .

77
79
80
80
81
84
85
89
90
94
95
96

xii

Contents

5.5
5.6

5.4.3 The eight-point linear algorithm . .

5.4.4 Euclidean constraints and structure
Summary . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . .

. . . . . . . .
reconstruction
. . . . . . . .
. . . . . . . .

100
105
106
107

6 Camera calibration and self-calibration

111
6.1
Calibration with a rig . . . . . . . . . . . . . . . . . . . .
112
6.2
The fundamental matrix . . . . . . . . . . . . . . . . . .
114
6.2.1 Geometric characterization of the fundamental matrix . . . . . . . . . . . . . . . . . . . . . . . . . .
115
6.2.2 The eight-point linear algorithm . . . . . . . . . .
117
6.3
Basics of uncalibrated geometry . . . . . . . . . . . . . .
118
6.3.1 Kruppas equations . . . . . . . . . . . . . . . . .
121
6.4
Self-calibration from special motions and chirality . . . .
127
6.4.1 Pure rotational motion . . . . . . . . . . . . . . .
127
6.4.2 Translation perpendicular or parallel to rotation .
130
6.4.3 Calibration with chirality . . . . . . . . . . . . . .
135
6.5
Calibration from continuous motion . . . . . . . . . . . .
138
6.6
Three stage stratification . . . . . . . . . . . . . . . . . .
142
6.6.1 Projective reconstruction . . . . . . . . . . . . . .
142
6.6.2 Affine reconstruction . . . . . . . . . . . . . . . .
145
6.6.3 Euclidean reconstruction . . . . . . . . . . . . . .
147
6.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
148
6.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
148

III

Geometry of multiple views

151

7 Introduction to multiple view reconstruction

7.1
Basic notions: pre-image and co-image of point and line
7.2
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . .
7.3
Pairwise view geometry revisited . . . . . . . . . . . .
7.4
Triple-wise view geometry . . . . . . . . . . . . . . . .
7.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

153
154
157
159
161
165
165

8 Geometry and reconstruction from point features

8.1
Multiple views of a point . . . . . . . . . . . . . . .
8.2
The multiple view matrix and its rank . . . . . . .
8.3
Geometric interpretation of the rank condition . . .
8.3.1 Uniqueness of the pre-image . . . . . . . . .
8.3.2 Geometry of the multiple view matrix . . .
8.4
Applications of the rank condition . . . . . . . . . .
8.4.1 Correspondence . . . . . . . . . . . . . . . .
8.4.2 Reconstruction . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

168
168
170
173
173
177
178
178
179

.
.
.
.
.
.
.
.

Contents

8.5

8.6
8.7

Experiments . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Setup . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Comparison with the 8 point algorithm . . .
8.5.3 Error as a function of the number of frames
8.5.4 Experiments on real images . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

9 Geometry and reconstruction from line features

9.1
Multiple views of a line . . . . . . . . . . . . . . . . .
9.2
The multiple view matrix and its rank . . . . . . . .
9.3
Geometric interpretation of the rank condition . . . .
9.3.1 Uniqueness of the pre-image . . . . . . . . . .
9.3.2 Geometry of the multiple view matrix . . . .
9.3.3 Relationships between rank conditions for line
point . . . . . . . . . . . . . . . . . . . . . . .
9.4
Applications of the rank condition . . . . . . . . . . .
9.4.1 Matching . . . . . . . . . . . . . . . . . . . . .
9.4.2 Reconstruction . . . . . . . . . . . . . . . . .
9.5
Experiments . . . . . . . . . . . . . . . . . . . . . . .
9.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . .
9.5.2 Motion and structure from four frames . . . .
9.5.3 Error as a function of number of frames . . .
9.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . .
9.7
Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

. .
. .
. .
. .
. .
and
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .

10 Geometry and reconstruction with incidence relations

10.1 Image and co-image of a point and line . . . . . . . . .
10.2 Rank conditions for various incidence relations . . . . .
10.2.1 Inclusion of features . . . . . . . . . . . . . . . .
10.2.2 Intersection of features . . . . . . . . . . . . . .
10.2.3 Features restricted to a plane . . . . . . . . . .
10.3 Rank conditions on the universal multiple view matrix
10.4 Geometric interpretation of the rank conditions . . . .
10.4.1 Case 2: 0 rank(M ) 1 . . . . . . . . . . . . .
10.4.2 Case 1: 1 rank(M ) 2 . . . . . . . . . . . . .
10.5 Applications of the rank conditions . . . . . . . . . . .
10.6 Simulation results . . . . . . . . . . . . . . . . . . . . .
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

11 Multiple view geometry in high dimensional space

11.1 Projection in high dimensional space . . . . . . . . . . .
11.1.1 Image formation . . . . . . . . . . . . . . . . . . .
11.1.2 Motion of the camera and homogeneous coordinates

xiii

182
182
183
184
184
186
186
187
187
189
191
192
194
196
197
197
198
201
201
202
203
205
205
206
206
208
209
211
213
216
220
220
222
225
229
231
231
233
234
234
235

xiv

Contents

11.2
11.3

11.4

11.5

11.6

11.1.3 Preimage and co-image . . . . . . . . . . . . . .

11.1.4 Generic motion assumption . . . . . . . . . . .
Rank condition on multiple images of a point . . . . .
Rank conditions on multiple images of hyperplanes . .
11.3.1 Multiple images of a hyperplane . . . . . . . . .
11.3.2 Inclusion of hyperplanes . . . . . . . . . . . . .
11.3.3 Intersection of hyperplanes . . . . . . . . . . . .
11.3.4 Restriction to a hyperplane . . . . . . . . . . .
Geometric interpretation of rank conditions . . . . . .
11.4.1 Multiple view stereopsis in n-dimensional space
11.4.2 A special case . . . . . . . . . . . . . . . . . . .
11.4.3 Degenerate motions . . . . . . . . . . . . . . . .
Applications of the generalized rank conditions . . . .
11.5.1 Multiple view constraints for dynamical scenes .
11.5.2 Multiple views of a kinematic chain . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Reconstruction algorithms

12 Batch reconstruction from multiple

12.1 Algorithms . . . . . . . . . . . . .
12.1.1 Optimal methods . . . . .
12.1.2 Factorization methods . .
12.2 Implementation . . . . . . . . . .

237
240
240
243
244
246
249
249
251
252
253
256
258
258
260
262

265
views
. . . .
. . . .
. . . .
. . . .

.
.
.
.

267
267
267
267
267

13 Recursive estimation from image sequences

13.1 Motion and shape estimation as a filtering problem
13.2 Observability . . . . . . . . . . . . . . . . . . . . .
13.3 Realization . . . . . . . . . . . . . . . . . . . . . . .
13.4 Stability . . . . . . . . . . . . . . . . . . . . . . . .
13.5 Implementation . . . . . . . . . . . . . . . . . . . .
13.6 Complete algorithm . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

268
268
269
274
276
278
284

14 Step-by-step building of a 3-D model from images

14.1 Establishing point correspondence . . . . . . . . . .
14.1.1 Feature extraction . . . . . . . . . . . . . . .
14.1.2 Feature matching . . . . . . . . . . . . . . .
14.2 Refining correspondence . . . . . . . . . . . . . . .
14.2.1 Computing fundamental matrices . . . . . .
14.2.2 Robust matching via RANSAC . . . . . . .
14.3 Uncalibrated 3-D structure and camera pose . . . .
14.3.1 Projective reconstruction . . . . . . . . . . .
14.3.2 Bundle adjustment . . . . . . . . . . . . . .
14.4 Scene and camera calibration . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

289
290
290
290
290
290
290
290
290
290
290

.
.
.
.

Contents

.
.
.
.
.
.
.
.
.
.
.
.
.

290
290
290
290
290
290
290
290
290
290
290
290
290

15 Extensions, applications and further research directions

15.1 Vision for navigation and robotic control . . . . . . . . .
15.2 Multiple linked rigid bodies and motion segmentation . .
15.3 Beyond geometric features . . . . . . . . . . . . . . . . .
15.4 Direct methods . . . . . . . . . . . . . . . . . . . . . . .
15.5 Non-lambertian photometry . . . . . . . . . . . . . . . .
15.6 Non-rigid deformations . . . . . . . . . . . . . . . . . . .

291
291
291
291
291
291
291

293

14.5

14.6

14.4.1 Constraints on the scene . . . . . . . .

14.4.2 Camera self-calibration . . . . . . . . .
14.4.3 Calibrating radial distortion . . . . . .
Dense reconstruction . . . . . . . . . . . . . .
14.5.1 Epipolar rectification . . . . . . . . . .
14.5.2 Dense matching . . . . . . . . . . . . .
14.5.3 Knowledge in the scene . . . . . . . . .
14.5.4 Multi-scale matching . . . . . . . . . .
14.5.5 Dense triangulation for multiple frames
Texture mapping and rendering . . . . . . . .
14.6.1 Modeling surface . . . . . . . . . . . .
14.6.2 Highlights and specular reflections . . .
14.6.3 Texture mapping . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

Appendices

A Basic facts from linear algebra

A.1 Linear maps and linear groups . . . . . . .
A.2 Gram-Schmidt orthonormalization . . . . .
A.3 Symmetric matrices . . . . . . . . . . . . .
A.4 Structure induced by a linear map . . . . .
A.5 The Singular Value Decomposition (SVD)
A.5.1 Algebraic derivation . . . . . . . .
A.5.2 Geometric interpretation . . . . . .
A.5.3 Some properties of the SVD . . . .

.
.
.
.
.
.
.
.

295
295
296
297
298
299
299
301
301

B Least-square estimation and filtering

304
B.1 Linear least-variance estimators of random vectors . . . .
304
B.1.1 Projections onto the range of a random vector . .
305
B.1.2 Solution for the linear (scalar) estimator . . . . .
306
B.1.3 Affine least-variance estimator . . . . . . . . . . .
306
B.1.4 Properties and interpretations of the least-variance
estimator . . . . . . . . . . . . . . . . . . . . . . .
307
B.2 Linear least-variance estimator for stationary processes .
310
B.3 Linear, finite-dimensional stochastic processes . . . . . .
312
B.4 Stationariety of LFDSP . . . . . . . . . . . . . . . . . . .
313

xvi

Contents

B.5
B.6

The linear Kalman filter . . . . . . . . . . . . . . . . . .

Asymptotic properties . . . . . . . . . . . . . . . . . . .

313
317

C Basic facts from optimization

318

References

319

Glossary of notations

323

Index

325

This is page 1
Printer: Opaque this

Chapter 1
Introduction

1.1 Visual perception: from 2-D images to 3-D

models
The sense of vision plays an important role in the life of primates: it allows
them to infer spatial properties of the environment that are necessary to
perform crucial tasks towards survival. Even relatively simple organisms
are capable of remote, non-contact sensing of the space surrounding them.
Sensing information is then processed in the brain to infer a representation
of the world; this can be used to their advantage in order to move within
the environment, detect, recognize and fetch a prey etc.
In broad terms, a visual system is a collection of devices that transform
measurements of light intensity into estimates of geometric properties of the
environment. For instance, the human visual system measures the intensity
of light incident the retinas and processes the measurements in the brain
to produce an estimate of the three-dimensional layout of the scene that is
sufficient, for instance, to reach for an object in the scene and grasp it. Only
in recent years have artificial vision systems been developed that process
light intensity measured by a camera and produce estimates of the threedimensional layout of the scene being viewed. Artificial vision systems offer
the potential of relieving humans from performing spatial control tasks that
are dangerous, monotonous or boring such as driving a vehicle, surveying
an underwater platform or detecting intruders in a building. In addition,
they can be used to augment human capabilities or replace lost skills.

Chapter 1. Introduction

A single image of a scene is already a rich source of information on the

three-dimensional structure, combining different cues such as texture, shading, T-junctions, cast shadows etc. (see figure 1.1). An image, however, is

Figure 1.1. Some pictorial cues for three-dimensional structure: texture (left),
shading and T-junctions (right).

intrinsically ambiguous since one dimension is lost in the projection from

the three-dimensional world onto the two-dimensional imaging surface.
Inferring 3-D structure from 2-D images depends upon underlying assumptions about the scene. For instance, in exploiting shading cues one assumes
that the visible surfaces have (locally) homogeneous reflective properties, so
that the different appearance in different locations on the image is a function of the geometric properties at the corresponding location in space. This
is one of the dominant cues in the perception of the shading of a marble
statue, or of a hill covered with snow (see Figure ??). Similarly, in the use
of cast-shadows one assumes that objects have homogeneous radiance, so
that the shadow can be related to a three-dimensional spatial configuration
(Figure ??). In the use of texture, one assumes that the scene presents some
statistically homogeneous patterns, such as in a tiled floor or a flower bed
(Figure ??). The problem of dividing the image into regions where some of
the properties mentioned above are homogeneous is called segmentation.
Since all these cues are present in one single picture, they are usually
called pictorial. For each cue it is possible to construct examples where
the underlying assumption is violated, so that we have different threedimensional scenes that produce the same image, thereby giving rise to
an optical illusion. In fact, pictures are the most common instance of an
illusion: they are objects with a particular three-dimensional shape (usually
a plane) whose image on our retina is almost equal to the image of the threedimensional scene they portrait. We say almost because the human visual
system exploits other cues, that usually complement pictorial cues and
disambiguate between different spatial interpretations of the same image.
This has been known to Leonardo da Vinci, who noticed that the image
of a scene on the left eye is different from the image on the right eye.
Their difference, known as Da Vinci stereopsis, can be used to infer the
three-dimensional structure of the scene.
Artificial images that contain only the stereo cue can be constructed using so-called random-dot stereograms, invented by Bela Julesz in 1960
(see Figure ??). However, even stereo relies on an assumption: that points
put into correspondence between the left and the right eye represent the
same spatial location. When this assumption is violated we have, again,
illusions. One example is given by those patterned pictures called autostereograms where, by relaxing the eyes as to look at infinity, one sees
a three-dimensional object emerging out of the flat picture. Stereo is ex-

1.2. A historical perspective

Figure 1.2. Stereo as a cue in random dot stereogram. When the image is fused
binocularly, it reveals a depth structure. Note that stereo is the only cue, as
all pictorial aspects are absent in the random dot stereograms.

ploited by the human visual system to infer the depth structure of the
scene in the close-range. More generally, if we consider a stream of images
taken from a moving viewpoint, the two-dimensional image-motion1 can
be exploited to infer information about the three-dimensional structure of
the scene as well as its motion relative to the viewer.
That image-motion is a strong cue is easily seen by eliminating all pictorial cues until the scene reduces to a cloud of points. A still image looks like
a random collection of dots but, as soon as it starts moving, we are able
to perceive the three-dimensional shape and motion of the scene (see figure 1.3). The use of motion as a cue relies upon the assumption that we are

Figure 1.3. 2-D image-motion is a cue to 3-D scene structure: a number of dots
are painted on the surface of a transparent cylinder. An image of the cylinder,
which is generated by projecting it onto a wall, looks like a random collection
of points. If we start rotating the cylinder, just by looking at its projection we
can clearly perceive the existence of a three-dimensional structure underlying the
two-dimensional motion of the projection.

able to assess which point corresponds to which across time (the correspondence problem, again). This implies that, while the scene and/or the sensor
moves, certain properties of the scene remain constant. For instance, we
may assume that neither the reflective properties nor the distance between
any two points on the scene change.
In this book we will concentrate on the motion/stereo problem, that is the
reconstruction of three-dimensional properties of a scene from collections
of two-dimensional images taken from different vantage points.
The reason for this choice does not imply that pictorial cues are not
important: the fact that we use and like photographs so much suggests the
contrary. However, the problem of stereo/motion reconstruction has now
reached the point where it can be formulated in a precise mathematical
sense, and effective software packages are available for solving it.

1.2 A historical perspective

give a succinct history of vision
1 We use the improper diction image-motion or moving image to describe the
time-change of the image due to a relative motion between the scene and the viewer.

Chapter 1. Introduction

1.3 A mathematical approach

As mentioned above, computer vision is about the study of recovering threedimensional information (e.g., shape or layout) of a scene from its twodimensional image(s). It is this nature of computer vision that claims itself
as the inverse discipline of computer graphics, where the typical problem is
to generate realistic two-dimensional images for a three-dimensional scene.
Due to its inverse problem nature, many problems that arise in computer
vision are prone to be ill-conditioned, or sometimes impossible to solve.
The motion/stereo problem, however, is one of the problems in computer
vision that can be well posed. With reasonable assumptions on lighting
condition, surface reflective property, and geometry (or shape) of objects,
it is possible to fully capture information of a three-dimensional scene by a
collection of two-dimensional images (typically taken from different vantage
points). These assumptions usually include:
1. Lighting condition remains constant while all the images are taken
for a scene;
2. Surface reflectance of objects in the scene is Lambertian, i.e. color and
brightness intensity of a point on a surface does not change with the
vantage point;
3. Objects in the scene consist of only geometric primitives: vertex, edge,
plane, and some simple curves or surfaces.
It also makes sense to require that there is always sufficient variation of
color and reflectance among these primitives such that it is possible to
distinct them from one or another - a scene of purely white objects against
a white background does not reveal itself well from any view. Under these
assumptions, at least in principle, we may expect that a full reconstruction
of a scene can be obtained from its multiple images. The rest of this book
is going to show how.
While assumptions listed above indeed make the motion/stereo problem
tractable to some extent, they do not imply that a constructive solution to
a full reconstruction would then be easy. In order to find such a solution,
it is imperative to understand first the geometric relationships between a
three-dimensional scene and its multiple two-dimensional images. Mathematical problems raised in this context can be roughly thought as result
of the interplay of two fundamental and important transformations: 1. the
Euclidean motion which models the motion of camera; 2. the perspective
projection which models the formation of image. Sometimes, for technical
or practical reasons, the Euclidean motion can also be replaced by other
types of motions, such as affine or projective transformations.2 Euclidean
2 For example, if the camera is not calibrated, it is usually easier to recover the camera
motion up to a projective transformation.

1.4. Organization of the book

motion, also known as rigid body motion, has been a classic subject in geometry, physics (especially mechanical physics), and robotics. Perspective
projection, with its root traced back to Renaissance art, has been a widely
studied subject in projective geometry and computer graphics. Important
in their own right, it is however the study of computer vision and computer
graphics that has brought together these two separate subjects, and therefore generated an intriguing, challenging, and yet beautiful new subject:
multiple view geometry.

1.4 Organization of the book

let us leave this last.

This is page 6
Printer: Opaque this

This is page 7
Printer: Opaque this

Part I

Introductory material

This is page 8
Printer: Opaque this

This is page 9
Printer: Opaque this

Chapter 2
Representation of a three dimensional
moving scene

The study of the geometric relationship between a three-dimensional scene

and its images taken from a moving camera boils down to the interplay between two fundamental transformations: the rigid body motion that models
how the camera moves, and the perspective projection which describes
the image formation process. Long before these two transformations were
brought together, the theories for each had been developed independently.
The study of the principles of motion of a material body has a long
history belonging to the foundations of physics. For our purpose, the more
recent noteworthy insights to the understanding of the motion of rigid
bodies came from Chasles and Poinsot in the early 1800s. Their findings
led to current treatment of the subject which has since been widely adopted.
We start this chapter with an introduction to the three dimensional Euclidean space as well as to rigid body transformations. The next chapter
will then focus on the perspective projection model of the camera.

2.1 Three-dimensional Euclidean space

We will use E3 to denote the familiar three-dimensional Euclidean space.
In general, a Euclidean space is a set whose elements satisfy the five
axioms of Euclid []. More practically, the three-dimensional Euclidean
space can be represented by a (global) Cartesian coordinate frame: every point p E3 can be identified with a point in R3 by three coordinates:
.
X = [X1 , X2 , X3 ]T . Sometime we will use [X, Y, Z]T to indicate individ-

Chapter 2. Representation of a three dimensional moving scene

ual coordinates instead of [X1 , X2 , X3 ]T . Through such an assignment of

Cartesian frame, one establishes a one-to-one correspondence between E3
and R3 , which allows us to safely talk about points and their coordinates
as if they were the same thing.
Cartesian coordinates are the first step towards being able to measure
distances and angles. In order to do so, E3 must be endowed with a metric.
A precise definition of metric relies on the notion of vector. In the Euclidean
space, a vector is identified by a pair of points p, q E3 ; that is, a vector
v is defined as a directed arrow connecting p to q. The point p is usually
called the base point of v. In coordinates, the vector v is represented by the
triplet [v1 , v2 , v3 ]T R3 , where each coordinate is the difference between
the corresponding coordinates of the two points: if p has coordinates X and
q has coordinates Y, then v has coordinates1
v = Y X R3 .
One can also introduce the concept of free vector, a vector whose definition
does not depend on its base point. If we have two pairs of points (p, q) and
(p0 , q 0 ) with coordinates satisfying YX = Y 0 X0 , we say that they define
the same free vector. Intuitively, this allows a vector v to be transported in
parallel anywhere in E3 . In particular, without loss of generality, one can
assume that the base point is the origin of the Cartesian frame, so that
X = 0 and v = Y. Note, however, that this notation is confusing: Y here
denotes the coordinates of a vector, that happen to be the same as the
coordinates of the point q just because we have chosen the point p to be
the origin. Nevertheless, the reader should keep in mind that points and
vectors are different geometric objects; for instance, as we will see shortly,
a rigid body motion acts differently on points and vectors. So, keep the
difference in mind!
The set of all (free) vectors form a linear (vector) space2 , where a linear
combination of two vectors v, u R3 is given by:
v + u = (v1 + u1 , v2 + u2 , v3 + u3 )T R3 ,
3

, R.

The Euclidean metric for E is then defined simply by an inner product on

its vector space:
Definition 2.1 (Euclidean metric and inner product). A bilinear
function h, i : R3 R3 R is an inner product if it is linear, symmetric
and positive definite. That is, u, v, w R3
1. hu, v + wi = hu, vi + hu, wi,

, R,

2. hu, vi = hv, ui.

3. hv, vi 0 and hv, vi = 0 v = 0,
1 Note
2 Note

that we use the same symbol v for a vector and its coordinates.
that points do not.

2.1. Three-dimensional Euclidean space

p
The quantity kvk = hv, vi is called the Euclidean norm (or 2-norm)
of the vector v. It can be shown that, by a proper choice of the Cartesian
frame, any inner product in E3 can be converted to the following familiar
form:
hu, vi = uT v = u1 v1 + u2 v2 + u3 v3 .

(2.1)

In most of this book (but not everywhere!) we will use

p the canonical inner
product hu, vi = uT v and, consequently, kvk = v12 + v22 + v32 . When
the inner product between two vectors is zero, hu, vi, they are said to be
orthogonal.
Finally, a Euclidean space E3 can then be formally described as a space
which, with respect to a Cartesian frame, can be identified with R3 and
has a metric (on its vector space) given by the above inner product. With
such a metric, one can measure not only distances between points or angles
between vectors, but also calculate the length of a curve, or the volume of
a region3
While the inner product of two vectors returns a scalar, the so-called
cross product returns a vector instead.
Definition 2.2 (Cross product). Given two vectors u, v R3 , their
cross product is a third vector with coordinates given by:

u2 v 3 u 3 v 2
u v = u3 v 1 u 1 v 3 R 3 .
u1 v 2 u 2 v 1

It is immediate from this definition that the cross product of two vectors
is linear: u (v + w) = u v + u w , R. Furthermore, it is
immediate to verify that
hu v, ui = hu v, vi = 0,

u v = v u.

Therefore, the cross product of two vector is orthogonal to each of its

factors, and the order of the factors defines an orientation.
If we fix u, the cross product can be interpreted as a map v 7 u v
between R3 and R3 . Due to the linearity property, this map is in fact
linear and, therefore, like all linear maps between vector spaces, it can be
represented by a matrix. We call such a matrix u
b R33 . It is immediate
3 For example, if the trajectory of a moving particle p in E 3 is described by a curve
() : t 7 X(t) R3 , t [0, 1], then the total length of the curve is given by:

l(()) =

where X(t)
=

d
(X(t))
dt

kX(t)k
dt.

R3 is the so-called tangent vector to the curve.

Chapter 2. Representation of a three dimensional moving scene

to verify by substitution that this matrix is given by

0
u3 u2
0
u1 R33 .
u
b = u3
u2 u1
0

(2.2)

Hence, we can write u v = u

bv. Note that4 u
b is a 3 3 skew-symmetric
.
.
T
matrix, i.e. u
b = b
u. It is immediate to verify that for e1 = [1, 0, 0]T , e2 =
T
3
T .
[0, 1, 0] R , we have e1 e2 = [0, 0, 1] = e3 . That is for, a standard
Cartesian frame, the cross product of the principal axes X and Y gives the
principal axis Z. The cross product therefore conforms with the right-hand
rule.
The cross product allows us to define a map between a vector, u, and
a skew-symmetric, 3 3 matrix u
b. Is the converse true? Can every 3 3
skew-symmetric matrix be associated with a three-dimensional vector u?
The answer is yes, as it is easy to verify. Let M R33 be skew-symmetric,
that is M = M T . By writing this equation in terms of the elements of the
matrix, we conclude that m11 = m22 = m23 = 0 and mij = mji , i, j =
1, 2, 3. This shows that a skew-symmetric matrix has only three degrees of
freedom, for instance m21 , m13 , m32 . If we call u1 = m32 ; u2 = m13 ; u3 =
m21 , then u
b = M . Indeed, the vector space R3 and the space of skewsymmetric 3 3 matrices so(3) can be considered as the same thing, with
the cross product that maps one onto the other, : R so(3); u 7 u
b, and
the inverse map, called vee, that extracts the components of the vector
u from the skew-symmetric matrix u
b: : so(3) R; M = M T 7 M =
T
[m32 , m13 , m21 ] .

2.2 Rigid body motion

Consider an object moving in front of a camera. In order to describe its
motion one should, in principle, specify the trajectory of every single point
on the object, for instance by giving its coordinates as a function of time
X(t). Fortunately, for rigid objects we do not need to specify the motion of
every particle. As we will see shortly, it is sufficient to specify the motion
of a point, and the motion of three coordinate axes attached to that point.
The condition that defines a rigid object is that the distance between
any two points on it does not change over time as the object moves. So if
X(t) and Y(t) are the coordinates of any two points p and q respectively,
their distance between must satisfy:
kX(t) Y(t)k = constant,
4 In

t R.

some computer vision literature, the matrix u

b is also denoted as u .

(2.3)

2.2. Rigid body motion

In other words, if v is the vector defined by the two points p and q, then
the norm (or length) of v remains the same as the object moves: kv(t)k =
constant. A rigid body motion is then a family of transformations that
describe how the coordinates of every point on the object change as a
function of time. We denote it by g:
g(t) : R3
X

7 g(t)(X)

If, instead of looking at the entire continuous moving path of the object,
we concentrate on the transformation between its initial and final configuration, this transformation is usually called a rigid body displacement and
is denoted by a single mapping:
g : R3
X

R3
7

g(X)

Besides transforming the coordinates of points, g also induces a transformation on vectors. Suppose v is a vector defined by two points p and q:
v = Y X; then, after the transformation g, we obtain a new vector:
.
g (v) = g(Y) g(X).
That g preserves the distance between any two points can be expressed in
terms of vectors as kg (v)k = kvk for v R3 .
However, preserving distances between points is not the only requirement
that a rigid object moving in space satisfies. In fact, there are transformations that preserve distances, and yet they are not physically realizable.
For instance, the mapping
f : [X1 , X2 , X3 ]T 7 [X1 , X2 , X3 ]T
preserves distance but not orientation. It corresponds to a reflection of
points about the XY plane as a double-sided mirror. To rule out this type
of mapping, we require that any rigid body motion, besides preserving
distance, preserves orientation as well. That is, in addition to preserving
the norm of vectors, it must preserve their cross product. The coordinate
transformation induced by a rigid body motion is called a special Euclidean
transformation. The word special indicates the fact that it is orientationpreserving.
Definition 2.3 (Rigid body motion or special Euclidean transformation). A mapping g : R3 R3 is a rigid body motion or a special
Euclidean transformation if it preserves the norm and the cross product of
any two vectors:
1. Norm: kg (v)k = kvk, v R3 .
2. Cross product: g (u) g (v) = g (u v), u, v R3 .

Chapter 2. Representation of a three dimensional moving scene

In the above definition of rigid body motion, it is explicitly required that

the distance between points be preserved. Then how about angles between
vectors? Although it is not explicitly stated in the definition, angles are
indeed preserved by any rigid body motion since the inner product h, i
can be expressed in terms of the norm k k by the polarization identity:
1
(ku + vk2 ku vk2 ).
4
Hence, for any rigid body motion g, one can show that:
uT v =

uT v = g (u)T g (v),

u, v R3 .

(2.4)

(2.5)

In other words, a rigid body motion can also be defined as one that preserves
both inner product and cross product.
How do these properties help us describe rigid motion concisely? The fact
that distances and orientations are preserved in a rigid motion means that
individual points cannot translate relative to each other. However, they can
rotate relative to each other, but they have to collectively, in order to not
alter any mutual distance between points. Therefore, a rigid body motion
can be described by the motion of any one point on the body, and the
rotation of a coordinate frame attached to that point. In order to do this,
we represent the configuration of a rigid body by attaching a Cartesian
coordinate frame to some point on the rigid body and keeping track of the
motion of this coordinate frame relative to a fixed frame.
To see this, consider a fixed (world) coordinate frame, given by three
orthonormal vectors e1 , e2 , e3 R3 ; that is, they satisfy

ij = 1 for i = j
eTi ej = ij
.
(2.6)
ij = 0 for i 6= j
Typically, the vectors are ordered so as to form a right-handed frame:
e1 e2 = e3 . Then, after a rigid body motion g, we have:
g (ei )T g (ej ) = ij ,

g (e1 ) g (e2 ) = g (e3 ).

(2.7)

That is, the resulting three vectors still form a right-handed orthonormal
frame. Therefore, a rigid object can always be associated with a righthanded, orthonormal frame, and its rigid body motion can be entirely
specified by the motion of such a frame, which we call the object frame. In
in Figure 2.1 we show an object (in this case a camera) moving relative to a
fixed world coordinate frame W . In order to specify the configuration of
the camera relative to the world frame W , one may pick a fixed point o on
the camera and attach to it an orthonormal frame, the camera coordinate
frame C. When the camera moves, the camera frame also moves as if it
were fixed to the camera. The configuration of the camera is then determined by (1) the vector between the origin of the world frame o and the
camera frame, g(o), called the translational part and denoted as T , and
(2) the relative orientation of the camera frame C, with coordinate axes

2.2. Rigid body motion

g (e1 ), g (e2 ), g (e3 ), relative to the fixed world frame W with coordinate
axes e1 , e2 , e3 ,called the rotational part and denoted by R.
In the case of vision, there is no obvious choice of the origin o and the
reference frame e1 , e2 , e3 . Therefore, we could choose the world frame to
be attached to the camera and specify the translation and rotation of the
scene (assuming it is rigid!), or we could attach the world frame to the
scene and specify the motion of the camera. All that matters is the relative
motion between the scene and the camera, and what choice of reference
frame to make is, from the point of view of geometry5 , arbitrary.
Remark 2.1. The set of rigid body motions, or special Euclidean transformations, is a (Lie) group, the so-called special Euclidean group, typically
denoted as SE(3). Algebraically, a group is a set G, with an operation of
(binary) multiplication on elements of G which is:
closed: If g1 , g2 G then also g1 g2 G;
associative: (g1 g2 ) g3 = g1 (g2 g3 ), for all g1 , g2 , g3 G;
unit element e: e g = g e = g, for all g G;
invertible: For every element g G, there exists an element g 1 G
such that g g 1 = g 1 g = e.
In the next few sections, we will focus on studying in detail how to represent
the special Euclidean group SE(3). More specifically, we will introduce a
way to realize elements in the special Euclidean group SE(3) as elements
in a group of n n non-singular (real) matrices whose multiplication is
simply the matrix multiplication. Such a group of matrices is usually called
a general linear group and denoted as GL(n) and such a representation is
called a matrix representation. A representation is a map
R : SE(3) GL(n)
g 7 R(g)
which preserves the group structure of SE(3).6 That is, the inverse of
a rigid body motion and the composition of two rigid body motions are
preserved by the map in the following way:
R(g 1 ) = R(g)1 ,

R(g h) = R(g)R(h),

g, h SE(3).

(2.8)

We start with the rotational component of motion.

5 The neuroscience literature debates on whether the primate brain maintains a viewcentered or an object-centered representation of the world. From the point of view
of geometry, the two are equivalent, for they only differ by an arbitrary change of
coordinates.
6 Such a map is called group homeomorphism in algebra.

Chapter 2. Representation of a three dimensional moving scene

y
PSfrag replacements
C

o
z

x
z

T
y

W
o

g = (R, T )

x
Figure 2.1. A rigid body motion which, in this instance, is between a camera and
a world coordinate frame.

2.3 Rotational motion and its representations

Suppose we have a rigid object rotating about a fixed point o E3 . How do
we describe its orientation relative a chosen coordinate frame, say W ? Without loss of generality, we may always assume that the origin of the world
frame is the center of rotation o. If this is not the case, simply translate the
origin to the point o. We now attach another coordinate frame, say C to
the rotating object with origin also at o. The relation between these two coordinate frames is illustrated in Figure 2.2. Obviously, the configuration of
PSfrag replacements

o
r2

x
Figure 2.2. Rotation of a rigid body about a fixed point o. The solid coordinate
frame W is fixed and the dashed coordinate frame C is attached to the rotating
rigid body.

the object is determined by the orientation of the frame C. The orientation

of the frame C relative to the frame W is determined by the coordinates
of the three orthonormal vectors r1 = g (e1 ), r2 = g (e2 ), r3 = g (e3 ) R3
relative to the world frame W , as shown in Figure 2.2. The three vectors
r1 , r2 , r3 are simply the unit vectors along the three principal axes X 0 , Y 0 , Z 0

2.3. Rotational motion and its representations

of the frame C respectively. The configuration of the rotating object is then

completely determined by the following 3 3 matrix:
Rwc = [r1 , r2 , r3 ] R33
with r1 , r2 , r3 stacked in order as its three columns. Since r1 , r2 , r3 form an
orthonormal frame, it follows that:

ij = 1 for i = j
i, j {1, 2, 3}.
riT rj = ij
ij = 0 for i 6= j
This can be written in matrix form as:
T
T
Rwc
Rwc = Rwc Rwc
= I.

Any matrix which satisfies the above identity is called an orthogonal matrix.
It follows from the definition that the inverse of an ortohgonal matrix is
1
T
simply its transpose: Rwc
= Rwc
. Since r1 , r2 , r3 form a right-handed frame,
we further have that the determinant of Rwc must be positive 1. This can
be easily seen when looking at the determinant of the rotation matrix:
detR = r1T (r2 r3 )
which is equal to 1 for right-handed coordinate systems. Hence Rwc is a
special orthogonal matrix where, as before, the word special indicates
orientation preserving. The space of all such special orthogonal matrices in
R33 is usually denoted by:
SO(3) = {R R33 | RT R = I, det(R) = +1} .
Traditionally, 3 3 special orthogonal matrices are called rotation matrices
for obvious reasons. It is straightforward to verify that SO(3) has a group
structure. That is, it satisfies all four axioms of a group mentioned in the
previous section. We leave the proof to the reader as an exercise. Therefore,
the space SO(3) is also referred to as the special orthogonal group of R 3 ,
or simply the rotation group. Directly from the definition of the rotation
matrix, we can show that rotation indeed preserves both the inner product
and the cross product of vectors. We also leave this as an exercise to the
reader.
Going back to Figure 2.2, every rotation matrix Rwc SO(3) represents
a possible configuration of the object rotated about the point o. Besides
this, Rwc takes another role as the matrix that represents the actual coordinates transformation from the frame C to the frame W . To see this,
suppose that, for a given a point p E3 , its coordinates with respect to the
frame W are Xw = [X1w , X2w , X3w ]T R3 . Since r1 , r2 , r3 form a basis for
R3 , Xw can also be expressed as a linear combination of these three vectors,
say Xw = X1c r1 + X2c r2 + X3c r3 with [X1c , X2c , X3c ]T R3 . Obviously,
Xc = [X1c , X2c , X3c ]T are the coordinates of the same point p with respect

Chapter 2. Representation of a three dimensional moving scene

to the frame C. Therefore, we have:

Xw = X1c r1 + X2c r2 + X3c r3 = Rwc Xc .
In this equation, the matrix Rwc transforms the coordinates Xc of a point
p relative to the frame C to those Xw relative to the frame W . Since Rwc
is a rotation matrix, its inverse is simply its transpose:
1
T
Xw .
Xc = Rwc
Xw = Rwc

That is, the inverse transformation is also a rotation; we call it Rcw ,

following an established convention, so that
1
T
Rcw = Rwc
= Rwc
.

The configuration of the continuously rotating object can be then described

as a trajectory R(t) : t 7 SO(3) in the space SO(3). For different times,
the composition law of the rotation group then implies:
R(t2 , t0 ) = R(t2 , t1 )R(t1 , t0 ),

t0 < t1 < t2 R.

Then, for a rotating camera, the world coordinates Xw of a fixed 3-D point
p are transformed to its coordinates relative to the camera frame C by:
Xc (t) = Rcw (t)Xw .
Alternatively, if a point p fixed with respect to the camera frame with
coordinates Xc , its world coordinates Xw (t) as function of t are then given
by:
Xw (t) = Rwc (t)Xc .

2.3.1 Canonical exponential coordinates

So far, we have shown that a rotational rigid body motion in E3 can be represented by a 33 rotation matrix R SO(3). In the matrix representation
that we have so far, each rotation matrix R is described and determined by
its 3 3 = 9 entries. However, these 9 entries are not free parameters - they
must satisfy the constraint RT R = I. This actually imposes 6 independent
constraints on the 9 entries. Hence the dimension of the rotation matrix
space SO(3) is only 3, and 6 parameters out of the 9 are in fact redundant.
In this and the next section, we will introduce a few more representations
(or parameterizations) for rotation matrix.
Given a curve R(t) : R SO(3) which describes a continuous rotational
motion, the rotation must satisfy the following constraint:
R(t)RT (t) = I.
Computing the derivative of the above equation with respect to time t,
noticing that the right hand side is a constant matrix, we obtain:
T

R(t)R
(t) + R(t)R T (t) = 0

T
T

R(t)R
(t) = (R(t)R
(t))T .

2.3. Rotational motion and its representations

The resulting constraint which we obtain reflects the fact that the matrix
T

R(t)R
(t) R33 is a skew symmetric matrix (see Appendix A). Then, as
we have seen, there must exist a vector, say (t) R3 such that:
T

b (t) = R(t)R
(t).

Multiplying both sides by R(t) to the right yields:

R(t)
=
b (t)R(t).

(2.9)

Notice that, from the above equation, if R(t0 ) = I for t = t0 , we have

0) =
R(t
b (t0 ). Hence, around the identity matrix I, a skew symmetric
matrix gives a first-order approximation of rotation matrix:
R(t0 + dt) I +
b (t0 ) dt.

The space of all skew symmetric matrices is denoted as:

so(3) = {b
R33 | R3 }
and thanks to the above observation it is also called the tangent space at
the identity of the matrix group SO(3)7 . If R(t) is not at the identity, the
tangent space at R(t) is simply so(3) transported to R(t) by a multiplica
tion of R(t) to the right: R(t)
=
b (t)R(t). This also shows that elements
of SO(3) only depend upon three parameters.
Having understood its local approximation, we will now use this knowledge to obtain a representation for rotation matrices. Let us start by
assuming that the matrix
b in (2.9) is constant:

R(t)
=
b R(t).

(2.10)

In the above equation,

b can be interpreted as the state transition matrix
for the following linear ordinary differential equation (ODE):
x(t)

=
b x(t),

x(t) R3 .

It is then immediate to verify that the solution to the above ODE is given
by:

where e

x(t) = eb t x(0)

(2.11)

is the matrix exponential:

eb t = I +
bt +

(b
t)2
(b
t)n
+ +
+ .
2!
n!

(2.12)

ebt is also denoted as exp(b

t). Due to the uniqueness of the solution for
the ODE (2.11), and assuming R(0) = I as initial condition, we must have:
R(t) = eb t
7 Since

SO(3) is a Lie group, so(3) is also called its Lie algebra.

(2.13)

Chapter 2. Representation of a three dimensional moving scene

To confirm that the matrix ebt is indeed a rotation matrix, one can directly
show from the definition of matrix exponential:
(eb t )1 = ebt = eb

= (eb t )T .

Hence (eb t )T eb t = I. It remains to show that det(eb t ) = +1 and we leave

this fact to the reader as an exercise. A physical interpretation of the equation (2.13) is: if kk = 1, then R(t) = eb t is simply a rotation around the
axis R3 by t radians. Therefore, the matrix exponential (2.12) indeed
defines a map from the space so(3) to SO(3), the so-called exponential map:
exp : so(3)

b so(3)

SO(3)

7 eb SO(3).

Note that we obtained the expression (2.13) by assuming that the (t)
in (2.9) is constant. This is however not always the case. So a question
naturally arises here: Can every rotation matrix R SO(3) be expressed
in an exponential form as in (2.13)? The answer is yes and the fact is stated
as the following theorem:
Theorem 2.1 (Surjectivity of the exponential map onto SO(3)).
For any R SO(3), there exists a (not necessarily unique) R3 , kk = 1
and t R such that R = eb t .
Proof. The proof of this theorem is by construction: if the rotation matrix
R is given as:

r11 r12 r13

R = r21 r22 r23 ,
r31 r32 r33
the corresponding t and are given by:
t = cos1

trace(R) 1
2

r32 r23
1
r13 r31 .
=
2 sin(t)
r21 r12

The significance of this theorem is that any rotation matrix can be realized by rotating around some fixed axis by a certain angle. However, the
theorem only guarantees the surjectivity of the exponential map from so(3)
to SO(3). Unfortunately, this map is not injective hence not one-to-one.
This will become clear after we have introduced the so-called Rodrigues
formula for computing R = eb t .
From the constructive proof for Theorem 2.1, we now know how to
compute the exponential coordinates (, t) for a given rotation matrix
R SO(3). On the other hand, given (, t), how do we effectively compute the corresponding rotation matrix R = ebt ? One can certainly use the

2.3. Rotational motion and its representations

series (2.12) from the definition. The following theorem however simplifies
the computation dramatically:
Theorem 2.2 (Rodrigues formula for rotation matrix). Given
R3 with kk = 1 and t R, the matrix exponential R = ebt is given by the
following formula:
eb t = I +
b sin(t) +
b 2 (1 cos(t))

(2.14)

Proof. It is direct to verify that powers of

b can be reduced by the following
two formulae:

= T I,
= b
.

Hence the exponential series (2.12) can be simplified as:

2

t5
t4
t6
t3
t
b+
+
b2.
ebt = I + t +
3! 5!
2! 4! 6!

What in the brackets are exactly the series for sin(t) and (1cos(t)). Hence
we have eb t = I +
b sin(t) +
b 2 (1 cos(t)).

Using the Rodrigues formula, it is immediate to see that if t = 2k, k

Z, we have
eb 2k = I

for all k. Hence, for a given rotation matrix R SO(3) there are typically
infinitely many exponential coordinates (, t) such that ebt = R. The exponential map exp : so(3) SO(3) is therefore not one-to-one. It is also
useful to know that the exponential map is not commutative either, i.e. for
two
b1 ,
b2 so(3), usually
unless
b1
b2 =
b2
b1 .

eb1 eb2 6= eb2 eb1 6= eb1 +b2

Remark 2.2. In general, the difference between

b1
b2 and
b2
b1 is called
the Lie bracket on so(3), denoted as:
[b
1 ,
b2 ] =
b1
b2
b2
b1 ,

b
1 ,
b2 so(3).

Obviously, [b
1 ,
b2 ] is also a skew symmetric matrix in so(3). The linear
structure of so(3) together with the Lie bracket form the Lie algebra of the
(Lie) group SO(3). For more details on the Lie group structure of SO(3),
the reader may refer to [MLS94]. The set of all rotation matrices e bt , t R
is then called a one parameter subgroup of SO(3) and the multiplication in
such a subgroup is commutative, i.e. for the same R3 , we have:
eb t1 ebt2 = ebt2 ebt1 = eb(t1 +t2 ) ,

t1 , t2 R.

Chapter 2. Representation of a three dimensional moving scene

2.3.2 Quaternions and Lie-Cartan coordinates

Quaternions
We know that complex numbers C can be simply defined as C = R + Ri
with i2 = 1. Quaternions are to generalize complex numbers in a similar
fashion. The set of quaternions, denoted by H, is defined as
H = C + Cj,

with j 2 = 1 and i j = j i.

(2.15)

So an element of H is of the form

q = q0 + q1 i + (q2 + iq3 )j = q0 + q1 i + q2 j + q3 ij,

q0 , q1 , q2 , q3 R. (2.16)

For simplicity of notation, in the literature ij is sometimes denoted as

k. In general, the multiplication of any two quaternions is similar to the
multiplication of two complex numbers, except that the multiplication of
i and j is anti-commutative: ij = ji. We can also similarly define the
concept of conjugation for a quaternion
q = q0 + q1 i + q2 j + q3 ij

q = q0 q1 i q2 j q3 ij.

(2.17)

It is direct to check that

q q = q02 + q12 + q22 + q32 .

(2.18)

So q q is simply the square of the norm kqk of q as a four dimensional vector

in R4 . For a non-zero q H, i.e. kqk 6= 0, we can further define its inverse
to be
q 1 = q/kqk2 .

(2.19)

The multiplication and inverse rules defined above in fact endow the space
R4 an algebraic structure of a skew field. H is in fact called a Hamiltonian
field, besides another common name quaternion field.
One important usage of quaternion field H is that we can in fact embed
the rotation group SO(3) into it. To see this, let us focus on a special
subgroup of H, the so-called unit quaternions
S3 = {q H | kqk2 = q02 + q12 + q22 + q32 = 1}.

(2.20)

It is obvious that the set of all unit quaternions is simply the unit sphere
in R4 . To show that S3 is indeed a group, we simply need to prove that
it is closed under the multiplication and inverse of quaternions, i.e. the
multiplication of two unit quaternions is still a unit quaternion and so is
the inverse of a unit quaternion. We leave this simple fact as an exercise to
the reader.
Given a rotation matrix R = ebt with = [1 , 2 , 3 ]T R3 and t R,
we can associate to it a unit quaternion as following
q(R) = cos(t/2) + sin(t/2)(1 i + 2 j + 3 ij) S3 .

(2.21)

2.3. Rotational motion and its representations

One may verify that this association preserves the group structure between
SO(3) and S3 :
q(R1 ) = q 1 (R),

R, R1 , R2 SO(3).
(2.22)
Further study can show that this association is also genuine, i.e. for different
rotation matrices, the associated unit quaternions are also different. In the
opposite direction, given a unit quaternion q = q0 + q1 i + q2 j + q3 ij S3 ,
we can use the following formulae find the corresponding rotation matrix
R(q) = ebt

qm / sin(t/2), t 6= 0
t = 2 arccos(q0 ), m =
, m = 1, 2, 3. (2.23)
0,
t=0
q(R1 R2 ) = q(R1 )q(R2 ),

However, one must notice that, according to the above formula, there
are two unit quaternions correspond to the same rotation matrix:
R(q) = R(q), as shown in Figure 2.3. Therefore, topologically, S3 is
R4
S3
q

PSfrag replacements

Figure 2.3. Antipodal unit quaternions q and q on the unit sphere S3 R4

correspond to the same rotation matrix.

a double-covering of SO(3). So SO(3) is topologically the same as a

three-dimensional projective plane RP3 .
Compared to the exponential coordinates for rotation matrix that we
studied in the previous section, using unit quaternions S3 to represent rotation matrices SO(3), we have much less redundancy: there are only two
unit quaternions correspond to the same rotation matrix while there are
infinitely many for exponential coordinates. Furthermore, such a representation for rotation matrix is smooth and there is no singularity, as opposed
to the Lie-Cartan coordinates representation which we will now introduce.
Lie-Cartan coordinates
Exponential coordinates and unit quaternions can both be viewed as ways
to globally parameterize rotation matrices the parameterization works for

Chapter 2. Representation of a three dimensional moving scene

every rotation matrix practically the same way. On the other hand, the
Lie-Cartan coordinates to be introduced below falls into the category of
local parameterizations. That is, this kind of parameterizations are only
good for a portion of SO(3) but not for the entire space. The advantage of
such local parameterizations is we usually need only three parameters to
describe a rotation matrix, instead of four for both exponential coordinates:
(, t) R4 and unit quaternions: q S3 R4 .
In the space of skew symmetric matrices so(3), pick a basis (b
1 ,
b2 ,
b3 ),
i.e. the three vectors 1 , 2 , 3 are linearly independent. Define a mapping
(a parameterization) from R3 to SO(3) as
: (1 , 2 , 3 )

7 exp(b
1 + 2
b 2 + 3
b3 ).

The coordinates (1 , 2 , 3 ) are called the Lie-Cartan coordinates of the

first kind relative to the basis (b
1 ,
b2 ,
b3 ). Another way to parameterize
the group SO(3) using the same basis is to define another mapping from
R3 to SO(3) by
: (1 , 2 , 3 ) 7 exp(1
b1 ) exp(2
b2 ) exp(3
b3 ).

The coordinates (1 , 2 , 3 ) are then called the Lie-Cartan coordinates of

the second kind.
In the special case when we choose 1 , 2 , 3 to be the principal axes
Z, Y, X, respectively, i.e.
.
.
.
1 = [0, 0, 1]T = z, 2 = [0, 1, 0]T = y, 3 = [1, 0, 0]T = x,
the Lie-Cartan coordinates of the second kind then coincide with the
well-known ZY X Euler angles parametrization and (, 2 , 3 ) are the
corresponding Euler angles. The rotation matrix is then expressed by:
b).
b ) exp(3 x
z) exp(2 y
R(1 , 2 , 3 ) = exp(1 b

(2.24)

Similarly we can define Y ZX Euler angles and ZY Z Euler angles. There

are instances when this representation becomes singular and for certain
rotation matrices, their corresponding Euler angles cannot be uniquelly
determines. For example, the ZY X Euler angles become singular when
2 = /2. The presence of such singularities is quite expected because of
the topology of the space SO(3). Globally SO(3) is very much like a sphere
in R4 as we have shown in the previous section, and it is well known that
any attempt to find a global (three-dimensional) coordinate chart for it is
doomed to fail.

2.4 Rigid body motion and its representations

In Section 2.3, we have studied extensively pure rotational rigid body motion and different representations for rotation matrix. In this section, we

2.4. Rigid body motion and its representations

will study how to represent a rigid body motion in general - a motion with
both rotation and translation.
Figure 2.4 illustrates a moving rigid object with a coordinate frame C
attached to it. To describe the coordinates of a point p on the object with
PSfrag replacements

p
Xc
x
o

C
y

Twc
W
o

y
g = (R, T )

x
Figure 2.4. A rigid body motion between a moving frame C and a world frame
W.

respect to the world frame W , it is clear from the figure that the vector
Xw is simply the sum of the translation Twc R3 of the center of frame
C relative to that of frame W and the vector Xc but expressed relative
to frame W . Since Xc are the coordinates of the point p relative to the
frame C, with respect to the world frame W , it becomes Rwc Xc where
Rwc SO(3) is the relative rotation between the two frames. Hence the
coordinates Xw are given by:
Xw = Rwc Xc + Twc .

(2.25)

Usually, we denote the full rigid motion as gwc = (Rwc , Twc ) or simply g =
(R, T ) if the frames involved are clear from the context. Then g represents
not only a description of the configuration of the rigid body object but a
transformation of coordinates between the frames. In a compact form we
may simply write:
Xw = gwc (Xc ).
The set of all possible configurations of rigid body can then be described
as:
SE(3) = {g = (R, T ) | R SO(3), T R3 } = SO(3) R3
so called special Euclidean group SE(3). Note that g = (R, T ) is not yet a
matrix representation for the group SE(3) as we defined in Section 2.2. To
obtain such a representation, we must introduce the so-called homogeneous
coordinates.

Chapter 2. Representation of a three dimensional moving scene

2.4.1 Homogeneous representation

One may have already noticed from equation (2.25) that, unlike the pure
rotation case, the coordinate transformation for a full rigid body motion
is not linear but affine instead.8 Nonetheless, we may convert such an
affine transformation to a linear one by using the so-called homogeneous
coordinates: Appending 1 to the coordinates X = [X1 , X2 , X3 ]T R3 of a

point p E3 yields a vector in R4 denoted by X:

X1

X2
X

X=
R4 .
=
X3
1
1

Such an extension of coordinates, in effect, has embedded the Euclidean

space E3 into a hyper-plane in R4 instead of R3 . Homogeneous coordinates of a vector v = X(q) X(p) are defined as the difference between
homogeneous coordinates of the two points hence of the form:

v1

v 2
v
X(q)
X(p)
4

v =
=

=
v 3 R .
0
1
1
0
Notice that, in R4 , vectors of the above form give rise to a subspace hence
all linear structures of the original vectors v R3 are perfectly preserved by
the new representation. Using the new notation, the transformation (2.25)
can be re-written as:

Rwc Twc Xc
Xw
c

=: gwc X
=
Xw =
1
0
1
1

where the 44 matrix gwc R44 is called the homogeneous representation

of the rigid motion gwc = (Rwc , Twc ) SE(3). In general, if g = (R, T ),
then its homogeneous representation is:

R T
g =
R44 .
(2.26)
0 1
Notice that, by introducing a little redundancy into the notation, we
represent a rigid body transformation of coordinates by a linear matrix
multiplication. The homogeneous representation of g in (2.26) gives rise to
a natural matrix representation of the special Euclidean group SE(3):

R T
3
SE(3) = g =
R SO(3), T R R44
0 1

8 We say two vectors u, v are related by a linear transformation if u = Av for some

matrix A, and by an affine transformation if u = Av + b for some matrix A and vector
b.

2.4. Rigid body motion and its representations

It is then straightforward to verify that so-defined SE(3) indeed satisfies

all the requirements of a group. In particular, g1 , g2 and g SE(3), we
have:

R 1 R 2 R 1 T2 + T 1
R 1 T1 R 2 T2
SE(3)
=
g1 g2 =
0
1
0
1
0
1
and
1

R
=
0

T
1

RT
=
0

RT T
SE(3).
1

In homogeneous representation, the action of a rigid body transformation

g SE(3) on a vector v = X(q) X(p) R3 becomes:

g (
v ) = gX(q)
gX(p)
= gv.

That is, the action is simply represented by a matrix multiplication. The

reader can verify that such an action preserves both inner product and cross
product hence g indeed represents a rigid body transformation according
to the definition we gave in Section 2.2.

2.4.2 Canonical exponential coordinates

In Section 2.3.1, we have studied exponential coordinates for rotation matrix R SO(3). Similar coordination also exists for the homogeneous
representation of a full rigid body motion g SE(3). For the rest of this
section, we demonstrate how to extend the results we have developed for
rotational motion in Section 2.3.1 to a full rigid body motion. The results
developed here will be extensively used throughout the entire book.
Consider that the motion of a continuously moving rigid body object
is described by a curve from R to SE(3): g(t) = (R(t), T (t)), or in
homogeneous representation:

R(t) T (t)
g(t) =
R44 .
0
1
Here, for simplicity of notation, we will remove the bar off from the
symbol g for homogeneous representation and simply use g for the same
and for vector:
matrix. We will use the same convention for point: X for X
v for v whenever their correct dimension is clear from the context. Similar
as in the pure rotation case, lets first look at the structure of the matrix
1
g(t)g

(t):

T
T

R(t)R
(t) T (t) R(t)R
(t)T (t)
1
g(t)g

(t) =
.
(2.27)
0
0
T

From our study of rotation matrix, we know R(t)R

(t) is a skew symmetric
T

matrix, i.e. there exists

b (t) so(3) such that
b (t) = R(t)R
(t). Define a

Chapter 2. Representation of a three dimensional moving scene

vector v(t) R3 such that v(t) = T (t)

b (t)T (t). Then the above equation
becomes:

b (t) v(t)
1
R44 .
g(t)g

(t) =
0
0

If we further define a matrix b R44 to be:

b (t) v(t)
b =
,
(t)
0
0
then we have:

1
b
g(t)

= (g(t)g

(t))g(t) = (t)g(t).

(2.28)

b can be viewed as the tangent vector along the curve of g(t) and used
for approximate g(t) locally:

b
b
g(t + dt) g(t) + (t)g(t)dt
= I + (t)dt
g(t).

In robotics literature a 4 4 matrix of the form as b is called a twist. The

set of all twist is then denoted as:

b v
3
R44
se(3) = b =

so(3),
v

R

0 0

se(3) is called the tangent space (or Lie algebra) of the matrix group SE(3).
We also define two operators and to convert between a twist b se(3)
and its twist coordinates R6 as follows:

b v
v

b v
.
. v
6
R44 .
=
R ,
=
0 0

0 0

In the twist coordinates , we will refer to v as the linear velocity and

as the angular velocity, which indicates that they are related to either
translational or rotational part of the full motion. Let us now consider a
special case of the equation (2.28) when the twist b is a constant matrix:
b
g(t)

= g(t).

Hence we have again a time-invariant linear ordinary differential equation,

which can be intergrated to give:
b

g(t) = et g(0).
Assuming that the initial condition g(0) = I we may conclude that:
b

g(t) = et
where the twist exponential is:
b

b +
et = I + t

b 2
b n
(t)
(t)
++
+ .
2!
n!

(2.29)

2.4. Rigid body motion and its representations

Using Rodrigues formula introduced in the previous section, it is

straightforward to obtain that:
b

et =

eb t
0

(I ebt )b
v + T vt
1

(2.30)

b is indeed a
It it clear from this expression that the exponential of t
rigid body transformation matrix in SE(3). Therefore the exponential map
defines a mapping from the space se(3) to SE(3):
exp : se(3) SE(3)
b
b se(3) 7 e SE(3)

and the twist b se(3) is also called the exponential coordinates for SE(3),
as
b so(3) for SO(3).
One question remains to answer: Can every rigid body motion g SE(3)
always be represented in such an exponential form? The answer is yes and
is formulated in the following theorem:
Theorem 2.3 (Surjectivity of the exponential map onto SE(3)).
For any g SE(3), there exist (not necessarily unique) twist coordinates
b
= (v, ) and t R such that g = et .

Proof. The proof is constructive. Suppose g = (R, T ). For the rotation

matrix R SO(3) we can always find (, t) with kk = 1 such that ebt =
R. If t 6= 0, from equation (2.30), we can solve for v R3 from the linear
equation
(I ebt )b
v + T vt = T

v = [(I ebt )b
+ T t]1 T.

If t = 0, then R = I. We may simply choose = 0, v = T /kT k and

t = kT k.
Similar to the exponential coordinates for rotation matrix, the exponential map from se(3) to SE(3) is not injective hence not one-to-one. There
are usually infinitely many exponential coordinates (or twists) that correspond to every g SE(3). Similarly as in the pure rotation case, the
linear structure of se(3), together with the closure under the Lie bracket
operation:

\
2 1 v 2 2 v 1
[b1 , b2 ] = b1 b2 b2 b1 = 1
se(3).
0
0
b

makes se(3) the Lie algebra for SE(3). The two rigid body motions g1 = e1
b
and g2 = e2 commute with each other : g1 g2 = g2 g1 , only if [b1 , b2 ] = 0.

Chapter 2. Representation of a three dimensional moving scene

2.5 Coordinates and velocity transformation

In the above presentation of rigid body motion we described how 3-D points
move relative to the camera frame. In computer vision we usually need to
know how the coordinates of the points and their velocities change with
respect to camera frames at different locations. This is mainly because
that it is usually much more convenient and natural to choose the camera
frame as the reference frame and to describe both camera motion and 3-D
points relative to it. Since the camera may be moving, we need to know
how to transform quantities such as coordinates and velocities from one
camera frame to another. In particular, how to correctly express location
and velocity of a (feature) point in terms of that of a moving camera. Here
we introduce a few conventions that we will use for the rest of this book.
The time t R will be always used as an index to register camera motion.
Even in the discrete case when a few snapshots are given, we will order them
by some time indexes as if th! ey had been taken in such order. We found
that time is a good uniform index for both discrete case and continuous
case, which will be treated in a unified way in this book. Therefore, we will
use g(t) = (R(t), T (t)) SE(3) or:

R(t) T (t)
g(t) =
SE(3)
0
1
to denote the relative displacement between some fixed world frame W and
the camera frame C at time t R. Here we will ignore the subscript cw
from supposedly gcw (t) as long as the relativity is clear from the context. By
default, we assume g(0) = I, i.e. at time t = 0 the camera frame coincides
with the world frame. So if the coordinates of a point p E3 relative to
the world frame are X0 = X(0), its coordinates relative to the camera at
time t are then:
X(t) = R(t)X0 + T (t)

(2.31)

or in homogeneous representation:
X(t) = g(t)X0 .

(2.32)

If the camera is at locations g(t1 ), . . . , g(tm ) at time t1 , . . . , tm respectively,

then the coordinates of the same point p are given as X(ti ) = g(ti )X0 , i =
1, . . . , m correspondingly. If it is only the position, not the time, that matters, we will very often use gi as a shorthand for g(ti ) and similarly Xi for
X(ti ).
When the starting time is not t = 0, the relative motion between
the camera at time t2 relative to the camera at time t1 will be denoted
as g(t2 , t1 ) SE(3). Then we have the following relationship between
coordinates of the same point p:
X(t2 ) = g(t2 , t1 )X(t1 ),

t2 , t1 R.

2.5. Coordinates and velocity transformation

Now consider a third position of the camera at t = t3 R3 , as shown in

Figure 2.5. The relative motion between the camera at t3 and t2 is g(t3 , t2 )
PSfrag replacements
and between t3 and t1 is g(t3 , t1 ). We then have the following relationship
p

z
x

z
g(t2 , t1 )
x

g = (t3 , t2 )
o

g = (t3 , t1 )
t = t1

t = t2

t = t3

Figure 2.5. Composition of rigid body motions.

among coordinates
X(t3 ) = g(t3 , t2 )X(t2 ) = g(t3 , t2 )g(t2 , t1 )X(t1 ).
Comparing with the direct relationship between coordinates at t3 and t1 :
X(t3 ) = g(t3 , t1 )X(t1 ),
the following composition rule for consecutive motions must hold:
g(t3 , t1 ) = g(t3 , t2 )g(t2 , t1 ).
The composition rule describes the coordinates X of the point p relative
to any camera position, if they are known with respect to a particular one.
The same composition rule implies the rule of inverse
g 1 (t2 , t1 ) = g(t1 , t2 )
since g(t2 , t1 )g(t1 , t2 ) = g(t2 , t2 ) = I. In case time is of no physical meaning
to a particular problem, we might use gij as a shorthand for g(ti , tj ) with
ti , tj R.
Having understood transformation of coordinates, we now study what
happens to velocity. We know that the coordinates X(t) of a point p E3
relative to a moving camera, are a function of time t:
X(t) = gcw (t)X0 .
Then the velocity of the point p relative to the (instantaneous) camera
frame is:

X(t)
= g cw (t)X0 .

In order express X(t)

in terms of quantities expressed in the moving frame
1
we substitute for X0 = gcw
(t)X0 and using the notion of twist define:
c
1
Vbcw
(t) = g cw (t)gcw
(t) se(3)

(2.33)

Chapter 2. Representation of a three dimensional moving scene

1
where an expression for g cw (t)gcw
(t) can be found in (2.27). The above
equation can be rewritten as:

c
Since Vbcw
(t) is of the form:

X(t)
= Vbcw
(t)X(t)

b (t)
c
Vbcw
(t) =
0

(2.34)

v(t)
,
0

we can also write the velocity of the point in 3-D coordinates (instead of
in the homogeneous coordinates) as:

X(t)
=
b (t)X(t) + v(t) .

(2.35)

c
The physical interpretation of the symbol Vbcw
is the velocity of the world
frame moving relative to the camera frame, as viewed in the camera frame
c
the subscript and superscript of Vbcw
indicate that. Usually, to clearly specify
the physical meaning of a velocity, we need to specify: It is the velocity
of which frame moving relative to which frame, and in which frame it is
viewed. If we change where we view the velocity, the expression will change
accordingly. For example suppose that a viewer is in another coordinate
frame displaced relative to the camera frame by a rigid body transformation
g SE(3). Then the coordinates of the same point p relative to this frame
are Y(t) = gX(t). Compute the velocity in the new frame we have:
1
c 1

g Y(t).
Y(t)
= g g cw (t)gcw
(t)g 1 Y(t) = g Vbcw

So the new velocity (or twist) is:

c 1
g .
Vb = g Vbcw

This is the simply the same physical quantity but viewed from a different
vantage point. We see that the two velocities are related through a mapping
defined by the relative motion g, in particular:
adg : se(3) se(3)
b 1 .
b 7 g g

This is the so-called adjoint map on the space se(3). Using this notac
tion in the previous example we have Vb = adg (Vbcw
). Clearly, the adjoint
map transforms velocity from one frame to another. Using the fact that
gcw (t)gwc (t) = I, it is straightforward to verify that:
c
1
1
1 1
w
Vbcw
= g cw gcw
= gwc
g wc = gcw (g wc gwc
)gcw = adgcw (Vbwc
).

c
Hence Vbcw
can also be interpreted as the negated velocity of the camera
moving relative to the world frame, viewed in the (instantaneous) camera
frame.

2.6. Summary

2.6 Summary
The rigid body motion introduced in this chapter is an element g SE(3).
The two most commonly used representation of elements of g SE(3) are:
Homogeneous representation:

R T
g =
R44 with R SO(3) and T R3 .
0 1
Twist representation:
b

g(t) = et with the twist coordinates = (v, ) R6 and t R.

In the instanteneous case the velocity of a point with respect to the
(instanteneous) camera frame is:
c
c
1

X(t)
= Vbcw
(t)X(t) where Vbcw
= g cw gcw

and gcw (t) is the configuration of the camera with respect to the world
frame. Using the actual 3D coordinates, the velocity of a 3D point yields
the familiar relationship:

X(t)
=
b (t)X(t) + v(t).

2.7 References
The presentation of the material in this chapter follows the development
in [?]. More details on the abstract treatment of the material as well as
further references can be also found there.

2.8 Exercises
1. Linear vs. nonlinear maps
Suppose A, B, C, X Rnn . Consider the following maps from
Rnn Rnn and determine if they are linear or not. Give a brief
proof if true and a counterexample if false:
(a) X
(b) X
(c)
(d)

X
X

7 AX + XB
7 AX + BXC

7 AXA B
7 AX + XBX

Note: A map f : Rn Rm : x 7 f (x) is called linear if f (x + y) =

f (x) + f (y) for all , R and x, y Rn .

Chapter 2. Representation of a three dimensional moving scene

2. Group structure of SO(3)

Prove that the space SO(3) is a group, i.e. it satisfies all four axioms
in the definition of group.
3. Skew symmetric matrices
Given any vector = [1 , 2 , 3 ]T R3 , define a 3 3 matrix
associated to :

0
3 2
(2.36)
0
1 .

b = 3
2 1
0
According to the definition,
b is skew symmetric, i.e.
b T = b
. Now
for any matrix A R33 with determinant det A = 1, show that the
following equation holds:
1 .
\
AT
bA = A

(2.37)

Then, in particular, if A is a rotation matrix, the above equation

holds.
c and A\
1 () are linear maps with as the variable.
Hint: Both AT ()A
What do you need to prove that two linear maps are the same?

4. Rotation as rigid body motion

Given a rotation matrix R SO(3), its action on a vector v is defined
as Rv. Prove that any rotation matrix must preserve both the inner
product and cross product of vectors. Hence, a rotation is indeed a
rigid body motion.

5. Range and null space

Recall that given a matrix A Rmn , its null space is defined as a
subspace of Rn consisting of all vectors x Rn such that Ax = 0. It
is usually denoted as N u(A). The range of the matrix A is defined
as a subspace of Rm consisting of all vectors y Rm such that there
exists some x Rn such that y = Ax. It is denoted as Ra(A). In
mathematical terms,
N u(A) = {x Rn | Ax = 0},
Ra(A) = {y Rm | x Rn , y = Ax}
(a) Recall that a set of vectors V is a subspace if for all vectors
x, y V and scalars , R, x + y is also a vector in V .
Show that both N u(A) and Ra(A) are indeed subspaces.
(b) What are N u(b
) and Ra(b
) for a non-zero vector R3 ? Can
you describe intuitively the geometric relationship between these
two subspaces in R3 ? (A picture might help.)
6. Properties of rotation matrices
Let R SO(3) be a rotation matrix generated by rotating about a
unit vector R3 by radians. That is R = eb .

2.8. Exercises

(a) What are the eigenvalues and eigenvectors of

b ? You may use
Matlab and try some examples first if you have little clue. If you
happen to find a brutal-forth way to do it, can you instead use
results in Exercise 3 to simplify the problem first?

(b) Show that the eigenvalues of R are 1, ei , ei where i = 1

the imaginary unit. What is the eigenvector which corresponds
to the eigenvalue 1? This actually gives another proof for
det(eb ) = 1 ei ei = +1 but not 1.

7. Adjoint transformation on twist

Given a rigid body motion g and a twist b

R T

b v
g=
SE(3), b =
se(3),
0 1
0 0

b 1 is still a twist. Notify what the corresponding and

show that g g
v terms have become in the new twist. The adjoint map is kind of a
c
generalization of Rb
RT = R.

This is page 36
Printer: Opaque this

Chapter 3
Image formation

This chapter introduces simple mathematical models of the image formation process. In a broad figurative sense, vision is the inverse problem of
image formation: the latter studies how objects give raise to images, while
the former attempts to use images to recover a description of objects in
space. Therefore, designing vision algorithms requires first developing a
suitable model of image formation. Suitable in this context does not necessarily mean physically accurate: the level of abstraction and complexity in
modeling image formation must trade off physical constraints and mathematical simplicity in order to result in a manageable model (i.e. one that
can be easily inverted). Physical models of image formation easily exceed
the level of complexity necessary and appropriate to this book, and determining the right model for the problem at hand is a form of engineering
art.
It comes at no surprise, then, that the study of image formation has
for centuries been in the domain of artistic reproduction and composition,
more so than in mathematics and engineering. Rudimentary understanding of the geometry of image formation, which includes various models for
projecting the three-dimensional world onto a plane (e.g., a canvas), is implicit in various forms of visual arts from all ancient civilization. However,
the roots to formalizing the geometry of image formation into a mathematical model can be traced back to the work of Euclid in the 6th century
B.C.. Examples of correct perspective projection are visible in the stunning
frescoes and mosaics of Pompeii (Figure 3.1) from the first century B.C.
Unfortunately, these skills seem to have been lost with the fall of the Roman
empire, and it took over a thousand years for correct perspective projection

Chapter 3. Image formation

to dominate paintings again, in the late 14th century. It was the early renaissance painters who developed systematic methods for determining the
perspective projection of three-dimensional landscapes. The first treatise
on perspective was published by Leon Battista Alberti [?], who emphasizes the eyes view of the world and capturing correctly the geometry of
the projection process. The renaissance coincided with the first attempts
to formalize the notion of perspective and place it on a solid analytical
footing. It is no coincidence that the early attempts to formalize the rules
of perspective came from artists proficient in architecture and engineering,
such as Alberti and Brunelleschi. Geometry, however, is only part of the

Figure 3.1. Frescoes from the first century B.C. in Pompeii. More (left) or less
(right) correct perspective projection is visible in the paintings. The skill was lost
during the middle ages, and it did not reappear in paintings until fifteen centuries
later, in the early renaissance.

image formation process: in order to generate an image, we need to decide

not only where to draw what, but what color, or grey-scale value, to
assign to any particular location on an image. The interaction of light with
matter is at the core of the studies of Leonardo Da Vinci in the 1500s, and
his insights on perspective, shading, color and even stereopsis are vibrantly
expressed in his notebooks. Renaissance painters, such as Caravaggio or
Raphael, exhibited rather sophisticated skills in rendering light and color
that remain remarkably compelling to this day.

Chapter 3. Image formation

In this chapter, we will follow the historical development of models for

image formation by giving a precise description of a simple model of the
geometry of image formation, and then describing a rather crude model of
its photometry, that is of how light interacts with geometry to give rise to
images. As we will see, we will use assumptions strong enough to reduce
the image formation process to an entirely geometric model, which we can
then study analytically.

3.1 Representation of images

An image, as far as this book is concerned, is a two-dimensional brightness
array. In other words, it is a map I, defined on a compact region of a
two-dimensional surface, taking values in the positive reals. For instance,
in the case of a camera, is a planar, rectangular region occupied by the
photographic medium (or by the CCD sensor), so that we have:
I : R 2 R+ ;

(x, y) 7 I(x, y).

(3.1)

Such an image can be represented, for instance, as the graph of I, as in

Figure 3.2. In the case of a digital image, both the domain and the
range R+ are discretized. For instance, = [1, 640] [1, 480] Z2 and
R+ is approximated by the interval [0, 255] Z+ . Such an image can be
represented as an array of numbers as in Table 3.1.

250

200

150

100

0
80
100

60
80

60
40

20
20
y

Figure 3.2. An image I represented as a two-dimensional surface.

The values taken by the map I depend upon physical properties of the
scene being viewed, such as its shape, its material reflectance properties
and the distribution of the light sources. Despite the fact that Figure 3.2

3.1. Representation of images

188
189
190
190
191
193
185
175
170
171
174
176
175
179
176
180
181
180
186
185

186
189
190
188
185
183
181
176
170
171
175
174
169
179
183
181
174
172
176
178

188
188
188
188
189
178
178
176
172
173
176
174
168
180
181
177
170
168
171
171

187
181
176
175
177
164
165
163
159
157
156
151
144
155
153
147
141
140
142
138

168
163
159
158
158
148
149
145
137
131
128
123
117
127
122
115
113
114
114
109

130
135
139
139
138
134
135
131
123
119
120
119
117
121
115
110
111
114
114
110

101
109
115
114
110
118
121
120
116
116
121
126
127
118
113
111
115
118
116
114

99
104
106
103
99
112
116
118
114
113
118
121
122
109
106
107
112
113
110
110

110
113
114
113
112
119
124
125
119
114
113
112
109
107
105
107
113
112
108
109

113
113
123
126
119
117
120
123
122
118
112
108
106
113
109
105
105
107
104
97

112
110
114
112
107
118
122
125
126
125
123
122
122
125
123
120
119
119
116
110

107
109
111
113
115
106
109
112
113
113
114
115
116
133
132
132
130
128
125
121

117
117
119
127
137
122
123
124
123
122
122
123
125
130
131
133
132
130
128
127

140
134
130
133
140
139
139
139
137
135
135
137
139
129
131
133
134
134
134
136

153
147
141
137
135
140
141
142
141
140
141
143
145
139
140
141
144
146
148
150

153
152
154
151
144
152
154
155
156
155
155
156
158
153
151
150
153
157
161
160

156
156
165
165
157
154
156
158
158
156
155
155
156
161
157
154
156
162
165
163

158
163
160
156
163
160
159
158
159
160
158
152
147
148
149
148
148
153
159
158

156
160
156
152
158
155
154
155
157
160
159
155
152
155
156
155
152
153
157
156

153
156
151
145
150
147
147
148
150
152
152
150
148
157
159
157
151
148
149
150

Table 3.1. The image I represented as a two-dimensional table of integers.

and Table 3.1 do not seem very indicative of the properties of the scene
they portray, this is how they are represented in a computer. A different
representation of the same image that is better suited for interpretation by
the human visual system is obtained by generating a picture. A picture is
a scene - different from the true one - that produces on the imaging sensor
(the eye in this case) the same image as the original scene. In this sense
pictures are controlled illusions: they are scenes different from the true
ones (they are flat), that produce in the eye the same image as the original
scenes. A picture of the same image I described in Figure 3.2 and Table
3.1 is shown in Figure 3.3. Although the latter seems more informative on
the content of the scene, it is merely a different representation and contains
exactly the same information.

70
10

Figure 3.3. A picture of the image I (compare with image 3.1).

Chapter 3. Image formation

3.2 Lenses, surfaces and light

In order to describe the image formation process, we must specify the value
of the map I at each point (x, y) in . Such a value I(x, y) is typically called
image brightness, or more formally irradiance. It has the units of power
per unit area (W atts/m2 ) and describes the energy falling onto a small
patch of the imaging sensor. The irradiance at a point of coordinates (x, y)
is typically obtained by integrating energy both in time (for instance the
shutter interval in a camera, or the integration time in a CCD) and in space.
The region of space that contributes to the irradiance at (x, y) depends
upon the geometry and optics of the imaging device, and is by no means
trivial to determine. In Section 3.2.1 we will adopt common simplifying
assumptions to approximate it. Once the region of space is determined, the
energy it contributes depends upon the geometry and material of the scene
as well as on the distribution of light sources.
Before going into the details of how to construct a model of image formation, we pause to consider a particular class of objects that we can use
to manipulate how light propagates in space. This is crucial in building
devices to capture images, and has to be taken into account to a certain
extent in a mathematical model of image formation.

3.2.1 Imaging through lenses

An optical system is a set of lenses used to direct light. By directing light
we mean a controlled change in the direction of propagation, which can be
performed by means of diffraction, refraction and reflection. For the sake
of simplicity, we neglect the effects of diffraction and reflection in a lens
system, and we only consider refraction. Even so, a complete description of
the functioning of a (purely refractive) lens is well beyond the scope of this
book. Therefore, we will only consider the simplest possible model, that of
a thin lens in a black box. For a more germane model of light propagation,
the interested reader is referred to the classical textbook of Born and Wolf
[?].
A thin lens is a mathematical model defined by an axis, called the optical
axis, and a plane perpendicular to the axis, called the focal plane, with
a circular aperture centered at the optical center, i.e. the intersection of
the focal plane with the optical axis. The thin lens is characterized by
one parameter, usually indicated by f , called the focal length, and by two
functional properties. The first is that all rays entering the aperture parallel
to the optical axis intersect on the optical axis at a distance f from the
optical center. The point of intersection is called the focus of the lens (see
Figure 3.4). The second property is that all rays through the optical center
are undeflected. Now, consider a point p E3 not too far from the optical
axis at a distance Z along the optical axis from the optical center. Now
draw, from the point p, two rays: one parallel to the optical axis, and one

3.2. Lenses, surfaces and light

PSfrag replacements

Figure 3.4. The rays parallel to the optical axis intersect at the focus.

through the optical center (Figure 3.5). The first one intersects the optical
axis at the focus, the second remains undeflected (by the defining properties
of the thin lens). Call x the point where the two rays intersect, and let z be
its distance from the optical center. By decomposing any other ray from p
into a component parallel to the optical axis and one through the optical
center, we can argue that all rays from p intersect at x on the opposite side
of the lens. In particular, a ray from x parallel to the optical axis, must go
through p. Using similar triangles, from Figure 3.5 we obtain the following
fundamental equation of the thin lens:
1
1
1
+ =
Z
z
f
The point x is called the image of the point p. Therefore, under the
p

z
Z

PSfrag replacements
x
Figure 3.5. Image of the point p is the point x of the intersection of rays going
parallel to the optical axis and the ray through the optical center.

Chapter 3. Image formation

assumption of a thin lens, the irradiance at the point x of coordinates

(x, y) on the image plane is obtained by integrating all the energy emitted
from the region of space contained in the cone determined by the geometry
of the lens. If such a region does not contain light sources, but only opaque
surfaces, it is necessary to assess how such surfaces radiate energy towards
the sensor, which we do in Section 3.3. Before doing so, we introduce the
notion of field of view, which is the angle subtended by the aperture of
the lens seen from the focus. If D is the diameter of the lens, then the field
of view is arctan(D/2f ).

3.2.2 Imaging through pin-hole

If we let the aperture of a thin lens decrease to zero, all rays are forced
to go through the optical center, and therefore they remain undeflected.
Consequently, the aperture of the cone decreases to zero, and the only
points that contribute to the irradiance at the points x on a line through
p. If we let p have coordinates X = [X, Y, Z]T relative to a reference frame
centered at the optical center, with the optical axis being the Z-axis, then
it is immediate to see from Figure 3.6 that the coordinates of x and p are
related by an ideal perspective projection:
x = f

X
,
Z

y = f

Y
Z

(3.2)

Note that any other point on the line through p projects onto the same
coordinates [x, y]T . This imaging model is the so-called ideal pin-hole. It

o
PSfrag replacements

f
y

x
Figure 3.6. Image of the point p is the point x of the intersection of the ray going
through the optical center o and an image plane at a distance f from the optical
center.

is doubtlessly ideal since, when the aperture decreases, diffraction effects

become dominant and therefore the (purely refractive) thin lens model does
not hold. Furthermore, as the aperture decreases to zero, the energy going
through the lens also becomes zero. The pin-hole is a purely geometric

3.3. A first model of image formation

model that approximates well-focused imaging systems. In this book we

will use the pin-hole model as much as possible and concentrate on the
geometric aspect of image-formation.
Notice that there is a negative sign in each of the formulae (3.2). This
makes the image of an object appear to be upside down on the image plane
(or the retina for the human eye). To eliminate this effect, we can simply
flip the image: (x, y) 7 (x, y). This corresponds to placing the image
plane {Z = f } in front of the optical center instead: {Z = +f }.

3.3 A first model of image formation

In this section we go through the process of deriving a geometric model of
image formation. We start by showing under what conditions one may study
photometry by means of simple geometry. Section 3.3.1 may be skipped at
a first reading. We then proceed to develop the purely geometric model
that we will use for the remainder of the book in Section 3.3.2.

3.3.1 Basic photometry

Let S be a smooth surface patch in space; we indicate the tangent plane
to the surface at a point p by and the inward unit normal vector by .
At each p S we can construct a local coordinate frame with origin at p,
e3 -axis parallel to the normal vector and (e1 , e2 )-plane parallel to (see
Figure 3.7). The change of coordinates between the reference point at p
and an inertial reference frame (the world frame) is indicated by gp ; gp
maps points in the local coordinate frame at p into points in the inertial
frame. For instance, gp maps the origin o to the point p (with respect to
the world frame): gp (o) = p; and the vector e3 into the surface normal
(again with respect to the world frame): gp (e3 ) = .1 In what follows
we will not make a distinction between a change of coordinates g and its
representation, and we will also consider interchangeably points p E3 and
their representation X R3 .
Consider then a distribution of energy dE over a compact region of a
surface in space L (the light source). For instance, L can be the hemisphere
and dE be constant in the case of diffuse light on a cloudy day, or L could
be a distant point and dE a delta-measure in the case of sunlight on a clear
day (see Figure 3.7). The effects of the light source on the point p S can
be described using the infinitesimal energy dE(p ) radiated from L to p
1 We recall from the previous chapter that, if we represent the change of coordinates
g with a rotation matrix R SO(3) and a translation vector T , then the action of g on
.
a point p of coordinates X R3 is given by g(p) = RX + T , while the action of g on a
.
vector of coordinates u is given by g (u) = Ru.

Chapter 3. Image formation

along a vector p . The totalRenergy reaching p, assuming additivity of the

energy transport, is E(p) = L dE(p ) which, of course, depends upon the
point p in question. Note that there could be several light sources, including
indirect ones (i.e. other objects reflecting energy onto S).
The portion of energy coming from a direction p that is reflected onto
a direction xp is described by (xp , p ), the bidirectional reflectance distribution function (BRDF). The energy that p reflects onto xp is therefore
obtained by integrating the BRDF against the energy distribution
Z
.
(xp , p )dE(p )
(3.3)
E(xp , p) =
L

which depends upon the direction xp and the point p S, as well as on

the energy distribution E of the light source L.
The geometry of the sensor is described by a central projection (see
Figure 3.7). For a point p with coordinates X R3 expressed in the camera
coordinate frame, which has its e3 -axis parallel to the optical axis and the
(e1 , e2 )-plane parallel to the lens, the projection can be modeled as2
: R3 ;

X 7 x = (X)

(3.4)

.
where R2 with (X) = X/Z in the case of planar projection (e.g.,
.
onto the CCD), or S2 with (X) = X/kXk in the case of a spherical
projection (e.g., onto the retina). We will not make a distinction between
the two models, and indicate the projection simply by .
In order to express the direction xp in the camera frame, we consider
the change of coordinates from the local coordinate frame at the point p
to the camera frame. For simplicity, we let the inertial frame coincide with
.
the camera frame, so that X = gp (o) = p and x gp (xp ) where3 we note
that gp is a rotation, so that x depends on xp , while X depends on p. The
reader should be aware that the transformation gp itself depends on local
shape of the surface at p, in particular its tangent plane and its normal
at the point p. Once we substitute x for xp into E in (3.3) we obtain the
radiance
.
R1 (p) = E(gp 1
(x), p)

where x = (p).

(3.5)

2 We will systematically study the model of projection in next few sections, but what
is given below suffices our discussions here.
3 The symbol indicates projective equivalence, that is equality up to a scalar.
Strictly speaking, x and xp do not represent the same vector, but only the same direction (they have opposite sign and different lengths). However, they do represent the
same point in the projective plane, and therefore we will regard them as one and the
same. In order to obtain the same embedded representation (i.e. a vector in R 3 with
the same coordinates), we would have to write x = (gp (xp )). The same holds if we
model the projective plane using the sphere with antipodal points identified.

3.3. A first model of image formation

Our (ideal) sensor can measure the amount of energy received along the
direction x, assuming a pin-hole model:
I1 (x) = R1 (p)

where x = (p).

(3.6)

If the optical system is not well modeled by a pin-hole, one would have to
explicitly model the thin lens, and therefore integrate not just along the
direction xp , but along all directions in the cone determined by the current
point and the geometry of the lens. For simplicity, we restrict our attention
to the pin-hole model.
Notice that R1 in (3.5) depends upon the shape of the surface S, represented by its location p and surface normal , but it also depends upon
the light source L, its energy distribution E and the reflectance properties
of the surface S, represented by the BRDF . Making this dependency
explicit we write
I1 (x) =

(gp 1
(x), p )dE(p ) where x = (p)

(3.7)

which we indicate in short-hand notation as I1 (x) = R1 (p; , , L, E) where

we emphasize the dependence on , , L, E in addition to p. I1 is called the
(image) irradiance, while R1 is called (scene) radiance.
When images are taken from different viewpoints, one has to consider the
change of coordinates g relative to the inertial reference frame. Assuming
that the inertial frame coincides with the image I1 , we can obtain a new
image I2 by moving with g (see Figure 3.7). The coordinates of the point
p in the first and second camera frames are related by p2 = g(p), and
x = (p), x2 = (p2 ) = (g(p)). Using the fact that x2 = (g(p)) =
(g(gp (o))), we have that the second image takes the light (reflected from
p) in the direction xp 2 (ggp )1 (x2 ) (with respect to the local coordinate
frame at p). Therefore, the (scene) radiance in the direction of the new
viewpoint is given by
.
R2 (p, g; , , L, E) = E((ggp )1 (x2 ), p)
.
and the (image) irradiance is I2 (x2 ) = R2 (p, g; , , L, E) where x2 =
(g(p)). So far, we have used the first image as a reference, so that x1 = x,
g1 is the identity transformation and p1 = p. This needs not be the case. If
we choose an arbitrary inertial reference and indicate with gk the change of
coordinates to the reference of camera k, pk the coordinate of the point p in
this frame and xk the corresponding direction, then for each of k = 1 . . . M
images we can write Ik (xk ) = Rk (p, gk ; , , L, E).
The irradiance Ik of image k can be measured only up to noise. Since
we assume that energy transport phenomena are additive, such a noise will
be additive, but will have to satisfy the constraint that both radiance and

Chapter 3. Image formation

Figure 3.7. Generative model

irradiance must be positive

(
Ik (xk ) = Rk (p, gk ; , , L, E) + nk (xk )
subject to xk = (gk (p)) and Rk 0

(3.8)

for k = 1 . . . M.
This model can be considerably simplified if we restrict our attention
to a class of materials, called Lambertian, that do not change appearance
depending on the viewpoint. Marble and other matte surfaces are to a
large extent well approximated by the Lambertian model. Metal, mirrors
and other shiny surfaces are not.
According to Lamberts model, the BRDF only depends on how the surface faces the light source, but not on how it is viewed. Therefore, (xp , p )
is actually independent of xp , and we can think of the radiance function as
being glued, or painted on the surface S, so that at each point p we have
that the radiance R only depends on the geometry of the surface, and not
explicitly on the light source. In particular,
we have (xp , p ) = hp , p i,
. R
independent of xp , and R(p, p ) = L hp , p idE(p ). Since p is the normal
vector, which is determined by the geometry of the surface at p, knowing
the position of the generic point p S one can differentiate it to compute
the tangent plane. Therefore, effectively, the radiance R only depends on
the surface S, described by its generic point p:
I(x) = R(p)

(3.9)

3.3. A first model of image formation

where x = (g(p)) relative to the camera reference frame. In all subsequent

sections we will adopt this model. In the next section we will show how to
relate this model, expressed in a very particular camera reference frame
(centered in the optical center, with Z-axis along the optical axis etc.) to
a general world reference frame.

3.3.2 The basic model of imaging geometry

Under the Lambertian assumption, as we have seen in the previous section,
we can reduce the process of image formation to tracing rays from objects
to pixels. The essence of this process is captured by the geometry of perspective projection, which can be reduced to simple changes of coordinates
and a canonical projection. In order to establish a relationship between the
position of points in the world, expressed for instance with respect to an
inertial reference frame, and the points measured in an image, it is necessary to describe the relation between the camera reference frame and the
world frame. In this section we describe such a relation in detail for this
simplified image formation model as a series of transformations of coordinates. Inverting such a chain of transformations is the task of camera
calibration, which is the subject of Chapter 6.
Consider an orthogonal inertial reference frame {o, X, Y, Z}, called the
world frame. We write the coordinates of a point p in the world frame as
Xw = [Xw , Yw , Zw ]T R3 .
In order to write the coordinates of the same point with respect to another
reference frame, for instance the camera frame c, Xc , it is necessary to
describe the transformation between the world frame and the camera frame.
We will indicate such a transformation by gwc . With this notation, the
coordinates in the world frame and those in the camera frame are related
by:
Xw = gwc (Xc )
1
and vice-versa Xc = gcw Xw where gcw = gwc
is the inverse transformation,
which maps coordinates in the world frame onto coordinates in the camera
frame.
Transformations of coordinates between orthonormal reference frames
are characterized by the rigid motion of the reference frame. Therefore, as
discussed in Chapter 2, a rigid change of coordinates is described by the
position of the origin of the new frame, and the orientation of the new
frame relative to the old frame. Consider now the change of coordinates
gwc :

gwc = (Rwc , Twc )

Chapter 3. Image formation

where Twc R3 and Rwc is a rotation matrix. The change of coordinates

of an arbitrary point is given by
Xw = gwc (Xc ) = Rwc Xc + Twc .
A more compact way of writing the above equation is to use the
homogeneous representation,

Rwc Twc
.
=
X
[XT , 1]T R4 , gwc =
R44
0
1
so that
.
c =
w =
X
gwc X

Rwc
0

Twc
1

Xc
1

The inverse transformation, that maps coordinates in the world frame onto
coordinates in the camera frame, is given by
T
T
gcw = (Rcw , Tcw ) = (Rwc
Twc , Rwc
)

as it can be easily verified by:

gcw gwc = gwc gcw = I44
1
which is the identity transformation, and therefore gcw = gwc
.

3.3.3 Ideal camera

Let us consider a generic point p, with coordinates Xw R3 relative to
the world reference frame. As we have just seen, the coordinates relative to
the camera reference frame are given by Xc = Rcw Xw + Tcw R3 . So far
we have used subscripts to avoid confusion. From now on, for simplicity,
we drop them by using the convention R = Rcw , T = Tcw and g = (R, T ).
In the following section we denote the coordinates of a point relative to
the camera frame by X = [X, Y, Z]T and the coordinates of the same point
relative to the world frame by X0 . X0 is also used to denote the coordinates
of the point relative to the initial location of the camera moving frame. This
is done often for convenience, since the choice of world frame is arbitrary
and it is convenient to identify it with the camera frame at a particular
time. Using this convention we have:
X = g(X0 ) = RX0 + T.

(3.10)

Attaching a coordinate frame to the center of projection, with the optical

axis aligned with the z-axis and adopting the ideal pin-hole camera model
of Section 3.2.2 (see also Figure 3.6) the point of coordinates X is projected
onto the point of coordinates

f X
x
.
=
x=
y
Z Y

3.3. A first model of image formation

where [x, y]T are coordinates expressed in a two-dimensional reference

frame (the retinal image frame) centered at the principal point (the intersection between the optical axis and the image plane) with the x-axis
and y-axis parallel to the X-axis and Y -axis respectively and f represents
the focal length corresponding to the distance of the image plane from the
center of projection. In homogeneous coordinates this relationship can be
written as:

f
0 0 0
x
Y

(3.11)
Z y = 0 f 0 0
Z .
0
0 1 0
1
1

.
=
where X
[X, Y, Z, 1]T is the representation of a 3-D point in homogeneous
.
= [x, y, 1]T are homogeneous (projective) coordinates of
coordinates and x
the point in the retinal plane. Since the Z-coordinate (or the depth of
the point p) is usually unknown, we may simply denote it as an arbitrary
positive scalar R+ . Also notice that in the above equation we can
decompose the matrix into

f
0 0
1 0 0 0
f
0 0 0
0 f 0 0 = 0 f 0 0 1 0 0 .
0
0 1
0 0 1 0
0
0 1 0
Define two matrices:

f
0
Af = 0 f
0
0

0
0 R33 ,
1

1 0 0 0
P = 0 1 0 0 R34 .
0 0 1 0

=
Also notice that from the coordinate transformation we have for X
[X, Y, Z, 1]T :

X0
X
R
T

Y
=

Y0 .
(3.12)
Z0
Z
0
1
1
1

To summarize, using the above notation, the geometric model for an

ideal camera can be described as:

R
T
1 0 0 0
f
0 0
x

Y0 ,
y = 0 f 0 0 1 0 0
Z0
0
1
0 0 1 0
0
0 1
1
1

or in matrix form

= Af P gX
0

x = Af P X

(3.13)

Chapter 3. Image formation

3.3.4 Camera with intrinsic parameters

The idea model of Equation (3.13) is specified relative to a very particular
choice of reference frame, with Z axis along the optical axis etc. In practice,
when one captures images with a digital camera, the coordinates of the
optical axis or the optical center are not known, and neither are the units
of measurements. Instead, one can typically specify the index (i, j) of a
particular pixel, relative to the top-left corner of the image. Therefore, in
order to render the model (3.13) usable, we need to specify its relationship
with the pixel array.
The first step consists of specifying the units along the x and y axes: if x
and y are specified in terms of metric units (e.g., millimeters), and xs , yx
are scaled version that correspond to the index of a particular pixel, then
the transformation from x coordinates to xs coordinates can be described
by a scaling matrix

x
xs
sx 0
(3.14)
=
y
0 sy
ys
that depends on the size of the pixel (in metric units) along the x and y
directions. In general, these may be different (i.e. the pixels can be rectangular, rather than square). However, xs and ys are still specified relative
to the point of intersection of the optical axis with the image plane (also
called the principal point), whereas the pixel index (i, j) is usually specified
relative to the top-left corner, and is conventionally indicated by a positive
number. Therefore, we need to translate the origin of the reference frame
and invert its axis:
xim
yim

= (xs ox )
= (ys oy )

where (ox , oy ) are the coordinates (in pixels) of the principal point relative
to the image reference frame, or where the Z-axis intersects the image
plane. So the actual image coordinates are given by the vector xim =
[xim , yim ]T R2 instead of the ideal image coordinates x = [x, y]T . The
above steps can be written in a homogeneous coordinates in a matrix form
in the following way:

x
xim
sx
0
ox
.
sy oy y
im = yim = 0
x
(3.15)
1
0
0
1
1
where xim and yim are actual image coordinates in pixels. When sx = sy
the pixels are square. In case the pixels are not rectangular, a more general
form of the scaling matrix can be considered:

sx s
S=
R22
0 sy

3.3. A first model of image formation

where s is proportional to cot() of the angle between the image axes x

and y. So then the transformation matrix in (3.15) takes the general form:

sx
As = 0
0

s
sy
0

ox
oy R33 .
1

(3.16)

In many practical applications it is commonly assumed that s = 0.

In addition to the change of coordinate transformations, in the case of a
large field of view, one can often observe image distortions along radial
directions. The radial distortion effect is typically modeled as:
x = xd (1 + a1 r2 + a2 r4 )
y = yd (1 + a1 r2 + a2 r4 )
where (xd , yd ) are coordinates of the distorted points, r 2 = x2d + yd2 and
a1 , a2 are then considered additional camera parameters. However, for
simplicity, in this book we will assume that radial distortion has been
compensated for. The reader can refer to [?] for details.
Now, combining the projection model from the previous section with the
scaling and translation yields a more realistic model of a transformation
between homogeneous coordinates of a 3-D point relative to the camera
frame and homogeneous coordinates of its image expressed in terms of
pixels:

sx
xim
yim = 0
0
1

s
sy
0

f
ox
oy 0
0
1

0
f
0

X
1 0 0 0
0
Y

0 0 1 0 0
Z .
0 0 1 0
1
1

Notice that in the above equation, the effect of a real camera is in fact carried through two stages. The first stage is a standard perspective projection
with respect to a normalized coordinate system (as if the focal length f = 1).
This is characterized by the standard projection matrix P = [I33 , 0]. The
second stage is an additional transformation (on the so obtained image x)
which depends on parameters of the camera such as the focal length f ,
the scaling factors sx , sy and s and the center offsets ox , oy . The second
transformation is obviously characterized by the combination of the two
matrices As and Af

sx
A = A s Af = 0
0

s
sy
0

ox
f
oy 0
1
0

0
f
0

0
f sx
0 = 0
1
0

f s
f sy
0

ox
oy .
1

Chapter 3. Image formation

Such a decoupling allows us to write the projection equation in the following

way:

f s x f s ox
1 0 0 0
Y
= 0

f s y oy 0 1 0 0

xim = AP X
Z . (3.17)
0 0 1 0
0
0
1
1

The 3 3 matrix A collects all parameters that are intrinsic to the particular camera, and are therefore called intrinsic parameters; the matrix P
represents the perspective projection. The matrix A is usually called the
intrinsic parameter matrix or simply the calibration matrix of the camera.
When A is known, the normalized coordinates x can be obtained from the
pixel coordinates xim by simple inversion of A:

1 0 0 0

= 0 1 0 0 Y .
im = P X
(3.18)

x = A1 x
Z
0 0 1 0
1

The information about the matrix A can be obtained through the process
of camera calibration described in Chapter 6.
The normalized coordinate system corresponds to the ideal pinhole camera model with the image plane located in front of the center of projection
and the focal length f equal to 1. Given this geometric interpretation, the
individual entries of the matrix A correspond to:
1/sx size of the horizontal pixel in meters [m/pixel],
1/sy size of the vertical pixel in meters [m/pixel],
x = sx f size of the focal length in horizontal pixels [pixel],
y = sy f size of the focal length in vertical pixels [pixel],
x /y aspect ratio .

To summarize, the overall geometric relationship between 3-D coordinates

X0 = [X0 , Y0 , Z0 ]T relative to the world frame and their corresponding
image coordinates xim = [xim , yim ]T (in pixels) depends on the rigid body
displacement between the camera frame and world frame (also called extrinsic calibration parameters), an ideal projection and the camera intrinsic
parameters. The overall geometry of the image formation model is therefore
captured by the following equation:

R
T
1 0 0 0
f s x f s ox
xim

Y0 ,
f s y oy 0 1 0 0
yim = 0
Z0
0
1
0 0 1 0
0
0
1
1
1
or in matrix form

= AP gX
0

xim = AP X

(3.19)

3.3. A first model of image formation

3.3.5 Spherical projection

The pin-hole camera model outlined in the previous section considers
planar imaging surfaces. An alternative imaging surface which is commonly considered is that of a sphere. This choice is partly motivated
by retina shapes often encountered in biological systems. For spherical
projection,
we simply choose
the imaging surface to be the unit sphere:

S2 = X R3 | kXk = 1 . Then, the spherical projection is defined by the
map s from R3 to S2 :
s : R 3 S 2 ,

X 7 X =

X
.
kXk

Similarly to the case of perspective projection, the relationship between the

coordinates of 3-D points and their image projections can be expressed as
= AP gX
0

xim = AP X

(3.20)

where
the scale = Z in case of perspective projection becomes =

X 2 + Y 2 + Z 2 in case of spherical projection. Therefore, mathematically,

the perspective projection and spherical projection are exactly equivalent
to each other. The only difference is the unknown scale takes different
values.

3.3.6 Approximate camera models

The most commonly used approximation to the perspective projection
model is the so called orthographic projection. The light rays in the orthographic model travel along the lines parallel to the optical axis. The
relationship between image points and 3-D points in this case is particularly simple: x = X; y = Y . So the geometric model for an orthographic
camera can be expressed as:

X
x
1 0 0
Y ,
=
(3.21)
y
0 1 0
Z
or simply in a matrix form

x = P2 X

(3.22)

where P2 = [I22 , 0] R23 .

Orthographic projection is a good approximation to perspective projection when the variation of the depth between the viewed points is much
smaller then the distance of the points from the image plane. In case the
points viewed lie on a plane which is parallel to the image plane, the image
of the points is essentially a scaled version of the original. This scaling can
be explicitly incorporated into the orthographic projection model, leading
to so called weak-perspective model. In such a case, the relationship between

Chapter 3. Image formation

image points and 3-D points is:

X
x=f ,
Z

Y
y=f
Z

where Z is the average distance of the points viewed by the camera. This
model is appropriate for the case when all points lie in the frontoparallel
plane then the scaling factor corresponds to the distance of the plane from
the origin. Denoting the scaling factor s = Zf we can express the weakperspective camera model (scaled orthographic) as:

X
x
1 0 0
Y
=s
(3.23)
y
0 1 0
Z

or simply in a matrix form

x = sP2 X

(3.24)

These approximate projection models often lead to simplified and efficient

algorithms for estimation of unknown structure and displacement of the
cameras and are especially suitable for applications when the assumptions
of the model are satisfied [?].

3.4 Summary
The perspective projection was introduced as a model of the image formation process. In the case of idealized projection (when the intrinsic
parameters of the camera system are known) the coordinates of the image
points are related to their 3D counterparts by an unknown scale

x = X.
In the absence of camera parameters the perspective projection has been
augmented by an additional transformation A which captures the intrinsic
parameters of the camera system and yields following relationship between
image quantities and its 3D counterparts.
= AP gX
0

xim = AP X

3.5 Exercises
1. Show that any point on a line through p projects onto the same
coordinates (equation (3.4)).
2. Show how perspective projection approximates orthographic projection when the scene occupies a volume whose diameter is small

3.5. Exercises

compared to its distance from the camera. Characterize other conditions under which the two projection models produce similar results
(equal in the limit).
3. Consider a thin round lens imaging a plane parallel to the lens at a
distance d from the focal plane. Determine the region of this plane
that contributes to the image I at the point x. (Hint: consider first
a one-dimensional imaging model, then extend to a two-dimensional
image).
4. Perspective projection vs. orthographic projection
It is common sense that, with a perspective camera, one can not
tell an object from another object which is exactly twice as big but
twice as far. This is a classic ambiguity introduced by perspective
projection. Please use the ideal camera model to explain why this is
true. Is the same also true for orthographic projection? Please explain.
5. Vanishing points
A straight line in the 3-D world is projected onto a straight line in
the image. The projections of two parallel lines intersect in the image
at so-called vanishing point.
a) Show that projections of parallel lines in the image intersect
at a point.
b) Compute, for a given family of parallel lines, where in the
image the vanishing point will be.
c) When does the vanishing point of the lines in the image lie
at infinity (i.e. they do not intersect)?
6. Field of view
An important parameter of the imaging system is the field of view
(FOV). The field of view is twice the angle between the optical axis
(z-axis) and the end of the retinal plane (CCD array). Imagine that
you have a camera system with focal length 16mm, and retinal plane
(CCD array) is (16mm 12mm) and that you digitizer samples your
imaging surface 500 500 pixels in each direction.

a) Compute the FOV.

b) Write down the relationship between the image coordinate
and a point in 3D world expressed in the camera coordinate
system.
d) Describe how is the size of FOV related to focal length and
how it affects the resolution in the image.
e) Write a MATLAB program, which simulates the geometry of
the projection process; given an object (3D coordinates of the
object in the camera (or world) frame, create an image of that
object. Experiment with changing the parameters of the imaging
system.

Chapter 3. Image formation

7. Calibration matrix
Please compute the calibration matrix A which represents the transformation from image I to I 0 as shown in Figure 3.8. Note that, from
the definition of calibration matrix, you need to use homogeneous
coordinates to represent points on the images. Suppose that the rex

o
y
(-1,1)

I
I
(640,480)

Figure 3.8. Transformation of a normalized image into pixel coordinates.

sulting image I 0 is further digitized into an array of 480 640 pixels

and the intensity value of each pixel is quantized to an integer in
[0, 255]. Then how many different digitized images one can possibly
get from such a process?

This is page 57
Printer: Opaque this

Chapter 4
Image primitives and correspondence

The previous chapter described how images are formed depending on the
position of the camera, the geometry of the environment and the light
distribution. In this chapter we begin our journey of understanding how to
undo this process and process measurements of light in order to infer the
geometry of the environment.
The geometry of the environment and the light distribution combine to
give an image in a way that is inextricable: from a single image, in lack
of prior information, we cannot tell light and geometry apart. One way to
see that is to think of an image as coming from a certain object with a
certain light distribution: one can always construct a different object and
a different light distribution that would give rise to the same image. One
example is the image itself! It is an object different from the true scene (it
is flat) that gives rise to the same image (itself). In this sense, images are
illusions: they are objects different than the true one, but that generate
on the retina of the viewer the same stimulus (or almost the same, as we
will see) that the original scene would.
However, if instead of looking at one single image we look at two or more
images of the same scene taken for instance from different vantage points,
then we can hope to be able to extract geometric information about the
scene. In fact, each image depends upon the (changing) camera position,
but they all depend upon the same geometry and light distribution, assuming that the scene has not changed between snapshots. Consider, for
instance, the case of a scene and an photograph of the scene. When viewed
from different viewpoints, the original scene will appear different depending upon the depth of objects within: distant objects will move little and

Chapter 4. Image primitives and correspondence

near objects will move mode. If, on the other hand, we move in front of a
photograph of the scene, every points will have the same image motion.
This intuitive discussion suggests that, in order to be able to extract
geometric information about the environment, we first have to be able to
establish how points move from one image to the next. Or, more in
general, we have to establish which points corresponds to which in different images of the same scene. This is called the correspondence problem.
The correspondence problem is at the heart of the process of converting
measurements of light into estimates of geometry.
In its full generality, this problem cannot be solved. If the light distribution and the geometry of the scene can change arbitrarily from one image
to the next, there is no way to establish which points corresponds to which
in different images. For instance, if we take a white marble sphere rotating,
we have no way to tell which point corresponds to which on the sphere
since each point looks the same. If instead we take a still mirror sphere
and move the light, even though points are still (and therefore points correspond to themselves), their appearance changes from one image to the
next. Even if light and geometry do not change, if objects have anisotropic
reflectance properties, so that they change their appearance depending on
the viewpoint, establishing correspondence can be very hard.
In this chapter we will study under what conditions the correspondence
problem can be solved, and can be solved easily. We will start with the most
naive approach, that consists in considering the image irradiance I(x) as a
label attached to each pixel x. Although sound in principle, this approach
leads to nowhere due to a variety of factors. We will then see how to extend
the label idea to more and more complex labels, and what assumptions
on the shape of objects different labels impose.

4.1 Correspondence between images

Given a projection of a 3D point in an image, the goal of establishing the
correspondence, is to find the projection of the same 3D point in another
view. In order to understand and tackle this difficult problem, we need to
relate the abstract notion of a point to some measurable image quantities.
Consider one image, denoted by I1 , represented as a function defined on
a compact two-dimensional region (the sensor) taking values (irradiance)
in the positive reals, as we discussed in the previous chapter:
I1 : R 2
x

R+
7

I(x).

Under the simplifying assumptions of Chapter 3, the irradiance I1 (x) is

obtained by integrating the radiant energy in space along the ray x. If the
scene only contains opaque objects, then only one point along the projection

4.1. Correspondence between images

ray contributes to the irradiance. This point has coordinates X R3 , corresponding to a particular value of determined by the first intersection of
the ray with a visible surface: x = X. Therefore, I1 (x) = E(x) = E(X),
where R+ represents the distance of the point along the projection ray
and E(X) : R3 R+ is the radiant distribution of the object at X. This
is the fundamental irradiance equation:
I1 (x) = E(x) = E(X).

(4.1)

Now suppose that a different image of the same scene becomes available,
I2 , for instance one taken from a different vantage point. Naturally,
I2 : R2 R+ ; x 7 I2 (x).
However, I2 will in general be different from I1 . The first step in establishing correspondence is to understand how such a difference occurs. We
distinguish between changes that occur in the image domain and changes
that occur in the value I2 .

4.1.1 Transformations in the image domain

Let us assume, for now, that we are imaging empty space except for a
point with coordinates X R3 that emits light with the same energy in
all directions. This is a simplifying assumption that we will relax in the
next section. If I1 and I2 are images of the same scene, they must satisfy
the same irradiance equation (4.1). In general, however, the projection rays
from the point X onto the two sensors will be different in the two images,
due to the difference in the viewpoints, which can be characterized by as a
displacement (R, T ) SE(3):
.
I2 (x2 ) = E(x2 2 ) = E(X) = E(Rx1 1 + T ) = I1 (x1 )
(4.2)
Under these admittedly restrictive assumptions, the correspondence problem consists of establishing the relationship between x1 and x2 . One may
think of I1 and I2 as labels associated to x1 and x2 , and therefore match
points xi based on their labels Ii . Equation (4.2) therefore shows that the
point x1 in image I1 corresponds to the point x2 in image I2 if
x2 = h(x1 ) = (Rx1 1 (X) + T )/2 (X)

(4.3)

where we have emphasized the fact that the scales i , i = 1, 2 depend upon
the coordinates of the point in space X. Therefore, under the admittedly
restrictive conditions of this section, a model of the deformation between
two images of the same scene is given by:
I1 (x1 ) = I2 (h(x1 )).

(4.4)

The function h describes the image motion that we have described

informally in the header of this chapter. In order to make it more suggestive

Chapter 4. Image primitives and correspondence

of the motion of individual pixels, we could write h as

h(x) = x + u(X)

(4.5)

where the fact that h depends upon the shape of the scene is made explicitly
in the argument of u. Note that the dependence of h(x) on the position
of the point X comes through the scales 1 , 2 . In general, therefore, h
is a function in an infinite-dimensional space (the space of surfaces), and
solving for image correspondence is as difficult as estimating the shape of
visible objects.
While the reader reflects on how inappropriate it is to use the brightness
value of the image at a point as a label to match pixels across different
images, we reflect on the form of the image motion h in some interesting
special cases. We leave it as an exercise to the reader to prove the statements
below.
Translational image model Consider the case where each image point
moves with the same motion, u(X) = const, or h(x) = x + u x
where u R2 . This model is correct only if the scene is flat and
parallel to the plane, far enough from it, and the viewer is moving
slowly in a direction parallel to the image. However, if instead of
considering the whole image we consider a small window around a
k }, then a simple translational model
point x, W (x) = {
x | kx x
could be a reasonable approximation provided that the window is not
an image of an occluding boundary (why?). Correspondence based
on this model is at the core of most optical flow and feature tracking
algorithms. We will adopt the one of Lucas and Kanade [?] as the
prototypical example of a model in this class.
Affine image deformation Consider the case where h(x) = Ax + d
where A R22 and d R2 . This model is a good approximation
for small planar patches parallel to the image plane moving under an
arbitrary translation and rotation about the optical axis, and modest
rotation about an axis parallel to the image plane. Correspondence
based on this model has been addressed by Shi and Tomasi [?].
Now, let us see why attaching scalar labels to pixels is not a good idea.

4.1.2 Transformations of the intensity value

The basic assumption underlying the derivation in the previous section is
that each point with coordinates X results in the same measured irradiance in both images, as in Equation (4.2). In practice, this assumption is
unrealistic due to a variety of factors.
First, measuring light intensity consists of counting photons, a process
that is intrinsically subject to uncertainty. In addition, the presence of
the atmosphere, the material of the lens, the manufacturing of the sensor

4.1. Correspondence between images

all inject noise in the reading of light intensity. As a first approximation,

one could lump all sources of uncertainty into an additive noise term n.
This noise is often described statistically as a Poisson random variable
(when emphasizing the nature of the counting process and enforcing the
non-negativity constraints on irradiance), or as a Gaussian random variable (when emphasizing the concurrence of multiple independent sources
of uncertainty). Therefore, equation (4.4) must be modified to take into
account changes in the intensity value, in addition to the deformation of
the domain:
I1 (x1 ) = I2 (h(x1 )) + n(h(x1 )).

(4.6)

More fundamental departures from the model (4.4) occur when one considers that points that are visible from one vantage point may become occluded
from another. Occlusions could be represented by a factor multiplying I2
that depends upon the shape of the surface being imaged:
I1 (x1 ) = fo (X, x)I2 (h(x1 )) + n(h(x1 )).

(4.7)

For instance, for the case where only one point on the surface is emitting
light, fo (X, x) = 1 when the point is visible, and fo (X, x) = 0 when not.
The equation above should make very clear the fact that associating the
label I1 to the point x1 is not a good idea, since the value of I1 depends
upon the noise n and the shape of the surfaces in space X, which we cannot
control.
There is more: in most natural scenes, object do not emit light of their
own but, rather, they reflect ambient light in a way that depends upon
the properties of the material. Even in the absence of occlusions, different
materials may scatter or reflect light by different amounts in different directions. A limit case consist of material that scatters light uniformly in all
directions, such as marble and opaque matte surfaces, for which fo (X, x)
could simply be multiplied by a factor (X) independent of the viewpoint
x. Such materials are called Lambertian and the function (X) defined on
the surface of visible objects is called albedo. However, for many other surfaces, for instance the other extreme case of a perfectly specular material,
light rays hitting the surface are reflected by different amounts in different
directions. Images of specular objects depend upon the light distribution of
the ambient space, which includes light sources as well as every other object
through inter-reflections. In general, few materials exhibit perfect Lambertian or specular reflection, and even more complex reflection models, such
as translucent or anisotropic materials, are commonplace in natural and
man-made scenes.
The reader can therefore appreciate that the situation can get very
complicated even for relatively simple objects. Consider, for instance, the
example of the marble sphere and the mirror sphere in the header of this
chapter. Indeed, in the most general case an image depends upon the (unknown) light distribution, material properties, shape and pose of objects in

Chapter 4. Image primitives and correspondence

space, and images alone no matter how many do not provide sufficient
information to recover all unknowns.
So how do we proceed? How can we establish correspondence despite the
fact that labels are ambiguous and that point correspondence cannot be
established for general scenes?
The second question can be easily dismissed by virtue of modesty: in
order to recover a model of the scene, it is not necessary to establish the
correspondence for every single point on a collection of images. Rather, in
order to recover the pose of cameras and a simple model of an environment,
correspondence of a few fiducial points is sufficient. It will be particularly
clever if we could choose such fiducial points, of features in such a way
as to guarantee that they are easily put in correspondence.
The first question can be addressed by proceeding as often when trying to counteract the effects of noise: integrating. Instead of considering
Equation (4.2) in terms of points on an image, we can consider it defining
correspondence in terms of regions. This can be done by integrating each
side on a window W (x) around each point x, and using the equation to
characterize the correspondence at x. Another way of thinking of the same
procedure is to associate to each pixel x not just the scalar label I(x), but
instead a vector label
.
W (x) }.
l(x) = {I(
x) | x
This way, rather than trying to match values, in order to establish
correspondence, we will need to match vectors, or regions.
Of course, if we considered the transformation of each point in the region
as independent, h(x) = x + u(X), we would be back to ground zero, having
to figure out two unknowns for each pixel. However, if we make the assumption that the whole region W (x) undergoes the same motion, for instance
+u x
W (x), then the problem may become easier to solve.
h(
x) = x
At this stage the reader should go back to consider under what assumptions a window undergoes a pure translation, and when it may be more
appropriate to model the motion of the region as an affine deformation, or
when neither approximation is appropriate.
In the remainder of this chapter we will explore in greater detail how
to match regions based on simple motion models, such as pure translation
or affine deformations. Notice that since we are comparing the appearance
of regions, a notion of discrepancy or matching cost must be agreed
upon.

4.2 Photometric and geometric features

In the previous section we have seen that, in the most general case, finding
correspondence between points in two images of a scene with arbitrary
photometry and geometry is impossible. Therefore, it makes sense to try

4.2. Photometric and geometric features

to characterize the locations in the image where such correspondence can

be established.
One way to do that, as we have suggested in the previous section, is to
associate each geometric entity, for instance a point, with a support region,
that is a neighborhood of x: W (x) | x W (x), and characterize
the nature of the point in terms of the values of the image in the support
region, {I(
x) |
x W (x)}. Depending on the particular irradiance pattern
I, not all label sets can be matched. Consider, for instance, two images of
a white wall. Clearly I1 (x1 ) = I2 (x2 ) x1 , x2 , and therefore equation (4.4)
does not define a correspondence function h.
As we have anticipated, rather than trying to match each pixel, we are
going to opportunistically define as feature points those for which the
correspondence problem can be solved. The first step is to choose a class of
transformations, h, that depends upon a set of parameters . For instance,
= u for the translational model, and = {A, b} for the affine model.
With an abuse of notation we indicate the dependency of h from the parameters as h(). We can then define a pixel x to be a feature point if there
exists a neighborhood W (x) such that the following equations
I1 (x) = I2 (h(x, )) x W (x)

(4.8)

uniquely determine the parameters . From the example of the white wall,
it is intuitive that such conditions would require that I1 and I2 have nonzero gradient. In the sections to follow we will derive an explicit form of this
condition for the case of translational model. Notice also that, due to noise,
there may exist no parameters for which the equation above is satisfied.
Therefore, we may have to solve the equation above as an optimization
problem. After choosing a discrepancy measure in , (I1 , I2 ), we can seek
for

= arg min (I1 (x), I2 (h(x, ))).

(4.9)

Similarly, one may define a feature line as line segment with a support
region and a collection of labels such that the orientation and normal displacement of the transformed line can be uniquely determined from the
equation above.
In the next two sections we will see how to efficiently solve the problem
above for the case where are translation parameters or affine parameters.
Comment 4.1. The definition of feature points and lines allows us to abstract our discussion from pixels and images to abstract entities such as
points and lines. However, as we will discuss in later chapters, this separation is more conceptual than factual. Indeed, all the constraints among
geometric entities that we will derive in Chapters 8, 9, 10, and 11 can be
rephrased in terms of constraints on the irradiance values on collections of
images.

Chapter 4. Image primitives and correspondence

4.3 Optical flow and feature tracking

4.3.1 Translational model
Consider the translational model described in the previous sections, where
I1 (x1 ) = I2 (h(x1 )) = I2 (x1 + u).

(4.10)

If we consider the two images as being taken from infinitesimally close

vantage points, we can write a differential version of the purely translational
model as a constraint. In order to make the notation more suggestive, we use
.
a time index t, so that I1 (x1 ) = I(x(t1 ), t1 ). With this notation, assuming
infinitesimal motion, Equation (4.10) can be re-written as
I(x(t), t) = I(h(x(t) + u(t), t + dt)

(4.11)

.
.
where we have taken the liberty of calling t1 = t and t2 = t+dt to emphasize
the fact that motion is infinitesimal. In the limit where dt 0, the equation
can be written as
I(x(t), t)T u + It (x(t), t) = 0

(4.12)

where
.
I(x, t) =

I
x (x, t)
I
y (x, t)

. I
and It (x, t) =
(x, t).
t

(4.13)

Another way of writing the equation is in terms of the total derivative with
respect to time,
dI(x(t), y(t), t)
= 0.
dt

(4.14)

since for I(x(t), y(t), t) we have

I dx I dy I
+
+
=0
x dt
y dt
t

(4.15)

.
dy T
2
which is identical to (4.12) once we let u = [ux , uy ]T = [ dx
dt , dt ] R .
This equation is called the image brightness constancy constraint. Depending on where the constraint is evaluated, this equation can be used to
compute what is called optical flow, or to track photometric features in a
sequence of moving images.
Before we delve into the study of optical flow and feature tracking, notice
that (4.12), if computed at each point, only provides one equation for two
unknowns (u). It is only when the equation is evaluated at each point in a
region x
W (x), and the motion is assumed to be constant in the region,
that the equation provides enough constraints on u.

4.3. Optical flow and feature tracking

Optical flow and the aperture problem

We start by rewriting Equation (4.12) in a more compact form as
I T u + It = 0

(4.16)

For convenience we omit the time from (x(t), y(t)) in I(x(t), y(t), t) and
write simply I(x, y, t).
There are two ways to exploit this constraint, corresponding to an Eulerian and Lagrangian approach to the problem of determining image
and
motion. When we fix our attention at a particular image location x
use (4.16) to compute the velocity of particles flowing through that pixel,
u(
x, t) is called optical flow. When instead the attention is on a particular particle x(t), and (4.16) is computed at the location x(t) as it moves
through the image domain, we refer to the computation of u(x(t), t) as
feature tracking. Optical flow and feature tracking are obviously related by
x(t + dt) = x(t) + u(x(t), t)dt. The only difference, at the conceptual level,
is where the vector u(x, t) is computed: in optical flow it is computed at a
fixed location on the image, whereas in feature tracking it is computed at
the point x(t).
The brightness constancy constraint captures the relationship between
the image velocity u of an image point (x, y) and spatial and temporal
derivatives I, It which are directly measurable. As we have already noticed, the equation provides a single constraint for two unknowns u =
[ux , uy ]T . From the linear algebraic point of view there are infinitely many
solutions u which satisfy this equation. All we can compute is the projection of the actual optical flow vector to the direction of image gradient I.
This component is also referred to as normal flow and can be thought of
as a minimum norm vector un R2 which satisfies the brightness constancy constraint. It is given by a projection of the true motion field u into
gradient direction and its magnitude is given by
T
It
I
. I u I

.
un =
kIk kIk
kIk kIk

(4.17)

Direct consequence of this observation is so called aperture problem which

can be easily visualized. For example consider viewing locally an area of
the image as it undergoes motion (see Figure 4.1). In spite of the fact
that the dark square moved between the two consecutive frames diagonally,
observing the motion purely through the window (aperture) we can observe
only changes in the horizontal direction and may assume that the observed
pattern moved arbitrarily along the direction of the edge.
The aperture problem is yet another manifestation of the fact that we
cannot solve the correspondence problem by looking at brightness values
at a point. Instead, as we have anticipated, we will be looking at brightness
values in a region around each point. The assumption that motion is constant within the region is commonly known as local constancy. It states

Chapter 4. Image primitives and correspondence

Figure 4.1. In spite of the fact that the dark square moved diagonally between the
two consecutive frames, observing purely the cut-out patch we can only observe
motion in horizontal direction and we may assume that the observed pattern
moved arbitrarily along the direction of the edge.

that over a small window centered at the image location x the optical flow
is the same for all points in a window W (x). This is equivalent to assuming
a purely translational deformation model:
h(x) = x + u(x) = x + u(x) for all x W (x)

(4.18)

Hence the in order to compute the image velocity u, we seek the image
velocity which is most consistent with all the point constraints. Due to the
effect of noise in the model (4.6), this now over-constrained system may not
have a solution. Hence the matching process is formulated as a minimization
of the following quadratic error function based on the brightness constancy
constraint:
X
Eb (u) =
(I T (x, y, t) u(x, y) + It (x, y, t))2
(4.19)
W (x,y)

where the subscript b indicates brightness constancy. To obtain a linear

least-squares estimate of u(x, y) at each image location, we compute the
derivative with respect to u of the error function Eb (u):

X
X I 2 Ix Iy
Ix It
T
x
u+
Eb (u) = 2
I(I u + It ) = 2
Ix Iy
Iy2
Iy It
W (x,y)

W (x,y)

For u that minimizes Eb , it is necessary that Eb (u) = 0. This yields:

P
P 2
P
P Ix
PIx I2y u + P Ix It = 0
(4.20)
Ix Iy
Iy
Iy It

or in a matrix form:

Gu + b = 0.

(4.21)

Solving this equation (if G is invertible) gives the least-squares estimate of

image velocity:
u = G1 b

(4.22)

4.3. Optical flow and feature tracking

Note however that the matrix G is not guaranteed to be invertible. If the

intensity variation in a local image window varies only along one dimension
e.g., Ix = 0 or Iy = 0) or vanishes (Ix = 0 and Iy = 0), then G is not
invertible. These singularities are frequently referred to as an aperture and
blank wall problem, respectively. Based on these observations we see that
it is the local properties of image irradiance in the window W (x) which
determine the ill-posedness of the problem.
Since it seems that the correspondence problem can be solved, under the
brightness constancy assumption, for points x where G(x) is invertible, it
is convenient to define such points as feature points at least according to
the quadratic criterion above.
The discrete-time equivalent of the above derivation is known as the sum
of squared differences (SSD) algorithm. The SSD approach considers an
image window W centered at a location (x, y) at time t and other candidate
locations (x + dx, y + dy) in the image at time t + dt, where the point could
have moved between two frames. The goal is to find a displacement (dx, dy)
at a location in the image (x, y) which minimizes the SSD criterion:
X
(I(x, y, t) I(x + dx, y + dy, t + dt))2
(4.23)
SSD(dx, dy) =
W (x,y)

where the summation ranges over the image window W (x, y) centered at a
feature of interest. Comparing to the error function (4.19), an advantage of
the SSD criterion is that in principle we no longer need to compute derivatives of I(x, y, t). One alternative is for computing the displacement is to
enumerate the function at each location and choose the one which gives the
minimum error. This formulation is due to Lucas and Kanade [LK81] and
was originally proposed in the context of stereo algorithms and was later
refined by Tomasi and Kanade [TK92] in more general feature tracking
context.

4.3.2 Affine deformation model

Assuming that a whole region is moving with purely translational motion
is reasonable only for very particular cases of scene structure and camera
motion, as we have anticipated in the previous section. This is therefore
too rigid a model, at the opposite end of the spectrum of assuming that
each pixel can move independently.
In between these two extreme cases one may adopt richer models that
account for more general scenes and motion. A commonly adopted model
is that of affine deformation of image regions that support point features:
I1 (x) = I2 (h(x)), where the function h has the following form:

a1 a2
a5
h(x) = Ax + d =
x+
.
(4.24)
a3 a4
a6

Chapter 4. Image primitives and correspondence

Similarly as in the pure translation model ( 4.10) we can formulate

brightness constancy constraint for this more general 6-parameter affine
model:
I(x, t) = I(Ax + d, t + dt).

(4.25)

Enforcing the above assumption over a region of the image we can estimate
the unknown affine parameters A and d, by integrating the above constraint
for all the points in the region W (x, y)
X
(I(x, t) I(Ax + d, t + dt))2
(4.26)
Ea (A, d) =
W (x)

where the subscript a indicates the affine deformation model. By solving the
above minimization problem one can obtain a linear-least squares estimate
of the affine parameters A and d directly from the image measurements of
spatial and temporal gradients. The solution has been first derived by Shi
and Tomasi in [?].

4.4 Feature detection algorithms

In the previous sections we have seen how to compute the displacement or
affine deformation of a photometric feature, and we have distinguished the
case where the computation is performed at a fixed set of locations (optical
flow) from the case where point features are tracked over time (feature
tracking). One issue we have not addressed in this second case is how to
initially select the points to be tracked. However, we have hinted in various
occasions at the possibility of selecting as feature points the locations
that allow to easily solve the correspondence problem. In this section we
make this more precise by giving an numerical algorithm to select such
features.
As the reader may have noticed, the description of any of those feature
points relies on knowing the gradient of the image. Hence before we can
give out any numerical algorithm for feature selection, we need to study
how to compute image gradient in an accurate and robust way.

4.4.1 Computing image gradient

Neglecting for the moment the discrete nature of digital images, conceptually the image gradient I(x, y) = [Ix , Iy ]T R2 is simply given by the
two partial derivatives:
Ix (x, y) =

I
(x, y),
x

Iy (x, y) =

I
(x, y).
y

(4.27)

4.4. Feature detection algorithms

While the notion of derivative is well defined for smooth functions, additional steps need to be taken when computing the derivatives of digital
images.
The most straightforward and commonly used approximation of the gradient by first order differences of the raw images usually results in a rough
approximation of I. Due to the presence of noise, digitization error as
well as the fact that only discrete samples of function I(x, y) are available,
a common practice is to smooth I(x, y) before computing first differences to
approximate the gradient. Smoothing can be accomplished by convolution
of the image with a smoothing filter. Suitable low-pass filter which attenuates the high-frequency noise and at the same time can be serve as some
sort of interpolating function is for instance a 2-D Gaussian function:
g (x, y) =

2
2
2
1
e(x +y )/2
2

(4.28)

where the choice is related to the effective size of the window over which
the intensity function is averaged. Since the Gaussian filter is separable,
the convolution can be accomplished in two separate steps, first convolving
the image in horizontal and vertical direction with a 1-D Gaussian kernel.
The smoothed image then is:
Z Z

I(x, y) =
I(u, v)g (x u)g (y v) dudv.
(4.29)
u

Figures 4.2 and 4.3 demonstrate the effect of smoothing a noisy image by
a convolution with the Gaussian function. Since convolution is linear, the

Figure 4.2. An image Lena corrupted Figure 4.3. Smoothed image by a conby white random noises.
volution with the Gaussian function.

operations of smoothing and computing the gradient can be interchanged

in order, so that taking derivatives of the smoothed function I is equivalent
to convolving the original image I with a kernel that is the derivative of
g . Hence we can compute an approximation of the gradient I according

Chapter 4. Image primitives and correspondence

to the following equations:

Z Z
I(u, v)g0 (x u)g (y v) dudv
Ix (x, y) =
u v
Z Z
I(u, v)g (x u)g0 (y v) dudv
Iy (x, y) =
u

(4.30)
(4.31)

x
where g0 (x) = 2
ex /2 is the derivative of the Gaussian function. Of
3
course, in principle, linear filters other than the Gaussian can be designed
to smooth the image or compute the gradient. Of course, over digital images, we cannot perform continuous integration. So the above convolution is
commonly approximated by a finite summation over an region of interest:
X
I(u, v)g0 (x u)g (y v)
(4.32)
Ix (x, y) =
(u,v)

Iy (x, y) =

(u,v)

I(u, v)g (x u)g0 (y v)

(4.33)

Ideally should be chosen to be the entire image. But in practice, one can
usually choose its size to be big enough so that the values of the Gaussian
kernel are negligibly small outside of the region.

4.4.2 Line features: edges

By an edge in an image, we typically refer to a collection of pixels aligned
along a certain direction such that norm of the gradient vector I is high in
the orthogonal direction. This simple idea results the following important
algorithm
Algorithm 4.1 (Canny edge detector). Given an image I(x, y), follow
the steps to detect if a given pixel (x, y) is an edge:
set a threshold > 0 and standard deviation > 0 for the Gaussian
function,
compute the gradient vector I = [Ix , Iy ]T according to
P
I(u, v)g0 (x u)g (y v)
Ix (x, y) =
Pu,v
0
Iy (x, y) =
u,v I(u, v)g (x u)g (y v)

(4.34)

if kI(x, y)k2 = I T I is a local maximum along the gradient and

larger than the prefixed threshold , then mark it as an edge pixel.

Figures 4.4 and 4.5 demonstrate edge pixels detected by the Canny edge
detector on a gray-level image. Edges are important features which give
away some local geometric information of the image. However, as we mentioned in previous sections, they are typically not good features to track
due to the aperture problem. Hence in order to select features which are

4.4. Feature detection algorithms

Figure 4.4. Original image.

Figure 4.5. Edge pixels from the Canny

edge detectors.

more robust to track or match, point features, or the so-called corners,

are more favored.

4.4.3 Point features: corners

The solution of the correspondence problem for the case of pure translation
relied on inverting the matrix G made of the spatial gradients of the image
(4.22). For G to be invertible, the region must have non-trivial gradient
along two independent directions, resembling therefore corner structures
in the images, as shown in Figure 4.6. So the existence of a corner point

Figure 4.6. A corner feature xo is the virtual intersection of local edges.

xo = [xo , yo ] R2 means that over the window W (xo , yo ) the following

minimization has a solution
2
. X
min Ec (xo ) =
I T (x)(x xo )
(4.35)
xo

xW (xo )

where I(x) is the gradient calculated at x = [x, y]T W (xo , yo ). It is

then direct to check that the existence of a local minima for this error

Chapter 4. Image primitives and correspondence

function is equivalent to the summation of outer product of the gradients,

i.e.
X
I(x)I T (x)
(4.36)
xW (xo )

is a non-singular matrix. In the simplest implementations, the partial

derivatives are approximated by convolution with the derivative of the
Gaussian function. Overall summation after approximation yields a matrix
G

P 2
P
Ix
Ix Iy
P
P
R22 .
(4.37)
G=
Ix Iy
Iy2

If the min , denoting the smallest singular value of G, is above specified

threshold we say that the point (x0 , y0 ) is a feature point. If the featurewindow has constant brightness, i.e. both singular values of G are zero, it
cannot be localized in another image. When one of the singular values is
close to zero, the brightness varies in a single direction only. In order to
be able to solve the correspondence problem, a window needs to have a
significant gradient along two independent directions. We will elaborate on
these issues in the context of feature tracking. After this definition, it is
easy to devise an algorithm to extract feature points (or corners)
Algorithm 4.2 (Corner detector). Given an image I(x, y), follow the
steps to detect if a given pixel (x, y) is a corner feature:
set a threshold R and a window W of fixed size,

for all pixels in the window W around (x, y) compute the matrix

P 2
P
Ix
Ix Iy
P
P
G=
(4.38)
Ix Iy
Iy2

if the smallest singular value min (G) is bigger than the prefixed
threshold , then mark the pixel as a feature (or corner) point.

Although we have used the word corner, the reader should observe that
the test above only guarantees that the irradiance function I is changing
enough in two independent directions within the window of interest. In
other words, the window contains sufficient texture. Besides using the
norm (inner product) or outer product of the gradient I to detect edge or
corner, criteria for detecting both edge and corner together have also been
explored. One of them is the well-known Harris corner and edge detector
[HS88]. The main idea is to threshold the following quantity
C(G) = det(G) + k trace2 (G)

(4.39)

where different choices of the parameter k R may result in either an edge

detector or corner detector. To see this, let assume the two eigenvalues

4.5. Compensating for photometric factors

Figure 4.7. An example of the response of the Harris feature detector using 5 5
integration window and parameter k = 0.04.

(which in this case coincide with the singular values) of G are 1 , 2 . Then
C(G) = 1 2 + k(1 + 2 )2 = (1 + 2k)1 2 + k(12 + 22 ).

(4.40)

Note that if k > 0 and either one of the eigenvalues is large, so will be
C(G). That is, both distinctive edge and corner features will more likely
pass the threshold. If k < 0, then both eigenvalues need to be big enough
to make C(G) pass the threshold. In this case, the corner feature is favored.
Simple thresholding operation often does not yield satisfactory results and
leads to a detection of too many corners, which are not well localized.
Partial improvements can be obtained by search for the local minima in
the regions, where the response of the detector was high. Alternatively,
more sophisticated techniques can be used, which utilize the edge detection
techniques and indeed search for the high curvature points of the detected
contours.

4.5 Compensating for photometric factors

4.6 Exercises
1. Motion model. Consider measuring image motion h(x) and noticing
that h(x) = x + u x , i.e. each point on the image translates by
the same amount u. What particular motion (R, T ) and 3D structure
X must the scene have to satisfy this model?
2. Repeat the exercise above for an affine motion model, h(x) = Ax+u.
3. Repeat the exercise above for a projective motion model, h(x) = Hx,
in homogeneous coordinates.

Chapter 4. Image primitives and correspondence

4. Eigenvalues of sum of squared differences

Given a set of vectors u1 , . . . , um Rn , prove that all eigenvalues of
the matrix
G=

m
X
i=1

ui uTi Rnn

(4.41)

are non-negative. This concludes that the eigenvalues of G are the

same as the singular values of G. (Note: you may take it for granted
that all the eigenvalues are real since G is a real symmetric matrix.)
5. Implementation of the corner detector
Implement a version of the corner detector introduced in class using
MATLAB. Mark the most distinctive, say 20 to 50 (you get to choose
the exact number), corner features of the image Lena. Both your
MATLAB codes and results must be turned in. (If you insist on coding
in C or C++, it is fine too. But you must include clear instructions
on how to execute your files.)
After you are done, you may try to play with it. Here are some
suggestions:
Identify and compare practical methods to choose the threshold
or other parameters. Evaluate the choice by altering the level
of brightness, saturation and contrast in the image.
In practice, you like to select only one pixel around a corner
feature. Devise a method to choose the best pixel (or subpixel location) within a window, instead of every pixel above
the threshold.
Devise some quality measure for feature points and sort them
according to such measure. Note that the threshold only have
soft control on the number of features selected. With such
a quality measure, however, you can specify any number of
features you want.
Surf the web and find two images of the same scene, try to match
the features selected by your corner detector using the naive SSD
criterion (how does it work?)
6. Sub-pixel iteration
Both the linear and the affine model can be refined by pursuing subpixel iterations as well as by using multi-scale deformation models
that allow handling larger deformations. In order to achieve sub-pixel
accuracy, implement the following iteration
d0 = G1 e
di+1 = G1 ei+1
where we define

4.6. Exercises

.
e0 = e R
i+1 .
e
= W I(I It )(u+di1 ,v+di2 ,t) dudv.

At each step (u+di1 , v +di2 ) is in general not on the pixel grid, so that
it is necessary to interpolate the brightness values to obtain image
intensity at that location. Matrix G is the discrete version of what
we defined to be the SSD in the first section of this chapter.
7. Multi-scale implementation
One problem common to all differential techniques is that they fail
as the displacement across frames is bigger than a few pixels. One
possible way to overcome this inconvenience is to use a coarse-to-fine
strategy:
build a pyramid of images by smoothing and sub-sampling the
original images (see for instance [BA83])
select features at the desired level of definition and then
propagate the selection up the pyramid
track the features at the coarser level
propagate the displacement to finer resolutions and use that displacement as an initial step for the sub-pixel iteration described
in the previous section.
The whole procedure can be very slow for a full-resolution image if
implemented on conventional hardware. However, it is highly parallel
computation, so that the image could be sub-divided into regions
which are then assigned to different processors. In many applications,
where a prediction of the displacement of each feature is available,
it is possible to process only restricted regions of interest within the
image.

This is page 76
Printer: Opaque this

This is page 77
Printer: Opaque this

Part II

Geometry of pairwise
views

This is page 78
Printer: Opaque this

This is page 79
Printer: Opaque this

Chapter 5
Reconstruction from two calibrated
views

This chapter introduces the basic geometry of reconstruction of points in

3-D space from image measurements made from two different (calibrated)
camera viewpoints. We then introduce a simple algorithm to recover the
3-D position of such points from their 2-D views. Although the algorithm
is primarily conceptual, it illustrates the basic principles that will be used
in deriving both more sophisticated algorithms as well as algorithms for
the more general case of reconstruction from multiple views which will be
discussed in Part III of the book.
The framework for this chapter is rather simple: Assume that we are
given two views of a set of N feature points, which we denote formally as
I. The unknowns are the position of the points in 3-D space, denoted Z
and the relative pose of the cameras, denoted G, we will in this Chapter
first derive constraints between these quantities. These constraints take the
general form (one for each feature point)
fi (I, G, Z) = 0,

i = 1, . . . , N

(5.1)

for some functions fi . The central problem to be addressed in this Chapter

is to start from this system of nonlinear equations and attempt, if possible,
to solve it to recover the unknowns G and Z. We will show that it is indeed
possible to recover the unknowns.
The equations in (5.1) are nonlinear and no closed-form solution is known
at this time. However, the functions fi have a very particular structure that
allows us to eliminate the parameters Z that contain information about the
3-D structure of the scene and be left with a constraint on G and I alone,
which is known as the epipolar constraint. As we shall see in Section 5.1,

Chapter 5. Reconstruction from two calibrated views

the epipolar constraint has the general form

hj (I, G) = 0,

j = 1, . . . , N.

(5.2)

Interestingly enough, it can be solved for G in closed form, as we shall show

in Section 5.2.
The closed-form algorithm, however, is not suitable for use in real images
corrupted by noise. In Section 5.3, we discuss how to modify it so as to minimize the effect of noise. When the two views are taken from infinitesimally
close vantage points, the basic geometry changes in ways that we describe
in Section 5.4.

5.1 The epipolar constraint

We begin by assuming that we are given two images of the same scene
taken from two distinct vantage points. Using the tools developed in Chapter 4, we can identify corresponding points in the two images. Under the
assumption that the scene is static (there are no moving objects) and that
the brightness constancy constraint is satisfied (Section 4.3), corresponding
points are images of the same point in space. Therefore, if we call x1 , x2
the (homogeneous) coordinates of corresponding points on the image, these
two points are related by a precise geometric relationship that we describe
in this section.

5.1.1 Discrete epipolar constraint and the essential matrix

Following the notation of Chapter 3, each camera is represented by an
orthonormal reference frame and can therefore be described as a change
of coordinates relative to an inertial reference frame. Without loss of
generality, we can assume that the inertial frame corresponds to one of
the two cameras, while the other is positioned and oriented according to
g = (R, T ) SE(3). Let x1 , x2 R3 be the homogeneous coordinates of
the projection of the same point p onto the two cameras. If we call X1 R3
and X2 R3 the 3-D coordinates of the point p relative to the two camera
frames, they are related by a rigid body motion
X2 = RX1 + T
which can be written in terms of the images xi s and the depths i s as
2 x2 = R1 x1 + T.
In order to eliminate the depths i s in the preceding equation, multiply
both sides by Tb to obtain
2 Tbx2 = TbR1 x1 .

5.1. The epipolar constraint

Since the vector Tbx2 = T x2 is perpendicular to the vector x2 , the

inner product hx2 , Tbx2 i = x2 T Tbx2 is zero. This implies that the quantity
xT2 TbR1 x1 is also zero. Since 1 > 0, the two image points x1 , x2 satisfy
the following constraint:
xT2 TbRx1 = 0.

(5.3)

A geometric explanation for this constraint is immediate from Figure 6.1.

The coordinates of the point p in the first camera X1 , its coordinates in the
second camera X2 and the vector connecting the two optical centers form
a triangle. Therefore, the three vectors joining these points to the origin
lie in the same plane and their triple product, which measures the volume
they span is zero. This is true for the coordinates of the points Xi , i = 1, 2
as well as for the homogeneous coordinates of their projection xi , i = 1, 2
since Xi and xi (as two vectors) point in the same direction. The constraint
(5.3) is just the triple product written in the reference frame of camera 2.
The plane determined by the two centers of projection o1 , o2 and the point
p is called epipolar plane, the projection of the center of one camera onto
the image plane of another camera ei , i = 1, 2 is called epipole, and the
constraint (5.3) is called epipolar constraint or essential constraint. The
.
PSfrag replacements
matrix E = TbR R33 which captures the relative orientation between
L2
the two cameras
is called the essential matrix.

d2
o1

x2
x

y
o2

(R, T )
Figure 5.1. Two projections x1 , x2 R3 of a 3-D point p from two vantage points.
The relative Euclidean transformation between the two vantage points is given
by (R, T ) SE(3). The vectors x1 , x2 and T are coplanar, and therefore their
triple product (5.3) must be zero.

5.1.2 Elementary properties of the essential matrix

The matrix E = TbR R33 in equation (5.3) contains information about
the relative position T and orientation R SO(3) between the two cameras.
Matrices of this form belong to a very particular set of matrices in R33

Chapter 5. Reconstruction from two calibrated views

which we call the essential space and denote by E

n
o

E := TbR R SO(3), T R3 R33 .

Before we study the structure of the space of essential matrices, we

introduce a very useful lemma from linear algebra.
Lemma 5.1 (The hat operator). If T R3 , A SL(3) and T 0 = AT ,
then Tb = AT c
T 0 A.

c and A\
1 () are linear maps from R3 to R33 ,
Proof. Since both AT ()A
one may directly verify that these two linear maps agree on the basis
[1, 0, 0]T , [0, 1, 0]T or [0, 0, 1]T (using the fact that A SL(3) implies that
det(A) = 1).
The following theorem, due to Huang and Faugeras [HF89], captures the
algebraic structure of essential matrices:
Theorem 5.1 (Characterization of the essential matrix). A nonzero matrix E R33 is an essential matrix if and only if E has a singular
value decomposition (SVD): E = U V T with
= diag{, , 0}
for some R+ and U, V SO(3).
Proof. We first prove the necessity. By definition, for any essential matrix E, there exists (at least one pair) (R, T ), R SO(3), T R3
such that TbR = E. For T , there exists a rotation matrix R0 such that
R0 T = [0, 0, kT k]T . Denote this vector a R3 . Since det(R0 ) = 1, we
know Tb = R0T b
aR0 from Lemma 5.1. Then EE T = TbRRT TbT = TbTbT =
T
T
R0 b
ab
a R0 . It is direct to verify that

kT k2
0
0
0
kT k 0
0
kT k 0
kT k2 0 .
0
0 = 0
0
0 kT k
b
ab
aT = kT k
0
0
0
0
0
0
0
0
0

So the singular values of the essential matrix E = TbR are (kT k, kT k, 0). In
general, in the SVD of E = U V T , U and V are unitary matrices, that is
matrices whose columns are orthonormal, but whose determinants can be
1. We still need to prove that U, V SO(3) (i.e. have determinant + 1) to
establish the theorem. We already have E = TbR = R0T b
aR0 R. Let RZ () be
the matrix which represents a rotation around the Z-axis (or the X3 -axis)
by an angle of radians, i.e. RZ () = eeb3 with e3 = [0, 0, 1]T R3 . Then

0 1 0

RZ +
= 1 0 0 .
2
0 0 1

5.1. The epipolar constraint

T
Then b
a = RZ (+ 2 )RZ
(+ 2 )b
a = RZ (+ 2 ) diag{kT k, kT k, 0}. Therefore

diag{kT k, kT k, 0}R0R.
E = TbR = R0T RZ +
2

So in the SVD of E = U V T , we may choose U = R0T RZ (+ 2 ) and V T =

R0 R. By construction, both U and V are rotation matrices.
We now prove the sufficiency. If a given matrix E R33 has a SVD:
E = U V T with U, V SO(3) and = diag{, , 0}. Define (R1 , T1 )
SE(3) and (R2 , T2 ) SE(3) to be
(
T
(Tb1 , R1 ) = (U RZ (+ 2 )U T , U RZ
(+ 2 )V T ),
(5.4)
T
(Tb2 , R2 ) = (U RZ ( 2 )U T , U RZ
( 2 )V T )

It is now easy to verify that Tb1 R1 = Tb2 R2 = E. Thus, E is an essential

matrix.

Given a rotation matrix R SO(3) and a rotation vector T R3 , it is

immediate to construct an essential matrix E = TbR. The inverse problem,
that is how to reconstruct T and R from a given essential matrix E, is
less obvious. Before we show how to solve it in Theorem 5.2, we need the
following preliminary lemma.
Lemma 5.2. Consider an arbitrary non-zero skew symmetric matrix Tb
so(3) with T R3 . If, for a rotation matrix R SO(3), TbR is also a skew
symmetric matrix, then R = I or eub where u = kTT k . Further, Tbeub = Tb.
Proof. Without loss of generality, we assume T is of unit length. Since TbR
is also a skew symmetric matrix, (TbR)T = TbR. This equation gives
RTbR = Tb.

(5.5)

Since R is a rotation matrix, there exists R , kk = 1 and R

such that R = eb . Then, (5.5) is rewritten as eb Tbeb = Tb. Applying
this equation to , we get eb Tbeb = Tb. Since eb = , we obtain
eb Tb = Tb. Since is the only eigenvector associated to the eigenvalue
1 of the matrix eb and Tb is orthogonal to , Tb has to be zero. Thus,
is equal to either kTT k or kTT k , i.e. = u. R then has the form eb ,
which commutes with Tb. Thus from (5.5), we get
e2b Tb = Tb.

According to Rodrigues formula [MLS94], we have

(5.6) yields

e2T = I +
b sin(2) +
b 2 (1 cos(2)).

b 2 sin(2) +
b 3 (1 cos(2)) = 0.

(5.6)

Chapter 5. Reconstruction from two calibrated views

Since
b 2 and
b 3 are linearly independent (Lemma 2.3 in [MLS94]), we have
sin(2) = 1 cos(2) = 0. That is, is equal to 2k or 2k + , k Z.
T
Therefore, R is equal to I or eb . Now if = u = ||T
k then, it is direct

b b
from the geometric meaning of the rotation e that e T = Tb. On the
other hand if = u = kTT k then it follows that eb Tb = Tb. Thus, in
any case the conclusions of the lemma follows.
The following theorem shows how to extract rotation and translation
from an essential matrix as given in closed-form in equation (5.7) at the
end of the theorem.
Theorem 5.2 (Pose recovery from the essential matrix). There exist
exactly two relative poses (R, T ) with R SO(3) and T R3 corresponding
to a non-zero essential matrix E E.
Proof. Assume that (R1 , T1 ) SE(3) and (R2 , p2 ) SE(3) are both
solutions for the equation TbR = E. Then we have Tb1 R1 = Tb2 R2 . It
yields Tb1 = Tb2 R2 R1T . Since Tb1 , Tb2 are both skew symmetric matrices and
R2 R1T is a rotation matrix, from the preceding lemma, we have that either
(R2 , T2 ) = (R1 , T1 ) or (R2 , T2 ) = (eub1 R1 , T1 ) with u1 = T1 /kT1k. Therefore, given an essential matrix E there are exactly two pairs (R, T ) such
that TbR = E. Further, if E has the SVD: E = U V T with U, V SO(3),
the following formulas give the two distinct solutions
T
(Tb1 , R1 ) = (U RZ (+ 2 )U T , U RZ
(+ 2 )V T ),

T
( 2 )V T ).
(Tb2 , R2 ) = (U RZ ( 2 )U T , U RZ

(5.7)

5.2 Closed-form reconstruction

In the previous section, we have seen that images of corresponding points
are related by the epipolar constraint, which involves the unknown relative
pose between the cameras. Therefore, given a number of corresponding
points, we could use the epipolar constraints to try to recover camera pose.
In this section, we show a simple closed-form solution to this problem.
It consists of two steps: First a matrix E is recovered from a number of
epipolar constraints, then relative translation and orientation is extracted
from E. However, since the matrix E recovered using correspondence data
in the epipolar constraint may not be an essential matrix it needs to be
projected into the space of essential matrices prior to applying the formula
of equation (5.7).
Although the algorithm that we will propose is mainly conceptual and
does not work well in the presence of large noise or uncertainty, it is impor-

5.2. Closed-form reconstruction

tant in its illustration of the geometric structure of the space of essential

matrices. We will deal with noise and uncertainty in Section 5.3.

5.2.1 The eight-point linear algorithm

Let E = TbR be the essential matrix associated with the epipolar constraint
(5.3). When the entries of this 3 3 matrix are denoted as

e1 e2 e3
E = e4 e5 e6
(5.8)
e7 e8 e9

and arrayed in a vector, which we define to be the essential vector e R9 ,

we have
e = [e1 , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 ]T R9 .

Since the epipolar constraint xT2 Ex1 = 0 is linear in the entries of E, we

can rewrite it as a linear equation in the entries of e, namely, aT e = 0 for
some a R9 . The vector a R9 is a function of the two corresponding
image points, x1 = [x1 , y1 , z1 ]T R3 and x2 = [x2 , y2 , z2 ]T R3 and is
given by
a = [x2 x1 , x2 y1 , x2 z1 , y2 x1 , y2 y1 , y2 z1 , z2 x1 , z2 y1 , z2 z1 ]T R9 .

(5.9)

The epipolar constraint (5.3) can then be rewritten as the inner product
of a and e
aT e = 0.
Now, given a set of corresponding image points (xj1 , xj2 ), j = 1, . . . , n, define
a matrix A Rn9 associated with these measurements to be
A = [a1 , a2 , . . . , an ]T

(5.10)

where the j th row aj is constructed from each pair (xj1 , xj2 ) using (6.21). In
the absence of noise, the essential vector e has to satisfy
Ae = 0.

(5.11)

This linear equation may now be solved for the vector e. For the solution
to be unique (up to scale, ruling out the trivial solution e = 0), the rank
of the matrix A R9n needs to be exactly eight. This should be the
case given n 8 ideal corresponding points. In general, however, since
correspondences may be noisy there may be no solution to (5.11). In such
a case, one may choose e to minimize the function kAek2 , i.e. choose e is
the eigenvector of AT A corresponding to its smallest eigenvalue. Another
condition to be cognizant of is when the rank of A is less than 8, allowing
for multiple solutions of equation (5.11). This happens when the feature
points are not in general position, for example when they all lie in a
plane. Again in the presence of noise when the feature points are in general

Chapter 5. Reconstruction from two calibrated views

position there are multiple small eigenvalues of AT A, corresponding to the

lack of conditioning of the data.
However, even in the absence of noise, for a vector e to be the solution
of our problem, it is not sufficient that it is the null space of A. In fact,
it has to satisfy an additional constraint, in that its matrix form E must
belong to the space of essential matrices. Enforcing this structure in the
determination of the null space of A is difficult. Therefore, as a first cut, we
could first estimate the null space of A ignoring the internal structure of
E, obtaining a matrix possibly not belong to the essential space, and then
orthogonally projecting the matrix thus obtained onto the essential space.
The following theorem says precisely what this projection is.
Theorem 5.3 (Projection onto the essential space). Given a real
matrix F R33 with a SVD: F = U diag{1 , 2 , 3 }V T with U, V
SO(3), 1 2 3 , then the essential matrix E E which minimizes the
error kE F k2f is given by E = U diag{, , 0}V T with = (1 + 2 )/2.
The subscript f indicates the Frobenius norm.
Proof. For any fixed matrix = diag{, , 0}, we define a subset E of
the essential space E to be the set of all essential matrices with SVD of
the form U1 V1T , U1 , V1 SO(3). To simplify the notation, define =
diag{1 , 2 , 3 }. We now prove the theorem by two steps:
Step 1: We prove that for a fixed , the essential matrix E E which
minimizes the error kE F k2f has a solution E = U V T (not necessarily
unique). Since E E has the form E = U1 V1T , we get
kE F k2f = kU1 V1T U V T k2f = k U T U1 V1T V k2f .

Define P = U T U1 , Q = V T V1 SO(3) which have the forms

p11 p12 p13

q11 q12 q13
P = p21 p22 p23 , Q = q21 q22 q23 .
p31 p32 p33
q31 q32 q33

(5.12)

Then

kE F k2f

= k U T U1 V1T V k2f = tr(2 ) 2tr(P QT ) + tr(2 ).

Expanding the second term, using = diag{, , 0} and the notation

pij , qij for the entries of P, Q, we have
tr(P QT ) = (1 (p11 q11 + p12 q12 ) + 2 (p21 q21 + p22 q22 )).
Since P, Q are rotation matrices, p11 q11 + p12 q12 1 and p21 q21 + p22 q22
1. Since , are fixed and 1 , 2 0, the error kE F k2f is minimized
when p11 q11 + p12 q12 = p21 q21 + p22 q22 = 1. This can be achieved when
P, Q are of the general form

cos() sin() 0
P = Q = sin() cos() 0 .
0
0
1

5.2. Closed-form reconstruction

Obviously P = Q = I is one of the solutions. That implies U1 = U, V1 = V .

Step 2: From Step 1, we only need to minimize the error function over
the matrices of the form U V T E where may vary. The minimization
problem is then converted to one of minimizing the error function
kE F k2f = (1 )2 + (2 )2 + (3 0)2 .
Clearly, the which minimizes this error function is given by = (1 +
2 )/2.
As we have already pointed out, the epipolar constraint only allows for
the recovery of the essential matrix up to a scale (since the epipolar constraint of 5.3 is homogeneous in E, it is not modified by multiplying by
any non-zero constant). A typical choice to address this ambiguity is to
choose E to have its non-zero singular values to be 1, which corresponds
to unit translation, that is, kT k = 1. Therefore, if E is the essential matrix
recovered from image data, we can normalize it at the outset by replacing
E
it with kEk
. Note that the sign of the essential matrix is also arbitrary. According to Theorem 5.2, each normalized essential matrix gives two possible
poses. So, in general, we can only recover the pose up to four solutions.1
The overall algorithm, which is due to Longuet-Higgins [LH81], can then
be summarized as follows
Algorithm 5.1 (The eight-point algorithm). For a given set of image correspondences (xj1 , xj2 ), j = 1, . . . , n (n 8), this algorithm finds
(R, T ) SE(3) which solves
b j
xjT
2 T Rx1 = 0,

j = 1, . . . , n.

1. Compute a first approximation of the essential matrix

Construct the A Rn9 from correspondences xj1 and xj2 as in (6.21),
namely.
aj = [xj2 xj1 , xj2 y1j , xj2 z1j , y2j xj1 , y2j y1j , y2j z1j , z2j xj1 , z2j y1j , z2j z1j ]T R9 .
Find the vector e R9 of unit length such that kAek is minimized
as follows: compute the SVD A = UA A VAT and define e to be the
9th column of VA . Rearrange the 9 elements of e into a square 3 3
matrix E as in (5.8). Note that this matrix will in general not be an
essential matrix.
2. Project onto the essential space
Compute the Singular Value Decomposition of the matrix E recovered
from data to be
E = U diag{1 , 2 , 3 }V T
1 Three

of them can be eliminated once imposing the positive depth constraint.

Chapter 5. Reconstruction from two calibrated views

where 1 2 3 0 and U, V SO(3). In general, since E

may not be an essential matrix, 1 6= 2 and 3 > 0. Compute its
projection onto the essential space as U V T , where = diag{1, 1, 0}.
3. Recover displacement from the essential matrix
Define the diagonal matrix to be Extract R and T from the essential
matrix as follows:

T
V T , Tb = U RZ
U T .
R = U RZ

2
2

Remark 5.1 (Infinitesimal viewpoint change). It is often the case in

applications that the two views described in this chapter are taken by a moving camera rather than by two static cameras. The derivation of the epipolar
constraint and the associated eight-point algorithm does not change, as long
as the two vantage points are distinct. In the limit that the two viewpoints
come infinitesimally close, the epipolar constraint takes a distinctly different form called the continuous epipolar constraint which we will introduce
in Section 5.4.
Remark 5.2 (Enough parallax). In the derivation of the epipolar
constraint we have implicitly assumed that E 6= 0, which allowed us to
derive the eight-point algorithm where the epipolar matrix is normalized to
kEk = 1. Due to the structure of the essential matrix, E = 0 T = 0.
Therefore, the eight-point algorithm requires that T 6= 0. The translation
vector T is also known as parallax. In practice, due to noise, the algorithm
will likely return an answer even when there is no parallax. However, in that
case the estimated direction of translation will be meaningless. Therefore,
one needs to exercise caution to make sure that there is enough parallax
for the algorithm to operate correctly.
Remark 5.3 (Positive depth constraint). Since both E and E satisfy
the same set of epipolar constraints, they in general give rise to 2 2 = 4
possible solutions to (R, T ). However, this does on pose a potential problem
because there is only one of them guarantees that the depths of the 3-D
points being observed by the camera are all positive. That is, in general,
three out of the four solutions will be physically impossible and hence may
be discarded.
Remark 5.4 (General position requirement). In order for the above
algorithm to work properly, the condition that the given 8 points are in
general position is very important. It can be easily shown that if these
points form certain degenerate configurations, the algorithm will fail. A
case of some practical importance is when all the points happen to lie on
the same 2-D plane in R3 . We will discuss the geometry for image features from a plane in a later chapter (in a more general setting of multiple
views). Nonetheless, we encourage the reader to solve the Exercise (7) at the
end of this chapter that illustrates some basic ideas how to explicitly take

5.2. Closed-form reconstruction

into account such coplanar information and modify the essential constraint
accordingly.

5.2.2 Euclidean constraints and structure reconstruction

The eight-point algorithm just described uses as input a set of eight or
more point correspondences and returns the relative pose (rotation and
translation) between the two cameras up a an arbitrary scale R+ .
Relative pose and point correspondences can then be used to retrieve the
position of the points in 3-D space by recovering their depths relative to
each camera frame.
Consider the basic rigid body equation, where the pose (R, T ) has been
recovered, with the translation T defined up to the scale . The usual rigid
body motion, written in terms of the images and the depths, is given by
j2 xj2 = j1 Rxj1 + T,

1 j n.

(5.13)

Notice that, since (R, T ) are known, the equations given by (5.13) are
linear in both the structural scales s and the motion scales s and could
be therefore easily solved. In the presence of noise, these scales can be
retrieved by solving a standard linear least-squares estimation problem, for
instance in the least-squares sense. Arrange all the unknown scales in (5.13)
into an extended scale vector ~
~ = [1 , . . . , n , 1 , . . . , n , ]T R2n+1 .
1
1
2
2
Then all the constraints given by (5.13) can be expressed as a single linear
equation
M ~ = 0

(5.14)

where M R3n(2n+1) is a matrix depending on (R, T ) and {(xj1 , xj2 )}nj=1 .

For the linear equation (5.14) to have a unique solution, the matrix M
needs to have a rank 2n. In the absence of noise, the matrix M is generically
of rank 2n if enough points are used2 , and it has a one-dimensional null
space. Hence the equation (5.14) determines all the unknown scales up to
a single universal scaling. The linear least-squares estimate of ~ is simply
the eigenvector of M T M which corresponds to the smallest eigenvalue.
However, the obtained reconstruction may suffer from certain inconsistency
that the 3-D locations of a point recovered from different vantage points do
not necessarily coincide. In the next section, we address this issue and, at
the mean time, address the issue of optimal reconstruction in the presence
of noise.
2 disregarding,

of course, configurations of points which are not in general position.

Chapter 5. Reconstruction from two calibrated views

5.3 Optimal reconstruction

The eight-point algorithm to reconstruct camera pose, as described in

the previous section, assumes that exact point correspondences are given.
In the presence of noise in the correspondence, we have hinted at possible
ways of estimating the essential matrix by solving a least-squares problem.
In this section, we describe a more principled approach to minimize the
effects of noise in the reconstruction. It requires solving an optimization
problem on a differentiable manifold. Rather than develop the machinery
to solve such optimization problems in this section, we will only formulate
the problem here and describe an algorithm at a conceptual level. Details
of the algorithm as well as more on optimization on manifolds are described
in Appendix C.
As we have seen in Chapter 6, a calibrated camera can be described
as a plane perpendicular to the Z-axis at a distance of 1 unit from the
origin. Therefore, coordinates of image points x1 and x2 are of the form
[x, y, 1]T R3 . In practice, we cannot measure the actual coordinates but
rather their noisy versions
j1 = xj1 + w1j ,
x

j2 = xj2 + w2 j ,
x

j = 1, . . . , n

(5.15)

where xj1 and xj2 are the ideal image coordinates and w1j = [w1 j1 , w1 j2 , 0]T
and w2j = [w2 j1 , w2 j2 , 0]T are localization errors in the correspondence.
Notice that it is the (unknown) ideal image coordinates xji , i = 1, 2 that
b j
satisfy the epipolar constraint xjT
2 T Rx1 = 0, and not the (measured) noisy
j
i , i = 1, 2. One could think of the ideal coordinates as a model that
ones x
depends upon the unknown parameters (R, T ), and wij as the discrepancy
ji = xji (R, T ) + wij . Therefore,
between the model and the measurements: x
one seeks the parameters (R, T ) that minimize the discrepancy from the
data, i.e. wij . It is clear that, in order to proceed, we need to define a
discrepancy criterion.
The simplest choice is to assume that wij are unknown errors and
to minimize
their norm; for instance, the standard Euclidean two-norm
.
kwk2 = wT w. Indeed, minimizing the squared norm is equivalent and
T = arg minR,T (R, T ) where
often results in simpler algorithms: R,
X
X j
.
(R, T ) =
kwij k22 =
k
xi xji (R, T )k22 .
i,j

i,j

This corresponds to a least-squares criterion (LS). Notice that the unknowns are constrained to live in a non-linear space: R SO(3) and T S2 ,
the unit sphere, after normalization due to the unrecoverable global scale
factor, as discussed in the eight-point algorithm.
3 This

section may be skipped at a first reading.

5.3. Optimal reconstruction

Alternatively, one may assume that wij are samples of a stochastic process
with a certain, known distribution that depends upon the unknown parameters (R, T ), wij p(w|R, T ) and maximize the (log)-likelihood function
T = arg maxR,T (R, T ), where
with respect to the parameters: R,
. X
log p((
xji xji )|R, T ).
(R, T ) =
i,j

This corresponds to a maximum likelihood criterion (ML).

If prior information about the unknown parameters (R, T ) is available,
for instance, their prior density p(R, T ), one may combine it with the loglikelihood function to obtain a posterior density p(R, T |
xjj ) using Bayes
T = arg maxR,T (R, T ), where
rule, and seek R,
.
= p(R, T |
xji i, j)
This choice corresponds to a maximum a-posteriori criterion (MAP).
Although conceptually more principled, this choice requires defining a
probability density on the space of camera pose SO(3) S2 which has a
non-trivial Riemannian structure. Not only is this prior not easy to obtain,
but also its description is beyond the scope of this book and will therefore
not be further explored.
Notice that both the ML and LS criteria in the end reduce to optimization
problems on the manifold SO(3) S2 . The discussion in this section can
be applied to ML as well as to LS, although for simplicity we restrict our
discussion to LS. Now that we have chosen a criterion, the least-squares
norm, we can pose the problem of optimal reconstruction by collecting all
ji , i = 1, 2, j = 1, . . . , n, find
available constraints as follows: given x
T = arg min
R,

n X
2
X
j=1 i=1

kwij k22

subject to
j
i = xji + wi j , i = 1, 2, j = 1, . . . , n
x

b j

xjT
j = 1, . . . , n

2 T Rx1 = 0,

xjT e = 1, j = 1, . . . , n
3
1
jT

x
e
j = 1, . . . , n
3 = 1,

R SO(3)

T S2

(5.16)

Using Lagrange multipliers j , j , j , we can convert the above minimization problem to an unconstrained minimization problem over R

Chapter 5. Reconstruction from two calibrated views

SO(3), T S2 , xji , xj2 , j , j , j with the constraints of equation (5.16) as

min

n
X
j=1

jT
jT
j
j
b j
xj2 xj2 k2 +j xjT
k
xj1 xj1 k2 +k
2 T Rx1 + (x1 e3 1)+ (x2 e3 1).

(5.17)

The necessary conditions for the existence of a minimum are:

2(
xj1 xj1 ) + j RT TbT xj2 + j e3 = 0
2(
xj2 xj2 ) + j TbRxj1 + j e3 = 0

Simplifying the necessary conditions, we obtain:

j1 12 j ebT3 eb3 RT TbT xj2

xj1
= x

j
j2 21 j ebT3 eb3 TbRxj1
x2
=
x

xjT TbRxj =
0
2
1

(5.18)

where j is given by:

j =

b xj + x
b j
jT
2(xjT
2 T R
1
2 T Rx1 )
xjT RT TbT ebT eb3 TbRxj + xjT TbRb
eT eb3 RT TbT xj
1

or
j =

(5.19)

b xj
b j
2xjT
2
xjT
2 T R
1
2 T Rx1
=
.
T bT bT e
b eT eb3 RT TbT xj
b j
xjT
xjT
1 R T e
2 T Rb
3
3 b3 T Rx1
2

(5.20)

Substituting (5.18) and (5.19) into the least-squares cost function of

equation (5.17, we obtain:
(R, T, xj1 , xj2 ) =

n
X
b xj + x
b j 2
jT
(xjT
2 T R
1
2 T Rx1 )
kb
e3 TbRxj k2 + kxjT TbRb
e T k2
1

j=1

(5.21)

If one uses instead (5.18) and (5.20), we get:

(R, T, xj1 , xj2 ) =

n
X
(
xjT TbRxj )2
2

j=1

kb
e3 TbRxj1 k2

b j )2
(xjT
2 Tx
1
.
kxjT TbRb
e T k2
2

(5.22)

Geometrically, both expressions can be interpreted as distances of the image

j1 and x
j2 from the epipolar lines specified by xj1 , xj2 and (R, T ), as
points x
shown in Figure 5.2. We leave the verification of this to the reader as an
exercise.
These expressions for can finally be minimized with respect to (R, T ) as
well as xji . In doing so, however, one has to make sure that the optimization
proceeds on the space R SO(3) and T S2 are enforced. In Appendix C
we discuss methods of minimizing function defined on the space SO(3)S2 ,
which can be used to minimize (R, T, xji ) once xji are known. Since xji are
not known, one can set up an alternating minimization scheme where an
initial approximation of xji is used to estimate an approximation of (R, T )
which is used, in turn, to update the estimates of xji . It can be shown that

5.3. Optimal reconstruction

PSfrag replacements

2
x
d2

L2
x
z

y
o2

(R, T )
Figure 5.2. Two noisy image points x1 , x2 R3 . L2 is the so-called epipolar line
which is the intersection of the second image plane with the plane formed by the
first image point x1 and the line connecting the two camera centers o1 , o2 . The
2 and the
distance d2 is the geometric distance between the second image point x
epipolar line in the second image plane. Symmetrically, one can define a similar
geometric distance d1 in the first image plane.

each such iteration decreases the cost function, and therefore convergence
to a local extremum is guaranteed since the cost function is bounded below
by zero.
Algorithm 5.2 (Optimal triangulation).
1. Initialization
j1 and x
j2 respectively.
Initialize xj1 (R, T ) and xj2 (R, T ) as x
2. Motion estimation
Update (R, T ) by minimizing (R, T, xj1 (R, T ), xj2 (R, T )) given by
(5.21) or (5.22).
3. Structure triangulation
Solve for xj1 (R, T ) and xj2 (R, T ) which minimize the objective function with respect to a fixed (R, T ) computed from the previous
step.
4. Return to Step 2 until the decrement in the value of is below a
threshold.
In step 3 above, for a fixed (R, T ), xj1 (R, T ) and xj2 (R, T ) can be com j1 k2 + kxj2 x
j2 k2 for each pair
puted by minimizing the distance kxj1 x
j
3
of image points. Let t2 R be the normal vector (of unit length) to
the (epipolar) plane spanned by (xj2 , T ). Given such a tj2 , xj1 and xj2 are

Chapter 5. Reconstruction from two calibrated views

determined by:
xj1 (tj1 )

bj T bj
T j
eb3 tj1 tjT
e
b
x
+
t
3 1
1
1 t1 e 3
bT b
eT3 tj1 tj1 e3

xj2 (tj2 )

bj T bj
T j
eb3 tj2 tjT
e
b
x
+
t
3 2
2
2 t2 e 3
bT b
eT3 tj2 tj2 e3

where tj1 = RT tj2 R3 . Then the distance can be explicitly expressed as:
j2 k2 + kxj1 x
j1 k2
kxj2 x

= k
xj2 k2 +

j j
tjT
2 A t2
j j
tjT
2 B t2

+ k
xj1 k2 +

j j
tjT
1 C t1
j j
tjT
1 D t1

where Aj , B j , C j , Dj R33 are defined by:

Aj
Cj

c
c
jT
j2 x
bT3 + xj2 eb3 + eb3 xj2 ),
I (b
e3 x
2 e
c
c
j1 x
jT
= I (b
e3 x
bT3 + xj1 eb3 + eb3 xj1 ),
1 e

B j = ebT3 eb3

Dj = ebT3 eb3

(5.23)

Then the problem of finding xj1 (R, T ) and xj2 (R, T ) becomes one of finding tj
2 which minimizes the function of a sum of two singular Rayleigh
quotients:
min

jT j
tjT
2 T =0,t2 t2 =1

V (tj2 ) =

j j
tjT
2 A t2
j j
tjT
2 B t2

j T j
tjT
2 RC R t2
j T j
tjT
2 RD R t2

(5.24)

This is an optimization problem on the unit circle S1 in the plane orthogonal to the vector T (therefore, geometrically, motion and structure recovery
from n pairs of image correspondences is an optimization problem on the
space SO(3) S2 Tn where Tn is an n-torus, i.e. an n-fold product of S1 ).
If N1 , N2 R3 are vectors such that T, N1 , N2 form an orthonormal basis
of R3 , then tj2 = cos()N1 + sin()N2 with R. We only need to find
which minimizes the function V (tj2 ()). From the geometric interpretation
of the optimal solution, we also know that the global minimum should
lie between two values: 1 and 2 such that tj2 (1 ) and tj2 (2 ) correspond to
normal vectors of the two planes spanned by (
xj2 , T ) and (R
xj1 , T ) respecj
j
1, x
2 are already triangulated, these two planes coincide). The
tively (if x
problem now becomes a simple bounded minimization problem for a scalar
function and can be efficiently solved using standard optimization routines
(such as fmin in Matlab or the Newtons algorithm).

5.4 Continuous case

As we pointed out in Section 5.1, the limit case where the two viewpoints
are infinitesimally close requires extra attention. From the practical standpoint, this case is relevant to the analysis of a video stream where the
camera motion is slow relative to the sampling frequency. In this section,
we follow the steps of the previous section by giving a parallel derivation

5.4. Continuous case

of the geometry of points in space as seen from a moving camera, and deriving a conceptual algorithm for reconstructing camera motion and scene
structure. In light of the fact that the camera motion is slow relative to the
sampling frequency we will treat the motion of the camera as continuous.
While the derivations proceed in parallel it is an important point of caution
to the reader that there are some subtle differences.

5.4.1 Continuous epipolar constraint and the continuous

essential matrix
Let us assume that camera motion is described by a smooth (i.e. continuously differentiable) trajectory g(t) = (R(t), T (t)) SE(3) with body
velocities ((t), v(t)) se(3) as defined in Chapter 2 For a point p R3 ,
its coordinates as a function of time X(t) satisfy

X(t)
=
b (t)X(t) + v(t).

(5.25)

T (v x) = (b
X
X + v)T (v x) = XT
b T vbx.

(5.26)

From now on, for convenience, we will drop the time-dependency from
the notation. The image of the point p taken by the camera is the vector
x which satisfies x = X. Denote the velocity of the image point x by
u = x R3 . u is also called image motion field, which under the brightness
constancy assumption discussed in Chapter 3 can be approximated by the
optical flow.
Consider now the inner product of the vectors in (5.25) with the vector
(v x)
Further note that

+ x
= x
X

and xT (v x) = 0.

Using this in equation (5.26), we have

x T vbx xT
b T vbx = 0.

When 6= 0, we obtain a constraint that plays the role of the epipolar

constraint for the case of continuous-time images, in the sense that it does
not depend upon the position of the points in space, but only on their
projection and the motion parameters:
uT vbx + xT
b vbx = 0.

(5.27)

Before proceeding, we state a lemma that will become useful in the

remainder of this section.
Lemma 5.3. Consider the matrices M1 , M2 R33 . Then xT M1 x =
xT M2 x for all x R3 if and only if M1 M2 is a skew symmetric matrix,
i.e. M1 M2 so(3).

Chapter 5. Reconstruction from two calibrated views

We leave the proof of this lemma as an exercise. Following the lemma,

for any skew symmetric matrix M R33 , xT M x = 0. Since 12 (b
vb vb
b ) is
vb vb
b )x = 0. If we define the symmetric
a skew symmetric matrix, xT 21 (b
epipolar component to be the following matrix
s=

then we have

1
(b
vb + vb
b)
2

R33 .

uT vbx + xT sx = 0.

(5.28)

This equation suggests some redundancy in the continuous epipolar constraint (5.27). Indeed, the matrix
b vb can only be recovered up to its
vb + vb
b )4 . This structure is substantially
symmetric component s = 21 (b
different from the discrete case, and it cannot be derived as a first-order
approximation of the essential matrix TbR. In fact a naive discretization
would lead to a matrix of the form vb
b , whereas in the true continuous case
we have to deal with only its symmetric component s = 12 (b
vb + vb
b )! The
set of interest in this case is the space of 6 3 matrices of the form
)
(

vb

3
0
R63
E =
, v R
1
vb + vb
b)
2 (b
which we call the continuous essential space. A matrix in this space is called
a continuous essential matrix. Note that the continuous epipolar constraint
(5.28) is homogeneous in the linear velocity v. Thus v may be recovered only
up to a constant scale. Consequently, in motion recovery, we will concern
ourselves with matrices belonging to the normalized continuous essential
space with v normalized to be 1:
(
)

v
b

E10 =
R63 .
R3 , v S2
1
vb + vb
b)
2 (b

5.4.2 Properties of continuous essential matrices

The skew-symmetric part of a continuous essential matrix simply corresponds to the velocity v. The characterization of the (normalized) essential
vb + vb
b ). We call the space of all the
matrix only focuses on its part s = 21 (b
matrices of this form the symmetric epipolar space

(
)

1

3
2
S=
(b
vb + vb
b) R , v S
R33 .

2
4 This redundancy is the very reason why different forms of the continuous epipolar
constraint exist in the literature [ZH84, PG98, VF95, May93, BCB97], and, accordingly,
various approaches have been proposed to recover and v (see [TTH96]).

5.4. Continuous case

A matrix in this space is called a symmetric epipolar component. The motion estimation problem is now reduced to the one of recovering the velocity
(, v) with R3 and v S2 from a given symmetric epipolar component
s.
The characterization of symmetric epipolar components depends on a
characterization of matrices of the form
b vb R33 , which is given in the
following lemma. We define the matrix RY () to be the rotation around
the Y -axis by an angle R, i.e. RY () = eeb2 with e2 = [0, 1, 0]T R3 .
Lemma 5.4. A matrix Q R33 has the form Q =
b vb with R3 , v S2
if and only if
Q = V RY ()diag{, cos(), 0}V T

(5.29)

for some rotation matrix V SO(3). the positive scalar = kk and

cos() = T v/.
Proof. We first prove the necessity. The proof follows from the geometric
meaning of
b vb for any vector q R3 ,

b vbq = (v q).

Let b S2 be the unit vector perpendicular to both and v. Then, b =

v
kvk (if v = 0, b is not uniquely defined. In this case, , v are parallel
and the rest of the proof follows if one picks any vector b orthogonal to
v and ). Then = exp(bb)v (according this definition, is the angle
between and v, and 0 ). It is direct to check that if the matrix
V is defined to be
b

V = (eb 2 v, b, v),
then Q has the given form (5.29).
We now prove the sufficiency. Given a matrix Q which can be decomposed
into the form (5.29), define the orthogonal matrix U = V RY () O(3).5
Let the two skew symmetric matrices
b and vb given by the formulae

(5.30)

b = U RZ ( ) U T , vb = V RZ ( )1 V T
2
2
where = diag{, , 0} and 1 = diag{1, 1, 0}. Then

b vb = U RZ ( ) U T V RZ ( )1 V T
2
2

T
= U RZ ( ) (RY ())RZ ( )1 V T
2
2
= U diag{, cos(), 0}V T
= Q.

(5.31)

Since and v have to be, respectively, the left and the right zero
eigenvectors of Q, the reconstruction given in (5.30) is unique.
5 O(3)

represents the space of all orthogonal matrices (of determinant 1.)

Chapter 5. Reconstruction from two calibrated views

The following theorem reveals the structure of the symmetric epipolar

component.
Theorem 5.4 (Characterization of the symmetric epipolar component). A real symmetric matrix s R33 is a symmetric epipolar
component if and only if s can be diagonalized as s = V V T with
V SO(3) and
= diag{1 , 2 , 3 }
with 1 0, 3 0 and 2 = 1 + 3 .
Proof. We first prove the necessity. Suppose s is a symmetric epipolar com vb + vb
b ). Since s is a
ponent, there exist R3 , v S2 such that s = 21 (b
symmetric matrix, it is diagonalizable, all its eigenvalues are real and all
the eigenvectors are orthogonal to each other. It then suffices to check that
its eigenvalues satisfy the given conditions.
Let the unit vector b and the rotation matrix V be the same as in the
proof of Lemma 5.4, so are and . Then according to the lemma, we have

b vb = V RY ()diag{, cos(), 0}V T .

Since (b
vb)T = vb
b , it yields
s=

1
V RY ()diag{, cos(), 0} diag{, cos(), 0}RYT () V T .
2

Define the matrix D(, ) R33 to be

D(, )

= RY ()diag{, cos(), 0} diag{, cos(), 0}RYT ()

2 cos()
0
sin()
.
0
2 cos()
0
=
sin()
0
0

Directly calculating its eigenvalues and eigenvectors, we obtain that D(, )

is equal to
RY (

)diag {(1 cos()), 2 cos(), (1 cos())} RYT (

)
2
2
(5.32)

Thus s = 21 V D(, )V T has eigenvalues

1
1
(1 cos()), cos(),
(1 cos()) ,
2
2

(5.33)

which satisfy the given conditions.

We now prove the sufficiency. Given s = V1 diag{1 , 2 , 3 }V1T with 1
0, 3 0 and 2 = 1 + 3 and V1T SO(3), these three eigenvalues
uniquely determine , R such that the i s have the form given in

5.4. Continuous case

(5.33)

= 1 3 ,
= arccos(2 /),

0
[0, ]

Define a matrix V SO(3) to be V = V1 RYT 2 2 . Then s =
1
T
2
2 V D(, )V . According to Lemma 5.4, there exist vectors v S and
3
R such that

b vb = V RY ()diag{, cos(), 0}V T .

Therefore, 12 (b
vb + vb
b ) = 21 V D(, )V T = s.

Figure 5.3 gives a geometric interpretation of the three eigenvectors of

the symmetric epipolar component s for the case when both , v are of
unit length. Theorem 5.4 was given as an exercise problem in Kanatani
b

PSfrag replacements

/2
u2
v

/2/2
u1

Figure 5.3. Vectors u1 , u2 , b are the three eigenvectors of a symmetric epipolar

vb + vb
b ). In particular, b is the normal vector to the plane spanned
component 21 (b
by and v, and u1 , u2 are both in this plane. u1 is the average of and v. u2 is
orthogonal to both b and u1 .

[Kan93] but it has never been really exploited in the literature for designing algorithms. The constructive proof given above is important in this
regard, since it gives an explicit decomposition of the symmetric epipolar
component s, which will be studied in more detail next.
Following the proof of Theorem 5.4, if we already know the eigenvector
decomposition of a symmetric epipolar component s, we certainly can find
at least one solution (, v) such that s = 12 (b
vb + vb
b ). We now discuss
uniqueness, i.e. how many solutions exist for s = 12 (b
vb + vb
b ).
Theorem 5.5 (Velocity recovery from the symmetric epipolar
component). There exist exactly four 3-D velocities (, v) with R 3
and v S2 corresponding to a non-zero s S.

vb + vb
b ).
Proof. Suppose (1 , v1 ) and (2 , v2 ) are both solutions for s = 21 (b
Then we have
vb1
b1 +
b1 vb1 = vb2
b2 +
b2 vb2 .

(5.34)

100

Chapter 5. Reconstruction from two calibrated views

From Lemma 5.4, we may write

b1 vb1 = V1 RY (1 )diag{1 , 1 cos(1 ), 0}V1T

b2 vb2 = V2 RY (2 )diag{2 , 2 cos(2 ), 0}V2T .

(5.35)

Let W = V1T V2 SO(3), then from (5.34)

D(1 , 1 ) = W D(2 , 2 )W T .

(5.36)

Since both sides of (5.36) have the same eigenvalues, according to (5.32),
we have
1 = 2 ,

2 = 1 .

We can then denote both 1 and 2 by . It is immediate to check that the

only possible rotation matrix W which satisfies (5.36) is given by I33 or

cos() 0 sin()
cos()
0 sin()

or
.
0
1
0
0
1
0
sin()
0 cos()
sin() 0 cos()

From the geometric meaning of V1 and V2 , all the cases give either
b 1 vb1 =

b2 vb2 or
b1 vb1 = vb2
b2 . Thus, according to the proof of Lemma 5.4, if (, v)
is one solution and
b vb = U diag{, cos(), 0}V T , then all the solutions
are given by

b = U RZ ( 2 ) U T , vb = V RZ ( 2 )1 V T ;
(5.37)

b = V RZ ( 2 ) V T , vb = U RZ ( 2 )1 U T
where = diag{, , 0} and 1 = diag{1, 1, 0}.

Given a non-zero continuous essential matrix E E 0 , according to (5.37),

its symmetric component gives four possible solutions for the 3-D velocity
(, v). However, in general only one of them has the same linear velocity
v as the skew symmetric part of E. Hence, compared to the discrete case
where there are two 3-D motions (R, T ) associated to an essential matrix,
the velocity (, v) corresponding to a continuous essential matrix is unique.
This is because, in the continuous case, the so-called twisted-pair ambiguity
which occurs in discrete case and is caused by a 180 rotation of the camera
around the translation direction (see Maybank [May93]), is now avoided.

5.4.3 The eight-point linear algorithm

Based on the preceding study of the continuous essential matrix, this section describes an algorithm to recover the 3-D velocity of the camera from
a set of (possibly

noisy) optical flows.
vb
vb + vb
b ) be the essential matrix
Let E =
E10 with s = 21 (b
s
associated with the continuous epipolar constraint (5.28). Since the submatrix vb is skew symmetric and s is symmetric, they have the following

5.4. Continuous case

form

0
vb = v3
v2

v3
0
v1

v2
v1 ,
0

s1
s = s2
s3

s2
s4
s5

Define the (continuous) essential vector e R9 to be

s3
s5 .
s6

e = [v1 , v2 , v3 , s1 , s2 , s3 , s4 , s5 , s6 ]T .

101

(5.38)

(5.39)

Define a vector a R9 associated to optical flow (x, u) with x = [x, y, z]T

R3 , u = [u1 , u2 , u3 ]T R3 to be6

a = [u3 y u2 z, u1z u3 x, u2 x u1 y, x2 , 2xy, 2xz, y 2 , 2yz, z 2]T . (5.40)

The continuous epipolar constraint (5.28) can be then rewritten as

aT e = 0.
Given a set of (possibly noisy) optical flow vectors (xj , uj ), j = 1, . . . , n
generated by the same motion, define a matrix A Rn9 associated to
these measurements to be
A = [a1 , a2 , . . . , an ]T
j

(5.41)

where a are defined for each pair (x , u ) using (5.40). In the absence of
noise, the essential vector e has to satisfy
Ae = 0.

(5.42)

In order for this equation to have a unique solution for e, the rank of the
matrix A has to be eight. Thus, for this algorithm, the optical flow vectors
of at least eight points are needed to recover the 3-D velocity, i.e. n 8,
although the minimum number of optical flows needed is actually 5 (see
Maybank [May93]).
When the measurements are noisy, there might be no solution of e for
Ae = 0. As in the discrete case, one may approximate the solution my
minimizing the error function kAek2 .
Since the continuous essential vector e is recovered from noisy measurements, the symmetric part s of E directly recovered from e is not necessarily
a symmetric epipolar component. Thus one can not directly use the previously derived results for symmetric epipolar components to recover the
3-D velocity. Similarly to what we have done for the discrete case, we can
first estimate the symmetric matrix s, and then project it onto the space
of symmetric epipolar components.
Theorem 5.6 (Projection to the symmetric epipolar space). If a
real symmetric matrix F R33 is diagonalized as F = V diag{1 , 2 , 3 }V T
6 For perspective projection, z = 1 and u = 0 thus the expression for a can be
3
simplified.

102

Chapter 5. Reconstruction from two calibrated views

with V SO(3), 1 0, 3 0 and 1 2 3 , then the symmetric

epipolar component E S which minimizes the error kE F k2f is given by
E = V diag{1 , 2 , 2 }V T with
1 =

21 + 2 3
,
3

2 =

1 + 22 + 3
,
3

3 =

23 + 2 1
.(5.43)
3

Proof. Define S to be the subspace of S whose elements have the same

eigenvalues = diag{1 , 2 , 3 }. Thus every matrix E S has the form
E = V1 V1T for some V1 SO(3). To simplify the notation, define =
diag{1 , 2 , 3 }. We now prove this theorem by two steps.
Step 1: We prove that the matrix E S which minimizes the error
kE F k2f is given by E = V V T . Since E S has the form E = V1 V1T ,
we get
kE F k2f

= kV1 V1T V V T k2f = k V T V1 V1T V k2f .

Define W = V T V1 SO(3) and W

w1
W = w4
w7

Then

kE F k2f

has the form

w2 w3
w5 w6 .
w8 w9

= k W W T k2f

= tr(2 ) 2tr(W W T ) + tr(2 ).

(5.44)

(5.45)

Substituting (5.44) into the second term, and using the fact that 2 =
1 + 3 and W is a rotation matrix, we get
tr(W W T )

= 1 (1 (1 w32 ) + 2 (1 w62 ) + 3 (1 w92 ))

+ 3 (1 (1 w12 ) + 2 (1 w42 ) + 3 (1 w72 )).(5.46)

Minimizing kE F k2f is equivalent to maximizing tr(W W T ). From

(5.46), tr(W W T ) is maximized if and only if w3 = w6 = 0, w92 = 1,
w4 = w7 = 0 and w12 = 1. Since W is a rotation matrix, we also have
w2 = w8 = 0 and w52 = 1. All possible W give a unique matrix in S which
minimizes kE F k2f : E = V V T .
Step 2: From step one, we only need to minimize the error function over
the matrices which have the form V V T S. The optimization problem
is then converted to one of minimizing the error function
kE F k2f = (1 1 )2 + (2 2 )2 + (3 3 )2
subject to the constraint
2 = 1 + 3 .
The formula (5.43) for 1 , 2 , 3 are directly obtained from solving this
minimization problem.

5.4. Continuous case

103

Remark 5.1. For symmetric matrices which do not satisfy conditions 1

0 or 3 0, one may simply choose 01 = max(1 , 0) or 03 = min(3 , 0).
Finally, we can outline an eigenvalue decomposition-based algorithm for
estimating 3-D velocity from optical flow, which serves as a continuous
counterpart of the eight-point algorithm given in Section 5.1.
Algorithm 5.3 (The continuous eight-point algorithm). For a given
set of images and optical flow vectors (xj , uj ), j = 1, . . . , n, this algorithm
finds (, v) SE(3) which solves
ujT vbxj + xjT
b vbxj = 0,

j = 1, . . . , n.

1. Estimate essential vector

Define a matrix A Rn9 whose j th row is constructed from xj and
uj as in (5.40). Use the SVD to find the vector e R9 such that
Ae = 0: A = UA A VAT and e = V (:, 9). Recover the vector v0 S2
from the first three entries of e and a symmetric matrix s R33
from the remaining six entries as in (5.39).7
2. Recover the symmetric epipolar component
Find the eigenvalue decomposition of the symmetric matrix s
s = V1 diag{1 , 2 , 3 }V1T
with 1 2 3 . Project the symmetric matrix s onto the symmetric epipolar space S. We then have the new s = V1 diag{1 , 2 , 3 }V1T
with
1 =

21 + 2 3
,
3

2 =

1 + 22 + 3
,
3

3 =

23 + 2 1
;
3

3. Recover velocity from the symmetric epipolar component

Define

= 1 3 , 0,
= arccos(2 /), [0, ].

Let V = V1 RYT 2 2 SO(3) and U = V RY () O(3). Then
the four possible 3-D velocities corresponding to the matrix s are given
by

b = U RZ ( 2 ) U T , vb = V RZ ( 2 )1 V T

b = V RZ ( 2 ) V T , vb = U RZ ( 2 )1 U T
where = diag{, , 0} and 1 = diag{1, 1, 0};

7 In order to guarantee v to be of unit length, one needs to re-normalize e, i.e. to

0
multiply e with a scalar such that the 3-D vector determined by the first three entries
of e is of unit length.

104

Chapter 5. Reconstruction from two calibrated views

4. Recover velocity from the continuous essential matrix

From the four velocities recovered from the matrix s in step 3, choose
the pair ( , v ) which satisfies
v T v0 = max viT v0 .
i

Then the estimated 3-D velocity (, v) with R3 and v S2 is

given by
= ,

v = v0 .

Remark 5.2. Since both E, E E10 satisfy the same set of continuous
epipolar constraints, both (, v) are possible solutions for the given set
of optical flows. However, as in the discrete case, one can get rid of the
ambiguous solution by enforcing the positive depth constraint.
In situations where the motion of the camera is partially constrained,
the above linear algorithm can be further simplified. The following example
illustrates how it can be done.
Example 5.1 (Kinematic model of an aircraft). This example shows
how to utilize the so called nonholonomic constraints (see Murray, Li and
Sastry [MLS94]) to simplify the proposed linear motion estimation algorithm in the continuous case. Let g(t) SE(3) represent the position
and orientation of an aircraft relative to the spatial frame, the inputs
1 , 2 , 3 R stand for the rates of the rotation about the axes of the
aircraft and v1 R the velocity of the aircraft. Using the standard homogeneous representation for g (see Murray, Li and Sastry [MLS94]), the
kinematic equations of the aircraft motion are given by
0 3 2 v1
3
0 1 0
g
g =
2 1
0 0
0

where 1 stands for pitch rate, 2 for roll rate, 3 for yaw rate and v1
the velocity of the aircraft. Then the 3-D velocity (, v) in the continuous
epipolar constraint (5.28) has the form = [1 , 2 , 3 ]T , v = [v1 , 0, 0]T . For
the algorithm given above, we here have extra constraints on the symmetric
vb + vb
b ): s1 = s5 = 0 and s4 = s6 . Then there are only
matrix s = 12 (b
four different essential parameters left to determine and we can re-define
the essential parameter vector e R4 to be e = [v1 , s2 , s3 , s4 ]T . Then the
measurement vector a R4 is to be a = [u3 y u2 z, 2xy, 2xz, y 2 + z 2 ]T . The
continuous epipolar constraint can then be rewritten as
aT e = 0.
If we define the matrix A from a as in (5.41), the matrix AT A is a 4 4
matrix rather than a 9 9 one. For estimating the velocity (, v), the dimensions of the problem is then reduced from 9 to 4. In this special case,
the minimum number of optical flow measurements needed to guarantee a

5.4. Continuous case

105

unique solution of e is reduced to 3 instead of 8. Furthermore, the symmetric matrix s recovered from e is automatically in the space S and the
remaining steps of the algorithm can thus be dramatically simplified. From
this simplified algorithm, the angular velocity = [1 , 2 , 3 ]T can be fully
recovered from the images. The velocity information can then be used for
controlling the aircraft.
As in the discrete case, the linear algorithm proposed above is not optimal since it does not enforce the structure of the parameter space during
the minimization. Therefore, the recovered velocity does not necessarily
minimize the originally chosen error function kAe(, v)k2 on the space E10 .
Again like in the discrete case, we have to assume that translation is not
zero. If the motion is purely rotational, then one can prove that there are
infinitely many solutions to the epipolar constraint related equations. We
leave this as an exercise to the reader.

5.4.4 Euclidean constraints and structure reconstruction

As in the discrete case, the purpose of exploiting Euclidean constraints
is to reconstruct the scales of the motion and structure. From the above
linear algorithm, we know we can only recover the linear velocity v up to
an arbitrary scale. Without loss of generality, we may assume the velocity
of the camera motion is (, v) with kvk = 1 and R. By now, only the
scale is unknown to us. Substituting X(t) = (t)x(t) into the equation

X(t)
=
b X(t) + v(t),

we obtain for the image xj of each point pj E3 , j = 1, . . . , n,

j xj + j x j =
(j xj ) + v j xj + j (x j
xj ) v = 0.(5.47)
As one may expect, in the continuous case, the scale information is then
encoded in , for the location of the 3-D point, and R+ for the
and v, these constraints are all linear in
linear velocity v. Knowing x, x,
j , j , 1 j n and . Also, if xj , 1 j n are linearly independent of
v, i.e. the feature points do not line up with the direction of translation,
it can be shown that these linear constraints are not degenerate hence
the unknown scales are determined up to a universal scale. We may then
arrange all the unknown scales into a single vector ~
~ = [1 , . . . , n , 1 , . . . , n , ]T R2n+1 .

For n optical flows, ~ is a 2n+1 dimensional vector. (5.47) gives 3n (scalar)

linear equations. The problem of solving ~ from (5.47) is usually overdetermined. It is easy to check that in the absence of noise the set of
equations given by (5.47) uniquely determines ~ if the configuration is
non-critical. We can therefore write all the equations in a matrix form
M ~ = 0

106

Chapter 5. Reconstruction from two calibrated views

with M R3n(2n+1) a matrix depending on , v and {(xj , x j )}nj=1 . Then,

in the presence of noise, the linear least-squares estimate of ~ is simply the
eigenvector of M T M corresponding to the smallest eigenvalue.
Notice that the rate of scales { j }nj=1 are also estimated. Suppose we
have done the above recovery for a time interval, say (t0 , tf ), then we
have the estimate ~(t) as a function of time t. But ~(t) at each time t is
only determined up to an arbitrary scale. Hence (t)~(t) is also a valid
estimation for any positive function (t) defined on (t0 , tf ). However, since

(t) is multiplied to both (t) and (t),

their ratio

r(t) = (t)/(t)
d

(ln ) = /.
Let the logis independent of the choice of (t). Notice dt
arithm of the structural scale to be y = ln . Then a time-consistent
estimation (t) needs to satisfy the following ordinary differential equation,
which we call the dynamic scale ODE

y(t)
= r(t).
Given y(t0 ) = y0 = ln((t0 )), solve this ODE and obtain y(t) for t [t0 , tf ].
Then we can recover a consistent scale (t) given by
(t) = exp(y(t)).
Hence (structure and motion) scales estimated at different time instances
now are all relative to the same scale at time t0 . Therefore, in the continuous
case, we are also able to recover all the scales as functions of time up to a
universal scale. The reader must be aware that the above scheme is only
conceptual. In practice, the ratio function r(t) would never be available for
all time instances in [t0 , tf ].
Comment 5.1 (Universal scale ambiguity). In both the discrete and
continuous cases, in principle, the proposed schemes can reconstruct both
the Euclidean structure and motion up to a universal scale.

5.5 Summary
The seminal work of Longuet-Higgins [LH81] on the characterization of the
so called epipolar constraint, has enabled the decoupling of the structure
and motion problems and led to the development of numerous linear and
nonlinear algorithms for motion estimation from two views, see [Fau93,
Kan93, May93, WHA93] for overviews.

5.6. Exercises

107

5.6 Exercises
1. Linear equation
Solve x Rn from the following equation
Ax = b
where A Rmn and b Rm . In terms of conditions on the matrix
A and vector b, describe: When does a solution exist or not exist, and
when is the solution unique or not unique? In case the solution is not
unique, describe the whole solution set.
2. Properties of skew symmetric matrix
(a) Prove Lemma 5.1.
(b) Prove Lemma 5.3.
3. A rank condition for epipolar constraint
Show that
xT2 TbRx1 = 0

if and only if

c2 T ] 1.
rank [c
x2 Rx1 x

4. Rotational motion
Assume that camera undergoes pure rotational motion, i.e. it rotates
around its center. Let R SO(3) be the rotation of the camera and
so(3) be the angular velocity. Show that, in this case, we have:
(a) Discrete case: xT2 TbRx1 0, T R3 ;
(b) Continuous case: xT
b vbx + uT vbx 0, v R3 .

5. Projection to O(3)
Given an arbitrary 3 3 matrix M R33 with positive singular
values, find the orthogonal matrix R O(3) such that the error
kR M k2f is minimized. Is the solution unique? Note: Here we allow
det(R) = 1.
6. Geometric distance to epipolar line
2 with respect to camera frames with
Given two image points x1 , x
their relative motion (R, T ), show that the geometric distance d2
defined in Figure 5.2 is given by the formula
d22 =
where e3 = [0, 0, 1]T R3 .

(
xT2 TbRx1 )2
kb
e3 TbRx1 k2

7. Planar scene and homography

Suppose we have a set of n fixed coplanar points {pj }nj=1 P, where

108

Chapter 5. Reconstruction from two calibrated views

P denotes a plane. Figure 5.4 depicts the geometry of the camera

frame relative to the plane. Without loss of generality, we further

frame 1

(R, T ) N1
PSfrag replacements

frame 2

p
P
Figure 5.4. Geometry of camera frames 1 and 2 relative to a plane P.

assume that the focal length of the camera is 1. That is, if x is the
image of a point p P with 3-D coordinates X = [X1 , X2 , X3 ]T , then
x = X/X3 = [X1 /X3 , X2 /X3 , 1]T R3 .
Follow the following steps to establish the so-called homography
between two images of the plane P:
(a) Verify the following simple identity
bX = 0.
x

(5.48)

(b) Suppose that the (R, T ) SE(3) is the rigid body transformation from frame 1 to 2. Then the coordinates X1 , X2 of a fixed
point p P relative to the two camera frames are related by

1
T
X2 = R + T N1 X1
(5.49)
d1
where d1 is the perpendicular distance of camera frame 1 to
the plane P and N1 S2 is the unit surface normal of P
relative to camera frame 1. In the equation, the matrix H =
(R + d11 T N1T ) is the so-called homography matrix in the computer vision literature. It represents the transformation from
X1 to X2 .
(c) Use the above identities to show that: Given the two images
x1 , x2 R3 of a point p P with respect to camera frames 1
and 2 respectively, they satisfy the constraint
c2 (R +
x

1
T N1T )x1 = 0.
d1

(5.50)

5.6. Exercises

109

(d) Prove that in order to solve uniquely (up to a scale) H from the
above equation, one needs the (two) images of at least 4 points
in P in general position.
8. Singular values of the homography matrix
Prove that any matrix of the form H = R + uv T with R SO(3)
and u, v R3 must have a singular value 1. (Hint: prove that the
matrix uv T + vuT + vuT uv T has a zero eigenvalue by constructing
the corresponding eigenvector.)
9. The symmetric component of the outer product of two vectors
Suppose u, v R3 , and kuk2 = kvk2 = . If u 6= v, the matrix
D = uv T + vuT R33 has eigenvalues {1 , 0, 3 }, where 1 > 0,
and 3 < 0. If u = v, the matrix D has eigenvalues {2, 0, 0}.
10. The continuous homography matrix
The continuous version of the homography matrix H introduced
above is H 0 =
b + d11 vN1T where , v R3 are the angular and linear
velocities of the camera respectively. Suppose that you are given a
matrix M which is also known to be of the form M = H 0 + I for
some R. But you are not told the actual value of . Prove that
you can uniquely recover H 0 from M and show how.
11. Implementation of the SVD-based pose estimation algorithm
Implement a version of the three step pose estimation algorithm for
two views. Your MATLAB codes are responsible for:
Initialization: Generate a set of N ( 8) 3-D points; generate
a rigid body motion (R, T ) between two camera frames and
project (the coordinates of) the points (relative to the camera
frame) onto the image plane correctly. Here you may assume the
focal length is 1. This step will give you corresponding images
as input to the algorithm.
Motion Recovery: using the corresponding images and the al T) and compare it to the
gorithm to compute the motion (R,
ground truth (R, T ).
After you get the correct answer from the above steps, here are a few
suggestions for you to play with the algorithm (or improve it):
A more realistic way to generate these 3-D points is to make sure
that they are all indeed in front of the image plane before and
after the camera moves. If a camera has a field of view, how
to make sure that all the feature points are shared in the view
before and after.

110

Chapter 5. Reconstruction from two calibrated views

Systematically add some noises to the projected images and see

how the algorithm responds. Try different camera motions and
different layouts of the points in 3-D.
Finally, to fail the algorithm, take all the 3-D points from some
plane in front of the camera. Run the codes and see what do you
get (especially with some noises on the images).
12. Implementation of the eigenvalue-decomposition based velocity estimation algorithm
Implement a version of the four step velocity estimation algorithm
for optical flows. Your MATLAB codes are responsible for:
Initialization: Choose a set of N ( 8) 3-D points and a rigid
body velocity (, v). Correctly obtain the image x and compute
the image velocity u = x you need to figure out how to compute u from (, v) and X. Here you may assume the focal length
is 1. This step will give you images and their velocities as input
to the algorithm.
Motion Recovery: Use the algorithm to compute the motion
(
, v) and compare it to the ground truth (, v).

This is page 111

Printer: Opaque this

Chapter 6
Camera calibration and self-calibration

Chapter 3 described the image-formation process and showed that camera

geometry depends upon a number of parameters such as the position of
the principal point, focal length, so called intrinsic parameters of the camera. Previous chapter demonstrated that in case the internal parameters
of the camera are available, given two views of the scene and number of
corresponding points in both views, the Euclidean structure of the scene
as well as displacements between the views can be recovered. This chapter
will again strive for the recovery of the Euclidean structure of the world
and camera pose in the absence of knowledge of internal parameters of the
camera.
The natural starting point is the development of techniques for estimating the internal parameters of the camera and reducing the problem to the
one described in the previous chapter. Hence we first review the most common technique for camera calibration, by using a known object (a so-called
calibration rig). Since this technique requires a specific calibration object it is suitable only for laboratory environments, where both the camera
taking the images as well as the calibration object are available.
The second class of techniques for estimating camera parameters does
not have this requirement, and involves estimating intrinsic parameters
as well as scene structure and camera pose directly from image measurements; hence, it is called camera self-calibration. As we will see, given the
richness and the difficulty of the problem, there are several avenues to
pursue. We first focus on the so-called intrinsic approach to the problem
of self-calibration which leads to analysis, algorithms and sufficient and
necessary conditions for the recovery of the all the unknowns based on in-

112

Chapter 6. Camera calibration and self-calibration

trinsic constraints only. For the cases when the conditions are not satisfied
we will characterize the associated ambiguities and suggest practical ways
to resolve them.
If we could measure metric coordinates of points on the image plane (as
opposed to pixel coordinates), we could model the projection as we have
done in previous chapters:
x = P gX

(6.1)

where P = [I, 0] and g SE(3) is the pose of the camera in the world
reference frame. In practice, however, calibration parameters are not known
ahead of time and, therefore, a more realistic model of the geometry of the
image formation process takes the form
x0 = AP gX
where A R
parameters1

(6.2)

is the camera calibration matrix that collects the unknown

f sx
A= 0
0

f s
f sy
0

ox
oy .
1

(6.3)

In this chapter we show how to calibrate the camera, that is to recover

both A (the intrinsic parameters) and g (the extrinsic parameters). Once
the intrinsic parameters are recovered, we can convert equation (6.2) to
a calibrated model by multiplying both sides by A1 . In this chapter,
we indicate pixel coordinates with a prime superscript x0 , whereas metric
coordinates are indicated by x.
We distinguish between two calibration methods: in classical calibration
we are given images of a known object (a calibration rig); in self-calibration
we attempt to recover intrinsic parameters together with camera pose and
scene reconstruction without any calibration rig. The former is simpler and
more robust, and several software package are available. However, often
one has no control on how the images are captured and, therefore, autocalibration is the only viable option. Most of the chapter, therefore, is
concerned with self-calibration.

6.1 Calibration with a rig

Consider an object with distinct point-features whose coordinates relative
to a fixed reference frame are known. We call this object a calibration rig.
Typical choices for calibration rigs are checkerboard patterns mounted on
one or more planes (Figure ??) . Then the coordinates of each corner can
1 In

Section 6.3 we relax the restriction that A be upper triangular.

6.1. Calibration with a rig

113

be easily specified relative to, say, the top-left corner of the checkerboard.
Let us call these coordinates X. The change of coordinates between the
reference frame on the calibration rig and the camera frame is denoted by
g and the calibration matrix by A, so that when we image the calibration
rig we measure pixel coordinates x0 of points on calibration grid, which
satisfy
x0 = AP gX = P X.
This can be written for each point on the calibration rig as:

i T Xi
p1
x
Yi

y i = pT2
Zi
T
p3
1
1

(6.4)

For clarity we omit the 0 s from the pixel coordinates of image points in the
above equation. Similarly as in the previous chapters, we can eliminate
b0 . This yields three
from the above equation, by multiplying both sides by x
equations, from which only two are independent. Hence for each point we
have following two constraints:
xi (pT3 X) = pT1 X
y i (pT3 X) = pT2 X

(6.5)

Unlike previous chapters, now X i , Y i and Z i are known, and so are xi , y i .

We can therefore stack all the unknowns pij into a vector and re-write the
equations above as a system of linear equations
Mp = 0
where
p = [p11 , p12 , p13 , p14 , p21 , p22 , p23 , p24 , p31 , p32 , p33 , p34 ]T R12 .
A linear (sub-optimal) estimate of p can then be obtained by minimizing
the linear least-squares criterion
min kM pk2

subject to kpk2 = 1

(6.6)

without taking into account the structure of the unknown vector p. If we

denote the (nonlinear) map from X to x0 as:
.
(X, P ) = x0 ,
the estimate can be further refined by a nonlinear minimization of the
following objective function
X
i
min
kx0 (Xi , P )k2 .
P

After obtaining an estimate of the projection matrix P = P = A[R | T ] =

[AR|AT ] we can factor the first 33 sub-matrix into the calibration matrix

114

Chapter 6. Camera calibration and self-calibration

A R33 (in its upper triangular form) and rotation matrix R SO(3)
using a routine QR decomposition

p11 p12 p13

AR = p21 p22 p23 ,
p31 p32 p33

which yields an estimate of the intrinsic parameters in the calibration

matrix A and of the rotational component of the extrinsic parameter.
Estimating translation completes the calibration procedure:

p14
T = A1 p24 .
p34
Several free software packages are available to perform calibration with a
rig. Most also include compensation for radial distortion and other lens
artifacts that we do not address here. In Chapter ?? we walk the reader
through use of one of these packages in a step-by-step procedure to calibrate
a camera.

6.2 The fundamental matrix

In the absence of a calibration grid, one view is clearly not sufficient to
determine all the unknowns. Hence, we revisit here the intrinsic geometric
relationship between two views, the concept of epipolar geometry introduced in the previous. In order to obtain the relationship between the two
uncalibrated views related by the displacement (R, T ), consider the first
the rigid body motion between two views:
2 x2 = Rx1 + T
where x = X. Multiplying both sides by calibration matrix A:
2 Ax2 = ARx1 + AT
0

2 x02 = ARA1 1 x01 + T 0

(6.7)

where x denotes a pixel coordinate of a point x, such x = Ax and

T 0 = AT . In order to obtain an intrinsic constraint in terms of image
measurements only, similarly as in the calibrated case, we can eliminate
the unknown depth by multiplying both sides of (6.7) by (x02 T 0 ) yielding
the following constraint:
T
x02 Tb0 ARA1 x01 = 0.

(6.8)

Alternatively, the transformation from the calibrated space to the uncalibrated given by x = A1 x0 , can be directly applied to epipolar constraint
by substituting for x:
xT2 TbRx1 = 0

x02 AT TbRA1 x01 = 0

(6.9)

6.2. The fundamental matrix

115

, consider the first just the rigid body motion equation and multiply both
sides by A
2 Ax2 = ARx1 + AT

2 x02 = ARA1 1 x01 + T 0

(6.10)

where x = X, x0 = Ax and T 0 = AT . In order to obtain an intrinsic constraint in terms of image measurements only, similarly as in the calibrated
case, we can eliminate the unknown depth by multiplying both sides of
(6.7) by (x02 T 0 ) yielding the following constraint:
T
x02 Tb0 ARA1 x01 = 0.

(6.11)

Alternatively the transformation from the calibrated space to the uncalibrated given by x = A1 x0 , can be directly applied to epipolar constraints
by substituting for x:
xT2 TbRx1 = 0

x02 AT TbRA1 x01

(6.12)

The above cases both correspond to the uncalibrated version of the epipolar
constraint:
T

x02 F x01 = 0

(6.13)

The two corresponding image points are related by the matrix F =

AT TbRA1 = Tb0 ARA1 R33 is called fundamental matrix. When
A = I, the fundamental matrix is an essential matrix TbR. In order to reconcile two different forms of the fundamental matrix note that for T R3
1 T , which enables us to show that
\
and A SL(3)2 , we have AT TbA = A
two different forms of fundamental matrix are equivalent:
F = AT TbRA1 = AT TbA1 ARA1 = c
T 0 ARA1

(6.14)

We will study these different forms more closely in Section 6.3 and show
that the second form turns out to be more convenient to use for camera selfcalibration. Before getting more insights into the theory of self-calibration,
let us have a look at some geometric intuition behind the fundamental
matrix, its algebraic properties and algorithm for its recovery.

6.2.1 Geometric characterization of the fundamental matrix

Geometrically the fundamental matrix maps points in the first view into
the lines in in the second view (or vice versa):
T

x02 F x01 = x02 l2 = 0

Notice that l2 defines a line implicitly as the collection of points that
solve lT2 x = 0, where the vector l2 is the normal to the line. From the first
2 We

will justify this generalization in the next section.

116

Chapter 6. Camera calibration and self-calibration

form of fundamental matrix, the part related to translation T 0 = AT R3

is called the epipole. The epipole is the point where the baseline (the line
joining the centers of projection of the two cameras) intersect the image
plane in the second (first) view and can be computed as the left (right) null
space of the fundamental matrix F (F T ). It can be easily verified that:
F T T20 = F T10 = 0
PSfrag replacements
L2

x2
l2

d2
o1

y
o2

(R, T )
Figure 6.1. Two projections x01 , x02 R3 of a 3-D point p from two vantage points.
The relative Euclidean transformation between the two vantage points is given
by (R, T ) SE(3).

Remark 6.1. Characterization of the fundamental matrix. A non-zero

matrix F R33 is a fundamental matrix if F has a singular value
decomposition (SVD): E = U V T with
= diag{1 , 2 , 0}
for some 1 , 2 R+ .
The characterization of the fundamental matrix in terms of its SVD
reflects the fact that the F is rank deficient det(F ) = 0. This can be simply
observed by noticing that F is a product of a skew symmetric matrix Tb0 or
rank 2 and a matrix ARA1 R33 of rank 3. Similarly to the essential
matrix we can see how can we factorize F into a skew symmetric matrix
b and nonsingular matrix M :
m

T
b M ) = (U RZ ( )diag{1, 1, 0}U T , U RZ
( )V T ) (6.15)
(m,
2
2

Note that given the above factorization and replacing by matrix :

1 0 0
= 0 2 0

(6.16)

6.2. The fundamental matrix

117

then we have:

T
T
b M ) = (U RZ ( )diag{1, 1, 0}U T , U RZ
(m,
( )V
) (6.17)
2
2
give the rise to the same fundamental matrix, for arbitrary choice of
parameters , , .
So, as we can see in the uncalibrated case, while the fundamental matrix
is uniquely determined from point correspondences, the factorization of F
into a skew-symmetric part and a non-singular matrix is not unique. There
is a three-parameter family of transformations consistent with the point
correspondences, which gives rise to the same fundamental matrix. This
fact can be observed directly from the epipolar constraint by noticing that:
T
T
x02 Tb0 ARA1 x01 = x02 Tb0 (ARA1 + T 0 vT )x01 = 0.

(6.18)

for an arbitrary vector v, since Tb0 (T 0 vT ) = 0.

Given the above ambiguities present in the uncalibrated case, additional

constraints have to be sought in order to be able to recover the Euclidean
structure, pose of the cameras as well as information about intrinsic camera
parameters. In the remainder of this chapter we will present two strategies
for resolving these ambiguities: an intrinsic approach - where we exploit
additional intrinsic constraints between image measurements in the uncalibrated case, and a stratified approach where we exploit knowledge of
motion, partial structure and/or calibration. The two approaches will be
reconciled in the last section of this chapter. Before we proceed in getting more insight and exploring different types of additional constraints,
we outline an algorithm for the recovery of the fundamental matrix.

6.2.2 The eight-point linear algorithm

Similarly to the calibrated case, the fundamental matrix can be recovered
from the measurements using linear techniques only, hence we will obtain
a version of the 8-point algorithm for the uncalibrated case. Since F is
uniquely determined by point correspondences between two views and geometrically represents a mapping of a point in one view into an epipolar line
in another, it plays an important role in the process of establishing correspondences between two views. Given the fundamental matrix F R33
with the entries as:

f1 f2 f3
(6.19)
F = f4 f5 f6
f7 f8 f9
and a corresponding vector f R9 , we have

f = [f1 , f2 , f3 , f4 , f5 , f6 , f7 , f8 , f9 ]T R9 .

118

Chapter 6. Camera calibration and self-calibration

Since the epipolar constraint xT2 F x1 = 0 is linear in the entries of f , we

can rewrite it as:
aT f = 0

(6.20)

Where a is given by
a = [x02 x01 , x02 y10 , x02 z10 , y20 x01 , y20 y10 , y20 z10 , z20 x01 , z20 y10 , z20 z10 ]T R9 .

(6.21)

Given a set of corresponding points and forming the matrix of measurements A = [a1 , a2 , . . . , an ]T , f can be obtained as the minimizing solution
of the following least-squares objective function kAf k2 . Such solution corresponds to the eigenvector associated with the smallest eigenvalue of A T A.
In the context of the above mentioned linear algorithm, since the point correspondences are in image coordinates, the individual entries of the matrix
A are may be unbalanced and affect the conditioning of AT A. An improvement can be achieved by a normalizing transform T1 and T2 in respective
images, which makes the measurements of zero mean and unit-variance [?].
By the same arguments as in the calibrated case, the linear least-squares
solution is not an optimal one, and it is typically followed by a nonlinear
refinement step. More details on the nonlinear refinement strategies are
described [HZ00].

6.3 Basics of uncalibrated geometry

Before we delve into algorithms to recover camera calibration, we point out
an important observation that makes the geometry of uncalibrated cameras
simpler to understand.
So far we have considered points living in the three-dimensional Euclidean space, and shown how to reconstruct spatial properties such as
distances and angles. Distances and angles can be written in terms of the
.
inner product3 hu, vi = uT v. If one were to define a different inner product,
. T
for instance hu, vi = u v for some positive definite symmetric matrix ,
then the notion of distances and angles would change, and one could think
of a distorted or uncalibrated world. Nevertheless, all reconstruction
algorithms described in the previous chapters, written in terms of this new
inner product, would yield a reconstruction of the correct camera pose and
scene structure in the distorted space. Only in the particular case where
= I would the reconstruction be physically correct.
The basic observation here is that an uncalibrated camera with calibration matrix A imaging points in a calibrated world is equivalent to a
calibrated camera imaging points in an uncalibrated world governed by an
.
inner product hu, vi = uT v. Furthermore, = AT A1 .
3 Definitions

and properties of inner products can be found in Appendix A.

6.3. Basics of uncalibrated geometry

119

To see this, let E3 denote the three-dimensional Euclidean space, that is

.
0
R with its standard inner product hu, vi = uT v. Let instead E3 denote the
.
3
uncalibrated space, that is R with the inner product hu, vi = uT v.
The map from the calibrated to the uncalibrated space is simply
3

: R3
X

7 X0 = AX

where X R3 and X0 R3 represent the three-dimensional coordinates of

0
the points p E3 and p0 = (p) E3 respectively, and A R33 is the
matrix representing the linear map. The map induces a transformation
of the inner product as follows
.
h 1 (u), 1 (v)i = uT AT A1 v = hu, viAT A1 , u, v T E3 (6.22)
where T E3 is the space of vectors. Therefore, = AT A1 . It is then clear
that calibrating a camera, i.e. finding A, is equivalent to calibrating the
space, i.e. finding its inner product .4 To see that, call SL(3) the (group)
of non-singular 3 3 matrices, and K SL(3) the subset consisting of all
upper-triangular matrices. That is, any matrix A K has the form

a11 a12 a13

A = 0 a22 a23 .
(6.23)
0
0 a33

Note that if A is upper-triangular, so is A1 . Clearly, there is a one-to-one

correspondence between K and the set of all upper-triangular matrices of
the form given in (6.23); also, the equation = AT A1 gives a finite-toone correspondence between K and the set of all 3 3 symmetric matrices
with determinant 1 (by the Cholesky factorization) (see Appendix ??). Usually, only one of the upper-triangular matrices corresponding to the same
symmetric matrix has a physical interpretation as the intrinsic parameters
of a camera. Thus, if the calibration matrix A does have the form given by
(6.23), the self-calibration problem is equivalent to the problem of recovering the matrix , i.e. the inner product of the uncalibrated space. Now let us
consider a more general case where the uncalibrated camera is characterized
by an arbitrary matrix A SL(3). A has a QR-decomposition
A = QR,

Q K, R SO(3).

(6.24)

Then A1 = RT Q1 and the associated symmetric matrix = AT A1 =

QT Q1 . In general, if A = BR with A, B SL(3), R SO(3), the
AT A1 = B T B 1 . That is A and B induce the same inner product on
the uncalibrated space. In this case, we say that matrices A and B are
4 The symmetric matrix can be also interpreted as a conic in complex projective
space, known as the absolute conic. Since our discussion does not involve any projective
space, we do not further pursue that interpretation here.

120

Chapter 6. Camera calibration and self-calibration

equivalent. The quotient space SL(3)/SO(3) can be called the intrinsic

parameter space and will be further discussed in this section.
We now show that, without knowing camera motion and scene structure,
the matrix A SL(3) can only be recovered up to an equivalence class
A SL(3)/SO(3). To see this, suppose that B SL(3) is another matrix
in the same equivalence class as A. Then A = BR0 for some R0 SO(3).
The coordinate transformation from time t0 to t yields
AX(t) = AR(t)X(t0 ) + AT (t)
BR0 X(t) = BR0 R(t)R0T R0 X(t0 ) + BR0 T (t).

(6.25)

From equation (6.25) it is clear that an uncalibrated camera with calibration matrix A undergoing the motion (R(t), T (t)) observing the point
p E3 can never be distinguished from an uncalibrated camera with calibration matrix B undergoing the motion (R0 R(t)R0T , R0 T (t)) observing
the point R0 (p) E3 . The effect of R0 is nothing but a rotation of the
overall configuration space.
Therefore, without knowing camera motion and scene structure, the matrix A associated with an uncalibrated camera can only be recovered up to
an equivalence class A in the space SL(3)/SO(3). The subgroup K of all
upper-triangular matrices in SL(3) is one representation of such a space, as
is the space of 3 3 symmetric positive definite matrices with determinant
1. Thus, SL(3)/SO(3) does provide an intrinsic geometric interpretation
for the unknown camera parameters. In general, the problem of camera selfcalibration is then equivalent to the problem of recovering the symmetric
matrix = AT A1 , from which the upper-triangular representation of
the intrinsic parameters can be easily obtained from the Cholesky factorization. The coordinate transformation in the uncalibrated space is given
by
AX(t) = AR(t)X(t0 ) + AT (t)
0

X0 (t) = AR(t)A1 X0 (t0 ) + T 0 (t)

(6.26)

where X = AX and T = AT . In homogeneous coordinates, the

transformation group on the uncalibrated space is given by

ARA1 T 0 0
(6.27)
G0 =
T R3 , R SO(3) R44
0
1
If the motion of a calibrated camera in the uncalibrated space is given by
g 0 (t) G0 , t R, the homogeneous coordinates of a point p0 E3 satisfy

0

0
X (t0 )
AR(t)A1 T 0 (t)
X (t)
.
(6.28)
=
1
0
1
1

From the ideal camera model, the image of the point p0 with respect to
such a camera (i.e. with an identity calibration matrix) is given by
(t)x0 (t) = P X0 (t)

(6.29)

6.3. Basics of uncalibrated geometry

121

where X0 (t) is in homogeneous representation (so for the equation below).

It is direct to check that the image x(t) is the same as the image of p =
1 (p0 ) E3 with respect to the uncalibrated camera, i.e. we have
(t)x0 (t) = AP X(t) = P X0 (t)

(6.30)

Hence we have established a duality which can be roughly stated as an

uncalibrated camera in a calibrated Euclidean space is equivalent to a calibrated camera in an uncalibrated space. It seems that all we have done so
far was converting the camera calibration problem to another one which is
no easier. Nevertheless, as we will soon see, understanding of this duality
in fact will give us some very useful insight when we study camera selfcalibration. Note now that given the above introduction we can revisit the
two different forms of the the fundamental matrix:
F = AT TbRA1 = AT TbA1 ARA1 = c
T 0 ARA1

(6.31)

The first commonly encountered form F = AT TbRA1 is the fundamental

matrix of an uncalibrated camera in a calibrated space and the second form
F =c
T 0 ARA1 is in fact the essential matrix of a calibrated camera in a
uncalibrated space.

6.3.1 Kruppas equations

In this section we pursue the analogy of an uncalibrated camera in a calibrated space with a calibrated camera in an uncalibrated space further, in
order to derive intrinsic constraints on the calibration of the camera that
can be inferred from a collection of fundamental matrices. In order to do so,
we define two copies of the three-dimensional Euclidean space. The cal.
ibrated Euclidean space E3 , with the usual inner product hu, vi = uT v,
3
and the uncalibrated Euclidean space E with the uncalibrated inner product hu, viS that ultimately will be related to the calibration of
the camera. We remind the reader that the Euclidean space E3 is represented in coordinates by R3 , endowed with the group of transformation
g = (T, R) SE(3), that acts on points of E3 via
g : R3 R3 ;

p 7 Rp + T.

(6.32)

Such a group induces an action on vectors v, which in this context are

represented by the difference of two points v = p2 p1 , by
g : T R3 R 3 T R 3 R 3 ;

v 7 Rv.

(6.33)

We make a point in distinguishing the action of (column) vectors from the

action of covectors, or row vectors:

g : T R3 R 3 T R 3 R 3 ;

v T 7 v T R

(6.34)

122

Chapter 6. Camera calibration and self-calibration

since this will be useful later. We equip E3 with the usual inner product
.
h, iS : T R3 T R3 7 R ; (u, v) 7 hu, vi = uT v
(6.35)
and we verify that such an inner product is invariant under the action of
the group, as hRu, Rvi = uT RT Rv = uT v = hu, vi. The inner product
.
operates on covectors exactly in the same way: huT , v T i = uT v, so that
T
T
the group action on covectors is also invariant: hu R, v Ri = uT RRT v =
uT v = huT , v T i. If we are given three (column) vectors, u, v, w, it is easy
to verify that their volume, i.e. the determinant of the matrix having them
as columns, is also invariant under the group action:
det[Ru, Rv, Rw] = det(R) det[u, v, w] = det[u, v, w].

(6.36)

Now consider a transformation of the Euclidean space E3 via a (nonsingular) linear map into another copy of the Euclidean space, which we
call E3 (it will soon be clear why we want to distinguish the two copies)
A : E3 E3 ;

p 7 Ap = q.

(6.37)

Such a map induces a group transformation on E by

h : R3 R 3 ;

q 7 ARA1 q + AT

which we represent as h = (U, ARA

(6.38)

) where

U = AT R3 .

(6.39)

Such a group acts on vectors via

h : T R 3 T R 3 ;

u 7 ARA1 u

(6.40)

uT 7 uT ARA1 .

(6.41)

and on covectors via

h : T R 3 T R 3 ;

The bilinear, positive definite form that is invariant under the action of h,
and plays the role of the inner product, is given by
.
h, iS : T R3 T R3 R ; (u, v) 7 hu, viS = uT AT A1 v
(6.42)
so that we have hARA1 u, ARA1 viS = uT AT RT AT AT A1 ARA1 v
= uT AT A1 v = hu, viS . The above bilinear form is determined uniquely
by its values on the elements of the basis, i.e. the elements of the matrix
.
S = AT A1 , so that we have
.
hu, viS = uT Sv
(6.43)
The bilinear form above acts on covectors via

h, iS : T R3 T R3 R ;

.
(uT , v T ) 7 huT , v T iS = uT AAT v

(6.44)

which is invariant with respect to the group action on covectors, since

huT ARA1 , v T ARA1 iS = uT AAT v = huT , v T iS

(6.45)

6.3. Basics of uncalibrated geometry

123

.
huT , v T iS = uT S 1 v which is different from hu, viS ; this shows that we
cannot quite treat row and column vectors alike. The difference in the
value of the inner product on vectors and covectors of E3 was not visible
on h, i since in that case S = I. If we assume that the transformation A has
determinant 1 (we can do this simply by scaling A, since we have assumed
it is nonsingular, so det(A) 6= 0) then, given three vectors u, v, w, their
volume is invariant under the group action, as
det[ARA1 u, ARA1 v, ARA1 w] = det(ARA1 ) det[u, v, w] = det[u, v, w].
(6.46)
and on covectors via
h : T R 3 T R 3 ;

uT 7 uT ARA1 .

(6.47)

With this machinery in our toolbox, we turn to the problem of camera

self-calibration.

Uncalibrated camera, or uncalibrated space?

In our discussion of camera calibration, we have modeled a moving point,
viewed with an uncalibrated camera, as
(
p(t) = Rp0 + T
(6.48)
x(t) = 1 Ap(t)
where p E3 and A is a non-singular, upper-triangular matrix that is
defined up to a scale. When A = I, we say that the camera is calibrated.
Let us now apply the transformation A to the space E3 so that if we let
q = Ap, we obtain
(
q(t) = ARA1 q0 + U
(6.49)
x(t) = 1 q(t).
Therefore, an uncalibrated camera moving in the Euclidean space E 3 with
a rigid motion g = (T, R) is equivalent to a calibrated camera moving in
the transformed space E3 with a motion h = (U, ARA1 ). In a calibrated
system, represented by
(
p(t) = Rp0 + T
(6.50)
x(t) = 1 p(t)
the Epipolar constraint expresses the fact that p, p0 and T are coplanar,
and therefore the volume they determine is zero: det[p(t), Rp0 , T ] = 0. We
write this condition in terms of the inner product as
hp(t), Qp0 i = 0

(6.51)

124

Chapter 6. Camera calibration and self-calibration

.
where we define the Essential matrix Q = TbR T SO(3). The same volume
constraint, in the uncalibrated space, is given by
hq(t), F q0 iS = 0

where the matrix F , called the Fundamental matrix, is given by

. b
F =U
ARA1

(6.52)

(6.53)

The importance of the above constraints is that, since they are homogeneous, we can substitute readily x = 1 p for p in the calibrated space,
and x = 1 q for q in the uncalibrated one. Since x can be measured from
images, the above equations give constraints either on the Essential matrix
Q, in the calibrated case, or on the Fundamental matrix F , in the uncalibrated case. A more common expression of the fundamental matrix is given
by AT QA1 , which can be derived using the fact that

so that

d = AT TbA1
AT

(6.54)

dARA1 = AT TbRA1 = AT QA1 .

F = AT

(6.55)

Note that the vector U = AT , also called the epipole, is the (right) nullspace of the F matrix, which can be easily (i.e. linearly) computed from
image measurements. As in the previous section, let A be a generic, nonsingular matrix A SL(3) and write its Q-R decomposition as A = BR0 .
We want to make sure that the fundamental matrix does not depend on R0 :
F = (BR0 )T TbR(BR0 )1 = B T R0 TbR0T R0 RR0T B 1 . Therefore, we have
bRB
1 , where T = R0 T and R
= R0 RR0T .
that F = AT TbRA1 = B T T
In other words, as we expected, we cannot distinguish a point moving
with
with (T, R) with calibration matrix A from one moving with (T, R)
calibration matrix B.
Enforcing the properties of the inner product: Kruppa equations
Since the Fundamental matrix can be recovered only up to a scale, let us
assume, for the moment, that kU k = kAT k = 1. The following derivation
is taken from [?]. Let R0 SO(3) be a rotation matrix such that
U = R 0 e3

(6.56)

where e3 = [0 0 1]T . Note that at least one such matrix always exists (in
geometric terms, one would say that this is a consequence of SO(3) acting
transitively on the sphere S 2 ). Consider now the matrix D = R0 F R0T . A
convenient expression for it is given by
T
d1
.
(6.57)
D = R0 F R0T = eb3 (R0 A)T R(R0 A)T = dT2 .
0

6.3. Basics of uncalibrated geometry

This expression comes from

D = R0 AT TbRA1 RT

125

= R0 A

\
A1
R0 e3 RA1 R0T

= (R0 A)T (R0\

A)1 e3 R(R0 A)1 .

Since e3 is in the (left) null space of D, we can arrange the rows so that
the last one is zero. The remaining two columns are given by
(
dT1 = eT2 R0 AR(R0 A)1
(6.58)
dT2 = eT1 R0 AR(R0 A)1 .
Now, the crucial observation comes from the fact that the covectors d T1
and dT2 are obtained through a transformation of the covectors eT1 and eT2
= (U
, AR
A1 ), as indicated in equathrough the action of the group h

tion (6.47), with A = R0 A. Therefore, the inner product huT , v T iS =

uT AAT v = uT AAT v = uT S 1 v must be preserved, as indicated in
Equation (6.45);

T
T
T
T

hd1 , d2 iS = he1 , e2 iS
T
T
T
(6.59)
hd1 , d1 iS = he2 , eT2 iS

T T
T
T
hd2 , d2 iS = he1 , e1 iS .

So far we have relied on the assumption that kAT k = 1. In general, that

is not the case, so the transformed vectors must be scaled by the norm of
AT . Taking this into account, we can write the conservation of the inner
product as

T 1
2 T 1

d 1 S d 2 = e 1 S e 2
T 1
(6.60)
d1 S d1 = 2 eT2 S 1 e2

T 1
2 T 1
d2 S d2 = e1 S e1
.
where 1 = kAT k. In order to eliminate the dependency on , we can
consider ratios of inner products. We obtain therefore 2 independent constraints on the 6 independent components of the inner product matrix S
which are called Kruppa equations.
hdT1 , dT2 iS
hdT , dT iS
hdT , dT iS
= 1T T1
= 2T T2 .
T
T
he1 , e2 iS
he2 , e2 iS
he1 , e1 iS

(6.61)

Given 3 or more fundamental matrices, taken from 3 or more sets of images

using the same camera, Kruppa equations can be - in principle - solved for
the calibration matrix S. We say in principle because in practice Kruppa
equations are non-linear and quite sensitive to noise, as we will discuss later.
The above derivation entails a choice of a matrix R0 that satisfied (6.56).
It is possible to avoid this arbitrary choice noticing that the fundamental
matrix itself is acted upon as a covector. Therefore, without a choice of R 0 ,
one can write, as in [?], a matrix version of Kruppa equations as
b S 1 U
b.
F S 1 F T = 2 U

(6.62)

126

Chapter 6. Camera calibration and self-calibration

where 1 = kU k and U = AT is in the null space of the fundamental

matrix F . Note that the above derivation did not involve any projective
geometry, nor complex loci such as the Absolute conic, as traditionally
done following [?].
Necessary and sufficient conditions for self-calibration
The above derivation is based on the assumption that the fundamental matrix F is normalized, i.e. kT 0k = 1. However, since the epipolar constraint
is homogeneous in the fundamental matrix F , it can only be determined
up to an arbitrary scale. Suppose is the length of the vector T 0 R3 in
F =c
T 0 ARA1 . Consequently, the vectors 1 and 2 are also scaled by the
same . Then the ratio between the left and right hand-side quantities in
each equation of (??) is equal to 2 . This reduces to two constraints on
1 , known as Kruppas equations after their discoverer in 1913
2 =

1T 1 1
2T 1 2
1T 1 2
=
=
2T 1 2
1T 1 1
1T 1 2

(6.63)

Equation (6.63) reveals the geometric meaning of the Kruppa ratio 2 : it is

the square of the length of the vector T 0 in the fundamental matrix F . This
discovery turns out to be quite useful when we later discuss the renormalization of Kruppas equations. Ideally, each fundamental matrix provides
at most two algebraic constraints on 1 , assuming that the two equations
in (6.63) happen to be algebraically independent. Since the symmetric matrix has five degrees of freedom, in general at least three fundamental
matrices are needed to uniquely determine . Nevertheless, as we will soon
see, this is not the case for many special camera motions.
Comment 6.1. One must be aware that solving Kruppas equations for
camera calibration is not equivalent to the camera self-calibration problem
in the sense that there may exist solutions of Kruppas equations which
are not solutions of a valid self-calibration. Given a non-critical set of
camera motions, the associated Kruppa equations do not necessarily give
enough constraints to solve for the calibration matrix A. See Section 6.4.3
for a more detailed account.
The above derivation of Kruppas equations is straightforward, but the
expression (6.63) depends on a particular choice of the rotation matrix
R0 . However, there is an even simpler way to get an equivalent expression
for Kruppas equations in a matrix form. Given a normalized fundamental
matrix F = c
T 0 ARA1 , it is then straightforward to check that 1 = AAT
must satisfy the following equation
T

F 1 F T = c
T 0 1 c
T0 .

(6.64)

We call this equation the normalized matrix Kruppa equation. It is readily

seen that this equation is equivalent to (??). If F is not normalized and is

6.4. Self-calibration from special motions and chirality

127

scaled by R, i.e. F = c
T 0 ARA1 ,5 we then have the matrix Kruppa
equation
c0 1 c
F 1 F T = 2 T
T0

(6.65)

This equation is equivalent to the scalar version given by (6.63) and is

independent of the choice of the rotation matrix R0 . Algebraic properties of Kruppas equations have been extensively studied in the Computer
Vision literature (see e.g. [MF92, ZF96]). However, we here are more
concerned with conditions on dependency among the two Kruppa equations obtained from one fundamental matrix. Knowing such conditions,
we may tell in practice whether a given set of Kruppas equations suffice
to guarantee a unique solution for calibration. As we will show in this
chapter, for very rich class of camera motions which commonly occur in
many practical applications, Kruppas equations will become degenerate.
Moreover, since Kruppas equations (6.63) or (6.65) are nonlinear in 1 ,
any self-calibration algorithms based on directly solving these equations
would either be computationally expensive or have multiple local minima
[Bou98, LF97]. So a more detailed study of Kruppas equations is necessary: We need to know under what conditions they may have problems
with such as degeneracy, and under what conditions they can be simplified
and most importantly leads to linear solutions.

6.4 Self-calibration from special motions and

chirality
In this section we discuss some special cases that deserve attention because
they make self-calibration either simpler, or not possible at all. Note that
subjecting the camera to special motions in order to simplify the calibration procedure requires access to the camera during the capture. Therefore,
this technique can only be applied in a controlled scenario. At that point,
however, one might as well use a rig, which makes the calibration simpler.
For this reason, the reader can skip this section unless he/she is interested in gaining insight into Kruppas equations and the practice of camera
calibration.

6.4.1 Pure rotational motion

c0 ARA1 with T 0 of unit length, the
Given a fundamental matrix F = T
normalized matrix Kruppa equation (6.64) can be rewritten in the following
5 Without

loss of generality, from now on, we always assume kT 0 k = 1.

128

Chapter 6. Camera calibration and self-calibration

way
T
c
T 0 (1 ARA1 1 AT RT AT )c
T 0 = 0.

(6.66)

According to this form, if we define C = ARA , a linear map, called the

Lyapunov map, : R33 R33 as : X 7 X CXC T , and a linear map
T
: R33 R33 as : Y 7 c
T 0Y c
T 0 , then the solution 1 of equation
(6.66) is exactly the (symmetric real) kernel of the composition map
:

R33 R33 R33 .

(6.67)

This interpretation of Kruppas equations clearly decomposes the effects

of the rotational and translational components of motion: if there is no
translation, i.e. T = 0, then there is no map ; if the translation is nonzero, the kernel is enlarged due to the composition with map . In general,
the kernel of is only two-dimensional. A more precise statement is given
by the lemma below:
Lemma 6.1 (A special Lyapunov map). Given a rotation matrix R
not of the form eubk for some k Z and some u R3 of unit length,
the symmetric real kernel associated with the map defined above is twodimensional. Otherwise, the symmetric real kernel is four-dimensional if k
is odd and six-dimensional if k is even.
the proof of this lemma follows directly from the properties of a Lyapunov
map [CD91]. According to this lemma, with two general rotations, the
matrix hence the camera calibration A can be uniquely determined. Let
us see how this can be done from three images taken by a purely rotating
camera. For a purely rotating camera, the coordinate transformation is
X(t) = R(t)X(t0 )

(6.68)

Corresponding image pairs (x01 , x02 ) then satisfy:

2 x02 = ARA1 1 x01 ,

(6.69)

for some scales 1 , 2 R+ . Then the image correspondences satisfy a

special version of epipolar constraint:
c0 ARA1 x0 = 0
x
1
2

(6.70)

1 C1 C T = 0.

(6.71)

From this linear equation, in general, 4 pair of image correspondences

uniquely determine the matrix C := ARA1 R33 .6 Recall that 1 =
AAT . It is then direct to check that:

6 Note that from the equation we can only solve ARA 1 up to an arbitrary scale. But
we also know that det(ARA1 ) must be 1.

6.4. Self-calibration from special motions and chirality

129

That is, 1 has to be in the symmetric real kernel SRKer(L) of the

Lyapunov map:
L : C33
X

C33
7

X CXC T .

(6.72)

As for how many matrices of the form Ci := ARi A1 with Ri SO(3)

may determine the calibration matrix A, we have the following results as
a corollary to Lemma 6.1:
Theorem 6.1 (Two rotational motions determine calibration).
Given two matrices Ci = ARi A1 ASO(3)A1 , i = 1, 2 where Ri = eubi i
with kui k = 1 and i s not equal to k, k Z, then SRKer(L1 )SRKer(L2 )
is one-dimensional if and only if u1 and u2 are linearly independent.
Proof. The necessity is obvious: if two rotation matrices R1 and R2 have the
same axis, they have the same eigenvectors hence SRKer(L1 ) = SRKer(L2 )
where Li : X 7 X Ci XCiT , i = 1, 2. We now only need to prove the
sufficiency. We may assume u1 and u2 are the two rotation axes of R1
and R2 respectively and are linearly independent. Since rotation angles
of both R1 and R2 are not k, both SRKer(L1 ) and SRKer(L2 ) are twodimensional according to the Lemma 6.1. Since u1 and u2 are linearly
independent, the matrices Au1 uT1 AT and Au2 uT2 AT are linearly independent and are in SRKer(L1 ) and SRKer(L2 ) respectively. Thus SRKer(L1 )
is not fully contained in SRKer(L2 ) because it already contains both
Au2 uT2 AT and 1 = AAT and could not possibly contains another independent Au1 uT1 AT . Hence their intersection SRKer(L1 ) SRKer(L2 )
has at most 1 dimension. X = 1 is then the only solution in the
intersection.
According to this theorem, a simple way to calibrate an uncalibrated
camera is to rotate it about two different axes. The self-calibration problem
in this case becomes completely linear and a unique solution is also guaranteed with the conditions given above. However, in practice, one usually
does not know where the center of a camera is and rotating the camera
does not necessarily guarantee that the rotation is about the exact optical
center. So the above algorithm can only be approximately applied given
that practical conditions are close enough. This will certainly cause certain
errors in the calibration matrix A finally estimated. In a general case with
translational motion, however, the symmetric real kernel of the composition
map is usually three-dimensional hence the conditions for a unique
calibration is much more complicated. Furthermore, as we will see in the
next section, the dimension of the kernel is not always 3. In many cases
of practical importance, this kernel may become four-dimensional instead,
which in fact correspond to certain degeneracy of Kruppas equations. Solutions for the unnormalized Kruppas equations are even more complicated
due to the unknown scale . Nevertheless, the following lemma shows that

130

Chapter 6. Camera calibration and self-calibration

the conditions on uniqueness depend only on the camera motion which

somehow may simplify the situation a little bit:
Lemma 6.2. Given a fundamental matrix F = c
T 0 ARA1 with T 0 = AT ,
T
c0 X c
a symmetric matrix X R33 is a solution of F XF T = 2 T
T 0 if and
only if Y = A1 XAT is a solution of EY E T = 2 TbY TbT with E = TbR.

The proof of this lemma is simply algebraic. This simple lemma, however, states a very important fact: given a set of fundamental matrices
Fi = c
Ti0 ARi A1 with Ti0 = ATi , i = 1, . . . , n, there is a one-to-one
correspondence between the set of solutions of the equations
T

Fi XFiT = 2i c
Ti0 X c
Ti0 ,

i = 1, . . . , n

and the set of solutions of the equations

T
Ei Y EiT = 2i Tbi Y Tbi ,

i = 1, . . . , n

where Ei = Tbi Ri are essential matrices associated to the given fundamental

matrices. Note that these essential matrices are determined only by the
camera motion. Therefore, the conditions of uniqueness of the solution of
Kruppas equations only depend on the camera motion. Our next task
is then to study how the solutions of Kruppas equations depend on the
camera motion and under what conditions the solutions can be simplified.

6.4.2 Translation perpendicular or parallel to rotation

From the derivation of Kruppas equations (6.63) or (6.65), we observe
that the reason why they are nonlinear is that we do not usually know
the scale . It is then helpful to know under what conditions the matrix
Kruppa equation will have the same solutions as the normalized one, i.e.
with set to 1. Here we will study two special cases for which we are
able to know directly what the missing is. The fundamental matrix can
then be renormalized and we can therefore solve the camera calibration
from the normalized matrix Kruppa equations, which are linear! These
two cases are when the rotation axis is parallel or perpendicular to the
translation. That is, if the motion is represented by (R, T ) SE(3) and
the unit vector u R3 is the axis of R SO(3),7 then the two cases are
when u is parallel or perpendicular to T . As we will soon see, these two
cases are of great theoretical importance: Not only does the calibration
algorithm become linear, but it also reveals certain subtleties of Kruppas
equations and explains when the nonlinear Kruppa equations are most
likely to become ill-conditioned.
7R

can always be written of the form R = eub for some [0, ] and u S2 .

6.4. Self-calibration from special motions and chirality

131

Lemma 6.3. Consider a camera motion (R, T ) SE(3) where R = eub ,

(0, ) and the axis u R3 is parallel or perpendicular to T . If R
and positive definite matrix Y are a solution to the matrix Kruppa equation:
TbRY RT TbT = 2 TbY TbT associated to the essential matrix TbR, then we
must have 2 = 1. Consequently, Y is a solution of the normalized matrix
Kruppa equation: TbRY RT TbT = TbY TbT .

Proof. Without loss of generality we assume kT k = 1. For the parallel case,

let x R3 be a vector of unit length in the plane spanned by the column
vectors of Tb. All such x lie on a unit circle. There exists x0 R3 on the circle
such that xT0 Y x0 is maximum. We then have xT0 RY RT x0 = 2 xT0 Y x0 ,
hence 2 1. Similarly, if we pick x0 such that xT0 Y x0 is minimum, we
have 2 1. Therefore, 2 = 1. For the perpendicular case, since the
columns of Tb span the subspace which is perpendicular to the vector T ,
the eigenvector u of R is in this subspace. Thus we have: uT RY RT u =
2 uT Y u uT Y u = 2 uT Y u. Hence 2 = 1 if Y is positive definite.
Combining Lemma 6.3 and Lemma 6.2, we immediately have:
Theorem 6.2 (Renormalization of Kruppas equations). Consider
an unnormalized fundamental matrix F = c
T 0 ARA1 where R = eub ,
(0, ) and the axis u R3 is parallel or perpendicular to T = A1 T 0 . Let
e = T 0 /kT 0k R3 . Then if R and a positive definite matrix are a
solution to the matrix Kruppa equation: F 1 F T = 2 eb1 ebT , we must
have 2 = kT 0 k2 .
This theorem claims that, for the two types of special motions considered
here, there is no solution for in Kruppas equation (6.65) besides the true
scale of the fundamental matrix. Hence we can decompose the problem
into finding first and then solving for or 1 . The following theorem
allows to directly compute the scale in the two special cases for a given
fundamental matrix:
Theorem 6.3 (Renormalization of the fundamental matrix). Given
T 0 ARA1 with kT 0 k = 1, if
an unnormalized fundamental matrix F = c
1 0
T 0 F k, and if T is
T = A T is parallel to the axis of R, then 2 is kF T c
perpendicular to the axis of R, then is one of the two non-zero eigenvalues
T 0.
of F T c
T

c0 T
c0 is a projection matrix to the plane spanned
Proof. Note that, since T
T
by the column vectors of c
T 0 , we have the identity c
T0 c
T 0c
T0 = c
T 0 . First
T 0F =
we prove the parallel case. It can be verified that, in general, F T c
\
T T . Since the axis of R is parallel to T , we have R T T = T hence
2 AR
FTc
T 0 F = 2 c
T 0 . For the perpendicular case, let u R3 be the axis of R.
By assumption T = A1 T 0 is perpendicular to u. Then there exists v R3
such that u = TbA1 v. Then it is direct to check that c
T 0 v is the eigenvector
of F T c
T 0 corresponding to the eigenvalue .

132

Chapter 6. Camera calibration and self-calibration

Then for these two types of special motions, the associated fundamental matrix can be immediately normalized by being divided by the scale
. Once the fundamental matrices are normalized, the problem of finding
the calibration matrix 1 from the normalized matrix Kruppa equations
(6.64) becomes a simple linear one. A normalized matrix Kruppa equation
in general imposes three linearly independent constraints on the unknown
calibration matrix given by (??). However, this is no longer the case for
the special motions that we are considering here.
Theorem 6.4 (Degeneracy of the normalized Kruppa equations).
Consider a camera motion (R, T ) SE(3) where R = eub has the angle
(0, ). If the axis u R3 is parallel or perpendicular to T , then the
normalized matrix Kruppa equation: TbRY RT TbT = TbY TbT imposes only
two linearly independent constraints on the symmetric matrix Y .

Proof. For the parallel case, by restricting Y to the plane spanned by the
column vectors of Tb, it yields a symmetric matrix Y in R22 . The rotation
SO(2). The
matrix R SO(3) restricted to this plane is a rotation R
Y R
T = 0.
normalized matrix Kruppa equation is then equivalent to Y R
Since 0 < < , this equation imposes exactly two constraints on the threedimensional space of 22 symmetric real matrices. The identity I22 is the
only solution. Hence the normalized Kruppa equation imposes exactly two
linearly independent constraints on Y . For the perpendicular case, since u
is in the plane spanned by the column vectors of Tb, there exist v R3 such
that (u, v) form an orthonormal basis of the plane. Then the normalized
matrix Kruppa equation is equivalent to
TbRY RT TbT = TbY TbT

(u, v)T RY RT (u, v) = (u, v)T Y (u, v).

Since RT u = u, the above matrix equation is equivalent to two equations

v T RY u = v T Y u,

v T RY RT v = v T Y v.

(6.73)

These are the only two constraints on Y imposed by the normalized Kruppa
equation.
According to this theorem, although we can renormalize the fundamental matrix when rotation axis and translation are parallel or perpendicular,
we only get two independent constraints from the resulting (normalized)
Kruppa equation corresponding to a single fundamental matrix, i.e. one of
the three linear equations in (??) depends on the other two. So degeneracy
does occur to normalized Kruppa equations. Hence for these motions, in
general, we still need three such fundamental matrices to uniquely determine the unknown calibration. What happens to the unnormalized Kruppa
equations? If we do not renormalize the fundamental matrix and directly
use the unnormalized Kruppa equations (6.63) to solve for calibration, the
two nonlinear equations in (6.63) may become algebraically dependent. The
following corollary shows that this is at least true for the case when the

6.4. Self-calibration from special motions and chirality

133

translation is perpendicular to the rotation axis (e.g. a planar or orbital

motion):
Corollary 6.1. Consider a camera motion (R, T ) SE(3) where R = eub
has the angle (0, ). If the axis u R3 is perpendicular to T , then the
unnormalized matrix Kruppa equation: TbRY RT TbT = 2 TbY TbT imposes
only one algebraically independent constraints on the symmetric matrix Y .

Proof. From the proof of Theorem 6.4, we know that any other linear constraint on Y imposed by the normalized Kruppa equations must be a linear
combination the two given in (6.73)
v T RY u + v T RY RT v = v T Y u + v T Y v,

, R.

Hence after eliminating , the unnormalized matrix Kruppa equation gives

vT Y v
v T Y u + v T Y v
vT Y u
=
=
.
v T RY u
v T RY RT v
v T RY u + v T RY RT v
The first equation obviously implies the second regardless of values of , .
Therefore, when the translation is perpendicular to the rotation axis,
one can only get one (unnormalized Kruppa) constraint, as opposed to the
expected two, on the unknown calibration 1 from each fundamental matrix. However, the above arguments do not always apply to the parallel case.
But our study provides a more precise picture about how many independent (linear or algebraic) constraints that one may get at most in different
situations. This is summarized in Table 6.1. Although, mathematically, moTable 6.1. Maximum numbers of independent constraints given by Kruppas
equations and the angle [0, ) between the rotation and translation.

Cases
( 6= 0, 2 )
( = 0)
( =

Type of Constraints
Unnormalized
Normalized
Unnormalized
Normalized
Unnormalized
Normalized

# of Constraints on 1
2
3
2
2
1
2

tion involving translation either parallel or perpendicular to the rotation is

only a zero-measure subset of SE(3), they are very commonly encountered
in applications: Many images sequences are usually taken by moving the
camera around an object in trajectory composed of planar motion or orbital motion, in which case the rotation axis and translation direction are
likely perpendicular to each other. Our analysis shows that, for these types
of motions, even if a unique calibration may exist from the given data,

134

Chapter 6. Camera calibration and self-calibration

a self-calibration algorithm based on directly solving Kruppas equations

(6.63) is likely to be ill-conditioned [Bou98]. To intuitively demonstrate the
practical significance of our results, we give an example in Figure 6.2. Our
analysis reveals that in these cases, it is crucial to renormalize Kruppas
equation using Theorem 6.4: once the fundamental matrix or Kruppas
equations are renormalized, not only is one more constraint recovered, but
we also obtain linear constraints (normalized Kruppas equations).

Figure 6.2. Two consecutive orbital motions with independent rotations: The
camera optical axis always pointing to the center of the globe. Even if all pairwise
fundamental matrices among the three views are considered, one only gets at
most 1 + 1 + 2 = 4 effective constraints on the camera intrinsic matrix if one
uses the three unnormalized matrix Kruppa equations. At least one more view
is need if one wishes to uniquely determine the unknown calibration. However,
using renormalization instead, we may get back to 2 + 2 + 2 5 constraints from
only the given three views.

Comment 6.2 (Solutions of the normalized Kruppa equations).

Claims of Theorem 6.4 run contrary to the claims of Propositions B.5 hence
B.9 in [ZF96]: In Proposition B.5 of [ZF96], it is claimed that the solution
space of the normalized Kruppas equations when the translation is parallel or perpendicular to the rotation axis is two or three-dimensional. In
Theorem 6.4, we claim that the solution space is always four-dimensional.
Theorem 6.4 does not cover the case when the rotation angle is . However, if one allows the rotation to be , the solutions of normalized Kruppas
equations are even more complicated. For example, we know Tbeub = Tb if
u is of unit length and parallel to T (see [MKS00]). Therefore, if R = e ub ,
the corresponding normalized Kruppa equation is completely degenerate and
imposes no constraints at all on the calibration matrix.

6.4. Self-calibration from special motions and chirality

135

Comment 6.3 (Number of solutions). Although Theorem 6.3 claims

that for the perpendicular case is one of the two non-zero eigenvalues of
FTc
T 0 , unfortunately, there is no way to tell which one is the correct one
simulations show that it could be either the larger or smaller one. Therefore,
in a numerical algorithm, for given n 3 fundamental matrices, one needs
to consider all possible 2n combinations. According to Theorem 6.2, in the
noise-free case, only one of the solutions can be positive definite, which
corresponds to the true calibration.

6.4.3 Calibration with chirality

It can be shown that if the scene is rich enough, then two general motions
with rotation around different axes already determine a unique Euclidean
solution for camera motion, calibration and scene structure (which will
be explained later on). However, the two Kruppa equations obtained from
these two motions will only give us at most four constraints on which
has five degrees of freedom. We hence need to know what information
is missing from Kruppas equations. Stated alternatively, can we get extra independent constraints on from the fundamental matrix other
than Kruppas equations? The proof of Theorem 6.3 suggests another
equation can be derived from the fundamental matrix F = c
T 0 ARA1
0
T c0
2 \
T
with kT k = 1. Since F T F = AR T , we can obtain the vector
= 2 ART T = 2 ART A1 T 0 . Then it is obvious that the following
equation for = AT A1 holds
T

T = 4 T 0 T 0 .

(6.74)

Notice that this is a constraint on , not like Kruppas equations which

are constraints on 1 . Combining Kruppas equations given in (6.63) with
(6.74) we have
s
T 1
T 1
T 1

2
2
1
.
(6.75)
2 = 1T 1 = 2T 1 = 1T 1 =
2 2
1 1
1 2
T 0 T T 0
Is the last equation algebraically independent of the two Kruppa equations?
Although it seems to be quite different from Kruppas equations, it is in
fact dependent on them. This can be shown either numerically or using
simple algebraic tools such as Maple. Thus, it appears that our effort to
look for extra independent constraints on A from the fundamental matrix
has failed.8 In the following, we will give an explanation to this by showing
that not all which satisfy the Kruppa equations may give valid Euclidean
reconstructions of both the camera motion and scene structure. The extra
8 Nevertheless, extra implicit constraints on A may still be obtained from other algebraic facts. For example, the so-called modulus constraints give three implicit constraints
on A by introducing three extra unknowns, for more details see [PG99].

136

Chapter 6. Camera calibration and self-calibration

constraints which are missing in Kruppas equations are in fact captured by

the so-called chirality constraint, which was previously studied in [Har98].
We now give a clear and concise description between the relationship of the
Kruppa equations and chirality.
Theorem 6.5 (Kruppas equations and chirality). Consider a camera
with calibration matrix I and motion (R, T ). If T 6= 0, among all the solutions Y = A1 AT of Kruppas equation EY E T = 2 TbY TbT associated to
E = TbR, only those which guarantee ARA1 SO(3) may provide a valid
Euclidean reconstruction of both camera motion and scene structure in the
sense that any other solution pushes some plane N R3 to the plane at
infinity, and feature points on different sides of the plane N have different
signs of recovered depth.
Proof. The images x2 , x1 of any point p R3 satisfy the coordinate
transformation
2 x2 = 1 Rx1 + T.
If there exists Y = A1 AT such that EY E T = 2 TbY TbT for some R,
then the matrix F = AT EA1 = c
T 0 ARA1 is also an essential matrix
0
SO(3) such that F = c
(see
with T = AT , that is, there exists R
T 0R
[May93] for an account of properties of essential matrices). Under the new
calibration A, the coordinate transformation is in fact
2 Ax2 = 1 ARA1 (Ax1 ) + T 0 .
c0 R
=T
c0 ARA1 , we have ARA1 = R
+ T 0 v T for some v R3 .
Since F = T
0 T
0

Then the above equation becomes: 2 Ax2 = 1 R(Ax

1 )+1 T v (Ax1 )+T .
Let = 1 v T (Ax1 ) R, we can further rewrite the equation as
1 + ( + 1)T 0 .
2 Ax2 = 1 RAx

(6.76)

Nevertheless, with respect to the solution A, the reconstructed images

T 0 ) must also satisfy
Ax1 , Ax2 and (R,
1 + T0
2 Ax2 = 1 RAx

(6.77)

for some scale factors 1 , 2 R. Now we prove by contradiction that v 6= 0

is impossible for a valid Euclidean reconstruction. Suppose that v 6= 0 and
we define the plane N = {p R3 |v T p = 1}. Then for any p = 1 Ax1 N ,
1.
we have = 1. Hence, from (6.76), Ax1 , Ax2 satisfy 2 Ax2 = 1 RAx
0
Since Ax1 , Ax2 also satisfy (6.77) and T 6= 0, both 1 and 2 in (6.77)
must be . That is, the plane N is pushed to the plane at infinity by
the solution A. For points not on the plane N , we have +1 6= 0. Comparing
the two equations (6.76) and (6.77), we get i = i /( + 1), i = 1, 2. Then
for a point in the far side of the plane N , i.e. + 1 < 0, the recovered depth
scale is negative; for a point! in the near side of N , i.e. + 1 > 0, the
recovered depth scale is positive. Thus, we must have that v = 0.

6.4. Self-calibration from special motions and chirality

137

Comment 6.4 (Quasi-affine reconstruction). Theorem 6.5 essentially

implies the chirality constraints studied in [Har98]. According to the above
theorem, if only finitely many feature points are measured, a solution of
the calibration matrix A which may allow a valid Euclidean reconstruction
should induce a plane N not cutting through the convex hull spanned by all
the feature points and camera centers. Such a reconstruction is referred as
quasi-affine in [Har98].
It is known from Theorem 6.1 that, in general, two matrices of the form
ARA1 determine a unique camera calibration A. Thus, following Theorem
6.5, in principle, a camera calibration can be uniquely determined by two
independent rotations regardless of translation if enough (usually infinitely
many) feature points are available. An intuitive example is provided in Figure 6.3. The significance of Theorem 6.5 is that it explains why we get only

(R2, T2)
p

o2
(R1, T1 )
L1

Figure 6.3. A camera undergoes two motions (R1 , T1 ) and (R2 , T2 ) observing a
rig consisting of three straight lines L1 , L2 , L3 . Then the camera calibration is
uniquely determined as long as R1 and R2 have independent rotation axes and
rotation angles in (0, ), regardless of T1 , T2 . This is because, for any invalid
solution A, the associated plane N (see the proof of Theorem 6.5) must intersect
the three lines at some point, say p. Then the reconstructed depth of point p with
respect to the solution A would be infinite (points beyond the plane N would
have negative recovered depth). This gives us a criteria to exclude all such invalid
solutions.

two constraints from one fundamental matrix even in the two special cases
when Kruppas equations can be renormalized extra ones are imposed by
the structure, not the motion. The theorem also resolves the discrepancy
between Kruppas equations and the necessary and sufficient condition for
a unique calibration: Kruppas equations, although convenient to use, do
not provide sufficient conditions for a valid calibration which allows a valid

138

Chapter 6. Camera calibration and self-calibration

Euclidean reconstruction of both the camera motion and scene structure.

However, the fact given in Theorem 6.5 is somewhat difficult to harness in
algorithms. For example, in order to exclude invalid solutions, one needs
feature points on or beyond the plane N .9 Alternatively, if such feature
points are not available, one may first obtain a projective reconstruction
and then use the so-called absolute quadric constraints to calibrate the
camera [Tri97]. However, in such a method, the camera motion needs to
satisfy a more restrictive condition than requiring only two independent
rotations, i.e. it cannot be critical in the sense specified in [Stu97].

6.5 Calibration from continuous motion

So far, we have mainly considered camera self-calibration when the motion of the camera is discrete positions of the camera are specified as
discrete points in SE(3). In this section, we study its continuous version.
T

Define the angular velocity

b = R(t)R
(t) so(3) and linear velocity
3
0
v = b
T (t) + T (t) R and. Let v = Av R3 , 0 = A R3 .
Differentiating the equation (6.26) with respect to time t, we obtain:
0 = Ab
X
A1 X0 + v 0

(6.78)

where X0 = AX.
Uncalibrated continuous epipolar geometry
By the general case we mean that both the angular and linear velocities
0 = x
+ x.
Then (6.78)
and v are non-zero. Note that X0 = x yields X
gives an uncalibrated version of the continuous epipolar constraint
x T AT vbA1 x + xT AT
b vbA1 x = 0

(6.79)

This will still be called the continuous epipolar constraint. As before, let
s R33 to be s = 12 (b
vb + vb
b ). Define the continuous fundamental matrix
F R63 to be:
T 1
A vbA
F =
.
(6.80)
AT sA1

F can therefore be estimated from as few as eight optical flows (x, x)

T
1
T
1
0
0
b
b
from (6.79). Note that v = A vbA
and = A
b A . Applying

9 Some possible ways of harnessing the constraints provided by chirality have been
discussed in [Har98]. Basically they give inequality constraints on the possible solutions
of the calibration.
10 This section may be skipped at a first reading

6.5. Calibration from continuous motion

139

c repeatedly, we obtain
this property of the hat operator ()

1 T
vb + vb
b )A1
A (b
2
1 T
1
=
b AT vb0 + vb0 Ab
(A
A1 ) = (b0 1 vb0 + vb0 1 b0 ).
2
2
Then the continuous epipolar constraint (6.79) is equivalent to
AT sA1

1
x T vb0 x + xT (b0 1 vb0 + vb0 1 b0 )x = 0.
(6.81)
2
Suppose 1 = BB T for another B SL(3), then A = BR0 for some
R0 SO(3). We have

1
x T vb0 x + xT (b0 1 vb0 + vb0 1 b0 )x = 0
2
T b0
T 1 b0
x v x + x ( BB T vb0 + vb0 BB T b0 )x = 0
2
1
d
d
d 1 x = 0.
x T B T R
vB
x + xT B T R
0
0 R0 vB

(6.82)

Comparing to (6.79), one cannot tell the camera A with motion (, v) from
the camera B with motion (R0 , R0 v). Thus, like the discrete case, without
knowing the camera motion the calibration can only be recovered in the
space SL(3)/SO(3), i.e. only the symmetric matrix 1 hence can be
recovered.
However, unlike the discrete case, the matrix cannot be fully recovered
in the continuous case. Since 1 = AAT is a symmetric matrix, it can be
diagonalized as
1 = R1T R1 ,

R1 SO(3)
00

(6.83)
00

where = diag{1 , 2 , 3 }. Then let = R1 and v = R1 v . Applying

the property of the hat operator, we have
00 R
vb0 = R1T vc
1

1 b0 1 b0 b0 1 b0
1 c00 c00 c00 c00
( v + v ) = R1T (
v + v )R1 .
(6.84)
2
2
Thus the continuous epipolar constraint (6.79) is also equivalent to:
1 c00 c00 c00 c00
T vc00 (R1 x) + (R1 x)T (
v + v )(R1 x) = 0.
(6.85)
(R1 x)
2
From this equation, one can see that there is no way to tell a camera
A with AAT = R1T R1 from a camera B = R1 A. Therefore, only the
diagonal matrix can be recovered as camera parameters if both the scene
structure and camera motion are unknown. Note that is in SL(3) hence
1 2 3 = 1. The singular values only have two degrees of freedom. Hence
we have:
Theorem 6.6 (Singular values of calibration matrix from continuous motion). Consider a camera with an unknown calibration matrix

140

Chapter 6. Camera calibration and self-calibration

A SL(3) undergoing an unknown motion (, v). Then only the eigenvalues of AAT , i.e. singular values of A, can be recovered from the bilinear
continuous epipolar constraint.
If we define that two matrices in SL(3) are equivalent if and only if
they have the same singular values. The intrinsic parameter space is then
reduced to the space SL(3)/ where represents this equivalence relation.
The fact that only two camera parameters can be recovered was known to
Brooks et al. [BCB97]. They have also shown how to do self-calibration for
matrices A with only two unknown parameters using the above continuous
method.
Comment 6.5. It is a little surprising to see that the discrete and continuous cases have some significant difference for the first time, especially
knowing that in the calibrated case these two cases have almost exactly parallel sets of theory and algorithms. We believe that this has to do with the
map:
A : R33
B

R33

7 ABAT

where A is an arbitrary matrix in R33 . Let so(3) be the Lie algebra of

SO(3). The restricted map A |so(3) is an endomorphism while A |SO(3)
is not. Consider A |so(3) to be the first order approximation of A |SO(3) .
Then the information about the calibration matrix A does not fully show
up until the second order term of the map A . This also somehow explains
why in the discrete case the constraints that we can get for A (essentially
Kruppas equations) are in general nonlinear.
Comment 6.6. From the above discussion, if one only uses the (bilinear)
continuous epipolar constraint, at most two intrinsic parameters of the calibration matrix A can be recovered. However, it is still possible that the full
information about A can be recovered from multilinear constraints on the
higher order derivatives of optical flow. A complete list of such constraints
will be given later on when we study multiple view geometry. For now, a
curious reader may refer to [
Ast96].
Pure rotational motion
Since full calibration is not possible in the general case when translation
is present, we need to know if it is possible in some special case. The only
case left is when there is only rotational motion, i.e. the linear velocity v
is always zero. In this case the continuous fundamental matrix is no longer
well defined. However from the equation (6.78) we have:
+ x = Ab
0 = Ab
bx = x
bAb
X
A1 X0 x
A1 x x
A1(6.86)
x.
This is a special version of the continuous epipolar constraint and it gives

two independent constraints on the matrix C := Ab

A1 for each (x, x).

6.5. Calibration from continuous motion

141

Given n 4 optical flow measurements {(xi , x i )}ni=1 , one may uniquely

determine the matrix Ab
A1 by solving a linear equation:
Mc = b

(6.87)

where M R2n9 is a matrix function of {(xi , x i )}ni=1 , b R9 is a vector

bi x i s and c R9 is the 9 entries of C = Ab
function of x
A1 . The unique
solution is given by the following lemma:
Lemma 6.4. If 6= 0, then Ab
A1 = CL I where CL R33 is the
matrix corresponding to the least-squares solution of the equation M c = b
and is the unique real eigenvalue of CL .

The proof is straightforward. Then the self-calibration problem becomes

how to recover = AT A1 or 1 = AAT from matrices of the form
Ab
A1 . Without loss of generality, we may assume is of unit length.11
Notice that this problem then becomes a continuous version of the discrete
pure rotational case. Since C = Ab
A1 R33 , it is direct to check that
C1 + 1 C T = 0.

That is, 1 = AAT has to be in the kernel of the Lyapunov map:

L : C33
X

C33

7 CX + XC T

(6.88)

If 6= 0, the eigenvalues of
b have the form 0, i, i with R. Let
the corresponding eigenvectors are , u, u
C3 . According to Callier and
Desoer [CD91], the kernel of the map L has three dimensions and is given
by:
Ker(L) = span{X1 = A AT , X2 = Auu AT , X3 = A
uu
AT }. (6.89)
As in the discrete case, the symmetric real X is of the form X = X1 +
(X2 + X3 ) for some , R. That is the symmetric real kernel of L is
only two-dimensional. We denote this space as SRKer(L). We thus have:
Lemma 6.5. Given a matrix C = Ab
A1 with kk = 1, the symmetric real kernel associated with the Lyapunov map L : CX + XC T is
two-dimensional.
Similar to the discrete case we have:
Theorem 6.7 (Two continuous rotational motions determine calibration). Given matrices Ci := Ab
i A1 R33 , i = 1, . . . , n with
ki k = 1. The symmetric matrix = AT A1 SL(3) is uniquely determined if and only if at least two of the n vectors i , i = 1, . . . , n are linearly
independent.
11 If

not, the norm of the two non-zero eigenvalues of Ab

A1 is exactly the length of

142

Chapter 6. Camera calibration and self-calibration

6.6 Three stage stratification

6.6.1 Projective reconstruction
The approach outlined in the previous section utilized the intrinsic constraints (expressed in terms of image measurements or quantities directly
computable from them, i.e. F and T 0 ) for the recovery of the metric of the
uncalibrated space . The sufficient and necessary conditions for successful
self-calibration stated that at least three views are needed, such that they
are related by nonzero translations T1 , T2 and rotations R1 , R2 are around
two linearly independent axes.
In the following section we will look at some cases where sufficient
and necessary conditions are not satisfied and explore some additional
avenues. Let us first revisit the epipolar geometry of two views in the
uncalibated case. As we have already discussed at the beginning of this
chapter, the fundamental matrix cannot be uniquely factorized into a skewsymmetric matrix and nonsingular matrix, due to the fact that there is a
family of transformations parameterized by v, which gives rise to the same
fundamental matrix:
T
T
x02 Tb0 ARA1 x01 = x02 Tb0 (ARA1 + T 0 vT )x01 = 0.

(6.90)

since Tb0 (T 0 vT ) = 0.
Below we demonstrate that these possible decompositions of F yield projection matrices, which are related to each other by non-singular matrices
H GL(4). The following theorem is due to Hartley [HZ00]. Without loss
of generality we assume that the projection matrix associated with the first
camera is P1 = [I, 0].
Theorem 6.8 (Projective reconstruction). Let F be the fundamental matrix and (P1 , P2 ) and (P1 , P2 ) are two possible pairs of projection
matrices, yielding the same fundamental matrix F . Then there exists a
nonsingular transformation H GL(4) such that P1 = H P1 and P2 =
H P2 .
Suppose that there are two possible decompositions of F such that F =
b
b
b
aA and F = bB,
then b = a and B = 1 (A + avT ). Since b
aA = bB,
then b
a(B A) = 0, so B A = avT , hence B = 1 (A + avT ). What
remains to be verified that given the projections matrices P = [A, a] and
P = [1 (A + avT ), a] they are indeed related by the matrix H, such that
P H = P , with H:
H=

1 I
1 vT

(6.91)

6.6. Three stage stratification

143

Canonical decomposition of F
The previous argument demonstrated that there is an entire 4 parameter
family of transformations parameterized by (, v), consistent with F . In order to be able to proceed with the reconstruction we need a systematic way
to choose a representative from this family. Such a canonical decomposition
of F consistent with a set of measurements has the following form:
P1 = [I, 0] and P2 = [Tb0 F, T 0 ].

This assumes that the structure is expressed in the coordinate frame of

the first camera with the projection matrix P1 = [I, 0] R34 and projection matrix P2 capturing the displacement between two views then becomes
P2 = [Tb0 F, T 0 ]. Note that T 0 is the epipole in the second view, F T T 0 = 0 and
can be directly computed from F T as its null space; hence, this decomposition can be obtained directly from image correspondences (by computing
F ) and is consistent with the fundamental matrix F . The image measurements and the canonical projection matrices are now linearly related to the
unknown structure X:
1 x01 = P1 X =
[I, 0]X
(6.92)
0
b
2 x2 = P2 X = [T 0 F, T 0 ]X
The presence of the ambiguity H GL(4) in the choice of the projection
matrices reflects the fact that structure can be recovered given the choice
of P1 and P2 . Since the projection matrices P1 , P2 are related to the actual
displacement by an unknown transformation H GL(4), then the ambiguity associated with the Euclidean structure X is characterized by the
inverse of transformation H 1 :
1 x01 = P1 HH 1 X = P1 Xp
(6.93)
2 x02 = P2 HH 1 X = P2 Xp
The recovered structure Xp = H 1 X is so called projective structure. The
relationship in equation (6.92) enables us to proceed with the recovery
of Xp . Eliminating unknown scales 1 and 2 by multiplying both sides
b01 and x
b02 in equation 6.92 we obtain three equations per
of equation by x
point, only two of which are linearly independent. Hence two corresponding
points x01 and x02 yield four independent constraints on Xp
1T
xi1 (p1T
3 Xp ) = p 1 Xp

(6.94)

1T
y1i (p1T
3 Xp ) = p 2 Xp
2T
xi2 (p2T
3 Xp ) = p 1 Xp
2T
y2i (p2T
3 Xp ) = p 2 Xp

where P1 = [p11 , p12 , p13 ]T and P2 = [p21 , p22 , p23 ]T are the projection matrices
described in terms of their rows. Structure can be then recovered as a linear
least squares solution to the system of homogeneous equations
AXp = 0

144

Chapter 6. Camera calibration and self-calibration

Figure 6.4. The true Euclidean Structure X of the scene and the projective
structure Xp obtained by the algorithm described above.

Similarly to the calibrated case, this linear reconstruction is suboptimal in

the presence of noise.
In order to upgrade the projective structure Xp to Euclidean structure X,
we have to recover the unknown transformation H R44 , such that:
X = HXp

(6.95)

This can be done in several steps as the remainder of this chapter outlines.
The projective transformation in equation (6.91) characterizes the ambiguity relating the choice of projection matrices P and P . Even when we can
exactly decompose F into parts related to rotation ARA1 and translation
AT and use it for reconstruction, the structure obtained in such way, is related to the original Euclidean structure by an unknown matrix A. Hence
the projective transformation H upgrading projective structure Xp to Euclidean structure X has to resolve all these unknowns. For this process it
is instrumental to consider following decomposition of the transformation
H:

1

sR T
A
0
I 0
H = H e Ha Hp =
(6.96)
0 1
0
1
vT 1
The first part He depends simply on the choice of coordinate system and
scale a factor, the second part Ha corresponds to so called affine upgrade,
and is directly related to the knowledge of intrinsic parameters of the camera and the third transformation is so called projective upgrade Hp , which
transforms points lying on a particular plane [vT 1]Xp = 0 to the points
at infinity. This basic observation sets the stage for the so called stratified
approach to Euclidean reconstruction and self-calibration. In such setting
the projective structure Xp is computed first, followed by computation of
Hp and Ha and gradual upgrade of Xp to to Xa and xX in the following

6.6. Three stage stratification

145

way:
X = H a Xa

and Xa = Hp Xp

Figure 6.5. Pictorial demonstration of the stratified approach: Euclidean

Structure X, Affine Structure Xa and Projective structure Xp as described above.

There is a large variety of ways how to compute the individual upgrade

transformations Hp and Ha using both intrinsic constraints or prior knowledge of scene structure and/or partial knowledge of calibration parameters.
Some commonly used practical choices will be review in the next section.

6.6.2 Affine reconstruction

Imposing constraints
As we outlined in the previous section in case of general motion and absence
of any other knowledge about structure and calibration matrix A, all we
can compute the fundamental matrix F and projective transformation Xp .
In order to upgrade the projective structure to affine structure we first need
to find the transformation Hp , such that Xa = Hp Xp

I 0
Hp =
(6.97)
vT 1
Note that the Hp has a special property that all points which satisfy the
constraint [v, 1]T Xp = 0, will be mapped to the points [X, Y, Z, 0]T , with
the last coordinate equal to 0. These points have a special geometric meaning: namely they corresponds to the intersections of parallel lines in 3D
space; the so called points at infinity. While the points at infinity, do not
correspond to any physical points in Euclidean space E3 , due to the effects
of perspective projection, the images of intersections of parallel lines correspond to perfectly valid points in the image plane, namely the vanishing
points. Given the projective coordinates of at least three vanishing points

146

Chapter 6. Camera calibration and self-calibration

Figure 6.6. An example of vanishing lines and associated vanishing directions.

X1p , X2p , X3p obtained in the projective reconstruction step, we can compute
the unknown vector v, which corresponds to the plane in projective space,
where all vanishing points lie.
[v, 1]T Xip = 0
Once v has been determined the projective upgrade Hp is defined and the
projective structure Xp can be upgraded to Xa .

I 0
Xa =
Xp
(6.98)
vT 1
Hence the projective upgrade can be computed when images of at least
three vanishing points are available.
The situation is not completely lost, in case the above structure constraints
are not available. Pollefeys in [?] proposed a technique, where additional so
called modulus constraints on v are imposed, which capture the fact that
v, needs to be chosen in a way that A + T 0 vT is a rotation matrix. This
approach requires a solution to a set of fourth order polynomial equations
and multiple views (four) are necessary, for the solution to be well conditioned. In Hartley [?] previously introduced chirality constraints are used
to determine possible range of values for v, where the optimum is achieved
using linear programming approach.
Direct affine reconstruction
There are several ways how to achieve affine reconstruction Xa , which will
bring is in the context of stratified approach one step closer to the Euclidean
reconstruction. One commonly used assumption in order to obtain affine
reconstruction is to assume scaled orthographic camera projection model [?,
TK92].
When general perspective projection model is used and the motion is
pure translation the affine structure can be recovered directly.

6.6. Three stage stratification

147

Pure translation
From two views related by pure translation only, we have:
2 Ax2
2 x02

= 1 Ax1 + AT
= 1 x01 + T 0

(6.99)

T 0 = AT can be computed directly from the epipolar constraint:

T
x0 2 Tb0 x01 = 0

Once T 0 has been computed, note that the unknown scales are now linear functions of known quantities and can be computed directly from the
equation (6.100) as
1 =

b02 T
(b
x02 x01 )T x
0
0
kb
x2 x1 k 2

One can easily verified that the structure Xa obtained in this manner is
indeed related to the Euclidean structure X by some unknown general affine
transformation:

1

A
0
sR T
A T
=
(6.100)
Ha =
0
1
0 1
0 1

6.6.3 Euclidean reconstruction

The upgrade from the affine Xa to Euclidean structure X requires knowledge of the metric S of the uncalibrated space. The presence of affine
structure however brings us to the familiar territory. Suppose that we have
two views of an affine structure available, which has been obtained by one
of the methods in the previous section:
1 x01
2 x02

= M 1 Xa + m 1
= M 2 Xa + m 2

(6.101)

where the projection matrices P1 = [M1 , m1 ] and P2 = [M2 , m2 ] can be

computed using the same techniques as in the case of calibration with
known object. Combining the above two equations, it can be shown that
2 x02

= M 1 x01 + m

(6.102)

where M = M2 M11 . Comparing the above equations with the familiar

rigid body equations in the uncalibrated world:
2 x02

= 1 ARA1 x01 + T 0

(6.103)

we can conclude that M = ARA1 . This is exactly the case which we

studied thoroughly in the pure rotation case for self-calibration, where we
fully characterized the space of possible solutions for S given, two views
related by M = ARA1 . Both the results as well as the algorithms for this
approach, reduce to a pure rotation case.

148

Chapter 6. Camera calibration and self-calibration

More details
0. More views are needed 1. Compute S 2. Cholesky 3. Get A.
Structure constraints
Natural additional constraints we can employ in order to solve for the unknown metric S are constraints on the structure; in particular distances
between points and angles between lines. Angles are distances are directly
related to notion of the inner product in the uncalibrated space and hence
when known, can be used to solve for S. Commonly encountered constraint
in man-made environments are orthogonality constraints between sets of
lines in 3D. These are reflected in the image plane in terms of angles between vanishing directions, for example v1 , v2 in the image plane, which
correspond to a set of orthogonal lines in 3D have to satisfy the following
constraint:
v1T S 1 v2 = 0

(6.104)
1

since the inner product for orthogonal lines is 0. Note since S is a symmetric matrix with six degrees of freedom, we need at least five pairs of
perpendicular lines to solve for S 1 .
More details
Point of difference between S, S 1 etc, how to get A.
Constraints on A
Computation of S 1 and consequently A can be simplified when some of
the intrinsic parameters are known. The most commonly used assumptions
are zero skew and known aspect ratios. In such case one can demonstrate
that S 1 can be parameterized by fewer then six parameters and hence
smaller number of constraints is necessary.

6.7 Summary
6.8 Exercises
1. Lyapunov maps
Let {}ni=1 be the (distinct) eigenvalues of A Cnn . Find all the n2
eigenvalues and eigenvectors of the linear maps:
(a) L1 : X Cnn 7 AT P P A Cnn .
(b) L2 : P Cnn 7 AT P A P Cnn .

(Hint: construct the eigenvectors of L1 and L2 from left eigenvectors

ui , i = 1, . . . , n and right eigenvectors vj , j = 1, . . . , n of A.)

6.8. Exercises

149

2. A few identities associated to a fundamental matrix

Given a fundamental matrix F = c
T 0 ARA1 with A SL(3), T 0 =
3
AT R , R SO(3), besides the well-known Kruppa equation, we
have the following
\
TT.
(a) F T c
T 0 F = AR
T
1
c0 F = T 0 v T for some v R3 .
(b) ARA T

Note that the first identity showed in the proof of Theorem 3.3 (in
handout 7); the second identity in fact can be used to derive the socalled modulus constraints on calibration matrix A (See Hartley and
Zissermann, page 458). (Hint: (a) is easy; for (b), the key step is to
show that the matrix on the left hand side is of rank 1, for which you
T
c0 T
c0 is a projection matrix.)
may need to use the fact that T

3. Invariant conic under a rotation

For the equation
S RSRT = 0,

R SO(3), S T = S R33 ,

prove Lemma 7.3.1 for the case that rotation angle of R is between
0 and . What are the two typical solutions for S? (Hint: You may
need eigenvectors of the rotation matrix R. Properties of Lyapunov
map from the previous homework should give you enough ideas how
to find all possible S from eigenvectors (constructed from those of
R) of the Lyapunov map L : S 7 S RSRT .) Since a symmetric
real matrix S is usually used to represent a conic xT Sx = 1, S which
satisfies the above equation obviously gives a conic that is invariant
under the rotation R.

This is page 150

Printer: Opaque this

This is page 151

Printer: Opaque this

Part III

Geometry of multiple
views

This is page 152

Printer: Opaque this

This is page 153

Printer: Opaque this

Chapter 7
Introduction to multiple view
reconstruction

In previous chapters we have studied the geometry of points in space as seen

from two views. There, the epipolar constraint plays a crucial role in that
it captures the relationship between corresponding points and camera configuration without reference to the position of such points in space. Such
relationship is typically referred to as intrinsic. The epipolar constraint
gives a complete answer to the question of what the intrinsic relationship
between corresponding points in two views are. It is then natural to ask
this question in the context of an arbitrary number of views: what is the
relationship between corresponding points in m views? Not only is this question crucial to the understanding of the geometry of points in space as seen
from a number of views, but like in the two-view case the constraints
will be used to derive algorithms for reconstructing camera configuration
and, ultimately, the 3-D position of points.
The search for the m-view analogue of the epipolar constraint has been
an active research area for almost two decades. It was realized early on
[?, SA90] that the relationship between multiple views of the same point or
line can be characterized by multi-linear constraints. Like the epipolar constraint is bilinear, i.e. it is linear in the coordinates of the point in one view
having fixed the coordinates in the other, one can show that there exist mlinear constraints between corresponding points in m images. Consequently,
the study of multiple view geometry has involved multi-dimensional linear
operators, or tensors, with as many indices as views. There is now a substantial literature on the properties of so-called tri-focal and quadri-focal
tensors that involves the use of sophisticated algebraic geometric tools, such
as the Grassmann-Caley algebra [?]. In the interest of simplicity and com-

154

Chapter 7. Introduction to multiple view reconstruction

pleteness, this book takes an approach that is substantially different from

the current literature, in that it only involves linear algebra and matrix
manipulations. We will show that all the constraints between corresponding points in m images can be written as constraints on the rank of certain
matrices. In order to put this approach into a historic context, in the exercises at the end of this chapter we will explore the relation between these
rank conditions and multi-linear tensors.
Roughly speaking, the extant multi-linear analysis is adequate if point
or line features are to be considered individually. In the next four chapters,
we will however see clearly that if also incidence relations among multiple
point and line features are considered, the multi-linear analysis is no longer
sufficient. Nonetheless, the rank condition to be introduced will provide
us a simple and unifying tool that gives a complete characterization of
all intrinsic constraints among multiple images. More specifically, we will
formally establish the following facts:
1. For perspective projection from R3 to R2 , intrinsic algebraic and
geometric relationships among multiple images of (multiple) points
and lines (with incidence relations among themselves) can always
be reduced to two categories: multi-linear relations among pairwise or triple-wise images; nonlinear relations among triple-wise or
quadruple-wise images.
2. For perspective projection from Rn to Rk (k < n), intrinsic algebraic
and geometric relationships among multiple images of hyperplanes
(with incidence relations among themselves) can be uniformly described by certain rank conditions, from which all kinds of algebraic
constraints can be easily instantiated.
In addition to the theoretical development, we will demonstrate how to provide global geometric analysis for multiple images of points, lines, planes,
and incidence relations among them based on the rank condition. Conceptual algorithms will be given for tasks such as feature matching, image
transfer, and structure from motion. These algorithms demonstrate the
possibility of accomplishing these tasks by simultaneously using all images
of all features and all incidence relations. We leave more practical algorithmic issues such as optimality and real-time constraint to chapters in Part
IV.

7.1 Basic notions: pre-image and co-image of point

and line
To set the stage, we recall the notation from Chapter 3: consider a generic
point p E3 in the space. The homogeneous coordinates of p relative to
.
a fixed world coordinate frame are denoted as X = [X, Y, Z, 1]T R4 .

7.1. Basic notions: pre-image and co-image of point and line

155

.
Then the (perspective) image x(t) = [x(t), y(t), z(t)]T R3 of p, taken by
a moving camera at time t satisfies the relationship
(t)x(t) = A(t)P g(t)X,

(7.1)

where (t) R+ is the (unknown) depth of the point p relative to the

camera frame, A(t) SL(3) is the camera calibration matrix (at time t),
P = [I, 0] R34 is the standard (perspective) projection matrix and
g(t) SE(3) is the coordinate transformation from the world frame to the
camera frame at time t. In equation (7.1), x, X and g are all in homogeneous
representation. Suppose the transformation g is specified by its rotation
R SO(3) and translation T R3 , then the homogeneous representation
of g is simply

R T
R44 .
(7.2)
g=
0 1
Notice that equation (7.1) is also equivalent to:
(t)x(t) = [A(t)R(t) A(t)T (t)]X.

(7.3)

PSfrag replacements
L
p

lb1

V
po

lb2

x2
o2

(R, T )

o1
l1

Figure 7.1. Images of a point p on a line L. Planes extended from the images
lb1 , lb2 should intersect at the line L in 3-D. Lines extended from the image points
x1 , x2 intersect at the point p. l1 , l2 are the two co-images of the line.

Now suppose a point p lying on a straight line L E3 , as shown in

Figure 10.1. The line L can be defined by a collection of points in E3 that
can be described (in homogeneous coordinates) as:
.
L = {X | X = Xo + V, R} R4 ,
(7.4)

156

Chapter 7. Introduction to multiple view reconstruction

where Xo = [Xo , Yo , Zo , 1] R4 are coordinates of a base point po on

T
this line and V = [V1 , V2 , V3 , 0] R4 is a non-zero vector indicating the
direction of the line. Then the image of the line L at time t is simply the
collection of images x(t) of all points p L. It is clear that all such x(t)s
span a plane in R3 , as shown in Figure 10.1. For future development, we
need a formal definition of the notion pre-image:
Definition 7.1 (Pre-image). The pre-image of a point or a line in 3-D
is defined to be the subspace (in the camera coordinate frame) spanned by
the homogeneous coordinates of the image point or points of the line in the
image plane.
So in the case of a point, its pre-image is a one-dimensional subspace,
i.e. a line defined by the vector x. Any non-zero vector on this line is a
valid representative for this line. It uniquely determines the image of the
point in the image plane: the intersection of the line with the image plane,
as shown in Figure 10.1. In the case of a line, the pre-image is a plane as
shown in Figure 10.1. Any two linearly independent vectors in this plane
can be used to represent this plane. The plane uniquely determines the
image of the line in the image plane: the intersection of the plane with the
image plane. Notice that, any two points on the pre-image of a point give
the same image in the image plane, so do any two lines on the pre-image
of a line. So the pre-image is really the largest set of 3-D points or lines
which have the same image in the image plane.
Since a pre-image is a subspace, we can also specify it by its maximum
orthogonal supplementary subspace. For instance, it is usually more convenient to specify a plane by its normal vector. This leads to the notion of
co-image:
Definition 7.2 (Co-image). The co-image of a point or a line in 3-D is
defined to be the maximum orthogonal supplementary subspace othogonal
to its pre-image (in the camera coordinate frame).
Since the pre-image of a line L is a plane, we can denote its co-image (at
time t) by the planes normal vector l(t) = [a(t), b(t), c(t)]T R3 . If x is
the image of a point p on this line, then l satisfies the following equation:
l(t)T x(t) = l(t)T A(t)P g(t)X = 0.

(7.5)

Recall that we use u

bR
to denote the skew-symmetric matrix associated to a vector u R3 whose column (or row) vectors span the plane
orthogonal to the vector u. Then the column (or row) vectors of the matrix
bl span the pre-image of the line L, i.e. they span the plane which is orthogonal to l.1 This is illustrated in Figure 10.1. Similarly, if x is the image of
1 In fact, there is some redundancy using b
l to describe the plane: the three column (or
row) vectors in bl are not linearly independent. They only span a two-dimensional space.
However, we here use it anyway to simplify the notation.

7.2. Preliminaries

157

a point p, its co-image (the plane orthogonal to x) is given by the column

b. So from now on, we will use the following
(or row) vectors of the matrix x
notations to represent the (pre-)image or co-image of a point or a line:
image of a point: x R3 ,
co-image of a point:
33
b
image of a line:
l R , co-image of a line:

b R33 ,
x
l R3 .

(7.6)

b x = 0 and bl l = 0. Since (pre-)image

Notice that we always have x
and co-image are equivalent representation of the same geometric entity,
sometimes for simplicity (and by abuse of language) we may simply refer
to either one as image if its meaning is clear from the context.
In a realistic situation, we usually obtain sampled images of x(t) or
l(t) at some time instances: t1 , t2 , . . . , tm . For simplicity we denote:
i = (ti ),

xi = x(ti ),

li = l(ti ),

i = A(ti )P g(ti ).

(7.7)

We will call the matrix i the projection matrix relative to the ith camera
frame. The matrix i is then a 3 4 matrix which relates the ith image of
the point p to its world coordinates X by:
i x i = i X
and the i

(7.8)

co-image of the line L to its world coordinates (Xo , V) by:

lTi i Xo = lTi i V = 0,

(7.9)

for i = 1, . . . , m. If the point is actually lying on the line, we further have

a relationship between the image of the point and the co-image of the line:
lTi xi = 0,

(7.10)
33

for i = 1, . . . , m. For convenience, we will use Ri R

to denote the
first three columns of the projection matrix i , and Ti R3 for the last
column. Note that (Ri , Ti ) are not necessarily the actual camera rotation
and translation unless the camera is fully calibrated, i.e. A(ti ) I. In any
case, the matrix Ri is always of full rank 3.

7.2 Preliminaries
We first observe that in above equations the unknowns, i s, X, Xo and V,
which encode the information about location of the point p or the line L are
not intrinsically available from the images. By eliminating these unknowns
from the equations we obtain the remaining intrinsic relationships between
x, l and only, i.e. between the image measurements and the camera configuration. Of course there are many different, but algebraically equivalent
ways for elimination of these unknowns. This has in fact resulted in different kinds (or forms) of multilinear (or multifocal) constraints that exist in
the computer vision literature. Our goal here should be a more systematic

158

Chapter 7. Introduction to multiple view reconstruction

way of eliminating all the above unknowns that results in a complete set of
conditions and a clear geometric characterization of all constraints.
To this end, we can re-write the above equations in matrix form as

1
1
x1 0
0

0 x2
0
2 2

(7.11)
=

..
. X
.
..
..
..
.
.
.
. .. ..
0

which, after defining

x1 0
0 x2
.
I = .
..
..
..
.
.
0

becomes

0
0
..
.
xm

.
~ =

1
2
..
.
m

I ~ = X.

.
=

1
2
..
.
m

(7.12)

(7.13)

which is just a matrix version of (7.8). For obvious reasons, we call ~ R3m
the depth scale vector, and R3m4 the projection matrix associated
to the image matrix I R3mm . The reader will notice that, with the
exception of xi s, everything else in this equation is unknown! Solving for
the depth scales and the projection matrices directly from such equations is
by no means straightforward. Like we did in the two-view case, therefore,
we will decouple the recovery of the camera configuration matrices i s
from recovery of scene structure, i s and X.
In order for equation (7.13) to be satisfied, it is necessary for the columns
of I and to be linearly dependent. In other words, the matrix

1 x1 0
0

..
2 0 x 2 . . .
.
.

R3m(m+4)
Np = [, I] = .
(7.14)

.
.
.
..
..
..
..
0
m 0
0 xm
must have a non-trivial null-space. Hence

rank(Np ) m + 3.

(7.15)
.
In fact, from equation (7.13) it is immediate to see that the vector u =
[XT , ~T ]T Rm+4 is in the null space of the matrix Np , i.e. Np u = 0.
Similarly, for line features the following matrix:

T
l1 1

T
. l2 2
(7.16)
Nl = . Rm4
..
lTm m

7.3. Pairwise view geometry revisited

159

must satisfy the rank condition

rank(Nl ) 2

(7.17)

since it is clear that the vectors Xo and V are both in the null space of
the matrix Nl due to (7.9). In fact, any X R4 in the null space of Nl
represents the homogeneous coordinates of some point lying on the line L,
and vice versa.
The above rank conditions on Np and Nl are merely the starting point.
There is some difficulty in using them directly since: 1. the lower bounds
on their rank are not yet clear; 2. their dimensions are high and hence the
above rank conditions contain a lot of redundancy. This apparently obvious
observation is the basis for the characterization of all constraints among
corresponding point and line in multiple views. To put our future development in historic context, in the rest of this chapter we will concentrated on
only the pairwise and triple-wise images.

7.3 Pairwise view geometry revisited

Note that the condition that the m + 4 columns of I and be linearly
dependent does not involve the 3-D position of the points, just its images
and camera configurations. Therefore, for the particular case of m = 2
views, one would like to make sure that this condition implies the epipolar
constraint. Since 3m = m + 4 = 6, the rank condition (7.15) is equivalent
to

1 x1 0
det(Np ) = det
= 0.
(7.18)
2 0 x 2
Since the determinant is linear in the elements of the matrix, the equation
above is clearly bilinear in x1 , x2 . It is an easy exercise to show that any
bilinear function x1 , x2 can be written as
xT2 F x1

(7.19)

for some matrix F R33 . In our case, the entries of F are the 44 minors
of the matrix

1
=
R64 .
2
For instance, for the case of two calibrated cameras, the matrix F is exactly
the essential matrix.2 To see this, without loss of generality, let us assume
2 In the tensorial jargon, the epipolar constraint is sometimes referred to as the bilinear
constraint, and the essential or fundamental matrix F as the bifocal tensor. Mathematically, F can be interpreted as a covariant 2-tensor whose contraction with x 1 and x2 is
zero.

160

Chapter 7. Introduction to multiple view reconstruction

that the first camera frame coincides with the world frame and the camera
motion between the two views is (R, T ). Then we have
1 = [I, 0],

2 = [R, T ]

R34 .

The determinant of the matrix Np = [ I] yields

I 0
I 0 x1 0
= det
det
0 T
R T 0 x2
= det [T, Rx1 , x2 ] = det [x2 , T, Rx1 ] .

0
Rx1

(7.20)

0
x2

For the last step, we use the fact from linear algebra that det[v1 , v2 , v3 ] =
v1T (v2 v3 ). This yields
Therefore, we obtain

det [x2 , T, Rx1 ] = xT2 TbRx1 .

det(Np ) = 0

xT2 TbRx1 = 0

(7.21)

which says that, for the two-view case, the rank condition (7.15) is equivalent to the epipolar constraint. The advantage of the rank condition is
that it generalizes to multiple views in a straightforward manner, unlike
the epipolar constraint. A similar derivation can be followed for the uncalibrated case by just allowing R R3 to be general and not a rotation
matrix. The resulting matrix F simply becomes the fundamental matrix.
However, for two (co-)images l1 , l2 of a line L, the above condition (7.17)
for the matrix Nl becomes
T
l
rank(Nl ) = rank 1T 1 2.
(7.22)
l2 2
But here Nl is a 2 4 matrix which automatically has a rank less and
equal to 2. Hence there is essential no intrinsic constraints on two images
of a line. This is not so surprising since for arbitrarily given two vectors
l1 , l2 R3 , they represent two planes relative to the two camera frames
respectively. These two planes will always intersect at some line in 3-D,
and l1 , l2 can be interpreted as two images of this line. A natural question
hence arises: are there any non-trivial constraints for images of a line in
more than 2 views? The answer will be yes which to some extent legitimizes
the study of triple-wise images. Another advantage from having more than
2 views is to avoid some obvious degeneracy inherent in the pairwise view
geometry. As demonstrated later in Figure 8.2, there are cases when all
image points satisfy pairwise epipolar constraint but they do not correspond
to a unique 3-D point. But as we will establish rigorously in the next
chapter, constraints among triple-wise views can eliminate such degenerate
cases very effectively.
In the next section, for the purpose of example, we show how to derive
all multilinear constraints between corresponding images in m = 3 views.

7.4. Triple-wise view geometry

161

We however do not pursue this direction any further for m 4 views, since
we will see in Chapters 8, 9, and 10 that there are no more multilinear
constraints among more than 4 views! However, a much more subtle picture
of the constraints among multiple images of points and lines cannot be
revealed untill we study those more general cases.

7.4 Triple-wise view geometry

For three views of a point, we have 3m = 9 > m + 4 = 7. The matrix

1 x1 0
0
(7.23)
Np = 2 0 x2 0 R97
3 0
0 x3

then has more rows than columns. Therefore, the rank condition (7.15)
implies that all 7 7 sub-matrices are singular. As we explore in the exercises, the only irreducible minor of Np must contain at least two rows from
each image. For example, if we choose all three rows of the first image and
the {1, 2}th rows from the second and the {1, 2}th rows from the third, we
obtain a 7 7 sub-matrix of Np as

1 x1
0
0
12 0 x21
0
2

R77
N033 =

0
x
0
(7.24)
22
12

3 0
0 x31
23 0
0 x32

where the subscript rst of Nrst indicates that the r th , sth and tth rows
of images 1, 2 and 3 respectively are omitted from the original matrix Np
when Nrst is formed. The rank condition (7.15) implies that
det(N033 ) = 0

(7.25)

and, as in the two-view case, the determinant is linear in each of the

x1 , x2 , x3 . The above equation, written in tensor form, is called the trifocal tensor; we explore its properties in the exercises. Here, instead, we
derive its expression directly from the rank condition.
Let us demonstrate this by choosing the first camera frame to be the
reference frame. That gives the three projection matrices i , i = 1, 2, 3
1 = [I, 0],

2 = [R2 , T2 ],

where, in the case of uncalibrated

rotation matrices. The matrix Np

I
0
Np = R 2 T 2
R 3 T3

3 = [R3 , T3 ]

R34 ,

(7.26)

cameras, Ri R , i = 2, 3 may not be

is then

x1 0
0
0 x2 0 R97 .
(7.27)
0
0 x3

162

Chapter 7. Introduction to multiple view reconstruction

This matrix should satisfy the rank condition. After a column manipulation
(which eliminates x1 from the first row), it is easy to see that Np has the
same rank as the matrix

I
0
0
0
0
(7.28)
Np0 = R2 T2 R2 x1 x2 0 R97 .
R 3 T3 R 3 x1 0 x 3

For this matrix to drop rank, its sub-matrix

T2 R 2 x1 x2 0
R64
Np00 =
T3 R 3 x1 0 x 3
must drop rank. Multiplying the following matrix

T
x2
0
x
c2
0
86

D=
0 xT3 R
c3
0 x
by Np00 yields

xT2 T2
x
c2 T2
DNp00 =
xT3 T3
c3 T3
x

xT2 R2 x1
c2 R2 x1
x
xT3 R3 x1
c3 R3 x1
x

xT2 x2
0
0
0

Since D is of full rank 6, we have

0
0

xT3 x3
0

(7.29)

(7.30)

R84 .

(7.31)

rank(Np00 ) = rank(DNp00 ).

Hence the original matrix Np is rank deficient if and only if the following
sub-matrix of DNp00

c2 R2 x1
c2 T2 x
x
R62
(7.32)
Mp =
c3 R3 x1
c3 T3 x
x

is rank deficient. In order to conclude, we need the following fact, whose

proof is left to the reader as an exercise:
Lemma 7.1. Given any non-zero
following matrix

a 1 b1
..
..
.
.
an

vectors ai , bi R3 , i = 1, . . . n, the

3n2
R

(7.33)

is rank deficient if and only if ai bTj bi aTj = 0 for all i, j = 1, . . . , n.

Apply this to the matrix M in (7.32), we have
rank(Np ) < 7

c2 (T2 xT1 R3T R2 x1 T3T )c

x
x3 = 0

(7.34)

7.4. Triple-wise view geometry

163

Note that this is a matrix equation and it gives a total of 3 3 = 9

scalar trilinear equations in x1 , x2 , x3 . Equation (??) is the analogous of
the epipolar constraint for three views.
Comment 7.1. The derivation of the trilinear constraints shows that
c2 T2 , x
c2 R2 x1 be(7.34) implies bilinear constraints, since the two vectors x
c2 R2 x1 = 0 (we leave this as an exercise
ing linearly dependent implies xT2 T
to the reader). The trilinear constraint, however, does not imply the bilinc3 T3 = x
c3 R3 x1 = 0. That corresponds to the
ear ones when, for example, x
case when the pre-image p lies on the line through optical centers o 1 , o3 .
In that case, the epipolar constraint between the first and second images is
not implied by the trilinear constraint above but still implied by the condition rank(Mp ) 1. That pairwise bilinear constraints algebraically imply
all the trilinear constraints is also true (except for some rare degenerate
configurations), although much harder to prove (see Chapter 8).
However, for three (co-)images l1 , l2 , l3 of a line L, the rank condition
(7.17) for the matrix Nl becomes
T
l1 1
rank(Nl ) = rank lT2 2 2.
(7.35)
lT3 3

Here Nl is a 34 matrix and the above rank condition is no longer void and
it indeed imposes certain non-trivial constraints on the quantities involved.
In the case we choose the first view to be the reference, i.e. 1 = [I, 0], we
have

T T
l1
0
l1 1
lT2 2 = lT2 R2 lT2 T2 .
(7.36)
lT3 R3 lT2 T3
lT3 3

Similar to the point case, we conduct simple matrix (column) manipulation

as below
T

T

l1 l1
0
0
l1
0
b
lT2 R2 lT2 T2 l1 l1 0 = lT2 R2 l1 lT2 R2 lb1 lT2 T2 .
(7.37)
0 0 1
lT3 R3 lT3 T3
lT3 R3 l1 lT3 R3 lb1 lT3 T3

Therefore, rank(Nl ) 2 is equivalent to the following matrix

"
#
lT2 R2 lb1 lT2 T2
Ml = T b T
R24
l3 R 3 l1 l3 T3

(7.38)

has a rank less and equal to 1. That is the two row vectors of Ml are linearly
dependent. In case lT2 T2 , lT3 T3 are non-zero, this linear dependency can be
written as
lT3 (T3 lT2 R2 R3 lT2 T2 )lb1 = 0.

(7.39)

164

Chapter 7. Introduction to multiple view reconstruction

This gives a total of 1 3 = 3 scalar trilinear equations in l1 , l2 , l3 .

Although the two types of trilinear constraints (7.34) and (7.39) are
expressed in terms of point and line respectively, they in fact describe the
same type of relationship among three images of a point on a line. It is easy
b are co-images of lines passing through
to see that columns (or rows) of x
the point, and columns (or rows) of bl are images of points lying on the line
(see Exercise 1). Hence, each scalar equation from (7.34) and (7.39) simply
falls into the following type of equations:
lT2 (T2 xT1 R3T R2 x1 T3T )l3 = 0

(7.40)

where l2 , l3 are respectively co-images of two lines that pass through the
same point whose image in the first view is x1 . Here, l2 and l3 do not have
to correspond to the coimages of the same line in 3-D. This is illustrated
in Figure 7.2.

PSfrag replacements

l3
o3

x1
o1

l2
o2

Figure 7.2. Images of two lines L1 , L2 intersecting at a point p. Planes extended

from the image lines might not intersect at the same line in 3-D. But x1 , l2 , l3
satisfy the trilinear relationship.

The above trilinear relation (7.40) plays a similar role for triple-wise view
geometry as the fundamental matrix for pairwise view geometry. 3 Here
the fundamental matrix must be replaced by the notion of trifocal tensor,
traditionally denoted as T . Like the fundamental matrix, the trifocal tensor
depends (nonlinearly) on the motion paramters R and T s, but in a more
complicated way. The above trilinear relation can then be written formally

3 A more detailed study of the relationship between the trilinear constraints and the
bilinear epipolar constraints is given in the next Chapter.

7.5. Summary

165

as the interior product of the tensor T with the three vectors x1 , l2 , l3 :

T (x1 , l2 , l3 ) =

3,3,3
X

i,j,k=1

T (i, j, k)x1 (i)l2 (j)l3 (k) = 0.

(7.41)

Intuitively, the tensor T consists of 333 entries since the above trilinear
relation has as many coefficients to be determined. Like the fundamental
matrix which only has 7 free parameters (det(F ) = 0), there are in fact only
18 free parameters in T . If one is only interested in recovering T linearly
from triple-wise images as did we for the fundamental matrix from pairwise
ones, by ignoring their internal dependency, we need at least 26 equations
of the above type (7.41) to recover all the coefficients T (i, j, k)s up to a
scale. We hence needs at least 26 the triplets (x1 , l2 , l3 ) with correspondence
specified by Figure 7.2. We leave as exercise at least how many point or line
correspondences across three views are needed in order to have 26 linearly
independent equations for solving T .

7.5 Summary
7.6 Exercises
1. Image and co-image of points and lines
Suppose p1 , p2 are two points on the line L; and L1 , L2 are two lines
intersecting at the point p. Let x, x1 , x2 be the images of the points
p, p1 , p2 respectively and l, l1 , l2 be the co-images of the lines L, L1 , L2
respectively.
(a) Show that for some , R:

c1 x2 ,
l = x

x = lb1 l2 .

(b) Show that for some r, s, u, v R3 :

bu,
l1 = x

bv,
l2 = x

x1 = blr,

x2 = bls.

(c) Please draw a picture and convince yourself about the above
relationships.
2. Linear estimating of the trifocal tensor
(a) Show that the corresponding three images x1 , x2 , x3 of a point
p in 3-D give rise to 4 linearly independent constraints of the
type (7.41).
(b) Show that the corresponding three co-images l1 , l2 , l3 of a line L
in 3-D give rise to 2 linearly independent constraints of the type
(7.41).

166

Chapter 7. Introduction to multiple view reconstruction

(c) Conclude that in general one needs n points and m lines across
three views with 4n + 2m 26 to recover the trifocal tensor T
up to a scale.
3. Multi-linear function
A function f (, ) : Rn Rn R is called bilinear if f (x, y) is linear
in x Rn if y Rn is fixed, and vice versa. That is,
f (x1 + x2 , y) = f (x1 , y) + f (x2 , y),
f (x, y1 + y2 ) = f (x, y1 ) + f (x, y2 ),
for any x, x1 , x2 , y, y1 , y2 Rn and , R. Show that any bilinear
function f (x, y) can be written as
f (x, y) = xT M y
for some matrix M Rnn . Notice that in the epipolar constraint
equation
xT2 F x1 = 0,
the left hand side is a bilinear form in x1 , x2 R3 . Hence epipolar
constraint is also sometimes referred as bilinear constraint. Similarly
we can define trilinear or quadrilinear functions. In the most general
case, we can define a multi-linear function f (x1 , . . . , xi , . . . , xm ) which
is linear in each xi Rn if all other xj s with j 6= i are fixed. Such
a function is called m-linear. In mathematics, another name for such
multi-linear object is tensor. Try to describe, while we can use matrix to represent a bilinear function, how to represent a multi-linear
function then?
4. Minors of a matrix
Suppose M is a mn matrix with m n. Then M is a rectangular
matrix with m rows and n columns. We can pick arbitrary n (say
th
ith
1 , . . . , in ) distinctive rows of M and form a n n sub-matrix of
M . If we denote such n n sub-matrix as Mi1 ,...,in , its determinant
det(Mi1 ,...,in ) is called a minor of M . Prove that the n column vectors
are linearly dependent if and only if all the minors of M are identically
zero, i.e.
det(Mi1 ,...,in ) = 0,

i1 , i2 , . . . , in [1, m].

By the way, totally how many different minors can you possibly gotten
there? (Hint: For the if and only if problem, the only if part is
easy; for the if part, proof by induction makes it easier.) This fact
was used to derive the traditional multi-linear constraints in some
computer vision literature.
5. Linear dependency of two vectors
Given any four non-zero vectors a1 , . . . , an , b1 , . . . , bn R3 , the

7.6. Exercises

following matrix

a1
..
.
an

b1
.. R3n2
.
bn

167

(7.42)

is rank deficient if and only if ai bTj bi aTj = 0 for all i, j = 1, . . . , n.

Explain what happens if some of the vectors are zero.

This is page 168

Printer: Opaque this

Chapter 8
Geometry and reconstruction from
point features

In the previous chapter we have introduced a rank condition that, we

claimed, captures all independent constraints between corresponding points
in an arbitrary number of images. We have seen that, in the case of two
views, the rank condition reduces to the epipolar constraint. Unlike the
epipolar constraint, however, it can be used to derive useful constraints between correspondences in three views. In this chapter we carry out the rest
of our program: we derive all independent constraints between correspondences in an arbitrary number of images. In the chapters that follow, we
will gradually extend the theory to all primitive geometric features: points,
lines, planes, as well as an arbitrary mixture of them.

8.1 Multiple views of a point

We first recall the notation used in Chapter 7: the generic point in space p
E3 has coordinates X = [X, Y, Z, 1]T relative to a fixed (inertial) coordinate
frame. The coordinates of its projection onto the image plane of a moving
camera are xi = [x(ti ), y(ti ), 1]T for t1 , t2 , . . . , tm ; the two are related by
i xi = Ai P gi X,

i = 1, . . . , m

(8.1)

where i R+ is the (unknown) depth of the point p relative to the

camera frame, Ai SL(3) is the camera calibration matrix (at time ti ),
P = [I, 0] R34 is the constant projection matrix and g(t) SE(3) is
the coordinate transformation from the world frame to the camera frame at

8.1. Multiple views of a point

169

.
time ti . We call i = A(ti )P g(ti ). Collecting all the indices, we can write
(8.1) in matrix form as

x1
0
..
.

0
x2
..
.

..
.

0
0
..
.

I ~

1
2
..
.

X
1
2
..
.
m

(8.2)

~ Rm is the depth scale vector, and R3m4 is the projection matrix

associated to the image matrix I R3mm .
The condition for the above equation to have a solution can be written
in terms of the following matrix in R3m(m+4)

1 x1 0
0

..
2 0 x 2 . . .
.
.

Np = [, I] = .
(8.3)

.
.
.
..
..
..
..
0
m 0
0 xm

where we use the subscript p to indicate that the matrix Np is associated

to a point feature. For v = [XT , ~T ]T Rm+4 to solve equation (8.2), Np
must have a non-trivial null space, that is, this 3m (m + 4) matrix must
be rank-deficient
rank(Np ) m + 3

(8.4)

Remark 8.1 (Null space of Np ). Even though equations (8.2) and (8.4)
are equivalent, if rank(Np ) m + 2 then equation Np v = 0 has more
than one solution. In the next section, we will show that rank(Np ) is either
m + 2 or m + 3 and that the first case happens if and only if the point being
observed and all the camera centers lie on the same line.
Remark 8.2 (Positive depth). Even if rank(Np ) = m + 3, there is no
guarantee that ~ in (8.2) will have positive entries.1 In practice, if the point
being observed is always in front of the camera, then I, and X in (8.2)
will be such that the entries of ~ are positive. Since the solution to Np v = 0
.
is unique and v = [XT , ~T ]T is a solution, then the last m entries of v
have to be of the same sign.
It is a basic fact of linear algebra that a matrix being rank-deficient can be
expressed in terms of the determinants of certain submatrices being zero.
1 In a projective setting, this is not a problem since all points on the same line through
the camera center are equivalent. But it does matter if we want to choose appropriate
reference frames such that the depth for the recovered 3-D point is physically meaningful,
i.e. positive.

170

Chapter 8. Geometry and reconstruction from point features

In particular, for Np R3mm+4 , m 2, all submatrices of dimension

(m + 4) (m + 4) must be zero. Each determinant is a polynomial equation
in the entries of the projection matrix and images x1 , . . . , xm having the
general form
f (, x1 , . . . , xm ) = 0.

(8.5)

The polynomials f () are always linear in each xi , i = 1, . . . , m. For the

case of m = 2, 3, 4, they are known as bilinear, trilinear and quadrilinear
constraints respectively. These constraints can be written in tensorial form
using the so-called bifocal, trifocal and quadrifocal tensors.
As we have anticipated in Chapter 7, rather than using tensors and multiindices, we are going to derive all independent constraints explicitly from
the rank condition (8.4). We begin with manipulating the matrix Np to
arrive at a simpler rank condition in the next section.

8.2 The multiple view matrix and its rank

Without loss of generality, we may assume that the first camera frame
is chosen to be the reference frame. 2 That gives the projection matrices
i , i = 1, . . . , m
1 = [I, 0], . . . , m = [Rm , Tm ]

R34 ,

(8.6)

where Ri R33 , i = 2, . . . , m represent the first three columns of i and

Ti R3 , i = 2, . . . , m is the fourth column of i . Although we have used
the suggestive notation (Ri , Ti ) here, they are not necessarily the actual
rotation and translation. Only in the case when the camera is perfectly
calibrated, i.e. Ai = I, does Ri correspond to actual camera rotation and
Ti to translation.
With the above notation, we eliminate x1 from the first row of Np using
column manipulation. It is easy to see that Np has the same rank as the
following matrix in R3m(m+4)

I 0
0
0 0
0
I

R2
R2 T2 R2 x1 x2 . . . ..

=
.

.
0
. .

.
.
.
N

..
p
.
.. ..
0 .. 0
Rm
R m Tm R m x1 0 0 x m
2 Depending on the context, the reference frame could be either an Euclidean, affine or
projective reference frame. In any case, the projection matrix for the first image becomes
[I, 0] R34 .

8.2. The multiple view matrix and its rank

171

Hence, the original Np is rank-deficient if and only if the sub-matrix Np0

R3(m1)(m+1) is. Multiplying Np0 on the left by the following matrix3

T
x2 0
0
x
0

c2 0
.. . .
.. R4(m1)3(m1)
.
..
(8.7)
Dp = .
.
.

T
0
0 xm
0
0 xc
m

yields the following matrix Dp Np0 in R4(m1)(m+1)

x2 T2 xT2 R2 x1 xT2 x2 0
0
0
x
c2 R2 x1
x
0
0
0
0
c2 T2

..
.
.
.
.
.
0
0
0

.
.

..
..
..

.
.
0
0
0

T
xT Tm xT R m x1

0
0
0
x
x
m
m
m m
c
xc
T
x
R
x
0
0
0
0
m m
m m 1

Since Dp has full rank 3(m 1), we have rank(Np0 ) = rank(Dp Np0 ). Hence
the original matrix Np is rank-deficient if and only if the following submatrix of Dp Np0 is rank-deficient

c2 T2
c2 R2 x1
x
x
x
c3 T3
x
c3 R3 x1

Mp =
(8.8)
R3(m1)2 .
..
..

.
.
xc
xc
m R m x1
m Tm

We call Mp the multiple view matrix associated to a point feature p. More

precisely, we have proved the following:
Theorem 8.1 (Rank equivalence condition for point features).
Matrices Np and Mp satisfy
rank(Mp ) = rank(Np ) (m + 2) 1.

(8.9)

Therefore, rank(Np ) is either m + 3 or m + 2, depending on whether

rank(Mp ) is 1 or 0, respectively.
Comment 8.1 (Geometric interpretation of Mp ). Notice that xbi Ti is
the normal to the epipolar plane given by frames 1 and i, that is xbi Ri x1 .
Therefore, the rank condition not only implies that these two normals are
parallel (as obvious from the epipolar constraint) but also that the scale
between these two possible normal vectors is the same for all frames. A
more detailed geometric interpretation is given in the next section.
3 For u R3 , we use u
b R33 to denote the skew symmetric generating the cross
product, i.e. for any vector v R3 , we have u
bv = u v; see Chapter 2.

172

Chapter 8. Geometry and reconstruction from point features

Due to the rank equality (8.9) for Mp and Np , we conclude that the rankdeficiency of Mp is equivalent to all multi-linear constraints that may arise
among the m images. To see this more explicitly, notice that for Mp to be
rank-deficient, it is necessary for the any pair of the vectors xbi Ti , xbi Ri x1 to
be linearly dependent. This gives us the well-known bilinear (or epipolar)
constraints
xTi Tbi Ri x1 = 0.

(8.10)

xbi (Ti xT1 RjT Ri x1 TjT )c

xj = 0.

(8.11)

between the ith and 1st images. Hence the constraint rank(Mp ) 1 consistently generalizes the epipolar constraint (for 2 views) to arbitrary m
views. Using Lemma 7.1. we can conclude that

Note that this is a matrix equation and it gives a total of 3 3 = 9 scalar

(trilinear) equations in terms of x1 , xi , xj . These 9 equations are exactly
the 9 trilinear constraints that one would obtain from the minors of Np
following the conventional derivation of trilinear constraints.
Notice that the multiple view matrix Mp being rank-deficient is equivalent to all its 2 2 minors having determinant zero. Since the 2 2 minors
involve three images only, we conclude the following:
There are no additional independent relationships among images
among any four views. Hence, the so-called quadrilinear constraints
do not impose any new constraints on the four images other than the
trilinear and bilinear constraints. This is an indirect proof of the fact
that the quadrifocal tensors can be factorized into trifocal tensors or
bifocal tensors.4
Trilinear constraints (8.11) in general implies bilinear constraints
(8.10), except when xbi Ti = xbi Ri x1 = 0 for some i. This corresponds
to a degenerate case in which the pre-image p lies on the line through
optical centers o1 , oi .
So far we have essentially given a much more simplified proof for the
following facts regarding multilinear constraints among multiple images:
Theorem 8.2 (Linear relationships among multiple views of a
point). For any given m images corresponding to a point p E3 relative to m camera frames, that the matrix Mp is of rank no more than 1
yields the following:
1. Any algebraic constraint among the m images can be reduced to only
those involving 2 and 3 images at a time. Formulae of these bilinear
and trilinear constraints are given by (8.10) and (8.11) respectively.
4 Although we only proved it for the special case with = [I, 0], the general case
1
differs from this special one only by a choice of a reference frame.

8.3. Geometric interpretation of the rank condition

173

2. For given m images of a point, all the triple-wise trilinear constraints

algebraically imply all pairwise bilinear constraints, except in the degenerate case in which the pre-image p lies on the line through optical
centers o1 , oi for some i.
Comment 8.2 (Rank condition v.s. multilinear constraints). Our
discussion implies that multilinear constraints are certainly necessary for
the rank of matrix Mp (hence Np ) to be deficient. But, rigorously speaking,
they are not sufficient. According to Lemma 7.1, for multilinear constraints
to be equivalent to the rank condition, vectors in the Mp matrix need to be
non-zero. This is not always true for certain degenerate cases, as mentioned
above. Hence, the rank condition on Mp provides a more general framework
for capturing all constraints among multiple images.

8.3 Geometric interpretation of the rank condition

In the previous section, we have classified all algebraic constraints that
may arise among m corresponding images of a point. We now know that
the relationship among m images essentially boils down to that between 2
and 3 views at a time, characterized by the bilinear constraint (8.10) and
trilinear constraint (8.11) respectively. But we have not yet explained what
these equations mean and whether there is a simpler intuitive geometric
explanation for all these algebraic relationships.

8.3.1 Uniqueness of the pre-image

Given 3 vectors x1 , x2 , x3 R3 , if they are indeed images of some 3-D point
p with respect to the three camera frames as shown in Figure 8.1 they
should automatically satisfy both the bilinear and trilinear constraints, e.g.
bilinear:
trilinear:

c2 R2 x1 = 0, xT3 T
c3 R3 x1 = 0,
xT2 T
T T
c2 (T2 x1 R3 R2 x1 T3T )c
x
x3 = 0.

Now the inverse problem: If the three vectors x1 , x2 , x3 satisfy either the
bilinear constraints or trilinear constraints, are they necessarily images of
some single point in 3-D, the so-called pre-image?
Let us first study whether bilinear constraints are sufficient to determine
a unique pre-image in 3-D. For the given three vectors x1 , x2 , x3 , suppose
that they satisfy three pairwise epipolar constraints
xT2 F21 x1 = 0,

xT3 F31 x1 = 0,

xT3 F32 x2 = 0,

(8.12)

th
with Fij = Tc
and j th images.
ij Rij the fundamental matrix between the i
Note that each image (as a point on the image plane) and the corresponding optical center uniquely determine a line in 3-D that passes through

174

Chapter 8. Geometry and reconstruction from point features

PSfrag replacements

x3
x1

o3
x2

o1
o2
Figure 8.1. Three rays extended from the three images x1 , x2 , x3 intersect at one
point p in 3-D, the pre-image of x1 , x2 , x3 .

them. This gives us a total of three lines. Geometrically, the three epipolar constraints simply imply that each pair of the three lines are coplanar.
So when do three pairwise coplanar lines intersect at exactly one point
in 3-D? If these three lines are not coplanar, the intersection is uniquely
determined, so is the pre-image. If all of them do lie on the same plane,
such a unique intersection is not always guaranteed. As shown in Figure
8.2, this may occur when the lines determined by the images lie on the
plane spanned by the three optical centers o1 , o2 , o3 , the so-called trifocal
plane, or when the three optical centers lie on a straight line regardless of
the images, the so-called rectilinear motion. The first case is of less pracPSfrag replacements
o3
p?

x3
o1

x1
x2
o2

x1
o1

x2
o2

x3
o3

Figure 8.2. Two cases when the three lines determined by the three images
x1 , x2 , x3 lie on the same plane, in which case they may not necessarily intersect
at a unique point p.

tical effect since 3-D points generically do not lie on the trifocal plane.
The second case is more important: regardless of what 3-D feature points
one chooses, pairwise epipolar constraints alone do not provide sufficient
constraints to determine a unique 3-D point from any given three image
vectors. In such a case, extra constraints need to be imposed on the three
images in order to obtain a unique pre-image.

8.3. Geometric interpretation of the rank condition

175

Would trilinear constraints suffice to salvage the situation? The answer

to this is yes and let us show why. Given any three vectors x1 , x2 , x3 R3 ,
suppose they satisfy the trilinear constraint equation
c2 (T2 xT1 R3T R2 x1 T3T )c
x
x3 = 0.

In order to determine x3 uniquely (up to a scale) from this equation, we

need the matrix
c2 (T2 xT1 R3T R2 x1 T3T ) R33
x

to be of exact rank 1. The only case that x3 is undetermined is when this

matrix is rank 0, that is

That is

c2 (T2 xT1 R3T R2 x1 T3T ) = 0.

range(T2 xT1 R3T R2 x1 T3T ) span{x2 }.

(8.13)

If T3 and R3 x1 are linearly independent, then (8.13) holds if and only if the
vectors R2 x1 , T2 , x2 are linearly dependent. This condition simply means
that the line associated to the first image x1 coincide with the line determined by the optical centers o1 , o2 .5 If T3 and R3 x1 are linearly dependent,
x3 is determinable since R3 x1 lies on the line determined by the optical centers o1 , o3 . Hence we have shown, that x3 cannot be uniquely determined
from x1 , x2 by the trilinear constraint if and only if
c2 R2 x1 = 0,
T

and

c3 R3 x1 = 0,
T

and

c2 x2 = 0.
T

(8.14)

c3 x3 = 0.
T

(8.15)

Due to the symmetry of the trilinear constraint equation, x2 is not uniquely

determined from x1 , x3 by the trilinear constraint if and only if

We still need to show that these three images indeed determine a unique
pre-image in 3-D if either one of the images can be determined from the
other two by the trilinear constraint. This is obvious. Without loss of generality, suppose it is x3 that can be uniquely determined from x1 and x2 .
Simply take the intersection p0 E3 of the two lines associated to the first
two images and project it back to the third image plane such intersection
exists since the two images satisfy epipolar constraint.6 Call this image
x03 . Then x03 automatically satisfies the trilinear constraint. Hence x03 = x3
due to its uniqueness. Therefore, p0 is the 3-D point p where all the three
lines intersect in the first place. As we have argued before, the trilinear
constraint (8.11) actually implies bilinear constraint (8.10). Therefore, the
3-D pre-image p is uniquely determined if either x3 can be determined from
5 In other words, the pre-image point p lies on the epipole between the first and second
camera frames.
6 If these two lines are parallel, we take the intersection in the plane at infinity.

176

Chapter 8. Geometry and reconstruction from point features

x1 , x2 or x2 can be determined from x1 , x3 . So we have in fact proved the

following known fact:
Lemma 8.1 (Properties of bilinear and trilinear constraints).
Given three vectors x1 , x2 , x3 R3 and three camera frames with distinct
optical centers. If the three images satisfy pairwise epipolar constraints
xTi Tc
ij Rij xj = 0,

i, j = 1, 2, 3,

a unique pre-image p is determined except when the three lines associated to

the three images are coplanar. If these vectors satisfy all triple-wise trilinear
constraints7
T
T
cj (Tji xTi Rki
x
Rji xi Tki
)c
xk = 0,

i, j, k = 1, 2, 3,

then they determine a unique pre-image p E3 except when the three lines
associated to the three images are collinear.
The two cases (which are essentially one case) in which the bilinear conPSfrag replacements
straints may become degenerate are shown in Figure 8.2. Figure 8.3 shows
the only case in which trilinear constraint may become degenerate. In simp?
x1
o1

x2
o2

x3
o3

Figure 8.3. If the three images and the three optical centers lie on a straight line,
any point on this line is a valid pre-image that satisfies all the constraints.

ple terms, bilinear fails for coplanar and trilinear fails for collinear. For
more than 3 views, in order to check the uniqueness of the pre-image, one
needs to apply the above lemma to every pairwise or triple-wise views. The
possible number of combinations of degenerate cases make it very hard to
draw any consistent conclusion. However, in terms of the rank condition on
the multiple view matrix, Lemma 8.1 can be generalized to multiple views
in a much more concise and unified way:
Theorem 8.3 (Uniqueness of the pre-image). Given m vectors on the
image planes with respect to m camera frames, they correspond to the same
point in the 3-D space if the rank of the Mp matrix relative to any of the
camera frames is 1. If its rank is 0, the point is determined up to the line
where all the camera centers must lie on.
Hence both the largest and smallest singular values of Mp have meaningful geometric interpretation: the smallest being zero is necessary for
7 Although there seem to be total of nine possible (i, j, k), there are in fact only three
different trilinear constraints due to the symmetry in the trilinear constraint equation.

8.3. Geometric interpretation of the rank condition

177

a unique pre-image; the largest being zero is necessary for a non-unique

pre-image.

8.3.2 Geometry of the multiple view matrix

Now we are ready to discover the interesting geometry that the matrix M p
really captures. From the equation
i x i = 1 R i x 1 + T i ,
it is direct to see that
Mp

1
1

i = 2, . . . , m,

= 0.

(8.16)

(8.17)

So the coefficient that relates the two columns of the Mp matrix is simply
the depth of the point p in 3-D space with respect to the center of the
first camera frame (the reference).8 Hence, from the Mp matrix alone, we
may know the distance from the point p to the camera center o1 . One can
further prove that for any two points of the same distance from o1 , there
always exist a set of camera frames for each point such that their images
give exactly the same Mp matrix.9 Hence we can interpret Mp as a map
from a point in 3-D space to a scalar
Mp : p R3 7 d R+ ,
where d = kp o1 k. This map is certainly surjective but not injective.
Points that may give rise to the same Mp matrix hence lie on a sphere S2
centered around the camera center o1 . The above scalar d is exactly the
radius of the sphere. We may summarize our discussion into the following
theorem:
Theorem 8.4 (Geometry of the multiple view matrix Mp ). The
matrix Mp associated to a point p in 3-D maps this point to a unique
scalar d. This map is surjective but not injective and two points may give
rise to the same Mp matrix if and only if they are of the same distance to
the center o1 of the reference camera frame. That distance is given by the
coefficients which relate the two columns of Mp .
Therefore, knowing Mp (i.e. the distance d), we know that p must lie on
a sphere of radius d = kp o1 k. Given three camera frames, we can choose
either camera center as the reference, then we essentially have three Mp
matrix for each point. Each Mp matrix gives a sphere around each camera
center in which the point p should stay. The intersection of two such spheres
in 3-D space is generically a circle, as shown in Figure 8.4. Then one would
8 Here we implicitly assume that the image surface is a sphere with radius 1. If it is
a plane instead, statements below should be interpreted accordingly.
9 We omit the detail of the proof here for simplicity.

178

Chapter 8. Geometry and reconstruction from point features

imagine that, in general, the intersection of all three spheres determines

the 3-D location of the point p up to two solutions, unless all the camera
centers lie on the same line as o1 , o2 (i.e. except for the rectilinear motion).
In the next chapter which studies rank condition for a line, we further show
p
S2

S2
PSfrag replacements

S1
o1

Figure 8.4. The rank-deficient matrices Mp of a point relative to two distinct

camera centers determine the point p up to a circle.

that there are some profound relationships between the Mp matrix for a
point and that for a line.

8.4 Applications of the rank condition

The rank condition of the multiple view matrix Mp allows us to use all
the constraints among multiple images simultaneously for purposes such
as feature matching or motion recovery, without specifying a particular set
of pairwise or triple-wise frames. Ideally, one would like to formulate the
entire problem of reconstructing 3-D motion and structure as one optimizing some global objective function subject to the rank condition. However,
such an approach usually relies on very difficult nonlinear optimization
methods. Instead, we here divide the overall problem into a few subproblems: matching features assuming known motion, and estimating motion
assuming known matched features.

8.4.1 Correspondence
Notice that Mp R3(m1)2 being rank-deficient is equivalent to the
determinant of MpT Mp R22 being zero
det(MpT Mp ) = 0.

(8.18)

In general MpT Mp is a function of the projection matrix and images

x1 , . . . , xm . If is known and we like to test if given m vectors x1 , . . . , xm

8.4. Applications of the rank condition

179

R3 indeed satisfy all the constraints that m images of a single 3-D preimage should, we only need to test if the above determinant is zero. A
more numerically robust algorithm would be:
Algorithm 8.1 (Multiple view matching test). Suppose the projection matrix associated to m camera frames are given. Then for given m
vectors x1 , . . . , xm R3 ,
1. Compute the matrix Mp R3(m1)2 according to (8.8).

2. Compute second eigenvalue 2 of MpT Mp ;

3. If 2 for some pre-fixed threshold, the m image vectors match.

8.4.2 Reconstruction
Now suppose that m images xi1 , . . . , xim of n points pi , i = 1, . . . , n are
given and we want to use them to estimate the unknown projection matrix
. The rank condition of the Mp matrix can be written as

ci R xi
ci T
x
x
2
2
1
2
2

ci
ci
x T x
R3 xi1
3
i 3 3
= 0 R3(m1)1 ,
(8.19)
. +
..

.
.
i T
xc
m m

i R xi
xc
m m 1

for proper i R, i = 1, . . . n.
ci we obFrom (8.1) we have ij xij = i1 Rj xi1 + Tj . Multiplying by x
j
c
i
i
i
i
i
tain x (R x + T / ) = 0. Therefore, = 1/ can be interpreted
j

as the inverse of the depth of point pi with respect to the first frame.
The set of equations in (8.19) is equivalent to finding vectors R~j =
[r11 , r12 , r13 , r21 , r22 , r23 , r31 , r32 , r33 ]T R9 and T~j = Tj R3 , j =
2, . . . , m, such that

c1 x
c1 x1 T
1 x
1
j
j

2 c2 c2
xj xj x21 T T~j
T~j
3n

Pj ~ = .
(8.20)
R~ = 0 R ,
..
Rj
.
j
.

.
cnj xn1 T
cnj x
n x
where A B is the Kronecker product of A and B. It can be shown that if
i s are known, the matrix Pj is of rank 11 if more than n 6 points in
general position are given. In that case, the kernel of Pj is unique, and so
is (Rj , Tj ).
Euclidean reconstruction
For simplicity, we here assume that the camera is perfectly calibrated.
Therefore, is A(t) = I, Ri is a rotation matrix in SO(3) and Ti is a trans-

180

Chapter 8. Geometry and reconstruction from point features

lation vector in R3 . Given the first two images of (at least) eight points in
general configuration, T2 R3 and R2 SO(3) can be estimated using the
classic eight point algorithm. The equation given by the first row in (8.19)
ci T = x
ci R xi , whose least squares solution up to scale (recall
implies i x
2 2
2 2 1
that T2 is recovered up to scale from the eight point algorithm) is given by
i =

ci T )T x
ci R xi
(x
2 2
2 2 1
,
c
i
||x T ||2

i = 1, . . . , n.

(8.21)

2 2

These values of i can therefore be used to initialize the equation (8.19).

Multiplying on the left the j-th block of (8.19) by TjT yields the epipolar
T
constraints xij c
Tj Rj xi1 = 0. We know that in general the solution (Rj , Tj )
of these equations is unique, with Tj recovered up to scale. Hence the
solution to (8.19) is unique and we can then recover from (8.19) the scale
of each Tj up to a single scale for all the Tj s (recall that the i s were
computed up to scale). Since the i s are known, (8.20) becomes a set of
linear equations on T~j and R~j , whose solution can be described as follows.
j R33 be the (unique) solution of (8.20). Such
Let Tj R3 and R
a solution is obtained as the eigenvector of Pj associated to the smallest
j = Uj Sj V T be the SVD of R
j . Then the solution of
singular value. Let R
j
3
(8.20) in R SO(3) is given by:
Tj =

sign(det(Uj VjT ))
p
Tj R3 ,
3
det(Sj )

Rj = sign(det(Uj VjT )) Uj VjT SO(3).

(8.22)
(8.23)

In the presence of noise, solving for i using only the first two frames may
not necessarily be the best thing to do. Nevertheless, this arbitrary choice
of i allows us to compute all the motions (Rj , Tj ), j = 2, . . . , m. Given all
the motions, the least squares solution for i from (8.19) is given by
Pm ci
Tc
i
i
j=2 (xj Tj ) xj Rj x1
i
, i = 1, . . . , n.
(8.24)
=
Pm ci
||x T ||2
j=2

j j

Note that these i s are the same as those in (8.21) if m = 2. One can
then recompute the motion given this new i s, until the error in is small
enough.
We then have the following linear algorithm for multiple view motion
and structure estimation:
Algorithm 8.2 (Multiple view six-eight point algorithm). Given
m images xi1 , . . . , xim of n points pi , i = 1, . . . , n, estimate the motions
(Rj , Tj ), j = 2, . . . , m as follows:
1. Initialization: k = 0

8.4. Applications of the rank condition

181

(a) Compute (R2 , T2 ) using the eight point algorithm for the first
two views.
(b) Compute ik = i /1 from (8.21).
j , Tj ) as the eigenvector associated to the smallest
2. Compute (R
singular value of Pj , j = 2, . . . , m.
3. Compute (Rj , Tj ) from (8.22) and (8.23) for j = 2, . . . , m.
4. Compute the new i = ik+1 from (8.24). Normalize so that 1k+1 =
1.
5. If ||k k+1 || > , for a pre-specified > 0, then k = k + 1 and goto
2. Else stop.
The camera motion is then (Rj , Tj ), j = 2, . . . , m and the structure of the
points (with respect to the first camera frame) is given by the converged
depth scalar i1 = 1/i , i = 1, . . . , n. There are a few notes for the proposed
algorithm:
1. It makes use of all multilinear constraints simultaneously for motion
and structure estimation. (Rj , Tj )s seem to be estimated using pairwise views only but that is not exactly true. The computation of the
matrix Pj depends on all i each of which is in turn estimated from
the Mpi matrix involving all views. The reason to set 1k+1 = 1 is to
fix the universal scale. It is equivalent to putting the first point at a
distance of 1 to the first camera center.
2. It can be used either in a batch fashion or a recursive one: initializes with two views, recursively estimates camera motion and
automatically updates scene structure when new data arrive.
3. It may effectively eliminate the effect of occluded feature points - if a
point is occluded in a particular image, simply drop the corresponding
group of three rows in the Mp matrix without affecting the condition
on its rank.
Remark 8.3 (Euclidean, affine or projective reconstruction). We
must point out that, although the algorithm seems to be proposed for the
calibrated camera (Euclidean) case only, it works just the same if the camera
is weakly calibrated (affine) or uncalibrated (projective). The only change
needed is at the initialization step. In order to initialize i s, either an
Euclidean, affine or projective choice for (R2 , T2 ) needs to be specified. This
is where our knowledge of the camera calibration enters the picture. In the
worst case, i.e. when we have no knowledge on the camera at all, any choice
of a projective frame will do. In that case, the relative transformation among
camera frames and the scene structure are only recovered up to a projective
transformation. Once that is done, the rest of the algorithm runs exactly
the same.

182

Chapter 8. Geometry and reconstruction from point features

8.5 Experiments
In this section, we show by simulations the performance of the proposed
algorithm. We compare motion and structure estimates with those of the
eight point algorithm for two views and then we compare the estimates as
a function of the number of frames.

8.5.1 Setup
The simulation parameters are as follows: number of trials is 1000, number
of feature points is 20, number of frames is 3 or 4 (unless we vary it on
purpose), field of view is 90o , depth variation is from 100 to 400 units of
focal length and image size is 500 500. Camera motions are specified
by their translation and rotation axes. For example, between a pair of
frames, the symbol XY means that the translation is along the X-axis and
rotation is along the Y -axis. If n such symbols are connected by hyphens,
it specifies a sequence of consecutive motions. The ratio of the magnitude
of translation ||T || and rotation , or simply the T /R ratio, is compared at
the center of the random cloud scattered in the truncated pyramid specified
by the given field of view and depth variation (see Figure 8.5). For each
simulation, the amount of rotation between frames is given by a proper
angle and the amount of translation is then automatically given by the
T /R ratio. We always choose the amount of total motion such that all
feature points will stay in the field of view for all frames. In all simulations,
independent Gaussian noise with std given in pixels is added
to each image

T
point. Error measure for rotation is arccos tr(RR2 )1 in degrees where
is an estimate of the true R. Error measure for translation is the angle
R
between T and T in degrees where T is an estimate of the true T .
Z

Points

Depth
Variation

T/R = ||T||/r
r
Field of View
Camera Center

Figure 8.5. Simulation setup

8.5. Experiments

183

8.5.2 Comparison with the 8 point algorithm

Figure 8.6 plots the errors of rotation estimates and translation estimates
compared with results from the standard 8-point linear algorithm. Figure
8.7 shows a histogram of the relative translation scale for a noise level of 3
pixels as well as the error on the estimation of the structure. As we see, the
multiple view linear algorithm not only generalizes but also outperforms
the well-known eight point linear algorithm for two views. This is because
the algorithm implicitly uses the estimated structure of the scene for the
estimation of the motion, while the eight point algorithm does not.

2
1.5
1
0.5
0

0
1
2
3
4
5
Noise level in pixels for image size 500x500

Rotation estimate error in degrees

Motion estimates between frames 3 and 2

2.5
Two frame
Multiframe

2
1.5
1
0.5
0

0
1
2
3
4
5
Noise level in pixels for image size 500x500

Motion estimates between frames 2 and 1

Translation estimate error in degrees

Two frame
Multiframe

Translation estimate error in degrees

Rotation estimate error in degrees

Motion estimates between frames 2 and 1

2.5

Two frame
Multiframe
6

0
1
2
3
4
5
Noise level in pixels for image size 500x500
Motion estimates between frames 3 and 2
Two frame
Multiframe

0
1
2
3
4
5
Noise level in pixels for image size 500x500

Figure 8.6. Motion estimate error comparison between 8 point algorithm and
multiple view linear algorithm.

Structure error

Relative scales for 3 pixel noise

300

Structure error || || / |||| [%]

Histogram of relative scale estimates

350

250
200
150
100
50
0
0.7

0.8

0.9
1
1.1
1.2
Relative scale ||p || / ||p ||
13

1.3

Two frame
Multiframe

12
10
8
6
4
2
0

1
2
3
4
Noise level in pixels for image size 500x500

Figure 8.7. Relative scale and structure estimate error comparison. Motion is
XX-Y Y and relative scale is 1.

184

Chapter 8. Geometry and reconstruction from point features

8.5.3 Error as a function of the number of frames

In this simulation, we analyze the effect of the number of frames on motion and structure estimates. In principle, motion estimates should not
necessarily improve, since additional frames introduce more data as well
as more unknowns. However, structure estimates should improve, because
additional data is available to estimate the same structure. We consider 7
frames with motion XX-Y Y -X(XY )-ZY -(XY )Z-(Y Z)(Y Z) and plot the
estimation errors for the first pair of frames as a function of the number
of frames. As expected, that rotation estimates and relative translation
scales approximately independent on the number of frames. Translation estimates with three frames are better than those with two frames, but there
is no significant improvement for more than three frames. Finally, structure
estimates in general improve with the number of frames.
Estimates between frames 2 and 1

Estimates between frames 2 and 1

3.8
Translation estimate error in degrees

Rotation estimate error in degrees

1.1

1.05

0.95

0.9

4
5
Number of frames

3.6

3.4

3.2

2.8

1.05

0.95

0.9

4
5
Number of frames

Inverse structure relative to frame 1

1.1
Structure estimate error [%]

Relative scale estimate. Actual scale = 1

Estimates between frame pairs 13 and 12

4
5
Number of frames

Figure 8.8. Estimation error as a function of the number of frames. Noise level is
3 pixels and relative scale is 1.

8.5.4 Experiments on real images

We now consider an indoor sequence with the camera undergoing rectilinear
motion. We compare the performance of the proposed multiple view linear
algorithm against the conventional eight point algorithm for two views and
the multiple view nonlinear algorithm in [VMHS01]. In order to work with
real images, we need to calibrate the camera, track a set of feature points
and establish their correspondences across multiple frames. We calibrated
the camera from a set of planar feature points using Zhangs technique

8.5. Experiments

185

[Zha98]. For feature tracking and correspondence, we adapted the algorithm

from [ZDFL95].
The multiple view nonlinear algorithm is initialized with estimates from
the eight-point linear algorithm. Since the translation estimates of the
linear algorithm are given up to scale only, for the multiple view case
an initialization of the relative scale between consecutive translations is
required. This is done by triangulation since the directions of the translations are known. For example, the relative scale between T21 and T32 is
sin()/ sin() where is the angle between T31 and R21 T21 and is the
angle between T23 and R13 T13 . Recall in the multiple view case the vector
of translations T is recovered up to one single scale. The estimated motion
is then compared with the ground truth data. Error measures are the same
as in the previous section.
We use 4 images of an indoor scene, with the motion of the camera in
a straight line (rectilinear motion) along the Z-axis (see Figure 8.9). The
relative scales between consecutive translations are 2:1 and 1:2, respectively.
Even though the motion is rectilinear, relative scales still can be initialized
by triangulation, because image measurements are noisy.
Figure 8.10 shows the error between the motion and structure estimates
and ground truth. It can be observed that the proposed linear algorithm
is able to recover the correct motion and structure, giving better results in
most of the cases.

Frame 1

Frame 2

52
+30

100

39
38
+

200

50 42
23
+ 35
54
+
++

+ 48
+

300

400

+40
+

100

200

+ ++ +
24
+
10
+

300

+ 48
+
+

300

26 13
17+ +
33
+
+11

600

100

400

50 42
23
++ 35
54
++

300
400

3
49
+
9 37
7
12
+547
+
+ +++
+
36
32
27
+
+
++
+
34
22
4 45
+
26 13
43
2
+
+
+
41
17+ +
29 + 44
33
++
+
+ 14
6+ 20 2819
+11
+ + ++ +
+
24
16
+
+40
10

100

200

300

400

50 35
23
+ +42
+
+
54

500

39
38
+

200

31
51
+

300

26 13

400

17+ +
33
+
+11
+

3
49
+
9 37
7
+547
12
+
+ +++
+
36
32
27
++
+
34
22
+
4 45
43 2+
+
+
41
29 + 44
++
+ 14
6+ 20 2819
+ +

+
16
+40

600

+21
+
46
+

600

8
25

+ 48
+

500

+30
+

100

8
39
38
+

53
21

++
46
+
+

Frame 4

+30
+

200

++
++

31
51
+

300

Frame 3

100

+
+
41
+
+ 14
6+ 20 2819
+ + ++ +
24
16
+
+40
10

200

3
49
+
9 37
7
12
+547
+
+ 32+++
+
36
+ 27
++
22 43 4 45
2
44
29

500

39
38
+

200

400

3
53
49
+
9 37
+21
7
12
+
+547
++
46
+
+ 27
+
+
36
32 +
+
+
++
+
31
34
51
22 43 4 45
26 13
+
++
2+ +
+
+
17
+
44
41
33
29 +
+
+
+11
+ 14 +
6+ 20 2819
18

50 42
23
++ 35
54
++

+30
+

100

200

53
21

++
46
+
31
51
+

+ +
24
+
10
+

300

400

500

600

Figure 8.9. Indoor rectilinear motion image sequence

186

Chapter 8. Geometry and reconstruction from point features

Rotation Estimates

Translation Estimates

9
Nonlinear Multiview
Linear Two Views
Linear Multiview

1.8
1.6

7
6
Error [degrees]

Error [degrees]

1.4
1.2
1
0.8

5
4
3

0.6

0.4

0.2
0

23
Frame pairs

Relative scale estimates for translation 23/12

100

Error [%]

Relative scale estimates for translation 23/12

5
0

Nonlinear Multiview
Linear Two Views
Linear Multiview

Nonlinear

Two View

Multiview

Nonlinear

Two View

Multiview

Figure 8.10. Motion and structure estimates for indoor rectilinear motion image
sequence

8.6 Summary
8.7 Exercises
1. Rectilinear motion
Use the rank condition of the Mp matrix to show that if the camera
is calibrated and only translating on a straight line, then the relative
translational scale between frames cannot be recovered from pairwise
bilinear constraints, but it can be recovered from trilinear constraints.
(Hint: use Ri = I to simplify the constraints.)
2. Pure rotational motion
In fact the rank condition on the matrix Mp is so fundamental that
it works for the pure rotational motion case too. Please show that
if there is no translation, the rank condition on the matrix Mp is
equivalent to the constraints that we may get in the case of pure
rotational motion
xbi Ri x1 = 0.

(8.25)

(Hint: you need to convince yourself that in this case the rank of Np
is no more than m + 2.)
3. Points in the plane at infinity
Show that the multiple view matrix Mp associated to m images
of a point p in the plane at infinity satisfies the rank condition:
rank(Mp ) 1. (Hint: the homogeneous coordinates for a point p
at infinity are of the form X = [X, Y, Z, 0]T R4 .)

This is page 187

Printer: Opaque this

Chapter 9
Geometry and reconstruction from line
features

Our discussion so far has been based on the assumption that point correspondence is available in a number m 2 of images. However, as we
have seen in Chapter 4, image correspondence is subject to the aperture
problem, that causes the correspondence to be undetermined along the direction of brightness edges. One may therefore consider matching edges,
rather than individual points. In particular, man-made scenes are abundant with straight lines features that can be matched in different views
using a variety of techniques. Therefore, assuming that we are given correspondence between lines in several images, how can we exploit them to
recover camera pose as well as the 3-D structure of the environment? In
order to answer this question, we need to study the geometry of line correspondences in multiple views. Fortunately, as we see in this chapter, most
of the concept described in the last chapter for the case of points carry
through with lines. Therefore, we follow the derivation in Chapter 8 to
derive all the independent constraints between line correspondences.

9.1 Multiple views of a line

There are several equivalent ways to represent a line in space, L. In terms
of homogeneous coordinates, the line L can be described as the subset of
R3 which satisfies
L = {X | X = Xo + v, R}

R3 ,

188

Chapter 9. Geometry and reconstruction from line features

where Xo = [Xo , Yo , Zo , 1] R4 is a base point on this line, v =

T
[v1 , v2 , v3 , 0] R4 is a non-zero vector indicating the direction of the
line, as shown in Figure 9.1. Hence, in homogeneous coordinates, we can
specify the line by the pair of vectors (Xo , v). Also, the line L can be viewed
as a 1-dimensional hyperplane in R3 and can be described as
= 0}
L = {X | X

R3 ,

R24 is a 2 4 matrix of rank 2 and Xo , v are both in its

where
kernel. To distinguish these different but equivalent representations, we
call the later as the co-representation. Clearly, a co-representation specifies
a geometric subspace by its orthogonal supplement. Notice that neither
description is unique. For example, any point on the line L can be the base
to be an n 4 matrix as long as it is of rank
point Xo ; or one can choose
2 and has the same kernel.
Similarly, for an image line in the image plane R2 , we can describe it
in either way. In this chapter, for simplicity, we will mainly use the corepresentation: an image point x = [x, y, z]T R3 lies on an image line
which is represented by a vector l = [a, b, c]T R3 if it satisfies
lT x = 0.

This is illustrated in Figure 9.1. We will call l the co-image of the line

L
v
X

PSfrag replacements

l
o
l
Figure 9.1. Image of a line L. The intersection of P and the image plane represents
the physical projection of L. But it is more conviniently represented by a normal
vector l of the plane P.

L. We will discuss in more detail the definition of image and co-image in

9.2. The multiple view matrix and its rank

189

Chapter 11. Using this notation, a co-image l(t) = [a(t), b(t), c(t)]T R3
of a L E3 taken by a moving camera satisfies the following equation
l(t)T x(t) = l(t)T A(t)P g(t)X = 0,

(9.1)

where x(t) is the image of the point X L at time t, A(t) SL(3) is the
camera calibration matrix (at time t), P = [I, 0] R34 is the constant
projection matrix and g(t) SE(3) is the coordinate transformation from
the world frame to the camera frame at time t. Note that l(t) is the normal
vector of the plane formed by the optical center o and the line L (see Figure
9.1). Since the above equation holds for any point X on the line L, it yields
l(t)T A(t)P g(t)Xo = l(t)T A(t)P g(t)v = 0.

(9.2)

In the above equations, all x, X and g are in homogeneous representation.

In practice one only measures images at sample times t1 , t2 , . . . , tm . For
simplicity we denote
li = l(ti ),

i = A(ti )P g(ti ).

(9.3)

The matrix i is then a 3 4 matrix which relates the ith image of the line
L to its world coordinates (Xo , v) by
lTi i Xo = lTi i v = 0

(9.4)

for i = 1, . . . , m. In the above equations, everythin except li s is unknown.

Solving for i s and L (i.e. (Xo , v)) simultaneously from these equations
is extremely difficult. A natural way to simplify the task is to exploit the
rank deficiency condition from the following equations
Nl Xo = 0,
where

Nl v = 0,

(9.5)

lT1 1
lT2 2

Nl = . Rm4 .
.
.

(9.6)

rank(Nl ) 2.

(9.7)

lTm m

Hence the rank of Nl must be

9.2 The multiple view matrix and its rank

Without loss of generality, we may assume that the first camera frame
is chosen to be the reference frame. That gives the projection matrices
i , i = 1, . . . , m the general form
1 = [I, 0],

...,

m = [Rm , Tm ]

R34 ,

(9.8)

190

Chapter 9. Geometry and reconstruction from line features

where, as before, Ri R33 , i = 2, . . . , m is the first three columns of i

and Ti R3 , i = 2, . . . , m is the fourth column of i . Notice that if the
camera is calibrated, Ri corresponds to the actual camera rotation and Ti
to the translation. Now the matrix Nl becomes
T

l1
0
lT2 R2 lT2 T2

m4
Nl = .
.
(9.9)
.. R
..
.
lTm Rm lTm Tm

This matrix should have a rank of no more than 2. Multiplying Nl on the

right by the following matrix

lb l 0
R45
(9.10)
Dl = 1 1
0 0 1
yields

0
lT2 R2 lb1

Nl0 =
..

.
lT Rm lb1
m

lT1 l1
T
l2 R 2 l1
..
.
lTm Rm l1

Then since Dl is of full rank 4, it yields

0
lT2 T2

..
.

Rm5 .

(9.11)

lTm Tm

rank(Nl0 ) = rank(Nl ) 2.

Obviously, this is true if and only if the following sub-matrix of Nl0

lT2 R2 lb1 lT2 T2

T b
l3 R3 l1 lT3 T3
(m1)4

(9.12)
Ml =
..
..
R

.
.
lT Rm lb1 lT Tm
m

has rank no more than one. We call the matrix Ml the multiple view matrix
associated to a line feature L. Hence we have proved the following:
Theorem 9.1 (Rank deficiency equivalence condition). For the two
matrices Nl and Ml , we have
rank(Ml ) = rank(Nl ) 1 1.

(9.13)

Therefore rank(Nl ) is either 2 or 1, depending on whether rank(Ml ) is 1 or

0, respectively.
Comment 9.1 (Constraint on rotation from Ml ). One may notice
that for the matrix Ml to be of rank 1, it is necessary that the first three
columns are of rank 1. This imposes constraints on the camera rotation
Ri s only.

9.3. Geometric interpretation of the rank condition

191

The rank condition certainly implies all trilinear constraints among the
given m images of the line. To see this more explicitly, notice that for
rank(Ml ) 1, it is necessary for any pair of row vectors of Ml to be
linearly dependent. This gives us the well-known trilinear constraints
lTj Tj lTi Ri lb1 lTi Ti lTj Rj lb1 = 0

(9.14)

among the first, ith and j th images. Hence the constraint rank(Ml ) 1 is a
natural generalization of the trilinear constraint (for 3 views) to arbitrary
m views since when m = 3 it is equivalent to the trilinear constraint for
lines, except for some rare degenerate cases, e.g., lTi Ti = 0 for some i.
It is easy to see from the rank of matrix Ml that there will be no more
independent relationship among either pairwise or quadruple-wise image
lines. Trilinear constraints are the only non-trivial ones for all the images
that correspond to a single line in 3-D. So far we have essentially proved the
following facts regarding multilinear constraints among multiple images of
a line:
Theorem 9.2 (Linear relationships among multiple views of a
line). For any given m images corresponding to a line L in E3 relative
to m camera frames, the rank-deficient matrix Ml implies that any algebraic constraints among the m images can be reduced to only those among 3
images at a time, characterized by the so-called trilinear constraints (9.14).
Although trilinear constraints are necessary for the rank of matrix Ml
(hence Nl ) to be 1, they are not sufficient. For equation (9.14) to be nontrivial, it is required that the entry lTi Ti in the involved rows of Ml need
to be non-zero. This is not always true for certain degenerate cases - such
as the when line is parallel to the translational direction. The rank condition on Ml , therefore, provides a more complete account of the constraints
among multiple images and it avoids artificial degeneracies that could be
introduced by using algebraic equations.

9.3 Geometric interpretation of the rank condition

In the previous section, we have classified the algebraic constraints that
may arise among m corresponding images of a line. We now know that
the relationships among m images somehow boil down to those among 3
views at a time, characterized by the trilinear constraints (9.14). We here
explain what these equations mean and give a simple intuitive geometric
interpretation for all these algebraic relationships.

192

Chapter 9. Geometry and reconstruction from line features

L
P3
PSfrag replacements

P1
P2

l2
o2

Figure 9.2. Three planes extended from the three images l1 , l2 , l3 intersect at one
line L in 3-D, the pre-image of l1 , l2 , l3 .

9.3.1 Uniqueness of the pre-image

Given 3 vectors l1 , l2 , l3 R3 , if they are indeed images of some line L in
3-D with respect to the three camera frames as shown in Figure 9.2, they
should automatically satisfy the trilinear constraints, e.g.,
lT2 T2 lT3 R3 lb1 lT3 T3 lT2 R2 lb1 = 0.

Like we did for points, we now ask the inverse question: if the three vectors
l1 , l2 , l3 satisfy the trilinear constraints, are they necessarily images of some
single line in 3-D, the so-called pre-image? As shown in Figure 9.3, we
denote the planes formed by the optical center oi of the ith frame and
image line li to be Pi , i = 1, 2, 3. Denote the intersection line between P1
and P2 as L2 and the intersection line between P1 and P3 as L3 . As pointed
out at the beginning, li R3 is also the normal vector of Pi . Then without
loss of generality, we can assume that li is the unit normal vector of plane
Pi , i = 1, 2, 3 and the trilinear constraint still holds. Thus, lTi Ti = di
is the distance from o1 to the plane Pi and (lTi Ri )T = RiT li is the unit
normal vector of Pi expressed in the 1th frame. Furthermore, (lTi Ri lb1 )T is
a vector parallel to Li with length being sin(i ), where i [0, ] is the
angle between the planes P1 and Pi , i = 2, 3. Therefore, in the general
case, the trilinear constraint implies two things: first, as (lT2 R2 lb1 )T is linear
dependent of (lT3 R3 lb1 )T , L2 is parallel to L3 . Secondly, as d2 sin(3 ) =

9.3. Geometric interpretation of the rank condition

193

L3
L2

PSfrag replacements
P3
P1
l1

l3
o3

l2
o2

Figure 9.3. Three planes extended from the three images l1 , l2 , l3 intersect at lines
L2 and L3 , which actually coincides.

d3 sin(2 ), the distance from o1 to L2 is the same with the distance from o1
to L3 . They then imply that L2 coincides with L3 , or in other words, the line
L in 3-D space is uniquely determined. However, we have degeneracy when
P1 coincides with P2 and P3 . In that case, d2 = d3 = 0 and (lT2 R2 lb1 )T =
(lT3 R3 lb1 )T = 031 . There are infinite number of lines in P1 = P2 = P3 that
generate the same set of images l1 , l2 and l3 . The case when P1 coincides
with only P2 or only P3 is more tricky. For example, if P1 coincides with
P2 but not with P3 , then d2 = 0 and (lT2 R2 lb1 )T = 031 . However, if we
re-index the images (frames), then we still can obtain a non-trivial trilinear
constraint, from which L can be deduced as the intersection line between
P1 and P3 . So we have in fact proved the following fact:
Lemma 9.1 (Properties of trilinear constraints for lines). Given
three camera frames with distinct optical centers and any three vectors
l1 , l2 , l3 R3 that represents three lines on each image plane. If the three
image lines satisfy trilinear constraints
lTj Tji lTk Rki lbi lTk Tki lTj Rji lbi = 0,

i, j, k = 1, 2, 3

a unique pre-image L is determined except when the three planes defined

respectively by the centers oi s of the camera and the vectors li s as their

194

Chapter 9. Geometry and reconstruction from line features

normals all coincide with each other. For this only degenerate case, the
matrix Ml becomes zero. 1
For more than 3 views, in order to check the uniqueness of the pre-image,
one needs to apply the above lemma to every triplet of views. The possible
combinations of degenerate cases make it very hard to draw any consistent
conclusion. However, in terms of the rank condition on the multiple view
matrix, the lemma can be generalized to multiple views in a much more
concise and unified form:
Theorem 9.3 (Uniqueness of the pre-image). Given m vectors in R3
representing lines on the image planes with respect to m camera frames,
they correspond to the same line in the 3-D space if the rank of the matrix
Ml relative to any of the camera frames is 1. If its rank is 0 (i.e. the
matrix Ml is zero), then the line is determined up to a plane on which all
the camera centers must lie.
The proof follows directly from Theorem 9.1. So the case that the line
in 3-D shares the same plane as all the centers of the camera frames is the
only degenerate case when one will not be able to determine the exact 3-D
location of the line from its multiple images. As long as the camera frames
have distinct centers, the set of lines that are coplanar with these centers is
of only zero measure. Hence, roughly speaking, trilinear constraints among
images of a line rarely fail in practice. Even the rectilinear motion will not
pose a problem as long as enough number of lines in 3-D are observed, the
same as in the point case. On the other hand, the theorem also suggests
a criteria to tell from the matrix Ml when a degenerate configuration is
present: exactly when the largest singular value of the Ml matrix (with
respect to any camera frame) is close to zero.

9.3.2 Geometry of the multiple view matrix

Now we can have better understanding of the geometric meaning of the
matrix Ml . If l1 , . . . , lm are the m images of one line L in m different
views, and without loss of generality, li s are unit vectors, then RiT li is the
normal vector of the plane Pi formed by L and the optical center oi of the
ith frame expressed in the 1th frame. With lTi Ti being the distance from
o1 to Pi , [li T Ri lTi Ti ]X = 0, X = [x, y, z, 1] R4 is the function of plane
T
Pi . Besides, since l1 , R2T l2 , . . . , Rm
lm are all perpendicular to L, they are
T
T
b
coplanar. Hence (li Ri l1 ) is parallel to the vector along the line L, i.e.
[lTi Ri lb1 , 0]T R4 is proportional to the vector v which defines the line L
in 3-D. Since Ml has rank 1, if we view each row of Ml as homogeneous
coordinates of some point in 3-D, then all the rows in fact defines a unique
1 Here we use subscripts ji to indicate that the related transformation is from the i th
frame to the j th .

9.3. Geometric interpretation of the rank condition

195

point in 3-D. This point is not necessarily on the image plane though. If
we call this point p, then the vector defined by p o1 is obviously parallel
to the original line L in 3-D. We also call it v. Therefore the so-defined Ml
matrix in fact gives us a map from lines in 3-D to vectors in 3-D:
Ml : L R3 7 v R3 .
This map is certainly not injective but surjective. That is, from Ml matrix
alone, one will not be able to recover the exact 3-D location of the line L,
but it will give us most of the information that we need to know about its
images. Moreover, the vector gives the direction of the line, and the norm
kvk is exactly the ratio:
kvk = sin(i )/di ,

i = 2, . . . , m.

Roughly speaking, the farther away is the line L from o1 , the smaller this
ratio is. In fact, one can show that the family of parallel lines in 3-D that
all map to the same vector v form a cylinder centered around o1 . The
above ratio is exactly the inverse of the radius of such a cylinder. We may
summarize our discussion into the following theorem:
Theorem 9.4 (Geometry of the matrix Ml ). The matrix Ml associated
to a line L in 3-D maps this line to a unique vector v in 3-D. This map is
surjective and two lines are mapped to the same vector if and only if they
are: 1. parallel to each other and 2. of the same distance to the center o of
the reference camera frame. That distance is exactly 1/kvk.
Therefore, knowing Ml (i.e. the vector v), we know that L must lie on a
circle of radius r = 1/kvk, as shown in Figure 9.4. So if we can obtain the Ml

r
PSfrag replacements

Figure 9.4. An equivalent family of parallel lines that give the same Ml matrix.

matrix for L with respect to another camera center, we get two families of
parallel lines lying on two cylinders. In general, these two cylinders intersect
at two lines, unless the camera centers all lie on a line parallel to the line

196

Chapter 9. Geometry and reconstruction from line features

L. Hence, two Ml matrices (for the same line in 3-D) with respect to two
distinct camera centers determine the line L up to two solutions. One would
imagine that, in general, a third Ml matrix respect to a third camera center
will then uniquely determine the 3-D location of the line L.

9.3.3 Relationships between rank conditions for line and point

It is interesting to explore the relationship between the rank conditions
derived for points in Chapter 8 and the rank conditions for lines in this
chapter. Let x1 , . . . , xm be the m images of a point p in 3-D space, and
(Ri , Ti ) R34 be the corresponding transformation from the ith camera
frame to the first, i = 2, . . . , m. Denote

c2 R2 x1
c2 T2
x
x

..
.. R3(m1)2 .
Mp =
.
.
xc
m R m x1

xc
m Tm

Then according to the rank condition for point in Chapter 8, we have

rank(Mp ) 1.

The apparent similarities between both the rank conditions and the forms
of Ml and Mp are expected due to the geometric duality between lines and
point. In the 3-D space, a point can be uniquely determined by two lines,
while a line can be uniquely determined by two points. So our question
now is that if the two set of rank conditions can be derived from each other
based on the geometric duality.
First we show that we can derive the rank condition for line from rank
condition for point. Let p1 and p2 be two distinct points on a line L in the
3-D space. Denote the m images of p1 , p2 under m views to be x11 , . . . , x1m
and x21 , . . . , x2m , respectively. Hence, the ith (i = 1, . . . , m) view of L can be
c2 x1 or lb = x2 x1T x1 x2T , i = 1, . . . , m. From the rank
expressed as li = x
i
i i
i i
i i
1
c1 R x1 = x
c1 T and x
c2 R x2 = x
c2 T
deficiency of Mp and Mp2 , we have x
i i 1
i i
i i 1
i i
for some , R and i = 1, . . . , m. This gives
2T
1T
c2
lTi Ri lb1 = x1T
i xi Ti (x1 x1 ),

c2
lTi Ti = x1T
i xi Ti .

(9.15)

which means that each row of Ml is spanned by the same vector [(x2T
1
T
4
x1T
1 ), 1] R . Therefore,
rank(Mp1 ) 1 & rank(Mp2 ) 1

rank(Ml ) 1.

The inverse direction of this duality is not so straightforward. The reason is

obvious: the duality is not totally symmetric. In the 3-D space, any distinct
two points can determine a line. However, not any two lines may intersect at
one point unless they are coplanar. Hence, in order to prove the inverse, an
additional coplanar condition for the two lines in space should be imposed.
This will be investigated in the next chapter. But the matrices Mp and Ml

9.4. Applications of the rank condition

197

already reveal some interesting duality between the camera center o and the
point p in 3-D. For example, if the camera center moves on a straight line
(the rectilinear motion), from the Mp matrix associated to a point p, the 3D location of the point p can only be determined up to a circle. On the other
hand, if the camera center is fixed but the point p can move on a straight
line L, from the Ml matrix associated to the line L, the 3-D location of this
line can only be determined up to a circle too. Mathematically speaking,
matrices Mp and Ml define an equivalence relationship for points and lines
in the 3-D space, respectively. They both group points and lines according
to their distance to the center of the reference camera frame. Numerically,
the sensitivity for Mp and Ml as rank-deficient matrices depends very much
on such distance. Roughly speaking, the farther away is the point or line,
the more sensitive the matrix is to noise or disturbance. Hence, in practice,
one may view Mp and Ml as a natural metric for the quality of the
measurements associated to the multiple images of a 3-D point or line.

9.4 Applications of the rank condition

The rank condition on the matrix Ml allows us to use all the constraints
among multiple images of a line feature simultaneously without specifying
a set of triple-wise frames. That is, it makes it possible to use all the data
simultaneously for purposes such as feature matching or recovering camera
configuration from multiple images of lines. Many natural algorithms are
suggested by the simple form of the Ml matrix. These algorithms run in
exact parallel as those outlined for the case of point features.
Ideally, one likes to formulate the entire problem of reconstructing 3-D
motion and structure as one optimizing some global objective function2
subject to the rank condition. However, such an approach usually relies
on very difficult nonlinear optimization methods. Instead, we here divide
the overall problem into a few subproblems: matching features assuming
known motion, and estimating motion assuming known matched features.
Sections 4.1 and 4.2 treat these two subproblems respectively.

9.4.1 Matching
In general we like to test if given m vectors l1 , . . . , lm R3 with known
indeed satisfy all the constraints that m images of a single 3-D line (preimage) should. There are two ways of performing this. One is based on the
fact that Ml matrix has rank no more than 1. Another is based on the
geometric interpretation of Ml matrix.
2 Such

as the reprojection error in image.

198

Chapter 9. Geometry and reconstruction from line features

Since rank(Ml ) 1, the 4 4 matrix MlT Ml also has rank no more than
1. Hence, ideally with given l1 , . . . , lm and , the eigenvalues of MlT Ml
should satisfy 1 0, 2 = 3 = 4 = 0. A more numerically robust
algorithm would be:
Algorithm 9.1 (Multiple view matching test). Suppose the projection matrix associated to m camera frames are given. Then for given m
vectors l1 , . . . , lm R3 ,
1. Compute the matrix Ml Rm4 according to (9.12).
2. Compute second largest eigenvalue 2 of the 4 4 matrix MlT Ml ;
3. If 2 1 for some pre-fixed threshold 1 , then we say the m image
vectors match. Otherwise discard it as outliers.

9.4.2 Reconstruction
Now suppose that m( 3) images lj1 , . . . , ljm of n lines Lj , j = 1, . . . , n
are given and we want to use them to estimate the unknown projection
matrix . From the rank condition of the Ml matrix, we know that the
kernel of Ml should have 3 dimensions. Denote Mlj to be the Ml matrix
T
b
for the j th line, j = 1, . . . , n. Since lji Ri lj1 is a vector parallel to Lj ,
j
j
3
then for any two linear

j vectors u , w R lying in the plane
independent
j
j
perpendicular to L , u0 and w0 are two base vectors in the kernel of
j
j
j
Mlj . Let 1 ker(Mlj ) be a vector orthogonal to both u0 and w0 .
Then
j = [j1 , j2 , j3 ]T = k j ubj wj ,

(9.16)

for some k j R. It is direct to see that j is a vector parallel to the line Lj .

It is in fact pointing to the opposite direction of the vector v j associated
to the line Lj and has a norm equal to the distance from the line Lj to the
center of the reference camera frame. Thus for the ith view, i = 2, . . . , m,
we define matrices:
T

T
T
l1i
1 (lb11 l1i )
.

..
..

T
li
n T (lbn1 lni )

T
T
0
u1 (lb11 l1i )

..
Pi = ...
(9.17)
R3n12 ,
.

T
nT
n
n
u (lb1 li )

T
T
0
w1 (lb11 l1i )

..
..

.
nT
nT
n
b
0 w (l l )
1

9.4. Applications of the rank condition

199

where is the Kronecker product. Then we have

T~
Pi ~i = 0,
Ri

(9.18)

~ i = [r11 , r21 , r31 , r12 , r22 , r32 , r13 , r23 , r33 ]T , with rkl
where T~i = Ti and R
th
being the (kl) entry of Ri , k, l = 1, 2, 3. It can be shown that if j , uj , wj s
are known, the matrix Pi is of rank 11 if more than n 12 lines in general
position are given. In that case, the kernel of Pi is unique, and so is (Ri , Ti ).
Hence if we know j for each j = 1, . . . , n, then we can estimate (Ri , Ti )
by performing singular value decomposition (SVD) on Pi . In practice, the
initial estimations of j , uj , wj s may be noisy and only depend on local
data, so we can use iteration method. First, we get estimates for (Ri , Ti )s,
then we use the estimated motions to re-calculate j s, and iterate in this
way till the algorithm converges. However, currently there are several ways
to update j , uj , wj s in each iteration. Initialization of j , uj , wj and the
overall algorithm are described in the next section.

Euclidean reconstruction
Here we illustrate the overall algorithm for the case when the camera is
assumed to be calibrated. That is A(t) = I hence the projection matrix
i = [Ri , Ti ] represents actual Euclidean transformation between camera
frames. We will discuss later on what happens if this assumption is violated.
Given the first three views of (at least) 12 lines in the 3-D space. Let
j
Ml3
be the matrix formed by the first 3 columns of Mlj associated to the
j
j
first three views. From Ml3
, we can already estimate uj , wj ker(Ml3
),
j
j
where u , w are defined above. So after we obtain an initial estimation of
i , Ti ) for i = 2, 3, we can calculate M
j and compute the SVD of it such
(R
l3
j
= U2 S2 V T , then pick u
that M
j , w
j being the 2nd and 3rd columns of
2
l3
V2 respectively. Then the estimation for j is
j = k j (
uj w
j ) for some
j
k R to be determined. k can be estimated by least squares method such
that
kj =

jT
j T bj bj j
w
)
i=2 (li Ti )(li Ri l1 u
Pm j T bj bj j 2 .
w
)
i=2 (li Ri l1 u

(9.19)

Using the estimated j , uj , wj , we can now compute the matrix Pi from

(9.17). By performing SVD on Pi such that Pi = U SV T , we then pick the
last column of V to obtain estimates respectively for Ti and Ri as Ti R3
i R33 for i = 2, . . . , m. Since R
i is not guaranteed to be a rotation
and R
i onto SO(3), we can then do SVD on R
i such
matrix, in order to project R

200

Chapter 9. Geometry and reconstruction from line features

i = Ui Si V T , so the estimates for Ri SO(3) and Ti R3 are

that R
i
sign(det(Ui ViT ))
p
Ti =
Ti R 3 ,
3
det(Si )
Ri = sign(det(Ui V T ))Ui V T SO(3).
i

(9.20)
(9.21)

The j , uj , wj can then be re-estimated from the full matrix Mlj computed
from the motion estimates (Ri , Ti ).
Based on above arguments, we can summarize our algorithm as the
following:
Algorithm 9.2 (Structure and motion from multiple views of
lines). Given a set of m images lj1 , . . . , ljm of n lines Lj , j = 1, . . . , n,
we can estimate the motions (Ri , Ti ), i = 2, . . . , m as the following:
1. Initialization
(a) Set step counter s = 0.
(b) Compute initial estimates for (Ri , Ti ), i = 2, 3 for the first three
views.
(c) Compute initial estimates of js , ujs , wsj for j = 1, . . . , n from the
first three views.
i , Ti ) from
2. Compute Pi from (9.17) for i = 2, . . . , m, and obtain (R
the eigenvector associated to its smallest singular value.
i , Ti ) from (9.20) and (9.21), i = 2, . . . , m.
3. Compute (R
j
based on the M j matrix calculated by using
4. Compute js+1 , ujs+1 , ws+1
j
1
i , Ti ). Normalize
(R
s+1 so that ks+1 k = 1.

5. If ks+1 s k < for some threshold , then stop, else set s = s + 1,

and goto 2.
There are several notes for the above algorithm:
1. From (9.17) we can see that in order to get a unique (Ri , Ti ) from
SVD on Pi , we want the rank of Pi to be 11, this requires that we
have at least 12 pairs of line correspondences.
2. There are several ways for the initial estimation of j s. There is
an nonlinear algorithm for estimating trifocal tensor given by Hartley et. al. [HZ00]. Although linear algorithms for three views of line
features also exist [WHA93], they usually require at least 13 lines
matched across three views. In practice, one may instead use the two
view 8 point algorithm to initialize the first three views.
3. The way that j is re-estimated from Mlj is not unique. It can also
be recovered from the rows of Mlj , from the relationship between j
and v mentioned above. The reason to set k1s+1 k to be 1 in Step 4

9.5. Experiments

201

is to fix the universal scale. It is equivalent to putting the first line

at a distance of 1 to the first camera center.
4. There is an interesting characteristic of the above algorithm: (Ri , Ti )
seem to be estimated using pairwise views only. But it is not exactly
true. The computation of the matrix Pi depends on all j , uj , wj each
of which is estimated from the Mlj matrix for all views.
Remark 9.1 (Euclidean, affine or projective reconstruction). We
must point out that, although it seems to be proposed for the calibrated camera (Euclidean) case only, the algorithm works just the same if the camera
is weakly calibrated (affine) or uncalibrated (projective). The only change
needed is at the initialization step. In order to initialize j s, either a Euclidean, affine or projective choice for (R2 , T2 ) and (R3 , T3 ) needs to be
specified. This is where our knowledge of the camera calibration enters the
picture. In the worst case, i.e. when we have no knowledge on the camera
at all, any choice of a projective frame will do. In that case, the relative
transformation among camera frames and the scene structure are only recovered up to a projective transformation of choice. Once that is done, the
rest of the algorithm runs exactly the same.

9.5 Experiments
In this section, we show in simulation the performance of the above algorithm. We test it on two different scenarios: sensitivity of motion and
structure estimates with respect to the level of noise on the line measurements; and the effect of number of frames on motion and structure
estimates.

9.5.1 Setup
The simulation parameters are as follows: number of trials is 500, number
of feature lines is typically 20, number of frames is 4 (unless we vary it
on purpose), field of view is 90o , depth variation is from 100 to 400 units
of focal length and image size is 500 500. Camera motions are specified
by their translation and rotation axes. For example, between a pair of
frames, the symbol XY means that the translation is along the X-axis and
rotation is along the Y -axis. If n such symbols are connected by hyphens,
it specifies a sequence of consecutive motions. The ratio of the magnitude
of translation ||T || and rotation , or simply the T /R ratio, is compared at
the center of the random cloud scattered in the truncated pyramid specified
by the given field of view and depth variation. For each simulation, the
amount of rotation between frames is given by a proper angle and the
amount of translation is then automatically given by the T /R ratio. We

202

Chapter 9. Geometry and reconstruction from line features

always choose the amount of total motion such that all feature (lines) will
stay in the field of view for all frames. In all simulations, each image line is
perturbed by independent noise witha standarddeviation given in degrees.
T
is an
Error measure for rotation is arccos tr(RR2 )1 in degrees where R
estimate of the true R. Error measure for translation is the angle between
T and T in degrees where T is an estimate of the true T . Error measure for
structure is approximately measured by the angle between and
where

is an estimate of the true .

Algorithm 2 requires that we initialize with the first three views. In
simulations below, we initialize the motion among the first three views using motion estimates from the linear algorithm for point features (see the
previous chapter). The error in initial motion estimates correspond to an
increasing level of noise on image points from 0 to 5 pixels. While the line
measurements are perturbed by an increasing level of random angle from
0 to 2.5 degrees, the motion for the first three views are initialized with
a corresponding level of noisy estimates. Of course, one may run existing algorithms based on trilinear constraint [WHA93] for line features and
initialize the first three views. But motion estimation from line measurements alone is much more sensitive to degenerate configurations - any line
coplanar with the translation gives essential no information on the camera
motion nor its own 3-D structure. On the other hand, point features give
more stable initial estimates.

9.5.2 Motion and structure from four frames

Figures 9.5 and 9.6 plot respectively the motion and structure estimate errors versus the level of noises added on the line measurements. The motion
sequence is XX-Y Y -X(XY ), where (XY ) means a direction half between
the X-axis and Y -axis. Unlike point features, the quality of line feature
measurements depends heavily on the camera motion. Among the 20 randomly generated 3-D lines, those which are coplanar with the translation
will give little information on the camera motion or its 3-D location. These
bad measurements typically contribute to a larger error in estimates.
Their effect is quite noticeable in simulations and sometimes even causes
numerical instability. By examining the multiple view matrix Ml associated
to each line, we believe in the future we will be able to find good criteria
to eliminate the bad lines hence improve the motion and structure estimates. Even so, the current algorithm is still able to converge to reasonably
good estimates for the uninitialized motion between the fourth frame and
the first.

9.5. Experiments
motion between frame 3 and 1

2
Rotation error in degrees

Translation error in degrees

motion between frame 3 and 1

3
2
1
0

0.5

1
1.5
2
noise level in degrees

1.5
1
0.5
0

2.5

0.5

1
1.5
2
noise level in degrees

2.5

motion between frame 4 and 1

2
Rotation error in degrees

Translation error in degrees

motion between frame 4 and 1

4
3
2
1
0

203

0.5

1
1.5
2
noise level in degrees

2.5

1.5
1
0.5
0

0.5

1
1.5
2
noise level in degrees

2.5

Figure 9.5. Motion estimate error from four views. The number of trials is 500.
T /R ratio is 1. The amount of rotation between frames is 20o .
4

3.5

Structure error in degrees

2.5

1.5

0.5

1
1.5
noise level in degrees

2.5

Figure 9.6. Structure estimate error from four views. The number of trials is 500.
T /R ratio is 1. The amount of rotation between frames is 20o .

9.5.3 Error as a function of number of frames

Figures 9.7 and 9.8 plot respectively the motion and structure estimate
errors versus the number of frames. The noise level on the image line measurements is 1o . The motion sequence is an orbital motion around the set
of 20 random lines generated at a distance between 200 and 400 units of
focal length away from the camera centers. The amount of rotation is incrementally 15o between frames and a corresponding amount of translation
is used to generate the orbital motion. From Figure 9.8, we see that the
structure estimates improve while the number of frames increases, which is

204

Chapter 9. Geometry and reconstruction from line features

expected. In fact, the error converges to the given level of noise on the line
measurements. However, according to Figure 9.7, motion estimates do not
necessarily improve with an increase number of frames since the number
of motion parameters increase linearly with the number of frames. In fact,
we see that after a few frames, additional frames will have little effect on
the previous motion estimates.

Translation error between frame 4 and 1

Rotation error between frame 4 and 1

2.5
Rotation error in degrees

Translation error in degrees

8
10
Number of frames

1.5

Translation error between frame 5 and 1

Rotation error in degrees

Translation error in degrees

8
10
Number of frames

Rotation error between frame 5 and 1

3.5

2.5
2
1.5
1
0.5
0

8
10
Number of frames

Figure 9.7. Motion (between frames 4-1 and 5-1) estimate error versus number
of frames for an orbital motion. The number of trials is 500.

4.5

Structure error in degrees

3.5

2.5

1.5

7
8
Number of Frames

Figure 9.8. Structure estimate error versus number of frames for an orbital
motion. The number of trials is 500.

9.6. Summary

205

Comment 9.2. From our experience with the simulations, we want to

mention a few things about the proposed linear algorithm for line features:
Degenerate configuration is a more severe problem for line features
than point features. Eliminating bad lines based on the multiple view
matrix Ml will improve the estimate significantly.
The rank 1 condition for the four column Ml is numerically less
stable than that for the two column Mp for point features. Better numerical techniques for imposing the rank condition on Ml are worth
investigating.
The algorithm requires at least 12 lines which is twice as many as that
required by the algorithm using point features. Increase the number of
line features, motion estimates will improve but not the structure.
In the future, we need to further investigate how these algorithms
perform if a mixture of point and line features are used, or additional
constraints such as coplanar are imposed on the features.

9.6 Summary
9.7 Exercises

This is page 206

Printer: Opaque this

Chapter 10
Geometry and reconstruction with
incidence relations

The previous two chapters described the constraints involving corresponding points or lines. In this chapter we address the case when correspondence
of points and lines is available. This results in a formal condition that has
all rank conditions discussed in previous chapters as special cases. We also
discuss how to enforce the constraint that all geometric features lie on a
plane, which is a special case of considerable practical importance.

10.1 Image and co-image of a point and line

Let E3 denote the three-dimensional Euclidean space and p E3 denote a
point in the space. The homogeneous coordinates of p relative to a fixed
.
world coordinate frame are denoted as X = [X, Y, Z, 1]T R4 . From previ.
ous chapters, we know that the (perspective) image x(t) = [x(t), y(t), z(t)]T
3
R of p, taken by a moving camera at time t satisfies the following
relationship:
(t)x(t) = A(t)P g(t)X.

(10.1)

Now suppose that p is lying on a straight line L E3 , as shown in Figure

10.1. We know that the line L can be defined by a collection of points in
E3 that can be described (in homogeneous coordinates) as:
.
L = {X | X = Xo + v, R} R4 ,
(10.2)
T

where Xo = [Xo , Yo , Zo , 1] R4 are coordinates of a base point po on

this line and v = [v1 , v2 , v3 , 0]T R4 is a non-zero vector indicating the

10.1. Image and co-image of a point and line

207

direction of the line. Then the image of the line L at time t is simply
the collection of images x(t) of all points p L. It is clear that all such
x(t)s span a plane in R3 , as shown in Figure 10.1. The projection of the
line is simply the intersection of this plane with the image plane. Usually
it is more convenient to specify a plane by its normal vector, denoted as
l(t) = [a(t), b(t), c(t)]T R3 . We call this vector l the co-image of the line
L, which satisfies the following equation:
l(t)T x(t) = l(t)T A(t)P g(t)X = 0

(10.3)

for any image x(t) of any point p on the line L. Notice that the column (or
row) vectors of the matrix bl span the (physical) image of the line L, i.e. they
span the plane which is orthogonal to l.1 This is illustrated in Figure 10.1.
Similarly, if x is the image of a point p, its co-image (the plane orthogonal
b. So in this chapter, we will use the following
to x) is given by the matrix x
notation:
Image of a point: x R3 ,
Co-image of a point:
bl R33 , Co-image of a line:
Image of a line:

b R33 ,
x
(10.4)
l R3 .

b x = 0 and bl l = 0. Since image and co-image

Notice that we always have x
are equivalent representation of the same geometric entity, sometimes for
simplicity (and by abuse of language) we may simply refer to either one as
image if its meaning is clear from the context.
If sampled images of x(t) or l(t) are given at some time instances:
t1 , t2 , . . . , tm , for simplicity we denote:
i = (ti ),

xi = x(ti ),

li = l(ti ),

i = A(ti )P g(ti ).

(10.5)
th

As before, the matrix i is the projection matrix relative to the i camera

frame. The matrix i R34 then relates the ith image of the point p to
its world coordinates X by:
i x i = i X

(10.6)

and the ith co-image of the line L to its world coordinates (Xo , v) by:
lTi i Xo = lTi i v = 0,

(10.7)

for i = 1, . . . , m. If the point is actually lying on the line, we further have

a relationship between the image of the point and the co-image of the line:
lTi xi = 0,

(10.8)

for i = 1, . . . , m.
In the above systems of equations, the unknowns, i s, X, Xo and v,
which encode the information about 3-D location of the point p or the line
1 In fact, there is some redundancy using b
l to describe the plane: the three column (or
row) vectors in bl are not linearly independent. They only span a two-dimensional space.
However, we here use it anyway to simplify the notation.

208

Chapter 10. Geometry and reconstruction with incidence relations

PSfrag replacements
L
p

lb1

v
po

lb2

x2
o2

(R, T )

o1
l1

Figure 10.1. Images of a point p on a line L. Planes extended from the image
lines lb1 , lb2 should intersect at the line L in 3-D. Lines extended from the image
points x1 , x2 intersect at the point p.

L in R3 are not intrinsically available from images. By eliminating these

unknowns from the equations we obtain the remaining intrinsic relationships between x, l and only, i.e. between the image measurements and the
camera configuration. Of course there are many different, but algebraically
equivalent ways for elimination of these unknowns. This has in fact resulted
in different kinds (or forms) of multilinear (or multifocal) constraints that
exist in the computer vision literature. We here introduce a more systematic way of eliminating all the above unknowns that results in a complete
set of conditions and a clear geometric characterization of all constraints.

10.2 Rank conditions for various incidence

relations
As we have seen in the previous chapters, multiple images of a point or a
line in E3 are governed by certain rank conditions. Such conditions not only
concisely capture geometric constraints among multiple images, but also are
the key to further reconstruction of the camera motion and scene structure.
In this section, we will show that all basic incidence relationships among
different features, i.e. inclusion, intersection or restriction of features, can
also be fully captured by similar rank conditions. Since these relationships
can be easily detected or verified in each image, such knowledge can be and
should be exploited if a consistent reconstruction is sought.

10.2. Rank conditions for various incidence relations

209

10.2.1 Inclusion of features

Consider the situation when you observe a line l1 in the reference view but
in remaining views, you observe images x2 , . . . , xm of a feature point on the
line we say that this feature point is included by the line. To derive the
constraints that such features have to satisfy, we start with the matrix Np
(which was introduced in Chapter 8) and left multiply it by the following
matrix:

T
l1
0
R(3m+1)3m .
Dl0 = lb1
(10.9)
0
0 I3(m1)3(m1)
We obtain:

lT1
lb1

0
D l Np =
R2
.
..

0
0

0
b
l1 x1

T2
..
.
Tm

0
0

0
..
.

x2
..
.

..
.
..
0

0
0
..
.

R(3m+1)(m+4) . (10.10)

0
xm

Since rank(Dl0 ) = 3m, we have rank(Np ) = rank(Dl0 Np ) m + 3. Now left

multiply Dl0 Np by the the following matrix:

I44 0
0
0
0
c2
x
0
0

T
0
x2
0
0

..
..
..
R(4m+1)(3m+1) .
.
.
.
0
0
(10.11)
Dp0 =

.
.
.
.
..
..
..
..
0

0
0
0 xc
m
T
0
0
0 xm

It is direct to verify that the rank of the resulting matrix Dp0 Dl0 Np is related
to the rank of its sub-matrix:

T
l1
0
x
c2 T2
x

c2 R2
(3m2)4
(10.12)
Np00 = .
.. R
..
.
xc
m Rm

xc
m Tm

by the expression rank(Np00 ) + m rank(Dp0 Dl0 Np ) = rank(Np ). Now right

multiplying Np00 by:

l1
0

lb1
0

0
1

R45

(10.13)

210

Chapter 10. Geometry and reconstruction with incidence relations

yields:

lT1 l1
c2 R2 l1
x
..
.

xc
m R m l1

0
c2 R2 lb1
x
..
.
b
xc
m R m l1

We call its sub-matrix:

c2 R2 lb1
x
.
..
Mlp =
.
b
xc
R
m m l1

0
c2 T2
x
..
.

xc
m Tm

c2 T2
x
..
.

xc
m Tm

R(3m2)5 .

R[3(m1)]4

(10.14)

(10.15)

the multiple view matrix for a point included by a line. Its rank is related
to that of Np by the expression:
rank(Mlp ) rank(Np ) (m + 1) 2.

(10.16)

Since rank(AB) (rank(A) + rank(B) n) for all A Rmn , B Rnk ,

we have rank(xbi Ri lb1 ) 1. So we essentially have proven the following:
Lemma 10.1 (Rank condition with inclusion). For multiple images
of a point p on a line L, the multiple view matrix Mlp defined above satisfies
1 rank(Mlp ) rank(Np ) (m + 1) 2.

(10.17)

The rank condition on Mlp then captures the incidence condition in which
a line with co-image l1 includes a point p in E3 with images x2 , . . . , xm
with respect to m 1 camera frames. What kind of equations does this
rank condition give rise to? Without loss of generality, let us look at the
sub-matrix
#
"
c2 T2
c2 R2 lb1 x
x
R64 .
(10.18)
Mlp =
c3 T3
c3 R3 lb1 x
x

The rank condition implies that every 3 3 sub-matrix of Mlp has determinant zero. The first three rows are automatically of rank 2 and so are
the last three rows and the first three columns.2 Hence any non-trivial determinant consists of two and only two rows from either the first three or
last three rows, and consists of two and only two columns from the first
three columns. For example, if we choose two rows from the first three,
it is direct to see that such a determinant is a polynomial of degree 5 on
(the entries of) l1 , x2 and x3 which is quadratic on l1 and also quadratic in
2 This

is due to the redundancy of using u

b for the orthogonal supplement of u R 3 .

10.2. Rank conditions for various incidence relations

211

x2 . The resulting equation is not multilinear in these vectors at all,3 but it

indeed imposes non-trivial constraints among these images (or co-images).
The study of this type of constraints has been completely ignored in the
past. In fact, if we have four images (in Mlp ), one may take one row from
the 2nd, 3rd and 4th images. The resulting equation would involve all four
images. It is however not the quadrilinear constraint since the equation is
always quadratic in l1 . We may summarize the above discussion as follows:
Corollary 10.1 (Multiple view nonlinear constraints). Given m 4
images, multilinear constraints exist only up to triple-wise images. However, nonlinear constraints exist among triple-wise and quadruple-wise
images.

10.2.2 Intersection of features

Consider the situation in which, in the reference, you observe the image x1
of a point p, but in the remaining views, you observe co-images l2 , l3 , . . . , lm
of lines that intersect at the point in this case these co-images do not have
to correspond to the same line in E3 . There are many different, but equivalent ways to derive the constraints that such set of features has to satisfy.
For simplicity, we here start with the matrix Mp (which was introduced in
Chapter 8):

c2 T2
c2 R2 x1
x
x
x
c3 T3
x

c3 R3 x1
(10.19)
Mp =
R[3(m1)]2 .
..
..

.
.
xc
xc
m Tm
m R m x1
This matrix should have a rank of no more than 1. Since the point belongs
to all the lines, we have
lTi xi = 0, i = 1, . . . , m.
Hence li range(xbi ). That is, there exist ui R3 such that li = xbi T ui , i =
1, . . . , m. Since rank(Mp ) 1, so should be the rank of following matrix
(whose rows are simply linear combinations of those of Mp ):
T
T

c2 R2 x1
c2 T2
u2 x
uT2 x
l2 R2 x1 lT2 T2

..
..
..
.. R(m1)2(10.20)
.

=
.
.
.
.
uTm xc
m R m x1

uTm xc
m Tm

lTm Rm x1

lTm Tm

3 Hence it is a constraint that is not in any of the multilinear (or multifocal) constraint
lists previously studied in the computer vision literature. For a list of these multilinear
constraints, see [HZ00].

212

Chapter 10. Geometry and reconstruction with incidence relations

We call the matrix

Mpl

lT2 R2 x1
.
..
=
.

lTm Rm x1

lT2 T2
..
.

lTm Tm

R(m1)2

(10.21)

the multiple view matrix for lines intersecting at a point. Then we have
proved:
Lemma 10.2 (Rank condition with intersection). Given the image
of a point p and multiple images of lines intersecting at p, the multiple view
matrix Mpl defined above satisfies:
0 rank(Mpl ) 1.

(10.22)

The above rank condition on the matrix Mpl captures the incidence condition between a point and lines which intersect at the same point. It is
worth noting that for the rank condition to be true, it is necessary that
all 2 2 minors of Mpl be zero, i.e. the following constraints hold among
arbitrary triplets of given images:
[lTi Ri x1 ][lTj Tj ] [lTi Ti ][lTj Rj x1 ] = 0

i, j = 2, . . . , m.

(10.23)

These are exactly the well-known point-line-line relationships among three

views [HZ00]. However, here li and lj do not have to be co-images of the
same line in E3 . Their pre-images only have to intersect at the same point p.
This undoubtly relaxed the restriction on the meaning of corresponding
line features. Hence our results have extended the use of the point-line-line
relationship. The rank of the above matrix Mpl is bounded by 1 is because
a point image feature x1 of p is chosen in the first image. If we relax it to
an image of a line that intersects at p, then we get a matrix which is similar
to Ml defined in the previous chapter:
T

l2 R2 lb1 lT2 T2
.
..
.. R(m1)4 .
l =
M
(10.24)

.
.
T
T
b
l R m l1 l Tm
m

However, here li in the ith view can be the co-image of any line out of a
family which intersect at a point p. It is then easy to derive from Lemma
10.1 and the proof of Lemma 10.2 that:
Corollary 10.2 (An intersecting family of lines). Given m imasge
l defined above
of a family lines intersecting at a 3D point p, the matrix M
satisfies:
l ) 2.
1 rank(M

(10.25)

10.2. Rank conditions for various incidence relations

213

We leave the proof as exercise to the reader. The reader also should
be aware of the difference between this corollary and the Theorem 9.1:
Here l1 , . . . , lm do not have to be the coimages of a single 3D line, but
instead each of them may randomly correspond to any line in an intersecting
family. It is easy to see that the above rank condition imposes non-trivial
constraints among up to four images, which conforms to what Corollary
10.1 says.

10.2.3 Features restricted to a plane

Another incidence condition commonly encountered in practice is that all
features involved are actually lying on a plane, say P, in E3 . In general, a plane can be described by a vector = [a, b, c, d] R4 such that
the homogeneous coordinates X of any point p on this plane satisfy the
equation:
X = 0.

(10.26)

Although we assume such a constraint on the coordinates X of p, in general

we do not have to assume that we know in advance.
Similarly, consider a line L = {X | X = Xo + v, R}. Then this line
is in the plane P if and only if:
Xo = v = 0.

(10.27)

For convenience, given a plane = [a, b, c, d], we usually define 1 =

[a, b, c] R3 and 2 = d R.
It turns out that, in order to take into account the planar restriction, we
only need to change the definition of each multiple view matrix slightly, but
all the rank conditions remain exactly the same. This is because in order
to combine equations (10.6) and (10.7) with the planar constraints (10.26)
and (10.27), we only need to change the definition of the matrices Np and
Nl to:

1 x1 0
0
l1 1

..
lT2 2
2 0 x 2 . . .
.

.
.
.
and Nl =
..
..
..
Np =
.. .

...
.
.

0
.

lTm m
m 0
0 xm

0
0

Such modifications do not change the rank of Np or Nl . Therefore, as before,

we have rank(Np ) m + 3 and rank(Nl ) 2. Then one can easily follow
the previous proofs for all the rank conditions by carrying this extra row
of (planar) constraint with the matrices and the rank conditions on the
resulting multiple view matrices remain the same as before. We leave this
as exercise for the reader (see Exercise ??). We here summarize the results
as follows:

214

Chapter 10. Geometry and reconstruction with incidence relations

Corollary 10.3 (Rank conditions for coplanar features). Given a

point p and a line L lying on a plane P which is specified by the vector
R4 , in Theorem 8.1 and Lemma 10.2, append the matrix [ 1 x1 2 ] to
the matrices Mp and Mpl respectively; in Theorem 9.1 and Lemma 10.1,
append the matrix [ 1 lb1 2 ] to the matrices Ml and Mlp respectively. Then
the rank conditions on the new matrices Mp , Ml , Mlp and Mpl remain the
same as in Theorems 8.1, 9.1, and Lemmas 10.1 and 10.2.
For example, the multiple view matrices Mp and Ml

lT2 R2 lb1
c2 R2 x1
c2 T2
x
x

x
c3 T3
x
lT3 R3 lb1
c3 R3 x1

..
..
..
Mp =
, Ml =
.
.
.

xc
c
lTm Rm lb1
m R m x1 x
m Tm
1 x1
2
1 lb1

become:

lT2 T2

lT3 T3
..

. .

lTm Tm
2

(10.28)

Then the rank condition rank(Mp ) 1 implies not only the multilinear
constraints as before, but also the following equations (by considering the
sub-matrix consisting of the ith group of three rows of Mp and its last row)
xbi Ti 1 x1 xbi Ri x1 2 = 0,

i = 2, . . . , m.

(10.29)

When the plane P does not cross the camera center o1 , i.e. 2 =
6 0, these
equations give exactly the well-known homography constraints for planar
image feature points

1
xbi Ri 2 Ti 1 x1 = 0
(10.30)

between the 1st and the ith views. The matrix Hi = Ri 12 Ti 1 in
the equation is the well-known homography matrix between the two views.
Similarly from the rank condition on Ml , we can obtain homography in
terms of line features

1
T
1
li Ri 2 Ti lb1 = 0
(10.31)

between the 1st and the ith views.

We know that on a plane P, any two points determine a line and any two
lines determine a point. This dual relationship is inherited in the following
relationship between the rank conditions on Mp and Ml :

Corollary 10.4 (Duality between coplanar points and lines). If the

Mp matrices of two distinct points on a plane are of rank less than or equal
to 1, then the Ml matrix associated to the line determined by the two points
is of rank less than or equal to 1. On the other hand, if the Ml matrices
of two distinct lines on a plane are of rank less than or equal to 1, then
the Mp matrix associated to the intersection of the two lines is of rank less
than or equal to 1.

10.2. Rank conditions for various incidence relations

215

The proof is left as exercise (see Exercise ??). An immediate implication

of this corollary is that given a set of feature points sharing the same 3-D
plane, it really does not matter too much whether one uses the Mp matrices
for points, or the Ml matrices for lines determined by pairwise points (in
all the views). They essentially give exactly the same set of constraints.
This is illustrated in Figure 10.2.

PSfrag replacements

p1
L2

L3
p

Figure 10.2. Duality between a set of three points and three lines in a plane P:
the rank conditions associated to p1 , p2 , p3 are exactly equivalent those associated
to L1 , L2 , L3 .

The above approach for expressing a planar restriction relies explicitly

on the parameters of the underlying plane P (which leads to the homography). There is however another intrinsic (but equivalent) way to express
the planar restriction, by using combinations of the rank conditions that
we have so far on point and line features. Since three points are always
coplanar, at least four points are needed to make any planar restriction
non-trivial. Suppose four feature points p1 , . . . , p4 are coplanar as shown
in Figure 10.3, and their images are denoted as x1 , . . . , x4 . The (virtual)

PSfrag replacements

L2
p

L1
p2

Figure 10.3. p1 , . . . , p4 are four points on the same plane P, if and only if the two
associated (virtual) lines L1 and L2 intersect at a (virtual) point p5 .

co-images l1 , l2 of the two lines L1 , L2 and the (virtual) image x5 of their

216

Chapter 10. Geometry and reconstruction with incidence relations

intersection p5 can be uniquely determined by x1 , . . . , x4 :

c1 x2 ,
l1 = x

c3 x4 ,
l2 = x

x5 = lb1 l2 .

(10.32)

Then the coplanar constraint for p1 , . . . , p4 can be expressed in terms of

the intersection relation between L1 , L2 and p5 . If we use lji to denote the
ith image of the line j, for j = 1, 2, and i = 1, . . . , m, and xji is defined
similarly. According to the preceding Section 10.2.2, the coplanar constraint
is equivalent to the following matrix:

1T
l2 R2 x51 l1T
2 T2
5
2T

l2T
2 R 2 x1 l2 T2

.
.
pl =
..
..
M
(10.33)
R[2(m1)]2

1T
5
1T
lm R m x1 lm Tm
5
l2T
l2T
m R m x1
m Tm

pl ) 1 condition. Note that, unlike (10.29), this condition

satisfying rank(M
does not explicitly depends on the parameters . We can get 2 more vitual
points in a similar way from the given four coplanar points from different
pairs of virtual lines formed by these four points. Hence having (images of)
4 coplanar points is equivalent to having (images of) a total of 4 (real) +
3 (virtual) = 7 points. Algebraic equations that one may get from above
rank conditions associated to all these points are essentially equivalent to
the homography equations (10.29) (with eliminated using the images of
all four coplanar points). These equations, again, will not be multilinear in
the given images xji , j = 1, . . . , 4, i = 1, . . . , m. Instead, they are typically
quadratic in each given xji . Despite that, we here see again the effectiveness
of using rank conditions to intrinsically express incidence relations in a
unified fashion.

10.3 Rank conditions on the universal multiple

view matrix
In preceding sections, we have seen that incidence conditions among multiple images of multiple features can usually be concisely expressed in terms
of certain rank conditions on various types of the so-called multiple view
matrix. In this section, we will demonstrate that all these conditions are
simply special instantiations of a unified rank condition on a universal
multiple view matrix.
For m images x1 , . . . , xm of a point p on a line L with its m co-images
l1 , . . . , lm , we define the following set of matrices:
.
Di = x i R 3
or lbi R33 ,
.

33
Di = xbi R
or lTi R3 .

10.3. Rank conditions on the universal multiple view matrix

217

Then, depending on whether the available (or chosen) measurement from

the ith image is the point feature xi or the line feature li , the Di (or
Di ) matrix is assigned its corresponding value. That choice is completely
independent of the other Dj (or Dj ) for j 6= i. The dual matrix Di can
be viewed as the orthogonal supplement to Di and it always represents a
co-image (of a point or a line).4 Using the above definition of Di and Di ,
we now formally define a universal multiple view matrix:

D2 R2 D1 D2 T2

. D3 R 3 D1 D3 T 3
(10.34)
M =
.. .
..

.
.

Dm
R m D1 Dm
Tm
Depending on the particular choice for each Di or D1 , the dimension of the
matrix M may vary. But no matter what the choice for each individual Di
or D1 is, M will always be a valid matrix of certain dimension. Then after
elimination of the unknowns i s, X, Xo and v in the system of equations
in (10.6) and (10.7), we obtain:
Theorem 10.1 (Multiple view rank conditions). Consider a point p
lying on a line L and its images x1 , . . . , xm R3 and co-images l1 , . . .,
lm R3 relative to m camera frames whose relative configuration is given
by (Ri , Ti ) for i = 2, . . . , m. Then for any choice of Di and D1 in the
definition of the multiple view matrix M , the rank of the resulting M belongs
to and only to one of the following two cases:
1. If D1 = lb1 and Di = xbi for some i 2, then:
1 rank(M ) 2.

(10.35)

0 rank(M ) 1.

(10.36)

2. Otherwise:

A complete proof of this theorem is a straight-forward combination and

extension of Theorems 8.1, 9.1, Lemmas 10.1 and 10.2. Essentially, the
above theorem gives a universal description of the incidence relation between a point and line in terms of their m images seen from m vantage
points.
As a result of Theorem 10.1, all previously known and some additional
unknown constraints among multiple images of point or line features are
simply certain instantiations of the rank conditions of Theorem 10.1. The
instantiations corresponding to case 2 are exactly the ones that give rise to
4 In

fact, there are many equivalent matrix representations for Di and Di . We choose
xbi and lbi here because they are the simplest forms representing the orthogonal subspaces
of xi and li and also linear in xi and li respectively.

218

Chapter 10. Geometry and reconstruction with incidence relations

the multilinear constraints in the literature. The instantiations corresponding to case 1, as we have seen before, give rise to constraints that are not
necessarily multilinear. The completeness of Theorem 10.1 also implies that
there would be no multilinear relationship among quadruple-wise views,
even in the mixed feature scenario.5 Therefore, quadrilinear constraints
and quadrilinear tensors are clearly redundant for multiple view analysis.
However, as we mentioned before (in Section 10.2.1), nonlinear constraints
may still exist up to four views.
As we have demonstrated in the previous sections, other incidence conditions such as all features belonging to a plane in E3 can also be expressed
in terms of the same set of rank conditions:
Corollary 10.5 (Planar features and homography). Suppose that all
features are in a plane and coordinates X of any point on it satisfy the
equation X = 0 for some vector T R4 . Denote = [ 1 , 2 ] with
T
1 R3 , 2 R. Then simply append the matrix
1

D1 2
(10.37)

to the matrix M in its formal definition (10.34). The rank condition on the
new M remains exactly the same as in Theorem 10.1.

The rank condition on the new matrix M then implies all constraints
among multiple images of these planar features, as well as the special
constraint previously known as homography. Of course, the above representation is not intrinsic it depends on parameters that describe the 3-D
location of the plane. Following the process in Section 10.2.3, the above
corollary reduces to rank conditions on matrices of the type in (10.33),
which in turn, give multi-quadratic constraints on the images involved.
Remark 10.1 (Features at infinity). In Theorem 10.1, if the point p
and the line L are in the plane at infinity P3 \ E3 , the rank condition on the
multiple view matrix M is just the same. Hence the rank condition extends
to multiple view geometry of the entire projective space P3 , and it does not
discriminate between Euclidean, affine or projective space. In fact, the rank
conditions are invariant under a much larger group of transformations: It
allows any transformation that preserves all the incidence relations among
a given set of features, these transformations do not even have to be linear.
Remark 10.2 (Occlusion). If any feature is occluded in a particular view,
the corresponding row (or a group of rows) is simply omitted from M ; or if
only the point is occluded but not the entire line(s) on which the point lies,
then simply replace the missing image of the point by the corresponding
5 In fact, this is quite expected: While the rank condition geometrically corresponds
to the incidence condition that lines intersect at a point and that planes intersect at a
line, incidence condition that three-dimensional subspaces intersect at a plane is a void
condition in E3 .

10.3. Rank conditions on the universal multiple view matrix

219

image(s) of the line(s). In either case, the overall rank condition on M

remains unaffected. In fact, the rank condition on M gives a very effective
criterion to tell whether or not a set of (mixed) features indeed corresponds
one to another. If the features are miss-matched, either due to occlusion
or to errors while establishing correspondences, the rank condition will be
violated.
For clarity, we demonstrate using the following examples how to obtain
these different types of constraints by instantiating M :
Example 10.1 (Point-point-point constraints). Let us choose D1 =
c2 , D3 = x
c3 . Then we get a multiple view matrix:
x1 , D2 = x

c2 R2 x1 x
c2 T2
x
R62 .
(10.38)
M=
c3 R3 x1 x
c3 T3
x

Then rank(M ) 1 gives:

[c
x2 R2 x1 ][c
x3 T3 ]T [c
x3 R3 x1 ][c
x 2 T 2 ]T = 0

R33 .

(10.39)

The above equation give the point-point-point type of constraints on three

images.
Example 10.2 (Line-line-line constraints). Let us choose D1 =
lb1 , D2 = lT2 , D3 = lT3 . Then we get a multiple view matrix:
"
#
lT2 R2 lb1 lT2 T2
M= T b T
R24 .
(10.40)
l R 3 l1 l T3
3

Then rank(M ) 1 gives:

i
i
h
h
lT2 R2 lb1 [lT3 T3 ] lT3 R3 lb1 [lT2 T2 ] = 0

R3 .

(10.41)

The above equation gives the line-line-line type of constraints on three

images.
Example 10.3 (Point-line-line constraints). Let us choose D1 =
x1 , D2 = lT2 , D3 = lT3 . Then we get a multiple view matrix:

T
l2 R2 x1 lT2 T2
R22 .
(10.42)
M= T
l3 R3 x1 lT3 T3
Then rank(M ) 1 gives
[lT2 R2 x1 ][lT3 T3 ] [lT3 R3 x1 ][lT2 T2 ] = 0

(10.43)

The above equation gives the point-line-line type of constraints on three

images.

220

Chapter 10. Geometry and reconstruction with incidence relations

Example 10.4 (Line-point-point constraints). Let us choose D1 =

c2 , D3 = x
c3 . Then we get a multiple view matrix:
lb1 , D2 = x
#
"
c2 R2 lb1 x
c2 T2
x
R64 .
(10.44)
M=
c3 R3 lb1 x
c3 T3
x

Then rank(M ) 2 implies that all 3 3 sub-matrices of M have determinant zero. These equations give the line-point-point type of constraints on
three images.

Similarly, other choices in Di and D1 will give all possible types of

constraints among any views of point and line features arbitrarily mixed.

10.4 Geometric interpretation of the rank

conditions
Since there are practically infinitely many possible instantiations for the
multiple view matrix, it is impossible to provide a geometric description
for each one of them. Instead, we will discuss a few essential cases that will
give the reader a clear idea about how the rank condition of the multiple
view matrix works geometrically. In particular, we will demonstrate that a
further drop of rank in the multiple view matrix M , can be clearly interpreted in terms of a corresponding geometric degeneracy. Understanding
these representative cases would be sufficient for the reader to carry out a
similar analysis to any other instantiation.

10.4.1 Case 2: 0 rank(M ) 1

Let us first consider the more general case, i.e. case 2 in Theorem 10.1,
when rank(M ) 1. We will discuss case 1 afterwards. For case 2, there are
only two interesting sub-cases, depending on the value of the rank of M ,
are:
a) rank(M ) = 1,

and b) rank(M ) = 0.

(10.45)

Case a), when rank of M is 1, corresponds to the generic case for which,
regardless of the particular choice of features in M , all these features satisfy
the incidence condition. More explicitly, all the point features (if at least
2 are present in M ) come from a unique 3-D point p, all the lines features
(if at least 3 are present in M ) come from a unique 3-D line L, and if both
point and line features are present, the point p then must lie on the line L
in 3-D. This is illustrated in Figure 10.4.
What happens if there are not enough point or line features present in
M ? If, for example, there is only one point feature x1 present in Mpl , then
the rank of Mpl being 1 implies that the line L is uniquely determined by

10.4. Geometric interpretation of the rank conditions

221

PSfrag replacements
L
p
P3
P1

l2
o2

Figure 10.4. Generic configuration for the case rank(M ) = 1. Planes extended
from the (co-)images l1 , l2 , l3 intersect at one line L in 3-D. Lines extended from
the images x1 , x2 , x3 intersect at one point p. p must lie on L.

l2 , . . . , lm . Hence, the point p is determined by L and x1 . On the other

hand, if there is only one line feature present in some M , but more than
two point features, L is then a family of lines lying on a plane and passing
through the point p determined by the rest of point features in M .
Case b), when the rank of M is 0, implies that all the entries of M
are zero. It is easy to verify that this corresponds to a set of degenerate
cases in which the 3-D location of the point or the line cannot be uniquely
determined from their multiple images (no matter how many), and the
incidence condition between the point p and the line L no longer holds. In
these cases, the best we can do is:
When there are more than two point features present in M , the 3-D
location of the point p can be determined up to a line which connects
all camera centers (related to these point features);
When there are more than three line features present in M , the 3-D
location of the line L can be determined up to the plane on which all
related camera centers must lie;
When both point and line features are present in M , we can usually
determine the point p up to a line (connecting all camera centers
related to the point features) which lies on the same plane on which
the rest of the camera centers (related to the line features) and the
line L must lie.

222

Chapter 10. Geometry and reconstruction with incidence relations

Let us demonstrate this last case on a concrete example. Suppose the

number of views is m = 6 and we choose the matrix M to be:

T
l2 R2 x1 lT2 T2
lT3 R3 x1 lT3 T3

T
92
T

(10.46)
M =
l4 R 4 x1 l4 T4 R .
x
c5 R5 x1 x
c5 T5
c6 R6 x1 x
c6 T6
x
PSfrag replacements
The geometric
configuration of the point and line features corresponding
L
to the condition rank(M ) = 0 is illustrated in Figure 10.5. But notice that,
among all the possible solutions for L and p, if they both happen to be at
infinity, the incidence condition then would hold for all the images involved.

lb2

lb4

x1
o1

lb3

Figure 10.5. A degenerate configuration for the case rank(M ) = 0: a

point-line-line-line-point-point scenario. From the given rank condition, the line
L could be any where on the plane spanned by all the camera centers; the point
p could be any where on the line through o1 , o5 , o6 .

10.4.2 Case 1: 1 rank(M ) 2

We now discuss case 1 in Theorem 10.1, when rank(M ) 2. In this case,

the matrix M must contain at least one sub-matrix of the type:
h
i
(10.47)
xbi Ri lb1 xbi Ti R34 ,

for some i 2. It is easy to verify that such a sub-matrix can never be

zero, hence the only possible values for the rank of M are:
a) rank(M ) = 2,

and b) rank(M ) = 1.

(10.48)

Case a), when the rank of M is 2, corresponds to the generic cases for
which the incidence condition among the features is effective. The essential

10.4. Geometric interpretation of the rank conditions

example here is the matrix Mlp given in (10.15):

c2 R2 lb1
c2 T2
x
x

..
.. R[3(m1)]4 .
Mlp =
.
.
b c
xc
m R m l1 x
m Tm

223

(10.49)

If rank(Mlp ) = 2, it can be shown that the point p is only determined up

to the plane specified by o1 and l1 but all the point features x2 , . . . , xm
correspond to the same point p. The line L is only determined up to this
plane, but the point p does not have to be on this line. This is illustrated in
Figure 10.6. Beyond Mlp , if there are more than two line features present

L
PSfrag replacements

P
p
l1
x1

x3
x2

l1
o2

Figure 10.6. Generic configuration for the case rank(Mlp ) = 2.

in some M , the point p then must lie on every plane associated to every
line feature. Hence p must be on the intersection of these planes. Notice
that, even in this case, adding more rows of line features to M will not
be enough to uniquely determine L in 3-D. This is because the incidence
condition for multiple line features requires that the rank of the associated
matrix Ml be 1.6 If we only require rank 2 for the overall matrix M , the
line can be determined only up to a family of lines intersections of the
planes associated to all the line features which all should intersect at the
same point p.
6 Matrix

Ml is obtained from M by extracting the line measurements only.

224

Chapter 10. Geometry and reconstruction with incidence relations

Case b), when the rank of M is 1, corresponds to a set of degenerate cases

for which the incidence relationship between the point p and the line L will
be violated. For example, it is direct to show that Mlp is of rank 1 if and
only if all the vectors Ri1 xi , i = 2, . . . , m are parallel to each other and
they are all orthogonal to l1 , and Ri1 Ti , i = 2, . . . , m are also orthogonal
to l1 . That means all the camera centers lie on the same plane specified
by o1 and l1 and all the images x2 , . . . , xm (transformed to the reference
camera frame) lie on the same plane and are parallel to each other. For
example, suppose that m = 5 and choose M to be:

c2 R2 lb1 x
c2 T2
x

cR lb x
c3 T3
x
M = 3 3 b1
(10.50)
R104 .
x
c4 R4 l1 x
c4 T4
lT5 R5 lb1 lT5 T5

The geometric configuration of the point and line features corresponding

to the condition rank(M ) = 1 is illustrated in Figure 10.7.

PSfrag replacements

L
x3

lb1

x4
o4
lb5
o5

Figure 10.7. A degenerate configuration for the case rank(M ) =

line-point-point-point-line scenario.

1: a

Notice that in this case, we no longer have an incidence condition for the
point features. However, one can view them as if they intersected at a point
p at infinity. In general, we no longer have the incidence condition between
the point p and the line L, unless both the point p and line L are in the
plane at infinity in the first place. But since the rank condition is effective
for line features, the incidence condition for all the line features still holds.

10.5. Applications of the rank conditions

225

To summarize the above discussion, we see that the rank conditions indeed allow us to carry out meaningful global geometric analysis on the
relationship among multiple point and line features for arbitrarily many
views. There is no doubt that this extends existing methods based on multifocal tensors that can only be used for analyzing up to three views at a
time. Since there is yet no systematic way to extend triple-wise analysis
to multiple views, the multiple view matrix seems to be a more natural
tool for multiple-view analysis. Notice that the rank conditions implies all
previously known multilinear constraints, but multilinear constraints do
not necessarily imply the rank conditions. This is because the use of algebraic equations may introduce certain artificial degeneracies that make a
global analysis much more complicated and sometimes even intractable. On
the other hand, the rank conditions have no problem in characterizing all
the geometrically meaningful degeneracies in a multiple-view mixed-feature
scenario. All the degenerate cases simply correspond to a further drop of
rank for the multiple view matrix.

10.5 Applications of the rank conditions

The unified formulation of the constraints among multiple images in terms
of the rank conditions, allows us to tackle many problems in multiple view
geometry globally. More specifically, one can use these rank conditions to
test whether or not a given set of features (points or lines) indeed satisfy
certain incidence relations in E3 . For example, whether a given set of point
and line features indeed correspond one to another depends on whether
all associated rank conditions are satisfied. The rank values can be used
not only to detect outliers but also degenerate configurations. The rank
conditions also allow us to transfer multiple features to a new view by
using all features and incidence conditions among them simultaneously,
without resorting to the 3-D structure of the scene. Hence the multiple
view matrix provides the geometric basis for any view synthesis or image
based rendering techniques. Most importantly, as we will demonstrate in
the following section, the rank conditions naturally suggest new ways of
formulating the problem of structure from motion recovery from multiple
views.
For simplicity, we here only demonstrate the use of the theory on the
structure from motion recovery problem. Essentially, the mathematical
problem associated to structure from motion is how to find the camera
configuration (Ri , Ti )s such that all the rank conditions hold simultaneously for a given set of features, including both point and line features.
Incidence relations among the given points and lines can now be explicitly
taken into account when a global and consistent recovery of motion and
structure takes place. To demonstrate how the multiple view matrix natu-

226

Chapter 10. Geometry and reconstruction with incidence relations

rally suggests a solution, let us consider image of a cube as shown in Figure

10.8. Let the j th corner pj be the intersection of the three edges L1j , L2j

L1
p

L2
PSfrag replacements

Figure 10.8. A standard cube. The three edges L1 , L2 , L3 intersect at the corner p.
The coordinate frames indicate that m images are taken at these vantage points.

and L3j , j = 1, . . . , 8. From the m images of the cube, we have the multiple
view matrix M j associated to pj :

c
xj2 R2 xj1
1jT
l2 R2 xj1
2jT
l2 R2 xj1
3jT
l2 R2 xj1

..
Mj =
.

c
xjm Rm xj1
1jT
lm Rm xj1

l2jT Rm xj
m
1
j
l3jT
m R m x1

c
xj2 T2

l1jT
2 T2

2jT
l2 T2

l3jT
2 T2
.. R[6(m1)]2
.

c
j
xm Tm

l1jT
m Tm
2jT
lm Tm
l3jT
m Tm

(10.51)

where xji R3 means the image of the j th corner in the ith view and
3
th
lkj
edge associated to the j th corner in
i R means the image of the k
th
the i view. Theorem 10.1 says that rank(M j ) = 1. One can verify that
j = [j1 , 1]T R2 is in the kernel of M j . In addition to the multiple images
xj1 , xj2 , xj3 of the j th corner pj itself, the extra rows associated to the line
features lkj
i , k = 1, 2, 3, i = 1, . . . , m also help to determine the depth scale
j1 .

10.5. Applications of the rank conditions

227

We can already see one advantage of the rank condition: It can simultaneously handle multiple incidence conditions associated to the same feature.7
In principle, by using (10.33) or Corollary 10.5, one can further take into account that the four vertices and edges on each face are coplanar.8 Since such
incidence conditions and relations among points and lines occur frequently
in practice, especially for man-made objects, such as buildings and houses,
the use of multiple view matrix for mixed features is going to improve the
quality of the overall reconstruction by explicitly taking into account all
incidence relations among features of various types.
In order to estimate j we need to know the matrix M j , i.e. we need
to know the motion (R2 , T2 ), . . . , (Rm , Tm ). From the geometric meaning
of j = [j1 , 1]T , j can be solved already if we know only the motion
(R2 , T2 ) between the first two views, which can be initially estimated using
the standard 8 point algorithm. Knowing j s, the equations:
M j j = 0,

j = 1, . . . , 8

(10.52)

become linear in (R2 , T2 ), . . . , (Rm , Tm ). We can use them to solve for the
motions (again). Define the vectors:
~ i = [r11 , r12 , r13 , r21 , r22 , r23 , r31 , r32 , r33 ]T R9
R
and T~i = Ti R3 , i = 2, . . . , m. Solving (10.52) is then equivalent to finding
the solution to the following equations for i = 2, . . . , m:

c1 x1 T x
c1
11 x
1
i
i

1 11T

1 li x11 T l11T
i

T
1 l21T x1 l21T
1
i

1i
1 31T
1 T 31T
1 li x1 li
~
~

R
..
..
Ri = 0 R48 ,
(10.53)
Pi ~ i =
.
.

T~i
Ti

8 c8
T
c8
1 x x81
x
i
i

8 l18T x8 T l18T

1i
1
i

8 28T
T

1 li x81 l28T
i
8 T 38T
81 l38T

x
l
1
i
i

where A B is the Kronecker product of A and B. In general, if we have

more than 6 feature points (here we have 8) or equivalently 12 feature lines,
~ i , T~i ).
the rank of the matrix Pi is 11 and there is a unique solution to (R
3
33

Let Ti R and Ri R
be the (unique) solution of (10.53) in matrix
form. Such a solution can be obtained numerically as the eigenvector of Pi
7 In fact, any algorithm extracting point feature essentially relies on exploiting local
incidence condition on multiple edge features. The structure of the M matrix simply
reveals a similar fact within a larger scale.
8 We here omit doing it for simplicity.

228

Chapter 10. Geometry and reconstruction with incidence relations

i = Ui Si V T be the SVD of
associated to the smallest singular value. Let R
i
3

Ri . Then the solution of (10.53) in R SO(3) is given by:

sign(det(Ui ViT ))
p
Ti R 3 ,
3
det(Si )

= sign(det(Ui ViT )) Ui ViT SO(3).

(10.54)
(10.55)

We then have the following linear algorithm for motion and structure
estimation from three views of a cube:
Algorithm 10.1 (Motion and structure from mixed features).
Given m (= 3) images xj1 , . . ., xjm of n(= 8) points pj , j = 1, . . . , n (as
the corners of a cube), and the images lkj
i , k = 1, 2, 3 of the three edges
intersecting at pj , estimate the motions (Ri , Ti ), i = 2, . . . , m as follows:
1. Initialization: s = 0
(a) Compute (R2 , T2 ) using the 8 point algorithm for the first two
views [LH81].
(b) Compute js = [j1 /11 , 1]T where j1 is the depth of the j th point
relative to the first camera frame.
i , Ti ) as the eigenvector associated to the smallest
2. Compute (R
singular value of Pi , i = 2, . . . , m.
3. Compute (Ri , Ti ) from (10.54) and (10.55) for i = 2, . . . , m.
4. Compute new js+1 = j from (10.52). Normalize so that 11,s+1 = 1.
5. If ||s s+1 || > , for a pre-specified > 0, then s = s + 1 and goto
2. Else stop.
The camera motion is then the converged (Ri , Ti ), i = 2, . . . , m and
the structure of the points (with respect to the first camera frame) is the
converged depth scalar j1 , j = 1, . . . , n.
We have a few comments on the proposed algorithm:
1. The reason to set 11,s+1 = 1 is to fix the universal scale. It is equivalent to putting the first point at a relative distance of 1 to the first
camera center.
2. Although the algorithm is based on the cube, considers only three
views, and utilizes only one type of multiple view matrix, it can be
easily generalized to any other objects and arbitrarily many views
whenever incidence conditions among a set of point features and line
features are present. One may also use the rank conditions on different
types of multiple view matrices provided by Theorem 10.1.
3. The above algorithm is a straightforward modification of the algorithm proposed for the pure point case given in Chapter 8. All the
measurements of line features directly contribute to the estimation of

10.6. Simulation results

229

the camera motion and the structure of the points. Throughout the
algorithm, there is no need to initialize or estimate the 3-D structure
of lines.
The reader must be aware that the above algorithm is only conceptual (and
naive in many ways). It by no means suggests that the resulting algorithm
would work better in practice than some existing algorithms in every situation. The reason is, there are many possible ways to impose the rank
conditions and each of them, although maybe algebraically equivalent, can
have dramatically different numerical stability and sensitivity. To make the
situation even worse, under different conditions (e.g., long baseline or short
baseline), correctly imposing these rank conditions does require different
numerical recipes.9 A systematic characterization of numerical stability of
the rank conditions remains largely open at this point. It is certainly the
next logical step for future research.

10.6 Simulation results

Given the above conceptual algorithm the following section presents simulation results in order to justify our intuition behind the suggested global
approach to structure from motion recovery. While at this stage we do
not make any optimality claims, due to the linear nature of the proposed
algorithms, we will demonstrate the performance and dependency of the
algorithm on types of features in the presence of noise.
The simulation parameters are as follows: the cameras field of view is
90o , image size is 500 500, everything is measured in units of the focal
length of the camera, and features typically are suited with a depth variation is from 100 to 400 units of focal length away from the camera center,
i.e. they are located in the truncated pyramid specified by the given field of
view and depth variation (see Figure 8.5). Camera motions are specified by
their translation and rotation axes. For example, between a pair of frames,
the symbol XY means that the translation is along the X-axis and rotation
is along the Y -axis. If we have a sequence of such symbols connected by
hyphens, it specifies a sequence of consecutive motions. We always choose
the amount of total motion, so that all feature points will stay in the field
of view for all frames. In all simulations, independent Gaussian noise with
a standard deviation (std) given in pixels is added to each image point, and
each image line is perturbed in a random direction of a random angle with
a corresponding standard
deviation
given in degrees.10 Error measure for

T )1
tr(RR
is an estimate of the true
rotation is arccos
in degrees where R
2
9 This

is true even for the standard 8 point algorithm in the two view case.
line features can be measured more reliably than point features, lower noise
level is added to them in simulations.
10 Since

230

Chapter 10. Geometry and reconstruction with incidence relations

R. Error measure for translation is the angle between T and T in degrees

where T is an estimate of the true T . Error measure for the scene structure
is the percentage of k
k/kk where
is an estimate of the true .
We apply the algorithm to a scene which consists of (four) cubes. Cubes
are good objects to test the algorithm since the relationships between their
corners and edges are easily defined and they represent a fundamental
structure of many objects in real-life. It is certainly a first step to see how
the multiple view matrix based approach is able to take into account point
and line features as well as their inter-relationships to facilitate the overall
recovery. The length of the four cube edges are 30, 40, 60 and 80 units
of focal length, respectively. The cubes are arranged so that the depth of
their corners ranges from 75 to 350 units of focal length. The three motions
(relative to the first view) are an XX-motion with -10 degrees rotation and
20 units translation, a Y Y -motion with 10 degrees rotation and 20 units
translation and another Y Y -motion with -10 degrees rotation and 20 units
translation, as shown in Figure 10.9.
1

0.5

0.5

1
1

0.5

0
x

0.5

1
1

0.5

0.5

1
1

0
x

0.5

0
x

0.5

1
1

0.5

0
x

Figure 10.9. Four views of four 3-D cubes in (normalized) image coordinates. The
circle and the dotted lines are the original images, the dots and the solid lines
are the noisy observations under 5 pixels noise on point features and 0.5 degrees
noise on line features.

We run the algorithm for 1000 trials with the noise level on the point
features from 0 pixel to 5 pixels and a corresponding noise level on the
line features from 0 degree to 1 degrees. Relative to the given amount of
translation, 5 pixels noise is rather high because we do want to compare
how all the algorithms perform over a large range of noise levels. The
results of the motion estimate error and structure estimate error are given
in Figure 10.10 and 10.11 respectively. The Point feature only algorithm
essentially uses the multiple view matrix M in (10.52) without all the

10.7. Summary

231

rows associated to the line features; and the Mixed features algorithm
uses essentially the same M as in (10.52). Both algorithms are initialized
by the standard 8 point algorithm. The Mixed features algorithm gives
a significant improvement in all the estimates as a result of the use of
both point and line features in the recovery. Also notice that, at a high
noise levels, even though the 8 point algorithm gives rather off initialization
values, the two iterative algorithms manage to converge back to reasonable
estimates.
Motion 12

Motion 14
7

8point algorithm
Point feature only
Mixed features

6
5

Rotation Error in degrees

4
3
2
1
0

1
2
3
4
Point feature noises (pixels)

6
5
4
3
2
1
0

Motion 12
80
Translation Error in degrees

Translation Error in degrees

Motion 14

70
60
50
40
30
20
10
0

1
2
3
4
Point feature noises (pixels)

Figure 10.10. Motion estimates error versus level of noises. Motion x-y means
the estimate for the motion between image frames x and y. Since the results are
very much similar, we only plotted Motion 1-2 and Motion 1-4.

10.7 Summary
10.8 Exercises

232

Chapter 10. Geometry and reconstruction with incidence relations

Error in Structure Estimation

250
8point algorithm
Point feature only
Mixed features

150

|| ||/|| || (%)

200

100

0.5

1.5

2
2.5
3
Point feature noises (pixels)

3.5

4.5

Figure 10.11. Structure estimates error versus level of noises. Here 0 represents
the true structure and the estimated.

This is page 233

Printer: Opaque this

Chapter 11
Multiple view geometry in high
dimensional space

A rather common problem that people typically encounter in computer

vision, robotics, medical imaging, signal processing, multi-channel communications and even biophysics is to reconstruct (the shape and location
of) a high-dimensional object or signal from its lower-dimensional
projections (e.g., images of a building taken by a Kodak camera). Due to
physical or technological limitations, it is usually much more economic or
convenient to obtain low-dimensional measurements rather than to try to
measure the object directly in the high-dimensional space where it resides.
It is already well-known to us now that, under proper conditions, if enough
two-dimensional images or views of a three-dimensional object are given,
a full reconstruction of both the shape and location of the object is possible, without even knowing where these images were taken by the camera.
It is then reasonable to expect that this should also be true for projections in spaces of other (typically higher) dimensions. To a large extent,
this chapter shows how to systematically make this generalization, as well
as to demonstrate potential uses of such a generalized theory to problems
beyond the scope of classic multiple view geometry in computer vision.
While the multiple view geometry developed so far typically applies to
the situation that the scene is static and only the camera is allowed to move,
the generalized theory would no longer have such limitation. It is easy to
show that, if a scene contains independently moving objects, we usually
can embed the problem into a space of higher dimensional linear space,
with a point in the high-dimensional space representing both the location
and velocity of a moving point in the physical three-dimensional world. In
this chapter, our focus is on a complete characterization of algebraic and

234

Chapter 11. Multiple view geometry in high dimensional space

geometric constraints that govern multiple k-dimensional images of hyperplanes in an n-dimensional space. We show that these constraints can again
be captured through the rank condition techniques. Such techniques also
simultaneously capture all types of geometric relationships among hyperplanes, such as inclusion, intersection and restriction. The importance of
this study is at least two-fold: 1. In many applications, objects involved are
indeed multi-faceted (polygonal) and their shape can be well modeled (or
approximated) as a combination of hyperplanes; 2. In some cases, there is
not enough information or it is not necessary to locate the exact location
of points in a high-dimensional space and instead, we may still be interested in identifying them up to some hyperplane. As we will point out later,
for the special case n = 3, the results naturally reduce to what is known
so far for points, lines and planes in preceding chapters. Since reconstruction is not the main focus of this chapter, the reader is referred to other
chapters for how to use these constraints to develop algorithms for various
reconstruction purposes.

11.1 Projection in high dimensional space

In this section, we introduce notation and basic ingredients for generalizing multiple view geometry from R3 to Rn . More importantly, we will
carefully single out assumptions that are needed to make such generalization well-posed. Despite these assumptions, the formulation remains general
enough to encompass almost all types of projections of practical importance
(e.g., perspective or orthographic projections) and a large class of transformations between vantage points (e.g., Euclidean, affine or projective
transformations).

11.1.1 Image formation

An abstract camera is a pair consisting of a k-dimensional hyperplane I k
Rn (k < n) and a point o outside I k . The point o is called the camera
center.
Given an abstract camera (I k , o) we can find a point o0 I k be such that
0
~
~ 0 for all x I k . In fact this point o0 is the one that
oo is orthogonal to xo
minimizes the value of koxk for all x I k . The point o0 is called the image
center. By associating a coordinate system to I k with its origin at o0 , we
have an image plane (or more correctly an image space). Without loss of
generality, we usually assume the distance koo0 k is normalized to 1.1
1 In computer vision, the distance koo0 k is the focal length of the camera. Since there is
always a universal scaling ambiguity, typically a metric is chosen such that this distance
is normalized.

11.1. Projection in high dimensional space

235

If we define the world coordinate system in Rn with its origin at o. Then

for any point p Rn , the coordinate x = [x1 , x2 , . . . , xk ]T of its image in
the image plane is obtained by:
x = P X,
T

(11.1)

where X = [X1 , X2 , . . . , Xn ] R is the coordinate of p, matrix P

Rkn , = (X) is a scale factor which is a function of X. We need the
following two assumptions:
1. Matrix P has full rank, i.e. rank(P ) = k. Otherwise, if rank(P ) =
0
k 0 < k, it simply becomes the projection from Rn to Rk .
2. is a first order function of X, i.e. =
~ T X + c, where
~ =
T
n
[1 , . . . , n ] R is a
constant vector and c R is a constant.
Note if X = 0, then we define x = 0. The image is not defined for X Rn
such that X 6= 0 and (X) = 0. For example,
in the perspective projection

T
1 0 0
from 3D space to 2D image, P =
,
~ = 0 0 1 and c = 0.
0 1 0
The image is not defined for points on the plane X3 = 0 except the origin
X = 0. For the orthographic projection from 3D space to 2D image, matrix

T
P is still the same while
~ = 0 0 0 and c = 1. The image is defined
for every point in R3 .
For any d-dimensional (0 d < n) hyperplane H d Rn with equation
T
A X + b = 0 (A Rn(nd) , b Rnd ), its image is a q( d)-dimensional
hyperplane S q in the image plane with equation ATI x + bI = 0 (AI
Rk(kq) , bI Rkq ). Then we define the preimage of S q to be a hyperplane
~ T and bF = cbI .
F with equation ATF X + bF = 0, where ATF = ATI P + bI
Hence F Rn has dimension n k + q. It is the largest set in Rn that can
give rise to the same image as H d does and H d F . So a basic relationship
is:
n k d q 0.

(11.2)

Figure 11.1 illustrates a special case when n = 3, k = 1 and d = q = 0.

11.1.2 Motion of the camera and homogeneous coordinates

A coordinate frame X = [X1 , X2 , . . . , Xn ]T (for the space) is associated
to the camera, whose origin is the camera center o. We define the motion
of the camera to be a transformation which maps one camera coordinate
frame to another (probably with a different origin) in Rn . It is obvious that
such a transformation is an affine transformation which can be represented
by a matrix R Rnn and a vector T Rn such that
Y = RX + T

(11.3)

236

Chapter 11. Multiple view geometry in high dimensional space

R3
p

PSfrag replacements
K

o
o0
I

Figure 11.1. An image x of a point p R3 under a perspective projection from

.
R3 to R1 (= I). K is the subspace orthogonal to the subspace spanned by o and
I. The plane F is the preimage of x, which is the subspace spanned by p and K.

where X Rn and Y Rn are coordinates of the same point p with

respect to the two coordinate frames (cameras) respectively. For this map
to be invertible, R must be invertible, i.e. R GL(n, R), the general linear
group.
In the above formulation it is more convenient to use the so-called homogeneous coordinates for both the point and its image. The homogeneous
coordinate for point p is X = [X1 , . . . , Xn , 1]T . Then its homogeneous
coordinate in the new coordinate frame after motion of the camera is
Y = gX,

(11.4)

R T
R(n+1)(n+1) belongs to GL(n + 1, R). The ho0 1
mogeneous
coordinate
of the image of p is x = [x1 , . . . , xk , 1]T . Let

P 0
P0 = T
R(k+1)(n+1) . Then we have

~
c

where g =

x = P 0 X.

(11.5)

Since rank(P 0 ) rank(P ) = k, so either rank(P 0 ) = k + 1 or rank(P 0 ) = k.

In general, assume we have m views of the point. The motion for each
R i Ti
view is gi =
, then the image in the ith frame is
0 1
i xi = P 0 gi X = i X,

i = 1, . . . , m,

(11.6)

where i = P 0 gi R(k+1)(n+1) and rank(i ) = rank(P 0 ) since gi is of full

rank. It is typical in computer vision that in the above equations, except

11.1. Projection in high dimensional space

237

for xi s, everything else is unknown and subject to recovery.2 It is easy to

see that a unique recovery would be impossible if the matrix i s could
be arbitrary. Hence the key is to utilize the fact that i s are from certain
motion group and do have some nice internal structure.3 Even so, it is still
very difficult to recover , , X altogether directly from above equations.
We first observe that, among the unknowns, both and X encode information about the location of the point p in Rn , which are not intrinsically
available from the images. Hence we would like to eliminate them from the
equations. The remaining relationships would be between x and only,
i.e. between the images and the camera configuration. These relationships,
as we will soon see, are to be captured precisely by the rank conditions.
Remark 11.1. Most results in this chapter essentially apply to any system
of equations of the type:
i xi = i X,

i = 1, . . . , m.

The internal structure of i s seems to be less important if we do not talk

about reconstruction or its geometric interpretation. This is also the reason
why, as the rank condition being a purely algebraic result, it really does
not matter whether the g in = P 0 g is Euclidean, affine or projective. In
fact, g may belong to any subgroup of GL(n + 1, R), each representing a
special multiple view geometry. The rank condition, to a large extent, is
a concise way of expressing the above equations in terms of only xi s and
i s.

11.1.3 Preimage and co-image

The fact that the dimension of image plane is lower than that of space
implies that different points in space may be projected onto the same image
point. Obviously, in image it is impossible to distinguish two points in
space if their difference is in the kernel of the matrix P 0 . This leads to the
important notion of preimage. Intuitively, the preimage consists of the
set of all points that may give rise to the same image. It plays a crucial
role later in the derivation of the rank condition. Here we give it a precise
definition as well as discuss its relation with the rank of the matrix P 0 .
The case when rank(P 0 ) = k
We first consider the situation that rank(P 0 ) = k. If rank(P 0 ) = k, we must
~ Rk and c = 0. However, this implies that
have ~ T P =
~ T for some
~ T x,
=

(11.7)

2 The above equations are associated to only one feature point. We usually need
images of multiple points in order to recover all the unknowns.
3 We later will see this can also be relaxed when we allow to depend on time.

238

Chapter 11. Multiple view geometry in high dimensional space

~ T x = 1. Since the points (except

which further implies that = 0 or
the camera center) that make = 0 do not have well-defined image, we
~ T x = 1. This later equation implies that
only need to consider the case of
the real meaningful image plane is only a (k 1)-dimensional hyperplane
in I k . The whole problem reduces to one for projection from Rn to Rk1 .
Hence, without loss of generality, we only need to consider the case when
rank(P 0 ) = k + 1.
The case when rank(P 0 ) = k + 1
First, we consider the case that rank(P 0 ) = k + 1. Both the perspective
projection and the orthographic projection belong to this case. For the
d-dimensional hyperplane H d mentioned above, its equation is then
T

A0 X = 0,
where A0 =

(11.8)

A
R(n+1)(nd) and the equation of its image S q is
bT
T

A0I x = 0,

(11.9)

AI
R(k+1)(kq) .
bTI
Note the set of X Rn+1 satisfying (11.8) is a subspace H d+1 of dimension d + 1 and the original hyperplane H d is the intersection of this
set with the hyperplane in Rn+1 : Xn+1 = 1. Hence, instead of studying
the object in Rn , we can study the corresponding object of one dimension higher in Rn+1 . Consequently, the image S q becomes a subspace
T
S q+1 = {x Rk+1 : A0I x = 0} Rk+1 . The space Rk+1 can be interpreted as the space I k+1 formed by the camera center o and the image
plane I k , which we denote as image space. So S q is the intersection of
S q+1 with the image plane I k in the image space I k+1 . Define the set
T
F = {X Rn+1 : A0I P 0 X = 0} to be the preimage of S q+1 , then
H d+1 F and F is the largest set in Rn+1 that can give rise to the
same image of H d+1 . If we consider the I k+1 as a subspace of Rn+1 , then
S q+1 is the intersection of the image space with F or H d+1 . Since it is
apparent that the space summation F + I k+1 = Rn+1 , the dimension of F
is
where A0I =

dim(F ) = dim(F I k+1 ) + dim(F + I k+1 ) dim(I k+1 ) = (n k) + q + 1.

(11.10)
If we take the motion into consideration, then F = {X Rn+1 :
T
A0I P 0 gX = 0}, which is the preimage expressed in the world frame. Since
g is of full rank, the above dimensional property for F still holds. Notice
that the dimension of the preimage F is in general larger than that of the
subspace H d+1 which gives rise to the image S q in the image plane in the
first place. Hence in general, hyperplanes which may generate the same

11.1. Projection in high dimensional space

239

image are not necessarily unique and may even have different dimensions.
More specifically, such hyperplanes may be of dimensions d = q, q+1, . . ., or
q +(nk 1).4 Nonetheless, if the camera and the object hyperplane are of
general configuration, typically the dimension of the image of a hyperplane
H d is
q = min{d, k}.

(11.11)

Since S q+1 is a subspace of the image space I k+1 = Rk+1 , it can be

described by the span of a matrix s = [u1 , u2 , . . . , uq+1 ] R(k+1)(q+1)
or by its maximum complementary orthogonal space (in I k+1 ) which is
spanned by s = [v1 , v2 , . . . , vkq ] R(k+1)(kq) such that (s )T s = 0.
By abuse of language, we also call s the image and we call s its co-image.
Of course the co-image of s is s. s and s are simply two equivalent
ways of describing the subspace S q+1 of the same image. For example, in
the equation (11.9), A0I can be viewed as the co-image of the hyperplane
H d . Figure 11.2 illustrates the image and co-image of a line under the
perspective projection from R3 to R2 .

L
F

PSfrag replacements

l
Figure 11.2. The image s = bl and co-image s = l of a line L in R3 . The three
column vectors of the image bl span the two-dimensional subspace (in the image
space I 2+1 ) formed by o and L (in this case it coincides with the preimage F ); and
the column vector of the co-image l is the vector orthogonal to that subspace.

When P 0 is of rank k + 1, there always exists a matrix g GL(n + 1, R)

such that P 0 g is of the standard form:

P 0 g = I(k+1)(k+1) 0(k+1)(nk) .
(11.12)
4 Here we assume that the hyperplanes do not cross the origin o. Otherwise, d can be
q + n k.

240

Chapter 11. Multiple view geometry in high dimensional space

Hence, without loss of generality, we may assume that P 0 itself is of above

form already. This is an assumption that holds algebraically. However, if
we want to study the physical meaning of the matrix P 0 , then two situations need to be distinguished. First is the case that we can find k + 1
linear independent columns from the first n columns of P 0 . In this case the
matrix g is an affine transformation. The second case is when we can only
find at most k linear independent columns from the first n columns of P 0 .
Then the matrix g is not an affine transformation. The perspective projection mentioned above is an example of the first case and the orthographic
projection belongs to the second situation. Although the rank conditions
to be derived do not discriminate against these projections, this difference
however may dramatically simplify the reconstruction problem in the case
of orthographic projection (see [TK92]).

11.1.4 Generic motion assumption

If we fix the location of a hyperplane H d Rn and move the camera
around the hyperplane under the motion g, then in general the image of
the hyperplane will have dimension q = min{d, k}. We call the set of motions {g} of the camera which make q < min{d, k} as degenerate and call
all the other motions as generic. Then the set of degenerate motions typically only constitute a zero-measure set in a chosen (continuous) group of
transformations. Also notice that, under a generic motion, for a hyperplane
of dimension d k, its image is of dimension k, which means it will occupy
the whole image plane. In general we cannot extract any useful information from such an image (about either the location of the hyperplane or the
motion of the camera). Therefore in this chapter, unless otherwise stated,
we always assume:
1. Motions of the camera are generic with respect to all the hyperplane
involved;
2. Hyperplanes whose images are considered always have dimension d <
k.
Although the image of a hyperplane of dimension higher than or equal to k
is considered as invalid, the presence of such a hyperplane may still affect
the constraints on the images of other hyperplanes in it. We will discuss
this in more detail in Section 11.3.4.

11.2 Rank condition on multiple images of a point

One basic type of hyperplane in R3 is a point (when d = 0). From Chapter
8, we know that multiple images of a point satisfy certain rank condition

11.2. Rank condition on multiple images of a point

241

(Theorem 8.1). Here we first study how to generalize this result to points
in Rn .
Take multiple, say m, images of the same point p with respect to multiple
camera frames. We obtain a family of equations
xi i = P 0 gi X = i X,

i = 1, . . . , m.
(11.13)

(k+1)(n+1)
i Ti R
To simplify notation, let i = P gi = R
, where
i R(k+1)(k+1) and Ti R(k+1)(nk) .5 In the above equations, except
R
for the xi s, everything else is unknown and subject to recovery. However,
solving for the i s, i s and X simultaneously from such equations is
extremely difficult at least nonlinear. A traditional way to simplify the
task is to decouple the recovery of the matrices i s from that of i s and X.
Then the remaining relationship would be between the images xi s and the
camera configuration i s only. Since such a relationship does not involve
knowledge on the location of the hyperplane in Rn , it is referred to as
intrinsic. It constitutes as the basis for any image-only based techniques.
For that purpose, let us rewrite the system of equations (11.13) in a single
matrix form
0

x1
0

..
.
0

0
x2
..
.

..
.

0
0
..
.

I ~ = X

1
1
2 2

.. = .. X.
. .

(11.14)

(11.15)

For obvious reasons, we will call ~ Rm the scale vector, and

R[m(k+1)](n+1) the projection matrix associated to the image matrix I
R[m(k+1)]m .
In order to eliminate the unknowns ~ and X, we consider the following
matrix in R[m(k+1)](m+n+1)

1 x1 0
0

..
2 0 x 2 . . .
.
.

R[m(k+1)](m+n+1) .
Np = [, I] = .

.
.
.
..
..
..
..
0
m 0
0 xm
(11.16)
Apparently, the rank of Np is no less than k + m (assuming the generic
configuration such that xi is not zero for all 1 i m). It is clear that
.
there exists v = [XT , ~T ]T Rm+n+1 in the null space of Np . Then the
equation (11.14) is equivalent to
m + k rank(Np ) m + n
5 Only

i = Ri and Ti = Ti .
in the special case that k = n 1, we have R

(11.17)

242

Chapter 11. Multiple view geometry in high dimensional space

as long as m(k + 1) (m + n + 1). In any case, the matrix Np is rank

deficient.
Multiplying Np on the left by the following matrix
T

x1
0
0
T
(x
0
1 ) 0

..
.. R[m(k+1)][m(k+1)]
..
..
Dp = .
.
.
.

0
0
xTm
T
0
0 (x
m)

yields the following matrix Dp Np in R[m(k+1)](m+n+1)

1
xT1 R
T
(x
)
1 R1

.
D p Np =

xT R

m m
T
(xm ) Rm

xT1 T1
T
(x
1 ) T1
..
.
..
.
xTm Tm
T
(x
m ) Tm

xT1 x1
0
0

0
0
..
.

0
0
0

0
0
0
..
.
0
0

0
0

0
.

xTm xm
0

Since Dp is of full rank m(k + 1), we have rank(Dp Np ) = rank(Np ). It is

then easy to see that the rank of Np is related to the sub-matrix of Dp Np

T
1 (x )T T1
(x1 ) R
1
T
T
(x
2 ) R2 (x2 ) T2
(11.18)
Cp =
R(mk)(n+1)
..
..

.
.
T
T
(x
m ) Rm (xm ) Tm
by the following equation

k rank(Cp ) = rank(Np ) m n.

(11.19)

This rank condition can be further reduced if we choose the world coordinate frame to be the first camera frame, i.e. [R1 , T1 ] = [I, 0]. Then
1 = I(k+1)(k+1) and T1 = 0(k+1)(nk) . Note that, algebraically, we do
R
not lose any generality in doing so. Now the matrix Cp becomes

T
(x
0
1 )
T
T
(x
2 ) R2 (x2 ) T2
Cp =
(11.20)
R(mk)(n+1) .
..
..

.
.
T
T
(x
m ) Rm (xm ) Tm

Multiplying Cp on the right by the following matrix

x1 x
0
1
R(n+1)(n+1)
Dp1 =
0
0 I(nk)(nk)

11.3. Rank conditions on multiple images of hyperplanes

yields the matrix

Cp Dp1

0
T
(x
)
2 R 2 x1
=
..

.
T
(x
)
R m x1
m

T
(x
1 ) x1
T
(x2 ) R2 x
1
..
.
m x
(x )T R
m

0
T
(x
2 ) T2

.
T
(x ) Tm

243

R(mk)(n+1) .

(11.21)
T
Note that Dp1 is of full rank n + 1 and (x
)
x
is
of
rank
k.
Let
us
define
1
1
the so-called multiple view matrix Mp associated to a point p to be

T
2 x1 (x )T T2
(x2 ) R
2
.

..
..
[(m1)k](nk+1)
.
(11.22)
Mp =
R
.
.
m x1 (x )T Tm
(x )T R
m

Then it is obvious that its rank is related to that of Cp by rank(Mp ) =

rank(Cp ) k. Therefore, we have:
0 rank(Mp ) = rank(Cp ) k n k.

(11.23)

Although the rank condition on Mp seems to be more compact than that

on Cp , Cp depends evenly on x1 , x2 , . . . , xm and Mp treats x1 differently
from the others.
Example 11.1 (Epipolar constraint). For a conventional camera, the
dimension of the space is n = 3 and that of the image plane is k = 2.
For two views and (R, T ) being the relative motion between them, the above
rank condition becomes the well-known epipolar constraint:

c2 T 1 xT2 RTbx1 = 0.
c2 Rx1 x
0 rank x

11.3 Rank conditions on multiple images of

hyperplanes
As we have seen in the previous section, multiple images of a point in Rn
are governed by certain rank conditions. Such conditions not only concisely
capture geometric constraints among multiple images, but also are the key
to further reconstruction of the camera motion and scene structure. In this
section, we generalize the rank conditions to hyperplanes in Rn . Furthermore, we will also show that basic geometric relationships among different
hyperplanes, i.e. inclusion and intersection of hyperplanes, can also be fully
captured by such rank conditions. This in fact should be expected. As we
will see later, roughly speaking, what the rank conditions indeed capture
geometrically is the fact that the hyperplane (observed from different vantages points) must belong to the intersection of all preimages. Inclusion and
intersection are properties of the same nature for hyperplanes they are all

244

Chapter 11. Multiple view geometry in high dimensional space

incidence conditions among hyperplanes. Furthermore, these relationships

can be easily detected or verified in images and such knowledge can be and
should be exploited if a consistent reconstruction is sought.

11.3.1 Multiple images of a hyperplane

As mentioned before, the image of a d-dimensional hyperplane H d is a
q( d)-dimensional hyperplane in the image plane. It can be described as
a (q + 1)-dimensional subspace S q+1 formed by the hyperplane and the
camera center o
.
s = [u1 , . . . , uq+1 ] R(k+1)(q+1) ,

(11.24)

.
s = [v1 , . . . , vkq ] R(k+1)(kq) .

(11.25)

or its co-image

d
Now given m images s1 , . . . , sm (hence co-images s
1 , . . . , sm ) of H relative
to m camera frames, let qi + 1 be the dimension of si , i = 1, . . . , m, note for
generic motions, qi = d, i = 1, . . . , m. We then obtain the following matrix

T
1 (s )T T1
(s1 ) R
1
T
T
(s
2 ) R2 (s2 ) T2
(11.26)
Ch =
R[m(kd)](n+1) .
..
..

.
.
m (s )T Tm
(s )T R
m

Let S d+1 be the subspace spanned by H d and the origin of the world
coordinate frame. Then it is immediately seen that S d+1 in its homogeneous
representation is the kernel of Ch . In other words, the rank of Ch is
(k d) rank(Ch ) (n d).

(11.27)

The lower bound comes from the fact rank((s

1 ) 1 ) = rank(s1 ) = k d.
Without loss of generality, we choose the first camera frame to be the world
frame, i.e. 1 = [I 0]. Then the matrix Ch becomes

T
(s
0
1)
T
T
(s
2 ) R2 (s2 ) T2
Ch =
(11.28)
R[m(kd)](n+1) .
..
..

.
.
T
T
(s
m ) Rm (sm ) Tm

Multiplying Ch on the right by the following matrix

s1 s
0
1
R(n+1)(n+1)
Dh1 =
0
0 I(nd)(nd)

11.3. Rank conditions on multiple images of hyperplanes

yields the following matrix

T
0
(s
1 ) s1

2 s
(s2 ) R2 s1 (s2 ) R
1

Ch Dh1 =
..
..

.
.
T
T

(s
m ) Rm s1 (sm ) Rm s1

0
T
(s
2 ) T2

.
T
(sm ) Tm

245

R[m(kd)](n+1) .

(11.29)
T
Note that Dp1 R(n+1)(n+1) is of full rank n + 1 and (s
1 ) s1
R(kd)(kd) is of rank k d. Let us define the so-called multiple view
matrix Mh associated to the hyperplane H d to be
T

T
2 s1 (s
(s2 ) R
2 ) T2
.

..
..
Mh =
R[(m1)(kd)](nk+d+1) . (11.30)
.
.
T
T
(s
m ) Rm s1 (sm ) Tm

Then it is obvious that its rank is related to that of Ch by rank(Mh ) =

rank(Ch ) (k d). Therefore, we have:
0 rank(Mh ) = rank(Ch ) (k d) (n k).

(11.31)

Hence we obtain the following statement which generalizes (11.23):

Theorem 11.1 (Rank condition for multiple images of one hy
perplane). Given m images s1 , . . . , sm and co-images s
1 , . . . , sm of a
d
n
d-dimensional hyperplane H in R , the associated multiple view matrix
Mh defined above satisfies
0 rank(Mh ) (n k).

(11.32)

It is easy to see that, in the case that the hyperplane is a point, we have
d = 0. Hence 0 rank(Mh ) (nk). So the result (11.23) given in Section
11.2 is trivially implied by this theorem. The significance of this theorem
is that the rank of the matrix Mh defined above does not depend on the
dimension of the hyperplane H d . Its rank only depends on the difference
between the dimension of the ambient space Rn and that of the image
plane Rk . Therefore, in practice, if many features of a scene are exploited
for reconstruction purposes, it is possible to design algorithms that do not
discriminate different features.
Example 11.2 (Trilinear constraints on three views of a line).
When n = 3, k = 2, m = 3 and d = 1, it gives the known result on three
views of a line:
"
#
lT2 R2 lb1 lT2 T2
rank T b T
1
(11.33)
l3 R 3 l1 l3 T3
where li R3 , i = 1, 2, 3 are the three coimages of the line, and (R2 , T2 )
and (R3 , T2 ) are the relative motions among the three views.

246

Chapter 11. Multiple view geometry in high dimensional space

11.3.2 Inclusion of hyperplanes

Now let us investigate the relationships between two hyperplanes. Without
loss of generality, we may assume d1 d2 and assume that an H d2 hyperplane belongs to an H d1 hyperplane in Rn : H d2 H d1 . So the image of
H d1 contains that of H d2 . And if s R(k+1)(d1 +1) is the image of H d1 and
x R(k+1)(d2 +1) is the image of H d2 (relative to some camera frame), we
have
xT s = 0.

(11.34)

Two new cases may arise from this relationship:

Case One. Consider the matrix

T
(s
0
1)
T
T
(s
2 ) R2 (s2 ) T2
Ch1 =
R[m(kd1 )](n+1) .
..
..

.
.
T
T
(sm ) Rm (sm ) Tm

(11.35)

According to (11.27), we have

Multiplying

k d1 rank(Ch1 ) n d1 .

Ch1

on the right by the following matrix

x1 x
0
1
1
Dh1
=
R(n+1)(n+1)
0
0 I(nk)(nk)

yields the matrix

1
Ch1 Dh1

Let

(11.36)

0
T
(s
)
2 R 2 x1
=
..

.
T
R m x1
(s
)
m

T
(s
2 ) R 2 x1
.

..
Mh1 =
.
T
(s
m ) R m x1

T
(s
1 ) x1
T
R 2 x
(s
)
1
2
..
.
T
R m x
(s
)
1
m

T
(s
2 ) T2

.
T
(sm ) Tm

0
T
(s
2 ) T2

.
T
(sm ) Tm

R[m(kd1 )](n+1) .

R[(m1)(kd1 )](nk+d2 ) .

(11.37)

(11.38)

T
Since (s
1 ) x1 is of rank at least k d1 , we must have

0 rank(Mh1 ) rank(Ch1 ) (k d1 ) n k.

(11.39)

Case Two. Now we consider the matrix

T
(x
0
1)
T
T
(x
2 ) R2 (x2 ) T2
Ch2 =
R[m(kd2 )](n+1) .
..
..

.
.
T
T
(x ) Rm (x ) Tm

(11.40)

11.3. Rank conditions on multiple images of hyperplanes

247

According to (11.27), it has the property

rank(Ch2 ) n d2 .

(11.41)

k d1 rank(Ch2 ) rank(Ch2 ) n d2 .

(11.43)

Since H d2 H d1 , we may choose x to be a sub-matrix of s and choose s

to be a sub-matrix of x . Hence, the following sub-matrix of Ch2

T
(s
0
1)
T
T
(x
2 ) R2 (x2 ) T2
2

(11.42)
Ch =
R[m(kd2 )(d1 d2 )](n+1)
..
..

.
.
T
T
(x
m ) Rm (xm ) Tm
satisfies

Now multiplying Ch2 on the right by the following matrix

s1 s
0
2
1
Dh1 =
R(n+1)(n+1)
0
0 I(nk)(nk)
yields the matrix

T
(x
2 ) R2 s1
2
Ch2 Dh1
=
..

.
T
(x
m ) Rm s1

Let

T
(x
2 ) R2 s1
.

..
Mh2 =
.
m s1
(x )T R
m

Since

T
(s
1 ) s1

T
(s
1 ) s1
T
(x2 ) R2 s1
..
.

T
(x
m ) Rm s1

T
(x
2 ) T2

.
(x )T Tm

0
T
(x
2 ) T2
.
..

.
T
(xm ) Tm

(11.44)

R[(m1)(kd2 )](nk+d1 +1) . (11.45)

is of full rank (k d1 ), then we have

0 rank(Mh2 ) = rank(Ch2 ) (k d1 ) (n k) + (d1 d2 ).

(11.46)

Definition 11.1 (Formal multiple view matrix). We define a multiple

view matrix M as

T
2 D1 (D )T T2
(D2 ) R
2
T
T
. (D3 ) R3 D1 (D3 ) T3
(11.47)
M =

..
..

.
.
T
T
) Tm
(Dm
) Rm D1 (Dm

where the Di s and Di s stand for images and co-images of some hyperplanes respectively. The actual values of Di s and Di s are to be determined
in context.
Therefore, we have proved the following statement which further
generalizes Theorem 11.1:

248

Chapter 11. Multiple view geometry in high dimensional space

Theorem 11.2 (Rank condition with inclusion). Consider a d2 dimensional hyperplane H d2 belonging to a d1 -dimensional hyperplane
H d1 in Rn . m images xi R(k+1)(d2 +1) of the H d2 and m images
si R(k+1)(d1 +1) of the H d1 relative to the ith (i = 1, . . . , m) camera
frame are given and the relative transformation from the 1st camera frame
to the ith is (Ri , Ti ) with Ri Rnn and Ti Rn , i = 2, . . . , m. Let the
Di s and Di s in the multiple view matrix M have the following values:
.
(k+1)(kd2 )
(k+1)(kd1 )
Di = x
or s
,
i R
i R
(11.48)
.
(k+1)(d2 +1)
(k+1)(d1 +1)
Di = x i R
or si R
.
Then for all possible instances of the matrix M , we have the two cases:
1. Case one: If D1 = s1 and Di = x
i for some i 2, then
rank(M ) (n k) + (d1 d2 ),
2. Case two: Otherwise,
0 rank(M ) n k.
Since rank(AB) (rank(A) + rank(B) n) for all A Rmn , B Rnk ,
we have
i D1 ] (d1 d2 )
rank[(Di )T R

i R(k+1)(k+1) is full rank for some i 2. So the rank

if the matrix R
condition for the case one can be improved with a tight lower bound
(d1 d2 ) rank(M ) (n k) + (d1 d2 ).
This theorem is very useful since it applies to many cases which are of
practical importance.
Example 11.3 (Three views of a point on a line). When n = 3, k =
2, m = 3 and d1 = 1, d2 = 0, it gives the known result in computer vision
on three views of a point on a line:
#
"
c2 T2
c2 R2 lb1 x
x
2
(11.49)
1 rank
c3 T3
c3 R3 lb1 x
x

where xi R3 and li R3 are the images and coimages of the point and
line respectively. The rank condition is equivalent to what is known as the
line-point-point relation.

The above theorem can also be easily generalized to any set of cascading
hyperplanes:
H dl H dl1 H d1
for some l Z+ . We omit the details for simplicity.

11.3. Rank conditions on multiple images of hyperplanes

249

11.3.3 Intersection of hyperplanes

Consider two hyperplanes H d1 and H d2 both containing a third hyperplanes
H d3 H d1 H d2 .
Besides the rank conditions given by Theorem 11.2, some extra rank conditions hold for the mixed images of all three hyperplanes. Let xi
R(k+1)(d3 +1) be the images of the H d3 , ri R(k+1)(d2 +1) be those of
the H d2 and si R(k+1)(d1 +1) be those of the H d1 relative to the ith
(i = 1, . . . , m) camera frame. Without loss of generality, let xi be a sub
matrix of ri and si , and on the other hand, let r
i and si be sub-matrices

of xi for i = 1, . . . , m. Then it is straight forward to prove (in a manner

similar to the proofs in the above) the following rank conditions for these
three hyperplanes.
Theorem 11.3 (Rank condition with intersection). Consider hyperplanes H d1 , H d2 , and H d3 with H d3 H d1 H d2 . Given their m images
relative to m camera frames as above, let the Di s and Di s in the multiple
view matrix M have the following values:
.

Di = x
i , ri , or si ,
(11.50)
.
D1 = x 1 .
Then we have:
0 rank(M ) (n k).
It is straight forward to generalize this theorem to a hyperplane which
is in the intersection of more than two hyperplanes:
H dl H dl1 H d1

for some l Z+ . Simply allow the matrix Di to take the co-image of any
hyperplane involved and the rank condition on M always holds.

11.3.4 Restriction to a hyperplane

In practice, there are situations when all hyperplanes being observed belong
to an ambient hyperplane in Rn , and the location of the ambient hyperplane
relative to the world reference frame is fixed. Here we no longer need the
assumption that the dimension of this ambient hyperplane is lower than
that of the image plane. It is not its image that we are interested in here.
However, as in Sections 11.3.2 and 11.3.3, we still assume that all the other
hyperplanes whose images are taken do have dimension lower than that of
the image plane. This extra knowledge on the existence of such an ambient
hyperplane is supposed to impose extra constraints on the multiple images
of any hyperplane in the ambient hyperplane.

250

Chapter 11. Multiple view geometry in high dimensional space

In general, a known d-dimensional hyperplane H d in Rn can be described

R(nd)(n+1) such that any
by a full-rank (n d) (n + 1) matrix
d
point p H satisfies
=0
X

(11.51)

where X Rn+1 is the homogeneous coordinate of the point p. We call

the homogeneous representation for H d . Of course, such a
the matrix
1 and
2 with the same
representation is not unique any two matrices
kernel give rise to the same hyperplane. For convenience, we usually divide
into two parts:
the (n d) (n + 1) matrix
= [
1,
2 ],

with

1 R(nd)(k+1) ,

2 R(nd)(nk) . (11.52)

Definition 11.2 (Formal extended multiple view matrix). We define

a multiple view matrix M as

T
2 D1 (D )T T2
(D2 ) R
2
3 D1 (D )T T3
(D3 )T R
3

..
..
(11.53)
M =

.
.

T
T
(Dm
) Rm D1 (Dm ) Tm
1 D1
2

where the Di s and Di s stand for images and co-images of some hyperplanes respectively. The actual values of Di s and Di s are to be determined
in context.
Then, the rank conditions given in Theorems 11.2 and 11.3 need to be
modified to the following two corollaries, respectively:
Corollary 11.1 (Rank condition with restricted inclusion). Consider hyperplanes H d2 H d1 H d in Rn , where d1 , d2 < k and d < n.
R(nd)(n+1) (relative
The hyperplane H d is described by a matrix
to the first camera frame). Then the rank conditions in Theorem 11.2 do
not change at all when the multiple view matrix there is replaced by the
extended multiple view matrix.
Corollary 11.2 (Rank condition with restricted intersection). Consider hyperplanes H d1 , H d2 , and H d3 with H d3 H d1 H d2 which all
belong to an ambient hyperplane H d , where d1 , d2 , d3 < k and d < n.
R(nd)(n+1) (relative
The hyperplane H d is described by a matrix
to the first camera frame). Then the rank conditions in Theorem 11.3 do
not change at all when the multiple view matrix there is replaced by the
extended multiple view matrix.
One can easily prove these corollaries by modifying the proofs for the
theorems accordingly.
Example 11.4 (Planar homography). Consider the standard perspective projection from R3 to R2 . Suppose, as illustrated in Figure 11.3, all

11.4. Geometric interpretation of rank conditions

251

feature points lie on certain plane in R3 , which can be described by some

vector R4 . Then the M matrix associated to two images x1 , x2 of a
point p is

c2 T x
c2 Rx1
x
M=
(11.54)
2
1 x1

where 1 R3 , 2 R and [ 1 , 2 ] = . It is easy to show that rank(M ) 1

is equivalent to the equation

1
c2 R 2 T 1 x1 = 0.
(11.55)
x

This equation in computer vision is known as homography between two

views of a planar scene. The matrix in the middle H = R 12 T 1
R33 is correspondingly called the homography matrix. Knowing the images
of more than four points, this equation can be used to recover both the
camera motion (R, T ) and the plane .

R3
PSfrag replacements

I2
x2

x1
o1

(R, T )

Figure 11.3. Two views of points on a plane R3 . For a point p , its two
(homogeneous) images are the two vectors x1 and x2 with respect to the two
vantage points.

11.4 Geometric interpretation of rank conditions

In the previous section, we have seen that multiple images of hyperplanes
must satisfy certain algebraic conditions, the so-called rank conditions. In
this section, we would like to address the opposite question: Given such an
algebraic condition, what does it mean geometrically? More specifically, we
like to study:

252

Chapter 11. Multiple view geometry in high dimensional space

1. Given a multiple view matrix M for a hyperplane, can we characterize

the hyperplane? If not, to what extent can we characterize it?
2. Given a multiple view matrix M and its rank, what can we tell
about the geometric configuration of the camera motions and the
hyperplanes involved?

11.4.1 Multiple view stereopsis in n-dimensional space

Given multiple images of a hyperplane H d Rn , according to Section
11.3.1, we have rank(Mh ) = [rank(Ch ) (k d)]. Then the dimension of
the kernel of Ch is [(n+1)rank(Ch )]. Hence if rank(Mh ) reaches its upper
bound n k, i.e. rank(Ch ) reaches its upper bound n d, the kernel of Ch
has dimension d + 1. From previous discussions we know that this kernel
is exactly the homogeneous representation of H d . Therefore, the upper
bound of the rank means that the hyperplane can be determined uniquely.
If rank(Ch ) = l < n d, then we can only determine the hyperplane H d
up to an (n l)-dimensional hyperplane as the kernel of Ch .
The structure of the matrix Ch provides another way of interpreting
the
of Ch . Ch is composed of m blocks of (k d) (n + 1) matrices
kernel
T
(si )T Ri (s
i ) Ti , i = 1, . . . , m. The kernel of each block is the preimage
Fi of si . Hence the kernel of Ch is the intersection of Fi s. The dimension
of Fi is (n k + d + 1) for all i. Define G1 = F1 , Gi = Gi1 Fi for
i = 2, . . . , m. Then dim(Gi ) dim(Gi1 ) for i = 2, . . . , m and Gm is
the kernel of Ch . For the motions 1 , . . . , m , if dim(Gi ) < dim(Gi1 )
for i = 2, . . . , m, then call the set of motions as independent motions and
the set of views (or images) as independent views. Otherwise, we call the
set of motions as dependent motions. In the following discussions, we only
consider independent motions.
For any set of independent motions, the corresponding G1 has dimension
(n k + d + 1). Then in order to determine uniquely H d , it is desirable to
have dim(Gm ) = d + 1. Hence the reduction of dimension from G1 to Gm
is n k. For i > 1, the dimension reduction for the ith view is given by
1 dim(Gi1 ) dim(Gi ) = dim(Gi1 + Fi ) dim(Fi )
(n + 1) (n k + d + 1) = k d.

Thus, in order to determine uniquely H d , we need at least m = d nk

kd e + 1
views with independent motions. However, this is the optimal case such
that each view can induce maximum dimension reduction of Gi . In general,
the number required to determine uniquely H d can be larger than this
number. The maximum number of independent views required is m =
(n k + 1) in which case each view only contributes to a one-dimensional
reduction of Gi . For example, in the special case for point features, we
have d = 0. Hence the minimum number of independent views required is

11.4. Geometric interpretation of rank conditions

253

then d nk
k e + 1. When n = 3, k = 2, this number reduces to 2 which is
well-known for 3-D stereopsis.

11.4.2 A special case

In addition to above general statements regarding the geometry that rank
conditions may imply, any particular instance of the multiple view matrix in
fact encodes some special information about the geometric configuration of
the camera motions and the hyperplanes involved. It is impractical to study
all possible instances of the multiple view matrix. Here we demonstrate how
this can be done for a special case when k = n 1.6 All the rank conditions
in Theorem 11.1, 11.2 and 11.3 are therefore simplified accordingly. We
discuss their geometric implications.
An individual hyperplane
Let us first study the multiple view matrix in Theorem 11.1. So there are
two sub-cases only for the rank:
rank(Mh ) = 1,

or rank(Mh ) = 0.

(11.56)

Suppose its rank is 1 for an Mh associated to some d-dimensional hyperplane H d in Rn . Let si , i = 1, . . . , m be the m images of H d relative to
m camera frames specified by the relative transformations (Ri , Ti ), i =
2, . . . , m (from the 1st frame to the ith ). Consider the matrix Ch defined in
Section 11.3.1

T
(s
0
1)
T
T
(s

2 ) R2 (s2 ) T2
Ch =
(11.57)
R[m(nd1)](n+1) .
..
..

.
.
T
(s
m ) Rm

T
(s
m ) Tm

Then from the derivation in Section 11.3.1, we have the following rank
inequality
(n d) rank(Ch ) = rank(Mh ) + (n d 1) = (n d).

(11.58)

Hence the rank of Ch is exactly n d. It kernel is a (d + 1)-dimensional

subspace in the homogeneous space Pn , which gives rise to the homogeneous coordinates (relative to the first camera frame) of all points on the
hyperplane H d . Hence Ch itself is a valid homogeneous representation for
H d , although it may consist of many redundant rows. In any case, when
the rank of Mh is 1, the location of the hyperplane H d (relative to the first
camera frame) is determined uniquely.
6 This is partly because when n = 3, k = 3 1 = 2, it is the classic case that we have
studied in previous chapters.

254

Chapter 11. Multiple view geometry in high dimensional space

Suppose the rank of Mh is 0, correspondingly the rank of the matrix Ch

is (n d 1) instead of (n d). Then the kernel of the matrix Ch is of
dimension (d + 2) in Pn . This gives rise to a (d + 1)-dimensional hyperplane
H d+1 in Rn . To see what this hyperplane is, let us define the matrix:

s1 0
Kh =
R(n+1)(d+2) .
(11.59)
0 1
Then it is immediate that
rank(Mh ) = rank(Ch Kh ).

(11.60)

Then rank(Mh ) = 0 simply means that range(Kh ) ker(Ch ). Notice that

the range of the matrix Kh is a (d + 2)-dimensional subspace (relative to
the first camera frame) in Pn which represents homogeneously the (d + 1)dimensional hyperplane spanned by H d and the center of the first camera
frame. Then the location of the originally observed H d is only determined
up to this H d+1 its exact location within H d+1 cannot be known from
its given m images. Since rank(Mh ) = 0, it also means the entire matrix
Mh is zero, which further reveals some special relationships between all
the camera frames and the images. It is easy to see that Mh is zero if
and only if each camera center and the hyperplane H d span the same
(d + 1)-dimensional hyperplane in Rn .
Inclusion
We now study the multiple view matrix in Theorem 11.2. Here we know
the two hyperplanes H d2 H d1 and would like to know what the rank
condition implies about their general location relative to the camera frames.
If the images involved in a particular multiple view matrix are all from the
same hyperplane, there is essentially the one hyperplane case which has
been discussed in the above section. The two interesting cases left are the
two matrices Mh1 and Mh2 defined in Section 11.3.2:

T
T
(s2 ) R2 x1 (s
2 ) T2
.

..
..
[(m1)(nd1 1)](d2 +2)
,
Mh1 =
R
.
.
Mh2

T
(s
m ) R m x1

T
(x
2 ) R2 s1
.
..
=
.

T
(x
m ) Rm s1

T
(s
m ) Tm

T
(x
2 ) T2

T
(x
m ) Tm

R[(m1)(nd2 1)](d1 +2) .

According to the rank conditions given in Theorem 11.2, we have:

0 rank(Mh1 ) 1,

(d1 d2 ) rank(Mh2 ) 1 + (d1 d2 ).

If rank(Mh1 ) = 1, then Ch1 defined in Section 11.3.2 has rank(Ch1 ) = nd1 .

Hence the kernel of the matrix Ch1 uniquely determines the hyperplane H d1

11.4. Geometric interpretation of rank conditions

255

it location relative to the first camera frame. Also since s

i is a sub-matrix
of x
,
i
=
1,
.
.
.
,
m,
we
have:
i

T
T
T
(x2 ) R2 x1 (x2 ) T2

(s2 ) R2 x1 (s2 ) T2

rank

..
.

T
(s
m ) Rm x1

..
.

T
(s
m ) Tm

rank

..
.

T
(x
m ) Rm x1

..
.

T
(x
m ) Tm

So the rank of the second matrix must be 1, hence the location of the hyperplane H d2 is uniquely determined according to the discussion in Section
11.4.2. However, if rank(Mh1 ) = 0, we only know
(n d1 1) rank(Ch1 ) (n d1 ).

(11.61)

In this case, nothing definite can be said about the uniqueness of H d1 from
the rank test on this particular multiple view matrix. On the other hand,
the entire matrix Mh1 being zero indeed reveals some degeneracy about the
configuration: All the camera centers and H d1 span a (d1 + 1)-dimensional
T
hyperplane in Rn since (s
i ) Ti = 0 for all i = 2, . . . , m; the only case that
d1
the exact location of H is determinable (from this particular rank test) is
when all the camera centers are actually in H d1 . Furthermore, we are not
able to say anything about the location of H d2 in this case either, except
that it is indeed in H d1 .
Now for the matrix Mh2 , if rank(Mh2 ) = (d1 d2 + 1), then the matrix Ch2
defined in Section 11.3.2 has rank(Ch2 ) = n d2 . Hence the kernel of the
matrix Ch2 uniquely determines the hyperplane H d2 it location relative to
the first camera frame. Or, simply use the following rank inequality (from
comparing the numbers of columns):

T
T
T
(x2 ) R2 x1 (x2 ) T2

(x2 ) R2 s1 (x2 ) T2

rank

(x
m)

..
.
T

Rm s1 (xm ) Tm

rank

..
.

T
(x
m ) Rm x1

..
.

T
(x
m ) Tm

+ (d1 d2 ).

So the second matrix will have rank 1, hence the location of H d2 is unique.
However, in general we are not able to say anything about whether H d1 is
uniquely determinable from such a rank test, except that it contains H d2 .
Now, if rank(Mh2 ) = (d1 d2 ), we only know
(n d2 1) rank(Ch2 ) (n d2 ).

(11.62)

Then this particular rank test does not imply the uniqueness of the hyperplane H d2 . From the matrix Mh2 being of rank (d1 d2 ), we can still derive
that all the camera centers and H d1 span a (d1 +1)-dimensional hyperplane
T
T
H d1 +1 since (x
i ) Ti is linearly dependent of (xi ) Ri x1 for i = 2, . . . , m.
So H d1 is in general not determinable unless all the camera centers lie in
H d1 . If it is the special case that all the camera centers actually lie in H d2 ,
the location of H d2 is then determinable. If some camera centers are off
the hyperplane H d2 but still in the hyperplane H d1 +1 , we simply cannot

256

Chapter 11. Multiple view geometry in high dimensional space

tell the uniqueness of H d2 from this particular rank test. Examples for either scenario exist. Generally speaking, the bigger the difference between
the dimensions d1 and d2 , the less information we will get about H d2 from
the rank test on Mh2 alone. Due to this reason, in practice, other types of
multiple view matrices are preferable.
We summarize our discussion in this section in Table 11.1.
Matrix
Rank values

rank(Mh1 )
1

rank(Mh1 )
0

rank(Mh2 )
d 1 d2 + 1

rank(Mh2 )
d 1 d2

H d1
H d2
Camera centers

unique
unique
Rn

H d1 +1
H d1
H d1 +1

H d2
unique
Rn

H d1 +1
H d1
H d1 +1

Table 11.1. Locations of hyperplanes or camera centers implied by each single rank
test. Whenever it is well-defined, H d1 +1 is the (d1 + 1)-dimensional hyperplane
spanned by the H d1 and all the camera centers.

Intersection
Finally, we study the multiple view matrix in Theorem 11.3. Here we know
three hyperplanes are related by H d3 H d1 H d2 . Let us first study

the case rank(M ) = 1. Since r

i and si are both sub-matrices of xi ,
i = 1, . . . , m, we have:
T
T

T
T
(D2 ) R2 x1 (D2 ) T2

rank

(Dm
)

..
.
T

Rm x1 (Dm ) Tm

x
i ,

(x2 ) R2 x1 (x2 ) T2

rank

..
.

T
(x
m ) Rm x1

..
.

T
(x
m ) Tm

where
=
r
i , or si . Therefore, the rank of the second matrix in
the above rank inequality is 1. That is the hyperplane H d3 can always
be determined. As before, nothing really can be said about the other two
hyperplanes except that they both contain H d3 . Now if rank(M ) = 0,
then nothing really can be said about the locations of any hyperplane,
except that all the camera centers and the hyperplanes H d1 , H d2 span a
(d1 + d2 + 1)-dimensional hyperplane a degenerate configuration. The
analysis goes very much the same for any number of hyperplanes which all
intersect at H d3 .

11.4.3 Degenerate motions

In previous sections, we have assumed that the hyperplane H d relative to
the camera (and its image plane) is of general position. That essentially
means that the hyperplane and its image have the same dimension: q = d
when d k. This assumption has to a large extent simplified our analysis.
However, in some rare degenerate situations, the image of a hyperplane
could be of strictly lower dimension, i.e. q < d. Such a case is illustrated

11.4. Geometric interpretation of rank conditions

257

by Figure 11.4. Degenerate situations happen in high dimensional spaces

in a similar fashion. More specifically, let K be the orthogonal complement
to the image space I k+1 in Rn . Still use H d+1 to denote the subspace
spanned by the hyperplane H d and the camera center. Then we have d = q+
dim(H d+1 K). So a degenerate situation occurs whenever the intersection
R3
PSfrag replacements
L

K
o

x
o

I1
Figure 11.4. A degenerate case: the image of a line is a point under the perspective
.
projection from R3 to R1 (= I 1 ). Let K be the subspace orthogonal to the plane
1
spanned by o and I . Then this happens when L is parallel to K.

of K and H d+1 is non-trivial. A potential problem that may arise with a

degenerate situation is that the dimension of the image of a hyperplane
may vary from view to view. Using these degenerate images in the multiple
view matrix may result in a looser bound on the rank conditions. Suppose
the first image of H d has a dimension q1 < d. Then all the theorems about
rank conditions in preceding sections need to be changed accordingly. For
example, Theorem 11.1 becomes:
Corollary 11.3 (Rank condition with degenerate motions). Given

m images s1 , . . . , sm and co-images s

1 , . . . , sm of a d-dimensional hyperd
n
plane H in R , the associated multiple view matrix Mh defined above
satisfies
0 rank(Mh ) (n k) (d q1 )

(11.63)

where q1 is the dimension of the hyperplane of the image for H d with respect
to the 1st camera frame.
Notice that since (n k) (d q1 ) < (n k), the rank condition says
that we may only determine the hyperplane H d up to an H 2dq1 subspace
(with respect to the first view). This is expected since a degenerate view
is chosen as the reference. If we choose a different (non-degenerate) frame
as the reference, the rank may vary and gives a better description of the
overall configuration.
In view of this, we see that the rank of the corresponding matrix Ch
(defined during the derivation of M in previous sections) in general are more

258

Chapter 11. Multiple view geometry in high dimensional space

symmetric with respect to all images. However, the rank of Ch typically

depends on the dimension of the hyperplane involved.7 That makes it much
harder to use in practice. For example, if the goals is to reconstruct a scene
using all available features (including hyperplanes of all dimensions), any
algorithm based on Ch needs to deal with different hyperplanes separately.
The use of the multiple view matrix M would be much more uniform,
despite its possible trouble with degeneracy.

11.5 Applications of the generalized rank

conditions
The rank conditions for various multiple-view matrices in the classic case
when n = 3 and k = 2 have been studied recently. These results then become special cases for the general rank conditions developed in this chapter.
For example, for the m images of a point in the 3-D space, the corresponding multiple-view matrix Mp then satisfies 0 Mp n k = 1.
Similarly for the line feature in 3-D, the multiple-view matrix also satisfies
0 Ml n k = 1. As we have mentioned before, the rank condition
is going to be useful for any problem that has a setting of projecting a
high-dimensional signal to low-dimensional ones. We discuss here two of
those problems in computer vision.

11.5.1 Multiple view constraints for dynamical scenes

One important application of the rank conditions is for the analysis of the
dynamical scene, which means that the scene contains moving objects with
independent motions. Let us first see how this can be done in general. The
trajectory of a moving point p (with respect to the world coordinate frame)
can be described by a function X(t) : R R3 . If the function is smooth
and analytic, we may express it in terms of Taylor expansion:
X(t) = X(0) +

X
X (j) (0)
j=1

tj .

(11.64)

In practice, we may approximate the series up to N th order terms:

X(t) X(0) +

N
X
X (j) (0)
j=1

tj .

(11.65)

7 The rank of M could depend on the difference of the hyperplanes in the case of
hyperplane inclusion. But that only occurs in Case one (see Theorem 11.2) which is
usually of least practical importance.

11.5. Applications of the generalized rank conditions

259

Notice that when N = 1, we essentially assume that points are moving in

constant velocity (with respect to the world frame); when N = 2, points
are moving with a constant acceleration (with a parabolic trajectory). Suppose the relative motion from the ith camera frame (at time ti ) to the
world frame is gi = (Ri , Ti ) SE(3). In homogeneous coordinates, the ith
(perspective) image of this point (at time ti ) is then given by

N
(j)
X
X (0) j
(11.66)
ti
i xi = P 0 gi X(ti ) = P 0 gi X(0) +
j!
j=1
where P 0 R34 . If we define a new projection matrix to be
3(3N +4)
i = [Ri tN
i . . . R i t i R i , Ti ] R

(11.67)

= [X (N ) (0)T . . . X 0 (0)T , XT ]T R3N +4 ,

(11.68)

and a new coordinate associated to the point p in R3n to be

then the above equation becomes

i xi = i X,

R3N +4 , xi R3 .
X

(11.69)

This is indeed a multiple-view projection from R3(N +1) to R2 (in its homogeneous coordinates). The rank conditions then give basic constraints
that can be further used to segment these features or reconstruct their coordinates. The choice of the time-base (tN , . . . , t, 1) in the construction of
above is not unique. If we change the time-base to be (sin(t), cos(t), 1),
it would allow us to embed points with elliptic trajectory into a higher
dimensional linear space.
As a simple example, let us study in more detail a scene consisting of
point features with each moving at a constant velocity. According to above
discussion, this can be viewed as a projection from a six-dimensional space
to a two-dimensional one. The six-dimensional coordinate consists of both a
points 3-D location and its velocity. We are interested in the set of algebraic
(and geometric) constraints that govern its multiple images. Since now we
have a projection from R6 to R2 , for given m images, the multiple view
matrix Mp (as defined in (11.22)) associated to a point p has a dimension
2(m 1) 5 and satisfies the rank condition 0 rank(Mp ) 6 2 = 4.
Then any 5 5 minor (involving images from 4 views) of this matrix has
determinant zero. These minors exactly correspond to the only type of
constraints discovered in [?] (using tensor terminology). However, the rank
conditions provide many other types of constraints that are independent
of the above. Especially, we know each point moving on a straight line in
3D from the constant velocity assumption, it is then natural to consider
the incidence condition that a point lies on a line. According to Theorem
11.2, for m images of a point on a line, if we choose D1 to be the image of
the point and Di to be the co-image of the line, the multiple view matrix
M has a dimension (m 1) 5. M has rank(M ) 4. It is direct to check

260

Chapter 11. Multiple view geometry in high dimensional space

that constraints from its minor being zero involve up to 6 (not 4) views. If
we consider the rank conditions for other types of multiple view matrix M
defined in Theorem 11.2, different constraints among points and lines can
be obtained. Therefore, to complete and correct the result in [?] (where
only the types of projection Rn R2 , n = 4, . . . , 6 were studied), we have
in general
Corollary 11.4 (Multilinear constraints for Rn Rk ). For projection
from Rn to Rk , non-trivial algebraic constraints involve up to (n k + 2)
views. The tensor associated to the (n k + 2)-view relationship in fact
induces all the other types of tensors associated to smaller numbers of views.
The proof directly follows from the rank conditions. In the classic case
n = 3, k = 2, this corollary reduces to the well-known fact in computer
vision that irreducible constraints exist up to triple-wise views, and furthermore, the associated (tri-focal) tensor induces all (bi-focal) tensors (i.e.
the essential matrix) associated to pairwise views. In the case n = 6, k = 2,
the corollary is consistent to our discussion above. Of course, besides embedding (11.69) of a dynamic scene to a higher dimensional space R3(N +1) ,
knowledge on the motion of features sometime allows us to embed the problem in a lower dimensional space. These special motions have been studied
in [?], and our results here also complete their analysis.

11.5.2 Multiple views of a kinematic chain

Maybe one of the most important applications of the rank conditions is to
describe objects that can be modeled as kinematic chains. These objects
include any linked rigid body (or so-called articulated body) such as a
walking human being or a robot arm. The key idea here is that each link
(of the body) can be very well represented as a moving line (a 1-dimensional
hyperplane); and each joint can be represented as a moving point where
links intersect. The rigidity of the motion essentially allows us to embed
points on the body as a hyperplane into a high dimensional space. The
dimension of the hyperplane is exactly the degree of freedoms of the links.
To demonstrate how this can work, we consider the simplest case: two links
joined by a resolute joint with the first one fixed to the ground through
another resolute joint, as shown in Figure 11.5. Without loss of generality,
suppose the joints lie in the xy-plane (with respect to the world coordinate
frame). Let e1 = [1, 0, 0]T R3 be the unit vector along the x-axis and
e3 = [0, 0, 1] R3 along the z-axis; 1 , 2 R be the angular velocities
of the first and second joints respectively; and l1 , l2 R be the lengths of
the two links. Assume the location of the first joint is Xo R3 . Assume at
time t = 0, the two joints start from the initial state 1 = 2 = 0.
Then a point on the first link has the coordinates X1 (t) = Xo +
r1 exp(eb3 1 t)e1 where r1 R is the distance from the point to the first

11.5. Applications of the generalized rank conditions

261

l2
2
PSfrag replacements
y

l1
1

x
u

Figure 11.5. Two resolute joints.

joint. Using Rodrigues formula we can rewrite this as

X1 (r1 , t) = [Xo + r1 (I + eb3 2 )e1 ] + [r1 eb3 e1 ] sin(1 t) + [r1 eb3 2 e1 ] cos(1 t).

Similarly for the second link, coordinates of a point with a distance r2 R

to the second joint are given by

X2 (r2 , t) = [X1 (l1 , t) + r2 (I + eb3 2 )e1 ] + [r2 eb3 e1 ] sin(2 t) + [r2 eb3 2 e1 ] cos(2 t).
If we choose the time-base (sin(1 t), cos(1 t), sin(2 t), cos(2 t), 1) for the
new projection matrix , we may embed the links into R15 . Hence the
projection of a point on the links is a projection from R15 to R2 , and can
be expressed in homogeneous coordinates:

(t)x(t) = (t)X,

R16 , x R3 .
X

(11.70)

It is clear to see that points from each link lie on a one-dimensional hyperplane (a line) in the ambient space R15 . These two links together span a
two-dimensional hyperplane, unless 1 = 2 (i.e. the two links essentially
move together as one rigid body). The rank conditions that we have for such
a projection will be able to exploit the fact that the two links intersect at
the second joint, and that points on the links belong to a two-dimensional
hyperplane Z = 0 (w.r.t. the world coordinate frame).
In general, using similar techniques, we may embed a linked rigid body
of multiple joints or links (maybe of other types) into a high dimensional
space. If a linked rigid body consists of N links, in general we can embed
its configuration into RO(6N ) with the assumption of constant translation
and rotation velocity. This makes perfect sense since each link potentially
may introduce 6 degrees of freedom three for rotation and three for translation. In principle, the rank conditions associated to the projection RO(6N )

262

Chapter 11. Multiple view geometry in high dimensional space

to R2 then provide necessary constraints among links, joints. Recognizing

the hyperplane to which all points belong tells what types of joints there
are; and recovering the high-dimensional coordinates of each point allows to
segment points to different links. For applications such as tracking walking
human or controlling robot arms, resolute joints are to a large extent sufficient for the purpose. However, in the case of walking human, the joints are
typically not fixed to the ground. Hence we need to model the base point
Xo as a moving point too. Combining with results in the previous section,
the resulting embedding is obvious.

11.6 Summary
In this chapter, we have studied algebraic and geometric relationships
among multiple low-dimensional projections (also called images) of hyperplanes in a high-dimensional linear space. To a large extent, the work is
a systematic generalization of classic multiple view geometry in computer
vision to higher dimensional spaces. The abstract formulation of the projection model in this chapter allows us to treat uniformly a large class
of projections that are normally encountered in many practical applications, such as perspective projection and orthographic projection. We have
demonstrated in detail how incidence conditions among multiple images of
multiple hyperplanes can be uniformly and concisely described in terms of
certain rank conditions on the so-called multiple view matrix. These incidence conditions include: inclusion, intersection, and restriction (to a fixed
hyperplane).
The significance of such rank conditions is multi-fold: 1. the multiple
view matrix is the result of a systematic elimination of unknowns that
are dependent on n-dimensional location of the hyperplanes involved. The
rank conditions essentially describe incidence conditions in Rn in terms of
only low dimensional images (of the hyperplanes) and relative configuration
among the vantage points. These are the only constraints that are available
for any reconstruction whose input data are the images only. 2. As an
alternative to the multi-focal tensor techniques that are widely used in
computer vision literature for studying multiple view geometry, the rank
conditions provide a more complete, uniform and precise description of
all the incidence conditions in multiple view geometry. They are therefore
easier to use in reconstruction algorithms for multiple view analysis and
synthesis. 3. A crucial observation is that the rank conditions are invariant
under any transformations that preserve incidence relationships. Hence it
does not discriminate against Euclidean, affine, projective or any other
transformation groups, as long as they preserve incidence conditions of
the hyperplanes involved. Since these transformations were traditionally
studied separately in computer vision, the rank conditions provide a way

11.6. Summary

263

of unifying these efforts. 4. The generalization of multiple view geometry to

high dimensional spaces has dramatically increased the range of potential
applications outside computer vision. As we have shown, many computer
vision problems that classic multiple view geometry is not able to handle
can in fact be embedded as a multiple view geometry problem in a space
of higher dimension. These generalized rank conditions can also be applied
to many other problems in communications, signal processing and medical
imaging, whenever the essential problem is to recover high dimensional
objects from their low dimensional projections.
In this chapter, we have only categorized constraints in multiple view
geometry based on the rank techniques. That is merely the first step to the
goal of reconstructing high-dimensional objects from multiple (perspective)
images. There is already a vast body of literature in computer vision showing (for the case of projection from R3 to R2 ) how to exploit some of these
constraints for reconstruction purposes. Since most of the existing techniques deal with two or three images at a time, we believe that the rank
conditions, which unify global constraints among multiple images, will eventually give rise to more global techniques for reconstruction. In the case of
three-dimensional space, this has been shown to be true. However, at this
point, very little is known about and very few algorithms are available for
reconstruction from multiple views in a general n-dimensional space.
This chapter has essentially demonstrated the possibility of generalizing
classic multiple view geometry from R3 to Rn . However, there is yet another direction that a generalization could take: changing the structure of
the underlying ambient space, say from Euclidean to non-Euclidean. That
is to develop a version of multiple view geometry for spaces with nonzero curvature. Such a generalization typically requires a generalization of
the concept of perspective projection (see [MS98]). In such a context,
many intriguing mathematical questions may arise: how to study geometric relationships between an object in a general space (say a Riemannian
manifold) and its multiple low-dimensional images; what are the corresponding conditions that would allow a full reconstruction of the object
from its multiple images; and in general, how to develop efficient algorithms
for such reconstruction purposes. We would like to leave these questions for
future investigation.

This is page 264

Printer: Opaque this

This is page 265

Printer: Opaque this

Part IV

Reconstruction algorithms

This is page 266

Printer: Opaque this

This is page 267

Printer: Opaque this

Chapter 12
Batch reconstruction from multiple
views

12.1 Algorithms
12.1.1 Optimal methods
12.1.2 Factorization methods

12.2 Implementation

This is page 268

Printer: Opaque this

Chapter 13
Recursive estimation from image
sequences

In this chapter we consider the problem of estimating structure and motion

under the constraint of causal processing. In applications such as driving,
docking, tracking, manipulation etc., a control task must be performed
based on visual information up to the current time instant. Processing
data in a batch would result in a delay (to first collect and then process
the data) that makes the control task significantly harder. The problem
of causally estimating structure and motion can be shown to not have a
finite-dimensional optimal solution. Therefore, one has to resort to approximations. We do so in this chapter, where we describe a local nonlinear filter
to recursively estimate structure and motion that has been implemented
on commercial hardware for real-time operation.

13.1 Motion and shape estimation as a filtering

problem
Consider an N -tuple of points in the three-dimensional Euclidean space E3
and their coordinates represented as a matrix

.
X = X1 X2 . . . XN R3N
(13.1)

and let them move under the action of a rigid motion between two adjacent
b
b se(3). We assume that we
time instants g(t + 1) = exp((t))g(t);
(t)
can measure the (noisy) projection

yi (t) = g(t)Xi + ni (t) R2 i = 1 . . . N
(13.2)

13.2. Observability

269

where we know the correspondence yi Xi . Finally, by organizing the

time-evolution of the configuration of points and their motion, we end up
with a discrete-time, non-linear dynamical system:

X(t + 1) = X(t)
X(0) = X0 R3N

g(t + 1) = exp((t))g(t)
b
g(0) = g0 SE(3)
(13.3)
(t + 1) = (t) + (t)
(0) = 0 se(3)

i
y (t) = g(t)Xi (t) + ni (t)
ni (t) N (0, n )

where N (M, S) indicates a normal distribution with mean M and covariance S. In the above model, is the relative acceleration between the
viewer and the scene. If some prior modeling information is available (for
instance when the camera is mounted on a vehicle or on a robot arm), this
is the place to use it. Otherwise, a statistical model can be employed. In
particular, one way to formalize the fact that no information is available is
by modeling as a Brownian motion process. This is what we do in this
chapter1 . In principle one would like - at least for this simplified formalization of SFM - to find the optimal solution, that is the description of the
state of the above system {X, g, } given a sequence output measurements
(correspondences) yi (t) over an interval of time. Since the measurements
are noisy and described using a stochasic model, the description of the
state consists in its probability density conditioned on the measurements.
We call an algorithm that delivers the conditional density of the state at
time t causally (i.e. based upon measurements up to time t) the optimal
filter.

13.2 Observability
To what extent can the 3-D shape and motion of a scene be reconstructed
causally from measurements of the motion of its projection onto the sensor?
This is the subject of this section, which we start by establishing some
notation that will be used throughout the rest of the chapter.
Let a rigid motion g SE(3) be represented by a translation vector
T R3 and a rotation matrix R SO(3), and let 6= 0 be a scalar. The
similarity group, which we indicate by g SE(3) R+ is the composition
of a rigid motion and a scaling, which acts on points in R3 as follows:
g (X) = RX + T.

(13.4)

We also define an action of g on SE(3) as

g (g 0 ) = {RT 0 + T, RR0 }

(13.5)

1 We wish to emphasize that this choice is not crucial towards the results of this
chapter. Any other model would do, as long as the overall system is observable.

270

Chapter 13. Recursive estimation from image sequences

and an action on se(3), represented by v and , as

g () = {v,
b }.

(13.6)

[X] = {Y R3N | g | Y = g X}

(13.7)

The similarity group, acting on an N -tuple of points in R3 , generates an

equivalence class:

two configurations of points X and Y R3N are equivalent if there exists

a similarity transformation g that brings one onto the other: Y = g X.
Such equivalence class in (13.7) is called a fiber, and the collection of all
fibers is called a fiber bundle. Therefore, the similarity group organizes
the space of N -tuples into a fiber bundle, which we call the state-space
bundle: given a point X in R3N , it belongs to one and only one fiber.
From any given point it is possible to move either along the fiber (via the
similarity group) or across fibers. One element of each fiber is sufficient
to represent it, since all other elements are just transformed versions of it
via the similarity group. In order to obtain a representation of the whole
bundle, however, we need a consistent way of choosing a representative
for each fiber. This is called a base of the fiber bundle. We will show how
to construct a base of the homogeneous space of coordinates of points
under the similarity transformation in section 13.3. For more details on the
geometric structure of fiber bundles the reader is referred to [GH86]; for the
purpose of this chapter an intuitive notion of fiber bundle as the equivalence
class of coordinates of points in R3 under the similarity transformation
suffices.

Preliminaries
Consider a discrete-time nonlinear dynamical system of the general form
(
x(t + 1) = f (x(t))
x(t0 ) = x0
(13.8)
y(t) = h(x(t))
and let y(t; t0 , x0 ) indicate the output of the system at time t, starting from
the initial condition x0 at time t0 . In this section we want to characterize
the states x that can be reconstructed from the measurements y. Such a
characterization depends upon the structure of the system f, h and not
on the measurement noise, which is therefore assumed to be absent for the
purpose of the analysis in this section.
Definition 13.1. Consider a system in the form (13.8) and a point in the
state-space x0 . We say that x0 is indistinguishable from x00 if y(t; t0 , x00 ) =
y(t; t0 , x0 ) t, t0 . We indicate with I(x0 ) the set of initial conditions that
are indistinguishable from x0 .

13.2. Observability

271

Definition 13.2. We say that the system (13.8) is observable up to a

(group) transformation if
.
I(x0 ) = [x0 ] = {x00 | | x00 = (x0 )}.
(13.9)
As articulated in the previous section, the state-space can be represented
as a fiber bundle: each fiber is an equivalence class of initial conditions that
are indistinguishable. Clearly, from measurements of the output y(t) over
any period of time, it is possible to recover at most the equivalence class
(fiber) where the initial condition belongs, that is I(x0 ), but not x0 itself.
The only case when this is possible is if the system is observable up to the
identity transformation. In this case we have I(x0 ) = {x0 } and we say that
the system is observable.
For a generic linear time-varying system of the form
(
x(t + 1) = F (t)x(t)
x(t0 ) = x0
(13.10)
y(t) = H(t)x(t)
the k-observability Grammian is defined as
t+k

. X T
Mk (t) =
i (t)H T (t)H(t)i (t) i > t

(13.11)

i=t

.
where t (t) = I and i (t) = F (i 1) . . . F (t). The following definition will
come handy in section 13.4:
Definition 13.3. We say that the system (13.10) is uniformly observable
if there exist real numbers m1 > 0, m2 > 0 and an integer k > 0 such that
m1 I Mk (t) m2 I t.
The following proposition revisits the well-known fact that, under constant velocity, structure and motion are observable up to a (global)
similarity transformation.
Proposition 13.1. The model (13.3) where the points X are in general position is observable up to a similarity transformation of X provided
that v0 6= 0. In particular, the set of initial conditions that are indistinb
guishable from {X0 , g0 , 0 }, where g0 = {T0 , R0 } and e0 = {v0 , U0 }, is
T } and
0 + T, g0 , 0 }, where g0 = {T0 R0 R
T T, R0 R
given by {RX
b

e0 = {v0 , U0 }.

Proof. Consider two initial conditions {X1 , g1 , 1 } and {X2 , g2 , 2 }. For

them to be indistinguishable we must have y(t) = (g1 (t)X1 (t)) =
(g2 (t)X2 (t))
t 0. In particular, at time t = 0 this is equivalent to the existence of a diagonal matrix of scalings, A(1), such that
g1 (0)X1 (0) = (g2 (0)X2 ) A(1), where the operator performs the scaling
.
according to (gX) A = RXA + T A. Under the assumption of constant
b
velocity, we have that g(t) = et g(0), and therefore the group action

272

Chapter 13. Recursive estimation from image sequences

g only appears at the initial time. Consequently, we drop the time index and write simply g1 and g2 as points in SE(3). At the generic time
instant t, the indistinguishability condition can therefore be written as
b
b
et1 g1 X1 = (et2 g2 X2 ) A(t + 1). Therefore, given X2 , g2 , 2 , in order to
find the initial conditions that are indistinguishable from it, we need to find
X1 , g1 , 1 and A(k), k 1 such that, after some substitutions, we have
(
g1 X1 = (g2 X2 ) A(1)

(13.12)
b
b
b
b
e1 e(k1)2 g2 X2 = e2 e(k1)2 g2 X2 A(k + 1) k 1.

Making the representation of SE(3) explicit, we write the above conditions

as
(
R1 X1 + T1 = (R2 X2 + T2 )A(1)
(13.13)
k A(k) + v1 = U2 X
k A(k + 1) + v2 A(k + 1)
U1 X
. (k1)b2
k =
where we have defined X
e
g2 X2 which, by the assumption of
.
general position, is of full rank 3, and v denotes the rank-one matrix v =
T
vIN , where IN is the N -dimensional vector of ones. We can rewrite the
second of the equations above in a more enlightening way as follows:
k A(k)A1 (k + 1) U T U2 X
k = U T (
X
v2 A(k + 1) v1 )A1 (k + 1). (13.14)
1

The 3 N matrix on the right hand-side has rank at most 2, while the
left hand-side has rank 3, following the general-position conditions, unless
A(k)A1 (k + 1) = I and U1T U2 = I, in which case it is identically zero.
Therefore, both terms in the above equations must be identically zero. From
U1T U2 = I we conclude that U1 = U2 , while from A(k)A1 (k + 1) = I we
conclude that A(k) is constant. However, the right hand-side imposes that
T
where A = diag{a}, which implies
v2 A = v1 , or in vector form v2 aT = v1 IN
that A = I, i.e. a multiple of the identity. Now, going back to the first
T , for any R
SO(3),
equation in (13.13), we conclude that R1 = R2 R
T
3

X1 = (RX0 + T ) for any T R , and finally T1 = (T2 R1 R T ), which

concludes the proof.
Remark 13.1. The relevance of the above proposition for the practical estimation of structure from motion (where velocity is not necessarily constant)
is that one can (in principle) solve the problem using the above model only
when velocity varies slowly compared to the sampling frequency. If, however, some information on acceleration becomes available (as for instance
if the camera is mounted on a support with some inertia), then the restriction on velocity can be lifted. This framework, however, will not hold if the
data y(t) are snapshots of a scene taken from sparse viewpoints, rather than
a sequence of images taken ad adjacent time instants while the camera is
moving in a continuous trajectory.
The following proposition states that it is possible to make the model
observable by fixing the direction of three points and one depth. It is closely

13.2. Observability

273

related to what some authors call invariance to gauge transformations

[McL99], except that in our case we have to consider the causality constraint. When we interpret the state-space as a fiber bundle under the
action of the similarity group, fixing the direction of three points and one
depth identifies a base of the bundle, that is a point in the similarity group.
Without loss of generality (i.e. modulo a re-ordering of the states) we will
assume the indices of such three points to be 1, 2 and 3. We consider a
point X as parameterized by its direction y and depth , so that X = y.
Proposition 13.2. Given the direction of three non-collinear points,
y1 , y2 , y3 and the scale of one point, 1 > 0, and given vectors i , i =
1 . . . N , the set of motions g = {T, R} SE(3) and scales R such that
Ryi i + T = i

i = 1...N 3

(13.15)

has measure zero.

Proof. Suppose that the statement holds for N = 3, then it holds for any
N > 3, as any additional equation of the form i = Ryi i + T is linear
.
in the variable Xi = yi i , and therefore can be solved uniquely. Since
i
i
i
X3 = , the latter is uniquely determined, and so is yi = X
i . Therefore,
we only need to prove the statement for N = 3:

1
1 1

= Ry + T
(13.16)
2 = Ry2 2 + T

3
3 3
= Ry + T.
Solve the first equation for T ,

T = 1 Ry1 1 6= 0
and substitute into the second and third equation to get
(
2 1 = R(y2 2 y1 1 )
3 1 = R(y3 3 y1 1 ).

(13.17)

(13.18)

The scale > 0 can be solved for as a function of the unknown scales 2
and 3
=

k3 1 k
k2 1 k
=
2
2
1
1
ky y k
ky3 3 y1 1 k

(13.19)

(note that these expressions are always legitimate as a consequence of the

non-collinearity assumption). After substituting in equations (13.18), we
get
( 2 1
y2 2 y1 1
k2 1 k = R ky2 2 y1 1 k
(13.20)
3 1
y3 3 y1 1
k3 1 k = R ky3 3 y1 1 k .
In the above equations, the only unknowns are R, 2 and 3 . Note that,
while on the left hand-side there are two fixed unit-norm vectors, on the

274

Chapter 13. Recursive estimation from image sequences

right hand-side there are unit-norm vectors parameterized by 2 and 3 respectively. In particular, the right hand-side of the first equation in (13.20)
is a vector on the unit circle of the plane spanned by y 1 and y2 , while
the right hand-side of the second equation is a vector on the unit circle of
the plane 2 spanned by y1 and y3 . By the assumption of non-collinearity,
these two planes do not coincide. We write the above equation in a more
compact form as
(
1 = R2
(13.21)
2 = R3 .
Now R must preserve the angle between 1 and 2 , which we indi1
cate as \
2 , and therefore 2 and 2 must be chosen accordingly.

1
2
If \
[
>
1 2 , no such choice is possible. Otherwise, there exists a
one-dimensional interval set of 1 , 2 for which one can find a rotation R
that preserves the angle. However, R must also preserve the cross product,
so that we have

1 2 = (R2 ) R3 = R2 (RT R3 ) = R(2 3 ) (13.22)

(note that the norm of the two cross products is the same as a consequence
of the conservation of the inner product), and therefore 2 and 3 are
determined uniquely, and so is R, which concludes the proof.

13.3 Realization
In order to design a finite-dimensional approximation to the optimal filter,
we need an observable realization of the original model. How to obtain it
is the subject of this section.

Local coordinates
Our first step consists in characterizing the local-coordinate representation
of the model (13.3). To this end, we represent SO(3) locally in canonical
exponential coordinates: let be a three-dimensional real vector (

R3 ); kk
specifies the direction of rotation and kk specifies the angle
of rotation in radians. Then a rotation matrix can be represented by its
.
b so(3) such that R =
b SO(3) as
exponential coordinates
exp()
described in chapter 2. The three-dimensional coordinate Xi is represented
by its projection onto the image plane yi and its depth i , so that
i
X1
.
.
X3i
i
i

i = X3i .
(13.23)
y = (X ) =
Xi
2

X3i

13.3. Realization

275

Such a representation has the advantage of decomposing the uncertainty

in the measured directions y (low) from the uncertainty in depth (high).
The model (13.3) in local coordinates is therefore

y0i (t + 1) = y0i (t)

i = 1...N
y0i (0) = y0i

i
i

i = 1...N
i (0) = i0

(t + 1) = (t)

(t))T (t) + v(t)

T (0) = T0
T (t + 1) = exp(b

b
(t + 1) = LogSO(3) (exp(b
(t)) exp((t)))
(0) = 0

v(t
+
1)
=
v(t)
+

(t)
v(0)
=
v
v
0

(t
+
1)
=
(t)
+

(t)
(0)

= 0

yi (t) = exp((t))y
i
i
i
b
i = 1 . . . N.
0 (t) (t) + T (t) + n (t)

(13.24)
b
The notation LogSO(3) (R) stands for such that R = e and is computed
by inverting Rodrigues formula as described in chapter 2.

Minimal realization
In linear time-invariant systems one can decompose the state-space into an
observable subspace and its (unobservable) complement. In the case of our
system, which is nonlinear and observable up to a group transformation,
we can exploit the bundle structure of the state-space to realize a similar
decomposition: the base of the fiber bundle is observable, while individual
fibers are not. Therefore, in order to restrict our attention to the observable
component of the system, we only need to choose a base of the fiber bundle,
that is a particular (representative) point on each fiber.
Proposition 13.2 suggests a way to render the model (13.24) observable
by eliminating the states that fix the unobservable subspace.
Corollary 13.1. The model

y0i (t + 1) = y0i (t)

i = 4...N
y0i (0) = y0i

i (t + 1) = i (t)
i = 2...N
i (0) = i0

(t))T (t) + v(t)

T (0) = T0

T (t + 1) = exp(b
b
(t + 1) = LogSO(3) (exp(b
(t)) exp((t)))
(0) = 0

v(t + 1) = v(t) + v (t)

v(0) = v0

(t
+
1)
=
(t)
+

(t)
(0)

= 0

yi (t) = exp((t))y
i
i
b
(t) (t) + T (t) + ni (t)
i = 1 . . . N.
0

(13.25)
which is obtained by eliminating y01 , y02 , y03 and 1 from the state of the
model (13.24), is observable.

276

Chapter 13. Recursive estimation from image sequences

Linearization and uniform observability of the minimal

realization
T
.
T
Let x = [y04 , . . . , y0N , 2 , . . . , N , T T , T , v T , T ]T be the state vector of
.
a minimal realization, ei the i-th canonical vector in R3 and define Y i (t) =
.
.
.
b
(x)
e(t) y0i (t)i (t) + T (t), Z i (t) = eT3 Y i (t). Let F (t) = fx
, H(t) = (x)
x
denote the linearization of the state and measurement equation in (??)
respectively. Explicit expressions for the linearization are given in section
13.5. Here we just wish to remark that, in order for the linearization to
be well-defined, we need to ensure that the depth of each point Z i (t) is
strictly positive as well as bounded. This is a reasonable assumption since
it corresponds to each visible point being in front of the camera at a finite
distance. Therefore, we restrict our attention to motions that guarantee
that this condition is verified.

Definition 13.4. We say that a motion is admissible if v(t) is not identically zero, i.e. there is an open set (a, b) such that v(t) 6= 0, t (a, b) and
the corresponding trajectory of the system (??) is such that c Z i (t) C,
i = 1, . . . , N , t > 0 for some constants c > 0, C < .
.
.
(x)
, H(t) = (x)
denote the linProposition 13.3. Let F (t) = fx
x
earization of the state and measurement equation in (??) respectively; let
N > 5 and assume the motion is admissible; then the linearized system is
uniformly observable.
Proof. Let k = 2; that there exists an m2 < such that the Grammian
M2 (t) m2 I follows from the fact that F (t) and H(t) are bounded for
all t, as can be easily verified provided that motion is admissible. We now
need to verify that M2 (t) is strictly positive definite for all t. To this end,
it is sufficient to show that the matrix

H(t)
.
(13.26)
U2 (t) =
H(t + 1)F (t)
has full column rank equal to 3N + 5 for all values of t, which can be easily
verified whenever N 5.

13.4 Stability
Following the derivation in previous sections, the problem of estimating
the motion, velocity and point-wise structure of the scene can be converted
into the problem of estimating the state of the model (??). We propose
to solve the task using a nonlinear filter, properly designed to account for
the observability properties of the model. The implementation, which we
report in detail in section 13.5, results in a sub-optimal filter. However, it
is important to guarantee that the estimation error, while different from

13.4. Stability

277

zero, remains bounded. To streamline the notation, we represent the model

(??) as
(
x(t + 1) = f (x(t)) + w(t) w(t) N (0, w0 (t)),
(13.27)
y(t) = h(x(t)) + n(t)
n(t) N (0, n0 (t)).
The filter is described by a difference equation for the state x
(t). We call
the estimation error
.
x
(t) = x(t) x
(t)
(13.28)
and its variance at time t P (t). The initial conditions for the estimator are
(
x(0) = x0
(13.29)
P (0) = P0 > 0
and its evolution is governed by
(
x
(t + 1) = f (
x(t)) + K(t)[y(t + 1) h(
x(t))]
P (t + 1) = R (P (t), F (t), H(t), n , w )

(13.30)

where R denotes the usual Riccati equation which uses the linearization
of the model {F, H} computed at the current estimate of the state, as
described in [?].
Note that we call n0 , w0 the variance of the measurement and model
noises, and n , w the tuning parameters that appear in the Riccati equation. The latter are free for the designer to choose, as described in section
13.5.

Stability analysis
The aim of this section is to prove that the estimation error generated
by the filter just described is bounded. In order to do so, we need a few
definitions.
Definition 13.5. A stochastic process x
(t) is said to be exponentially
bounded in mean-square (or MS-bounded) if there are real numbers , >
0 and 0 < < 1 such that Ek
x(t)k2 k
x(0)k2 t + for all t 0.
x
(t) is said do be bounded with probability one (or bounded WP1) if
P [supt0 k
x(t)k < ] = 1.
Definition 13.6. The filter (13.27) is said to be stable if there exist positive
real numbers and such that
k
x(0)k , n (t) I, w (t) I

= x
(t) is bounded.

(13.31)

Depending on whether x(t) is bounded in mean square or with probability

one, we say that the filter is MS-stable or stable WP1.
We are now ready to state the core proposition of this section

278

Chapter 13. Recursive estimation from image sequences

Proposition 13.4. Let the hypothesis of proposition 13.3 be fulfilled. Then

the filter based on the model (??) is MS-stable and stable WP1.
In order to prove the proposition, we need a result that follows directly
from corollary 5.2 of [?]:
Lemma 13.1. In the filter based on the model (??), let motion be admissible and P0 > 0. Then there exist positive real numbers p1 and p2 such
that
p1 I P (t) p2 I

t 0.

(13.32)

Proof. The proof follows from corollary 5.2 of [?], using proposition 13.3
on the uniform observability of the linearization of (??).
Now the proof for the proposition:
Proof. The proposition follows from theorem 3.1 in [?], making use in
the assumptions of the boundedness of F (t), H(t), lemma 13.1 and the
differentiability of f and g when 0 < i < i.

Non-minimal models
Most recursive schemes for causally reconstructing structure and motion
available in the literature represent structure using only one state per point
(either its depth in an inertial frame, or its inverse, or other variations on
the theme). This corresponds to reducing the state of the model (??), with
the states y0i substituted for the measurements yi (0), which causes the
model noise n(t) to be non-zero-mean2. When the zero-mean assumption
implicit in the use of the Kalman filter is violated, the filter settles at a
severely biased estimate. In this case we say that the model is sub-minimal.
On the other hand, when the model is non-minimal such is the case
when we do not force it to evolve on a base of the state-space bundle the
filter is free to wonder on the non-observable space (i.e. along the fibers of
the state-space bundle), therefore causing the explosion of the variance of
the estimation error along the unobservable components.

13.5 Implementation
In the previous section we have seen that the model (??) is observable.
Notice that in that model the index for y0i starts at 4, while the index for
i starts at 2. This corresponds to choosing the first three points as reference
for the similarity group and is necessary (and sufficient) for guaranteeing
2 In fact, if we call ni (0) the error in measuring the position of the point i at time 0,
we have that E[ni (t)] = ni (0) t.

13.5. Implementation

279

that the representation is minimal. In the model (??), we are free to choose
the initial conditions 0 , T0 , which we will therefore let be 0 = T0 = 0,
thereby choosing the camera reference at the initial time instant as the
world reference.

Partial autocalibration
As we have anticipated, the models proposed can be extended to account
for changes in calibration. For instance, if we consider an imaging model
with focal length f 3

X1
f
X2
(13.33)
f (X) =
X3
where the focal length can change in time, but no prior knowledge on how
it does so is available, one can model its evolution as a random walk
f (t + 1) = f (t) + f (t)

f (t) N (0, f2 )

(13.34)

and insert it into the state of the model (13.3). As long as the overall
system is observable, the conclusions reached in the previous section will
hold. The following claim shows that this is the case for the model (13.34)
above. Another imaging model proposed in the literature is the following
[?]:

X1
X2
(X) =
(13.35)
1 + X3
for which similar conclusions can be drawn.
Proposition 13.5. Let g = {T, R} and = {v, }. The model

X(t + 1) = X(t)
X(0) = X0

g(0) = g0
g(t + 1) = e g(t)
(t + 1) = (t)
(0) = 0

f
(t
+
1)
=
f
(t)
f
(0) = f0

y(t) = (g(t)X(t))
f

(13.36)

acting on
is observable up to the action of the group represented by T, R,
the initial conditions.
Proof. Consider the diagonal matrix F (t) = diag{f (t), f (t), 1} and the
matrix of scalings A(t) as in the proof in section 13.2. Consider then two
3 This f is not to be confused with the generic state equation of the filter in section
13.6.

280

Chapter 13. Recursive estimation from image sequences

initial conditions {X1 , g1 , 1 , f1 } and {X2 , g2 , 2 , f2 }. For them to be indistinguishable there must exist matrices of scalings A(k) and of focus F (k)
such that
(
g1 X1 = F (1)(g2 X2 ) A(1)

b
b
b
b
e1 e(k1)1 g1 X1 = F (k + 1) e2 e(k1)2 g2 X2 A(k + 1) k 1.
(13.37)
Making the representation explicit we obtain
(
R1 X1 + T1 = F (1)(R2 X2 + T2 )A(1)
(13.38)
k A(k) + v1 = F (k + 1)(U2 X
k + v2 )A(k + 1)
U1 F (k)X
which can be re-written as
k A(k)A1 (k + 1) F 1 (k)U T F (k + 1)U2 X
k
X
1
= F (k)1 U1T (F (k + 1)
v2 A(k + 1) v1 )A1 (k + 1).
The two sides of the equation have equal rank only if it is equal to zero,
which draws us to conclude that A(k)A1 (k + 1) = I, and hence A is
constant. From F 1 (k)U1T F (k+1)U2 = I we get that F (k+1)U2 = U1 F (k)
and, since U1 , U2 SO(3), we have that taking the norm of both sides
2f 2 (k + 1) + 1 = 2f 2 (k) + 1, where f must be positive, and therefore
constant: F U2 = U1 F . From the right hand side we have that F v2 A = v1 ,
from which we conclude that A = I, so that in vector form we have
v1 = F v2 . Therefore, from the second equation we have that, for any f
and any , we can have
v1
U1

= F v2
= F U2 F 1 .

(13.39)
(13.40)

However, from the first equation we have that R1 X1 + T1 = F R2 X2 +

F T2 , whence - from the general position conditions - we conclude that
R1 = F R2 and therefore F = I. From that we have that T1 = F T2 =
T2 which concludes the proof.
Remark 13.2. The previous claim essentially implies that the realization
remains minimal if we add into the model the focal parameter. Note that
observability depends upon the structural properties of the model, not on the
noise, which is therefore assumed to be zero for the purpose of the proof.

Saturation
Instead of eliminating states to render the model observable, it is possible
to design a nonlinear filter directly on the (unobservable) model (13.3) by
saturating the filter along the unobservable component of the state space
as we show in this section. In other words, it is possible to design the initial
variance of the state of the estimator as well as its model error in such a

13.5. Implementation

281

way that it will never move along the unobservable component of the state
space.
As suggested in the previous section, one can saturate the states corresponding to y01 , y02 , y03 and 1 . We have to guarantee that the filter
b0 , b0 , gb0 , b0 evolves in such a way that y
b01 (t) = y
b01 , y
b02 (t) =
initialized at y
2
3
3
1
1
b0 , y
b0 (t) = y
b0 , b (t) = b0 . It is simple, albeit tedious, to prove the following
y
proposition.

Proposition 13.6. Let Pyi (0), Pi (0) denote the variance of the initial
condition corresponding to the state y0i and i respectively, and yi , i
the variance of the model error corresponding to the same state, then

Pyi (0) = 0, i = 1 . . . 3

P 1 = 0

(13.41)

i = 1...3

y i = 0,

1 = 0
b0i (t|t) = y
b0i (0),
implies that y

Pseudo-measurements

i = 1 . . . 3, and b1 (t|t) = b1 (0).

Yet another alternative to render the model observable is to add pseudomeasurement equations with zero error variance.
Proposition 13.7. The model

y0i (t + 1) = y0i (t)

i = 1...N
y0i (0) = y0i

i
i

i = 1...N
i (0) = i0

(t + 1) = (t)

(t))T (t) + v(t)

T (0) = 0
T (t + 1) = exp(b

(t)) exp((t)))
(0) = 0

(t + 1) = LogSO(3) (exp(b
v(t + 1) = v(t) + v (t)
v(0) = v0

(t
+
1)
=
(t)
+

(t)
(0)

= 0

i
i
i
b

y (t) = exp((t))y
(t) (t) + T (t) + ni (t)
i = 1...N

= 1

i
y (t) = i i = 1 . . . 3,
(13.42)
where 1 is an arbitrary (positive) constant and i are three non-collinear
points on the plane, is observable.
The implementation of an extended Kalman filter based upon the model
(??) is straightforward. However, for the sake of completeness we report
it in section 13.6. The only issue that needs to be dealt with is the disappearing and appearing of feature points, a common trait of sequences of
images of natural scenes. Visible feature points may become occluded (and
therefore their measurements become unavailable), or occluded points may

282

Chapter 13. Recursive estimation from image sequences

become visible (and therefore provide further measurements). New states

must be properly initialized. Occlusion of point features do not cause major
problems, unless the feature that disappears happens to be associated with
the scale factor. This is unavoidable and results in a drift whose nature is
explained below.

Occlusions
When a feature point, say Xi , becomes occluded, the corresponding
measurement yi (t) becomes unavailable. It is possible to model this phenomenon by setting the corresponding variance to infinity or, in practice
ni = M I2 for a suitably large scalar M > 0. By doing so, we guarantee
0i (t) and i (t) are not updated:
that the corresponding states y
0i (t+1) = y
0i (t) and i (t+1) = i (t).
Proposition 13.8. If ni = , then y
An alternative, which is actually preferable in order to avoid useless com 0i and i
putation and ill-conditioned inverses, is to eliminate the states y
altogether, thereby reducing the dimension of the state-space. This is simple due to the diagonal structure of the model (??): the states i , y0i are
decoupled, and therefore it is sufficient to remove them, and delete the corresponding rows from the gain matrix K(t) and the variance w (t) for all
t past the disappearance of the feature (see section 13.6).
When a new feature point appears, on the other hand, it is not possible
to simply insert it into the state of the model, since the initial condition is
unknown. Any initialization error will disturb the current estimate of the
remaining states, since it is fed back into the update equation for the filter,
and generates a spurious transient. We address this problem by running
a separate filter in parallel for each point using the current estimates of
motion from the main filter in order to reconstruct the initial condition.
Such a subfilter is based upon the following model, where we assume that
N features appear at time :

i
i

yi (0) N (yi ( ), ni )
t>

y (t + 1) = y (t) + yi (t)
i
i
(t)
+

(0)

N
(1,
P
(0))
i (t + 1) =

i (t)

b
b | ))]1 yi (t)i (t) T ( | ) + T (t|t) + ni (t)
yi (t) = exp((t|t))[exp(
(

(13.43)

for i = 1 . . . N , where (t|t) and T (t|t) are the current best estimates of
and T , ( | ) and T ( | ) are the best estimates of and T at t = . In
pracice, rather than initializing to 1, one can compute a first approximation by triangulating on two adjacent views, and compute covariance of the
initialization error from the covariance of the current estimates of motion.
Several heuristics can be employed in order to decide when the estimate
of the initial condition is good enough for it to be inserted into the main
filter. The most natural criterion is when the variance of the estimation

13.5. Implementation

283

error of i in the subfilter is comparable with the variance of j0 for j 6= i

in the main filter. The last step in order to insert the feature i into the
main filter consists in bringing the coordinates of the new points back to
the initial frame. This is done by
h
i1

b | ))
yi i T ( | ) .
(13.44)
Xi = exp((

Drift
The only case when losing a feature constitutes a problem is when it is
used to fix the observable component of the state-space (in our notation,
i = 1, 2, 3) 4 . The most obvious choice consists in associating the reference
to any other visible point. This can be done by saturating the corresponding
state and assigning as reference value the current best estimate. In particular, if feature i is lost at time , and we want to switch the reference index
to feature j, we eliminate y0i , i from the state, and set the diagonal block
of w and P ( ) with indices 3j 3 to 3j to zero. Therefore, by proposition
13.6, we have that
0j ( + t) = y
0j ( )
y

t > 0.

(13.45)

0j ( ) was equal to y0j , switching the reference feature would have no

If y
effect on the other states, and the filter would evolve on the same observable
component of the state-space determined by the reference feature i.
.
0j ( ) = y0j ( )
However, in general the difference y
y0j is a random variable
with variance = P3j3:3j1,3j3:3j1 . Therefore, switching the reference
to feature j causes the observable component of the state-space to move
0j ( ). When a number of switches have
by an amount proportional to y
occurred, we can expect - on average - the state-space to move by an
amount proportional to k k#switches. This is unavoidable. What we can
do is at most try to keep the bias to a minimum by switching the reference
to the state that has the lowest variance5 .
Of course, should the original reference feature i become available, one
can immediately switch the reference to it, and therefore recover the original
base and annihilate the bias.
4 When the scale factor is not directly associated to one feature, but is associated to
a function of a number of features (for instance the depth of the centroid, or the average
inverse depth), then losing any of these features causes a drift.
5 Just to give the reader an intuitive feeling of the numbers involved, we find that in
practice the average lifetime of a feature is around 10-30 frames depending on illumination and reflectance properties of the scene and motion of the camera. The variance
of the estimation error for y0i is in the order of 106 units of focal length, while the
variance of i is in the order of 104 units for noise levels commonly encountered with
commercial cameras.

284

Chapter 13. Recursive estimation from image sequences

13.6 Complete algorithm

The implementation of an approximate wide-sense nonlinear filter for the
model (??) proceeds as follows:
Initialization
Choose the initial conditions

yi = yi (0)

0 = 1

T = 0
0
0 = 0

v0 = 0

0 = 0

i = 1 . . . N.

(13.46)

For the initial variance P0 , choose it to be block diagonal with blocks ni (0)
corresponding to y0i , a large positive number M (typically 100-1000 units of
focal length) corresponding to i , zeros corresponding to T0 and 0 (fixing
the inertial frame to coincide with the initial reference frame). We also
choose a large positive number W for the blocks corresponding to v0 and
0 .
The variance n (t) is usually available from the analysis of the feature
tracking algorithm. We assume that the tracking error is independent in
each point, and therefore n is block diagonal. We choose each block to
be the covariance of the measurement yi (t) (in the current implementation
they are diagonal and equal to 1 pixel std.). The variance w (t) is a design
parameter that is available for tuning. We describe the procedure in section
13.6. Finally, set
(
T
T
.
T
T
T
T T
x
(0|0) = [y4 0 , . . . yN 0 , 20 , . . . , N
0 , T0 , 0 , v 0 , 0 ]
(13.47)
P (0|0) = P0 .
Transient
During the first transient of the filter, we do not allow for new features to
be acquired. Whenever a feature is lost, its state is removed from the model
and its best current estimate is placed in a storage vector. If the feature
was associated with the scale factor, we proceed as in section 13.5. The
transient can be tested as either a threshold on the innovation, a threshold
on the variance of the estimates, or a fixed time interval. We choose a
combination with the time set to 30 frames, corresponding to one second
of video.
The recursion to update the state x and the variance P proceed as
follows: Let f and h denote the state and measurement model, so that

13.6. Complete algorithm

equation (??) can be written in concise form as

(
x(t + 1) = f (x(t)) + w(t) w(t) N (0, w )
y(t) = h(x(t)) + n(t)
n(t) N (0, n )

285

(13.48)

We then have
Prediction:

(
x
(t + 1|t) = f (x(t|t))
P (t + 1|t) = F (t)P (t|t)F T (t) + w

(13.49)

(t + 1) = I L(t + 1)H(t + 1)
.
(13.51)
L(t + 1) = P (t + 1|t)H T (t + 1)1 (t + 1)

(t + 1) = H(t + 1)P (t + 1|t)H T (t + 1) + n (t + 1)

Linearization:

(
.
F (t) = f
x(t|t))
x (
.
H(t + 1) = h
x(t + 1|t))
x (

(13.52)

.
Let ei be the i-th canonical vector in R3 and define Y i (t) = e(t)
.
[y0i (t)]i (t) + T (t), Z i (t) = eT3 Y i (t). The i-th block-row (i = 1, . . . , N )
y i Y i .
Y i
Hi (t) of the matrix H(t) can be written as Hi = Y
i x = i x where
the time argument t has
been omitted for simplicity of notation. It is easy
to check that i = Z1i I2 (Y i ) and

Y i Y i
i
i
Y i
Y
0
...
...
0
Y
[
] [ 0 ... i ... 0 ]
0
0
i
y0
= |
T |{z}
|{z} |{z} .
{z
} |{z}
{z
} |
x
3
3
N 1

2N 6

The partial derivatives in

Y i

y0i =

Y i
=
i

Y i

Y i =

the previous expression are given by

i
e
0

e y0i
Ih

e i i
y
1 0

e i i
y
2 0

e i i
y
3 0

The linearization of the state equation involves derivatives of the logarithm

function in SO(3) which is available as a Matlab function in the software

286

Chapter 13. Recursive estimation from image sequences

distribution [?] and will not be reported here. We shall use the following
notation:
LogSO(3) (R) . h
=
R

LogSO(3) (R)
r21

LogSO(3) (R)
r11

...

LogSO(3) (R)
r33

where rij is the element in position (i, j) of R. Let us denote R = e e ;

the linearization of the state equation can be written in the following form:

I2N 6
0

. 0
F =
0

0
0

0
IN 1
0

0
0
0

0
0

0
0
0

LogSO(3) (R) R
R

0
I
0

0
0

e
e
e
1 T
2 T
3 T

LogSO(3) (R) R

where
R h e
= e
1

R . h e
= e
1

e
e
2

e
e
3

3 e

and

2 e

and the bracket () indicates that the content has been organized into a
column vector.

Regime
Whenever a feature disappears, we simply remove it from the state as
during the transient. However, after the transient a feature selection module
works in parallel with the filter to select new features so as to maintain
roughly a constant number (equal to the maximum that the hardware can
handle in real time), and to maintain a distribution as uniform as possible
across the image plane. We implement this by randomly sampling points
on the plane, searching then around that point for a feature with enough
brightness gradient (we use an SSD-type test [?]).
Once a new point-feature is found (one with enough contrast along
two independent directions), a new filter (which we call a subfilter) is
initialized based on the model (13.43). Its evolution is given by

13.6. Complete algorithm

Initialization:

Prediction:

i ( | ) = yi ( )
y

i ( | ) = 1

ni ( )

P
(
|
)
=

i (t + 1|t) = y
i (t|t)
y
i
(t + 1|t) = i (t|t)

P (t + 1|t) = P (t + 1|t) + w (t)

287

(13.53)

(13.54)

Update:
i
i

(t + 1|t + 1)
(t + 1|t)
y
y
=
+
(t + 1|t + 1)
(t + 1|t)

h
i1

i
i
i
b
b
y (t) (t) T ( ) + T (t))
L (t + 1) y (t) (exp((t)) exp(( ))

and P is updated according to a Riccati equation in all similar to (13.50).

After a probation period, whose length is chosen according to the same
criterion adopted for the main filter, the feature is inserted into the state
using the transformation (13.44). The initial variance is chosen to be the
variance of the estimation error of the subfilter.

Tuning
The variance w (t) is a design parameter. We choose it to be block diagonal, with the blocks corresponding to T (t) and (t) equal to zero (a
deterministic integrator). We choose the remaining parameters using standard statistical tests, such as the Cumulative Periodogram of Bartlett [?].
The idea is that the parameters in w are changed until the innovation
.
process (t) = y(t) h(
x(t)) is as close as possible to being white. The
periodogram is one of many ways to test the whiteness of a stochastic
process. In practice, we choose the blocks corresponding to y0i equal to the
variance of the measurements, and the elements corresponding to i all
equal to . We then choose the blocks corresponding to v and to be
diagonal with element v , and then we change v relative to depending
on whether we want to allow for more or less regular motions. We then
change both, relative to the variance of the measurement noise, depending
on the level of desired smoothness in the estimates.
Tuning nonlinear filters is an art, and this is not the proper venue to
discuss this issue. Suffices to say that we have only performed the procedure

288

Chapter 13. Recursive estimation from image sequences

once and for all. We then keep the same tuning parameters no matter what
the motion, structure and noise in the measurements.

This is page 289

Printer: Opaque this

Chapter 14
Step-by-step building of a 3-D model
from images

In this chapter we present a step-by-step demonstration of how to build a

3-Dmodel from images. The sections in this chapter follow closely matlab
routines that are available from the books website.

290

Chapter 14. Step-by-step building of a 3-D model from images

14.1 Establishing point correspondence

14.1.1 Feature extraction
14.1.2 Feature matching

14.2 Refining correspondence

14.2.1 Computing fundamental matrices
14.2.2 Robust matching via RANSAC

14.3 Uncalibrated 3-D structure and camera pose

14.3.1 Projective reconstruction
14.3.2 Bundle adjustment

14.4 Scene and camera calibration

14.4.1 Constraints on the scene
14.4.2 Camera self-calibration
14.4.3 Calibrating radial distortion

14.5 Dense reconstruction

14.5.1 Epipolar rectification
14.5.2 Dense matching
14.5.3 Knowledge in the scene
Visibility
Occlusions
Uniqueness of matching

14.5.4 Multi-scale matching

14.5.5 Dense triangulation for multiple frames

14.6 Texture mapping and rendering

14.6.1 Modeling surface
14.6.2 Highlights and specular reflections
14.6.3 Texture mapping

This is page 291

Printer: Opaque this

Chapter 15
Extensions, applications and further
research directions

This chapter highlights research directions of great practical importance

that have not yet reached the point where robust algorithms are available.

15.1 Vision for navigation and robotic control

15.2 Multiple linked rigid bodies and motion
segmentation
15.3 Beyond geometric features
15.4 Direct methods
15.5 Non-lambertian photometry
15.6 Non-rigid deformations

This is page 292

Printer: Opaque this

This is page 293

Printer: Opaque this

Part V

Appendices

This is page 294

Printer: Opaque this

This is page 295

Printer: Opaque this

Appendix A
Basic facts from linear algebra

Since in computer vision, we will mostly deal with real vector spaces, we
here introduce some important facts about linear algebra for real vector
spaces.1 Rn is then a natural model for a real vector space of n-dimension.
If Rn is considered as a linear space, its elements, so-called vectors, are
complete under two basic operations: scalar multiplication and vector summation. That is, given any two vectors x, y Rn , and any two real scalars
, R, we may obtain a new vector z = x + y in Rn . Linear algebra mainly studies properties of the so-called linear transformations among
different real vector spaces. Since such transformations can typically represented as matrices, linear algebra to a large extent studies the properties
of matrices.

A.1 Linear maps and linear groups

A linear transformation from a real linear (vector) space Rn to Rm is
defined as a map L : Rn Rm such that
L(x + y) = L(x) + L(y) ,
L(x) = L(x),

x, y Rn

x Rn , R.

1 We here do not intend to provide a formal introduction to linear algebra. Instead,

we do assume that the reader has basic knowledge in linear algebra.

296

Appendix A. Basic facts from linear algebra

Clearly, with respect to the standard bases of Rn and Rm , the map L can
be represented by a matrix A Rmn such that
L(x) = Ax,

x Rn .

(A.1)

The set of all (real) m n matrices are denoted as M(m, n). If viewed as
a linear space, M(m, n) can be identified as the space Rmn . By abuse of
language, we sometimes simply refer to a linear map L by its representation
(matrix) A.
.
If n = m, the set M(n, n) = M(n) forms an algebraic structure called
ring (over the field R). That is, matrices in M(n) are closed under matrix
multiplication and summation: if A, B are two n n matrices, so are C =
AB and D = A + B. If we consider the ring of all n n matrices, its group
of units GL(n) which consists of all n n invertible matrices and is called
the general linear group can be identified with the set of invertible linear
maps from Rn to Rn :
L : Rn Rn ; x 7 L(x) = Ax | A GL(n).

(A.2)

Now let us further consider R with its standard inner product structure.
That is, given two vectors x, y Rn , we define their inner product to be
hx, yi = xT y. We say that a linear transformation A (from Rn to itself) is
orthogonal if it preserves such inner product:
hAx, Ayi = hx, yi,

x, y Rn .

(A.3)

The set of n n orthogonal matrices forms the orthogonal group O(n). If

R is a matrix representative of an orthogonal transformation, expressed
relative to an orthonormal reference frame, then it is easy to see that the
orthogonal group is characterized as
O(n) = {R GL(n) | RT R = I}.

(A.4)

The determinant of an orthogonal matrix can be 1. The subgroup of O(n)

with unit determinant is called the special orthogonal group SO(n).

A.2 Gram-Schmidt orthonormalization

A matrix in GL(n) has n independent rows (columns). A matrix in O(n) has
orthonormal rows (columns). The Gram-Schmidt procedure can be viewed
as a map between GL(n) and O(n), for it transforms a nonsingular matrix
into an orthonormal one. Call L+ (n) the subset of GL(n) consisting of
lower triangular matrices with positive elements along the diagonal. Such
matrices form a subgroup of GL(n). Then we have
Theorem A.1. (Gram-Schmidt) A GL(n) !Q L+ (n), R O(n)
such that
A = QR

(A.5)

A.3. Symmetric matrices

297

Proof. Contrary to convention elsewhere, within this proof all vectors stand
for row vectors. That is, if v is an n-dimensional row vector, it is of the form:
v = [v1 , v2 , . . . , vn ] which does not have a transpose on. Denote the ith row
vector of the given matrix A as ai for i = 1, . . . , n. The proof consists in
constructing L and E iteratively from the row vectors ai :
.
.
q1 = a 1
r1 = q1 /kq1 k
.
.
q2 = a2 ha2 , r1 ir1
r2 = q2 /kq2 k
..
..
..
.
. =
.

.
Pn1
.
.
qn = an i=1
hai+1 , ri iri rn = qn /kqn k
Then R = [r1T . . . rnT ]T and the matrix Q is obtained as

kq1 k
0
...
0
ha2 , r1 i kq2 k
.
.
.
0

Q=
..
..
.
.
..
..

.
.
ha2 , r1 i han , rn1 i kqn k

By construction R is orthogonal, i.e. RRT = RT R = I.

Remark A.1. Gram-Schmidts procedure has the peculiarity of being

causal, in the sense that the k-th column of the transformed matrix depends only upon rows with index l k of the original matrix. The choice
of the name E for the orthogonal matrix above is not random. In fact we
will view the Kalman filter as a way to perform a Gram-Schmidt orthonormalization on a peculiar Hilbert space, and the outcome E of the procedure
is traditionally called the innovation.

A.3 Symmetric matrices

Definition A.1. S Rnn is symmetric iff S T = S.
Theorem A.2. S is symmetric then
1. Let (v, ) be eigenvalue-eigenvector pairs. If i 6= j then vi vj , i.e.
eigenvectors corresponding to distinct eigenvalues are orthogonal.
2. n orthonormal eigenvectors of S, which form a basis for Rn .
3. S 0 iff i 0 i = 1, . . . , n, i.e. S is positive semi-definite iff all
eigenvalues are non-negative.
4. if S 0 and 1 2 n then maxkxk2 =1 hx, Sxi = 1 and
minkxk2 =1 hx, Sxi = n .
Remark A.2.

298

Appendix A. Basic facts from linear algebra

from point (3) of the previous theorem we see that if V =

[v1 , v2 , . . . , vn ] is the matrix of all the eigenvectors, and =
diag(1 , . . . , n ) is the diagonal matrix of the corresponding eigenvalues, then we can write S = V V T ; note that V is orthogonal.
Proofs of the above claims are easy exercises.

Definition A.2. Let A Rmn , then we define the induced 2-norm of A

as an operator between Rn and Rm as
.
kAk = max kAxk22 = max hx, AT Axi.
kxk2 =1

kxk2 =1

Remark A.3.
Similarly other induced operator norms on A can be defined starting
from different norms on the domain and co-domain spaces on which
A operates.
let A be as above, then AT A is clearly symmetric and positive semidefinite, so it can be diagonalized by a orthogonal matrix V . The
eigenvalues, being non-negative, can be written as i2 . By ordering the
columns of V so that the eigenvalue matrix has decreasing eigenvalues on the diagonal, we see, from point (e) of the previous theorem,
that AT A = V diag(12 , . . . , n2 )V T and kAk2 = 1 .

A.4 Structure induced by a linear map

Let A be an operator from a vector space Rn to a space Rm

Define the null space of A, denoted as N u(A) to be a subspace of Rn

such that Ax = 0 if and only if x N u(A); define the range or span
of A, denoted as Ra(A) or span(A), to be a subspace of Rm such that
y = Ax for some x Rn if and only if y Ra(A).

Given a subspace S of Rn , we define its orthogonal supplement to be

the subspace S Rn such that x S if and only if xT y = 0 for
all y S. We denote Rn = S S .
Then with respect any linear map A from Rn to Rm , Rn can be
decomposed as:
Rn = N u(A) N u(A)

And Rm can be decomposed as:

Rm = Ra(A) Ra(A) .

Theorem A.3. Let A is a linear map from Rn to Rm ; then

a) N u(A) = Ra(AT )

A.5. The Singular Value Decomposition (SVD)

299

b) Ra(A) = N u(AT )
c) N u(AT ) = N u(AAT )
d) Ra(A) = Ra(AAT ).
Proof. To prove N u(AAT ) = N u(AT ), we have:
AAT x = 0 hx, AAT xi = kAT xk2 = 0 AT x = 0, hence
N u(AAT ) N u(AT ).
AT x = 0 AAT x = 0, hence N u(AAT ) N u(AT ).

To prove Ra(AAT ) = Ra(A), we first need to prove that Rn is a direct

sum of Ra(AT ) and N u(A), i.e. part a) of the theorem. Part b) can then
be proved similarly. We prove this by showing that a vector x is in N u(A)
if and only if it is orthogonal to Ra(AT ): x N u(A) hAT x, yi = 0, y
Ra(AT ) hx, Ayi = 0, y. Hence N u(A) is exactly the subspace which is
orthogonal supplementary to Ra(AT ) (sometimes denoted as Ra(AT ) ).
Therefore Rn is a direct sum of Ra(AT ) and N u(A). Let Img A (S) denote
the image of a subspace S under the map A. Then we have: Ra(A) =
ImgA (Rn ) = Img A (Ra(AT )) = Ra(AAT ) (in the second equality we used
the fact that Rn is a direct sum of Ra(AT ) and N u(A).)
In fact the same result still holds even if the domain of the linear map
A is replaced by an infinite dimensional linear space with an inner product
(i.e. Rn is replaced by a Hilbert space). In that case, this theorem is also
known as the Finite Rank Operator Fundamental Lemma.

A.5 The Singular Value Decomposition (SVD)

The SVD is a useful tool to capture essential features of a linear operator, such as the rank, range space, null space, induced norm etc. and to
generalize the concept of eigenvalue- eigenvector pair. The computation of the SVD is numerically well-conditioned, so it makes sense to try to
solve some typical linear problems as matrix inversions, calculation of rank,
best 2-norm approximations, projections and fixed-rank approximations, in
terms of the SVD of the operator.

A.5.1 Algebraic derivation

Given a matrix A Rmn , the following theorem and its proof follow that
in [CD91].
Theorem A.1 (Singular Value Decomposition). Let A Rmn be a
matrix of rank r. Then there exist matrices U Rmm and V Rnn ,
and 1 Rrr such that:

300

Appendix A. Basic facts from linear algebra

1. V = [V1 : V2 ], V1 Rnr , satisfies:

V is unitary, i.e. V T V = Inn ;
Ra(V1 ) = Ra(AT ), and the columns of V1 form an orthonormal basis
of Ra(AT );
Ra(V2 ) = N u(A), and the columns of V2 form an orthonormal basis
of N u(A);
The columns of V form a complete orthonormal basis of eigenvectors
of AT A.
2. U = [U1 : U2 ], U1 Rmr , satisfies:
U is unitary, i.e. U T U = Imm ;
Ra(U1 ) = Ra(A), and the columns of U1 form an orthonormal basis
of Ra(A);
Ra(U2 ) = N u(AT ), and the columns of U2 form an orthonormal basis
of N u(AT );
The columns of U form a complete orthonormal basis of eigenvectors
of AAT .
3. 1 = diag(1 , 2 , . . . , r ) Rrr such that 1 2 , r > 0.
A Rmn has a dyadic expansion:
A = U1 1 V1T ,

or equivalently,

r
X

i ui viT

i=1

where ui , vi are the columns of U1 and V1 respectively.

4. A Rmn has a singular value decomposition (SVD):

1 0
.
A = U V T , with =
0 0 mn
Proof. 1. A Rmn has rank r, hence the nonnegative (or, equivalently,
positive semidefinite) matrix AT A has rank r according to Theorem A.3.
It has n nonnegative eigenvalues i2 ordered as:
2
12 22 r2 > 0 = r+1
= = n2

to which corresponds a complete orthonormal eigenvector basis (vi )ni=1 of

AT A. This family of vectors (in Rn ) form the columns of a unitary n n
matrix, say, V . From Theorem A.3, Ra(AT A) = Ra(AT ) and N u(AT A) =
N u(A), the properties listed in 1 follow.
2. Define a diagonal matrix 1 = diag(1 , 2 , . . . , r ) Rrr . We then
1
T
have AT AV1 = V1 21 , hence (AV1 1
1 ) (AV1 1 ) = Irr . This defines a
m r matrix:
U1 = AV1 1
1 .

(A.6)

Then U1T U1 = Irr . Since AT A and AAT both have exactly r nonzero
eigenvalues, it follows that the columns of U1 form an orthonormal basis for
Ra(AAT ) and Ra(A). Thus the properties of U1 listed in 2 hold. Now define

A.5. The Singular Value Decomposition (SVD)

301

an m (m r) matrix U2 with orthonormal columns which are orthogonal

to columns of U1 . Then U = [U1 : U2 ] is clearly an unitary matrix. From
the proof of the theorem A.3, columns of U2 form an orthonormal basis of
N u(AT ) or N u(AAT ). Therefore, columns of U2 are all the eigenvectors
corresponding to the zero eigenvalue. Hence columns of U form a complete
orthonormal basis of eigenvectors of AAT . List 2 is then fully proven.
3. From the definition of U1 in (A.6), we have:
A = U1 1 V1T .
The dyadic expansion directly follows.
4. The singular value decomposition follows because:
A[V1 : V2 ] = [U1 1 : 0] = [U1 : U2 ]

A = U V T .

After we have gone through all the trouble proving this theorem, you
must know that SVD has become a numerical routine available in many
computational softwares such as MATLAB. Within MATLAB, to compute
the SVD of a given m n matrix A, simply use the command [U, S, V ] =
SV D(A) which returns matrices U, S, V satisfying A = U SV T (where S
represents as defined above).

A.5.2 Geometric interpretation

Theorem A.4. Let a square matrix A Rnn = U V T , then A maps
.
the unit sphere Sn1 = {x Rn : kxk2 = 1} to an ellipsoid with half-axes
i u i
Proof. let x, y be such that Ax = y. {v1 , . . . , vn } is an orthonormal basis for
Rn . With respect to such basis xPhas coordinates [hv1 , xi, hv
n , xi].
P2 ,nxi, . . . , hv
n
T
Idem
for
{u
}.
Let
y
=
y
u

Ax
=
x
=

u
v
i
i
i
i
i
i
i=1
i=1
Pn
Pn
2

u
hv
,
xi
=
y
u
=
y.
Hence

hv
,
xi
=
y
.
Now
kxk
=
i
i
i
i
i
i
i
i
2
i=1
i=1
Pn
Pn
2
n1
, from which we conclude i=1 yi2 /i2 = 1,
i=1 hvi , xi = 1, x S
which represents the equation of an ellipsoid with half-axes of length i .

A.5.3 Some properties of the SVD

The problems involving orthogonal projections onto invariant subspaces
of A, as Linear Least Squares (LLSE) or Minimum Energy problems, are
easily solved using the SVD.
Definition A.3 (Generalized (Moore-Penrose) Inverse). Let A
Rmn , A = U V T where is the diagonal matrix with diagonal elements
(1 , . . . r , 0 . . . 0); then define the generalized inverse of A to be
T
A = U 1
(r) V ,

where

1
1
1
(r) = diag(1 , . . . , r , 0, . . . , 0)

302

Appendix A. Basic facts from linear algebra

The so-defined generalized inverse has the following properties:

Theorem A.5.
AA A = A
A AA = A
The generalized inverse can be used to solve linear equations in general:
Theorem A.6 (Least squares solution of a linear systems). Consider the problem Ax = b with A Rmn of rank p min(m, n), then the
solution x
that minimizes kA
x bk is given by x
= A b.
One of the most important properties of the SVD has to deal with fixedrank approximations of a given operator. Given A as an operator from a
space X to a space Y of rank n, we want to find an operator B from the
same spaces such that it has rank p < n fixed and kA BkF is minimal,
where the F indicates the Frobenius norm (in this context it is the sum of
the singular values).
If we had the usual 2-norm and we calculate the SVD of A = U V T ,
then by simply setting all the singular values but the first p to zero, we
.
have an operator B = U (p) V T , where (p) denotes a matrix obtained
from by setting to zero the elements on the diagonal after the pth , which
has exactly the same two norm of A and satisfies the requirement on the
rank. It is not difficult to see the following result
Theorem A.7 (Fixed rank approximations). Let A, B be defined as
above, then kA BkF = p+1 . Furthermore such norm is the minimum
achievable.
Proof: easy exercise; follows directly from the orthogonal projection
theorem and the properties of the SVD given above.
The following two results have something to do with the sensitivity of
solving linear equation of the form Ax = b.
Proposition A.1 (Perturbations). Consider a non-singular matrix A
Rnn (if A is singular substitute its inverse by the moore-penrose pseudoinverse). Let A be a full-rank perturbation. Then
|k (A + A) k (A)| 1 (A),

k = 1, . . . , n

n (AA) n (A)n (A)

1 (A1 ) =

1
n (A)

Proposition A.2 (Condition number). Consider the problem Ax = b,

and consider a perturbed full rank problem (A + A)(x + x) = b. Since
Ax = b, then to first order approximation x = A Ax. Hence kxk

A.5. The Singular Value Decomposition (SVD)

303

kA kkAkkxk, from which

kAk .
kxk
kAk
= kA kkAk
= k(A)
kxk
kAk
kAk

where k(A) = kA kkAk is called the condition number of A. It easy to see

that k(A) = 1 /n .

This is page 304

Printer: Opaque this

Appendix B
Least-square estimation and filtering

B.1 Linear least-variance estimators of random

vectors
Let T : Rn Rm ; X 7 Y be a transformation acting between two spaces
of random vectors with instances in Rm and Rn (the model generating the
data). We are interested in building an estimator for the random vector X,
given measurements of instances of the random vector Y . An estimator is
= T (Y ), which solves an optimization
a function T : Rm Rn ; Y 7 X
problem of the form
.
T = arg min C(X T (Y ))T
T T

(B.1)

where T is a suitable chosen class of functions and C()T some cost in the
Xspace.
We concentrate on one of the simplest possible choices, which correspond
to minimum variance affine estimators:
T

C()T

.
= {A Rnm ; b Rn | T (Y ) = AY + b}

.
= E k k2

(B.2)
(B.3)

where the latter operator takes the expectation of the squared euclidean
norm of the random vector Y . Therefore, we seek for

.
b) =
(A,
arg min E kX (AY + b)k2
(B.4)
A,b

B.1. Linear least-variance estimators of random vectors

305

.
.
We call X = E[X] and X = E[XX T ], and similarly for Y . First notice
that if X = Y = 0, then b = 0. Therefore, consider the centered vectors
.
.
=
X
X X and Y = Y Y and the reduced problem

.
AY k2 .
A = arg min E kX
(B.5)
A

Now observe that

E kX AY bk2

+ (AX + b Y )k2
= E kAY X

AY k2 + kAX + b Y k2 . (B.6)
= E kX

Hence, if we assume for a moment that we have found A that solves the
problem (B.5), then trivially
b = X A
Y

(B.7)

annihilates the second term of eq. (B.6).

Therefore, we will concentrate on the case X = Y = 0 without loss of
generality.

B.1.1 Projections onto the range of a random vector

The set of all random variables Zi defined on the same probability space,
with zero-mean E[Zi ] = 0 and finite variance Zi < is a Hilbert space
with the inner-product given by
.
hZi , Zj iH = Zi Zj = E[Zi Zj ].
(B.8)
In this space the notion of orthogonality corresponds to the notion of uncorrelatedness. The components of a random vector Y define a subspace of
such Hilbert space:
H(Y ) = spanhY1 , . . . , Ym i

(B.9)

where the span is intended over the reals.1 We say that the subspace H(Y )
is full rank if Y = E[Y Y T ] > 0.
The structure of a Hilbert space allows us to make use of the concept
of orthogonal projection of a random variable onto the span of a random
vector:
ZiH = 0,
Z = prhH(Y )i (X) hX Z,

Yi iH = 0,
hX Z,
.

= E[X|Y
]
.
)
= X(Y

Z H(Y )
i = 1...n

(B.10)
(B.11)
(B.12)

1 So H(Y ) is the space of random variables which are linear combination of Y , i =

i
1, . . . , m.

306

Appendix B. Least-square estimation and filtering

The notation E[X|Y

] is often used for the projection of X over the span
2
of Y .

B.1.2 Solution for the linear (scalar) estimator

Let Z = AY be a linear estimator for the random variable X R; A Rn
is a row-vector, and Y Rm an m-dimensional column random vector. The
least-square estimate Z is given by the choice of A that solves the following
problem:

where k k2H = E k k

A = arg min kAY Xk2H

(B.13)

is the norm induced by the inner product h, iH .

to the problem (B.13) exists, is

Theorem B.1. The solution Z = AY
unique and corresponds to the orthogonal projection of X onto the span of
Y:
Z = prhH(Y )i (X)

(B.14)

The proof is an easy exercise. In the following we report an explicit

From substituting the expression of
construction of the best estimator A.
the estimator onto the definition of orthogonal projection (B.12), we get
Yi iH = E[(X AY
)Yi ]
0 = hX AY,

(B.15)

which holds iff E[XYi ] = AE[Y

Yi ], i = 1 . . . n. In a row-vector notation
we write

E[XY T ] = AE[Y
Y T]
Y
XY = A
which, provided that H(Y ) is full rank, gives A =

(B.16)
XY 1
Y .

B.1.3 Affine least-variance estimator

Suppose we want to compute the best estimator of a zero-mean random
vector X as a linear map of the zero-mean random vector Y . We just
have to repeat the construction reported in the previous section for each
component Xi of X, so that the rows Ai. of the matrix A are given by
A1. = Xi Y 1
Y
..
..
. = .
An. = Xn Y 1
Y

(B.17)

2 The resemblance with a conditional expectation is due to the fact that, in the
presence of Gaussian random vectors such a projection is indeed the conditional
expectation.

B.1. Linear least-variance estimators of random vectors

307

which eventually gives us

A = XY 1
Y .
If now the vectors X and Y are not zero-mean, X 6=
first transform it into a zero-mean problem by defining Y
X X , then solve for the linear least-variance estimator
XY 1
Y , and then substitute to get

(B.18)
0 , Y 6= 0, we
.
.
=
= Y Y , X
.
=
A = X Y 1
Y

Z = X + XY 1
Y (Y Y )
which is the least-variance affine estimator
.
+ b
Z = E[X|Y
] = AY

(B.19)
(B.20)

where
A = XY 1
Y
b = X XY 1 Y .
Y

(B.21)

X = X XY 1
Y Y X .

(B.23)

(B.22)
.
=
It is an easy exercise to compute the variance of the estimation error X

X Z:
If we interpret the variance of X as the prior uncertainty, and the vari as the posterior uncertainty, we may interpret the second term
ance ot X
(which is positive semi-definite) of the above equation as a decrease of
the uncertainty.

B.1.4 Properties and interpretations of the least-variance

estimator
The variance of the estimation error in equation (B.23) is by construction
the smallest that can be achieved with an affine estimator. Of course if we
consider a broader class T of estimators, the estimation error can be further
decreased, unless the model that generates the data T is itself affine:
Y = T (X) = F X + W.

(B.24)

In such a case, using the matrix inversion lemma 3 , it is easy to compute

the expression of the optimal (affine) estimator that depends only upon
X , W and F :
Z = X F T (F X F T + W )1 Y
(B.25)
which achieves a variance of the estimation error equal to
X = X X F T (F X F T + W )1 F X .

(B.26)

3 If A, B, C, D are real matrices of the appropriate dimensions with A and C invertible,

then (A + BCD)1 = A1 A1 B(C 1 + DA1 B)1 DA.

308

Appendix B. Least-square estimation and filtering

Projection onto an orthogonal sum of subspaces

Y1
Let Y =
be such that
Y2
H(Y ) = H(Y1 ) H(Y2 ).

(B.27)

We may now wonder what are the conditions under which

E[X|Y
] = E[X|Y
1 ] + E[X|Y2 ].
After an easy calculation one can see that the above is true iff
which is to say when
H(Y1 ) H(Y2 )

(B.28)
E[Y1 Y2T ]

= 0,

(B.29)

Change of basis
Suppose that instead of measuring the instances of a random vector Y we
measure another random vector Z which is related to Y via a change of

, then it is immediate
basis: Z = T Y | T GL(m). If we call E[X|Y
] = AY
to see that

E[X|Z]
= XZ 1
Z Z
1

= XY T T (T T Y T 1 )Z
1
= XY 1
Z.
Y T

(B.30)

Innovations
The linear least-variance estimator involves the computation of the inverse of the output covariance matrix Y . It may be interesting to look
for changes of bases T that transform the output Y into Z = T Y such that
Z = I. In such a case the optimal estimator is simply

E[X|Z]
= XZ Z.

(B.31)

Let us pretend for a moment that the components of the vector Y are samples of a process taken over time: Yi = y(i), and call y t = [Y1 , . . . , Yt ]T the
history of the process up to time t. Each component (sample) is an element
of the Hilbert space H, which has a well-defined notion of orthogonality,
and where we can apply Gram-Schmidt procedure in order to make the
vectors y(i) orthogonal (uncorrelated).
.
.
v1 = y(1)
e1 = v1 /kv1 k
.
.
v2 = y(2) hy(2), e1 ie1
e2 = v2 /kv2 k
.
..
..
. ..
. = .

.
Pt1
.
.
vt = y(t) i=1 hy(i), ei iei et = vt /kvt k
The process {e}, whose instances up to time t are collected into the vector
et = [e1 , . . . , et ]T has a number of important properties:

B.1. Linear least-variance estimators of random vectors

309

1. The component of et are orthonormal in H (or equivalently {e} is an

uncorrelated process). This holds by construction.
2. The transformation from y to e is causal, in the sense that if we
represent it as a matrix Lt such that
y t = L t et

(B.32)

then Lt L+ is lower-triangular with positive diagonal. This follows

from the Gram-Schmidt procedure.
3. The process {e} is equivalent to {y} in the sense that they generate
the same span
H(y t ) = H(et ).

(B.33)

This property follows from the fact that Lt is non-singular.

4. If we write y t = Lt et in matrix form as Y = LE, then Y = LLT .
The meaning of the components of v, and the name innovation, comes from
the fact that we can interpret
.
t1

vt = y(t) E[y(t)|y
]

(B.34)

as a one-step prediction error. The process e is a scaled version of v such

that its variance is the identity.
We may now wonder whether each process {y} has an innovation, and if
so, whether it is unique. The following theorem, which is known as Cholesky
factorization theorem or Spectral Factorization theorem depending upon the
context, states the conditions:
Theorem B.2. There exists a unique vector E which is causally equivalent
to Y iff there exists a unique lower-triangular matrix L, called Choleskis
factor, such that Y = LLT .
Remark B.1. The Choleski factor can be interpreted as a whitening filter, in the sense that it acts on the components of the vector Y in a causal
fashion to make them uncorrelated.
We may consider a two-step solution to the problem of finding the leastsquare filter: a whitening step
E = L1 Y

(B.35)

where E = I, and a projection onto H(E):

) = XE L1 Y.
X(Y

(B.36)

310

Appendix B. Least-square estimation and filtering

B.2 Linear least-variance estimator for stationary

processes
In the previous section we have interpreted a column-vector as a collection
of samples from a scalar random process, and computed the least-variance
estimator by orthogonal projection. In this section we see how this plot
generalizes to proper stationary processes. We consider only scalar processes
for simplicity of notation, although all considerations can be extended to
vector-valued processes.
Let us assume that {x(t)} Rn and {y(t)} Rm are (wide-sense) jointly
stationary, i.e.
.
xy (t, s) = E[x(t)y T (s)] = xy (t s).
(B.37)
Again, we restrict our attention to linear estimators of {x(t)} given the
measurements of {y(s); s t} up to time t. We denote the estimate by
x
(t|t). A linear estimator is described by a convolution kernel h such that
t
X

x
(t|t) =

h(t, k)y(k).

(B.38)

such
The design of the least-variance estimator involves finding the kernel h
.
that the estimation error x(t) = x(t) x
(t|t) has minimum variance. This
is found, as in the previous sections for the static case, by imposing that
the estimation error be orthogonal to the history of the process {y} up to
time t:
hx(t) x
(t|t), y(s)iH = 0, s t
t
X
E[x(t)y T (s)]
h(t, k)E[y(k)y T (s)] = 0,
k=

s t (B.39)

which is equivalent to
xy (t s) =

t
X

h(t, k)y (k s).

(B.40)

The above is equivalent to a linear system with an infinite number of

equations, and we will assume that it has a unique solution for H. Given
that the processes involved are (jointly) stationary, and the convolution
starts at , it can be easily seen that the kernel h is time invariant:
h(t, k) = h(t k). Therefore the last equation is equivalent to
xy (t) =

X
s=0

h(s)y (t s),

t 0

(B.41)

which is called Wiener-Hopf equation and is exactly equivalent to the

orthogonality conditions (B.16). In fact, if we Z-transform the above

B.2. Linear least-variance estimator for stationary processes

311

equation
Sxy (z) = H(z)Sy (z)

(B.42)

we have exactly the same expression as equation (B.16), which we could

try to solve as

H(z)
= Sxy (z)Sy1 (z)

(B.43)

provided that the spectral density Sy is invertible. This, however, is not

quite the solution we are looking for. In fact, in order to be of any use, the
estimator must be causal (it must not depend upon future samples of the
process {y}) and stable (it must return a bounded estimate for bounded
data). We can express these conditions by requiring
causality: h(t) = 0, t < 0 (or H(z) analytic at )
stability: H(z) analytic in |z| 1 (or h(t) square-summable).
One particular case is when the spectral density of {y} is the identity (or
equivalently {y} is a white noise). Then Sy = I and we could choose

xy (t), t 0
(B.44)
h(t) =
0,
t < 0.
This suggests us to try to whiten (or orthonormalize) the measurement
process {y} in a similar fashion to what we did in section B.1.4. Indeed
we can state a theorem similar to B.2, which is known as the spectral
factorization theorem:
Theorem B.3. There exists a process {
e} such that H(
et ) = H(y t ) and
e(t) = (t) iff there exists W (z) stable and causal, with W 1 (z) causal
such that Sy (z) = W (z)W (z 1 ).
Remark B.2. In words there exists a white process {
e} (called the innovation) which is causally equivalent to {y} iff the spectral density of
y has a causal, stable and minimum-phase spectral factor. If we re-scale
W (z) to L(z) = W (z)W ()1 , the innovation {e} is re-normalized so
that e (t) = I(t), and is called normalized innovation.
We may at this point repeat the two-step construction of the leastvariance estimator. First the whitening step:
E(z) = L1 (z)Y (z)
and then the causal part of the projection:

xe (t) e(t), t 0

X(Y ) =
0,
t<0

(B.45)

(B.46)

where indicates the standard convolution. Equivalently, if we denote by

[Sxe (z)]+ the causal part of the Z-transform of xe (t), we can write
)(z) = [Sxe (z)] E(z).
X(Y
+

(B.47)

312

Appendix B. Least-square estimation and filtering

Since Sxe (z) = Sxy (z)L(z 1 )1 , the final expression of our linear, least
variance estimator is (in the Z-domain) x = H(z)y(z),
where the kernel H
is given by

H(z)
= Sxy (z)L1 (z 1 ) + L1 (z).
(B.48)
The corresponding filter is known as the Wiener filter. Again we can recover the meaning of the innovation as the one-step prediction error for
the measurements: in fact, the best prediction of the process {y}, indicated
with y(t|t 1), is defined as the projection of y(t) onto the span of {y} up
to t 1, indicated with Ht1 (y). Such projection is therefore defined such
that
y(t) = y(t|t 1)+e(t)

(B.49)

where e(t) Ht1 (y) = Ht1 (e).

B.3 Linear, finite-dimensional stochastic processes

A linear, finite-dimensional stochastic process (LFDSP) is defined as the
output of a linear, finite-dimensional dynamical system driven by white
Gaussian noise. Let A(t), B(t), C(t), D(t) be time-varying matrices of
suitable dimensions, {n(t)} N (0, I) | E[n(t)nT (s)] = I(t s) a white,
zero-mean Gaussian noise and x0 a random vector which is uncorrelated
with {n}: E[x0 nT (t)] = 0, t. Then {y(t)} is a LFDSP if there exists
{x(t)} such that
(
x(t + 1) = A(t)x(t) + B(t)n(t)
x(t0 ) = x0
(B.50)
y(t) = C(t)x(t) + D(t)n(t)
We call {x} the state process, {y} the output (or measurement) process, and
{n} the input (or driving) noise. The time-evolution of the state process
can be written as the orthogonal sum of the past history (prior to the
initial condition), and the present history (from the initial condition until
the present time)
x(t) = tt0 x0 +

t1
X

tk+1 B(t)n(t) = E[x(t)|H(x

)]+E[x(t)|x(t
0 ), . . . , x(t1)]

k=t0

(B.51)
where denotes a fundamental set of solutions, which is the flow of the
differential equation
(
(t + 1, s) = A(t)(t, s)
(B.52)
(t, t) = I.
In the case of a time-invariant system A(t) = A, t, then (t, s) = Ats .

B.4. Stationariety of LFDSP

313

Remark B.3. As a consequence of the definitions, the orthogonality

between the state and the input noise propagates up to the current time:
n(t) H x(s),

s t.

(B.53)

Moreover, the past history up to time s is alway summarized by the value

of the state at that time (Markov property):

E[x(t)|H
s (x)] = E[x(t)|x(s)] = (t, s)x(s),

t s.

(B.54)

B.4 Stationariety of LFDSP

In order to design the least-squares estimator as in the previous sections,
we ask what are the conditions under which a LFDSP is stationary. The
first restriction we require is that the system be time-invariant. The mean
of the state process at time t is given by
x (t) = Att0 x0

(B.55)

while the covariance of the state-process

x (t, s) = Ats x (s)

(B.56)

evolves according to the following Ljapunov equation

x (s + 1) = Ax (s)AT + BB T .

(B.57)

The conditions for stationariety impose that x (t) = const and x (t) =
const. It is easy to prove the following
Theorem B.4. Let A be stable (have all eigenvalues in the unit com where
= P Ak BB T AT k is the
plex circle), then x (t t0 ) ,
k=0
unique equilibrium solution of the above Ljapunov equation, and {x} de
scribes asymptotically a stationary process. If x0 is such that x (t0 ) = ,
then the process is stationary for all t t0 .
Remark B.4. The condition of stability for A is sufficient, but not necessary for generating a stationary process. If, however, the pair (A, B) is
completely controllable, so that the noise input affects all of the components
of the state, then such a stability condition becomes also necessary.

B.5 The linear Kalman filter

Suppose we are given a linear finite-dimensional process, which has a realization (A, B, C, D) as in equation (B.50). While we measure the (noisy)
output y(t) of such a realization, we do not have access to its state x(t).
The Kalman filter is a dynamical model that accepts as input the output
of the process realization, and returns an estimate of its state that has the

314

Appendix B. Least-square estimation and filtering

property of having the least error variance. In order to derive the expression
for the filter, we write the LFDSP as follows:
(
x(t + 1) = Ax(t) + v(t)
x(t0 ) = x0
(B.58)
y(t) = Cx(t) + w(t)
where we have neglected the time argument in the matrices A(t) and C(t)
(all considerations can be carried through for time-varying systems as well).
v(t) = Bn(t) is a white, zero-mean Gaussian noise with variance Q, w(t) =
Dn(t), also a white, zero-mean noise, has variance R, so that we could write
p
v(t) =
Qn(t)

Rn(t)
w(t) =
where n is a unit-variance noise. In general v and w will be correlated, and
in particular we will call
S(t) = E[v(t)w T (t)].

(B.59)

We require that the initial condition x0 be uncorrelated from the noise

processes:
x0 {v}, {w},

(B.60)

The first step is to modify the above model so that the model error v is
uncorrelated from the measurement error w.
Uncorrelating the model from the measurements
In order to uncorrelate the model error from the measurement error we can
just substitute v with the complement of its projection onto the span of w.
Let us call

v(t) = v(t) E[v(t)|H(w)]

= v(t) E[v(t)|w(t)]

(B.61)

the last equivalence is due to the fact that w is a white noise. We can now
use the results from section B.1 to conclude that
v(t) = v(t) SR1 w(t)

(B.62)

and similarly for the variance matrix

= Q SR1 S T .
Q
Substituting the expression of v(t) into the model (B.58) we get
(
x(t + 1) = F x(t) + SR1 y(t) + v
y(t) = Cx(t) + w(t)

(B.63)

(B.64)

where F = A SR1 C. The model error v in the above model is uncorrelated from the measurement noise w, and the cost is that we had to add
an output-injection term SR1 y(t).

B.5. The linear Kalman filter

315

Prediction step
Suppose at some point in time we are given a current estimate for the state
x
(t|t) and a corresponding estimate of the variance of the model error
P (t|t) = E[
x(t)
x(t)T ] where x
= xx
. At the initial time t0 we can take
x
(t0 |t0 ) = x0 with some bona-fide variance matrix. Then it is immediate
to compute
v (t)|Ht (y)]
x
(t + 1|t) = F x(t|t) + SR1 y(t) + E[

(B.65)

where the last term is zero since v(t) x(s), x t and v(t) w(s), s
and therefore v(t) y(s), s t. The estimation error is therefore
x
(t + 1|t) = F x
(t|t)+
v (t)

(B.66)

where the sum is an orthogonal sum, and therefore it is trivial to compute

the variance as

P (t + 1|t) = F P (t|t)F T + Q.

(B.67)

Update step
Once a new measurement is acquired, we can update our prediction so
as to take into account the new measurement. The update is defined as
.
x
(t + 1|t + 1) = E[x(t
+ 1)|Ht+1 (y)]. Now, as we have seen in section B.1.4,
we can decompose the span of the measurements into the orthogonal sum
Ht+1 (y) = Ht (y)+{e(t + 1)}
(B.68)
.

where e(t + 1) = y(t + 1) E[y(t

+ 1)|Ht (y)] is the innovation process.
Therefore, we have

x
(t + 1|t + 1) = E[x(t
+ 1)|Ht (y)] + E[x(t
+ 1)|e(t + 1)]

(B.69)

where the last term can be computed using the results from section B.1:
x
(t + 1|t + 1) = x
(t + 1|t) + L(t + 1)e(t + 1)
(B.70)
.
1
where L(t+1) = xe (t+1)e (t+1) is called the Kalman gain. Substituting
the expression for the innovation we have
x
(t + 1|t + 1) = x
(t + 1|t) + L(t + 1) (y(t + 1) C x
(t + 1|t))

(B.71)

from which we see that the update consists in a linear correction weighted
by the Kalman gain.
Computation of the gain
.
In order to compute the gain L(t + 1) = xe (t + 1)1
e (t + 1) we derive an
alternative expression for the innovation:
e(t+1) = y(t+1)Cx(t+1)+Cx(t+1)C x
(t+1|t) = w(t+1)+C x
(t+1|t)
(B.72)

316

Appendix B. Least-square estimation and filtering

from which it is immediate to compute

xe (t + 1) = P (t + 1|t)C T .

(B.73)

Similarly we can derive the variance of the innovation (t + 1):

.
(t + 1) = e (t + 1) = CP (t + 1|t)C T + R

(B.74)

and therefore the Kalman gain is

L(t + 1) = P (t + 1|t)C T 1 (t + 1).

(B.75)

Variance update
From the update of the estimation error
x
(t + 1|t + 1) = x
(t + 1|t) L(t + 1)e(t + 1)

(B.76)

we can easily compute the update for the variance. We first observe that
x
(t + 1|t + 1) is by definition orthogonal to Ht+1 (y), while the correction
term L(t + 1)e(t + 1) is contained in the history of the innovation, which
is by construction equal to the history of the process y: Ht+1 (y). Then it
is immediate to see that
P (t + 1|t + 1) = P (t + 1|t) L(t + 1)(t + 1)LT (t + 1).

(B.77)

The above equation is not convenient for computational purposes, since it

does not guarantee that the updated variance is a symmetric matrix. An
alternative form of the above that does guarantee symmetry of the result
is
P (t + 1|t) = (t + 1)P (t + 1|t)(t + 1)T + L(t + 1)RL(t + 1)T

(B.78)

where (t+1) = I L(t+1)C. The last equation is in the form of a discrete

Riccati equation (DRE).
Predictor equations
It is possible to combine the two steps above and derive a single model for
the one-step predictor. We summarize the result as follows:
x(t + 1|t) = A
x(t|t 1) + Ks (t) (y(t) C x
(t|t 1))
(B.79)
T T
T
T

P (t + 1|t) = F (t)P (t|t 1)(t) F + F L(t)RL (t)F + Q

(B.80)
where we have defined
Ks (t)

.
= F L(t) + SR1
=

AP (t|t 1)C + S

(B.81)
1

(t)

(B.82)

B.6. Asymptotic properties

317

B.6 Asymptotic properties

If we consider time-invariant models (all matrices A, C, Q, S, R are constant
in time), we can study the asymptotic behavior of the estimator.
Remark B.5. In particular, the dynamics of the estimator depends upon
P (t), the solution of the DRE of equation (B.78). We want such a solution
to converge asymptotically to a small but non-zero value. In fact, P = 0
corresponds to a zero gain K = 0, which indicates that the filter does take
into account the measurements. In such a case we say that the filter is
saturated.
We will not get into the details of the results of the asymptotic theory
of Kalman filtering. We will only report the main results, which essentially
says that if the realization of the LFDSP is minimal, then there exists a
unique positive-definite fixed-point of the DRE, and the solution converges
to the fixed-point asymptotically. Furthermore the dynamics of the estimation error is stable (even though the process may be unstable). The Kalman
filter converges asymptotically to the Wiener filter described in section B.2.

Claim B.1. If the pair (F, C) is detectable and (F, Q) is stabilizable,

then there exists a unique P |P = P T 0 fixed point of the DRE
(B.78). Furthermore P (t) P for all positive semi-definite P (t0 ) and
= limt (t) is stable.
We recall that a (F, C) being detectable
means that the unobservable

subspace is stable, as well as (F, Q) being stabilizable means that the

uncontrollable subspace is stable. The proof of the above claim, as well
as other results on the asymptotic properties of the Kalman filter, can be
found for instance in [Jaz70].

This is page 318

Printer: Opaque this

Appendix C
Basic facts from optimization

This is page 319

Printer: Opaque this

References

[
Ast96]

str
K. A
om. Invariancy Methods for Points, Curves and Surfaces
in Computational Vision. PhD thesis, Department of Mathematics,
Lund University, 1996.

[BA83]

P. Burt and E. H. Adelson. The laplacian pyramid as a compact image

code. IEEE Transactions on Communication, 31:532540, 1983.

[BCB97]

M. J. Brooks, W. Chojnacki, and L. Baumela. Determining the egomotion of an uncalibrated camera from instantaneous optical flow. in
press, 1997.

[Bou98]

S. Bougnoux. From projective to Euclidean space under any practical

situation, a criticism of self-calibration. In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, pages 790796,
1998.

[CD91]

F. M. Callier and C. A. Desoer. Linear System Theory. Springer Texts

in Electrical Engineering. Springer-Verlag, 1991.

[Fau93]

O. Faugeras. There-Dimensional Computer Vision. The MIT Press,

1993.

[GH86]

Guckenheimer and Holmes. Nonlinear oscillations, dynamical systems

and bifurcations of vector fields. Springer Verlag, 1986.

[Har98]

R. I. Hartley. Chirality. International Journal of Computer Vision,

26(1):4161, 1998.

[HF89]

T. Huang and O. Faugeras. Some properties of the E matrix in twoview motion estimation. IEEE PAMI, 11(12):131012, 1989.

[HS88]

C. Harris and M. Stephens. A combined corner and edge detector. In

Proceedings of the Alvey Conference, pages 189192, 1988.

320

References

[HZ00]

R. Hartley and A. Zisserman. Multiple View Geometry in Computer

Vision. Cambridge, 2000.

[Jaz70]

A. H. Jazwinski. Stochastic Processes and Filtering Theory. NY:

Academic Press, 1970.

[Kan93]

K. Kanatani. Geometric Computation for Machine Vision. Oxford

Science Publications, 1993.

[LF97]

Q.-T. Luong and O. Faugeras. Self-calibration of a moving camera from point correspondences and fundamental matrices. IJCV,
22(3):26189, 1997.

[LH81]

H. C. Longuet-Higgins. A computer algorithm for reconstructing a

scene from two projections. Nature, 293:133135, 1981.

[LK81]

B.D. Lucas and T. Kanade. An iterative image registration technique

with an application to stereo vision. In Proc. 7th International Joint
Conference on Artificial Intelligence, pages 674679, 1981.

[May93]

S. Maybank. Theory of Reconstruction from Image Motion. Springer

Series in Information Sciences. Springer-Verlag, 1993.

[McL99]

P. F. McLauchlan. Gauge invariance in projective 3d reconstruction.

In IEEE Workshop on Multi-View Modeling and Analysis of Visual
Scenes, Fort Collins, CO, June 1999, 1999.

[MF92]

S. Maybank and O. Faugeras. A theory of self-calibration of a moving camera. International Journal of Computer Vision, 8(2):123151,
1992.

[MKS00]

Y. Ma, J. Kovseck
a, and S. Sastry. Linear differential algorithm
for motion recovery: A geometric approach. International Journal of
Computer Vision, 36(1):7189, 2000.

[MLS94]

R. M. Murray, Z. Li, and S. S. Sastry. A Mathematical Introduction

to Robotic Manipulation. CRC press Inc., 1994.

[MS98]

Y. Ma and S. Sastry. Vision theory in spaces of constant curvature. Electronic Research Laboratory Memorandum, UC Berkeley,
UCB/ERL(M98/36), June 1998.

[PG98]

J. Ponce and Y. Genc. Epipolar geometry and linear subspace methods: a new approach to weak calibration. International Journal of
Computer Vision, 28(3):22343, 1998.

[PG99]

M. Pollefeys and L. Van Gool. Stratified self-calibration with the modulus constraint. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(8):70724, 1999.

[SA90]

M. Spetsakis and Y. Aloimonos. Structure from motion using

line correspondences. International Journal of Computer Vision,
4(3):171184, 1990.

[Stu97]

P. Sturm. Critical motion sequences for monocular self-calibration

and uncalibrated Euclidean reconstruction. In Proceedings of IEEE
Computer Vision and Pattern Recognition, pages 11001105. IEEE
Comput. Soc. Press, 1997.

References

321

[TK92]

C. Tomasi and T. Kanade. Shape and motion from image streams

under orthography. Intl. Journal of Computer Vision, 9(2):137154,
1992.

[Tri97]

B. Triggs. Autocalibration and the absolute quadric. In Proceedings of

IEEE conference on Computer Vision and Pattern Recognition, 1997.

[TTH96]

T. Y. Tian, C. Tomasi, and D. J. Heeger. Comparison of approaches to

egomotion computation. In Proceedings of 1996 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages
31520, Los Alamitos, CA, USA, 1996. IEEE Comput. Soc. Press.

[VF95]

T. Vieville and O. D. Faugeras. Motion analysis with a camera with

unknown, and possibly varying intrinsic parameters. Proceedings of
Fifth International Conference on Computer Vision, pages 750756,
June 1995.

[VMHS01] R. Vidal, Y. Ma, S. Hsu, and S. Sastry. Optimal motion estimation from multiview normalized epipolar constraint. In ICCV01,
Vancouver, Canada, 2001.
[WHA93]

J. Weng, T. S. Huang, and N. Ahuja. Motion and Structure from

Image Sequences. Springer Verlag, 1993.

[ZDFL95] Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong. A robust

technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence,
78(1-2):87119, 1995.
[ZF96]

C. Zeller and O. Faugeras. Camera self-calibration from video sequences: the Kruppa equations revisited. Research Report 2793,
INRIA, France, 1996.

[ZH84]

X. Zhuang and R. M. Haralick. Rigid body motion and optical flow

image. Proceedings of the First International Conference on Artificial
Intelligence Applications, pages 366375, 1984.

[Zha98]

Z. Zhang. A flexible new technique for camera calibration. Microsoft

Technical Report MSR-TR-98-71, 1998.

This is page 322

Printer: Opaque this

This is page 323

Printer: Opaque this

Glossary of notations

(R, T )

R SO(3) the rotation; T R3 the translation

(, v)

R3 the angular velocity; v R3 the linear

velocity

.
E = TbR
L E3

the essential matrix

a generic (abstract) line in the Euclidean space
three-dimensional Euclidean space

.
X = [X, Y, Z, W ]T R4

homogeneous coordinates of a point in E3

coordinates of the ith point

Xij

coordinates of the ith point with respect to the

j th (camera) coordinate frame, short version for
Xi (tj )

.
l = [a, b, c]T
.
x = [x, y, z]T R3

homogeneous coordinates of the coimage of a line

homogeneous coordinates of the image of a point

coordinates of the ith image of a point

xij

coordinates of the ith image with respect to the

j th (camera) coordinate frame, short version for
xi (tj )

g SE(3)

special Euclidean motion

324

p E3

Glossary of notations

a generic (abstract) point in the Euclidean space

This is page 325

Printer: Opaque this

Index

Cross product, 11

Special Euclidean transformation, 13

Euclidean metric, 10
Exercise
Adjoint transformation on twist, 35
Group structure of SO(3), 33
Properties of rotation matrices, 34
Range and null space, 34
Skew symmetric matrices, 34

Theorem
Rodrigues formula for rotation
matrix, 21
Surjectivity of the exponential map
onto SE(3), 29
Surjectivity of the exponential map
onto SO(3), 20

Group, 15
matrix representation, 15
rotation group, 17
special Euclidean group, 15
special orthogonal group, 17
Inner product, 10
Matrix
orthogonal matrix, 17
rotation matrix, 17
special orthogonal matrix, 17
Polarization identity, 14
Rigid body displacement, 13
Rigid body motion, 13