An Invitation To 3-D Vision PDF
An Invitation To 3-D Vision PDF
Yi Ma
Jana Kosecka
Stefano Soatto
Shankar Sastry
November 19, 2001
This is page v
Printer: Opaque this
This is page vi
Printer: Opaque this
Preface
1 To our knowledge, there are also two other books on computer vision currently in
preparation: Ponce and Forsyth (expected in 2002), Pollefeys and van Gool (expected
in 2002).
viii
Preface
Preface
ix
This is page x
Printer: Opaque this
Contents
Preface
1 Introduction
1.1
Visual perception: from 2-D images to 3-D models
1.2
A historical perspective . . . . . . . . . . . . . . .
1.3
A mathematical approach . . . . . . . . . . . . .
1.4
Organization of the book . . . . . . . . . . . . . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introductory material
1
1
3
4
5
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
12
16
18
22
24
26
27
30
33
33
33
Contents
3 Image formation
3.1
Representation of images . . . . . . . . . . .
3.2
Lenses, surfaces and light . . . . . . . . . . .
3.2.1 Imaging through lenses . . . . . . . .
3.2.2 Imaging through pin-hole . . . . . . .
3.3
A first model of image formation . . . . . .
3.3.1 Basic photometry . . . . . . . . . . .
3.3.2 The basic model of imaging geometry
3.3.3 Ideal camera . . . . . . . . . . . . . .
3.3.4 Camera with intrinsic parameters . .
3.3.5 Spherical projection . . . . . . . . . .
3.3.6 Approximate camera models . . . . .
3.4
Summary . . . . . . . . . . . . . . . . . . . .
3.5
Exercises . . . . . . . . . . . . . . . . . . . .
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
38
40
40
42
43
43
47
48
50
53
53
54
54
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
58
59
60
62
64
64
67
68
68
70
71
73
73
II
77
79
80
80
81
84
85
89
90
94
95
96
xii
Contents
5.5
5.6
. . . . . . . .
reconstruction
. . . . . . . .
. . . . . . . .
100
105
106
107
III
151
.
.
.
.
.
.
153
154
157
159
161
165
165
.
.
.
.
.
.
.
.
168
168
170
173
173
177
178
178
179
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
8.5
8.6
8.7
Experiments . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Setup . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Comparison with the 8 point algorithm . . .
8.5.3 Error as a function of the number of frames
8.5.4 Experiments on real images . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
. .
. .
. .
and
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
xiii
182
182
183
184
184
186
186
187
187
189
191
192
194
196
197
197
198
201
201
202
203
205
205
206
206
208
209
211
213
216
220
220
222
225
229
231
231
233
234
234
235
xiv
Contents
11.2
11.3
11.4
11.5
11.6
IV
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Reconstruction algorithms
237
240
240
243
244
246
249
249
251
252
253
256
258
258
260
262
265
views
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
267
267
267
267
267
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
268
268
269
274
276
278
284
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
289
290
290
290
290
290
290
290
290
290
290
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
290
290
290
290
290
290
290
290
290
290
290
290
290
291
291
291
291
291
291
291
293
14.5
14.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xv
Appendices
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
295
295
296
297
298
299
299
301
301
xvi
Contents
B.5
B.6
313
317
318
References
319
Glossary of notations
323
Index
325
This is page 1
Printer: Opaque this
Chapter 1
Introduction
Chapter 1. Introduction
Figure 1.1. Some pictorial cues for three-dimensional structure: texture (left),
shading and T-junctions (right).
Figure 1.2. Stereo as a cue in random dot stereogram. When the image is fused
binocularly, it reveals a depth structure. Note that stereo is the only cue, as
all pictorial aspects are absent in the random dot stereograms.
ploited by the human visual system to infer the depth structure of the
scene in the close-range. More generally, if we consider a stream of images
taken from a moving viewpoint, the two-dimensional image-motion1 can
be exploited to infer information about the three-dimensional structure of
the scene as well as its motion relative to the viewer.
That image-motion is a strong cue is easily seen by eliminating all pictorial cues until the scene reduces to a cloud of points. A still image looks like
a random collection of dots but, as soon as it starts moving, we are able
to perceive the three-dimensional shape and motion of the scene (see figure 1.3). The use of motion as a cue relies upon the assumption that we are
Figure 1.3. 2-D image-motion is a cue to 3-D scene structure: a number of dots
are painted on the surface of a transparent cylinder. An image of the cylinder,
which is generated by projecting it onto a wall, looks like a random collection
of points. If we start rotating the cylinder, just by looking at its projection we
can clearly perceive the existence of a three-dimensional structure underlying the
two-dimensional motion of the projection.
able to assess which point corresponds to which across time (the correspondence problem, again). This implies that, while the scene and/or the sensor
moves, certain properties of the scene remain constant. For instance, we
may assume that neither the reflective properties nor the distance between
any two points on the scene change.
In this book we will concentrate on the motion/stereo problem, that is the
reconstruction of three-dimensional properties of a scene from collections
of two-dimensional images taken from different vantage points.
The reason for this choice does not imply that pictorial cues are not
important: the fact that we use and like photographs so much suggests the
contrary. However, the problem of stereo/motion reconstruction has now
reached the point where it can be formulated in a precise mathematical
sense, and effective software packages are available for solving it.
Chapter 1. Introduction
motion, also known as rigid body motion, has been a classic subject in geometry, physics (especially mechanical physics), and robotics. Perspective
projection, with its root traced back to Renaissance art, has been a widely
studied subject in projective geometry and computer graphics. Important
in their own right, it is however the study of computer vision and computer
graphics that has brought together these two separate subjects, and therefore generated an intriguing, challenging, and yet beautiful new subject:
multiple view geometry.
This is page 6
Printer: Opaque this
This is page 7
Printer: Opaque this
Part I
Introductory material
This is page 8
Printer: Opaque this
This is page 9
Printer: Opaque this
Chapter 2
Representation of a three dimensional
moving scene
10
, R.
, R,
that we use the same symbol v for a vector and its coordinates.
that points do not.
11
p
The quantity kvk = hv, vi is called the Euclidean norm (or 2-norm)
of the vector v. It can be shown that, by a proper choice of the Cartesian
frame, any inner product in E3 can be converted to the following familiar
form:
hu, vi = uT v = u1 v1 + u2 v2 + u3 v3 .
(2.1)
u2 v 3 u 3 v 2
u v = u3 v 1 u 1 v 3 R 3 .
u1 v 2 u 2 v 1
It is immediate from this definition that the cross product of two vectors
is linear: u (v + w) = u v + u w , R. Furthermore, it is
immediate to verify that
hu v, ui = hu v, vi = 0,
u v = v u.
l(()) =
where X(t)
=
d
(X(t))
dt
kX(t)k
dt.
12
0
u3 u2
0
u1 R33 .
u
b = u3
u2 u1
0
(2.2)
t R.
(2.3)
13
In other words, if v is the vector defined by the two points p and q, then
the norm (or length) of v remains the same as the object moves: kv(t)k =
constant. A rigid body motion is then a family of transformations that
describe how the coordinates of every point on the object change as a
function of time. We denote it by g:
g(t) : R3
X
R3
7 g(t)(X)
If, instead of looking at the entire continuous moving path of the object,
we concentrate on the transformation between its initial and final configuration, this transformation is usually called a rigid body displacement and
is denoted by a single mapping:
g : R3
X
R3
7
g(X)
Besides transforming the coordinates of points, g also induces a transformation on vectors. Suppose v is a vector defined by two points p and q:
v = Y X; then, after the transformation g, we obtain a new vector:
.
g (v) = g(Y) g(X).
That g preserves the distance between any two points can be expressed in
terms of vectors as kg (v)k = kvk for v R3 .
However, preserving distances between points is not the only requirement
that a rigid object moving in space satisfies. In fact, there are transformations that preserve distances, and yet they are not physically realizable.
For instance, the mapping
f : [X1 , X2 , X3 ]T 7 [X1 , X2 , X3 ]T
preserves distance but not orientation. It corresponds to a reflection of
points about the XY plane as a double-sided mirror. To rule out this type
of mapping, we require that any rigid body motion, besides preserving
distance, preserves orientation as well. That is, in addition to preserving
the norm of vectors, it must preserve their cross product. The coordinate
transformation induced by a rigid body motion is called a special Euclidean
transformation. The word special indicates the fact that it is orientationpreserving.
Definition 2.3 (Rigid body motion or special Euclidean transformation). A mapping g : R3 R3 is a rigid body motion or a special
Euclidean transformation if it preserves the norm and the cross product of
any two vectors:
1. Norm: kg (v)k = kvk, v R3 .
2. Cross product: g (u) g (v) = g (u v), u, v R3 .
14
uT v = g (u)T g (v),
u, v R3 .
(2.4)
(2.5)
In other words, a rigid body motion can also be defined as one that preserves
both inner product and cross product.
How do these properties help us describe rigid motion concisely? The fact
that distances and orientations are preserved in a rigid motion means that
individual points cannot translate relative to each other. However, they can
rotate relative to each other, but they have to collectively, in order to not
alter any mutual distance between points. Therefore, a rigid body motion
can be described by the motion of any one point on the body, and the
rotation of a coordinate frame attached to that point. In order to do this,
we represent the configuration of a rigid body by attaching a Cartesian
coordinate frame to some point on the rigid body and keeping track of the
motion of this coordinate frame relative to a fixed frame.
To see this, consider a fixed (world) coordinate frame, given by three
orthonormal vectors e1 , e2 , e3 R3 ; that is, they satisfy
ij = 1 for i = j
eTi ej = ij
.
(2.6)
ij = 0 for i 6= j
Typically, the vectors are ordered so as to form a right-handed frame:
e1 e2 = e3 . Then, after a rigid body motion g, we have:
g (ei )T g (ej ) = ij ,
(2.7)
That is, the resulting three vectors still form a right-handed orthonormal
frame. Therefore, a rigid object can always be associated with a righthanded, orthonormal frame, and its rigid body motion can be entirely
specified by the motion of such a frame, which we call the object frame. In
in Figure 2.1 we show an object (in this case a camera) moving relative to a
fixed world coordinate frame W . In order to specify the configuration of
the camera relative to the world frame W , one may pick a fixed point o on
the camera and attach to it an orthonormal frame, the camera coordinate
frame C. When the camera moves, the camera frame also moves as if it
were fixed to the camera. The configuration of the camera is then determined by (1) the vector between the origin of the world frame o and the
camera frame, g(o), called the translational part and denoted as T , and
(2) the relative orientation of the camera frame C, with coordinate axes
15
g (e1 ), g (e2 ), g (e3 ), relative to the fixed world frame W with coordinate
axes e1 , e2 , e3 ,called the rotational part and denoted by R.
In the case of vision, there is no obvious choice of the origin o and the
reference frame e1 , e2 , e3 . Therefore, we could choose the world frame to
be attached to the camera and specify the translation and rotation of the
scene (assuming it is rigid!), or we could attach the world frame to the
scene and specify the motion of the camera. All that matters is the relative
motion between the scene and the camera, and what choice of reference
frame to make is, from the point of view of geometry5 , arbitrary.
Remark 2.1. The set of rigid body motions, or special Euclidean transformations, is a (Lie) group, the so-called special Euclidean group, typically
denoted as SE(3). Algebraically, a group is a set G, with an operation of
(binary) multiplication on elements of G which is:
closed: If g1 , g2 G then also g1 g2 G;
associative: (g1 g2 ) g3 = g1 (g2 g3 ), for all g1 , g2 , g3 G;
unit element e: e g = g e = g, for all g G;
invertible: For every element g G, there exists an element g 1 G
such that g g 1 = g 1 g = e.
In the next few sections, we will focus on studying in detail how to represent
the special Euclidean group SE(3). More specifically, we will introduce a
way to realize elements in the special Euclidean group SE(3) as elements
in a group of n n non-singular (real) matrices whose multiplication is
simply the matrix multiplication. Such a group of matrices is usually called
a general linear group and denoted as GL(n) and such a representation is
called a matrix representation. A representation is a map
R : SE(3) GL(n)
g 7 R(g)
which preserves the group structure of SE(3).6 That is, the inverse of
a rigid body motion and the composition of two rigid body motions are
preserved by the map in the following way:
R(g 1 ) = R(g)1 ,
R(g h) = R(g)R(h),
g, h SE(3).
(2.8)
16
y
PSfrag replacements
C
o
z
x
z
T
y
W
o
g = (R, T )
x
Figure 2.1. A rigid body motion which, in this instance, is between a camera and
a world coordinate frame.
z0
r3
r1
o
r2
y0
x
Figure 2.2. Rotation of a rigid body about a fixed point o. The solid coordinate
frame W is fixed and the dashed coordinate frame C is attached to the rotating
rigid body.
17
Any matrix which satisfies the above identity is called an orthogonal matrix.
It follows from the definition that the inverse of an ortohgonal matrix is
1
T
simply its transpose: Rwc
= Rwc
. Since r1 , r2 , r3 form a right-handed frame,
we further have that the determinant of Rwc must be positive 1. This can
be easily seen when looking at the determinant of the rotation matrix:
detR = r1T (r2 r3 )
which is equal to 1 for right-handed coordinate systems. Hence Rwc is a
special orthogonal matrix where, as before, the word special indicates
orientation preserving. The space of all such special orthogonal matrices in
R33 is usually denoted by:
SO(3) = {R R33 | RT R = I, det(R) = +1} .
Traditionally, 3 3 special orthogonal matrices are called rotation matrices
for obvious reasons. It is straightforward to verify that SO(3) has a group
structure. That is, it satisfies all four axioms of a group mentioned in the
previous section. We leave the proof to the reader as an exercise. Therefore,
the space SO(3) is also referred to as the special orthogonal group of R 3 ,
or simply the rotation group. Directly from the definition of the rotation
matrix, we can show that rotation indeed preserves both the inner product
and the cross product of vectors. We also leave this as an exercise to the
reader.
Going back to Figure 2.2, every rotation matrix Rwc SO(3) represents
a possible configuration of the object rotated about the point o. Besides
this, Rwc takes another role as the matrix that represents the actual coordinates transformation from the frame C to the frame W . To see this,
suppose that, for a given a point p E3 , its coordinates with respect to the
frame W are Xw = [X1w , X2w , X3w ]T R3 . Since r1 , r2 , r3 form a basis for
R3 , Xw can also be expressed as a linear combination of these three vectors,
say Xw = X1c r1 + X2c r2 + X3c r3 with [X1c , X2c , X3c ]T R3 . Obviously,
Xc = [X1c , X2c , X3c ]T are the coordinates of the same point p with respect
18
t0 < t1 < t2 R.
Then, for a rotating camera, the world coordinates Xw of a fixed 3-D point
p are transformed to its coordinates relative to the camera frame C by:
Xc (t) = Rcw (t)Xw .
Alternatively, if a point p fixed with respect to the camera frame with
coordinates Xc , its world coordinates Xw (t) as function of t are then given
by:
Xw (t) = Rwc (t)Xc .
R(t)R
(t) + R(t)R T (t) = 0
T
T
R(t)R
(t) = (R(t)R
(t))T .
19
The resulting constraint which we obtain reflects the fact that the matrix
T
R(t)R
(t) R33 is a skew symmetric matrix (see Appendix A). Then, as
we have seen, there must exist a vector, say (t) R3 such that:
T
b (t) = R(t)R
(t).
R(t)
=
b (t)R(t).
(2.9)
R(t)
=
b R(t).
(2.10)
=
b x(t),
x(t) R3 .
It is then immediate to verify that the solution to the above ODE is given
by:
where e
bt
x(t) = eb t x(0)
(2.11)
(b
t)2
(b
t)n
+ +
+ .
2!
n!
(2.12)
(2.13)
20
To confirm that the matrix ebt is indeed a rotation matrix, one can directly
show from the definition of matrix exponential:
(eb t )1 = ebt = eb
= (eb t )T .
b so(3)
SO(3)
7 eb SO(3).
Note that we obtained the expression (2.13) by assuming that the (t)
in (2.9) is constant. This is however not always the case. So a question
naturally arises here: Can every rotation matrix R SO(3) be expressed
in an exponential form as in (2.13)? The answer is yes and the fact is stated
as the following theorem:
Theorem 2.1 (Surjectivity of the exponential map onto SO(3)).
For any R SO(3), there exists a (not necessarily unique) R3 , kk = 1
and t R such that R = eb t .
Proof. The proof of this theorem is by construction: if the rotation matrix
R is given as:
trace(R) 1
2
r32 r23
1
r13 r31 .
=
2 sin(t)
r21 r12
The significance of this theorem is that any rotation matrix can be realized by rotating around some fixed axis by a certain angle. However, the
theorem only guarantees the surjectivity of the exponential map from so(3)
to SO(3). Unfortunately, this map is not injective hence not one-to-one.
This will become clear after we have introduced the so-called Rodrigues
formula for computing R = eb t .
From the constructive proof for Theorem 2.1, we now know how to
compute the exponential coordinates (, t) for a given rotation matrix
R SO(3). On the other hand, given (, t), how do we effectively compute the corresponding rotation matrix R = ebt ? One can certainly use the
21
series (2.12) from the definition. The following theorem however simplifies
the computation dramatically:
Theorem 2.2 (Rodrigues formula for rotation matrix). Given
R3 with kk = 1 and t R, the matrix exponential R = ebt is given by the
following formula:
eb t = I +
b sin(t) +
b 2 (1 cos(t))
(2.14)
b2
b3
= T I,
= b
.
What in the brackets are exactly the series for sin(t) and (1cos(t)). Hence
we have eb t = I +
b sin(t) +
b 2 (1 cos(t)).
for all k. Hence, for a given rotation matrix R SO(3) there are typically
infinitely many exponential coordinates (, t) such that ebt = R. The exponential map exp : so(3) SO(3) is therefore not one-to-one. It is also
useful to know that the exponential map is not commutative either, i.e. for
two
b1 ,
b2 so(3), usually
unless
b1
b2 =
b2
b1 .
b
1 ,
b2 so(3).
Obviously, [b
1 ,
b2 ] is also a skew symmetric matrix in so(3). The linear
structure of so(3) together with the Lie bracket form the Lie algebra of the
(Lie) group SO(3). For more details on the Lie group structure of SO(3),
the reader may refer to [MLS94]. The set of all rotation matrices e bt , t R
is then called a one parameter subgroup of SO(3) and the multiplication in
such a subgroup is commutative, i.e. for the same R3 , we have:
eb t1 ebt2 = ebt2 ebt1 = eb(t1 +t2 ) ,
t1 , t2 R.
22
with j 2 = 1 and i j = j i.
(2.15)
q0 , q1 , q2 , q3 R. (2.16)
q = q0 q1 i q2 j q3 ij.
(2.17)
(2.18)
(2.19)
The multiplication and inverse rules defined above in fact endow the space
R4 an algebraic structure of a skew field. H is in fact called a Hamiltonian
field, besides another common name quaternion field.
One important usage of quaternion field H is that we can in fact embed
the rotation group SO(3) into it. To see this, let us focus on a special
subgroup of H, the so-called unit quaternions
S3 = {q H | kqk2 = q02 + q12 + q22 + q32 = 1}.
(2.20)
It is obvious that the set of all unit quaternions is simply the unit sphere
in R4 . To show that S3 is indeed a group, we simply need to prove that
it is closed under the multiplication and inverse of quaternions, i.e. the
multiplication of two unit quaternions is still a unit quaternion and so is
the inverse of a unit quaternion. We leave this simple fact as an exercise to
the reader.
Given a rotation matrix R = ebt with = [1 , 2 , 3 ]T R3 and t R,
we can associate to it a unit quaternion as following
q(R) = cos(t/2) + sin(t/2)(1 i + 2 j + 3 ij) S3 .
(2.21)
23
One may verify that this association preserves the group structure between
SO(3) and S3 :
q(R1 ) = q 1 (R),
R, R1 , R2 SO(3).
(2.22)
Further study can show that this association is also genuine, i.e. for different
rotation matrices, the associated unit quaternions are also different. In the
opposite direction, given a unit quaternion q = q0 + q1 i + q2 j + q3 ij S3 ,
we can use the following formulae find the corresponding rotation matrix
R(q) = ebt
qm / sin(t/2), t 6= 0
t = 2 arccos(q0 ), m =
, m = 1, 2, 3. (2.23)
0,
t=0
q(R1 R2 ) = q(R1 )q(R2 ),
However, one must notice that, according to the above formula, there
are two unit quaternions correspond to the same rotation matrix:
R(q) = R(q), as shown in Figure 2.3. Therefore, topologically, S3 is
R4
S3
q
PSfrag replacements
24
every rotation matrix practically the same way. On the other hand, the
Lie-Cartan coordinates to be introduced below falls into the category of
local parameterizations. That is, this kind of parameterizations are only
good for a portion of SO(3) but not for the entire space. The advantage of
such local parameterizations is we usually need only three parameters to
describe a rotation matrix, instead of four for both exponential coordinates:
(, t) R4 and unit quaternions: q S3 R4 .
In the space of skew symmetric matrices so(3), pick a basis (b
1 ,
b2 ,
b3 ),
i.e. the three vectors 1 , 2 , 3 are linearly independent. Define a mapping
(a parameterization) from R3 to SO(3) as
: (1 , 2 , 3 )
7 exp(b
1 + 2
b 2 + 3
b3 ).
(2.24)
25
will study how to represent a rigid body motion in general - a motion with
both rotation and translation.
Figure 2.4 illustrates a moving rigid object with a coordinate frame C
attached to it. To describe the coordinates of a point p on the object with
PSfrag replacements
p
Xc
x
o
Xw
C
y
Twc
W
o
y
g = (R, T )
x
Figure 2.4. A rigid body motion between a moving frame C and a world frame
W.
respect to the world frame W , it is clear from the figure that the vector
Xw is simply the sum of the translation Twc R3 of the center of frame
C relative to that of frame W and the vector Xc but expressed relative
to frame W . Since Xc are the coordinates of the point p relative to the
frame C, with respect to the world frame W , it becomes Rwc Xc where
Rwc SO(3) is the relative rotation between the two frames. Hence the
coordinates Xw are given by:
Xw = Rwc Xc + Twc .
(2.25)
Usually, we denote the full rigid motion as gwc = (Rwc , Twc ) or simply g =
(R, T ) if the frames involved are clear from the context. Then g represents
not only a description of the configuration of the rigid body object but a
transformation of coordinates between the frames. In a compact form we
may simply write:
Xw = gwc (Xc ).
The set of all possible configurations of rigid body can then be described
as:
SE(3) = {g = (R, T ) | R SO(3), T R3 } = SO(3) R3
so called special Euclidean group SE(3). Note that g = (R, T ) is not yet a
matrix representation for the group SE(3) as we defined in Section 2.2. To
obtain such a representation, we must introduce the so-called homogeneous
coordinates.
26
X=
R4 .
=
X3
1
1
v =
=
=
v 3 R .
0
1
1
0
Notice that, in R4 , vectors of the above form give rise to a subspace hence
all linear structures of the original vectors v R3 are perfectly preserved by
the new representation. Using the new notation, the transformation (2.25)
can be re-written as:
Rwc Twc Xc
Xw
c
=: gwc X
=
Xw =
1
0
1
1
27
R
=
0
T
1
1
RT
=
0
RT T
SE(3).
1
g (
v ) = gX(q)
gX(p)
= gv.
(t):
T
T
R(t)R
(t) T (t) R(t)R
(t)T (t)
1
g(t)g
(t) =
.
(2.27)
0
0
T
28
b (t) v(t)
1
R44 .
g(t)g
(t) =
0
0
1
b
g(t)
= (g(t)g
(t))g(t) = (t)g(t).
(2.28)
b can be viewed as the tangent vector along the curve of g(t) and used
for approximate g(t) locally:
b
b
g(t + dt) g(t) + (t)g(t)dt
= I + (t)dt
g(t).
b v
3
R44
se(3) = b =
so(3),
v
R
0 0
se(3) is called the tangent space (or Lie algebra) of the matrix group SE(3).
We also define two operators and to convert between a twist b se(3)
and its twist coordinates R6 as follows:
b v
v
b v
.
. v
6
R44 .
=
R ,
=
0 0
0 0
= g(t).
g(t) = et g(0).
Assuming that the initial condition g(0) = I we may conclude that:
b
g(t) = et
where the twist exponential is:
b
b +
et = I + t
b 2
b n
(t)
(t)
++
+ .
2!
n!
(2.29)
29
et =
eb t
0
(I ebt )b
v + T vt
1
(2.30)
b is indeed a
It it clear from this expression that the exponential of t
rigid body transformation matrix in SE(3). Therefore the exponential map
defines a mapping from the space se(3) to SE(3):
exp : se(3) SE(3)
b
b se(3) 7 e SE(3)
and the twist b se(3) is also called the exponential coordinates for SE(3),
as
b so(3) for SO(3).
One question remains to answer: Can every rigid body motion g SE(3)
always be represented in such an exponential form? The answer is yes and
is formulated in the following theorem:
Theorem 2.3 (Surjectivity of the exponential map onto SE(3)).
For any g SE(3), there exist (not necessarily unique) twist coordinates
b
= (v, ) and t R such that g = et .
v = [(I ebt )b
+ T t]1 T.
makes se(3) the Lie algebra for SE(3). The two rigid body motions g1 = e1
b
and g2 = e2 commute with each other : g1 g2 = g2 g1 , only if [b1 , b2 ] = 0.
30
(2.31)
or in homogeneous representation:
X(t) = g(t)X0 .
(2.32)
t2 , t1 R.
31
z
x
z
g(t2 , t1 )
x
g = (t3 , t2 )
o
g = (t3 , t1 )
t = t1
t = t2
t = t3
among coordinates
X(t3 ) = g(t3 , t2 )X(t2 ) = g(t3 , t2 )g(t2 , t1 )X(t1 ).
Comparing with the direct relationship between coordinates at t3 and t1 :
X(t3 ) = g(t3 , t1 )X(t1 ),
the following composition rule for consecutive motions must hold:
g(t3 , t1 ) = g(t3 , t2 )g(t2 , t1 ).
The composition rule describes the coordinates X of the point p relative
to any camera position, if they are known with respect to a particular one.
The same composition rule implies the rule of inverse
g 1 (t2 , t1 ) = g(t1 , t2 )
since g(t2 , t1 )g(t1 , t2 ) = g(t2 , t2 ) = I. In case time is of no physical meaning
to a particular problem, we might use gij as a shorthand for g(ti , tj ) with
ti , tj R.
Having understood transformation of coordinates, we now study what
happens to velocity. We know that the coordinates X(t) of a point p E3
relative to a moving camera, are a function of time t:
X(t) = gcw (t)X0 .
Then the velocity of the point p relative to the (instantaneous) camera
frame is:
X(t)
= g cw (t)X0 .
(2.33)
32
1
where an expression for g cw (t)gcw
(t) can be found in (2.27). The above
equation can be rewritten as:
c
Since Vbcw
(t) is of the form:
X(t)
= Vbcw
(t)X(t)
b (t)
c
Vbcw
(t) =
0
(2.34)
v(t)
,
0
we can also write the velocity of the point in 3-D coordinates (instead of
in the homogeneous coordinates) as:
X(t)
=
b (t)X(t) + v(t) .
(2.35)
c
The physical interpretation of the symbol Vbcw
is the velocity of the world
frame moving relative to the camera frame, as viewed in the camera frame
c
the subscript and superscript of Vbcw
indicate that. Usually, to clearly specify
the physical meaning of a velocity, we need to specify: It is the velocity
of which frame moving relative to which frame, and in which frame it is
viewed. If we change where we view the velocity, the expression will change
accordingly. For example suppose that a viewer is in another coordinate
frame displaced relative to the camera frame by a rigid body transformation
g SE(3). Then the coordinates of the same point p relative to this frame
are Y(t) = gX(t). Compute the velocity in the new frame we have:
1
c 1
g Y(t).
Y(t)
= g g cw (t)gcw
(t)g 1 Y(t) = g Vbcw
c 1
g .
Vb = g Vbcw
This is the simply the same physical quantity but viewed from a different
vantage point. We see that the two velocities are related through a mapping
defined by the relative motion g, in particular:
adg : se(3) se(3)
b 1 .
b 7 g g
This is the so-called adjoint map on the space se(3). Using this notac
tion in the previous example we have Vb = adg (Vbcw
). Clearly, the adjoint
map transforms velocity from one frame to another. Using the fact that
gcw (t)gwc (t) = I, it is straightforward to verify that:
c
1
1
1 1
w
Vbcw
= g cw gcw
= gwc
g wc = gcw (g wc gwc
)gcw = adgcw (Vbwc
).
c
Hence Vbcw
can also be interpreted as the negated velocity of the camera
moving relative to the world frame, viewed in the (instantaneous) camera
frame.
2.6. Summary
33
2.6 Summary
The rigid body motion introduced in this chapter is an element g SE(3).
The two most commonly used representation of elements of g SE(3) are:
Homogeneous representation:
R T
g =
R44 with R SO(3) and T R3 .
0 1
Twist representation:
b
X(t)
= Vbcw
(t)X(t) where Vbcw
= g cw gcw
and gcw (t) is the configuration of the camera with respect to the world
frame. Using the actual 3D coordinates, the velocity of a 3D point yields
the familiar relationship:
X(t)
=
b (t)X(t) + v(t).
2.7 References
The presentation of the material in this chapter follows the development
in [?]. More details on the abstract treatment of the material as well as
further references can be also found there.
2.8 Exercises
1. Linear vs. nonlinear maps
Suppose A, B, C, X Rnn . Consider the following maps from
Rnn Rnn and determine if they are linear or not. Give a brief
proof if true and a counterexample if false:
(a) X
(b) X
(c)
(d)
X
X
7 AX + XB
7 AX + BXC
7 AXA B
7 AX + XBX
34
0
3 2
(2.36)
0
1 .
b = 3
2 1
0
According to the definition,
b is skew symmetric, i.e.
b T = b
. Now
for any matrix A R33 with determinant det A = 1, show that the
following equation holds:
1 .
\
AT
bA = A
(2.37)
2.8. Exercises
35
b v
g=
SE(3), b =
se(3),
0 1
0 0
This is page 36
Printer: Opaque this
Chapter 3
Image formation
This chapter introduces simple mathematical models of the image formation process. In a broad figurative sense, vision is the inverse problem of
image formation: the latter studies how objects give raise to images, while
the former attempts to use images to recover a description of objects in
space. Therefore, designing vision algorithms requires first developing a
suitable model of image formation. Suitable in this context does not necessarily mean physically accurate: the level of abstraction and complexity in
modeling image formation must trade off physical constraints and mathematical simplicity in order to result in a manageable model (i.e. one that
can be easily inverted). Physical models of image formation easily exceed
the level of complexity necessary and appropriate to this book, and determining the right model for the problem at hand is a form of engineering
art.
It comes at no surprise, then, that the study of image formation has
for centuries been in the domain of artistic reproduction and composition,
more so than in mathematics and engineering. Rudimentary understanding of the geometry of image formation, which includes various models for
projecting the three-dimensional world onto a plane (e.g., a canvas), is implicit in various forms of visual arts from all ancient civilization. However,
the roots to formalizing the geometry of image formation into a mathematical model can be traced back to the work of Euclid in the 6th century
B.C.. Examples of correct perspective projection are visible in the stunning
frescoes and mosaics of Pompeii (Figure 3.1) from the first century B.C.
Unfortunately, these skills seem to have been lost with the fall of the Roman
empire, and it took over a thousand years for correct perspective projection
37
to dominate paintings again, in the late 14th century. It was the early renaissance painters who developed systematic methods for determining the
perspective projection of three-dimensional landscapes. The first treatise
on perspective was published by Leon Battista Alberti [?], who emphasizes the eyes view of the world and capturing correctly the geometry of
the projection process. The renaissance coincided with the first attempts
to formalize the notion of perspective and place it on a solid analytical
footing. It is no coincidence that the early attempts to formalize the rules
of perspective came from artists proficient in architecture and engineering,
such as Alberti and Brunelleschi. Geometry, however, is only part of the
Figure 3.1. Frescoes from the first century B.C. in Pompeii. More (left) or less
(right) correct perspective projection is visible in the paintings. The skill was lost
during the middle ages, and it did not reappear in paintings until fifteen centuries
later, in the early renaissance.
38
(3.1)
250
200
150
100
50
0
80
100
60
80
40
60
40
20
20
y
The values taken by the map I depend upon physical properties of the
scene being viewed, such as its shape, its material reflectance properties
and the distribution of the light sources. Despite the fact that Figure 3.2
186
189
190
188
185
183
181
176
170
171
175
174
169
179
183
181
174
172
176
178
188
188
188
188
189
178
178
176
172
173
176
174
168
180
181
177
170
168
171
171
187
181
176
175
177
164
165
163
159
157
156
151
144
155
153
147
141
140
142
138
168
163
159
158
158
148
149
145
137
131
128
123
117
127
122
115
113
114
114
109
130
135
139
139
138
134
135
131
123
119
120
119
117
121
115
110
111
114
114
110
101
109
115
114
110
118
121
120
116
116
121
126
127
118
113
111
115
118
116
114
99
104
106
103
99
112
116
118
114
113
118
121
122
109
106
107
112
113
110
110
110
113
114
113
112
119
124
125
119
114
113
112
109
107
105
107
113
112
108
109
113
113
123
126
119
117
120
123
122
118
112
108
106
113
109
105
105
107
104
97
112
110
114
112
107
118
122
125
126
125
123
122
122
125
123
120
119
119
116
110
107
109
111
113
115
106
109
112
113
113
114
115
116
133
132
132
130
128
125
121
117
117
119
127
137
122
123
124
123
122
122
123
125
130
131
133
132
130
128
127
140
134
130
133
140
139
139
139
137
135
135
137
139
129
131
133
134
134
134
136
153
147
141
137
135
140
141
142
141
140
141
143
145
139
140
141
144
146
148
150
153
152
154
151
144
152
154
155
156
155
155
156
158
153
151
150
153
157
161
160
156
156
165
165
157
154
156
158
158
156
155
155
156
161
157
154
156
162
165
163
158
163
160
156
163
160
159
158
159
160
158
152
147
148
149
148
148
153
159
158
156
160
156
152
158
155
154
155
157
160
159
155
152
155
156
155
152
153
157
156
39
153
156
151
145
150
147
147
148
150
152
152
150
148
157
159
157
151
148
149
150
and Table 3.1 do not seem very indicative of the properties of the scene
they portray, this is how they are represented in a computer. A different
representation of the same image that is better suited for interpretation by
the human visual system is obtained by generating a picture. A picture is
a scene - different from the true one - that produces on the imaging sensor
(the eye in this case) the same image as the original scene. In this sense
pictures are controlled illusions: they are scenes different from the true
ones (they are flat), that produce in the eye the same image as the original
scenes. A picture of the same image I described in Figure 3.2 and Table
3.1 is shown in Figure 3.3. Although the latter seems more informative on
the content of the scene, it is merely a different representation and contains
exactly the same information.
10
20
30
40
50
60
70
10
20
30
40
50
60
70
80
90
40
41
PSfrag replacements
Figure 3.4. The rays parallel to the optical axis intersect at the focus.
through the optical center (Figure 3.5). The first one intersects the optical
axis at the focus, the second remains undeflected (by the defining properties
of the thin lens). Call x the point where the two rays intersect, and let z be
its distance from the optical center. By decomposing any other ray from p
into a component parallel to the optical axis and one through the optical
center, we can argue that all rays from p intersect at x on the opposite side
of the lens. In particular, a ray from x parallel to the optical axis, must go
through p. Using similar triangles, from Figure 3.5 we obtain the following
fundamental equation of the thin lens:
1
1
1
+ =
Z
z
f
The point x is called the image of the point p. Therefore, under the
p
z
Z
PSfrag replacements
x
Figure 3.5. Image of the point p is the point x of the intersection of rays going
parallel to the optical axis and the ray through the optical center.
42
X
,
Z
y = f
Y
Z
(3.2)
Note that any other point on the line through p projects onto the same
coordinates [x, y]T . This imaging model is the so-called ideal pin-hole. It
o
PSfrag replacements
f
y
x
Figure 3.6. Image of the point p is the point x of the intersection of the ray going
through the optical center o and an image plane at a distance f from the optical
center.
43
44
X 7 x = (X)
(3.4)
.
where R2 with (X) = X/Z in the case of planar projection (e.g.,
.
onto the CCD), or S2 with (X) = X/kXk in the case of a spherical
projection (e.g., onto the retina). We will not make a distinction between
the two models, and indicate the projection simply by .
In order to express the direction xp in the camera frame, we consider
the change of coordinates from the local coordinate frame at the point p
to the camera frame. For simplicity, we let the inertial frame coincide with
.
the camera frame, so that X = gp (o) = p and x gp (xp ) where3 we note
that gp is a rotation, so that x depends on xp , while X depends on p. The
reader should be aware that the transformation gp itself depends on local
shape of the surface at p, in particular its tangent plane and its normal
at the point p. Once we substitute x for xp into E in (3.3) we obtain the
radiance
.
R1 (p) = E(gp 1
(x), p)
where x = (p).
(3.5)
2 We will systematically study the model of projection in next few sections, but what
is given below suffices our discussions here.
3 The symbol indicates projective equivalence, that is equality up to a scalar.
Strictly speaking, x and xp do not represent the same vector, but only the same direction (they have opposite sign and different lengths). However, they do represent the
same point in the projective plane, and therefore we will regard them as one and the
same. In order to obtain the same embedded representation (i.e. a vector in R 3 with
the same coordinates), we would have to write x = (gp (xp )). The same holds if we
model the projective plane using the sphere with antipodal points identified.
45
Our (ideal) sensor can measure the amount of energy received along the
direction x, assuming a pin-hole model:
I1 (x) = R1 (p)
where x = (p).
(3.6)
If the optical system is not well modeled by a pin-hole, one would have to
explicitly model the thin lens, and therefore integrate not just along the
direction xp , but along all directions in the cone determined by the current
point and the geometry of the lens. For simplicity, we restrict our attention
to the pin-hole model.
Notice that R1 in (3.5) depends upon the shape of the surface S, represented by its location p and surface normal , but it also depends upon
the light source L, its energy distribution E and the reflectance properties
of the surface S, represented by the BRDF . Making this dependency
explicit we write
I1 (x) =
(gp 1
(x), p )dE(p ) where x = (p)
(3.7)
46
(3.8)
for k = 1 . . . M.
This model can be considerably simplified if we restrict our attention
to a class of materials, called Lambertian, that do not change appearance
depending on the viewpoint. Marble and other matte surfaces are to a
large extent well approximated by the Lambertian model. Metal, mirrors
and other shiny surfaces are not.
According to Lamberts model, the BRDF only depends on how the surface faces the light source, but not on how it is viewed. Therefore, (xp , p )
is actually independent of xp , and we can think of the radiance function as
being glued, or painted on the surface S, so that at each point p we have
that the radiance R only depends on the geometry of the surface, and not
explicitly on the light source. In particular,
we have (xp , p ) = hp , p i,
. R
independent of xp , and R(p, p ) = L hp , p idE(p ). Since p is the normal
vector, which is determined by the geometry of the surface at p, knowing
the position of the generic point p S one can differentiate it to compute
the tangent plane. Therefore, effectively, the radiance R only depends on
the surface S, described by its generic point p:
I(x) = R(p)
(3.9)
47
48
Rwc
0
Twc
1
Xc
1
The inverse transformation, that maps coordinates in the world frame onto
coordinates in the camera frame, is given by
T
T
gcw = (Rcw , Tcw ) = (Rwc
Twc , Rwc
)
(3.10)
49
f
0 0 0
x
Y
(3.11)
Z y = 0 f 0 0
Z .
0
0 1 0
1
1
.
=
where X
[X, Y, Z, 1]T is the representation of a 3-D point in homogeneous
.
= [x, y, 1]T are homogeneous (projective) coordinates of
coordinates and x
the point in the retinal plane. Since the Z-coordinate (or the depth of
the point p) is usually unknown, we may simply denote it as an arbitrary
positive scalar R+ . Also notice that in the above equation we can
decompose the matrix into
f
0 0
1 0 0 0
f
0 0 0
0 f 0 0 = 0 f 0 0 1 0 0 .
0
0 1
0 0 1 0
0
0 1 0
Define two matrices:
f
0
Af = 0 f
0
0
0
0 R33 ,
1
1 0 0 0
P = 0 1 0 0 R34 .
0 0 1 0
=
Also notice that from the coordinate transformation we have for X
[X, Y, Z, 1]T :
X0
X
R
T
Y
=
Y0 .
(3.12)
Z0
Z
0
1
1
1
X0
R
T
1 0 0 0
f
0 0
x
Y0 ,
y = 0 f 0 0 1 0 0
Z0
0
1
0 0 1 0
0
0 1
1
1
or in matrix form
= Af P gX
0
x = Af P X
(3.13)
50
= (xs ox )
= (ys oy )
where (ox , oy ) are the coordinates (in pixels) of the principal point relative
to the image reference frame, or where the Z-axis intersects the image
plane. So the actual image coordinates are given by the vector xim =
[xim , yim ]T R2 instead of the ideal image coordinates x = [x, y]T . The
above steps can be written in a homogeneous coordinates in a matrix form
in the following way:
x
xim
sx
0
ox
.
sy oy y
im = yim = 0
x
(3.15)
1
0
0
1
1
where xim and yim are actual image coordinates in pixels. When sx = sy
the pixels are square. In case the pixels are not rectangular, a more general
form of the scaling matrix can be considered:
sx s
S=
R22
0 sy
51
sx
As = 0
0
s
sy
0
ox
oy R33 .
1
(3.16)
sx
xim
yim = 0
0
1
s
sy
0
f
ox
oy 0
0
1
0
f
0
X
1 0 0 0
0
Y
0 0 1 0 0
Z .
0 0 1 0
1
1
Notice that in the above equation, the effect of a real camera is in fact carried through two stages. The first stage is a standard perspective projection
with respect to a normalized coordinate system (as if the focal length f = 1).
This is characterized by the standard projection matrix P = [I33 , 0]. The
second stage is an additional transformation (on the so obtained image x)
which depends on parameters of the camera such as the focal length f ,
the scaling factors sx , sy and s and the center offsets ox , oy . The second
transformation is obviously characterized by the combination of the two
matrices As and Af
sx
A = A s Af = 0
0
s
sy
0
ox
f
oy 0
1
0
0
f
0
0
f sx
0 = 0
1
0
f s
f sy
0
ox
oy .
1
52
f s x f s ox
1 0 0 0
Y
= 0
f s y oy 0 1 0 0
xim = AP X
Z . (3.17)
0 0 1 0
0
0
1
1
The 3 3 matrix A collects all parameters that are intrinsic to the particular camera, and are therefore called intrinsic parameters; the matrix P
represents the perspective projection. The matrix A is usually called the
intrinsic parameter matrix or simply the calibration matrix of the camera.
When A is known, the normalized coordinates x can be obtained from the
pixel coordinates xim by simple inversion of A:
1 0 0 0
= 0 1 0 0 Y .
im = P X
(3.18)
x = A1 x
Z
0 0 1 0
1
The information about the matrix A can be obtained through the process
of camera calibration described in Chapter 6.
The normalized coordinate system corresponds to the ideal pinhole camera model with the image plane located in front of the center of projection
and the focal length f equal to 1. Given this geometric interpretation, the
individual entries of the matrix A correspond to:
1/sx size of the horizontal pixel in meters [m/pixel],
1/sy size of the vertical pixel in meters [m/pixel],
x = sx f size of the focal length in horizontal pixels [pixel],
y = sy f size of the focal length in vertical pixels [pixel],
x /y aspect ratio .
X0
R
T
1 0 0 0
f s x f s ox
xim
Y0 ,
f s y oy 0 1 0 0
yim = 0
Z0
0
1
0 0 1 0
0
0
1
1
1
or in matrix form
= AP gX
0
xim = AP X
(3.19)
53
X 7 X =
X
.
kXk
xim = AP X
(3.20)
where
the scale = Z in case of perspective projection becomes =
X
x
1 0 0
Y ,
=
(3.21)
y
0 1 0
Z
or simply in a matrix form
x = P2 X
(3.22)
54
Y
y=f
Z
where Z is the average distance of the points viewed by the camera. This
model is appropriate for the case when all points lie in the frontoparallel
plane then the scaling factor corresponds to the distance of the plane from
the origin. Denoting the scaling factor s = Zf we can express the weakperspective camera model (scaled orthographic) as:
X
x
1 0 0
Y
=s
(3.23)
y
0 1 0
Z
x = sP2 X
(3.24)
3.4 Summary
The perspective projection was introduced as a model of the image formation process. In the case of idealized projection (when the intrinsic
parameters of the camera system are known) the coordinates of the image
points are related to their 3D counterparts by an unknown scale
x = X.
In the absence of camera parameters the perspective projection has been
augmented by an additional transformation A which captures the intrinsic
parameters of the camera system and yields following relationship between
image quantities and its 3D counterparts.
= AP gX
0
xim = AP X
3.5 Exercises
1. Show that any point on a line through p projects onto the same
coordinates (equation (3.4)).
2. Show how perspective projection approximates orthographic projection when the scene occupies a volume whose diameter is small
3.5. Exercises
55
compared to its distance from the camera. Characterize other conditions under which the two projection models produce similar results
(equal in the limit).
3. Consider a thin round lens imaging a plane parallel to the lens at a
distance d from the focal plane. Determine the region of this plane
that contributes to the image I at the point x. (Hint: consider first
a one-dimensional imaging model, then extend to a two-dimensional
image).
4. Perspective projection vs. orthographic projection
It is common sense that, with a perspective camera, one can not
tell an object from another object which is exactly twice as big but
twice as far. This is a classic ambiguity introduced by perspective
projection. Please use the ideal camera model to explain why this is
true. Is the same also true for orthographic projection? Please explain.
5. Vanishing points
A straight line in the 3-D world is projected onto a straight line in
the image. The projections of two parallel lines intersect in the image
at so-called vanishing point.
a) Show that projections of parallel lines in the image intersect
at a point.
b) Compute, for a given family of parallel lines, where in the
image the vanishing point will be.
c) When does the vanishing point of the lines in the image lie
at infinity (i.e. they do not intersect)?
6. Field of view
An important parameter of the imaging system is the field of view
(FOV). The field of view is twice the angle between the optical axis
(z-axis) and the end of the retinal plane (CCD array). Imagine that
you have a camera system with focal length 16mm, and retinal plane
(CCD array) is (16mm 12mm) and that you digitizer samples your
imaging surface 500 500 pixels in each direction.
56
7. Calibration matrix
Please compute the calibration matrix A which represents the transformation from image I to I 0 as shown in Figure 3.8. Note that, from
the definition of calibration matrix, you need to use homogeneous
coordinates to represent points on the images. Suppose that the rex
o
y
(-1,1)
I
I
(640,480)
This is page 57
Printer: Opaque this
Chapter 4
Image primitives and correspondence
The previous chapter described how images are formed depending on the
position of the camera, the geometry of the environment and the light
distribution. In this chapter we begin our journey of understanding how to
undo this process and process measurements of light in order to infer the
geometry of the environment.
The geometry of the environment and the light distribution combine to
give an image in a way that is inextricable: from a single image, in lack
of prior information, we cannot tell light and geometry apart. One way to
see that is to think of an image as coming from a certain object with a
certain light distribution: one can always construct a different object and
a different light distribution that would give rise to the same image. One
example is the image itself! It is an object different from the true scene (it
is flat) that gives rise to the same image (itself). In this sense, images are
illusions: they are objects different than the true one, but that generate
on the retina of the viewer the same stimulus (or almost the same, as we
will see) that the original scene would.
However, if instead of looking at one single image we look at two or more
images of the same scene taken for instance from different vantage points,
then we can hope to be able to extract geometric information about the
scene. In fact, each image depends upon the (changing) camera position,
but they all depend upon the same geometry and light distribution, assuming that the scene has not changed between snapshots. Consider, for
instance, the case of a scene and an photograph of the scene. When viewed
from different viewpoints, the original scene will appear different depending upon the depth of objects within: distant objects will move little and
58
near objects will move mode. If, on the other hand, we move in front of a
photograph of the scene, every points will have the same image motion.
This intuitive discussion suggests that, in order to be able to extract
geometric information about the environment, we first have to be able to
establish how points move from one image to the next. Or, more in
general, we have to establish which points corresponds to which in different images of the same scene. This is called the correspondence problem.
The correspondence problem is at the heart of the process of converting
measurements of light into estimates of geometry.
In its full generality, this problem cannot be solved. If the light distribution and the geometry of the scene can change arbitrarily from one image
to the next, there is no way to establish which points corresponds to which
in different images. For instance, if we take a white marble sphere rotating,
we have no way to tell which point corresponds to which on the sphere
since each point looks the same. If instead we take a still mirror sphere
and move the light, even though points are still (and therefore points correspond to themselves), their appearance changes from one image to the
next. Even if light and geometry do not change, if objects have anisotropic
reflectance properties, so that they change their appearance depending on
the viewpoint, establishing correspondence can be very hard.
In this chapter we will study under what conditions the correspondence
problem can be solved, and can be solved easily. We will start with the most
naive approach, that consists in considering the image irradiance I(x) as a
label attached to each pixel x. Although sound in principle, this approach
leads to nowhere due to a variety of factors. We will then see how to extend
the label idea to more and more complex labels, and what assumptions
on the shape of objects different labels impose.
R+
7
I(x).
59
ray contributes to the irradiance. This point has coordinates X R3 , corresponding to a particular value of determined by the first intersection of
the ray with a visible surface: x = X. Therefore, I1 (x) = E(x) = E(X),
where R+ represents the distance of the point along the projection ray
and E(X) : R3 R+ is the radiant distribution of the object at X. This
is the fundamental irradiance equation:
I1 (x) = E(x) = E(X).
(4.1)
Now suppose that a different image of the same scene becomes available,
I2 , for instance one taken from a different vantage point. Naturally,
I2 : R2 R+ ; x 7 I2 (x).
However, I2 will in general be different from I1 . The first step in establishing correspondence is to understand how such a difference occurs. We
distinguish between changes that occur in the image domain and changes
that occur in the value I2 .
(4.3)
where we have emphasized the fact that the scales i , i = 1, 2 depend upon
the coordinates of the point in space X. Therefore, under the admittedly
restrictive conditions of this section, a model of the deformation between
two images of the same scene is given by:
I1 (x1 ) = I2 (h(x1 )).
(4.4)
60
(4.5)
where the fact that h depends upon the shape of the scene is made explicitly
in the argument of u. Note that the dependence of h(x) on the position
of the point X comes through the scales 1 , 2 . In general, therefore, h
is a function in an infinite-dimensional space (the space of surfaces), and
solving for image correspondence is as difficult as estimating the shape of
visible objects.
While the reader reflects on how inappropriate it is to use the brightness
value of the image at a point as a label to match pixels across different
images, we reflect on the form of the image motion h in some interesting
special cases. We leave it as an exercise to the reader to prove the statements
below.
Translational image model Consider the case where each image point
moves with the same motion, u(X) = const, or h(x) = x + u x
where u R2 . This model is correct only if the scene is flat and
parallel to the plane, far enough from it, and the viewer is moving
slowly in a direction parallel to the image. However, if instead of
considering the whole image we consider a small window around a
k }, then a simple translational model
point x, W (x) = {
x | kx x
could be a reasonable approximation provided that the window is not
an image of an occluding boundary (why?). Correspondence based
on this model is at the core of most optical flow and feature tracking
algorithms. We will adopt the one of Lucas and Kanade [?] as the
prototypical example of a model in this class.
Affine image deformation Consider the case where h(x) = Ax + d
where A R22 and d R2 . This model is a good approximation
for small planar patches parallel to the image plane moving under an
arbitrary translation and rotation about the optical axis, and modest
rotation about an axis parallel to the image plane. Correspondence
based on this model has been addressed by Shi and Tomasi [?].
Now, let us see why attaching scalar labels to pixels is not a good idea.
61
(4.6)
More fundamental departures from the model (4.4) occur when one considers that points that are visible from one vantage point may become occluded
from another. Occlusions could be represented by a factor multiplying I2
that depends upon the shape of the surface being imaged:
I1 (x1 ) = fo (X, x)I2 (h(x1 )) + n(h(x1 )).
(4.7)
For instance, for the case where only one point on the surface is emitting
light, fo (X, x) = 1 when the point is visible, and fo (X, x) = 0 when not.
The equation above should make very clear the fact that associating the
label I1 to the point x1 is not a good idea, since the value of I1 depends
upon the noise n and the shape of the surfaces in space X, which we cannot
control.
There is more: in most natural scenes, object do not emit light of their
own but, rather, they reflect ambient light in a way that depends upon
the properties of the material. Even in the absence of occlusions, different
materials may scatter or reflect light by different amounts in different directions. A limit case consist of material that scatters light uniformly in all
directions, such as marble and opaque matte surfaces, for which fo (X, x)
could simply be multiplied by a factor (X) independent of the viewpoint
x. Such materials are called Lambertian and the function (X) defined on
the surface of visible objects is called albedo. However, for many other surfaces, for instance the other extreme case of a perfectly specular material,
light rays hitting the surface are reflected by different amounts in different
directions. Images of specular objects depend upon the light distribution of
the ambient space, which includes light sources as well as every other object
through inter-reflections. In general, few materials exhibit perfect Lambertian or specular reflection, and even more complex reflection models, such
as translucent or anisotropic materials, are commonplace in natural and
man-made scenes.
The reader can therefore appreciate that the situation can get very
complicated even for relatively simple objects. Consider, for instance, the
example of the marble sphere and the mirror sphere in the header of this
chapter. Indeed, in the most general case an image depends upon the (unknown) light distribution, material properties, shape and pose of objects in
62
space, and images alone no matter how many do not provide sufficient
information to recover all unknowns.
So how do we proceed? How can we establish correspondence despite the
fact that labels are ambiguous and that point correspondence cannot be
established for general scenes?
The second question can be easily dismissed by virtue of modesty: in
order to recover a model of the scene, it is not necessary to establish the
correspondence for every single point on a collection of images. Rather, in
order to recover the pose of cameras and a simple model of an environment,
correspondence of a few fiducial points is sufficient. It will be particularly
clever if we could choose such fiducial points, of features in such a way
as to guarantee that they are easily put in correspondence.
The first question can be addressed by proceeding as often when trying to counteract the effects of noise: integrating. Instead of considering
Equation (4.2) in terms of points on an image, we can consider it defining
correspondence in terms of regions. This can be done by integrating each
side on a window W (x) around each point x, and using the equation to
characterize the correspondence at x. Another way of thinking of the same
procedure is to associate to each pixel x not just the scalar label I(x), but
instead a vector label
.
W (x) }.
l(x) = {I(
x) | x
This way, rather than trying to match values, in order to establish
correspondence, we will need to match vectors, or regions.
Of course, if we considered the transformation of each point in the region
as independent, h(x) = x + u(X), we would be back to ground zero, having
to figure out two unknowns for each pixel. However, if we make the assumption that the whole region W (x) undergoes the same motion, for instance
+u x
W (x), then the problem may become easier to solve.
h(
x) = x
At this stage the reader should go back to consider under what assumptions a window undergoes a pure translation, and when it may be more
appropriate to model the motion of the region as an affine deformation, or
when neither approximation is appropriate.
In the remainder of this chapter we will explore in greater detail how
to match regions based on simple motion models, such as pure translation
or affine deformations. Notice that since we are comparing the appearance
of regions, a notion of discrepancy or matching cost must be agreed
upon.
63
(4.8)
uniquely determine the parameters . From the example of the white wall,
it is intuitive that such conditions would require that I1 and I2 have nonzero gradient. In the sections to follow we will derive an explicit form of this
condition for the case of translational model. Notice also that, due to noise,
there may exist no parameters for which the equation above is satisfied.
Therefore, we may have to solve the equation above as an optimization
problem. After choosing a discrepancy measure in , (I1 , I2 ), we can seek
for
(4.9)
Similarly, one may define a feature line as line segment with a support
region and a collection of labels such that the orientation and normal displacement of the transformed line can be uniquely determined from the
equation above.
In the next two sections we will see how to efficiently solve the problem
above for the case where are translation parameters or affine parameters.
Comment 4.1. The definition of feature points and lines allows us to abstract our discussion from pixels and images to abstract entities such as
points and lines. However, as we will discuss in later chapters, this separation is more conceptual than factual. Indeed, all the constraints among
geometric entities that we will derive in Chapters 8, 9, 10, and 11 can be
rephrased in terms of constraints on the irradiance values on collections of
images.
64
(4.10)
(4.11)
.
.
where we have taken the liberty of calling t1 = t and t2 = t+dt to emphasize
the fact that motion is infinitesimal. In the limit where dt 0, the equation
can be written as
I(x(t), t)T u + It (x(t), t) = 0
(4.12)
where
.
I(x, t) =
I
x (x, t)
I
y (x, t)
. I
and It (x, t) =
(x, t).
t
(4.13)
Another way of writing the equation is in terms of the total derivative with
respect to time,
dI(x(t), y(t), t)
= 0.
dt
(4.14)
(4.15)
.
dy T
2
which is identical to (4.12) once we let u = [ux , uy ]T = [ dx
dt , dt ] R .
This equation is called the image brightness constancy constraint. Depending on where the constraint is evaluated, this equation can be used to
compute what is called optical flow, or to track photometric features in a
sequence of moving images.
Before we delve into the study of optical flow and feature tracking, notice
that (4.12), if computed at each point, only provides one equation for two
unknowns (u). It is only when the equation is evaluated at each point in a
region x
W (x), and the motion is assumed to be constant in the region,
that the equation provides enough constraints on u.
65
(4.16)
For convenience we omit the time from (x(t), y(t)) in I(x(t), y(t), t) and
write simply I(x, y, t).
There are two ways to exploit this constraint, corresponding to an Eulerian and Lagrangian approach to the problem of determining image
and
motion. When we fix our attention at a particular image location x
use (4.16) to compute the velocity of particles flowing through that pixel,
u(
x, t) is called optical flow. When instead the attention is on a particular particle x(t), and (4.16) is computed at the location x(t) as it moves
through the image domain, we refer to the computation of u(x(t), t) as
feature tracking. Optical flow and feature tracking are obviously related by
x(t + dt) = x(t) + u(x(t), t)dt. The only difference, at the conceptual level,
is where the vector u(x, t) is computed: in optical flow it is computed at a
fixed location on the image, whereas in feature tracking it is computed at
the point x(t).
The brightness constancy constraint captures the relationship between
the image velocity u of an image point (x, y) and spatial and temporal
derivatives I, It which are directly measurable. As we have already noticed, the equation provides a single constraint for two unknowns u =
[ux , uy ]T . From the linear algebraic point of view there are infinitely many
solutions u which satisfy this equation. All we can compute is the projection of the actual optical flow vector to the direction of image gradient I.
This component is also referred to as normal flow and can be thought of
as a minimum norm vector un R2 which satisfies the brightness constancy constraint. It is given by a projection of the true motion field u into
gradient direction and its magnitude is given by
T
It
I
. I u I
.
un =
kIk kIk
kIk kIk
(4.17)
66
Figure 4.1. In spite of the fact that the dark square moved diagonally between the
two consecutive frames, observing purely the cut-out patch we can only observe
motion in horizontal direction and we may assume that the observed pattern
moved arbitrarily along the direction of the edge.
that over a small window centered at the image location x the optical flow
is the same for all points in a window W (x). This is equivalent to assuming
a purely translational deformation model:
h(x) = x + u(x) = x + u(x) for all x W (x)
(4.18)
Hence the in order to compute the image velocity u, we seek the image
velocity which is most consistent with all the point constraints. Due to the
effect of noise in the model (4.6), this now over-constrained system may not
have a solution. Hence the matching process is formulated as a minimization
of the following quadratic error function based on the brightness constancy
constraint:
X
Eb (u) =
(I T (x, y, t) u(x, y) + It (x, y, t))2
(4.19)
W (x,y)
W (x,y)
or in a matrix form:
Gu + b = 0.
(4.21)
(4.22)
67
where the summation ranges over the image window W (x, y) centered at a
feature of interest. Comparing to the error function (4.19), an advantage of
the SSD criterion is that in principle we no longer need to compute derivatives of I(x, y, t). One alternative is for computing the displacement is to
enumerate the function at each location and choose the one which gives the
minimum error. This formulation is due to Lucas and Kanade [LK81] and
was originally proposed in the context of stereo algorithms and was later
refined by Tomasi and Kanade [TK92] in more general feature tracking
context.
68
(4.25)
Enforcing the above assumption over a region of the image we can estimate
the unknown affine parameters A and d, by integrating the above constraint
for all the points in the region W (x, y)
X
(I(x, t) I(Ax + d, t + dt))2
(4.26)
Ea (A, d) =
W (x)
where the subscript a indicates the affine deformation model. By solving the
above minimization problem one can obtain a linear-least squares estimate
of the affine parameters A and d directly from the image measurements of
spatial and temporal gradients. The solution has been first derived by Shi
and Tomasi in [?].
I
(x, y),
x
Iy (x, y) =
I
(x, y).
y
(4.27)
69
While the notion of derivative is well defined for smooth functions, additional steps need to be taken when computing the derivatives of digital
images.
The most straightforward and commonly used approximation of the gradient by first order differences of the raw images usually results in a rough
approximation of I. Due to the presence of noise, digitization error as
well as the fact that only discrete samples of function I(x, y) are available,
a common practice is to smooth I(x, y) before computing first differences to
approximate the gradient. Smoothing can be accomplished by convolution
of the image with a smoothing filter. Suitable low-pass filter which attenuates the high-frequency noise and at the same time can be serve as some
sort of interpolating function is for instance a 2-D Gaussian function:
g (x, y) =
2
2
2
1
e(x +y )/2
2
(4.28)
where the choice is related to the effective size of the window over which
the intensity function is averaged. Since the Gaussian filter is separable,
the convolution can be accomplished in two separate steps, first convolving
the image in horizontal and vertical direction with a 1-D Gaussian kernel.
The smoothed image then is:
Z Z
I(x, y) =
I(u, v)g (x u)g (y v) dudv.
(4.29)
u
Figures 4.2 and 4.3 demonstrate the effect of smoothing a noisy image by
a convolution with the Gaussian function. Since convolution is linear, the
Figure 4.2. An image Lena corrupted Figure 4.3. Smoothed image by a conby white random noises.
volution with the Gaussian function.
70
(4.30)
(4.31)
x
where g0 (x) = 2
ex /2 is the derivative of the Gaussian function. Of
3
course, in principle, linear filters other than the Gaussian can be designed
to smooth the image or compute the gradient. Of course, over digital images, we cannot perform continuous integration. So the above convolution is
commonly approximated by a finite summation over an region of interest:
X
I(u, v)g0 (x u)g (y v)
(4.32)
Ix (x, y) =
(u,v)
Iy (x, y) =
(u,v)
(4.33)
Ideally should be chosen to be the entire image. But in practice, one can
usually choose its size to be big enough so that the values of the Gaussian
kernel are negligibly small outside of the region.
(4.34)
Figures 4.4 and 4.5 demonstrate edge pixels detected by the Canny edge
detector on a gray-level image. Edges are important features which give
away some local geometric information of the image. However, as we mentioned in previous sections, they are typically not good features to track
due to the aperture problem. Hence in order to select features which are
71
xo
xW (xo )
72
for all pixels in the window W around (x, y) compute the matrix
P 2
P
Ix
Ix Iy
P
P
G=
(4.38)
Ix Iy
Iy2
if the smallest singular value min (G) is bigger than the prefixed
threshold , then mark the pixel as a feature (or corner) point.
Although we have used the word corner, the reader should observe that
the test above only guarantees that the irradiance function I is changing
enough in two independent directions within the window of interest. In
other words, the window contains sufficient texture. Besides using the
norm (inner product) or outer product of the gradient I to detect edge or
corner, criteria for detecting both edge and corner together have also been
explored. One of them is the well-known Harris corner and edge detector
[HS88]. The main idea is to threshold the following quantity
C(G) = det(G) + k trace2 (G)
(4.39)
73
Figure 4.7. An example of the response of the Harris feature detector using 5 5
integration window and parameter k = 0.04.
(which in this case coincide with the singular values) of G are 1 , 2 . Then
C(G) = 1 2 + k(1 + 2 )2 = (1 + 2k)1 2 + k(12 + 22 ).
(4.40)
Note that if k > 0 and either one of the eigenvalues is large, so will be
C(G). That is, both distinctive edge and corner features will more likely
pass the threshold. If k < 0, then both eigenvalues need to be big enough
to make C(G) pass the threshold. In this case, the corner feature is favored.
Simple thresholding operation often does not yield satisfactory results and
leads to a detection of too many corners, which are not well localized.
Partial improvements can be obtained by search for the local minima in
the regions, where the response of the detector was high. Alternatively,
more sophisticated techniques can be used, which utilize the edge detection
techniques and indeed search for the high curvature points of the detected
contours.
74
m
X
i=1
ui uTi Rnn
(4.41)
4.6. Exercises
75
.
e0 = e R
i+1 .
e
= W I(I It )(u+di1 ,v+di2 ,t) dudv.
At each step (u+di1 , v +di2 ) is in general not on the pixel grid, so that
it is necessary to interpolate the brightness values to obtain image
intensity at that location. Matrix G is the discrete version of what
we defined to be the SSD in the first section of this chapter.
7. Multi-scale implementation
One problem common to all differential techniques is that they fail
as the displacement across frames is bigger than a few pixels. One
possible way to overcome this inconvenience is to use a coarse-to-fine
strategy:
build a pyramid of images by smoothing and sub-sampling the
original images (see for instance [BA83])
select features at the desired level of definition and then
propagate the selection up the pyramid
track the features at the coarser level
propagate the displacement to finer resolutions and use that displacement as an initial step for the sub-pixel iteration described
in the previous section.
The whole procedure can be very slow for a full-resolution image if
implemented on conventional hardware. However, it is highly parallel
computation, so that the image could be sub-divided into regions
which are then assigned to different processors. In many applications,
where a prediction of the displacement of each feature is available,
it is possible to process only restricted regions of interest within the
image.
This is page 76
Printer: Opaque this
This is page 77
Printer: Opaque this
Part II
Geometry of pairwise
views
This is page 78
Printer: Opaque this
This is page 79
Printer: Opaque this
Chapter 5
Reconstruction from two calibrated
views
i = 1, . . . , N
(5.1)
80
j = 1, . . . , N.
(5.2)
81
(5.3)
d2
o1
x1
x2
x
e1
e2
y
o2
(R, T )
Figure 5.1. Two projections x1 , x2 R3 of a 3-D point p from two vantage points.
The relative Euclidean transformation between the two vantage points is given
by (R, T ) SE(3). The vectors x1 , x2 and T are coplanar, and therefore their
triple product (5.3) must be zero.
82
c and A\
1 () are linear maps from R3 to R33 ,
Proof. Since both AT ()A
one may directly verify that these two linear maps agree on the basis
[1, 0, 0]T , [0, 1, 0]T or [0, 0, 1]T (using the fact that A SL(3) implies that
det(A) = 1).
The following theorem, due to Huang and Faugeras [HF89], captures the
algebraic structure of essential matrices:
Theorem 5.1 (Characterization of the essential matrix). A nonzero matrix E R33 is an essential matrix if and only if E has a singular
value decomposition (SVD): E = U V T with
= diag{, , 0}
for some R+ and U, V SO(3).
Proof. We first prove the necessity. By definition, for any essential matrix E, there exists (at least one pair) (R, T ), R SO(3), T R3
such that TbR = E. For T , there exists a rotation matrix R0 such that
R0 T = [0, 0, kT k]T . Denote this vector a R3 . Since det(R0 ) = 1, we
know Tb = R0T b
aR0 from Lemma 5.1. Then EE T = TbRRT TbT = TbTbT =
T
T
R0 b
ab
a R0 . It is direct to verify that
kT k2
0
0
0
kT k 0
0
kT k 0
kT k2 0 .
0
0 = 0
0
0 kT k
b
ab
aT = kT k
0
0
0
0
0
0
0
0
0
So the singular values of the essential matrix E = TbR are (kT k, kT k, 0). In
general, in the SVD of E = U V T , U and V are unitary matrices, that is
matrices whose columns are orthonormal, but whose determinants can be
1. We still need to prove that U, V SO(3) (i.e. have determinant + 1) to
establish the theorem. We already have E = TbR = R0T b
aR0 R. Let RZ () be
the matrix which represents a rotation around the Z-axis (or the X3 -axis)
by an angle of radians, i.e. RZ () = eeb3 with e3 = [0, 0, 1]T R3 . Then
0 1 0
RZ +
= 1 0 0 .
2
0 0 1
83
T
Then b
a = RZ (+ 2 )RZ
(+ 2 )b
a = RZ (+ 2 ) diag{kT k, kT k, 0}. Therefore
diag{kT k, kT k, 0}R0R.
E = TbR = R0T RZ +
2
(5.5)
(5.6) yields
e2T = I +
b sin(2) +
b 2 (1 cos(2)).
b 2 sin(2) +
b 3 (1 cos(2)) = 0.
(5.6)
84
Since
b 2 and
b 3 are linearly independent (Lemma 2.3 in [MLS94]), we have
sin(2) = 1 cos(2) = 0. That is, is equal to 2k or 2k + , k Z.
T
Therefore, R is equal to I or eb . Now if = u = ||T
k then, it is direct
b b
from the geometric meaning of the rotation e that e T = Tb. On the
other hand if = u = kTT k then it follows that eb Tb = Tb. Thus, in
any case the conclusions of the lemma follows.
The following theorem shows how to extract rotation and translation
from an essential matrix as given in closed-form in equation (5.7) at the
end of the theorem.
Theorem 5.2 (Pose recovery from the essential matrix). There exist
exactly two relative poses (R, T ) with R SO(3) and T R3 corresponding
to a non-zero essential matrix E E.
Proof. Assume that (R1 , T1 ) SE(3) and (R2 , p2 ) SE(3) are both
solutions for the equation TbR = E. Then we have Tb1 R1 = Tb2 R2 . It
yields Tb1 = Tb2 R2 R1T . Since Tb1 , Tb2 are both skew symmetric matrices and
R2 R1T is a rotation matrix, from the preceding lemma, we have that either
(R2 , T2 ) = (R1 , T1 ) or (R2 , T2 ) = (eub1 R1 , T1 ) with u1 = T1 /kT1k. Therefore, given an essential matrix E there are exactly two pairs (R, T ) such
that TbR = E. Further, if E has the SVD: E = U V T with U, V SO(3),
the following formulas give the two distinct solutions
T
(Tb1 , R1 ) = (U RZ (+ 2 )U T , U RZ
(+ 2 )V T ),
T
( 2 )V T ).
(Tb2 , R2 ) = (U RZ ( 2 )U T , U RZ
(5.7)
85
e1 e2 e3
E = e4 e5 e6
(5.8)
e7 e8 e9
(5.9)
The epipolar constraint (5.3) can then be rewritten as the inner product
of a and e
aT e = 0.
Now, given a set of corresponding image points (xj1 , xj2 ), j = 1, . . . , n, define
a matrix A Rn9 associated with these measurements to be
A = [a1 , a2 , . . . , an ]T
(5.10)
where the j th row aj is constructed from each pair (xj1 , xj2 ) using (6.21). In
the absence of noise, the essential vector e has to satisfy
Ae = 0.
(5.11)
This linear equation may now be solved for the vector e. For the solution
to be unique (up to scale, ruling out the trivial solution e = 0), the rank
of the matrix A R9n needs to be exactly eight. This should be the
case given n 8 ideal corresponding points. In general, however, since
correspondences may be noisy there may be no solution to (5.11). In such
a case, one may choose e to minimize the function kAek2 , i.e. choose e is
the eigenvector of AT A corresponding to its smallest eigenvalue. Another
condition to be cognizant of is when the rank of A is less than 8, allowing
for multiple solutions of equation (5.11). This happens when the feature
points are not in general position, for example when they all lie in a
plane. Again in the presence of noise when the feature points are in general
86
(5.12)
Then
kE F k2f
cos() sin() 0
P = Q = sin() cos() 0 .
0
0
1
87
j = 1, . . . , n.
88
2
2
89
into account such coplanar information and modify the essential constraint
accordingly.
1 j n.
(5.13)
Notice that, since (R, T ) are known, the equations given by (5.13) are
linear in both the structural scales s and the motion scales s and could
be therefore easily solved. In the presence of noise, these scales can be
retrieved by solving a standard linear least-squares estimation problem, for
instance in the least-squares sense. Arrange all the unknown scales in (5.13)
into an extended scale vector ~
~ = [1 , . . . , n , 1 , . . . , n , ]T R2n+1 .
1
1
2
2
Then all the constraints given by (5.13) can be expressed as a single linear
equation
M ~ = 0
(5.14)
90
j2 = xj2 + w2 j ,
x
j = 1, . . . , n
(5.15)
where xj1 and xj2 are the ideal image coordinates and w1j = [w1 j1 , w1 j2 , 0]T
and w2j = [w2 j1 , w2 j2 , 0]T are localization errors in the correspondence.
Notice that it is the (unknown) ideal image coordinates xji , i = 1, 2 that
b j
satisfy the epipolar constraint xjT
2 T Rx1 = 0, and not the (measured) noisy
j
i , i = 1, 2. One could think of the ideal coordinates as a model that
ones x
depends upon the unknown parameters (R, T ), and wij as the discrepancy
ji = xji (R, T ) + wij . Therefore,
between the model and the measurements: x
one seeks the parameters (R, T ) that minimize the discrepancy from the
data, i.e. wij . It is clear that, in order to proceed, we need to define a
discrepancy criterion.
The simplest choice is to assume that wij are unknown errors and
to minimize
their norm; for instance, the standard Euclidean two-norm
.
kwk2 = wT w. Indeed, minimizing the squared norm is equivalent and
T = arg minR,T (R, T ) where
often results in simpler algorithms: R,
X
X j
.
(R, T ) =
kwij k22 =
k
xi xji (R, T )k22 .
i,j
i,j
This corresponds to a least-squares criterion (LS). Notice that the unknowns are constrained to live in a non-linear space: R SO(3) and T S2 ,
the unit sphere, after normalization due to the unrecoverable global scale
factor, as discussed in the eight-point algorithm.
3 This
91
Alternatively, one may assume that wij are samples of a stochastic process
with a certain, known distribution that depends upon the unknown parameters (R, T ), wij p(w|R, T ) and maximize the (log)-likelihood function
T = arg maxR,T (R, T ), where
with respect to the parameters: R,
. X
log p((
xji xji )|R, T ).
(R, T ) =
i,j
n X
2
X
j=1 i=1
kwij k22
subject to
j
i = xji + wi j , i = 1, 2, j = 1, . . . , n
x
b j
xjT
j = 1, . . . , n
2 T Rx1 = 0,
xjT e = 1, j = 1, . . . , n
3
1
jT
x
e
j = 1, . . . , n
3 = 1,
R SO(3)
T S2
(5.16)
Using Lagrange multipliers j , j , j , we can convert the above minimization problem to an unconstrained minimization problem over R
92
n
X
j=1
jT
jT
j
j
b j
xj2 xj2 k2 +j xjT
k
xj1 xj1 k2 +k
2 T Rx1 + (x1 e3 1)+ (x2 e3 1).
(5.17)
j
j2 21 j ebT3 eb3 TbRxj1
x2
=
x
xjT TbRxj =
0
2
1
(5.18)
b xj + x
b j
jT
2(xjT
2 T R
1
2 T Rx1 )
xjT RT TbT ebT eb3 TbRxj + xjT TbRb
eT eb3 RT TbT xj
1
or
j =
(5.19)
b xj
b j
2xjT
2
xjT
2 T R
1
2 T Rx1
=
.
T bT bT e
b eT eb3 RT TbT xj
b j
xjT
xjT
1 R T e
2 T Rb
3
3 b3 T Rx1
2
(5.20)
n
X
b xj + x
b j 2
jT
(xjT
2 T R
1
2 T Rx1 )
kb
e3 TbRxj k2 + kxjT TbRb
e T k2
1
j=1
(5.21)
n
X
(
xjT TbRxj )2
2
j=1
kb
e3 TbRxj1 k2
b j )2
(xjT
2 Tx
1
.
kxjT TbRb
e T k2
2
(5.22)
93
PSfrag replacements
x1
2
x
d2
o1
e1
e2
L2
x
z
y
o2
(R, T )
Figure 5.2. Two noisy image points x1 , x2 R3 . L2 is the so-called epipolar line
which is the intersection of the second image plane with the plane formed by the
first image point x1 and the line connecting the two camera centers o1 , o2 . The
2 and the
distance d2 is the geometric distance between the second image point x
epipolar line in the second image plane. Symmetrically, one can define a similar
geometric distance d1 in the first image plane.
each such iteration decreases the cost function, and therefore convergence
to a local extremum is guaranteed since the cost function is bounded below
by zero.
Algorithm 5.2 (Optimal triangulation).
1. Initialization
j1 and x
j2 respectively.
Initialize xj1 (R, T ) and xj2 (R, T ) as x
2. Motion estimation
Update (R, T ) by minimizing (R, T, xj1 (R, T ), xj2 (R, T )) given by
(5.21) or (5.22).
3. Structure triangulation
Solve for xj1 (R, T ) and xj2 (R, T ) which minimize the objective function with respect to a fixed (R, T ) computed from the previous
step.
4. Return to Step 2 until the decrement in the value of is below a
threshold.
In step 3 above, for a fixed (R, T ), xj1 (R, T ) and xj2 (R, T ) can be com j1 k2 + kxj2 x
j2 k2 for each pair
puted by minimizing the distance kxj1 x
j
3
of image points. Let t2 R be the normal vector (of unit length) to
the (epipolar) plane spanned by (xj2 , T ). Given such a tj2 , xj1 and xj2 are
94
determined by:
xj1 (tj1 )
bj T bj
T j
eb3 tj1 tjT
e
b
x
+
t
3 1
1
1 t1 e 3
bT b
eT3 tj1 tj1 e3
xj2 (tj2 )
bj T bj
T j
eb3 tj2 tjT
e
b
x
+
t
3 2
2
2 t2 e 3
bT b
eT3 tj2 tj2 e3
where tj1 = RT tj2 R3 . Then the distance can be explicitly expressed as:
j2 k2 + kxj1 x
j1 k2
kxj2 x
= k
xj2 k2 +
j j
tjT
2 A t2
j j
tjT
2 B t2
+ k
xj1 k2 +
j j
tjT
1 C t1
j j
tjT
1 D t1
c
c
jT
j2 x
bT3 + xj2 eb3 + eb3 xj2 ),
I (b
e3 x
2 e
c
c
j1 x
jT
= I (b
e3 x
bT3 + xj1 eb3 + eb3 xj1 ),
1 e
B j = ebT3 eb3
Dj = ebT3 eb3
(5.23)
Then the problem of finding xj1 (R, T ) and xj2 (R, T ) becomes one of finding tj
2 which minimizes the function of a sum of two singular Rayleigh
quotients:
min
jT j
tjT
2 T =0,t2 t2 =1
V (tj2 ) =
j j
tjT
2 A t2
j j
tjT
2 B t2
j T j
tjT
2 RC R t2
j T j
tjT
2 RD R t2
(5.24)
This is an optimization problem on the unit circle S1 in the plane orthogonal to the vector T (therefore, geometrically, motion and structure recovery
from n pairs of image correspondences is an optimization problem on the
space SO(3) S2 Tn where Tn is an n-torus, i.e. an n-fold product of S1 ).
If N1 , N2 R3 are vectors such that T, N1 , N2 form an orthonormal basis
of R3 , then tj2 = cos()N1 + sin()N2 with R. We only need to find
which minimizes the function V (tj2 ()). From the geometric interpretation
of the optimal solution, we also know that the global minimum should
lie between two values: 1 and 2 such that tj2 (1 ) and tj2 (2 ) correspond to
normal vectors of the two planes spanned by (
xj2 , T ) and (R
xj1 , T ) respecj
j
1, x
2 are already triangulated, these two planes coincide). The
tively (if x
problem now becomes a simple bounded minimization problem for a scalar
function and can be efficiently solved using standard optimization routines
(such as fmin in Matlab or the Newtons algorithm).
95
of the geometry of points in space as seen from a moving camera, and deriving a conceptual algorithm for reconstructing camera motion and scene
structure. In light of the fact that the camera motion is slow relative to the
sampling frequency we will treat the motion of the camera as continuous.
While the derivations proceed in parallel it is an important point of caution
to the reader that there are some subtle differences.
X(t)
=
b (t)X(t) + v(t).
(5.25)
T (v x) = (b
X
X + v)T (v x) = XT
b T vbx.
(5.26)
From now on, for convenience, we will drop the time-dependency from
the notation. The image of the point p taken by the camera is the vector
x which satisfies x = X. Denote the velocity of the image point x by
u = x R3 . u is also called image motion field, which under the brightness
constancy assumption discussed in Chapter 3 can be approximated by the
optical flow.
Consider now the inner product of the vectors in (5.25) with the vector
(v x)
Further note that
+ x
= x
X
and xT (v x) = 0.
(5.27)
96
then we have
1
(b
vb + vb
b)
2
R33 .
uT vbx + xT sx = 0.
(5.28)
This equation suggests some redundancy in the continuous epipolar constraint (5.27). Indeed, the matrix
b vb can only be recovered up to its
vb + vb
b )4 . This structure is substantially
symmetric component s = 21 (b
different from the discrete case, and it cannot be derived as a first-order
approximation of the essential matrix TbR. In fact a naive discretization
would lead to a matrix of the form vb
b , whereas in the true continuous case
we have to deal with only its symmetric component s = 12 (b
vb + vb
b )! The
set of interest in this case is the space of 6 3 matrices of the form
)
(
vb
3
0
R63
E =
, v R
1
vb + vb
b)
2 (b
which we call the continuous essential space. A matrix in this space is called
a continuous essential matrix. Note that the continuous epipolar constraint
(5.28) is homogeneous in the linear velocity v. Thus v may be recovered only
up to a constant scale. Consequently, in motion recovery, we will concern
ourselves with matrices belonging to the normalized continuous essential
space with v normalized to be 1:
(
)
v
b
E10 =
R63 .
R3 , v S2
1
vb + vb
b)
2 (b
The skew-symmetric part of a continuous essential matrix simply corresponds to the velocity v. The characterization of the (normalized) essential
vb + vb
b ). We call the space of all the
matrix only focuses on its part s = 21 (b
matrices of this form the symmetric epipolar space
(
)
1
3
2
S=
(b
vb + vb
b) R , v S
R33 .
2
4 This redundancy is the very reason why different forms of the continuous epipolar
constraint exist in the literature [ZH84, PG98, VF95, May93, BCB97], and, accordingly,
various approaches have been proposed to recover and v (see [TTH96]).
97
A matrix in this space is called a symmetric epipolar component. The motion estimation problem is now reduced to the one of recovering the velocity
(, v) with R3 and v S2 from a given symmetric epipolar component
s.
The characterization of symmetric epipolar components depends on a
characterization of matrices of the form
b vb R33 , which is given in the
following lemma. We define the matrix RY () to be the rotation around
the Y -axis by an angle R, i.e. RY () = eeb2 with e2 = [0, 1, 0]T R3 .
Lemma 5.4. A matrix Q R33 has the form Q =
b vb with R3 , v S2
if and only if
Q = V RY ()diag{, cos(), 0}V T
(5.29)
b vbq = (v q).
V = (eb 2 v, b, v),
then Q has the given form (5.29).
We now prove the sufficiency. Given a matrix Q which can be decomposed
into the form (5.29), define the orthogonal matrix U = V RY () O(3).5
Let the two skew symmetric matrices
b and vb given by the formulae
(5.30)
b = U RZ ( ) U T , vb = V RZ ( )1 V T
2
2
where = diag{, , 0} and 1 = diag{1, 1, 0}. Then
b vb = U RZ ( ) U T V RZ ( )1 V T
2
2
T
= U RZ ( ) (RY ())RZ ( )1 V T
2
2
= U diag{, cos(), 0}V T
= Q.
(5.31)
Since and v have to be, respectively, the left and the right zero
eigenvectors of Q, the reconstruction given in (5.30) is unique.
5 O(3)
98
Since (b
vb)T = vb
b , it yields
s=
1
V RY ()diag{, cos(), 0} diag{, cos(), 0}RYT () V T .
2
2 cos()
0
sin()
.
0
2 cos()
0
=
sin()
0
0
(5.33)
99
(5.33)
= 1 3 ,
= arccos(2 /),
0
[0, ]
Define a matrix V SO(3) to be V = V1 RYT 2 2 . Then s =
1
T
2
2 V D(, )V . According to Lemma 5.4, there exist vectors v S and
3
R such that
Therefore, 12 (b
vb + vb
b ) = 21 V D(, )V T = s.
PSfrag replacements
/2
u2
v
/2/2
u1
[Kan93] but it has never been really exploited in the literature for designing algorithms. The constructive proof given above is important in this
regard, since it gives an explicit decomposition of the symmetric epipolar
component s, which will be studied in more detail next.
Following the proof of Theorem 5.4, if we already know the eigenvector
decomposition of a symmetric epipolar component s, we certainly can find
at least one solution (, v) such that s = 12 (b
vb + vb
b ). We now discuss
uniqueness, i.e. how many solutions exist for s = 12 (b
vb + vb
b ).
Theorem 5.5 (Velocity recovery from the symmetric epipolar
component). There exist exactly four 3-D velocities (, v) with R 3
and v S2 corresponding to a non-zero s S.
vb + vb
b ).
Proof. Suppose (1 , v1 ) and (2 , v2 ) are both solutions for s = 21 (b
Then we have
vb1
b1 +
b1 vb1 = vb2
b2 +
b2 vb2 .
(5.34)
100
(5.35)
D(1 , 1 ) = W D(2 , 2 )W T .
(5.36)
Since both sides of (5.36) have the same eigenvalues, according to (5.32),
we have
1 = 2 ,
2 = 1 .
cos() 0 sin()
cos()
0 sin()
or
.
0
1
0
0
1
0
sin()
0 cos()
sin() 0 cos()
From the geometric meaning of V1 and V2 , all the cases give either
b 1 vb1 =
b2 vb2 or
b1 vb1 = vb2
b2 . Thus, according to the proof of Lemma 5.4, if (, v)
is one solution and
b vb = U diag{, cos(), 0}V T , then all the solutions
are given by
b = U RZ ( 2 ) U T , vb = V RZ ( 2 )1 V T ;
(5.37)
b = V RZ ( 2 ) V T , vb = U RZ ( 2 )1 U T
where = diag{, , 0} and 1 = diag{1, 1, 0}.
form
0
vb = v3
v2
v3
0
v1
v2
v1 ,
0
s1
s = s2
s3
s2
s4
s5
s3
s5 .
s6
e = [v1 , v2 , v3 , s1 , s2 , s3 , s4 , s5 , s6 ]T .
101
(5.38)
(5.39)
(5.41)
where a are defined for each pair (x , u ) using (5.40). In the absence of
noise, the essential vector e has to satisfy
Ae = 0.
(5.42)
In order for this equation to have a unique solution for e, the rank of the
matrix A has to be eight. Thus, for this algorithm, the optical flow vectors
of at least eight points are needed to recover the 3-D velocity, i.e. n 8,
although the minimum number of optical flows needed is actually 5 (see
Maybank [May93]).
When the measurements are noisy, there might be no solution of e for
Ae = 0. As in the discrete case, one may approximate the solution my
minimizing the error function kAek2 .
Since the continuous essential vector e is recovered from noisy measurements, the symmetric part s of E directly recovered from e is not necessarily
a symmetric epipolar component. Thus one can not directly use the previously derived results for symmetric epipolar components to recover the
3-D velocity. Similarly to what we have done for the discrete case, we can
first estimate the symmetric matrix s, and then project it onto the space
of symmetric epipolar components.
Theorem 5.6 (Projection to the symmetric epipolar space). If a
real symmetric matrix F R33 is diagonalized as F = V diag{1 , 2 , 3 }V T
6 For perspective projection, z = 1 and u = 0 thus the expression for a can be
3
simplified.
102
21 + 2 3
,
3
2 =
1 + 22 + 3
,
3
3 =
23 + 2 1
.(5.43)
3
w1
W = w4
w7
Then
kE F k2f
w2 w3
w5 w6 .
w8 w9
= k W W T k2f
(5.44)
(5.45)
Substituting (5.44) into the second term, and using the fact that 2 =
1 + 3 and W is a rotation matrix, we get
tr(W W T )
103
j = 1, . . . , n.
21 + 2 3
,
3
2 =
1 + 22 + 3
,
3
3 =
23 + 2 1
;
3
b = U RZ ( 2 ) U T , vb = V RZ ( 2 )1 V T
b = V RZ ( 2 ) V T , vb = U RZ ( 2 )1 U T
where = diag{, , 0} and 1 = diag{1, 1, 0};
104
v = v0 .
Remark 5.2. Since both E, E E10 satisfy the same set of continuous
epipolar constraints, both (, v) are possible solutions for the given set
of optical flows. However, as in the discrete case, one can get rid of the
ambiguous solution by enforcing the positive depth constraint.
In situations where the motion of the camera is partially constrained,
the above linear algorithm can be further simplified. The following example
illustrates how it can be done.
Example 5.1 (Kinematic model of an aircraft). This example shows
how to utilize the so called nonholonomic constraints (see Murray, Li and
Sastry [MLS94]) to simplify the proposed linear motion estimation algorithm in the continuous case. Let g(t) SE(3) represent the position
and orientation of an aircraft relative to the spatial frame, the inputs
1 , 2 , 3 R stand for the rates of the rotation about the axes of the
aircraft and v1 R the velocity of the aircraft. Using the standard homogeneous representation for g (see Murray, Li and Sastry [MLS94]), the
kinematic equations of the aircraft motion are given by
0 3 2 v1
3
0 1 0
g
g =
2 1
0 0
0
where 1 stands for pitch rate, 2 for roll rate, 3 for yaw rate and v1
the velocity of the aircraft. Then the 3-D velocity (, v) in the continuous
epipolar constraint (5.28) has the form = [1 , 2 , 3 ]T , v = [v1 , 0, 0]T . For
the algorithm given above, we here have extra constraints on the symmetric
vb + vb
b ): s1 = s5 = 0 and s4 = s6 . Then there are only
matrix s = 12 (b
four different essential parameters left to determine and we can re-define
the essential parameter vector e R4 to be e = [v1 , s2 , s3 , s4 ]T . Then the
measurement vector a R4 is to be a = [u3 y u2 z, 2xy, 2xz, y 2 + z 2 ]T . The
continuous epipolar constraint can then be rewritten as
aT e = 0.
If we define the matrix A from a as in (5.41), the matrix AT A is a 4 4
matrix rather than a 9 9 one. For estimating the velocity (, v), the dimensions of the problem is then reduced from 9 to 4. In this special case,
the minimum number of optical flow measurements needed to guarantee a
105
unique solution of e is reduced to 3 instead of 8. Furthermore, the symmetric matrix s recovered from e is automatically in the space S and the
remaining steps of the algorithm can thus be dramatically simplified. From
this simplified algorithm, the angular velocity = [1 , 2 , 3 ]T can be fully
recovered from the images. The velocity information can then be used for
controlling the aircraft.
As in the discrete case, the linear algorithm proposed above is not optimal since it does not enforce the structure of the parameter space during
the minimization. Therefore, the recovered velocity does not necessarily
minimize the originally chosen error function kAe(, v)k2 on the space E10 .
Again like in the discrete case, we have to assume that translation is not
zero. If the motion is purely rotational, then one can prove that there are
infinitely many solutions to the epipolar constraint related equations. We
leave this as an exercise to the reader.
X(t)
=
b X(t) + v(t),
106
r(t) = (t)/(t)
d
(ln ) = /.
Let the logis independent of the choice of (t). Notice dt
arithm of the structural scale to be y = ln . Then a time-consistent
estimation (t) needs to satisfy the following ordinary differential equation,
which we call the dynamic scale ODE
y(t)
= r(t).
Given y(t0 ) = y0 = ln((t0 )), solve this ODE and obtain y(t) for t [t0 , tf ].
Then we can recover a consistent scale (t) given by
(t) = exp(y(t)).
Hence (structure and motion) scales estimated at different time instances
now are all relative to the same scale at time t0 . Therefore, in the continuous
case, we are also able to recover all the scales as functions of time up to a
universal scale. The reader must be aware that the above scheme is only
conceptual. In practice, the ratio function r(t) would never be available for
all time instances in [t0 , tf ].
Comment 5.1 (Universal scale ambiguity). In both the discrete and
continuous cases, in principle, the proposed schemes can reconstruct both
the Euclidean structure and motion up to a universal scale.
5.5 Summary
The seminal work of Longuet-Higgins [LH81] on the characterization of the
so called epipolar constraint, has enabled the decoupling of the structure
and motion problems and led to the development of numerous linear and
nonlinear algorithms for motion estimation from two views, see [Fau93,
Kan93, May93, WHA93] for overviews.
5.6. Exercises
107
5.6 Exercises
1. Linear equation
Solve x Rn from the following equation
Ax = b
where A Rmn and b Rm . In terms of conditions on the matrix
A and vector b, describe: When does a solution exist or not exist, and
when is the solution unique or not unique? In case the solution is not
unique, describe the whole solution set.
2. Properties of skew symmetric matrix
(a) Prove Lemma 5.1.
(b) Prove Lemma 5.3.
3. A rank condition for epipolar constraint
Show that
xT2 TbRx1 = 0
if and only if
c2 T ] 1.
rank [c
x2 Rx1 x
4. Rotational motion
Assume that camera undergoes pure rotational motion, i.e. it rotates
around its center. Let R SO(3) be the rotation of the camera and
so(3) be the angular velocity. Show that, in this case, we have:
(a) Discrete case: xT2 TbRx1 0, T R3 ;
(b) Continuous case: xT
b vbx + uT vbx 0, v R3 .
5. Projection to O(3)
Given an arbitrary 3 3 matrix M R33 with positive singular
values, find the orthogonal matrix R O(3) such that the error
kR M k2f is minimized. Is the solution unique? Note: Here we allow
det(R) = 1.
6. Geometric distance to epipolar line
2 with respect to camera frames with
Given two image points x1 , x
their relative motion (R, T ), show that the geometric distance d2
defined in Figure 5.2 is given by the formula
d22 =
where e3 = [0, 0, 1]T R3 .
(
xT2 TbRx1 )2
kb
e3 TbRx1 k2
108
frame 1
(R, T ) N1
PSfrag replacements
frame 2
d1
p
P
Figure 5.4. Geometry of camera frames 1 and 2 relative to a plane P.
assume that the focal length of the camera is 1. That is, if x is the
image of a point p P with 3-D coordinates X = [X1 , X2 , X3 ]T , then
x = X/X3 = [X1 /X3 , X2 /X3 , 1]T R3 .
Follow the following steps to establish the so-called homography
between two images of the plane P:
(a) Verify the following simple identity
bX = 0.
x
(5.48)
(b) Suppose that the (R, T ) SE(3) is the rigid body transformation from frame 1 to 2. Then the coordinates X1 , X2 of a fixed
point p P relative to the two camera frames are related by
1
T
X2 = R + T N1 X1
(5.49)
d1
where d1 is the perpendicular distance of camera frame 1 to
the plane P and N1 S2 is the unit surface normal of P
relative to camera frame 1. In the equation, the matrix H =
(R + d11 T N1T ) is the so-called homography matrix in the computer vision literature. It represents the transformation from
X1 to X2 .
(c) Use the above identities to show that: Given the two images
x1 , x2 R3 of a point p P with respect to camera frames 1
and 2 respectively, they satisfy the constraint
c2 (R +
x
1
T N1T )x1 = 0.
d1
(5.50)
5.6. Exercises
109
(d) Prove that in order to solve uniquely (up to a scale) H from the
above equation, one needs the (two) images of at least 4 points
in P in general position.
8. Singular values of the homography matrix
Prove that any matrix of the form H = R + uv T with R SO(3)
and u, v R3 must have a singular value 1. (Hint: prove that the
matrix uv T + vuT + vuT uv T has a zero eigenvalue by constructing
the corresponding eigenvector.)
9. The symmetric component of the outer product of two vectors
Suppose u, v R3 , and kuk2 = kvk2 = . If u 6= v, the matrix
D = uv T + vuT R33 has eigenvalues {1 , 0, 3 }, where 1 > 0,
and 3 < 0. If u = v, the matrix D has eigenvalues {2, 0, 0}.
10. The continuous homography matrix
The continuous version of the homography matrix H introduced
above is H 0 =
b + d11 vN1T where , v R3 are the angular and linear
velocities of the camera respectively. Suppose that you are given a
matrix M which is also known to be of the form M = H 0 + I for
some R. But you are not told the actual value of . Prove that
you can uniquely recover H 0 from M and show how.
11. Implementation of the SVD-based pose estimation algorithm
Implement a version of the three step pose estimation algorithm for
two views. Your MATLAB codes are responsible for:
Initialization: Generate a set of N ( 8) 3-D points; generate
a rigid body motion (R, T ) between two camera frames and
project (the coordinates of) the points (relative to the camera
frame) onto the image plane correctly. Here you may assume the
focal length is 1. This step will give you corresponding images
as input to the algorithm.
Motion Recovery: using the corresponding images and the al T) and compare it to the
gorithm to compute the motion (R,
ground truth (R, T ).
After you get the correct answer from the above steps, here are a few
suggestions for you to play with the algorithm (or improve it):
A more realistic way to generate these 3-D points is to make sure
that they are all indeed in front of the image plane before and
after the camera moves. If a camera has a field of view, how
to make sure that all the feature points are shared in the view
before and after.
110
Chapter 6
Camera calibration and self-calibration
112
trinsic constraints only. For the cases when the conditions are not satisfied
we will characterize the associated ambiguities and suggest practical ways
to resolve them.
If we could measure metric coordinates of points on the image plane (as
opposed to pixel coordinates), we could model the projection as we have
done in previous chapters:
x = P gX
(6.1)
where P = [I, 0] and g SE(3) is the pose of the camera in the world
reference frame. In practice, however, calibration parameters are not known
ahead of time and, therefore, a more realistic model of the geometry of the
image formation process takes the form
x0 = AP gX
where A R
parameters1
33
(6.2)
f sx
A= 0
0
f s
f sy
0
ox
oy .
1
(6.3)
113
be easily specified relative to, say, the top-left corner of the checkerboard.
Let us call these coordinates X. The change of coordinates between the
reference frame on the calibration rig and the camera frame is denoted by
g and the calibration matrix by A, so that when we image the calibration
rig we measure pixel coordinates x0 of points on calibration grid, which
satisfy
x0 = AP gX = P X.
This can be written for each point on the calibration rig as:
i T Xi
p1
x
Yi
y i = pT2
Zi
T
p3
1
1
(6.4)
For clarity we omit the 0 s from the pixel coordinates of image points in the
above equation. Similarly as in the previous chapters, we can eliminate
b0 . This yields three
from the above equation, by multiplying both sides by x
equations, from which only two are independent. Hence for each point we
have following two constraints:
xi (pT3 X) = pT1 X
y i (pT3 X) = pT2 X
(6.5)
subject to kpk2 = 1
(6.6)
114
A R33 (in its upper triangular form) and rotation matrix R SO(3)
using a routine QR decomposition
p14
T = A1 p24 .
p34
Several free software packages are available to perform calibration with a
rig. Most also include compensation for radial distortion and other lens
artifacts that we do not address here. In Chapter ?? we walk the reader
through use of one of these packages in a step-by-step procedure to calibrate
a camera.
(6.7)
(6.8)
Alternatively, the transformation from the calibrated space to the uncalibrated given by x = A1 x0 , can be directly applied to epipolar constraint
by substituting for x:
xT2 TbRx1 = 0
(6.9)
115
, consider the first just the rigid body motion equation and multiply both
sides by A
2 Ax2 = ARx1 + AT
(6.10)
where x = X, x0 = Ax and T 0 = AT . In order to obtain an intrinsic constraint in terms of image measurements only, similarly as in the calibrated
case, we can eliminate the unknown depth by multiplying both sides of
(6.7) by (x02 T 0 ) yielding the following constraint:
T
x02 Tb0 ARA1 x01 = 0.
(6.11)
Alternatively the transformation from the calibrated space to the uncalibrated given by x = A1 x0 , can be directly applied to epipolar constraints
by substituting for x:
xT2 TbRx1 = 0
(6.12)
The above cases both correspond to the uncalibrated version of the epipolar
constraint:
T
x02 F x01 = 0
(6.13)
(6.14)
We will study these different forms more closely in Section 6.3 and show
that the second form turns out to be more convenient to use for camera selfcalibration. Before getting more insights into the theory of self-calibration,
let us have a look at some geometric intuition behind the fundamental
matrix, its algebraic properties and algorithm for its recovery.
116
l1
x1
x2
l2
d2
o1
e1
e2
y
o2
(R, T )
Figure 6.1. Two projections x01 , x02 R3 of a 3-D point p from two vantage points.
The relative Euclidean transformation between the two vantage points is given
by (R, T ) SE(3).
T
b M ) = (U RZ ( )diag{1, 1, 0}U T , U RZ
( )V T ) (6.15)
(m,
2
2
1 0 0
= 0 2 0
(6.16)
117
then we have:
T
T
b M ) = (U RZ ( )diag{1, 1, 0}U T , U RZ
(m,
( )V
) (6.17)
2
2
give the rise to the same fundamental matrix, for arbitrary choice of
parameters , , .
So, as we can see in the uncalibrated case, while the fundamental matrix
is uniquely determined from point correspondences, the factorization of F
into a skew-symmetric part and a non-singular matrix is not unique. There
is a three-parameter family of transformations consistent with the point
correspondences, which gives rise to the same fundamental matrix. This
fact can be observed directly from the epipolar constraint by noticing that:
T
T
x02 Tb0 ARA1 x01 = x02 Tb0 (ARA1 + T 0 vT )x01 = 0.
(6.18)
f1 f2 f3
(6.19)
F = f4 f5 f6
f7 f8 f9
and a corresponding vector f R9 , we have
f = [f1 , f2 , f3 , f4 , f5 , f6 , f7 , f8 , f9 ]T R9 .
118
(6.20)
Where a is given by
a = [x02 x01 , x02 y10 , x02 z10 , y20 x01 , y20 y10 , y20 z10 , z20 x01 , z20 y10 , z20 z10 ]T R9 .
(6.21)
Given a set of corresponding points and forming the matrix of measurements A = [a1 , a2 , . . . , an ]T , f can be obtained as the minimizing solution
of the following least-squares objective function kAf k2 . Such solution corresponds to the eigenvector associated with the smallest eigenvalue of A T A.
In the context of the above mentioned linear algorithm, since the point correspondences are in image coordinates, the individual entries of the matrix
A are may be unbalanced and affect the conditioning of AT A. An improvement can be achieved by a normalizing transform T1 and T2 in respective
images, which makes the measurements of zero mean and unit-variance [?].
By the same arguments as in the calibrated case, the linear least-squares
solution is not an optimal one, and it is typically followed by a nonlinear
refinement step. More details on the nonlinear refinement strategies are
described [HZ00].
119
: R3
X
R3
7 X0 = AX
Q K, R SO(3).
(6.24)
120
(6.25)
From equation (6.25) it is clear that an uncalibrated camera with calibration matrix A undergoing the motion (R(t), T (t)) observing the point
p E3 can never be distinguished from an uncalibrated camera with calibration matrix B undergoing the motion (R0 R(t)R0T , R0 T (t)) observing
the point R0 (p) E3 . The effect of R0 is nothing but a rotation of the
overall configuration space.
Therefore, without knowing camera motion and scene structure, the matrix A associated with an uncalibrated camera can only be recovered up to
an equivalence class A in the space SL(3)/SO(3). The subgroup K of all
upper-triangular matrices in SL(3) is one representation of such a space, as
is the space of 3 3 symmetric positive definite matrices with determinant
1. Thus, SL(3)/SO(3) does provide an intrinsic geometric interpretation
for the unknown camera parameters. In general, the problem of camera selfcalibration is then equivalent to the problem of recovering the symmetric
matrix = AT A1 , from which the upper-triangular representation of
the intrinsic parameters can be easily obtained from the Cholesky factorization. The coordinate transformation in the uncalibrated space is given
by
AX(t) = AR(t)X(t0 ) + AT (t)
0
(6.26)
From the ideal camera model, the image of the point p0 with respect to
such a camera (i.e. with an identity calibration matrix) is given by
(t)x0 (t) = P X0 (t)
(6.29)
121
(6.30)
(6.31)
p 7 Rp + T.
(6.32)
v 7 Rv.
(6.33)
g : T R3 R 3 T R 3 R 3 ;
v T 7 v T R
(6.34)
122
since this will be useful later. We equip E3 with the usual inner product
.
h, iS : T R3 T R3 7 R ; (u, v) 7 hu, vi = uT v
(6.35)
and we verify that such an inner product is invariant under the action of
the group, as hRu, Rvi = uT RT Rv = uT v = hu, vi. The inner product
.
operates on covectors exactly in the same way: huT , v T i = uT v, so that
T
T
the group action on covectors is also invariant: hu R, v Ri = uT RRT v =
uT v = huT , v T i. If we are given three (column) vectors, u, v, w, it is easy
to verify that their volume, i.e. the determinant of the matrix having them
as columns, is also invariant under the group action:
det[Ru, Rv, Rw] = det(R) det[u, v, w] = det[u, v, w].
(6.36)
Now consider a transformation of the Euclidean space E3 via a (nonsingular) linear map into another copy of the Euclidean space, which we
call E3 (it will soon be clear why we want to distinguish the two copies)
A : E3 E3 ;
p 7 Ap = q.
(6.37)
q 7 ARA1 q + AT
(6.38)
) where
U = AT R3 .
(6.39)
u 7 ARA1 u
(6.40)
uT 7 uT ARA1 .
(6.41)
The bilinear, positive definite form that is invariant under the action of h,
and plays the role of the inner product, is given by
.
h, iS : T R3 T R3 R ; (u, v) 7 hu, viS = uT AT A1 v
(6.42)
so that we have hARA1 u, ARA1 viS = uT AT RT AT AT A1 ARA1 v
= uT AT A1 v = hu, viS . The above bilinear form is determined uniquely
by its values on the elements of the basis, i.e. the elements of the matrix
.
S = AT A1 , so that we have
.
hu, viS = uT Sv
(6.43)
The bilinear form above acts on covectors via
h, iS : T R3 T R3 R ;
.
(uT , v T ) 7 huT , v T iS = uT AAT v
(6.44)
(6.45)
123
.
huT , v T iS = uT S 1 v which is different from hu, viS ; this shows that we
cannot quite treat row and column vectors alike. The difference in the
value of the inner product on vectors and covectors of E3 was not visible
on h, i since in that case S = I. If we assume that the transformation A has
determinant 1 (we can do this simply by scaling A, since we have assumed
it is nonsingular, so det(A) 6= 0) then, given three vectors u, v, w, their
volume is invariant under the group action, as
det[ARA1 u, ARA1 v, ARA1 w] = det(ARA1 ) det[u, v, w] = det[u, v, w].
(6.46)
and on covectors via
h : T R 3 T R 3 ;
uT 7 uT ARA1 .
(6.47)
(6.51)
124
.
where we define the Essential matrix Q = TbR T SO(3). The same volume
constraint, in the uncalibrated space, is given by
hq(t), F q0 iS = 0
(6.52)
(6.53)
The importance of the above constraints is that, since they are homogeneous, we can substitute readily x = 1 p for p in the calibrated space,
and x = 1 q for q in the uncalibrated one. Since x can be measured from
images, the above equations give constraints either on the Essential matrix
Q, in the calibrated case, or on the Fundamental matrix F , in the uncalibrated case. A more common expression of the fundamental matrix is given
by AT QA1 , which can be derived using the fact that
so that
d = AT TbA1
AT
(6.54)
(6.55)
Note that the vector U = AT , also called the epipole, is the (right) nullspace of the F matrix, which can be easily (i.e. linearly) computed from
image measurements. As in the previous section, let A be a generic, nonsingular matrix A SL(3) and write its Q-R decomposition as A = BR0 .
We want to make sure that the fundamental matrix does not depend on R0 :
F = (BR0 )T TbR(BR0 )1 = B T R0 TbR0T R0 RR0T B 1 . Therefore, we have
bRB
1 , where T = R0 T and R
= R0 RR0T .
that F = AT TbRA1 = B T T
In other words, as we expected, we cannot distinguish a point moving
with
with (T, R) with calibration matrix A from one moving with (T, R)
calibration matrix B.
Enforcing the properties of the inner product: Kruppa equations
Since the Fundamental matrix can be recovered only up to a scale, let us
assume, for the moment, that kU k = kAT k = 1. The following derivation
is taken from [?]. Let R0 SO(3) be a rotation matrix such that
U = R 0 e3
(6.56)
where e3 = [0 0 1]T . Note that at least one such matrix always exists (in
geometric terms, one would say that this is a consequence of SO(3) acting
transitively on the sphere S 2 ). Consider now the matrix D = R0 F R0T . A
convenient expression for it is given by
T
d1
.
(6.57)
D = R0 F R0T = eb3 (R0 A)T R(R0 A)T = dT2 .
0
125
= R0 A
\
A1
R0 e3 RA1 R0T
Since e3 is in the (left) null space of D, we can arrange the rows so that
the last one is zero. The remaining two columns are given by
(
dT1 = eT2 R0 AR(R0 A)1
(6.58)
dT2 = eT1 R0 AR(R0 A)1 .
Now, the crucial observation comes from the fact that the covectors d T1
and dT2 are obtained through a transformation of the covectors eT1 and eT2
= (U
, AR
A1 ), as indicated in equathrough the action of the group h
T
T
T
T
hd1 , d2 iS = he1 , e2 iS
T
T
T
(6.59)
hd1 , d1 iS = he2 , eT2 iS
T T
T
T
hd2 , d2 iS = he1 , e1 iS .
T 1
2 T 1
d 1 S d 2 = e 1 S e 2
T 1
(6.60)
d1 S d1 = 2 eT2 S 1 e2
T 1
2 T 1
d2 S d2 = e1 S e1
.
where 1 = kAT k. In order to eliminate the dependency on , we can
consider ratios of inner products. We obtain therefore 2 independent constraints on the 6 independent components of the inner product matrix S
which are called Kruppa equations.
hdT1 , dT2 iS
hdT , dT iS
hdT , dT iS
= 1T T1
= 2T T2 .
T
T
he1 , e2 iS
he2 , e2 iS
he1 , e1 iS
(6.61)
(6.62)
126
1T 1 1
2T 1 2
1T 1 2
=
=
2T 1 2
1T 1 1
1T 1 2
(6.63)
F 1 F T = c
T 0 1 c
T0 .
(6.64)
127
scaled by R, i.e. F = c
T 0 ARA1 ,5 we then have the matrix Kruppa
equation
c0 1 c
F 1 F T = 2 T
T0
(6.65)
128
way
T
c
T 0 (1 ARA1 1 AT RT AT )c
T 0 = 0.
(6.66)
(6.67)
(6.68)
(6.69)
(6.70)
1 C1 C T = 0.
(6.71)
6 Note that from the equation we can only solve ARA 1 up to an arbitrary scale. But
we also know that det(ARA1 ) must be 1.
129
C33
7
X CXC T .
(6.72)
130
The proof of this lemma is simply algebraic. This simple lemma, however, states a very important fact: given a set of fundamental matrices
Fi = c
Ti0 ARi A1 with Ti0 = ATi , i = 1, . . . , n, there is a one-to-one
correspondence between the set of solutions of the equations
T
Fi XFiT = 2i c
Ti0 X c
Ti0 ,
i = 1, . . . , n
T
Ei Y EiT = 2i Tbi Y Tbi ,
i = 1, . . . , n
can always be written of the form R = eub for some [0, ] and u S2 .
131
c0 T
c0 is a projection matrix to the plane spanned
Proof. Note that, since T
T
by the column vectors of c
T 0 , we have the identity c
T0 c
T 0c
T0 = c
T 0 . First
T 0F =
we prove the parallel case. It can be verified that, in general, F T c
\
T T . Since the axis of R is parallel to T , we have R T T = T hence
2 AR
FTc
T 0 F = 2 c
T 0 . For the perpendicular case, let u R3 be the axis of R.
By assumption T = A1 T 0 is perpendicular to u. Then there exists v R3
such that u = TbA1 v. Then it is direct to check that c
T 0 v is the eigenvector
of F T c
T 0 corresponding to the eigenvalue .
132
Then for these two types of special motions, the associated fundamental matrix can be immediately normalized by being divided by the scale
. Once the fundamental matrices are normalized, the problem of finding
the calibration matrix 1 from the normalized matrix Kruppa equations
(6.64) becomes a simple linear one. A normalized matrix Kruppa equation
in general imposes three linearly independent constraints on the unknown
calibration matrix given by (??). However, this is no longer the case for
the special motions that we are considering here.
Theorem 6.4 (Degeneracy of the normalized Kruppa equations).
Consider a camera motion (R, T ) SE(3) where R = eub has the angle
(0, ). If the axis u R3 is parallel or perpendicular to T , then the
normalized matrix Kruppa equation: TbRY RT TbT = TbY TbT imposes only
two linearly independent constraints on the symmetric matrix Y .
Proof. For the parallel case, by restricting Y to the plane spanned by the
column vectors of Tb, it yields a symmetric matrix Y in R22 . The rotation
SO(2). The
matrix R SO(3) restricted to this plane is a rotation R
Y R
T = 0.
normalized matrix Kruppa equation is then equivalent to Y R
Since 0 < < , this equation imposes exactly two constraints on the threedimensional space of 22 symmetric real matrices. The identity I22 is the
only solution. Hence the normalized Kruppa equation imposes exactly two
linearly independent constraints on Y . For the perpendicular case, since u
is in the plane spanned by the column vectors of Tb, there exist v R3 such
that (u, v) form an orthonormal basis of the plane. Then the normalized
matrix Kruppa equation is equivalent to
TbRY RT TbT = TbY TbT
v T RY RT v = v T Y v.
(6.73)
These are the only two constraints on Y imposed by the normalized Kruppa
equation.
According to this theorem, although we can renormalize the fundamental matrix when rotation axis and translation are parallel or perpendicular,
we only get two independent constraints from the resulting (normalized)
Kruppa equation corresponding to a single fundamental matrix, i.e. one of
the three linear equations in (??) depends on the other two. So degeneracy
does occur to normalized Kruppa equations. Hence for these motions, in
general, we still need three such fundamental matrices to uniquely determine the unknown calibration. What happens to the unnormalized Kruppa
equations? If we do not renormalize the fundamental matrix and directly
use the unnormalized Kruppa equations (6.63) to solve for calibration, the
two nonlinear equations in (6.63) may become algebraically dependent. The
following corollary shows that this is at least true for the case when the
133
Proof. From the proof of Theorem 6.4, we know that any other linear constraint on Y imposed by the normalized Kruppa equations must be a linear
combination the two given in (6.73)
v T RY u + v T RY RT v = v T Y u + v T Y v,
, R.
Cases
( 6= 0, 2 )
( = 0)
( =
2)
Type of Constraints
Unnormalized
Normalized
Unnormalized
Normalized
Unnormalized
Normalized
# of Constraints on 1
2
3
2
2
1
2
134
Figure 6.2. Two consecutive orbital motions with independent rotations: The
camera optical axis always pointing to the center of the globe. Even if all pairwise
fundamental matrices among the three views are considered, one only gets at
most 1 + 1 + 2 = 4 effective constraints on the camera intrinsic matrix if one
uses the three unnormalized matrix Kruppa equations. At least one more view
is need if one wishes to uniquely determine the unknown calibration. However,
using renormalization instead, we may get back to 2 + 2 + 2 5 constraints from
only the given three views.
135
T = 4 T 0 T 0 .
(6.74)
2
2
1
.
(6.75)
2 = 1T 1 = 2T 1 = 1T 1 =
2 2
1 1
1 2
T 0 T T 0
Is the last equation algebraically independent of the two Kruppa equations?
Although it seems to be quite different from Kruppas equations, it is in
fact dependent on them. This can be shown either numerically or using
simple algebraic tools such as Maple. Thus, it appears that our effort to
look for extra independent constraints on A from the fundamental matrix
has failed.8 In the following, we will give an explanation to this by showing
that not all which satisfy the Kruppa equations may give valid Euclidean
reconstructions of both the camera motion and scene structure. The extra
8 Nevertheless, extra implicit constraints on A may still be obtained from other algebraic facts. For example, the so-called modulus constraints give three implicit constraints
on A by introducing three extra unknowns, for more details see [PG99].
136
(6.76)
(6.77)
137
L3
o3
(R2, T2)
p
L2
o1
o2
(R1, T1 )
L1
Figure 6.3. A camera undergoes two motions (R1 , T1 ) and (R2 , T2 ) observing a
rig consisting of three straight lines L1 , L2 , L3 . Then the camera calibration is
uniquely determined as long as R1 and R2 have independent rotation axes and
rotation angles in (0, ), regardless of T1 , T2 . This is because, for any invalid
solution A, the associated plane N (see the proof of Theorem 6.5) must intersect
the three lines at some point, say p. Then the reconstructed depth of point p with
respect to the solution A would be infinite (points beyond the plane N would
have negative recovered depth). This gives us a criteria to exclude all such invalid
solutions.
two constraints from one fundamental matrix even in the two special cases
when Kruppas equations can be renormalized extra ones are imposed by
the structure, not the motion. The theorem also resolves the discrepancy
between Kruppas equations and the necessary and sufficient condition for
a unique calibration: Kruppas equations, although convenient to use, do
not provide sufficient conditions for a valid calibration which allows a valid
138
So far, we have mainly considered camera self-calibration when the motion of the camera is discrete positions of the camera are specified as
discrete points in SE(3). In this section, we study its continuous version.
T
(6.78)
where X0 = AX.
Uncalibrated continuous epipolar geometry
By the general case we mean that both the angular and linear velocities
0 = x
+ x.
Then (6.78)
and v are non-zero. Note that X0 = x yields X
gives an uncalibrated version of the continuous epipolar constraint
x T AT vbA1 x + xT AT
b vbA1 x = 0
(6.79)
This will still be called the continuous epipolar constraint. As before, let
s R33 to be s = 12 (b
vb + vb
b ). Define the continuous fundamental matrix
F R63 to be:
T 1
A vbA
F =
.
(6.80)
AT sA1
9 Some possible ways of harnessing the constraints provided by chirality have been
discussed in [Har98]. Basically they give inequality constraints on the possible solutions
of the calibration.
10 This section may be skipped at a first reading
139
c repeatedly, we obtain
this property of the hat operator ()
1 T
vb + vb
b )A1
A (b
2
1 T
1
=
b AT vb0 + vb0 Ab
(A
A1 ) = (b0 1 vb0 + vb0 1 b0 ).
2
2
Then the continuous epipolar constraint (6.79) is equivalent to
AT sA1
1
x T vb0 x + xT (b0 1 vb0 + vb0 1 b0 )x = 0.
(6.81)
2
Suppose 1 = BB T for another B SL(3), then A = BR0 for some
R0 SO(3). We have
1
x T vb0 x + xT (b0 1 vb0 + vb0 1 b0 )x = 0
2
T b0
T 1 b0
x v x + x ( BB T vb0 + vb0 BB T b0 )x = 0
2
1
d
d
d 1 x = 0.
x T B T R
vB
x + xT B T R
0
0 R0 vB
(6.82)
Comparing to (6.79), one cannot tell the camera A with motion (, v) from
the camera B with motion (R0 , R0 v). Thus, like the discrete case, without
knowing the camera motion the calibration can only be recovered in the
space SL(3)/SO(3), i.e. only the symmetric matrix 1 hence can be
recovered.
However, unlike the discrete case, the matrix cannot be fully recovered
in the continuous case. Since 1 = AAT is a symmetric matrix, it can be
diagonalized as
1 = R1T R1 ,
R1 SO(3)
00
(6.83)
00
1 b0 1 b0 b0 1 b0
1 c00 c00 c00 c00
( v + v ) = R1T (
v + v )R1 .
(6.84)
2
2
Thus the continuous epipolar constraint (6.79) is also equivalent to:
1 c00 c00 c00 c00
T vc00 (R1 x) + (R1 x)T (
v + v )(R1 x) = 0.
(6.85)
(R1 x)
2
From this equation, one can see that there is no way to tell a camera
A with AAT = R1T R1 from a camera B = R1 A. Therefore, only the
diagonal matrix can be recovered as camera parameters if both the scene
structure and camera motion are unknown. Note that is in SL(3) hence
1 2 3 = 1. The singular values only have two degrees of freedom. Hence
we have:
Theorem 6.6 (Singular values of calibration matrix from continuous motion). Consider a camera with an unknown calibration matrix
140
A SL(3) undergoing an unknown motion (, v). Then only the eigenvalues of AAT , i.e. singular values of A, can be recovered from the bilinear
continuous epipolar constraint.
If we define that two matrices in SL(3) are equivalent if and only if
they have the same singular values. The intrinsic parameter space is then
reduced to the space SL(3)/ where represents this equivalence relation.
The fact that only two camera parameters can be recovered was known to
Brooks et al. [BCB97]. They have also shown how to do self-calibration for
matrices A with only two unknown parameters using the above continuous
method.
Comment 6.5. It is a little surprising to see that the discrete and continuous cases have some significant difference for the first time, especially
knowing that in the calibrated case these two cases have almost exactly parallel sets of theory and algorithms. We believe that this has to do with the
map:
A : R33
B
R33
7 ABAT
141
(6.87)
C33
7 CX + XC T
(6.88)
If 6= 0, the eigenvalues of
b have the form 0, i, i with R. Let
the corresponding eigenvectors are , u, u
C3 . According to Callier and
Desoer [CD91], the kernel of the map L has three dimensions and is given
by:
Ker(L) = span{X1 = A AT , X2 = Auu AT , X3 = A
uu
AT }. (6.89)
As in the discrete case, the symmetric real X is of the form X = X1 +
(X2 + X3 ) for some , R. That is the symmetric real kernel of L is
only two-dimensional. We denote this space as SRKer(L). We thus have:
Lemma 6.5. Given a matrix C = Ab
A1 with kk = 1, the symmetric real kernel associated with the Lyapunov map L : CX + XC T is
two-dimensional.
Similar to the discrete case we have:
Theorem 6.7 (Two continuous rotational motions determine calibration). Given matrices Ci := Ab
i A1 R33 , i = 1, . . . , n with
ki k = 1. The symmetric matrix = AT A1 SL(3) is uniquely determined if and only if at least two of the n vectors i , i = 1, . . . , n are linearly
independent.
11 If
142
(6.90)
since Tb0 (T 0 vT ) = 0.
Below we demonstrate that these possible decompositions of F yield projection matrices, which are related to each other by non-singular matrices
H GL(4). The following theorem is due to Hartley [HZ00]. Without loss
of generality we assume that the projection matrix associated with the first
camera is P1 = [I, 0].
Theorem 6.8 (Projective reconstruction). Let F be the fundamental matrix and (P1 , P2 ) and (P1 , P2 ) are two possible pairs of projection
matrices, yielding the same fundamental matrix F . Then there exists a
nonsingular transformation H GL(4) such that P1 = H P1 and P2 =
H P2 .
Suppose that there are two possible decompositions of F such that F =
b
b
b
aA and F = bB,
then b = a and B = 1 (A + avT ). Since b
aA = bB,
then b
a(B A) = 0, so B A = avT , hence B = 1 (A + avT ). What
remains to be verified that given the projections matrices P = [A, a] and
P = [1 (A + avT ), a] they are indeed related by the matrix H, such that
P H = P , with H:
H=
1 I
1 vT
(6.91)
143
Canonical decomposition of F
The previous argument demonstrated that there is an entire 4 parameter
family of transformations parameterized by (, v), consistent with F . In order to be able to proceed with the reconstruction we need a systematic way
to choose a representative from this family. Such a canonical decomposition
of F consistent with a set of measurements has the following form:
P1 = [I, 0] and P2 = [Tb0 F, T 0 ].
(6.94)
1T
y1i (p1T
3 Xp ) = p 2 Xp
2T
xi2 (p2T
3 Xp ) = p 1 Xp
2T
y2i (p2T
3 Xp ) = p 2 Xp
where P1 = [p11 , p12 , p13 ]T and P2 = [p21 , p22 , p23 ]T are the projection matrices
described in terms of their rows. Structure can be then recovered as a linear
least squares solution to the system of homogeneous equations
AXp = 0
144
Figure 6.4. The true Euclidean Structure X of the scene and the projective
structure Xp obtained by the algorithm described above.
(6.95)
This can be done in several steps as the remainder of this chapter outlines.
The projective transformation in equation (6.91) characterizes the ambiguity relating the choice of projection matrices P and P . Even when we can
exactly decompose F into parts related to rotation ARA1 and translation
AT and use it for reconstruction, the structure obtained in such way, is related to the original Euclidean structure by an unknown matrix A. Hence
the projective transformation H upgrading projective structure Xp to Euclidean structure X has to resolve all these unknowns. For this process it
is instrumental to consider following decomposition of the transformation
H:
1
sR T
A
0
I 0
H = H e Ha Hp =
(6.96)
0 1
0
1
vT 1
The first part He depends simply on the choice of coordinate system and
scale a factor, the second part Ha corresponds to so called affine upgrade,
and is directly related to the knowledge of intrinsic parameters of the camera and the third transformation is so called projective upgrade Hp , which
transforms points lying on a particular plane [vT 1]Xp = 0 to the points
at infinity. This basic observation sets the stage for the so called stratified
approach to Euclidean reconstruction and self-calibration. In such setting
the projective structure Xp is computed first, followed by computation of
Hp and Ha and gradual upgrade of Xp to to Xa and xX in the following
145
way:
X = H a Xa
and Xa = Hp Xp
146
X1p , X2p , X3p obtained in the projective reconstruction step, we can compute
the unknown vector v, which corresponds to the plane in projective space,
where all vanishing points lie.
[v, 1]T Xip = 0
Once v has been determined the projective upgrade Hp is defined and the
projective structure Xp can be upgraded to Xa .
I 0
Xa =
Xp
(6.98)
vT 1
Hence the projective upgrade can be computed when images of at least
three vanishing points are available.
The situation is not completely lost, in case the above structure constraints
are not available. Pollefeys in [?] proposed a technique, where additional so
called modulus constraints on v are imposed, which capture the fact that
v, needs to be chosen in a way that A + T 0 vT is a rotation matrix. This
approach requires a solution to a set of fourth order polynomial equations
and multiple views (four) are necessary, for the solution to be well conditioned. In Hartley [?] previously introduced chirality constraints are used
to determine possible range of values for v, where the optimum is achieved
using linear programming approach.
Direct affine reconstruction
There are several ways how to achieve affine reconstruction Xa , which will
bring is in the context of stratified approach one step closer to the Euclidean
reconstruction. One commonly used assumption in order to obtain affine
reconstruction is to assume scaled orthographic camera projection model [?,
TK92].
When general perspective projection model is used and the motion is
pure translation the affine structure can be recovered directly.
147
Pure translation
From two views related by pure translation only, we have:
2 Ax2
2 x02
= 1 Ax1 + AT
= 1 x01 + T 0
(6.99)
Once T 0 has been computed, note that the unknown scales are now linear functions of known quantities and can be computed directly from the
equation (6.100) as
1 =
b02 T
(b
x02 x01 )T x
0
0
kb
x2 x1 k 2
One can easily verified that the structure Xa obtained in this manner is
indeed related to the Euclidean structure X by some unknown general affine
transformation:
1
A
0
sR T
A T
=
(6.100)
Ha =
0
1
0 1
0 1
= M 1 Xa + m 1
= M 2 Xa + m 2
(6.101)
= M 1 x01 + m
(6.102)
= 1 ARA1 x01 + T 0
(6.103)
148
More details
0. More views are needed 1. Compute S 2. Cholesky 3. Get A.
Structure constraints
Natural additional constraints we can employ in order to solve for the unknown metric S are constraints on the structure; in particular distances
between points and angles between lines. Angles are distances are directly
related to notion of the inner product in the uncalibrated space and hence
when known, can be used to solve for S. Commonly encountered constraint
in man-made environments are orthogonality constraints between sets of
lines in 3D. These are reflected in the image plane in terms of angles between vanishing directions, for example v1 , v2 in the image plane, which
correspond to a set of orthogonal lines in 3D have to satisfy the following
constraint:
v1T S 1 v2 = 0
(6.104)
1
since the inner product for orthogonal lines is 0. Note since S is a symmetric matrix with six degrees of freedom, we need at least five pairs of
perpendicular lines to solve for S 1 .
More details
Point of difference between S, S 1 etc, how to get A.
Constraints on A
Computation of S 1 and consequently A can be simplified when some of
the intrinsic parameters are known. The most commonly used assumptions
are zero skew and known aspect ratios. In such case one can demonstrate
that S 1 can be parameterized by fewer then six parameters and hence
smaller number of constraints is necessary.
6.7 Summary
6.8 Exercises
1. Lyapunov maps
Let {}ni=1 be the (distinct) eigenvalues of A Cnn . Find all the n2
eigenvalues and eigenvectors of the linear maps:
(a) L1 : X Cnn 7 AT P P A Cnn .
(b) L2 : P Cnn 7 AT P A P Cnn .
6.8. Exercises
149
Note that the first identity showed in the proof of Theorem 3.3 (in
handout 7); the second identity in fact can be used to derive the socalled modulus constraints on calibration matrix A (See Hartley and
Zissermann, page 458). (Hint: (a) is easy; for (b), the key step is to
show that the matrix on the left hand side is of rank 1, for which you
T
c0 T
c0 is a projection matrix.)
may need to use the fact that T
R SO(3), S T = S R33 ,
prove Lemma 7.3.1 for the case that rotation angle of R is between
0 and . What are the two typical solutions for S? (Hint: You may
need eigenvectors of the rotation matrix R. Properties of Lyapunov
map from the previous homework should give you enough ideas how
to find all possible S from eigenvectors (constructed from those of
R) of the Lyapunov map L : S 7 S RSRT .) Since a symmetric
real matrix S is usually used to represent a conic xT Sx = 1, S which
satisfies the above equation obviously gives a conic that is invariant
under the rotation R.
Part III
Geometry of multiple
views
Chapter 7
Introduction to multiple view
reconstruction
154
155
.
Then the (perspective) image x(t) = [x(t), y(t), z(t)]T R3 of p, taken by
a moving camera at time t satisfies the relationship
(t)x(t) = A(t)P g(t)X,
(7.1)
(7.3)
PSfrag replacements
L
p
x1
lb1
V
po
lb2
l2
x2
o2
(R, T )
o1
l1
Figure 7.1. Images of a point p on a line L. Planes extended from the images
lb1 , lb2 should intersect at the line L in 3-D. Lines extended from the image points
x1 , x2 intersect at the point p. l1 , l2 are the two co-images of the line.
156
(7.5)
33
7.2. Preliminaries
157
b R33 ,
x
l R3 .
(7.6)
xi = x(ti ),
li = l(ti ),
i = A(ti )P g(ti ).
(7.7)
We will call the matrix i the projection matrix relative to the ith camera
frame. The matrix i is then a 3 4 matrix which relates the ith image of
the point p to its world coordinates X by:
i x i = i X
and the i
th
(7.8)
(7.9)
(7.10)
33
7.2 Preliminaries
We first observe that in above equations the unknowns, i s, X, Xo and V,
which encode the information about location of the point p or the line L are
not intrinsically available from the images. By eliminating these unknowns
from the equations we obtain the remaining intrinsic relationships between
x, l and only, i.e. between the image measurements and the camera configuration. Of course there are many different, but algebraically equivalent
ways for elimination of these unknowns. This has in fact resulted in different kinds (or forms) of multilinear (or multifocal) constraints that exist in
the computer vision literature. Our goal here should be a more systematic
158
way of eliminating all the above unknowns that results in a complete set of
conditions and a clear geometric characterization of all constraints.
To this end, we can re-write the above equations in matrix form as
1
1
x1 0
0
0 x2
0
2 2
(7.11)
=
..
. X
.
..
..
..
.
.
.
. .. ..
0
x1 0
0 x2
.
I = .
..
..
..
.
.
0
becomes
xm
0
0
..
.
xm
.
~ =
1
2
..
.
m
I ~ = X.
.
=
1
2
..
.
m
(7.12)
(7.13)
which is just a matrix version of (7.8). For obvious reasons, we call ~ R3m
the depth scale vector, and R3m4 the projection matrix associated
to the image matrix I R3mm . The reader will notice that, with the
exception of xi s, everything else in this equation is unknown! Solving for
the depth scales and the projection matrices directly from such equations is
by no means straightforward. Like we did in the two-view case, therefore,
we will decouple the recovery of the camera configuration matrices i s
from recovery of scene structure, i s and X.
In order for equation (7.13) to be satisfied, it is necessary for the columns
of I and to be linearly dependent. In other words, the matrix
1 x1 0
0
..
2 0 x 2 . . .
.
.
R3m(m+4)
Np = [, I] = .
(7.14)
.
.
.
..
..
..
..
0
m 0
0 xm
must have a non-trivial null-space. Hence
rank(Np ) m + 3.
(7.15)
.
In fact, from equation (7.13) it is immediate to see that the vector u =
[XT , ~T ]T Rm+4 is in the null space of the matrix Np , i.e. Np u = 0.
Similarly, for line features the following matrix:
T
l1 1
T
. l2 2
(7.16)
Nl = . Rm4
..
lTm m
159
(7.17)
since it is clear that the vectors Xo and V are both in the null space of
the matrix Nl due to (7.9). In fact, any X R4 in the null space of Nl
represents the homogeneous coordinates of some point lying on the line L,
and vice versa.
The above rank conditions on Np and Nl are merely the starting point.
There is some difficulty in using them directly since: 1. the lower bounds
on their rank are not yet clear; 2. their dimensions are high and hence the
above rank conditions contain a lot of redundancy. This apparently obvious
observation is the basis for the characterization of all constraints among
corresponding point and line in multiple views. To put our future development in historic context, in the rest of this chapter we will concentrated on
only the pairwise and triple-wise images.
(7.19)
for some matrix F R33 . In our case, the entries of F are the 44 minors
of the matrix
1
=
R64 .
2
For instance, for the case of two calibrated cameras, the matrix F is exactly
the essential matrix.2 To see this, without loss of generality, let us assume
2 In the tensorial jargon, the epipolar constraint is sometimes referred to as the bilinear
constraint, and the essential or fundamental matrix F as the bifocal tensor. Mathematically, F can be interpreted as a covariant 2-tensor whose contraction with x 1 and x2 is
zero.
160
that the first camera frame coincides with the world frame and the camera
motion between the two views is (R, T ). Then we have
1 = [I, 0],
2 = [R, T ]
R34 .
0
Rx1
(7.20)
0
x2
For the last step, we use the fact from linear algebra that det[v1 , v2 , v3 ] =
v1T (v2 v3 ). This yields
Therefore, we obtain
det(Np ) = 0
xT2 TbRx1 = 0
(7.21)
which says that, for the two-view case, the rank condition (7.15) is equivalent to the epipolar constraint. The advantage of the rank condition is
that it generalizes to multiple views in a straightforward manner, unlike
the epipolar constraint. A similar derivation can be followed for the uncalibrated case by just allowing R R3 to be general and not a rotation
matrix. The resulting matrix F simply becomes the fundamental matrix.
However, for two (co-)images l1 , l2 of a line L, the above condition (7.17)
for the matrix Nl becomes
T
l
rank(Nl ) = rank 1T 1 2.
(7.22)
l2 2
But here Nl is a 2 4 matrix which automatically has a rank less and
equal to 2. Hence there is essential no intrinsic constraints on two images
of a line. This is not so surprising since for arbitrarily given two vectors
l1 , l2 R3 , they represent two planes relative to the two camera frames
respectively. These two planes will always intersect at some line in 3-D,
and l1 , l2 can be interpreted as two images of this line. A natural question
hence arises: are there any non-trivial constraints for images of a line in
more than 2 views? The answer will be yes which to some extent legitimizes
the study of triple-wise images. Another advantage from having more than
2 views is to avoid some obvious degeneracy inherent in the pairwise view
geometry. As demonstrated later in Figure 8.2, there are cases when all
image points satisfy pairwise epipolar constraint but they do not correspond
to a unique 3-D point. But as we will establish rigorously in the next
chapter, constraints among triple-wise views can eliminate such degenerate
cases very effectively.
In the next section, for the purpose of example, we show how to derive
all multilinear constraints between corresponding images in m = 3 views.
161
We however do not pursue this direction any further for m 4 views, since
we will see in Chapters 8, 9, and 10 that there are no more multilinear
constraints among more than 4 views! However, a much more subtle picture
of the constraints among multiple images of points and lines cannot be
revealed untill we study those more general cases.
1 x1 0
0
(7.23)
Np = 2 0 x2 0 R97
3 0
0 x3
then has more rows than columns. Therefore, the rank condition (7.15)
implies that all 7 7 sub-matrices are singular. As we explore in the exercises, the only irreducible minor of Np must contain at least two rows from
each image. For example, if we choose all three rows of the first image and
the {1, 2}th rows from the second and the {1, 2}th rows from the third, we
obtain a 7 7 sub-matrix of Np as
1 x1
0
0
12 0 x21
0
2
R77
N033 =
0
x
0
(7.24)
22
12
3 0
0 x31
23 0
0 x32
where the subscript rst of Nrst indicates that the r th , sth and tth rows
of images 1, 2 and 3 respectively are omitted from the original matrix Np
when Nrst is formed. The rank condition (7.15) implies that
det(N033 ) = 0
(7.25)
2 = [R2 , T2 ],
I
0
Np = R 2 T 2
R 3 T3
3 = [R3 , T3 ]
R34 ,
33
(7.26)
x1 0
0
0 x2 0 R97 .
(7.27)
0
0 x3
162
This matrix should satisfy the rank condition. After a column manipulation
(which eliminates x1 from the first row), it is easy to see that Np has the
same rank as the matrix
I
0
0
0
0
(7.28)
Np0 = R2 T2 R2 x1 x2 0 R97 .
R 3 T3 R 3 x1 0 x 3
T
x2
0
x
c2
0
86
D=
0 xT3 R
c3
0 x
by Np00 yields
xT2 T2
x
c2 T2
DNp00 =
xT3 T3
c3 T3
x
xT2 R2 x1
c2 R2 x1
x
xT3 R3 x1
c3 R3 x1
x
xT2 x2
0
0
0
0
0
xT3 x3
0
(7.29)
(7.30)
R84 .
(7.31)
rank(Np00 ) = rank(DNp00 ).
Hence the original matrix Np is rank deficient if and only if the following
sub-matrix of DNp00
c2 R2 x1
c2 T2 x
x
R62
(7.32)
Mp =
c3 R3 x1
c3 T3 x
x
a 1 b1
..
..
.
.
an
bn
vectors ai , bi R3 , i = 1, . . . n, the
3n2
R
(7.33)
(7.34)
163
Here Nl is a 34 matrix and the above rank condition is no longer void and
it indeed imposes certain non-trivial constraints on the quantities involved.
In the case we choose the first view to be the reference, i.e. 1 = [I, 0], we
have
T T
l1
0
l1 1
lT2 2 = lT2 R2 lT2 T2 .
(7.36)
lT3 R3 lT2 T3
lT3 3
T
l1 l1
0
0
l1
0
b
lT2 R2 lT2 T2 l1 l1 0 = lT2 R2 l1 lT2 R2 lb1 lT2 T2 .
(7.37)
0 0 1
lT3 R3 lT3 T3
lT3 R3 l1 lT3 R3 lb1 lT3 T3
(7.38)
has a rank less and equal to 1. That is the two row vectors of Ml are linearly
dependent. In case lT2 T2 , lT3 T3 are non-zero, this linear dependency can be
written as
lT3 (T3 lT2 R2 R3 lT2 T2 )lb1 = 0.
(7.39)
164
(7.40)
where l2 , l3 are respectively co-images of two lines that pass through the
same point whose image in the first view is x1 . Here, l2 and l3 do not have
to correspond to the coimages of the same line in 3-D. This is illustrated
in Figure 7.2.
L1
L2
PSfrag replacements
l3
o3
x1
o1
l2
o2
The above trilinear relation (7.40) plays a similar role for triple-wise view
geometry as the fundamental matrix for pairwise view geometry. 3 Here
the fundamental matrix must be replaced by the notion of trifocal tensor,
traditionally denoted as T . Like the fundamental matrix, the trifocal tensor
depends (nonlinearly) on the motion paramters R and T s, but in a more
complicated way. The above trilinear relation can then be written formally
3 A more detailed study of the relationship between the trilinear constraints and the
bilinear epipolar constraints is given in the next Chapter.
7.5. Summary
165
3,3,3
X
i,j,k=1
(7.41)
Intuitively, the tensor T consists of 333 entries since the above trilinear
relation has as many coefficients to be determined. Like the fundamental
matrix which only has 7 free parameters (det(F ) = 0), there are in fact only
18 free parameters in T . If one is only interested in recovering T linearly
from triple-wise images as did we for the fundamental matrix from pairwise
ones, by ignoring their internal dependency, we need at least 26 equations
of the above type (7.41) to recover all the coefficients T (i, j, k)s up to a
scale. We hence needs at least 26 the triplets (x1 , l2 , l3 ) with correspondence
specified by Figure 7.2. We leave as exercise at least how many point or line
correspondences across three views are needed in order to have 26 linearly
independent equations for solving T .
7.5 Summary
7.6 Exercises
1. Image and co-image of points and lines
Suppose p1 , p2 are two points on the line L; and L1 , L2 are two lines
intersecting at the point p. Let x, x1 , x2 be the images of the points
p, p1 , p2 respectively and l, l1 , l2 be the co-images of the lines L, L1 , L2
respectively.
(a) Show that for some , R:
c1 x2 ,
l = x
x = lb1 l2 .
bv,
l2 = x
x1 = blr,
x2 = bls.
(c) Please draw a picture and convince yourself about the above
relationships.
2. Linear estimating of the trifocal tensor
(a) Show that the corresponding three images x1 , x2 , x3 of a point
p in 3-D give rise to 4 linearly independent constraints of the
type (7.41).
(b) Show that the corresponding three co-images l1 , l2 , l3 of a line L
in 3-D give rise to 2 linearly independent constraints of the type
(7.41).
166
(c) Conclude that in general one needs n points and m lines across
three views with 4n + 2m 26 to recover the trifocal tensor T
up to a scale.
3. Multi-linear function
A function f (, ) : Rn Rn R is called bilinear if f (x, y) is linear
in x Rn if y Rn is fixed, and vice versa. That is,
f (x1 + x2 , y) = f (x1 , y) + f (x2 , y),
f (x, y1 + y2 ) = f (x, y1 ) + f (x, y2 ),
for any x, x1 , x2 , y, y1 , y2 Rn and , R. Show that any bilinear
function f (x, y) can be written as
f (x, y) = xT M y
for some matrix M Rnn . Notice that in the epipolar constraint
equation
xT2 F x1 = 0,
the left hand side is a bilinear form in x1 , x2 R3 . Hence epipolar
constraint is also sometimes referred as bilinear constraint. Similarly
we can define trilinear or quadrilinear functions. In the most general
case, we can define a multi-linear function f (x1 , . . . , xi , . . . , xm ) which
is linear in each xi Rn if all other xj s with j 6= i are fixed. Such
a function is called m-linear. In mathematics, another name for such
multi-linear object is tensor. Try to describe, while we can use matrix to represent a bilinear function, how to represent a multi-linear
function then?
4. Minors of a matrix
Suppose M is a mn matrix with m n. Then M is a rectangular
matrix with m rows and n columns. We can pick arbitrary n (say
th
ith
1 , . . . , in ) distinctive rows of M and form a n n sub-matrix of
M . If we denote such n n sub-matrix as Mi1 ,...,in , its determinant
det(Mi1 ,...,in ) is called a minor of M . Prove that the n column vectors
are linearly dependent if and only if all the minors of M are identically
zero, i.e.
det(Mi1 ,...,in ) = 0,
i1 , i2 , . . . , in [1, m].
By the way, totally how many different minors can you possibly gotten
there? (Hint: For the if and only if problem, the only if part is
easy; for the if part, proof by induction makes it easier.) This fact
was used to derive the traditional multi-linear constraints in some
computer vision literature.
5. Linear dependency of two vectors
Given any four non-zero vectors a1 , . . . , an , b1 , . . . , bn R3 , the
7.6. Exercises
following matrix
a1
..
.
an
b1
.. R3n2
.
bn
167
(7.42)
Chapter 8
Geometry and reconstruction from
point features
i = 1, . . . , m
(8.1)
169
.
time ti . We call i = A(ti )P g(ti ). Collecting all the indices, we can write
(8.1) in matrix form as
x1
0
..
.
0
x2
..
.
..
.
0
0
..
.
xm
I ~
1
2
..
.
X
1
2
..
.
m
X.
(8.2)
1 x1 0
0
..
2 0 x 2 . . .
.
.
Np = [, I] = .
(8.3)
.
.
.
..
..
..
..
0
m 0
0 xm
(8.4)
Remark 8.1 (Null space of Np ). Even though equations (8.2) and (8.4)
are equivalent, if rank(Np ) m + 2 then equation Np v = 0 has more
than one solution. In the next section, we will show that rank(Np ) is either
m + 2 or m + 3 and that the first case happens if and only if the point being
observed and all the camera centers lie on the same line.
Remark 8.2 (Positive depth). Even if rank(Np ) = m + 3, there is no
guarantee that ~ in (8.2) will have positive entries.1 In practice, if the point
being observed is always in front of the camera, then I, and X in (8.2)
will be such that the entries of ~ are positive. Since the solution to Np v = 0
.
is unique and v = [XT , ~T ]T is a solution, then the last m entries of v
have to be of the same sign.
It is a basic fact of linear algebra that a matrix being rank-deficient can be
expressed in terms of the determinants of certain submatrices being zero.
1 In a projective setting, this is not a problem since all points on the same line through
the camera center are equivalent. But it does matter if we want to choose appropriate
reference frames such that the depth for the recovered 3-D point is physically meaningful,
i.e. positive.
170
(8.5)
R34 ,
(8.6)
I 0
0
0 0
0
I
R2
R2 T2 R2 x1 x2 . . . ..
=
.
.
0
. .
.
.
.
N
..
p
.
.. ..
0 .. 0
Rm
R m Tm R m x1 0 0 x m
2 Depending on the context, the reference frame could be either an Euclidean, affine or
projective reference frame. In any case, the projection matrix for the first image becomes
[I, 0] R34 .
171
T
x2 0
0
x
0
c2 0
.. . .
.. R4(m1)3(m1)
.
..
(8.7)
Dp = .
.
.
T
0
0 xm
0
0 xc
m
x2 T2 xT2 R2 x1 xT2 x2 0
0
0
x
c2 R2 x1
x
0
0
0
0
c2 T2
..
..
.
.
.
.
.
0
0
0
.
.
..
..
..
.
.
0
0
0
T
xT Tm xT R m x1
0
0
0
x
x
m
m
m m
c
xc
T
x
R
x
0
0
0
0
m m
m m 1
Since Dp has full rank 3(m 1), we have rank(Np0 ) = rank(Dp Np0 ). Hence
the original matrix Np is rank-deficient if and only if the following submatrix of Dp Np0 is rank-deficient
c2 T2
c2 R2 x1
x
x
x
c3 T3
x
c3 R3 x1
Mp =
(8.8)
R3(m1)2 .
..
..
.
.
xc
xc
m R m x1
m Tm
(8.9)
172
Due to the rank equality (8.9) for Mp and Np , we conclude that the rankdeficiency of Mp is equivalent to all multi-linear constraints that may arise
among the m images. To see this more explicitly, notice that for Mp to be
rank-deficient, it is necessary for the any pair of the vectors xbi Ti , xbi Ri x1 to
be linearly dependent. This gives us the well-known bilinear (or epipolar)
constraints
xTi Tbi Ri x1 = 0.
(8.10)
(8.11)
between the ith and 1st images. Hence the constraint rank(Mp ) 1 consistently generalizes the epipolar constraint (for 2 views) to arbitrary m
views. Using Lemma 7.1. we can conclude that
173
c2 R2 x1 = 0, xT3 T
c3 R3 x1 = 0,
xT2 T
T T
c2 (T2 x1 R3 R2 x1 T3T )c
x
x3 = 0.
Now the inverse problem: If the three vectors x1 , x2 , x3 satisfy either the
bilinear constraints or trilinear constraints, are they necessarily images of
some single point in 3-D, the so-called pre-image?
Let us first study whether bilinear constraints are sufficient to determine
a unique pre-image in 3-D. For the given three vectors x1 , x2 , x3 , suppose
that they satisfy three pairwise epipolar constraints
xT2 F21 x1 = 0,
xT3 F31 x1 = 0,
xT3 F32 x2 = 0,
(8.12)
th
with Fij = Tc
and j th images.
ij Rij the fundamental matrix between the i
Note that each image (as a point on the image plane) and the corresponding optical center uniquely determine a line in 3-D that passes through
174
PSfrag replacements
x3
x1
o3
x2
o1
o2
Figure 8.1. Three rays extended from the three images x1 , x2 , x3 intersect at one
point p in 3-D, the pre-image of x1 , x2 , x3 .
them. This gives us a total of three lines. Geometrically, the three epipolar constraints simply imply that each pair of the three lines are coplanar.
So when do three pairwise coplanar lines intersect at exactly one point
in 3-D? If these three lines are not coplanar, the intersection is uniquely
determined, so is the pre-image. If all of them do lie on the same plane,
such a unique intersection is not always guaranteed. As shown in Figure
8.2, this may occur when the lines determined by the images lie on the
plane spanned by the three optical centers o1 , o2 , o3 , the so-called trifocal
plane, or when the three optical centers lie on a straight line regardless of
the images, the so-called rectilinear motion. The first case is of less pracPSfrag replacements
o3
p?
x3
o1
x1
x2
o2
p?
x1
o1
x2
o2
x3
o3
Figure 8.2. Two cases when the three lines determined by the three images
x1 , x2 , x3 lie on the same plane, in which case they may not necessarily intersect
at a unique point p.
tical effect since 3-D points generically do not lie on the trifocal plane.
The second case is more important: regardless of what 3-D feature points
one chooses, pairwise epipolar constraints alone do not provide sufficient
constraints to determine a unique 3-D point from any given three image
vectors. In such a case, extra constraints need to be imposed on the three
images in order to obtain a unique pre-image.
175
That is
(8.13)
If T3 and R3 x1 are linearly independent, then (8.13) holds if and only if the
vectors R2 x1 , T2 , x2 are linearly dependent. This condition simply means
that the line associated to the first image x1 coincide with the line determined by the optical centers o1 , o2 .5 If T3 and R3 x1 are linearly dependent,
x3 is determinable since R3 x1 lies on the line determined by the optical centers o1 , o3 . Hence we have shown, that x3 cannot be uniquely determined
from x1 , x2 by the trilinear constraint if and only if
c2 R2 x1 = 0,
T
and
c3 R3 x1 = 0,
T
and
c2 x2 = 0.
T
(8.14)
c3 x3 = 0.
T
(8.15)
We still need to show that these three images indeed determine a unique
pre-image in 3-D if either one of the images can be determined from the
other two by the trilinear constraint. This is obvious. Without loss of generality, suppose it is x3 that can be uniquely determined from x1 and x2 .
Simply take the intersection p0 E3 of the two lines associated to the first
two images and project it back to the third image plane such intersection
exists since the two images satisfy epipolar constraint.6 Call this image
x03 . Then x03 automatically satisfies the trilinear constraint. Hence x03 = x3
due to its uniqueness. Therefore, p0 is the 3-D point p where all the three
lines intersect in the first place. As we have argued before, the trilinear
constraint (8.11) actually implies bilinear constraint (8.10). Therefore, the
3-D pre-image p is uniquely determined if either x3 can be determined from
5 In other words, the pre-image point p lies on the epipole between the first and second
camera frames.
6 If these two lines are parallel, we take the intersection in the plane at infinity.
176
i, j = 1, 2, 3,
i, j, k = 1, 2, 3,
then they determine a unique pre-image p E3 except when the three lines
associated to the three images are collinear.
The two cases (which are essentially one case) in which the bilinear conPSfrag replacements
straints may become degenerate are shown in Figure 8.2. Figure 8.3 shows
the only case in which trilinear constraint may become degenerate. In simp?
x1
o1
x2
o2
x3
o3
Figure 8.3. If the three images and the three optical centers lie on a straight line,
any point on this line is a valid pre-image that satisfies all the constraints.
ple terms, bilinear fails for coplanar and trilinear fails for collinear. For
more than 3 views, in order to check the uniqueness of the pre-image, one
needs to apply the above lemma to every pairwise or triple-wise views. The
possible number of combinations of degenerate cases make it very hard to
draw any consistent conclusion. However, in terms of the rank condition on
the multiple view matrix, Lemma 8.1 can be generalized to multiple views
in a much more concise and unified way:
Theorem 8.3 (Uniqueness of the pre-image). Given m vectors on the
image planes with respect to m camera frames, they correspond to the same
point in the 3-D space if the rank of the Mp matrix relative to any of the
camera frames is 1. If its rank is 0, the point is determined up to the line
where all the camera centers must lie on.
Hence both the largest and smallest singular values of Mp have meaningful geometric interpretation: the smallest being zero is necessary for
7 Although there seem to be total of nine possible (i, j, k), there are in fact only three
different trilinear constraints due to the symmetry in the trilinear constraint equation.
177
1
1
i = 2, . . . , m,
= 0.
(8.16)
(8.17)
So the coefficient that relates the two columns of the Mp matrix is simply
the depth of the point p in 3-D space with respect to the center of the
first camera frame (the reference).8 Hence, from the Mp matrix alone, we
may know the distance from the point p to the camera center o1 . One can
further prove that for any two points of the same distance from o1 , there
always exist a set of camera frames for each point such that their images
give exactly the same Mp matrix.9 Hence we can interpret Mp as a map
from a point in 3-D space to a scalar
Mp : p R3 7 d R+ ,
where d = kp o1 k. This map is certainly surjective but not injective.
Points that may give rise to the same Mp matrix hence lie on a sphere S2
centered around the camera center o1 . The above scalar d is exactly the
radius of the sphere. We may summarize our discussion into the following
theorem:
Theorem 8.4 (Geometry of the multiple view matrix Mp ). The
matrix Mp associated to a point p in 3-D maps this point to a unique
scalar d. This map is surjective but not injective and two points may give
rise to the same Mp matrix if and only if they are of the same distance to
the center o1 of the reference camera frame. That distance is given by the
coefficients which relate the two columns of Mp .
Therefore, knowing Mp (i.e. the distance d), we know that p must lie on
a sphere of radius d = kp o1 k. Given three camera frames, we can choose
either camera center as the reference, then we essentially have three Mp
matrix for each point. Each Mp matrix gives a sphere around each camera
center in which the point p should stay. The intersection of two such spheres
in 3-D space is generically a circle, as shown in Figure 8.4. Then one would
8 Here we implicitly assume that the image surface is a sphere with radius 1. If it is
a plane instead, statements below should be interpreted accordingly.
9 We omit the detail of the proof here for simplicity.
178
S2
PSfrag replacements
S1
o1
o2
that there are some profound relationships between the Mp matrix for a
point and that for a line.
8.4.1 Correspondence
Notice that Mp R3(m1)2 being rank-deficient is equivalent to the
determinant of MpT Mp R22 being zero
det(MpT Mp ) = 0.
(8.18)
179
R3 indeed satisfy all the constraints that m images of a single 3-D preimage should, we only need to test if the above determinant is zero. A
more numerically robust algorithm would be:
Algorithm 8.1 (Multiple view matching test). Suppose the projection matrix associated to m camera frames are given. Then for given m
vectors x1 , . . . , xm R3 ,
1. Compute the matrix Mp R3(m1)2 according to (8.8).
8.4.2 Reconstruction
Now suppose that m images xi1 , . . . , xim of n points pi , i = 1, . . . , n are
given and we want to use them to estimate the unknown projection matrix
. The rank condition of the Mp matrix can be written as
ci R xi
ci T
x
x
2
2
1
2
2
ci
ci
x T x
R3 xi1
3
i 3 3
= 0 R3(m1)1 ,
(8.19)
. +
..
.
.
i T
xc
m m
i R xi
xc
m m 1
for proper i R, i = 1, . . . n.
ci we obFrom (8.1) we have ij xij = i1 Rj xi1 + Tj . Multiplying by x
j
c
i
i
i
i
i
tain x (R x + T / ) = 0. Therefore, = 1/ can be interpreted
j
as the inverse of the depth of point pi with respect to the first frame.
The set of equations in (8.19) is equivalent to finding vectors R~j =
[r11 , r12 , r13 , r21 , r22 , r23 , r31 , r32 , r33 ]T R9 and T~j = Tj R3 , j =
2, . . . , m, such that
c1 x
c1 x1 T
1 x
1
j
j
2 c2 c2
xj xj x21 T T~j
T~j
3n
Pj ~ = .
(8.20)
R~ = 0 R ,
..
Rj
.
j
.
.
cnj xn1 T
cnj x
n x
where A B is the Kronecker product of A and B. It can be shown that if
i s are known, the matrix Pj is of rank 11 if more than n 6 points in
general position are given. In that case, the kernel of Pj is unique, and so
is (Rj , Tj ).
Euclidean reconstruction
For simplicity, we here assume that the camera is perfectly calibrated.
Therefore, is A(t) = I, Ri is a rotation matrix in SO(3) and Ti is a trans-
180
lation vector in R3 . Given the first two images of (at least) eight points in
general configuration, T2 R3 and R2 SO(3) can be estimated using the
classic eight point algorithm. The equation given by the first row in (8.19)
ci T = x
ci R xi , whose least squares solution up to scale (recall
implies i x
2 2
2 2 1
that T2 is recovered up to scale from the eight point algorithm) is given by
i =
ci T )T x
ci R xi
(x
2 2
2 2 1
,
c
i
||x T ||2
i = 1, . . . , n.
(8.21)
2 2
sign(det(Uj VjT ))
p
Tj R3 ,
3
det(Sj )
(8.22)
(8.23)
In the presence of noise, solving for i using only the first two frames may
not necessarily be the best thing to do. Nevertheless, this arbitrary choice
of i allows us to compute all the motions (Rj , Tj ), j = 2, . . . , m. Given all
the motions, the least squares solution for i from (8.19) is given by
Pm ci
Tc
i
i
j=2 (xj Tj ) xj Rj x1
i
, i = 1, . . . , n.
(8.24)
=
Pm ci
||x T ||2
j=2
j j
Note that these i s are the same as those in (8.21) if m = 2. One can
then recompute the motion given this new i s, until the error in is small
enough.
We then have the following linear algorithm for multiple view motion
and structure estimation:
Algorithm 8.2 (Multiple view six-eight point algorithm). Given
m images xi1 , . . . , xim of n points pi , i = 1, . . . , n, estimate the motions
(Rj , Tj ), j = 2, . . . , m as follows:
1. Initialization: k = 0
181
(a) Compute (R2 , T2 ) using the eight point algorithm for the first
two views.
(b) Compute ik = i /1 from (8.21).
j , Tj ) as the eigenvector associated to the smallest
2. Compute (R
singular value of Pj , j = 2, . . . , m.
3. Compute (Rj , Tj ) from (8.22) and (8.23) for j = 2, . . . , m.
4. Compute the new i = ik+1 from (8.24). Normalize so that 1k+1 =
1.
5. If ||k k+1 || > , for a pre-specified > 0, then k = k + 1 and goto
2. Else stop.
The camera motion is then (Rj , Tj ), j = 2, . . . , m and the structure of the
points (with respect to the first camera frame) is given by the converged
depth scalar i1 = 1/i , i = 1, . . . , n. There are a few notes for the proposed
algorithm:
1. It makes use of all multilinear constraints simultaneously for motion
and structure estimation. (Rj , Tj )s seem to be estimated using pairwise views only but that is not exactly true. The computation of the
matrix Pj depends on all i each of which is in turn estimated from
the Mpi matrix involving all views. The reason to set 1k+1 = 1 is to
fix the universal scale. It is equivalent to putting the first point at a
distance of 1 to the first camera center.
2. It can be used either in a batch fashion or a recursive one: initializes with two views, recursively estimates camera motion and
automatically updates scene structure when new data arrive.
3. It may effectively eliminate the effect of occluded feature points - if a
point is occluded in a particular image, simply drop the corresponding
group of three rows in the Mp matrix without affecting the condition
on its rank.
Remark 8.3 (Euclidean, affine or projective reconstruction). We
must point out that, although the algorithm seems to be proposed for the
calibrated camera (Euclidean) case only, it works just the same if the camera
is weakly calibrated (affine) or uncalibrated (projective). The only change
needed is at the initialization step. In order to initialize i s, either an
Euclidean, affine or projective choice for (R2 , T2 ) needs to be specified. This
is where our knowledge of the camera calibration enters the picture. In the
worst case, i.e. when we have no knowledge on the camera at all, any choice
of a projective frame will do. In that case, the relative transformation among
camera frames and the scene structure are only recovered up to a projective
transformation. Once that is done, the rest of the algorithm runs exactly
the same.
182
8.5 Experiments
In this section, we show by simulations the performance of the proposed
algorithm. We compare motion and structure estimates with those of the
eight point algorithm for two views and then we compare the estimates as
a function of the number of frames.
8.5.1 Setup
The simulation parameters are as follows: number of trials is 1000, number
of feature points is 20, number of frames is 3 or 4 (unless we vary it on
purpose), field of view is 90o , depth variation is from 100 to 400 units of
focal length and image size is 500 500. Camera motions are specified
by their translation and rotation axes. For example, between a pair of
frames, the symbol XY means that the translation is along the X-axis and
rotation is along the Y -axis. If n such symbols are connected by hyphens,
it specifies a sequence of consecutive motions. The ratio of the magnitude
of translation ||T || and rotation , or simply the T /R ratio, is compared at
the center of the random cloud scattered in the truncated pyramid specified
by the given field of view and depth variation (see Figure 8.5). For each
simulation, the amount of rotation between frames is given by a proper
angle and the amount of translation is then automatically given by the
T /R ratio. We always choose the amount of total motion such that all
feature points will stay in the field of view for all frames. In all simulations,
independent Gaussian noise with std given in pixels is added
to each image
T
point. Error measure for rotation is arccos tr(RR2 )1 in degrees where
is an estimate of the true R. Error measure for translation is the angle
R
between T and T in degrees where T is an estimate of the true T .
Z
Points
Depth
Variation
T/R = ||T||/r
r
Field of View
Camera Center
XY
8.5. Experiments
183
2
1.5
1
0.5
0
0
1
2
3
4
5
Noise level in pixels for image size 500x500
2
1.5
1
0.5
0
0
1
2
3
4
5
Noise level in pixels for image size 500x500
Two frame
Multiframe
Two frame
Multiframe
6
0
1
2
3
4
5
Noise level in pixels for image size 500x500
Motion estimates between frames 3 and 2
Two frame
Multiframe
0
1
2
3
4
5
Noise level in pixels for image size 500x500
Figure 8.6. Motion estimate error comparison between 8 point algorithm and
multiple view linear algorithm.
Structure error
300
350
250
200
150
100
50
0
0.7
0.8
0.9
1
1.1
1.2
Relative scale ||p || / ||p ||
13
12
1.3
Two frame
Multiframe
12
10
8
6
4
2
0
1
2
3
4
Noise level in pixels for image size 500x500
Figure 8.7. Relative scale and structure estimate error comparison. Motion is
XX-Y Y and relative scale is 1.
184
1.1
1.05
0.95
0.9
4
5
Number of frames
3.6
3.4
3.2
2.8
1.05
0.95
0.9
4
5
Number of frames
1.1
Structure estimate error [%]
4
5
Number of frames
4
5
Number of frames
Figure 8.8. Estimation error as a function of the number of frames. Noise level is
3 pixels and relative scale is 1.
8.5. Experiments
185
Frame 1
Frame 2
52
+30
100
39
38
+
200
50 42
23
+ 35
54
+
++
+ 48
+
300
25
15
400
16
+40
+
100
200
+ ++ +
24
+
10
+
300
+ 48
+
+
300
26 13
17+ +
33
+
+11
600
100
34
52
400
50 42
23
++ 35
54
++
300
400
3
49
+
9 37
7
12
+547
+
+ +++
+
36
32
27
+
+
++
+
34
22
4 45
+
26 13
43
2
+
+
+
41
17+ +
29 + 44
33
++
+
+ 14
6+ 20 2819
+11
+ + ++ +
+
24
16
+
+40
10
100
200
300
400
50 35
23
+ +42
+
+
54
52
500
39
38
+
200
31
51
+
300
15
18
26 13
400
17+ +
33
+
+11
+
25
48
3
49
+
9 37
7
+547
12
+
+ +++
+
36
32
27
++
+
34
22
+
4 45
43 2+
+
+
41
29 + 44
++
+ 14
6+ 20 2819
+ +
+
16
+40
600
53
+21
+
46
+
15
600
8
25
+ 48
+
18
500
+30
+
100
8
39
38
+
53
21
++
46
+
+
Frame 4
+30
+
200
++
++
31
51
+
300
Frame 3
100
+
+
41
+
+ 14
6+ 20 2819
+ + ++ +
24
16
+
+40
10
200
25
3
49
+
9 37
7
12
+547
+
+ 32+++
+
36
+ 27
++
22 43 4 45
2
44
29
15
18
500
39
38
+
200
400
400
3
53
49
+
9 37
+21
7
12
+
+547
++
46
+
+ 27
+
+
36
32 +
+
+
++
+
31
34
51
22 43 4 45
26 13
+
++
2+ +
+
+
17
+
44
41
33
29 +
+
+
+11
+ 14 +
6+ 20 2819
18
50 42
23
++ 35
54
++
52
+30
+
100
100
200
53
21
++
46
+
31
51
+
+ +
24
+
10
+
300
400
500
600
186
Translation Estimates
9
Nonlinear Multiview
Linear Two Views
Linear Multiview
1.8
1.6
7
6
Error [degrees]
Error [degrees]
1.4
1.2
1
0.8
5
4
3
0.6
0.4
0.2
0
12
23
Frame pairs
34
90
40
80
35
70
30
60
25
23
Frame pairs
34
50
20
40
15
30
10
12
45
Error [%]
Error [%]
20
5
0
Nonlinear Multiview
Linear Two Views
Linear Multiview
10
Nonlinear
Two View
Multiview
Nonlinear
Two View
Multiview
Figure 8.10. Motion and structure estimates for indoor rectilinear motion image
sequence
8.6 Summary
8.7 Exercises
1. Rectilinear motion
Use the rank condition of the Mp matrix to show that if the camera
is calibrated and only translating on a straight line, then the relative
translational scale between frames cannot be recovered from pairwise
bilinear constraints, but it can be recovered from trilinear constraints.
(Hint: use Ri = I to simplify the constraints.)
2. Pure rotational motion
In fact the rank condition on the matrix Mp is so fundamental that
it works for the pure rotational motion case too. Please show that
if there is no translation, the rank condition on the matrix Mp is
equivalent to the constraints that we may get in the case of pure
rotational motion
xbi Ri x1 = 0.
(8.25)
(Hint: you need to convince yourself that in this case the rank of Np
is no more than m + 2.)
3. Points in the plane at infinity
Show that the multiple view matrix Mp associated to m images
of a point p in the plane at infinity satisfies the rank condition:
rank(Mp ) 1. (Hint: the homogeneous coordinates for a point p
at infinity are of the form X = [X, Y, Z, 0]T R4 .)
Chapter 9
Geometry and reconstruction from line
features
Our discussion so far has been based on the assumption that point correspondence is available in a number m 2 of images. However, as we
have seen in Chapter 4, image correspondence is subject to the aperture
problem, that causes the correspondence to be undetermined along the direction of brightness edges. One may therefore consider matching edges,
rather than individual points. In particular, man-made scenes are abundant with straight lines features that can be matched in different views
using a variety of techniques. Therefore, assuming that we are given correspondence between lines in several images, how can we exploit them to
recover camera pose as well as the 3-D structure of the environment? In
order to answer this question, we need to study the geometry of line correspondences in multiple views. Fortunately, as we see in this chapter, most
of the concept described in the last chapter for the case of points carry
through with lines. Therefore, we follow the derivation in Chapter 8 to
derive all the independent constraints between line correspondences.
R3 ,
188
R3 ,
This is illustrated in Figure 9.1. We will call l the co-image of the line
L
v
X
PSfrag replacements
l
o
l
Figure 9.1. Image of a line L. The intersection of P and the image plane represents
the physical projection of L. But it is more conviniently represented by a normal
vector l of the plane P.
189
Chapter 11. Using this notation, a co-image l(t) = [a(t), b(t), c(t)]T R3
of a L E3 taken by a moving camera satisfies the following equation
l(t)T x(t) = l(t)T A(t)P g(t)X = 0,
(9.1)
where x(t) is the image of the point X L at time t, A(t) SL(3) is the
camera calibration matrix (at time t), P = [I, 0] R34 is the constant
projection matrix and g(t) SE(3) is the coordinate transformation from
the world frame to the camera frame at time t. Note that l(t) is the normal
vector of the plane formed by the optical center o and the line L (see Figure
9.1). Since the above equation holds for any point X on the line L, it yields
l(t)T A(t)P g(t)Xo = l(t)T A(t)P g(t)v = 0.
(9.2)
i = A(ti )P g(ti ).
(9.3)
The matrix i is then a 3 4 matrix which relates the ith image of the line
L to its world coordinates (Xo , v) by
lTi i Xo = lTi i v = 0
(9.4)
Nl v = 0,
(9.5)
lT1 1
lT2 2
Nl = . Rm4 .
.
.
(9.6)
rank(Nl ) 2.
(9.7)
lTm m
...,
m = [Rm , Tm ]
R34 ,
(9.8)
190
l1
0
lT2 R2 lT2 T2
m4
Nl = .
.
(9.9)
.. R
..
.
lTm Rm lTm Tm
0
lT2 R2 lb1
Nl0 =
..
.
lT Rm lb1
m
lT1 l1
T
l2 R 2 l1
..
.
lTm Rm l1
0
lT2 T2
..
.
Rm5 .
(9.11)
lTm Tm
rank(Nl0 ) = rank(Nl ) 2.
T b
l3 R3 l1 lT3 T3
(m1)4
(9.12)
Ml =
..
..
R
.
.
lT Rm lb1 lT Tm
m
has rank no more than one. We call the matrix Ml the multiple view matrix
associated to a line feature L. Hence we have proved the following:
Theorem 9.1 (Rank deficiency equivalence condition). For the two
matrices Nl and Ml , we have
rank(Ml ) = rank(Nl ) 1 1.
(9.13)
191
The rank condition certainly implies all trilinear constraints among the
given m images of the line. To see this more explicitly, notice that for
rank(Ml ) 1, it is necessary for any pair of row vectors of Ml to be
linearly dependent. This gives us the well-known trilinear constraints
lTj Tj lTi Ri lb1 lTi Ti lTj Rj lb1 = 0
(9.14)
among the first, ith and j th images. Hence the constraint rank(Ml ) 1 is a
natural generalization of the trilinear constraint (for 3 views) to arbitrary
m views since when m = 3 it is equivalent to the trilinear constraint for
lines, except for some rare degenerate cases, e.g., lTi Ti = 0 for some i.
It is easy to see from the rank of matrix Ml that there will be no more
independent relationship among either pairwise or quadruple-wise image
lines. Trilinear constraints are the only non-trivial ones for all the images
that correspond to a single line in 3-D. So far we have essentially proved the
following facts regarding multilinear constraints among multiple images of
a line:
Theorem 9.2 (Linear relationships among multiple views of a
line). For any given m images corresponding to a line L in E3 relative
to m camera frames, the rank-deficient matrix Ml implies that any algebraic constraints among the m images can be reduced to only those among 3
images at a time, characterized by the so-called trilinear constraints (9.14).
Although trilinear constraints are necessary for the rank of matrix Ml
(hence Nl ) to be 1, they are not sufficient. For equation (9.14) to be nontrivial, it is required that the entry lTi Ti in the involved rows of Ml need
to be non-zero. This is not always true for certain degenerate cases - such
as the when line is parallel to the translational direction. The rank condition on Ml , therefore, provides a more complete account of the constraints
among multiple images and it avoids artificial degeneracies that could be
introduced by using algebraic equations.
192
L
P3
PSfrag replacements
P1
P2
l3
l1
o3
l2
o1
l3
l1
l2
o2
Figure 9.2. Three planes extended from the three images l1 , l2 , l3 intersect at one
line L in 3-D, the pre-image of l1 , l2 , l3 .
Like we did for points, we now ask the inverse question: if the three vectors
l1 , l2 , l3 satisfy the trilinear constraints, are they necessarily images of some
single line in 3-D, the so-called pre-image? As shown in Figure 9.3, we
denote the planes formed by the optical center oi of the ith frame and
image line li to be Pi , i = 1, 2, 3. Denote the intersection line between P1
and P2 as L2 and the intersection line between P1 and P3 as L3 . As pointed
out at the beginning, li R3 is also the normal vector of Pi . Then without
loss of generality, we can assume that li is the unit normal vector of plane
Pi , i = 1, 2, 3 and the trilinear constraint still holds. Thus, lTi Ti = di
is the distance from o1 to the plane Pi and (lTi Ri )T = RiT li is the unit
normal vector of Pi expressed in the 1th frame. Furthermore, (lTi Ri lb1 )T is
a vector parallel to Li with length being sin(i ), where i [0, ] is the
angle between the planes P1 and Pi , i = 2, 3. Therefore, in the general
case, the trilinear constraint implies two things: first, as (lT2 R2 lb1 )T is linear
dependent of (lT3 R3 lb1 )T , L2 is parallel to L3 . Secondly, as d2 sin(3 ) =
193
L3
L2
PSfrag replacements
P3
P1
l1
o1
l3
P2
l3
o3
l2
l1
l2
o2
Figure 9.3. Three planes extended from the three images l1 , l2 , l3 intersect at lines
L2 and L3 , which actually coincides.
d3 sin(2 ), the distance from o1 to L2 is the same with the distance from o1
to L3 . They then imply that L2 coincides with L3 , or in other words, the line
L in 3-D space is uniquely determined. However, we have degeneracy when
P1 coincides with P2 and P3 . In that case, d2 = d3 = 0 and (lT2 R2 lb1 )T =
(lT3 R3 lb1 )T = 031 . There are infinite number of lines in P1 = P2 = P3 that
generate the same set of images l1 , l2 and l3 . The case when P1 coincides
with only P2 or only P3 is more tricky. For example, if P1 coincides with
P2 but not with P3 , then d2 = 0 and (lT2 R2 lb1 )T = 031 . However, if we
re-index the images (frames), then we still can obtain a non-trivial trilinear
constraint, from which L can be deduced as the intersection line between
P1 and P3 . So we have in fact proved the following fact:
Lemma 9.1 (Properties of trilinear constraints for lines). Given
three camera frames with distinct optical centers and any three vectors
l1 , l2 , l3 R3 that represents three lines on each image plane. If the three
image lines satisfy trilinear constraints
lTj Tji lTk Rki lbi lTk Tki lTj Rji lbi = 0,
i, j, k = 1, 2, 3
194
normals all coincide with each other. For this only degenerate case, the
matrix Ml becomes zero. 1
For more than 3 views, in order to check the uniqueness of the pre-image,
one needs to apply the above lemma to every triplet of views. The possible
combinations of degenerate cases make it very hard to draw any consistent
conclusion. However, in terms of the rank condition on the multiple view
matrix, the lemma can be generalized to multiple views in a much more
concise and unified form:
Theorem 9.3 (Uniqueness of the pre-image). Given m vectors in R3
representing lines on the image planes with respect to m camera frames,
they correspond to the same line in the 3-D space if the rank of the matrix
Ml relative to any of the camera frames is 1. If its rank is 0 (i.e. the
matrix Ml is zero), then the line is determined up to a plane on which all
the camera centers must lie.
The proof follows directly from Theorem 9.1. So the case that the line
in 3-D shares the same plane as all the centers of the camera frames is the
only degenerate case when one will not be able to determine the exact 3-D
location of the line from its multiple images. As long as the camera frames
have distinct centers, the set of lines that are coplanar with these centers is
of only zero measure. Hence, roughly speaking, trilinear constraints among
images of a line rarely fail in practice. Even the rectilinear motion will not
pose a problem as long as enough number of lines in 3-D are observed, the
same as in the point case. On the other hand, the theorem also suggests
a criteria to tell from the matrix Ml when a degenerate configuration is
present: exactly when the largest singular value of the Ml matrix (with
respect to any camera frame) is close to zero.
195
point in 3-D. This point is not necessarily on the image plane though. If
we call this point p, then the vector defined by p o1 is obviously parallel
to the original line L in 3-D. We also call it v. Therefore the so-defined Ml
matrix in fact gives us a map from lines in 3-D to vectors in 3-D:
Ml : L R3 7 v R3 .
This map is certainly not injective but surjective. That is, from Ml matrix
alone, one will not be able to recover the exact 3-D location of the line L,
but it will give us most of the information that we need to know about its
images. Moreover, the vector gives the direction of the line, and the norm
kvk is exactly the ratio:
kvk = sin(i )/di ,
i = 2, . . . , m.
Roughly speaking, the farther away is the line L from o1 , the smaller this
ratio is. In fact, one can show that the family of parallel lines in 3-D that
all map to the same vector v form a cylinder centered around o1 . The
above ratio is exactly the inverse of the radius of such a cylinder. We may
summarize our discussion into the following theorem:
Theorem 9.4 (Geometry of the matrix Ml ). The matrix Ml associated
to a line L in 3-D maps this line to a unique vector v in 3-D. This map is
surjective and two lines are mapped to the same vector if and only if they
are: 1. parallel to each other and 2. of the same distance to the center o of
the reference camera frame. That distance is exactly 1/kvk.
Therefore, knowing Ml (i.e. the vector v), we know that L must lie on a
circle of radius r = 1/kvk, as shown in Figure 9.4. So if we can obtain the Ml
r
PSfrag replacements
Figure 9.4. An equivalent family of parallel lines that give the same Ml matrix.
matrix for L with respect to another camera center, we get two families of
parallel lines lying on two cylinders. In general, these two cylinders intersect
at two lines, unless the camera centers all lie on a line parallel to the line
196
L. Hence, two Ml matrices (for the same line in 3-D) with respect to two
distinct camera centers determine the line L up to two solutions. One would
imagine that, in general, a third Ml matrix respect to a third camera center
will then uniquely determine the 3-D location of the line L.
c2 R2 x1
c2 T2
x
x
..
.. R3(m1)2 .
Mp =
.
.
xc
m R m x1
xc
m Tm
The apparent similarities between both the rank conditions and the forms
of Ml and Mp are expected due to the geometric duality between lines and
point. In the 3-D space, a point can be uniquely determined by two lines,
while a line can be uniquely determined by two points. So our question
now is that if the two set of rank conditions can be derived from each other
based on the geometric duality.
First we show that we can derive the rank condition for line from rank
condition for point. Let p1 and p2 be two distinct points on a line L in the
3-D space. Denote the m images of p1 , p2 under m views to be x11 , . . . , x1m
and x21 , . . . , x2m , respectively. Hence, the ith (i = 1, . . . , m) view of L can be
c2 x1 or lb = x2 x1T x1 x2T , i = 1, . . . , m. From the rank
expressed as li = x
i
i i
i i
i i
1
c1 R x1 = x
c1 T and x
c2 R x2 = x
c2 T
deficiency of Mp and Mp2 , we have x
i i 1
i i
i i 1
i i
for some , R and i = 1, . . . , m. This gives
2T
1T
c2
lTi Ri lb1 = x1T
i xi Ti (x1 x1 ),
c2
lTi Ti = x1T
i xi Ti .
(9.15)
which means that each row of Ml is spanned by the same vector [(x2T
1
T
4
x1T
1 ), 1] R . Therefore,
rank(Mp1 ) 1 & rank(Mp2 ) 1
rank(Ml ) 1.
197
already reveal some interesting duality between the camera center o and the
point p in 3-D. For example, if the camera center moves on a straight line
(the rectilinear motion), from the Mp matrix associated to a point p, the 3D location of the point p can only be determined up to a circle. On the other
hand, if the camera center is fixed but the point p can move on a straight
line L, from the Ml matrix associated to the line L, the 3-D location of this
line can only be determined up to a circle too. Mathematically speaking,
matrices Mp and Ml define an equivalence relationship for points and lines
in the 3-D space, respectively. They both group points and lines according
to their distance to the center of the reference camera frame. Numerically,
the sensitivity for Mp and Ml as rank-deficient matrices depends very much
on such distance. Roughly speaking, the farther away is the point or line,
the more sensitive the matrix is to noise or disturbance. Hence, in practice,
one may view Mp and Ml as a natural metric for the quality of the
measurements associated to the multiple images of a 3-D point or line.
9.4.1 Matching
In general we like to test if given m vectors l1 , . . . , lm R3 with known
indeed satisfy all the constraints that m images of a single 3-D line (preimage) should. There are two ways of performing this. One is based on the
fact that Ml matrix has rank no more than 1. Another is based on the
geometric interpretation of Ml matrix.
2 Such
198
Since rank(Ml ) 1, the 4 4 matrix MlT Ml also has rank no more than
1. Hence, ideally with given l1 , . . . , lm and , the eigenvalues of MlT Ml
should satisfy 1 0, 2 = 3 = 4 = 0. A more numerically robust
algorithm would be:
Algorithm 9.1 (Multiple view matching test). Suppose the projection matrix associated to m camera frames are given. Then for given m
vectors l1 , . . . , lm R3 ,
1. Compute the matrix Ml Rm4 according to (9.12).
2. Compute second largest eigenvalue 2 of the 4 4 matrix MlT Ml ;
3. If 2 1 for some pre-fixed threshold 1 , then we say the m image
vectors match. Otherwise discard it as outliers.
9.4.2 Reconstruction
Now suppose that m( 3) images lj1 , . . . , ljm of n lines Lj , j = 1, . . . , n
are given and we want to use them to estimate the unknown projection
matrix . From the rank condition of the Ml matrix, we know that the
kernel of Ml should have 3 dimensions. Denote Mlj to be the Ml matrix
T
b
for the j th line, j = 1, . . . , n. Since lji Ri lj1 is a vector parallel to Lj ,
j
j
3
then for any two linear
j vectors u , w R lying in the plane
independent
j
j
perpendicular to L , u0 and w0 are two base vectors in the kernel of
j
j
j
Mlj . Let 1 ker(Mlj ) be a vector orthogonal to both u0 and w0 .
Then
j = [j1 , j2 , j3 ]T = k j ubj wj ,
(9.16)
T
T
l1i
1 (lb11 l1i )
.
..
..
nT
T
li
n T (lbn1 lni )
T
T
0
u1 (lb11 l1i )
..
Pi = ...
(9.17)
R3n12 ,
.
T
nT
n
n
u (lb1 li )
T
T
0
w1 (lb11 l1i )
..
..
.
nT
nT
n
b
0 w (l l )
1
199
(9.18)
~ i = [r11 , r21 , r31 , r12 , r22 , r32 , r13 , r23 , r33 ]T , with rkl
where T~i = Ti and R
th
being the (kl) entry of Ri , k, l = 1, 2, 3. It can be shown that if j , uj , wj s
are known, the matrix Pi is of rank 11 if more than n 12 lines in general
position are given. In that case, the kernel of Pi is unique, and so is (Ri , Ti ).
Hence if we know j for each j = 1, . . . , n, then we can estimate (Ri , Ti )
by performing singular value decomposition (SVD) on Pi . In practice, the
initial estimations of j , uj , wj s may be noisy and only depend on local
data, so we can use iteration method. First, we get estimates for (Ri , Ti )s,
then we use the estimated motions to re-calculate j s, and iterate in this
way till the algorithm converges. However, currently there are several ways
to update j , uj , wj s in each iteration. Initialization of j , uj , wj and the
overall algorithm are described in the next section.
Euclidean reconstruction
Here we illustrate the overall algorithm for the case when the camera is
assumed to be calibrated. That is A(t) = I hence the projection matrix
i = [Ri , Ti ] represents actual Euclidean transformation between camera
frames. We will discuss later on what happens if this assumption is violated.
Given the first three views of (at least) 12 lines in the 3-D space. Let
j
Ml3
be the matrix formed by the first 3 columns of Mlj associated to the
j
j
first three views. From Ml3
, we can already estimate uj , wj ker(Ml3
),
j
j
where u , w are defined above. So after we obtain an initial estimation of
i , Ti ) for i = 2, 3, we can calculate M
j and compute the SVD of it such
(R
l3
j
= U2 S2 V T , then pick u
that M
j , w
j being the 2nd and 3rd columns of
2
l3
V2 respectively. Then the estimation for j is
j = k j (
uj w
j ) for some
j
k R to be determined. k can be estimated by least squares method such
that
kj =
Pm
jT
j T bj bj j
w
)
i=2 (li Ti )(li Ri l1 u
Pm j T bj bj j 2 .
w
)
i=2 (li Ri l1 u
(9.19)
200
(9.20)
(9.21)
The j , uj , wj can then be re-estimated from the full matrix Mlj computed
from the motion estimates (Ri , Ti ).
Based on above arguments, we can summarize our algorithm as the
following:
Algorithm 9.2 (Structure and motion from multiple views of
lines). Given a set of m images lj1 , . . . , ljm of n lines Lj , j = 1, . . . , n,
we can estimate the motions (Ri , Ti ), i = 2, . . . , m as the following:
1. Initialization
(a) Set step counter s = 0.
(b) Compute initial estimates for (Ri , Ti ), i = 2, 3 for the first three
views.
(c) Compute initial estimates of js , ujs , wsj for j = 1, . . . , n from the
first three views.
i , Ti ) from
2. Compute Pi from (9.17) for i = 2, . . . , m, and obtain (R
the eigenvector associated to its smallest singular value.
i , Ti ) from (9.20) and (9.21), i = 2, . . . , m.
3. Compute (R
j
based on the M j matrix calculated by using
4. Compute js+1 , ujs+1 , ws+1
j
1
i , Ti ). Normalize
(R
s+1 so that ks+1 k = 1.
9.5. Experiments
201
9.5 Experiments
In this section, we show in simulation the performance of the above algorithm. We test it on two different scenarios: sensitivity of motion and
structure estimates with respect to the level of noise on the line measurements; and the effect of number of frames on motion and structure
estimates.
9.5.1 Setup
The simulation parameters are as follows: number of trials is 500, number
of feature lines is typically 20, number of frames is 4 (unless we vary it
on purpose), field of view is 90o , depth variation is from 100 to 400 units
of focal length and image size is 500 500. Camera motions are specified
by their translation and rotation axes. For example, between a pair of
frames, the symbol XY means that the translation is along the X-axis and
rotation is along the Y -axis. If n such symbols are connected by hyphens,
it specifies a sequence of consecutive motions. The ratio of the magnitude
of translation ||T || and rotation , or simply the T /R ratio, is compared at
the center of the random cloud scattered in the truncated pyramid specified
by the given field of view and depth variation. For each simulation, the
amount of rotation between frames is given by a proper angle and the
amount of translation is then automatically given by the T /R ratio. We
202
always choose the amount of total motion such that all feature (lines) will
stay in the field of view for all frames. In all simulations, each image line is
perturbed by independent noise witha standarddeviation given in degrees.
T
is an
Error measure for rotation is arccos tr(RR2 )1 in degrees where R
estimate of the true R. Error measure for translation is the angle between
T and T in degrees where T is an estimate of the true T . Error measure for
structure is approximately measured by the angle between and
where
9.5. Experiments
motion between frame 3 and 1
2
Rotation error in degrees
3
2
1
0
0.5
1
1.5
2
noise level in degrees
1.5
1
0.5
0
2.5
0.5
1
1.5
2
noise level in degrees
2.5
203
0.5
1
1.5
2
noise level in degrees
2.5
1.5
1
0.5
0
0.5
1
1.5
2
noise level in degrees
2.5
Figure 9.5. Motion estimate error from four views. The number of trials is 500.
T /R ratio is 1. The amount of rotation between frames is 20o .
4
3.5
2.5
1.5
0.5
0.5
1
1.5
noise level in degrees
2.5
Figure 9.6. Structure estimate error from four views. The number of trials is 500.
T /R ratio is 1. The amount of rotation between frames is 20o .
204
expected. In fact, the error converges to the given level of noise on the line
measurements. However, according to Figure 9.7, motion estimates do not
necessarily improve with an increase number of frames since the number
of motion parameters increase linearly with the number of frames. In fact,
we see that after a few frames, additional frames will have little effect on
the previous motion estimates.
8
10
Number of frames
1.5
12
12
8
10
Number of frames
3.5
2.5
2
1.5
1
0.5
0
8
10
Number of frames
12
8
10
Number of frames
12
Figure 9.7. Motion (between frames 4-1 and 5-1) estimate error versus number
of frames for an orbital motion. The number of trials is 500.
4.5
3.5
2.5
1.5
7
8
Number of Frames
10
11
12
Figure 9.8. Structure estimate error versus number of frames for an orbital
motion. The number of trials is 500.
9.6. Summary
205
9.6 Summary
9.7 Exercises
Chapter 10
Geometry and reconstruction with
incidence relations
The previous two chapters described the constraints involving corresponding points or lines. In this chapter we address the case when correspondence
of points and lines is available. This results in a formal condition that has
all rank conditions discussed in previous chapters as special cases. We also
discuss how to enforce the constraint that all geometric features lie on a
plane, which is a special case of considerable practical importance.
(10.1)
207
direction of the line. Then the image of the line L at time t is simply
the collection of images x(t) of all points p L. It is clear that all such
x(t)s span a plane in R3 , as shown in Figure 10.1. The projection of the
line is simply the intersection of this plane with the image plane. Usually
it is more convenient to specify a plane by its normal vector, denoted as
l(t) = [a(t), b(t), c(t)]T R3 . We call this vector l the co-image of the line
L, which satisfies the following equation:
l(t)T x(t) = l(t)T A(t)P g(t)X = 0
(10.3)
for any image x(t) of any point p on the line L. Notice that the column (or
row) vectors of the matrix bl span the (physical) image of the line L, i.e. they
span the plane which is orthogonal to l.1 This is illustrated in Figure 10.1.
Similarly, if x is the image of a point p, its co-image (the plane orthogonal
b. So in this chapter, we will use the following
to x) is given by the matrix x
notation:
Image of a point: x R3 ,
Co-image of a point:
bl R33 , Co-image of a line:
Image of a line:
b R33 ,
x
(10.4)
l R3 .
xi = x(ti ),
li = l(ti ),
i = A(ti )P g(ti ).
(10.5)
th
(10.6)
and the ith co-image of the line L to its world coordinates (Xo , v) by:
lTi i Xo = lTi i v = 0,
(10.7)
(10.8)
for i = 1, . . . , m.
In the above systems of equations, the unknowns, i s, X, Xo and v,
which encode the information about 3-D location of the point p or the line
1 In fact, there is some redundancy using b
l to describe the plane: the three column (or
row) vectors in bl are not linearly independent. They only span a two-dimensional space.
However, we here use it anyway to simplify the notation.
208
PSfrag replacements
L
p
x1
lb1
v
po
lb2
l2
x2
o2
(R, T )
o1
l1
Figure 10.1. Images of a point p on a line L. Planes extended from the image
lines lb1 , lb2 should intersect at the line L in 3-D. Lines extended from the image
points x1 , x2 intersect at the point p.
209
T
l1
0
R(3m+1)3m .
Dl0 = lb1
(10.9)
0
0 I3(m1)3(m1)
We obtain:
lT1
lb1
0
D l Np =
R2
.
..
Rm
0
0
0
b
l1 x1
T2
..
.
Tm
0
0
0
..
.
x2
..
.
..
.
..
0
0
0
..
.
R(3m+1)(m+4) . (10.10)
0
xm
I44 0
0
0
0
c2
x
0
0
T
0
x2
0
0
..
..
..
R(4m+1)(3m+1) .
.
.
.
0
0
(10.11)
Dp0 =
.
.
.
.
..
..
..
..
0
0
0
0 xc
m
T
0
0
0 xm
It is direct to verify that the rank of the resulting matrix Dp0 Dl0 Np is related
to the rank of its sub-matrix:
T
l1
0
x
c2 T2
x
c2 R2
(3m2)4
(10.12)
Np00 = .
.. R
..
.
xc
m Rm
xc
m Tm
l1
0
lb1
0
0
1
R45
(10.13)
210
yields:
lT1 l1
c2 R2 l1
x
..
.
xc
m R m l1
0
c2 R2 lb1
x
..
.
b
xc
m R m l1
c2 R2 lb1
x
.
..
Mlp =
.
b
xc
R
m m l1
0
c2 T2
x
..
.
xc
m Tm
c2 T2
x
..
.
xc
m Tm
R(3m2)5 .
R[3(m1)]4
(10.14)
(10.15)
the multiple view matrix for a point included by a line. Its rank is related
to that of Np by the expression:
rank(Mlp ) rank(Np ) (m + 1) 2.
(10.16)
(10.17)
The rank condition on Mlp then captures the incidence condition in which
a line with co-image l1 includes a point p in E3 with images x2 , . . . , xm
with respect to m 1 camera frames. What kind of equations does this
rank condition give rise to? Without loss of generality, let us look at the
sub-matrix
#
"
c2 T2
c2 R2 lb1 x
x
R64 .
(10.18)
Mlp =
c3 T3
c3 R3 lb1 x
x
The rank condition implies that every 3 3 sub-matrix of Mlp has determinant zero. The first three rows are automatically of rank 2 and so are
the last three rows and the first three columns.2 Hence any non-trivial determinant consists of two and only two rows from either the first three or
last three rows, and consists of two and only two columns from the first
three columns. For example, if we choose two rows from the first three,
it is direct to see that such a determinant is a polynomial of degree 5 on
(the entries of) l1 , x2 and x3 which is quadratic on l1 and also quadratic in
2 This
211
c2 T2
c2 R2 x1
x
x
x
c3 T3
x
c3 R3 x1
(10.19)
Mp =
R[3(m1)]2 .
..
..
.
.
xc
xc
m Tm
m R m x1
This matrix should have a rank of no more than 1. Since the point belongs
to all the lines, we have
lTi xi = 0, i = 1, . . . , m.
Hence li range(xbi ). That is, there exist ui R3 such that li = xbi T ui , i =
1, . . . , m. Since rank(Mp ) 1, so should be the rank of following matrix
(whose rows are simply linear combinations of those of Mp ):
T
T
c2 R2 x1
c2 T2
u2 x
uT2 x
l2 R2 x1 lT2 T2
..
..
..
.. R(m1)2(10.20)
.
=
.
.
.
.
uTm xc
m R m x1
uTm xc
m Tm
lTm Rm x1
lTm Tm
3 Hence it is a constraint that is not in any of the multilinear (or multifocal) constraint
lists previously studied in the computer vision literature. For a list of these multilinear
constraints, see [HZ00].
212
lT2 R2 x1
.
..
=
.
lTm Rm x1
lT2 T2
..
.
lTm Tm
R(m1)2
(10.21)
the multiple view matrix for lines intersecting at a point. Then we have
proved:
Lemma 10.2 (Rank condition with intersection). Given the image
of a point p and multiple images of lines intersecting at p, the multiple view
matrix Mpl defined above satisfies:
0 rank(Mpl ) 1.
(10.22)
The above rank condition on the matrix Mpl captures the incidence condition between a point and lines which intersect at the same point. It is
worth noting that for the rank condition to be true, it is necessary that
all 2 2 minors of Mpl be zero, i.e. the following constraints hold among
arbitrary triplets of given images:
[lTi Ri x1 ][lTj Tj ] [lTi Ti ][lTj Rj x1 ] = 0
R,
i, j = 2, . . . , m.
(10.23)
l2 R2 lb1 lT2 T2
.
..
.. R(m1)4 .
l =
M
(10.24)
.
.
T
T
b
l R m l1 l Tm
m
However, here li in the ith view can be the co-image of any line out of a
family which intersect at a point p. It is then easy to derive from Lemma
10.1 and the proof of Lemma 10.2 that:
Corollary 10.2 (An intersecting family of lines). Given m imasge
l defined above
of a family lines intersecting at a 3D point p, the matrix M
satisfies:
l ) 2.
1 rank(M
(10.25)
213
We leave the proof as exercise to the reader. The reader also should
be aware of the difference between this corollary and the Theorem 9.1:
Here l1 , . . . , lm do not have to be the coimages of a single 3D line, but
instead each of them may randomly correspond to any line in an intersecting
family. It is easy to see that the above rank condition imposes non-trivial
constraints among up to four images, which conforms to what Corollary
10.1 says.
(10.26)
(10.27)
1 x1 0
0
l1 1
..
lT2 2
2 0 x 2 . . .
.
.
.
.
and Nl =
..
..
..
Np =
.. .
...
.
.
0
.
lTm m
m 0
0 xm
0
0
214
lT2 R2 lb1
c2 R2 x1
c2 T2
x
x
x
c3 T3
x
lT3 R3 lb1
c3 R3 x1
..
..
..
Mp =
, Ml =
.
.
.
xc
c
lTm Rm lb1
m R m x1 x
m Tm
1 x1
2
1 lb1
become:
lT2 T2
lT3 T3
..
. .
lTm Tm
2
(10.28)
Then the rank condition rank(Mp ) 1 implies not only the multilinear
constraints as before, but also the following equations (by considering the
sub-matrix consisting of the ith group of three rows of Mp and its last row)
xbi Ti 1 x1 xbi Ri x1 2 = 0,
i = 2, . . . , m.
(10.29)
When the plane P does not cross the camera center o1 , i.e. 2 =
6 0, these
equations give exactly the well-known homography constraints for planar
image feature points
1
xbi Ri 2 Ti 1 x1 = 0
(10.30)
between the 1st and the ith views. The matrix Hi = Ri 12 Ti 1 in
the equation is the well-known homography matrix between the two views.
Similarly from the rank condition on Ml , we can obtain homography in
terms of line features
1
T
1
li Ri 2 Ti lb1 = 0
(10.31)
215
PSfrag replacements
p1
L2
L3
p
p3
L1
Figure 10.2. Duality between a set of three points and three lines in a plane P:
the rank conditions associated to p1 , p2 , p3 are exactly equivalent those associated
to L1 , L2 , L3 .
PSfrag replacements
p3
p1
L2
p
p4
L1
p2
Figure 10.3. p1 , . . . , p4 are four points on the same plane P, if and only if the two
associated (virtual) lines L1 and L2 intersect at a (virtual) point p5 .
216
c3 x4 ,
l2 = x
x5 = lb1 l2 .
(10.32)
1T
l2 R2 x51 l1T
2 T2
5
2T
l2T
2 R 2 x1 l2 T2
.
.
pl =
..
..
M
(10.33)
R[2(m1)]2
1T
5
1T
lm R m x1 lm Tm
5
l2T
l2T
m R m x1
m Tm
33
Di = xbi R
or lTi R3 .
217
D2 R2 D1 D2 T2
. D3 R 3 D1 D3 T 3
(10.34)
M =
.. .
..
.
.
Dm
R m D1 Dm
Tm
Depending on the particular choice for each Di or D1 , the dimension of the
matrix M may vary. But no matter what the choice for each individual Di
or D1 is, M will always be a valid matrix of certain dimension. Then after
elimination of the unknowns i s, X, Xo and v in the system of equations
in (10.6) and (10.7), we obtain:
Theorem 10.1 (Multiple view rank conditions). Consider a point p
lying on a line L and its images x1 , . . . , xm R3 and co-images l1 , . . .,
lm R3 relative to m camera frames whose relative configuration is given
by (Ri , Ti ) for i = 2, . . . , m. Then for any choice of Di and D1 in the
definition of the multiple view matrix M , the rank of the resulting M belongs
to and only to one of the following two cases:
1. If D1 = lb1 and Di = xbi for some i 2, then:
1 rank(M ) 2.
(10.35)
0 rank(M ) 1.
(10.36)
2. Otherwise:
fact, there are many equivalent matrix representations for Di and Di . We choose
xbi and lbi here because they are the simplest forms representing the orthogonal subspaces
of xi and li and also linear in xi and li respectively.
218
the multilinear constraints in the literature. The instantiations corresponding to case 1, as we have seen before, give rise to constraints that are not
necessarily multilinear. The completeness of Theorem 10.1 also implies that
there would be no multilinear relationship among quadruple-wise views,
even in the mixed feature scenario.5 Therefore, quadrilinear constraints
and quadrilinear tensors are clearly redundant for multiple view analysis.
However, as we mentioned before (in Section 10.2.1), nonlinear constraints
may still exist up to four views.
As we have demonstrated in the previous sections, other incidence conditions such as all features belonging to a plane in E3 can also be expressed
in terms of the same set of rank conditions:
Corollary 10.5 (Planar features and homography). Suppose that all
features are in a plane and coordinates X of any point on it satisfy the
equation X = 0 for some vector T R4 . Denote = [ 1 , 2 ] with
T
1 R3 , 2 R. Then simply append the matrix
1
D1 2
(10.37)
to the matrix M in its formal definition (10.34). The rank condition on the
new M remains exactly the same as in Theorem 10.1.
The rank condition on the new matrix M then implies all constraints
among multiple images of these planar features, as well as the special
constraint previously known as homography. Of course, the above representation is not intrinsic it depends on parameters that describe the 3-D
location of the plane. Following the process in Section 10.2.3, the above
corollary reduces to rank conditions on matrices of the type in (10.33),
which in turn, give multi-quadratic constraints on the images involved.
Remark 10.1 (Features at infinity). In Theorem 10.1, if the point p
and the line L are in the plane at infinity P3 \ E3 , the rank condition on the
multiple view matrix M is just the same. Hence the rank condition extends
to multiple view geometry of the entire projective space P3 , and it does not
discriminate between Euclidean, affine or projective space. In fact, the rank
conditions are invariant under a much larger group of transformations: It
allows any transformation that preserves all the incidence relations among
a given set of features, these transformations do not even have to be linear.
Remark 10.2 (Occlusion). If any feature is occluded in a particular view,
the corresponding row (or a group of rows) is simply omitted from M ; or if
only the point is occluded but not the entire line(s) on which the point lies,
then simply replace the missing image of the point by the corresponding
5 In fact, this is quite expected: While the rank condition geometrically corresponds
to the incidence condition that lines intersect at a point and that planes intersect at a
line, incidence condition that three-dimensional subspaces intersect at a plane is a void
condition in E3 .
219
[c
x2 R2 x1 ][c
x3 T3 ]T [c
x3 R3 x1 ][c
x 2 T 2 ]T = 0
R33 .
(10.39)
R3 .
(10.41)
R.
(10.43)
220
Then rank(M ) 2 implies that all 3 3 sub-matrices of M have determinant zero. These equations give the line-point-point type of constraints on
three images.
Let us first consider the more general case, i.e. case 2 in Theorem 10.1,
when rank(M ) 1. We will discuss case 1 afterwards. For case 2, there are
only two interesting sub-cases, depending on the value of the rank of M ,
are:
a) rank(M ) = 1,
and b) rank(M ) = 0.
(10.45)
Case a), when rank of M is 1, corresponds to the generic case for which,
regardless of the particular choice of features in M , all these features satisfy
the incidence condition. More explicitly, all the point features (if at least
2 are present in M ) come from a unique 3-D point p, all the lines features
(if at least 3 are present in M ) come from a unique 3-D line L, and if both
point and line features are present, the point p then must lie on the line L
in 3-D. This is illustrated in Figure 10.4.
What happens if there are not enough point or line features present in
M ? If, for example, there is only one point feature x1 present in Mpl , then
the rank of Mpl being 1 implies that the line L is uniquely determined by
221
PSfrag replacements
L
p
P3
P1
l1
o3
x2
l1
l3
l3
l2
x1
o1
x3
P2
l2
o2
Figure 10.4. Generic configuration for the case rank(M ) = 1. Planes extended
from the (co-)images l1 , l2 , l3 intersect at one line L in 3-D. Lines extended from
the images x1 , x2 , x3 intersect at one point p. p must lie on L.
222
T
l2 R2 x1 lT2 T2
lT3 R3 x1 lT3 T3
T
92
T
(10.46)
M =
l4 R 4 x1 l4 T4 R .
x
c5 R5 x1 x
c5 T5
c6 R6 x1 x
c6 T6
x
PSfrag replacements
The geometric
configuration of the point and line features corresponding
L
to the condition rank(M ) = 0 is illustrated in Figure 10.5. But notice that,
among all the possible solutions for L and p, if they both happen to be at
infinity, the incidence condition then would hold for all the images involved.
o2
o3
lb2
lb4
o4
x1
o1
o5
lb3
x6
x5
o6
and b) rank(M ) = 1.
(10.48)
Case a), when the rank of M is 2, corresponds to the generic cases for
which the incidence condition among the features is effective. The essential
c2 R2 lb1
c2 T2
x
x
..
.. R[3(m1)]4 .
Mlp =
.
.
b c
xc
m R m l1 x
m Tm
223
(10.49)
L
PSfrag replacements
xm
om
P
p
l1
x1
x3
x2
o1
o3
l1
o2
in some M , the point p then must lie on every plane associated to every
line feature. Hence p must be on the intersection of these planes. Notice
that, even in this case, adding more rows of line features to M will not
be enough to uniquely determine L in 3-D. This is because the incidence
condition for multiple line features requires that the rank of the associated
matrix Ml be 1.6 If we only require rank 2 for the overall matrix M , the
line can be determined only up to a family of lines intersections of the
planes associated to all the line features which all should intersect at the
same point p.
6 Matrix
224
c2 R2 lb1 x
c2 T2
x
cR lb x
c3 T3
x
M = 3 3 b1
(10.50)
R104 .
x
c4 R4 l1 x
c4 T4
lT5 R5 lb1 lT5 T5
PSfrag replacements
x2
o2
L
x3
o3
o1
lb1
x4
o4
lb5
o5
1: a
Notice that in this case, we no longer have an incidence condition for the
point features. However, one can view them as if they intersected at a point
p at infinity. In general, we no longer have the incidence condition between
the point p and the line L, unless both the point p and line L are in the
plane at infinity in the first place. But since the rank condition is effective
for line features, the incidence condition for all the line features still holds.
225
To summarize the above discussion, we see that the rank conditions indeed allow us to carry out meaningful global geometric analysis on the
relationship among multiple point and line features for arbitrarily many
views. There is no doubt that this extends existing methods based on multifocal tensors that can only be used for analyzing up to three views at a
time. Since there is yet no systematic way to extend triple-wise analysis
to multiple views, the multiple view matrix seems to be a more natural
tool for multiple-view analysis. Notice that the rank conditions implies all
previously known multilinear constraints, but multilinear constraints do
not necessarily imply the rank conditions. This is because the use of algebraic equations may introduce certain artificial degeneracies that make a
global analysis much more complicated and sometimes even intractable. On
the other hand, the rank conditions have no problem in characterizing all
the geometrically meaningful degeneracies in a multiple-view mixed-feature
scenario. All the degenerate cases simply correspond to a further drop of
rank for the multiple view matrix.
226
L1
p
L3
L2
PSfrag replacements
om
o1
o2
Figure 10.8. A standard cube. The three edges L1 , L2 , L3 intersect at the corner p.
The coordinate frames indicate that m images are taken at these vantage points.
and L3j , j = 1, . . . , 8. From the m images of the cube, we have the multiple
view matrix M j associated to pj :
c
xj2 R2 xj1
1jT
l2 R2 xj1
2jT
l2 R2 xj1
3jT
l2 R2 xj1
..
Mj =
.
c
xjm Rm xj1
1jT
lm Rm xj1
l2jT Rm xj
m
1
j
l3jT
m R m x1
c
xj2 T2
l1jT
2 T2
2jT
l2 T2
l3jT
2 T2
.. R[6(m1)]2
.
c
j
xm Tm
l1jT
m Tm
2jT
lm Tm
l3jT
m Tm
(10.51)
where xji R3 means the image of the j th corner in the ith view and
3
th
lkj
edge associated to the j th corner in
i R means the image of the k
th
the i view. Theorem 10.1 says that rank(M j ) = 1. One can verify that
j = [j1 , 1]T R2 is in the kernel of M j . In addition to the multiple images
xj1 , xj2 , xj3 of the j th corner pj itself, the extra rows associated to the line
features lkj
i , k = 1, 2, 3, i = 1, . . . , m also help to determine the depth scale
j1 .
227
We can already see one advantage of the rank condition: It can simultaneously handle multiple incidence conditions associated to the same feature.7
In principle, by using (10.33) or Corollary 10.5, one can further take into account that the four vertices and edges on each face are coplanar.8 Since such
incidence conditions and relations among points and lines occur frequently
in practice, especially for man-made objects, such as buildings and houses,
the use of multiple view matrix for mixed features is going to improve the
quality of the overall reconstruction by explicitly taking into account all
incidence relations among features of various types.
In order to estimate j we need to know the matrix M j , i.e. we need
to know the motion (R2 , T2 ), . . . , (Rm , Tm ). From the geometric meaning
of j = [j1 , 1]T , j can be solved already if we know only the motion
(R2 , T2 ) between the first two views, which can be initially estimated using
the standard 8 point algorithm. Knowing j s, the equations:
M j j = 0,
j = 1, . . . , 8
(10.52)
become linear in (R2 , T2 ), . . . , (Rm , Tm ). We can use them to solve for the
motions (again). Define the vectors:
~ i = [r11 , r12 , r13 , r21 , r22 , r23 , r31 , r32 , r33 ]T R9
R
and T~i = Ti R3 , i = 2, . . . , m. Solving (10.52) is then equivalent to finding
the solution to the following equations for i = 2, . . . , m:
c1 x1 T x
c1
11 x
1
i
i
1 11T
1 li x11 T l11T
i
T
1 l21T x1 l21T
1
i
1i
1 31T
1 T 31T
1 li x1 li
~
~
R
..
..
Ri = 0 R48 ,
(10.53)
Pi ~ i =
.
.
T~i
Ti
8 c8
T
c8
1 x x81
x
i
i
8 l18T x8 T l18T
1i
1
i
8 28T
T
1 li x81 l28T
i
8 T 38T
81 l38T
x
l
1
i
i
Let Ti R and Ri R
be the (unique) solution of (10.53) in matrix
form. Such a solution can be obtained numerically as the eigenvector of Pi
7 In fact, any algorithm extracting point feature essentially relies on exploiting local
incidence condition on multiple edge features. The structure of the M matrix simply
reveals a similar fact within a larger scale.
8 We here omit doing it for simplicity.
228
i = Ui Si V T be the SVD of
associated to the smallest singular value. Let R
i
3
Ti
Ri
(10.54)
(10.55)
We then have the following linear algorithm for motion and structure
estimation from three views of a cube:
Algorithm 10.1 (Motion and structure from mixed features).
Given m (= 3) images xj1 , . . ., xjm of n(= 8) points pj , j = 1, . . . , n (as
the corners of a cube), and the images lkj
i , k = 1, 2, 3 of the three edges
intersecting at pj , estimate the motions (Ri , Ti ), i = 2, . . . , m as follows:
1. Initialization: s = 0
(a) Compute (R2 , T2 ) using the 8 point algorithm for the first two
views [LH81].
(b) Compute js = [j1 /11 , 1]T where j1 is the depth of the j th point
relative to the first camera frame.
i , Ti ) as the eigenvector associated to the smallest
2. Compute (R
singular value of Pi , i = 2, . . . , m.
3. Compute (Ri , Ti ) from (10.54) and (10.55) for i = 2, . . . , m.
4. Compute new js+1 = j from (10.52). Normalize so that 11,s+1 = 1.
5. If ||s s+1 || > , for a pre-specified > 0, then s = s + 1 and goto
2. Else stop.
The camera motion is then the converged (Ri , Ti ), i = 2, . . . , m and
the structure of the points (with respect to the first camera frame) is the
converged depth scalar j1 , j = 1, . . . , n.
We have a few comments on the proposed algorithm:
1. The reason to set 11,s+1 = 1 is to fix the universal scale. It is equivalent to putting the first point at a relative distance of 1 to the first
camera center.
2. Although the algorithm is based on the cube, considers only three
views, and utilizes only one type of multiple view matrix, it can be
easily generalized to any other objects and arbitrarily many views
whenever incidence conditions among a set of point features and line
features are present. One may also use the rank conditions on different
types of multiple view matrices provided by Theorem 10.1.
3. The above algorithm is a straightforward modification of the algorithm proposed for the pure point case given in Chapter 8. All the
measurements of line features directly contribute to the estimation of
229
the camera motion and the structure of the points. Throughout the
algorithm, there is no need to initialize or estimate the 3-D structure
of lines.
The reader must be aware that the above algorithm is only conceptual (and
naive in many ways). It by no means suggests that the resulting algorithm
would work better in practice than some existing algorithms in every situation. The reason is, there are many possible ways to impose the rank
conditions and each of them, although maybe algebraically equivalent, can
have dramatically different numerical stability and sensitivity. To make the
situation even worse, under different conditions (e.g., long baseline or short
baseline), correctly imposing these rank conditions does require different
numerical recipes.9 A systematic characterization of numerical stability of
the rank conditions remains largely open at this point. It is certainly the
next logical step for future research.
is true even for the standard 8 point algorithm in the two view case.
line features can be measured more reliably than point features, lower noise
level is added to them in simulations.
10 Since
230
0.5
0.5
View 2
View 1
0.5
1
1
0.5
0.5
0
x
0.5
1
1
0.5
0.5
0.5
0.5
0.5
View 4
View 3
1
0.5
1
1
0
x
0.5
0.5
0
x
0.5
1
1
0.5
0
x
Figure 10.9. Four views of four 3-D cubes in (normalized) image coordinates. The
circle and the dotted lines are the original images, the dots and the solid lines
are the noisy observations under 5 pixels noise on point features and 0.5 degrees
noise on line features.
We run the algorithm for 1000 trials with the noise level on the point
features from 0 pixel to 5 pixels and a corresponding noise level on the
line features from 0 degree to 1 degrees. Relative to the given amount of
translation, 5 pixels noise is rather high because we do want to compare
how all the algorithms perform over a large range of noise levels. The
results of the motion estimate error and structure estimate error are given
in Figure 10.10 and 10.11 respectively. The Point feature only algorithm
essentially uses the multiple view matrix M in (10.52) without all the
10.7. Summary
231
rows associated to the line features; and the Mixed features algorithm
uses essentially the same M as in (10.52). Both algorithms are initialized
by the standard 8 point algorithm. The Mixed features algorithm gives
a significant improvement in all the estimates as a result of the use of
both point and line features in the recovery. Also notice that, at a high
noise levels, even though the 8 point algorithm gives rather off initialization
values, the two iterative algorithms manage to converge back to reasonable
estimates.
Motion 12
Motion 14
7
8point algorithm
Point feature only
Mixed features
6
5
4
3
2
1
0
1
2
3
4
Point feature noises (pixels)
6
5
4
3
2
1
0
Motion 12
80
Translation Error in degrees
Motion 14
70
60
50
40
30
20
10
0
1
2
3
4
Point feature noises (pixels)
1
2
3
4
Point feature noises (pixels)
60
40
20
1
2
3
4
Point feature noises (pixels)
Figure 10.10. Motion estimates error versus level of noises. Motion x-y means
the estimate for the motion between image frames x and y. Since the results are
very much similar, we only plotted Motion 1-2 and Motion 1-4.
10.7 Summary
10.8 Exercises
232
150
|| ||/|| || (%)
200
100
50
0.5
1.5
2
2.5
3
Point feature noises (pixels)
3.5
4.5
Figure 10.11. Structure estimates error versus level of noises. Here 0 represents
the true structure and the estimated.
Chapter 11
Multiple view geometry in high
dimensional space
234
geometric constraints that govern multiple k-dimensional images of hyperplanes in an n-dimensional space. We show that these constraints can again
be captured through the rank condition techniques. Such techniques also
simultaneously capture all types of geometric relationships among hyperplanes, such as inclusion, intersection and restriction. The importance of
this study is at least two-fold: 1. In many applications, objects involved are
indeed multi-faceted (polygonal) and their shape can be well modeled (or
approximated) as a combination of hyperplanes; 2. In some cases, there is
not enough information or it is not necessary to locate the exact location
of points in a high-dimensional space and instead, we may still be interested in identifying them up to some hyperplane. As we will point out later,
for the special case n = 3, the results naturally reduce to what is known
so far for points, lines and planes in preceding chapters. Since reconstruction is not the main focus of this chapter, the reader is referred to other
chapters for how to use these constraints to develop algorithms for various
reconstruction purposes.
235
(11.1)
(11.2)
(11.3)
236
R3
p
PSfrag replacements
K
o
o0
I
(11.4)
R T
R(n+1)(n+1) belongs to GL(n + 1, R). The ho0 1
mogeneous
coordinate
of the image of p is x = [x1 , . . . , xk , 1]T . Let
P 0
P0 = T
R(k+1)(n+1) . Then we have
~
c
where g =
x = P 0 X.
(11.5)
i = 1, . . . , m,
(11.6)
237
i = 1, . . . , m.
(11.7)
2 The above equations are associated to only one feature point. We usually need
images of multiple points in order to recover all the unknowns.
3 We later will see this can also be relaxed when we allow to depend on time.
238
A0 X = 0,
where A0 =
(11.8)
A
R(n+1)(nd) and the equation of its image S q is
bT
T
A0I x = 0,
(11.9)
AI
R(k+1)(kq) .
bTI
Note the set of X Rn+1 satisfying (11.8) is a subspace H d+1 of dimension d + 1 and the original hyperplane H d is the intersection of this
set with the hyperplane in Rn+1 : Xn+1 = 1. Hence, instead of studying
the object in Rn , we can study the corresponding object of one dimension higher in Rn+1 . Consequently, the image S q becomes a subspace
T
S q+1 = {x Rk+1 : A0I x = 0} Rk+1 . The space Rk+1 can be interpreted as the space I k+1 formed by the camera center o and the image
plane I k , which we denote as image space. So S q is the intersection of
S q+1 with the image plane I k in the image space I k+1 . Define the set
T
F = {X Rn+1 : A0I P 0 X = 0} to be the preimage of S q+1 , then
H d+1 F and F is the largest set in Rn+1 that can give rise to the
same image of H d+1 . If we consider the I k+1 as a subspace of Rn+1 , then
S q+1 is the intersection of the image space with F or H d+1 . Since it is
apparent that the space summation F + I k+1 = Rn+1 , the dimension of F
is
where A0I =
239
image are not necessarily unique and may even have different dimensions.
More specifically, such hyperplanes may be of dimensions d = q, q+1, . . ., or
q +(nk 1).4 Nonetheless, if the camera and the object hyperplane are of
general configuration, typically the dimension of the image of a hyperplane
H d is
q = min{d, k}.
(11.11)
R3
L
F
PSfrag replacements
bl
l
Figure 11.2. The image s = bl and co-image s = l of a line L in R3 . The three
column vectors of the image bl span the two-dimensional subspace (in the image
space I 2+1 ) formed by o and L (in this case it coincides with the preimage F ); and
the column vector of the co-image l is the vector orthogonal to that subspace.
240
241
(Theorem 8.1). Here we first study how to generalize this result to points
in Rn .
Take multiple, say m, images of the same point p with respect to multiple
camera frames. We obtain a family of equations
xi i = P 0 gi X = i X,
i = 1, . . . , m.
(11.13)
(k+1)(n+1)
i Ti R
To simplify notation, let i = P gi = R
, where
i R(k+1)(k+1) and Ti R(k+1)(nk) .5 In the above equations, except
R
for the xi s, everything else is unknown and subject to recovery. However,
solving for the i s, i s and X simultaneously from such equations is
extremely difficult at least nonlinear. A traditional way to simplify the
task is to decouple the recovery of the matrices i s from that of i s and X.
Then the remaining relationship would be between the images xi s and the
camera configuration i s only. Since such a relationship does not involve
knowledge on the location of the hyperplane in Rn , it is referred to as
intrinsic. It constitutes as the basis for any image-only based techniques.
For that purpose, let us rewrite the system of equations (11.13) in a single
matrix form
0
x1
0
..
.
0
0
x2
..
.
..
.
0
0
..
.
xm
I ~ = X
1
1
2 2
.. = .. X.
. .
(11.14)
(11.15)
1 x1 0
0
..
2 0 x 2 . . .
.
.
R[m(k+1)](m+n+1) .
Np = [, I] = .
.
.
.
..
..
..
..
0
m 0
0 xm
(11.16)
Apparently, the rank of Np is no less than k + m (assuming the generic
configuration such that xi is not zero for all 1 i m). It is clear that
.
there exists v = [XT , ~T ]T Rm+n+1 in the null space of Np . Then the
equation (11.14) is equivalent to
m + k rank(Np ) m + n
5 Only
i = Ri and Ti = Ti .
in the special case that k = n 1, we have R
(11.17)
242
x1
0
0
T
(x
0
1 ) 0
..
.. R[m(k+1)][m(k+1)]
..
..
Dp = .
.
.
.
0
0
xTm
T
0
0 (x
m)
1
xT1 R
T
(x
)
1 R1
..
.
D p Np =
..
xT R
m m
T
(xm ) Rm
xT1 T1
T
(x
1 ) T1
..
.
..
.
xTm Tm
T
(x
m ) Tm
xT1 x1
0
0
0
0
..
.
0
0
0
0
0
0
0
0
0
..
.
0
0
0
0
0
.
xTm xm
0
T
1 (x )T T1
(x1 ) R
1
T
T
(x
2 ) R2 (x2 ) T2
(11.18)
Cp =
R(mk)(n+1)
..
..
.
.
T
T
(x
m ) Rm (xm ) Tm
by the following equation
k rank(Cp ) = rank(Np ) m n.
(11.19)
This rank condition can be further reduced if we choose the world coordinate frame to be the first camera frame, i.e. [R1 , T1 ] = [I, 0]. Then
1 = I(k+1)(k+1) and T1 = 0(k+1)(nk) . Note that, algebraically, we do
R
not lose any generality in doing so. Now the matrix Cp becomes
T
(x
0
1 )
T
T
(x
2 ) R2 (x2 ) T2
Cp =
(11.20)
R(mk)(n+1) .
..
..
.
.
T
T
(x
m ) Rm (xm ) Tm
Cp Dp1
0
T
(x
)
2 R 2 x1
=
..
.
T
(x
)
R m x1
m
T
(x
1 ) x1
T
(x2 ) R2 x
1
..
.
m x
(x )T R
m
0
T
(x
2 ) T2
..
.
T
(x ) Tm
243
R(mk)(n+1) .
(11.21)
T
Note that Dp1 is of full rank n + 1 and (x
)
x
is
of
rank
k.
Let
us
define
1
1
the so-called multiple view matrix Mp associated to a point p to be
T
2 x1 (x )T T2
(x2 ) R
2
.
..
..
[(m1)k](nk+1)
.
(11.22)
Mp =
R
.
.
m x1 (x )T Tm
(x )T R
m
(11.23)
244
(11.24)
.
s = [v1 , . . . , vkq ] R(k+1)(kq) .
(11.25)
or its co-image
d
Now given m images s1 , . . . , sm (hence co-images s
1 , . . . , sm ) of H relative
to m camera frames, let qi + 1 be the dimension of si , i = 1, . . . , m, note for
generic motions, qi = d, i = 1, . . . , m. We then obtain the following matrix
T
1 (s )T T1
(s1 ) R
1
T
T
(s
2 ) R2 (s2 ) T2
(11.26)
Ch =
R[m(kd)](n+1) .
..
..
.
.
m (s )T Tm
(s )T R
m
Let S d+1 be the subspace spanned by H d and the origin of the world
coordinate frame. Then it is immediately seen that S d+1 in its homogeneous
representation is the kernel of Ch . In other words, the rank of Ch is
(k d) rank(Ch ) (n d).
(11.27)
T
(s
0
1)
T
T
(s
2 ) R2 (s2 ) T2
Ch =
(11.28)
R[m(kd)](n+1) .
..
..
.
.
T
T
(s
m ) Rm (sm ) Tm
T
0
(s
1 ) s1
2 s
(s2 ) R2 s1 (s2 ) R
1
Ch Dh1 =
..
..
.
.
T
T
(s
m ) Rm s1 (sm ) Rm s1
0
T
(s
2 ) T2
..
.
T
(sm ) Tm
245
R[m(kd)](n+1) .
(11.29)
T
Note that Dp1 R(n+1)(n+1) is of full rank n + 1 and (s
1 ) s1
R(kd)(kd) is of rank k d. Let us define the so-called multiple view
matrix Mh associated to the hyperplane H d to be
T
T
2 s1 (s
(s2 ) R
2 ) T2
.
..
..
Mh =
R[(m1)(kd)](nk+d+1) . (11.30)
.
.
T
T
(s
m ) Rm s1 (sm ) Tm
(11.31)
(11.32)
It is easy to see that, in the case that the hyperplane is a point, we have
d = 0. Hence 0 rank(Mh ) (nk). So the result (11.23) given in Section
11.2 is trivially implied by this theorem. The significance of this theorem
is that the rank of the matrix Mh defined above does not depend on the
dimension of the hyperplane H d . Its rank only depends on the difference
between the dimension of the ambient space Rn and that of the image
plane Rk . Therefore, in practice, if many features of a scene are exploited
for reconstruction purposes, it is possible to design algorithms that do not
discriminate different features.
Example 11.2 (Trilinear constraints on three views of a line).
When n = 3, k = 2, m = 3 and d = 1, it gives the known result on three
views of a line:
"
#
lT2 R2 lb1 lT2 T2
rank T b T
1
(11.33)
l3 R 3 l1 l3 T3
where li R3 , i = 1, 2, 3 are the three coimages of the line, and (R2 , T2 )
and (R3 , T2 ) are the relative motions among the three views.
246
(11.34)
T
(s
0
1)
T
T
(s
2 ) R2 (s2 ) T2
Ch1 =
R[m(kd1 )](n+1) .
..
..
.
.
T
T
(sm ) Rm (sm ) Tm
(11.35)
Multiplying
k d1 rank(Ch1 ) n d1 .
Ch1
1
Ch1 Dh1
Let
(11.36)
0
T
(s
)
2 R 2 x1
=
..
.
T
R m x1
(s
)
m
T
(s
2 ) R 2 x1
.
..
Mh1 =
.
T
(s
m ) R m x1
T
(s
1 ) x1
T
R 2 x
(s
)
1
2
..
.
T
R m x
(s
)
1
m
T
(s
2 ) T2
..
.
T
(sm ) Tm
0
T
(s
2 ) T2
..
.
T
(sm ) Tm
R[m(kd1 )](n+1) .
R[(m1)(kd1 )](nk+d2 ) .
(11.37)
(11.38)
T
Since (s
1 ) x1 is of rank at least k d1 , we must have
0 rank(Mh1 ) rank(Ch1 ) (k d1 ) n k.
(11.39)
T
(x
0
1)
T
T
(x
2 ) R2 (x2 ) T2
Ch2 =
R[m(kd2 )](n+1) .
..
..
.
.
T
T
(x ) Rm (x ) Tm
(11.40)
247
(11.41)
k d1 rank(Ch2 ) rank(Ch2 ) n d2 .
(11.43)
T
(s
0
1)
T
T
(x
2 ) R2 (x2 ) T2
2
(11.42)
Ch =
R[m(kd2 )(d1 d2 )](n+1)
..
..
.
.
T
T
(x
m ) Rm (xm ) Tm
satisfies
T
(x
2 ) R2 s1
2
Ch2 Dh1
=
..
.
T
(x
m ) Rm s1
Let
T
(x
2 ) R2 s1
.
..
Mh2 =
.
m s1
(x )T R
m
Since
T
(s
1 ) s1
T
(s
1 ) s1
T
(x2 ) R2 s1
..
.
T
(x
m ) Rm s1
T
(x
2 ) T2
..
.
(x )T Tm
0
T
(x
2 ) T2
.
..
.
T
(xm ) Tm
(11.44)
(11.46)
T
2 D1 (D )T T2
(D2 ) R
2
T
T
. (D3 ) R3 D1 (D3 ) T3
(11.47)
M =
..
..
.
.
T
T
) Tm
(Dm
) Rm D1 (Dm
where the Di s and Di s stand for images and co-images of some hyperplanes respectively. The actual values of Di s and Di s are to be determined
in context.
Therefore, we have proved the following statement which further
generalizes Theorem 11.1:
248
Theorem 11.2 (Rank condition with inclusion). Consider a d2 dimensional hyperplane H d2 belonging to a d1 -dimensional hyperplane
H d1 in Rn . m images xi R(k+1)(d2 +1) of the H d2 and m images
si R(k+1)(d1 +1) of the H d1 relative to the ith (i = 1, . . . , m) camera
frame are given and the relative transformation from the 1st camera frame
to the ith is (Ri , Ti ) with Ri Rnn and Ti Rn , i = 2, . . . , m. Let the
Di s and Di s in the multiple view matrix M have the following values:
.
(k+1)(kd2 )
(k+1)(kd1 )
Di = x
or s
,
i R
i R
(11.48)
.
(k+1)(d2 +1)
(k+1)(d1 +1)
Di = x i R
or si R
.
Then for all possible instances of the matrix M , we have the two cases:
1. Case one: If D1 = s1 and Di = x
i for some i 2, then
rank(M ) (n k) + (d1 d2 ),
2. Case two: Otherwise,
0 rank(M ) n k.
Since rank(AB) (rank(A) + rank(B) n) for all A Rmn , B Rnk ,
we have
i D1 ] (d1 d2 )
rank[(Di )T R
where xi R3 and li R3 are the images and coimages of the point and
line respectively. The rank condition is equivalent to what is known as the
line-point-point relation.
The above theorem can also be easily generalized to any set of cascading
hyperplanes:
H dl H dl1 H d1
for some l Z+ . We omit the details for simplicity.
249
Di = x
i , ri , or si ,
(11.50)
.
D1 = x 1 .
Then we have:
0 rank(M ) (n k).
It is straight forward to generalize this theorem to a hyperplane which
is in the intersection of more than two hyperplanes:
H dl H dl1 H d1
for some l Z+ . Simply allow the matrix Di to take the co-image of any
hyperplane involved and the rank condition on M always holds.
250
(11.51)
with
1 R(nd)(k+1) ,
2 R(nd)(nk) . (11.52)
T
2 D1 (D )T T2
(D2 ) R
2
3 D1 (D )T T3
(D3 )T R
3
..
..
(11.53)
M =
.
.
T
T
(Dm
) Rm D1 (Dm ) Tm
1 D1
2
where the Di s and Di s stand for images and co-images of some hyperplanes respectively. The actual values of Di s and Di s are to be determined
in context.
Then, the rank conditions given in Theorems 11.2 and 11.3 need to be
modified to the following two corollaries, respectively:
Corollary 11.1 (Rank condition with restricted inclusion). Consider hyperplanes H d2 H d1 H d in Rn , where d1 , d2 < k and d < n.
R(nd)(n+1) (relative
The hyperplane H d is described by a matrix
to the first camera frame). Then the rank conditions in Theorem 11.2 do
not change at all when the multiple view matrix there is replaced by the
extended multiple view matrix.
Corollary 11.2 (Rank condition with restricted intersection). Consider hyperplanes H d1 , H d2 , and H d3 with H d3 H d1 H d2 which all
belong to an ambient hyperplane H d , where d1 , d2 , d3 < k and d < n.
R(nd)(n+1) (relative
The hyperplane H d is described by a matrix
to the first camera frame). Then the rank conditions in Theorem 11.3 do
not change at all when the multiple view matrix there is replaced by the
extended multiple view matrix.
One can easily prove these corollaries by modifying the proofs for the
theorems accordingly.
Example 11.4 (Planar homography). Consider the standard perspective projection from R3 to R2 . Suppose, as illustrated in Figure 11.3, all
251
R3
PSfrag replacements
I2
I2
x2
x1
o1
(R, T )
o2
Figure 11.3. Two views of points on a plane R3 . For a point p , its two
(homogeneous) images are the two vectors x1 and x2 with respect to the two
vantage points.
252
253
then d nk
k e + 1. When n = 3, k = 2, this number reduces to 2 which is
well-known for 3-D stereopsis.
or rank(Mh ) = 0.
(11.56)
Suppose its rank is 1 for an Mh associated to some d-dimensional hyperplane H d in Rn . Let si , i = 1, . . . , m be the m images of H d relative to
m camera frames specified by the relative transformations (Ri , Ti ), i =
2, . . . , m (from the 1st frame to the ith ). Consider the matrix Ch defined in
Section 11.3.1
T
(s
0
1)
T
T
(s
2 ) R2 (s2 ) T2
Ch =
(11.57)
R[m(nd1)](n+1) .
..
..
.
.
T
(s
m ) Rm
T
(s
m ) Tm
Then from the derivation in Section 11.3.1, we have the following rank
inequality
(n d) rank(Ch ) = rank(Mh ) + (n d 1) = (n d).
(11.58)
254
(11.60)
T
T
(s2 ) R2 x1 (s
2 ) T2
.
..
..
[(m1)(nd1 1)](d2 +2)
,
Mh1 =
R
.
.
Mh2
T
(s
m ) R m x1
T
(x
2 ) R2 s1
.
..
=
.
T
(x
m ) Rm s1
T
(s
m ) Tm
T
(x
2 ) T2
..
T
(x
m ) Tm
255
T
T
T
(x2 ) R2 x1 (x2 ) T2
(s2 ) R2 x1 (s2 ) T2
rank
..
.
T
(s
m ) Rm x1
..
.
T
(s
m ) Tm
rank
..
.
T
(x
m ) Rm x1
..
.
T
(x
m ) Tm
So the rank of the second matrix must be 1, hence the location of the hyperplane H d2 is uniquely determined according to the discussion in Section
11.4.2. However, if rank(Mh1 ) = 0, we only know
(n d1 1) rank(Ch1 ) (n d1 ).
(11.61)
In this case, nothing definite can be said about the uniqueness of H d1 from
the rank test on this particular multiple view matrix. On the other hand,
the entire matrix Mh1 being zero indeed reveals some degeneracy about the
configuration: All the camera centers and H d1 span a (d1 + 1)-dimensional
T
hyperplane in Rn since (s
i ) Ti = 0 for all i = 2, . . . , m; the only case that
d1
the exact location of H is determinable (from this particular rank test) is
when all the camera centers are actually in H d1 . Furthermore, we are not
able to say anything about the location of H d2 in this case either, except
that it is indeed in H d1 .
Now for the matrix Mh2 , if rank(Mh2 ) = (d1 d2 + 1), then the matrix Ch2
defined in Section 11.3.2 has rank(Ch2 ) = n d2 . Hence the kernel of the
matrix Ch2 uniquely determines the hyperplane H d2 it location relative to
the first camera frame. Or, simply use the following rank inequality (from
comparing the numbers of columns):
T
T
T
(x2 ) R2 x1 (x2 ) T2
(x2 ) R2 s1 (x2 ) T2
rank
(x
m)
..
.
T
..
.
T
Rm s1 (xm ) Tm
rank
..
.
T
(x
m ) Rm x1
..
.
T
(x
m ) Tm
+ (d1 d2 ).
So the second matrix will have rank 1, hence the location of H d2 is unique.
However, in general we are not able to say anything about whether H d1 is
uniquely determinable from such a rank test, except that it contains H d2 .
Now, if rank(Mh2 ) = (d1 d2 ), we only know
(n d2 1) rank(Ch2 ) (n d2 ).
(11.62)
Then this particular rank test does not imply the uniqueness of the hyperplane H d2 . From the matrix Mh2 being of rank (d1 d2 ), we can still derive
that all the camera centers and H d1 span a (d1 +1)-dimensional hyperplane
T
T
H d1 +1 since (x
i ) Ti is linearly dependent of (xi ) Ri x1 for i = 2, . . . , m.
So H d1 is in general not determinable unless all the camera centers lie in
H d1 . If it is the special case that all the camera centers actually lie in H d2 ,
the location of H d2 is then determinable. If some camera centers are off
the hyperplane H d2 but still in the hyperplane H d1 +1 , we simply cannot
256
tell the uniqueness of H d2 from this particular rank test. Examples for either scenario exist. Generally speaking, the bigger the difference between
the dimensions d1 and d2 , the less information we will get about H d2 from
the rank test on Mh2 alone. Due to this reason, in practice, other types of
multiple view matrices are preferable.
We summarize our discussion in this section in Table 11.1.
Matrix
Rank values
rank(Mh1 )
1
rank(Mh1 )
0
rank(Mh2 )
d 1 d2 + 1
rank(Mh2 )
d 1 d2
H d1
H d2
Camera centers
unique
unique
Rn
H d1 +1
H d1
H d1 +1
H d2
unique
Rn
H d1 +1
H d1
H d1 +1
Table 11.1. Locations of hyperplanes or camera centers implied by each single rank
test. Whenever it is well-defined, H d1 +1 is the (d1 + 1)-dimensional hyperplane
spanned by the H d1 and all the camera centers.
Intersection
Finally, we study the multiple view matrix in Theorem 11.3. Here we know
three hyperplanes are related by H d3 H d1 H d2 . Let us first study
T
T
(D2 ) R2 x1 (D2 ) T2
rank
(Dm
)
Di
..
.
T
..
.
T
Rm x1 (Dm ) Tm
x
i ,
(x2 ) R2 x1 (x2 ) T2
rank
..
.
T
(x
m ) Rm x1
..
.
T
(x
m ) Tm
where
=
r
i , or si . Therefore, the rank of the second matrix in
the above rank inequality is 1. That is the hyperplane H d3 can always
be determined. As before, nothing really can be said about the other two
hyperplanes except that they both contain H d3 . Now if rank(M ) = 0,
then nothing really can be said about the locations of any hyperplane,
except that all the camera centers and the hyperplanes H d1 , H d2 span a
(d1 + d2 + 1)-dimensional hyperplane a degenerate configuration. The
analysis goes very much the same for any number of hyperplanes which all
intersect at H d3 .
257
K
o
x
o
I1
Figure 11.4. A degenerate case: the image of a line is a point under the perspective
.
projection from R3 to R1 (= I 1 ). Let K be the subspace orthogonal to the plane
1
spanned by o and I . Then this happens when L is parallel to K.
(11.63)
where q1 is the dimension of the hyperplane of the image for H d with respect
to the 1st camera frame.
Notice that since (n k) (d q1 ) < (n k), the rank condition says
that we may only determine the hyperplane H d up to an H 2dq1 subspace
(with respect to the first view). This is expected since a degenerate view
is chosen as the reference. If we choose a different (non-degenerate) frame
as the reference, the rank may vary and gives a better description of the
overall configuration.
In view of this, we see that the rank of the corresponding matrix Ch
(defined during the derivation of M in previous sections) in general are more
258
X
X (j) (0)
j=1
j!
tj .
(11.64)
X(t) X(0) +
N
X
X (j) (0)
j=1
j!
tj .
(11.65)
7 The rank of M could depend on the difference of the hyperplanes in the case of
hyperplane inclusion. But that only occurs in Case one (see Theorem 11.2) which is
usually of least practical importance.
259
N
(j)
X
X (0) j
(11.66)
ti
i xi = P 0 gi X(ti ) = P 0 gi X(0) +
j!
j=1
where P 0 R34 . If we define a new projection matrix to be
3(3N +4)
i = [Ri tN
i . . . R i t i R i , Ti ] R
(11.67)
(11.68)
i xi = i X,
R3N +4 , xi R3 .
X
(11.69)
This is indeed a multiple-view projection from R3(N +1) to R2 (in its homogeneous coordinates). The rank conditions then give basic constraints
that can be further used to segment these features or reconstruct their coordinates. The choice of the time-base (tN , . . . , t, 1) in the construction of
above is not unique. If we change the time-base to be (sin(t), cos(t), 1),
it would allow us to embed points with elliptic trajectory into a higher
dimensional linear space.
As a simple example, let us study in more detail a scene consisting of
point features with each moving at a constant velocity. According to above
discussion, this can be viewed as a projection from a six-dimensional space
to a two-dimensional one. The six-dimensional coordinate consists of both a
points 3-D location and its velocity. We are interested in the set of algebraic
(and geometric) constraints that govern its multiple images. Since now we
have a projection from R6 to R2 , for given m images, the multiple view
matrix Mp (as defined in (11.22)) associated to a point p has a dimension
2(m 1) 5 and satisfies the rank condition 0 rank(Mp ) 6 2 = 4.
Then any 5 5 minor (involving images from 4 views) of this matrix has
determinant zero. These minors exactly correspond to the only type of
constraints discovered in [?] (using tensor terminology). However, the rank
conditions provide many other types of constraints that are independent
of the above. Especially, we know each point moving on a straight line in
3D from the constant velocity assumption, it is then natural to consider
the incidence condition that a point lies on a line. According to Theorem
11.2, for m images of a point on a line, if we choose D1 to be the image of
the point and Di to be the co-image of the line, the multiple view matrix
M has a dimension (m 1) 5. M has rank(M ) 4. It is direct to check
260
that constraints from its minor being zero involve up to 6 (not 4) views. If
we consider the rank conditions for other types of multiple view matrix M
defined in Theorem 11.2, different constraints among points and lines can
be obtained. Therefore, to complete and correct the result in [?] (where
only the types of projection Rn R2 , n = 4, . . . , 6 were studied), we have
in general
Corollary 11.4 (Multilinear constraints for Rn Rk ). For projection
from Rn to Rk , non-trivial algebraic constraints involve up to (n k + 2)
views. The tensor associated to the (n k + 2)-view relationship in fact
induces all the other types of tensors associated to smaller numbers of views.
The proof directly follows from the rank conditions. In the classic case
n = 3, k = 2, this corollary reduces to the well-known fact in computer
vision that irreducible constraints exist up to triple-wise views, and furthermore, the associated (tri-focal) tensor induces all (bi-focal) tensors (i.e.
the essential matrix) associated to pairwise views. In the case n = 6, k = 2,
the corollary is consistent to our discussion above. Of course, besides embedding (11.69) of a dynamic scene to a higher dimensional space R3(N +1) ,
knowledge on the motion of features sometime allows us to embed the problem in a lower dimensional space. These special motions have been studied
in [?], and our results here also complete their analysis.
261
l2
2
PSfrag replacements
y
l1
1
x
u
Xo
X2 (r2 , t) = [X1 (l1 , t) + r2 (I + eb3 2 )e1 ] + [r2 eb3 e1 ] sin(2 t) + [r2 eb3 2 e1 ] cos(2 t).
If we choose the time-base (sin(1 t), cos(1 t), sin(2 t), cos(2 t), 1) for the
new projection matrix , we may embed the links into R15 . Hence the
projection of a point on the links is a projection from R15 to R2 , and can
be expressed in homogeneous coordinates:
(t)x(t) = (t)X,
R16 , x R3 .
X
(11.70)
It is clear to see that points from each link lie on a one-dimensional hyperplane (a line) in the ambient space R15 . These two links together span a
two-dimensional hyperplane, unless 1 = 2 (i.e. the two links essentially
move together as one rigid body). The rank conditions that we have for such
a projection will be able to exploit the fact that the two links intersect at
the second joint, and that points on the links belong to a two-dimensional
hyperplane Z = 0 (w.r.t. the world coordinate frame).
In general, using similar techniques, we may embed a linked rigid body
of multiple joints or links (maybe of other types) into a high dimensional
space. If a linked rigid body consists of N links, in general we can embed
its configuration into RO(6N ) with the assumption of constant translation
and rotation velocity. This makes perfect sense since each link potentially
may introduce 6 degrees of freedom three for rotation and three for translation. In principle, the rank conditions associated to the projection RO(6N )
262
11.6 Summary
In this chapter, we have studied algebraic and geometric relationships
among multiple low-dimensional projections (also called images) of hyperplanes in a high-dimensional linear space. To a large extent, the work is
a systematic generalization of classic multiple view geometry in computer
vision to higher dimensional spaces. The abstract formulation of the projection model in this chapter allows us to treat uniformly a large class
of projections that are normally encountered in many practical applications, such as perspective projection and orthographic projection. We have
demonstrated in detail how incidence conditions among multiple images of
multiple hyperplanes can be uniformly and concisely described in terms of
certain rank conditions on the so-called multiple view matrix. These incidence conditions include: inclusion, intersection, and restriction (to a fixed
hyperplane).
The significance of such rank conditions is multi-fold: 1. the multiple
view matrix is the result of a systematic elimination of unknowns that
are dependent on n-dimensional location of the hyperplanes involved. The
rank conditions essentially describe incidence conditions in Rn in terms of
only low dimensional images (of the hyperplanes) and relative configuration
among the vantage points. These are the only constraints that are available
for any reconstruction whose input data are the images only. 2. As an
alternative to the multi-focal tensor techniques that are widely used in
computer vision literature for studying multiple view geometry, the rank
conditions provide a more complete, uniform and precise description of
all the incidence conditions in multiple view geometry. They are therefore
easier to use in reconstruction algorithms for multiple view analysis and
synthesis. 3. A crucial observation is that the rank conditions are invariant
under any transformations that preserve incidence relationships. Hence it
does not discriminate against Euclidean, affine, projective or any other
transformation groups, as long as they preserve incidence conditions of
the hyperplanes involved. Since these transformations were traditionally
studied separately in computer vision, the rank conditions provide a way
11.6. Summary
263
Part IV
Reconstruction algorithms
Chapter 12
Batch reconstruction from multiple
views
12.1 Algorithms
12.1.1 Optimal methods
12.1.2 Factorization methods
12.2 Implementation
Chapter 13
Recursive estimation from image
sequences
and let them move under the action of a rigid motion between two adjacent
b
b se(3). We assume that we
time instants g(t + 1) = exp((t))g(t);
(t)
can measure the (noisy) projection
yi (t) = g(t)Xi + ni (t) R2 i = 1 . . . N
(13.2)
13.2. Observability
269
X(t + 1) = X(t)
X(0) = X0 R3N
g(t + 1) = exp((t))g(t)
b
g(0) = g0 SE(3)
(13.3)
(t + 1) = (t) + (t)
(0) = 0 se(3)
i
y (t) = g(t)Xi (t) + ni (t)
ni (t) N (0, n )
where N (M, S) indicates a normal distribution with mean M and covariance S. In the above model, is the relative acceleration between the
viewer and the scene. If some prior modeling information is available (for
instance when the camera is mounted on a vehicle or on a robot arm), this
is the place to use it. Otherwise, a statistical model can be employed. In
particular, one way to formalize the fact that no information is available is
by modeling as a Brownian motion process. This is what we do in this
chapter1 . In principle one would like - at least for this simplified formalization of SFM - to find the optimal solution, that is the description of the
state of the above system {X, g, } given a sequence output measurements
(correspondences) yi (t) over an interval of time. Since the measurements
are noisy and described using a stochasic model, the description of the
state consists in its probability density conditioned on the measurements.
We call an algorithm that delivers the conditional density of the state at
time t causally (i.e. based upon measurements up to time t) the optimal
filter.
13.2 Observability
To what extent can the 3-D shape and motion of a scene be reconstructed
causally from measurements of the motion of its projection onto the sensor?
This is the subject of this section, which we start by establishing some
notation that will be used throughout the rest of the chapter.
Let a rigid motion g SE(3) be represented by a translation vector
T R3 and a rotation matrix R SO(3), and let 6= 0 be a scalar. The
similarity group, which we indicate by g SE(3) R+ is the composition
of a rigid motion and a scaling, which acts on points in R3 as follows:
g (X) = RX + T.
(13.4)
(13.5)
1 We wish to emphasize that this choice is not crucial towards the results of this
chapter. Any other model would do, as long as the overall system is observable.
270
(13.6)
[X] = {Y R3N | g | Y = g X}
(13.7)
Preliminaries
Consider a discrete-time nonlinear dynamical system of the general form
(
x(t + 1) = f (x(t))
x(t0 ) = x0
(13.8)
y(t) = h(x(t))
and let y(t; t0 , x0 ) indicate the output of the system at time t, starting from
the initial condition x0 at time t0 . In this section we want to characterize
the states x that can be reconstructed from the measurements y. Such a
characterization depends upon the structure of the system f, h and not
on the measurement noise, which is therefore assumed to be absent for the
purpose of the analysis in this section.
Definition 13.1. Consider a system in the form (13.8) and a point in the
state-space x0 . We say that x0 is indistinguishable from x00 if y(t; t0 , x00 ) =
y(t; t0 , x0 ) t, t0 . We indicate with I(x0 ) the set of initial conditions that
are indistinguishable from x0 .
13.2. Observability
271
. X T
Mk (t) =
i (t)H T (t)H(t)i (t) i > t
(13.11)
i=t
.
where t (t) = I and i (t) = F (i 1) . . . F (t). The following definition will
come handy in section 13.4:
Definition 13.3. We say that the system (13.10) is uniformly observable
if there exist real numbers m1 > 0, m2 > 0 and an integer k > 0 such that
m1 I Mk (t) m2 I t.
The following proposition revisits the well-known fact that, under constant velocity, structure and motion are observable up to a (global)
similarity transformation.
Proposition 13.1. The model (13.3) where the points X are in general position is observable up to a similarity transformation of X provided
that v0 6= 0. In particular, the set of initial conditions that are indistinb
guishable from {X0 , g0 , 0 }, where g0 = {T0 , R0 } and e0 = {v0 , U0 }, is
T } and
0 + T, g0 , 0 }, where g0 = {T0 R0 R
T T, R0 R
given by {RX
b
e0 = {v0 , U0 }.
272
g only appears at the initial time. Consequently, we drop the time index and write simply g1 and g2 as points in SE(3). At the generic time
instant t, the indistinguishability condition can therefore be written as
b
b
et1 g1 X1 = (et2 g2 X2 ) A(t + 1). Therefore, given X2 , g2 , 2 , in order to
find the initial conditions that are indistinguishable from it, we need to find
X1 , g1 , 1 and A(k), k 1 such that, after some substitutions, we have
(
g1 X1 = (g2 X2 ) A(1)
(13.12)
b
b
b
b
e1 e(k1)2 g2 X2 = e2 e(k1)2 g2 X2 A(k + 1) k 1.
The 3 N matrix on the right hand-side has rank at most 2, while the
left hand-side has rank 3, following the general-position conditions, unless
A(k)A1 (k + 1) = I and U1T U2 = I, in which case it is identically zero.
Therefore, both terms in the above equations must be identically zero. From
U1T U2 = I we conclude that U1 = U2 , while from A(k)A1 (k + 1) = I we
conclude that A(k) is constant. However, the right hand-side imposes that
T
where A = diag{a}, which implies
v2 A = v1 , or in vector form v2 aT = v1 IN
that A = I, i.e. a multiple of the identity. Now, going back to the first
T , for any R
SO(3),
equation in (13.13), we conclude that R1 = R2 R
T
3
13.2. Observability
273
i = 1...N 3
(13.15)
1
1 1
= Ry + T
(13.16)
2 = Ry2 2 + T
3
3 3
= Ry + T.
Solve the first equation for T ,
T = 1 Ry1 1 6= 0
and substitute into the second and third equation to get
(
2 1 = R(y2 2 y1 1 )
3 1 = R(y3 3 y1 1 ).
(13.17)
(13.18)
The scale > 0 can be solved for as a function of the unknown scales 2
and 3
=
k3 1 k
k2 1 k
=
2
2
1
1
ky y k
ky3 3 y1 1 k
(13.19)
274
right hand-side there are unit-norm vectors parameterized by 2 and 3 respectively. In particular, the right hand-side of the first equation in (13.20)
is a vector on the unit circle of the plane spanned by y 1 and y2 , while
the right hand-side of the second equation is a vector on the unit circle of
the plane 2 spanned by y1 and y3 . By the assumption of non-collinearity,
these two planes do not coincide. We write the above equation in a more
compact form as
(
1 = R2
(13.21)
2 = R3 .
Now R must preserve the angle between 1 and 2 , which we indi1
cate as \
2 , and therefore 2 and 2 must be chosen accordingly.
1
2
If \
[
>
1 2 , no such choice is possible. Otherwise, there exists a
one-dimensional interval set of 1 , 2 for which one can find a rotation R
that preserves the angle. However, R must also preserve the cross product,
so that we have
13.3 Realization
In order to design a finite-dimensional approximation to the optimal filter,
we need an observable realization of the original model. How to obtain it
is the subject of this section.
Local coordinates
Our first step consists in characterizing the local-coordinate representation
of the model (13.3). To this end, we represent SO(3) locally in canonical
exponential coordinates: let be a three-dimensional real vector (
R3 ); kk
specifies the direction of rotation and kk specifies the angle
of rotation in radians. Then a rotation matrix can be represented by its
.
b so(3) such that R =
b SO(3) as
exponential coordinates
exp()
described in chapter 2. The three-dimensional coordinate Xi is represented
by its projection onto the image plane yi and its depth i , so that
i
X1
.
.
X3i
i
i
i = X3i .
(13.23)
y = (X ) =
Xi
2
X3i
13.3. Realization
275
i
i
i = 1...N
i (0) = i0
(t + 1) = (t)
b
(t + 1) = LogSO(3) (exp(b
(t)) exp((t)))
(0) = 0
v(t
+
1)
=
v(t)
+
(t)
v(0)
=
v
v
0
(t
+
1)
=
(t)
+
(t)
(0)
= 0
yi (t) = exp((t))y
i
i
i
b
i = 1 . . . N.
0 (t) (t) + T (t) + n (t)
(13.24)
b
The notation LogSO(3) (R) stands for such that R = e and is computed
by inverting Rodrigues formula as described in chapter 2.
Minimal realization
In linear time-invariant systems one can decompose the state-space into an
observable subspace and its (unobservable) complement. In the case of our
system, which is nonlinear and observable up to a group transformation,
we can exploit the bundle structure of the state-space to realize a similar
decomposition: the base of the fiber bundle is observable, while individual
fibers are not. Therefore, in order to restrict our attention to the observable
component of the system, we only need to choose a base of the fiber bundle,
that is a particular (representative) point on each fiber.
Proposition 13.2 suggests a way to render the model (13.24) observable
by eliminating the states that fix the unobservable subspace.
Corollary 13.1. The model
i (t + 1) = i (t)
i = 2...N
i (0) = i0
T (t + 1) = exp(b
b
(t + 1) = LogSO(3) (exp(b
(t)) exp((t)))
(0) = 0
(t
+
1)
=
(t)
+
(t)
(0)
= 0
yi (t) = exp((t))y
i
i
b
(t) (t) + T (t) + ni (t)
i = 1 . . . N.
0
(13.25)
which is obtained by eliminating y01 , y02 , y03 and 1 from the state of the
model (13.24), is observable.
276
Definition 13.4. We say that a motion is admissible if v(t) is not identically zero, i.e. there is an open set (a, b) such that v(t) 6= 0, t (a, b) and
the corresponding trajectory of the system (??) is such that c Z i (t) C,
i = 1, . . . , N , t > 0 for some constants c > 0, C < .
.
.
(x)
, H(t) = (x)
denote the linProposition 13.3. Let F (t) = fx
x
earization of the state and measurement equation in (??) respectively; let
N > 5 and assume the motion is admissible; then the linearized system is
uniformly observable.
Proof. Let k = 2; that there exists an m2 < such that the Grammian
M2 (t) m2 I follows from the fact that F (t) and H(t) are bounded for
all t, as can be easily verified provided that motion is admissible. We now
need to verify that M2 (t) is strictly positive definite for all t. To this end,
it is sufficient to show that the matrix
H(t)
.
(13.26)
U2 (t) =
H(t + 1)F (t)
has full column rank equal to 3N + 5 for all values of t, which can be easily
verified whenever N 5.
13.4 Stability
Following the derivation in previous sections, the problem of estimating
the motion, velocity and point-wise structure of the scene can be converted
into the problem of estimating the state of the model (??). We propose
to solve the task using a nonlinear filter, properly designed to account for
the observability properties of the model. The implementation, which we
report in detail in section 13.5, results in a sub-optimal filter. However, it
is important to guarantee that the estimation error, while different from
13.4. Stability
277
(13.30)
where R denotes the usual Riccati equation which uses the linearization
of the model {F, H} computed at the current estimate of the state, as
described in [?].
Note that we call n0 , w0 the variance of the measurement and model
noises, and n , w the tuning parameters that appear in the Riccati equation. The latter are free for the designer to choose, as described in section
13.5.
Stability analysis
The aim of this section is to prove that the estimation error generated
by the filter just described is bounded. In order to do so, we need a few
definitions.
Definition 13.5. A stochastic process x
(t) is said to be exponentially
bounded in mean-square (or MS-bounded) if there are real numbers , >
0 and 0 < < 1 such that Ek
x(t)k2 k
x(0)k2 t + for all t 0.
x
(t) is said do be bounded with probability one (or bounded WP1) if
P [supt0 k
x(t)k < ] = 1.
Definition 13.6. The filter (13.27) is said to be stable if there exist positive
real numbers and such that
k
x(0)k , n (t) I, w (t) I
= x
(t) is bounded.
(13.31)
278
t 0.
(13.32)
Proof. The proof follows from corollary 5.2 of [?], using proposition 13.3
on the uniform observability of the linearization of (??).
Now the proof for the proposition:
Proof. The proposition follows from theorem 3.1 in [?], making use in
the assumptions of the boundedness of F (t), H(t), lemma 13.1 and the
differentiability of f and g when 0 < i < i.
Non-minimal models
Most recursive schemes for causally reconstructing structure and motion
available in the literature represent structure using only one state per point
(either its depth in an inertial frame, or its inverse, or other variations on
the theme). This corresponds to reducing the state of the model (??), with
the states y0i substituted for the measurements yi (0), which causes the
model noise n(t) to be non-zero-mean2. When the zero-mean assumption
implicit in the use of the Kalman filter is violated, the filter settles at a
severely biased estimate. In this case we say that the model is sub-minimal.
On the other hand, when the model is non-minimal such is the case
when we do not force it to evolve on a base of the state-space bundle the
filter is free to wonder on the non-observable space (i.e. along the fibers of
the state-space bundle), therefore causing the explosion of the variance of
the estimation error along the unobservable components.
13.5 Implementation
In the previous section we have seen that the model (??) is observable.
Notice that in that model the index for y0i starts at 4, while the index for
i starts at 2. This corresponds to choosing the first three points as reference
for the similarity group and is necessary (and sufficient) for guaranteeing
2 In fact, if we call ni (0) the error in measuring the position of the point i at time 0,
we have that E[ni (t)] = ni (0) t.
13.5. Implementation
279
that the representation is minimal. In the model (??), we are free to choose
the initial conditions 0 , T0 , which we will therefore let be 0 = T0 = 0,
thereby choosing the camera reference at the initial time instant as the
world reference.
Partial autocalibration
As we have anticipated, the models proposed can be extended to account
for changes in calibration. For instance, if we consider an imaging model
with focal length f 3
X1
f
X2
(13.33)
f (X) =
X3
where the focal length can change in time, but no prior knowledge on how
it does so is available, one can model its evolution as a random walk
f (t + 1) = f (t) + f (t)
f (t) N (0, f2 )
(13.34)
and insert it into the state of the model (13.3). As long as the overall
system is observable, the conclusions reached in the previous section will
hold. The following claim shows that this is the case for the model (13.34)
above. Another imaging model proposed in the literature is the following
[?]:
X1
X2
(X) =
(13.35)
1 + X3
for which similar conclusions can be drawn.
Proposition 13.5. Let g = {T, R} and = {v, }. The model
X(t + 1) = X(t)
X(0) = X0
g(0) = g0
g(t + 1) = e g(t)
(t + 1) = (t)
(0) = 0
f
(t
+
1)
=
f
(t)
f
(0) = f0
y(t) = (g(t)X(t))
f
(13.36)
acting on
is observable up to the action of the group represented by T, R,
the initial conditions.
Proof. Consider the diagonal matrix F (t) = diag{f (t), f (t), 1} and the
matrix of scalings A(t) as in the proof in section 13.2. Consider then two
3 This f is not to be confused with the generic state equation of the filter in section
13.6.
280
initial conditions {X1 , g1 , 1 , f1 } and {X2 , g2 , 2 , f2 }. For them to be indistinguishable there must exist matrices of scalings A(k) and of focus F (k)
such that
(
g1 X1 = F (1)(g2 X2 ) A(1)
b
b
b
b
e1 e(k1)1 g1 X1 = F (k + 1) e2 e(k1)2 g2 X2 A(k + 1) k 1.
(13.37)
Making the representation explicit we obtain
(
R1 X1 + T1 = F (1)(R2 X2 + T2 )A(1)
(13.38)
k A(k) + v1 = F (k + 1)(U2 X
k + v2 )A(k + 1)
U1 F (k)X
which can be re-written as
k A(k)A1 (k + 1) F 1 (k)U T F (k + 1)U2 X
k
X
1
= F (k)1 U1T (F (k + 1)
v2 A(k + 1) v1 )A1 (k + 1).
The two sides of the equation have equal rank only if it is equal to zero,
which draws us to conclude that A(k)A1 (k + 1) = I, and hence A is
constant. From F 1 (k)U1T F (k+1)U2 = I we get that F (k+1)U2 = U1 F (k)
and, since U1 , U2 SO(3), we have that taking the norm of both sides
2f 2 (k + 1) + 1 = 2f 2 (k) + 1, where f must be positive, and therefore
constant: F U2 = U1 F . From the right hand side we have that F v2 A = v1 ,
from which we conclude that A = I, so that in vector form we have
v1 = F v2 . Therefore, from the second equation we have that, for any f
and any , we can have
v1
U1
= F v2
= F U2 F 1 .
(13.39)
(13.40)
Saturation
Instead of eliminating states to render the model observable, it is possible
to design a nonlinear filter directly on the (unobservable) model (13.3) by
saturating the filter along the unobservable component of the state space
as we show in this section. In other words, it is possible to design the initial
variance of the state of the estimator as well as its model error in such a
13.5. Implementation
281
way that it will never move along the unobservable component of the state
space.
As suggested in the previous section, one can saturate the states corresponding to y01 , y02 , y03 and 1 . We have to guarantee that the filter
b0 , b0 , gb0 , b0 evolves in such a way that y
b01 (t) = y
b01 , y
b02 (t) =
initialized at y
2
3
3
1
1
b0 , y
b0 (t) = y
b0 , b (t) = b0 . It is simple, albeit tedious, to prove the following
y
proposition.
Proposition 13.6. Let Pyi (0), Pi (0) denote the variance of the initial
condition corresponding to the state y0i and i respectively, and yi , i
the variance of the model error corresponding to the same state, then
Pyi (0) = 0, i = 1 . . . 3
P 1 = 0
(13.41)
i = 1...3
y i = 0,
1 = 0
b0i (t|t) = y
b0i (0),
implies that y
Pseudo-measurements
Yet another alternative to render the model observable is to add pseudomeasurement equations with zero error variance.
Proposition 13.7. The model
i
i
i = 1...N
i (0) = i0
(t + 1) = (t)
(t)) exp((t)))
(0) = 0
(t + 1) = LogSO(3) (exp(b
v(t + 1) = v(t) + v (t)
v(0) = v0
(t
+
1)
=
(t)
+
(t)
(0)
= 0
i
i
i
b
y (t) = exp((t))y
(t) (t) + T (t) + ni (t)
i = 1...N
= 1
i
y (t) = i i = 1 . . . 3,
(13.42)
where 1 is an arbitrary (positive) constant and i are three non-collinear
points on the plane, is observable.
The implementation of an extended Kalman filter based upon the model
(??) is straightforward. However, for the sake of completeness we report
it in section 13.6. The only issue that needs to be dealt with is the disappearing and appearing of feature points, a common trait of sequences of
images of natural scenes. Visible feature points may become occluded (and
therefore their measurements become unavailable), or occluded points may
282
Occlusions
When a feature point, say Xi , becomes occluded, the corresponding
measurement yi (t) becomes unavailable. It is possible to model this phenomenon by setting the corresponding variance to infinity or, in practice
ni = M I2 for a suitably large scalar M > 0. By doing so, we guarantee
0i (t) and i (t) are not updated:
that the corresponding states y
0i (t+1) = y
0i (t) and i (t+1) = i (t).
Proposition 13.8. If ni = , then y
An alternative, which is actually preferable in order to avoid useless com 0i and i
putation and ill-conditioned inverses, is to eliminate the states y
altogether, thereby reducing the dimension of the state-space. This is simple due to the diagonal structure of the model (??): the states i , y0i are
decoupled, and therefore it is sufficient to remove them, and delete the corresponding rows from the gain matrix K(t) and the variance w (t) for all
t past the disappearance of the feature (see section 13.6).
When a new feature point appears, on the other hand, it is not possible
to simply insert it into the state of the model, since the initial condition is
unknown. Any initialization error will disturb the current estimate of the
remaining states, since it is fed back into the update equation for the filter,
and generates a spurious transient. We address this problem by running
a separate filter in parallel for each point using the current estimates of
motion from the main filter in order to reconstruct the initial condition.
Such a subfilter is based upon the following model, where we assume that
N features appear at time :
i
i
yi (0) N (yi ( ), ni )
t>
y (t + 1) = y (t) + yi (t)
i
i
(t)
+
(0)
N
(1,
P
(0))
i (t + 1) =
i (t)
b
b | ))]1 yi (t)i (t) T ( | ) + T (t|t) + ni (t)
yi (t) = exp((t|t))[exp(
(
(13.43)
for i = 1 . . . N , where (t|t) and T (t|t) are the current best estimates of
and T , ( | ) and T ( | ) are the best estimates of and T at t = . In
pracice, rather than initializing to 1, one can compute a first approximation by triangulating on two adjacent views, and compute covariance of the
initialization error from the covariance of the current estimates of motion.
Several heuristics can be employed in order to decide when the estimate
of the initial condition is good enough for it to be inserted into the main
filter. The most natural criterion is when the variance of the estimation
13.5. Implementation
283
Drift
The only case when losing a feature constitutes a problem is when it is
used to fix the observable component of the state-space (in our notation,
i = 1, 2, 3) 4 . The most obvious choice consists in associating the reference
to any other visible point. This can be done by saturating the corresponding
state and assigning as reference value the current best estimate. In particular, if feature i is lost at time , and we want to switch the reference index
to feature j, we eliminate y0i , i from the state, and set the diagonal block
of w and P ( ) with indices 3j 3 to 3j to zero. Therefore, by proposition
13.6, we have that
0j ( + t) = y
0j ( )
y
t > 0.
(13.45)
284
yi = yi (0)
i0
0 = 1
T = 0
0
0 = 0
v0 = 0
0 = 0
i = 1 . . . N.
(13.46)
For the initial variance P0 , choose it to be block diagonal with blocks ni (0)
corresponding to y0i , a large positive number M (typically 100-1000 units of
focal length) corresponding to i , zeros corresponding to T0 and 0 (fixing
the inertial frame to coincide with the initial reference frame). We also
choose a large positive number W for the blocks corresponding to v0 and
0 .
The variance n (t) is usually available from the analysis of the feature
tracking algorithm. We assume that the tracking error is independent in
each point, and therefore n is block diagonal. We choose each block to
be the covariance of the measurement yi (t) (in the current implementation
they are diagonal and equal to 1 pixel std.). The variance w (t) is a design
parameter that is available for tuning. We describe the procedure in section
13.6. Finally, set
(
T
T
.
T
T
T
T T
x
(0|0) = [y4 0 , . . . yN 0 , 20 , . . . , N
0 , T0 , 0 , v 0 , 0 ]
(13.47)
P (0|0) = P0 .
Transient
During the first transient of the filter, we do not allow for new features to
be acquired. Whenever a feature is lost, its state is removed from the model
and its best current estimate is placed in a storage vector. If the feature
was associated with the scale factor, we proceed as in section 13.5. The
transient can be tested as either a threshold on the innovation, a threshold
on the variance of the estimates, or a fixed time interval. We choose a
combination with the time set to 30 frames, corresponding to one second
of video.
The recursion to update the state x and the variance P proceed as
follows: Let f and h denote the state and measurement model, so that
285
(13.48)
We then have
Prediction:
(
x
(t + 1|t) = f (x(t|t))
P (t + 1|t) = F (t)P (t|t)F T (t) + w
(13.49)
Update:
(
x
(t + 1|t + 1) = x(t + 1|t) + L(t + 1) (y(t + 1) h(
x(t + 1|t)))
P (t + 1|t + 1) = (t + 1)P (t + 1|t)T (t + 1) + L(t + 1)n (t + 1)LT (t + 1).
(13.50)
Gain:
(t + 1) = I L(t + 1)H(t + 1)
.
(13.51)
L(t + 1) = P (t + 1|t)H T (t + 1)1 (t + 1)
(
.
F (t) = f
x(t|t))
x (
.
H(t + 1) = h
x(t + 1|t))
x (
(13.52)
.
Let ei be the i-th canonical vector in R3 and define Y i (t) = e(t)
.
[y0i (t)]i (t) + T (t), Z i (t) = eT3 Y i (t). The i-th block-row (i = 1, . . . , N )
y i Y i .
Y i
Hi (t) of the matrix H(t) can be written as Hi = Y
i x = i x where
the time argument t has
been omitted for simplicity of notation. It is easy
to check that i = Z1i I2 (Y i ) and
Y i Y i
i
i
Y i
Y
0
...
...
0
Y
[
] [ 0 ... i ... 0 ]
0
0
i
y0
= |
T |{z}
|{z} |{z} .
{z
} |{z}
{z
} |
x
3
3
N 1
2N 6
Y i
y0i =
Y i
=
i
Y i
Y i =
i
e
0
e y0i
Ih
e i i
y
1 0
e i i
y
2 0
e i i
y
3 0
286
distribution [?] and will not be reported here. We shall use the following
notation:
LogSO(3) (R) . h
=
R
LogSO(3) (R)
r21
LogSO(3) (R)
r11
...
LogSO(3) (R)
r33
I2N 6
0
. 0
F =
0
0
0
0
IN 1
0
0
0
0
0
0
0
0
0
0
0
0
0
LogSO(3) (R) R
R
0
I
0
0
0
0
0
e
e
e
1 T
2 T
3 T
LogSO(3) (R) R
where
R h e
= e
1
R . h e
= e
1
e
e
2
e
e
3
i
3 e
i
and
2 e
and the bracket () indicates that the content has been organized into a
column vector.
Regime
Whenever a feature disappears, we simply remove it from the state as
during the transient. However, after the transient a feature selection module
works in parallel with the filter to select new features so as to maintain
roughly a constant number (equal to the maximum that the hardware can
handle in real time), and to maintain a distribution as uniform as possible
across the image plane. We implement this by randomly sampling points
on the plane, searching then around that point for a feature with enough
brightness gradient (we use an SSD-type test [?]).
Once a new point-feature is found (one with enough contrast along
two independent directions), a new filter (which we call a subfilter) is
initialized based on the model (13.43). Its evolution is given by
Initialization:
Prediction:
i ( | ) = yi ( )
y
i ( | ) = 1
..
ni ( )
P
(
|
)
=
..
i (t + 1|t) = y
i (t|t)
y
i
(t + 1|t) = i (t|t)
t>
287
(13.53)
(13.54)
Update:
i
i
(t + 1|t + 1)
(t + 1|t)
y
y
=
+
(t + 1|t + 1)
(t + 1|t)
h
i1
i
i
i
b
b
y (t) (t) T ( ) + T (t))
L (t + 1) y (t) (exp((t)) exp(( ))
Tuning
The variance w (t) is a design parameter. We choose it to be block diagonal, with the blocks corresponding to T (t) and (t) equal to zero (a
deterministic integrator). We choose the remaining parameters using standard statistical tests, such as the Cumulative Periodogram of Bartlett [?].
The idea is that the parameters in w are changed until the innovation
.
process (t) = y(t) h(
x(t)) is as close as possible to being white. The
periodogram is one of many ways to test the whiteness of a stochastic
process. In practice, we choose the blocks corresponding to y0i equal to the
variance of the measurements, and the elements corresponding to i all
equal to . We then choose the blocks corresponding to v and to be
diagonal with element v , and then we change v relative to depending
on whether we want to allow for more or less regular motions. We then
change both, relative to the variance of the measurement noise, depending
on the level of desired smoothness in the estimates.
Tuning nonlinear filters is an art, and this is not the proper venue to
discuss this issue. Suffices to say that we have only performed the procedure
288
once and for all. We then keep the same tuning parameters no matter what
the motion, structure and noise in the measurements.
Chapter 14
Step-by-step building of a 3-D model
from images
290
Chapter 15
Extensions, applications and further
research directions
Part V
Appendices
Appendix A
Basic facts from linear algebra
Since in computer vision, we will mostly deal with real vector spaces, we
here introduce some important facts about linear algebra for real vector
spaces.1 Rn is then a natural model for a real vector space of n-dimension.
If Rn is considered as a linear space, its elements, so-called vectors, are
complete under two basic operations: scalar multiplication and vector summation. That is, given any two vectors x, y Rn , and any two real scalars
, R, we may obtain a new vector z = x + y in Rn . Linear algebra mainly studies properties of the so-called linear transformations among
different real vector spaces. Since such transformations can typically represented as matrices, linear algebra to a large extent studies the properties
of matrices.
x, y Rn
x Rn , R.
296
Clearly, with respect to the standard bases of Rn and Rm , the map L can
be represented by a matrix A Rmn such that
L(x) = Ax,
x Rn .
(A.1)
The set of all (real) m n matrices are denoted as M(m, n). If viewed as
a linear space, M(m, n) can be identified as the space Rmn . By abuse of
language, we sometimes simply refer to a linear map L by its representation
(matrix) A.
.
If n = m, the set M(n, n) = M(n) forms an algebraic structure called
ring (over the field R). That is, matrices in M(n) are closed under matrix
multiplication and summation: if A, B are two n n matrices, so are C =
AB and D = A + B. If we consider the ring of all n n matrices, its group
of units GL(n) which consists of all n n invertible matrices and is called
the general linear group can be identified with the set of invertible linear
maps from Rn to Rn :
L : Rn Rn ; x 7 L(x) = Ax | A GL(n).
(A.2)
Now let us further consider R with its standard inner product structure.
That is, given two vectors x, y Rn , we define their inner product to be
hx, yi = xT y. We say that a linear transformation A (from Rn to itself) is
orthogonal if it preserves such inner product:
hAx, Ayi = hx, yi,
x, y Rn .
(A.3)
(A.4)
(A.5)
297
Proof. Contrary to convention elsewhere, within this proof all vectors stand
for row vectors. That is, if v is an n-dimensional row vector, it is of the form:
v = [v1 , v2 , . . . , vn ] which does not have a transpose on. Denote the ith row
vector of the given matrix A as ai for i = 1, . . . , n. The proof consists in
constructing L and E iteratively from the row vectors ai :
.
.
q1 = a 1
r1 = q1 /kq1 k
.
.
q2 = a2 ha2 , r1 ir1
r2 = q2 /kq2 k
..
..
..
.
. =
.
.
Pn1
.
.
qn = an i=1
hai+1 , ri iri rn = qn /kqn k
Then R = [r1T . . . rnT ]T and the matrix Q is obtained as
kq1 k
0
...
0
ha2 , r1 i kq2 k
.
.
.
0
Q=
..
..
.
.
..
..
.
.
ha2 , r1 i han , rn1 i kqn k
298
kxk2 =1
Remark A.3.
Similarly other induced operator norms on A can be defined starting
from different norms on the domain and co-domain spaces on which
A operates.
let A be as above, then AT A is clearly symmetric and positive semidefinite, so it can be diagonalized by a orthogonal matrix V . The
eigenvalues, being non-negative, can be written as i2 . By ordering the
columns of V so that the eigenvalue matrix has decreasing eigenvalues on the diagonal, we see, from point (e) of the previous theorem,
that AT A = V diag(12 , . . . , n2 )V T and kAk2 = 1 .
Rm = Ra(A) Ra(A) .
299
b) Ra(A) = N u(AT )
c) N u(AT ) = N u(AAT )
d) Ra(A) = Ra(AAT ).
Proof. To prove N u(AAT ) = N u(AT ), we have:
AAT x = 0 hx, AAT xi = kAT xk2 = 0 AT x = 0, hence
N u(AAT ) N u(AT ).
AT x = 0 AAT x = 0, hence N u(AAT ) N u(AT ).
300
or equivalently,
A=
r
X
i ui viT
i=1
(A.6)
Then U1T U1 = Irr . Since AT A and AAT both have exactly r nonzero
eigenvalues, it follows that the columns of U1 form an orthonormal basis for
Ra(AAT ) and Ra(A). Thus the properties of U1 listed in 2 hold. Now define
301
A = U V T .
After we have gone through all the trouble proving this theorem, you
must know that SVD has become a numerical routine available in many
computational softwares such as MATLAB. Within MATLAB, to compute
the SVD of a given m n matrix A, simply use the command [U, S, V ] =
SV D(A) which returns matrices U, S, V satisfying A = U SV T (where S
represents as defined above).
Ax
=
x
=
u
v
i
i
i
i
i
i
i=1
i=1
Pn
Pn
2
u
hv
,
xi
=
y
u
=
y.
Hence
hv
,
xi
=
y
.
Now
kxk
=
i
i
i
i
i
i
i
i
2
i=1
i=1
Pn
Pn
2
n1
, from which we conclude i=1 yi2 /i2 = 1,
i=1 hvi , xi = 1, x S
which represents the equation of an ellipsoid with half-axes of length i .
where
1
1
1
(r) = diag(1 , . . . , r , 0, . . . , 0)
302
k = 1, . . . , n
1
n (A)
303
kAk .
kxk
kAk
= kA kkAk
= k(A)
kxk
kAk
kAk
Appendix B
Least-square estimation and filtering
(B.1)
where T is a suitable chosen class of functions and C()T some cost in the
Xspace.
We concentrate on one of the simplest possible choices, which correspond
to minimum variance affine estimators:
T
C()T
.
= {A Rnm ; b Rn | T (Y ) = AY + b}
.
= E k k2
(B.2)
(B.3)
where the latter operator takes the expectation of the squared euclidean
norm of the random vector Y . Therefore, we seek for
.
b) =
(A,
arg min E kX (AY + b)k2
(B.4)
A,b
305
.
.
We call X = E[X] and X = E[XX T ], and similarly for Y . First notice
that if X = Y = 0, then b = 0. Therefore, consider the centered vectors
.
.
=
X
X X and Y = Y Y and the reduced problem
.
AY k2 .
A = arg min E kX
(B.5)
A
+ (AX + b Y )k2
= E kAY X
AY k2 + kAX + b Y k2 . (B.6)
= E kX
Hence, if we assume for a moment that we have found A that solves the
problem (B.5), then trivially
b = X A
Y
(B.7)
(B.9)
where the span is intended over the reals.1 We say that the subspace H(Y )
is full rank if Y = E[Y Y T ] > 0.
The structure of a Hilbert space allows us to make use of the concept
of orthogonal projection of a random variable onto the span of a random
vector:
ZiH = 0,
Z = prhH(Y )i (X) hX Z,
Yi iH = 0,
hX Z,
.
= E[X|Y
]
.
)
= X(Y
Z H(Y )
i = 1...n
(B.10)
(B.11)
(B.12)
306
where k k2H = E k k
2
(B.13)
(B.14)
(B.15)
E[XY T ] = AE[Y
Y T]
Y
XY = A
which, provided that H(Y ) is full rank, gives A =
(B.16)
XY 1
Y .
(B.17)
2 The resemblance with a conditional expectation is due to the fact that, in the
presence of Gaussian random vectors such a projection is indeed the conditional
expectation.
307
(B.18)
0 , Y 6= 0, we
.
.
=
= Y Y , X
.
=
A = X Y 1
Y
Z = X + XY 1
Y (Y Y )
which is the least-variance affine estimator
.
+ b
Z = E[X|Y
] = AY
(B.19)
(B.20)
where
A = XY 1
Y
b = X XY 1 Y .
Y
(B.21)
X = X XY 1
Y Y X .
(B.23)
(B.22)
.
=
It is an easy exercise to compute the variance of the estimation error X
X Z:
If we interpret the variance of X as the prior uncertainty, and the vari as the posterior uncertainty, we may interpret the second term
ance ot X
(which is positive semi-definite) of the above equation as a decrease of
the uncertainty.
(B.24)
(B.26)
308
(B.27)
E[X|Y
] = E[X|Y
1 ] + E[X|Y2 ].
After an easy calculation one can see that the above is true iff
which is to say when
H(Y1 ) H(Y2 )
(B.28)
E[Y1 Y2T ]
= 0,
(B.29)
Change of basis
Suppose that instead of measuring the instances of a random vector Y we
measure another random vector Z which is related to Y via a change of
, then it is immediate
basis: Z = T Y | T GL(m). If we call E[X|Y
] = AY
to see that
E[X|Z]
= XZ 1
Z Z
1
= XY T T (T T Y T 1 )Z
1
= XY 1
Z.
Y T
(B.30)
Innovations
The linear least-variance estimator involves the computation of the inverse of the output covariance matrix Y . It may be interesting to look
for changes of bases T that transform the output Y into Z = T Y such that
Z = I. In such a case the optimal estimator is simply
E[X|Z]
= XZ Z.
(B.31)
Let us pretend for a moment that the components of the vector Y are samples of a process taken over time: Yi = y(i), and call y t = [Y1 , . . . , Yt ]T the
history of the process up to time t. Each component (sample) is an element
of the Hilbert space H, which has a well-defined notion of orthogonality,
and where we can apply Gram-Schmidt procedure in order to make the
vectors y(i) orthogonal (uncorrelated).
.
.
v1 = y(1)
e1 = v1 /kv1 k
.
.
v2 = y(2) hy(2), e1 ie1
e2 = v2 /kv2 k
.
..
..
. ..
. = .
.
Pt1
.
.
vt = y(t) i=1 hy(i), ei iei et = vt /kvt k
The process {e}, whose instances up to time t are collected into the vector
et = [e1 , . . . , et ]T has a number of important properties:
309
(B.32)
(B.33)
vt = y(t) E[y(t)|y
]
(B.34)
(B.35)
(B.36)
310
x
(t|t) =
h(t, k)y(k).
(B.38)
k=
such
The design of the least-variance estimator involves finding the kernel h
.
that the estimation error x(t) = x(t) x
(t|t) has minimum variance. This
is found, as in the previous sections for the static case, by imposing that
the estimation error be orthogonal to the history of the process {y} up to
time t:
hx(t) x
(t|t), y(s)iH = 0, s t
t
X
E[x(t)y T (s)]
h(t, k)E[y(k)y T (s)] = 0,
k=
s t (B.39)
which is equivalent to
xy (t s) =
t
X
k=
(B.40)
X
s=0
h(s)y (t s),
t 0
(B.41)
311
equation
Sxy (z) = H(z)Sy (z)
(B.42)
H(z)
= Sxy (z)Sy1 (z)
(B.43)
X(Y ) =
0,
t<0
(B.45)
(B.46)
(B.47)
312
Since Sxe (z) = Sxy (z)L(z 1 )1 , the final expression of our linear, least
variance estimator is (in the Z-domain) x = H(z)y(z),
where the kernel H
is given by
H(z)
= Sxy (z)L1 (z 1 ) + L1 (z).
(B.48)
The corresponding filter is known as the Wiener filter. Again we can recover the meaning of the innovation as the one-step prediction error for
the measurements: in fact, the best prediction of the process {y}, indicated
with y(t|t 1), is defined as the projection of y(t) onto the span of {y} up
to t 1, indicated with Ht1 (y). Such projection is therefore defined such
that
y(t) = y(t|t 1)+e(t)
(B.49)
t1
X
t0
k=t0
(B.51)
where denotes a fundamental set of solutions, which is the flow of the
differential equation
(
(t + 1, s) = A(t)(t, s)
(B.52)
(t, t) = I.
In the case of a time-invariant system A(t) = A, t, then (t, s) = Ats .
313
s t.
(B.53)
E[x(t)|H
s (x)] = E[x(t)|x(s)] = (t, s)x(s),
t s.
(B.54)
(B.55)
(B.56)
(B.57)
The conditions for stationariety impose that x (t) = const and x (t) =
const. It is easy to prove the following
Theorem B.4. Let A be stable (have all eigenvalues in the unit com where
= P Ak BB T AT k is the
plex circle), then x (t t0 ) ,
k=0
unique equilibrium solution of the above Ljapunov equation, and {x} de
scribes asymptotically a stationary process. If x0 is such that x (t0 ) = ,
then the process is stationary for all t t0 .
Remark B.4. The condition of stability for A is sufficient, but not necessary for generating a stationary process. If, however, the pair (A, B) is
completely controllable, so that the noise input affects all of the components
of the state, then such a stability condition becomes also necessary.
314
property of having the least error variance. In order to derive the expression
for the filter, we write the LFDSP as follows:
(
x(t + 1) = Ax(t) + v(t)
x(t0 ) = x0
(B.58)
y(t) = Cx(t) + w(t)
where we have neglected the time argument in the matrices A(t) and C(t)
(all considerations can be carried through for time-varying systems as well).
v(t) = Bn(t) is a white, zero-mean Gaussian noise with variance Q, w(t) =
Dn(t), also a white, zero-mean noise, has variance R, so that we could write
p
v(t) =
Qn(t)
Rn(t)
w(t) =
where n is a unit-variance noise. In general v and w will be correlated, and
in particular we will call
S(t) = E[v(t)w T (t)].
(B.59)
(B.60)
The first step is to modify the above model so that the model error v is
uncorrelated from the measurement error w.
Uncorrelating the model from the measurements
In order to uncorrelate the model error from the measurement error we can
just substitute v with the complement of its projection onto the span of w.
Let us call
(B.61)
the last equivalence is due to the fact that w is a white noise. We can now
use the results from section B.1 to conclude that
v(t) = v(t) SR1 w(t)
(B.62)
(B.63)
(B.64)
where F = A SR1 C. The model error v in the above model is uncorrelated from the measurement noise w, and the cost is that we had to add
an output-injection term SR1 y(t).
315
Prediction step
Suppose at some point in time we are given a current estimate for the state
x
(t|t) and a corresponding estimate of the variance of the model error
P (t|t) = E[
x(t)
x(t)T ] where x
= xx
. At the initial time t0 we can take
x
(t0 |t0 ) = x0 with some bona-fide variance matrix. Then it is immediate
to compute
v (t)|Ht (y)]
x
(t + 1|t) = F x(t|t) + SR1 y(t) + E[
(B.65)
where the last term is zero since v(t) x(s), x t and v(t) w(s), s
and therefore v(t) y(s), s t. The estimation error is therefore
x
(t + 1|t) = F x
(t|t)+
v (t)
(B.66)
P (t + 1|t) = F P (t|t)F T + Q.
(B.67)
Update step
Once a new measurement is acquired, we can update our prediction so
as to take into account the new measurement. The update is defined as
.
x
(t + 1|t + 1) = E[x(t
+ 1)|Ht+1 (y)]. Now, as we have seen in section B.1.4,
we can decompose the span of the measurements into the orthogonal sum
Ht+1 (y) = Ht (y)+{e(t + 1)}
(B.68)
.
x
(t + 1|t + 1) = E[x(t
+ 1)|Ht (y)] + E[x(t
+ 1)|e(t + 1)]
(B.69)
where the last term can be computed using the results from section B.1:
x
(t + 1|t + 1) = x
(t + 1|t) + L(t + 1)e(t + 1)
(B.70)
.
1
where L(t+1) = xe (t+1)e (t+1) is called the Kalman gain. Substituting
the expression for the innovation we have
x
(t + 1|t + 1) = x
(t + 1|t) + L(t + 1) (y(t + 1) C x
(t + 1|t))
(B.71)
from which we see that the update consists in a linear correction weighted
by the Kalman gain.
Computation of the gain
.
In order to compute the gain L(t + 1) = xe (t + 1)1
e (t + 1) we derive an
alternative expression for the innovation:
e(t+1) = y(t+1)Cx(t+1)+Cx(t+1)C x
(t+1|t) = w(t+1)+C x
(t+1|t)
(B.72)
316
(B.73)
(B.74)
(B.75)
Variance update
From the update of the estimation error
x
(t + 1|t + 1) = x
(t + 1|t) L(t + 1)e(t + 1)
(B.76)
we can easily compute the update for the variance. We first observe that
x
(t + 1|t + 1) is by definition orthogonal to Ht+1 (y), while the correction
term L(t + 1)e(t + 1) is contained in the history of the innovation, which
is by construction equal to the history of the process y: Ht+1 (y). Then it
is immediate to see that
P (t + 1|t + 1) = P (t + 1|t) L(t + 1)(t + 1)LT (t + 1).
(B.77)
(B.78)
.
= F L(t) + SR1
=
AP (t|t 1)C + S
(B.81)
1
(t)
(B.82)
317
Appendix C
Basic facts from optimization
References
[
Ast96]
str
K. A
om. Invariancy Methods for Points, Curves and Surfaces
in Computational Vision. PhD thesis, Department of Mathematics,
Lund University, 1996.
[BA83]
[BCB97]
M. J. Brooks, W. Chojnacki, and L. Baumela. Determining the egomotion of an uncalibrated camera from instantaneous optical flow. in
press, 1997.
[Bou98]
[CD91]
[Fau93]
[GH86]
[Har98]
[HF89]
T. Huang and O. Faugeras. Some properties of the E matrix in twoview motion estimation. IEEE PAMI, 11(12):131012, 1989.
[HS88]
320
References
[HZ00]
[Jaz70]
[Kan93]
[LF97]
Q.-T. Luong and O. Faugeras. Self-calibration of a moving camera from point correspondences and fundamental matrices. IJCV,
22(3):26189, 1997.
[LH81]
[LK81]
[May93]
[McL99]
[MF92]
S. Maybank and O. Faugeras. A theory of self-calibration of a moving camera. International Journal of Computer Vision, 8(2):123151,
1992.
[MKS00]
Y. Ma, J. Kovseck
a, and S. Sastry. Linear differential algorithm
for motion recovery: A geometric approach. International Journal of
Computer Vision, 36(1):7189, 2000.
[MLS94]
[MS98]
Y. Ma and S. Sastry. Vision theory in spaces of constant curvature. Electronic Research Laboratory Memorandum, UC Berkeley,
UCB/ERL(M98/36), June 1998.
[PG98]
J. Ponce and Y. Genc. Epipolar geometry and linear subspace methods: a new approach to weak calibration. International Journal of
Computer Vision, 28(3):22343, 1998.
[PG99]
M. Pollefeys and L. Van Gool. Stratified self-calibration with the modulus constraint. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(8):70724, 1999.
[SA90]
[Stu97]
References
321
[TK92]
[Tri97]
[TTH96]
[VF95]
[VMHS01] R. Vidal, Y. Ma, S. Hsu, and S. Sastry. Optimal motion estimation from multiview normalized epipolar constraint. In ICCV01,
Vancouver, Canada, 2001.
[WHA93]
C. Zeller and O. Faugeras. Camera self-calibration from video sequences: the Kruppa equations revisited. Research Report 2793,
INRIA, France, 1996.
[ZH84]
[Zha98]
Glossary of notations
(R, T )
(, v)
.
E = TbR
L E3
.
X = [X, Y, Z, W ]T R4
Xi
Xij
.
l = [a, b, c]T
.
x = [x, y, z]T R3
xi
xij
g SE(3)
324
p E3
Glossary of notations
Index
Cross product, 11
Euclidean metric, 10
Exercise
Adjoint transformation on twist, 35
Group structure of SO(3), 33
Properties of rotation matrices, 34
Range and null space, 34
Skew symmetric matrices, 34
Theorem
Rodrigues formula for rotation
matrix, 21
Surjectivity of the exponential map
onto SE(3), 29
Surjectivity of the exponential map
onto SO(3), 20
Group, 15
matrix representation, 15
rotation group, 17
special Euclidean group, 15
special orthogonal group, 17
Inner product, 10
Matrix
orthogonal matrix, 17
rotation matrix, 17
special orthogonal matrix, 17
Polarization identity, 14
Rigid body displacement, 13
Rigid body motion, 13