Stereo Calibration 2
Stereo Calibration 2
Figure 1: Perspective imaging geometry showing relationship between 3D points and image plane
points.
X Z Y Z X Y
= ; = ; x = f· ; y = f·
x f y f Z Z
If f = 1, note that perspective projection is just scaling a world coordinate by its Z value. Also
note that all 3D points along a line from the COP through a designated position (x, y) on the image
plane will have the same image plane coordinates.
1
We can also describe perspective projection by the matrix equation:
X
x s·x f 0 0 0
4
Y
y ≡ s·y = 0 f 0 0
·
Z
1 s 0 0 1 0
1
where s is a scaling factor and [x, y, 1]T are the projected coordinates in the image plane.
We can generate image space coordinates from the projected camera space coordinates. These are
the actual pixels values that you use in image processing. Pixels values (u, v) are derived by scaling
the camera image plane coordinates in the x and y directions (for example, converting mm to pixels),
and adding a translation to the origin of the image space plane. We can call these scale factors D x and
Dy , and the translation to the origin of the image plane as (u0 , v0 ).
If the pixel coordinates of a projected point (x,y) are (u,v) then we can write:
x y
= u − u0 ; = v − v0 ;
Dx Dy
x y
u = u0 + ; v = v0 +
Dx Dy
where Dx , Dy are the physical dimensions of a pixel and (u0 , v0 ) is the origin of the pixel coordi-
nate system. Dxx and Dyy are simply the number of pixels, and we center them at the pixel coordinate
origin. We can also put this into matrix form as:
1
s·u D
0 u0 s·x
x 1
s·v = 0
Dy
v0 · s · y
s 0 0 1 s
1
0 u0
X
u s·u Dx f 0 0 0 Y
4 1
≡ s·v = 0
v
Dy
v0
0 f 0 0 ·
Z
1 s 0 0 1 0 0 1 0
1
P image = Tpersp
image persp
Tcamera P camera
In the above, we assumed that the point to be imaged was in the camera coordinate system. If the
point is in a previously defined world coordinate system, then we also have to add in a standard 4x4
transform to express the world coordinate point in camera coordinates:
2
w
1 r r r t X
u s·u D
0 u0 f 0 0 0 11 12 13 x w
r21 r22 r23 ty Y
x 1
v = s·v = 0
v0 0 f 0 0 ·
Dy r31 r32 r33 tz w Z
1 s 0 0 1 0 0 1 0
0 0 0 1 1
P image = Tpersp
image persp camera world
Tcamera Tworld P
Summing all this up, we can see that we need to find the following information to transform an
arbitrary 3D world point to a designated pixel in a computer image:
• 6 parameters that relate the 3D world point to the 3D camera coordinate system (standard 3
translation and 3 rotation): (R, T )
• Focal Length of the camera: f
• Scaling factors in the x and y direcitons on the image plane: (Dx , Dy )
• Translation to the origin of the image plane: (u0 , v0 ).
This is 11 parameters in all. We can break these parameters down into Extrinsic parameters
which are the 6-DOF transform between the camera coordinate system and the world coordinate
system, and the Intrinsic parameters which are unique to the actual camera being used, and
include the focal length, scaling factors, and location of the origin of the pixel coordinate system.
2 Camera Calibration
Camera calibration is used to find the mapping from 3D to 2D image space coordinates. There are 2
approaches:
• Method I: Find both extrinsic and intrinsic parameters of the camera system. However, this can
be difficult to do. The instinsic parameters of the camera may be unknown (i.e. focal length,
pixel dimension) and the 6-DOF transform also may be difficult to calculate directly.
• Method 2: An easier method is the “Lumped” transform. Rather than finding individual param-
eters, we find a composite matrix that relates 3D to 2D. Given the equation below:
P image = Tpersp
image persp camera world
Tcamera Tworld P
P image = C P world
image persp camera
C = Tpersp Tcamera Tworld
3
• C is a single 3 × 4 transform that we can calculate empirically.
4×1
z }| { 3×1
z }| {
3×4 x
zh }| i{ u u0 where
y 4 0
C
=
v ≡ v u0 = wu
z
w 1 v 0 = wv
1 | {z } | {z }
| {z }
2-D homo. vec Pixels
3-D homo. vec
1. If we know all the cij and x, y, z, we can find u0 , v 0 . This means that if we know calibration
matrix C and a 3-D point, we can predict its image space coordinates.
2. If we know x, y, z, u0 , v 0 , we can find cij . Each 5-tuple gives 2 equations in cij . This is
the basis for empirically finding the calibration matrix C (more on this later).
3. If we know cij , u0 , v 0 , we have 2 equations in x, y, z. They are the equations of 2 planes
in 3-D. 2 planes form an intersecton which is a line. These are the equations of the line
emanating from the center of projection of the camera, through the image pixel location
u0 , v 0 and which contains point x, y, z.
4
• We can set up a linear system to solve for cij : AC = B
x1 y1 z1 1 0 0 0 0 −u01x −u01y −u01 z
c11
u01
0 0 0 0 x1 y1 z1 1 −v10 x −v10 y −v10 z
c12
v10
x2 y2 z2 1 0 0 0 0 −u02x −u02y −u02 z
c13
u02
0 0 0 0 x2 y2 z2 1 −v20 x −v20 y −v20 z
c14
v20
.
c21
u03
.
c22
=
v30
.
c23
.
.
c24
.
.
c31
.
.
c32
u0N
0
. c33 vN
| {z }
We can assume c34 =1
• Each set of points x, y, z, u0 , v 0 yields 2 equations in 11 unknowns (the cij ’s).
• To solve for C, A needs to be invertible (square). We can overdetermine A and find a Least-
Squares fit for C by using a pseudo-inverse solution.
If A is N × 11, where N > 11,
AC = B
A AC = AT B
T
C= (AT A)−1 AT B
| {z }
pseudo inverse
3 COMPUTATIONAL STEREO
Stereopsis is an identified human vision process. It is a passive, simple procedure that is robust to
changes in lighting, scale, etc. Humans can fuse random dot stereograms that contain no high-level
information about the objects in the fused images, yet they can infer depth from these stereograms.
The procedure is:
• Camera-Modeling/Image-acquisition
• Feature extraction - identify edges, corners, regions etc.
• Matching/Correspondence - find same feature in both images
• Compute depth from matches - use calibration information to back project rays from each cam-
era and intersect them (triangulation)
5
• Interpolate surfaces - Matches are sparse, and constraints such as smoothness of surfaces are
needed to “fill in” the depth between match points.
Camera Modeling: An important consideration in computational stereo is the setup of the cam-
eras. The baseline between the camera centers determines the accuracy of the triangulation. Large
baseline means more accuracy; however as the baseline gets larger, the same physical event in each
image may not be found.
The cameras also have to be calibrated and registered. Calibration is relatively straightforward,
and a variety of methods exist. Some methods extend the simple least squares model we discussed to
include non-linear effects of lens distortion (particularly true with short a focal length lens).
Registration is needed to make use of the epipolar constraint. This constraint consists of a plane
that includes both camera’s optical centers and a point in 3-D space. This epilolar plane intersects
both image planes in a straight line.
Feature Extraction: Identifying features in each image that can be matched is an important part
of the stereo process. It serves 2 purposes: 1) data reduction so we are not forced to deal with every
single pixel as a potential match, and 2) stability - features are seen to be more stable than a single
gray level pixel.
There are 2 approaches: feature-based methods which find primitives such as edges, corners, lines,
arcs in each image and match them; and area-based methods that identify regions or areas of pixels
that can be matched using correlation based methods. Sometimes both methods are used, with feature-
based methods proposing a match and area-based methods centered on the feature used to verify it.
Correspondence: The heart of the stereo problem is a search procedure. Given a pixel in image
1, it can potentially match each of N 2 pixels in the other image. To cut down this search space,
cameras are often registered along scan lines. This means that he epipolar plane intersects each image
plane along the same scan line. A pixel in image 1 can now potentially match only a pixel along the
corresponding scan line in image 2, reducing the search from O ( N 2 ) to O ( N ). The match criteria
can include not only the location of a feature like an edge, but also the edge direction and polarity.
Problems in Matching: A number of problems occur during matching to create false matches:
These are occlusions, periodic features such as texture, homogeneous regions without features, base-
line separation errors, and misregistered images. Stereo can usually only provide sparse 3-D data at
easily identified feature points.