3D Model Assisted Image Segmentation: Srimal Jayawardena and Di Yang and Marcus Hutter
3D Model Assisted Image Segmentation: Srimal Jayawardena and Di Yang and Marcus Hutter
December 2011
Abstract
The problem of segmenting a given image into coherent regions is important in Computer Vision and many industrial applications require segmenting
a known object into its components. Examples include identifying individual
parts of a component for process control work in a manufacturing plant and
identifying parts of a car from a photo for automatic damage detection. Unfortunately most of an objects parts of interest in such applications share
the same pixel characteristics, having similar colour and texture. This makes
segmenting the object into its components a non-trivial task for conventional
image segmentation algorithms. In this paper, we propose a Model Assisted
Segmentation method to tackle this problem. A 3D model of the object is
registered over the given image by optimising a novel gradient based loss function. This registration obtains the full 3D pose from an image of the object.
The image can have an arbitrary view of the object and is not limited to a
particular set of views. The segmentation is subsequently performed using
a level-set based method, using the projected contours of the registered 3D
model as initialisation curves. The method is fully automatic and requires no
user interaction. Also, the system does not require any prior training. We
present our results on photographs of a real car.
Keywords. Image segmentation; 3D-2D Registration; 3D Model; Monocular;
Full 3D Pose; Contour Detection; Fully Automatic.
Introduction
Figure 1: The figure shows Model Assisted Segmentation results for a semi-profile
view of the car.
such sub-segmentation and does not require user interaction or prior training. A
result from our method is shown in Figure 1 with the car sub-segmented into a
collection of parts. This includes the hood of the car, windshield, fender, front and
back doors/windows.
Many industry applications require an image of a known object to be subsegmented and separated into its parts. Examples include identification of individual parts of a car given a photograph for automatic damage identification or
the identification of sub-parts of a component in a manufacturing plant for process
control work. Sub-segmenting parts of an object which share the same color and
texture is very hard, if not impossible, with conventional segmentation methods.
However, prior knowledge of the shape of the known object and its components can
be exploited to make this task easier. Based on this rationale we propose a novel
Model Assisted Segmentation method for image segmentation.
We propose to register a 3D model of the known object over a given photograph/image in order to initialise the segmentation process. The segmentation is
performed over each part of the object in order to obtain sub-segments from the
image. A major contribution of this work is a novel gradient based loss function,
which is used to estimate the full 3D pose of the object in the given image. The
projected parts of the 3D model may not perfectly match the corresponding parts
in the photo due to dents in a damaged vehicle or inaccuracies in the 3D model.
Therefore, a level-set [11] based segmentation method is initialised using initial contour information obtained by projecting parts of the 3D model at this 3D pose.
We focus our work on sub-segmentation of known car images. Cars pose a difficult
segmentation task due to highly reflective surfaces in the car body. The method can
be adapted to work for any object.
The remainder of this paper is organised as follows. Previous work related to our
paper is described in Section 2. We describe the method used to estimate the 3D
pose of the object in Section 3. The contour based image segmentation approach
is described next in Section 4. This is followed by results on real photos which are
benchmarked against state of the art methods in Section 5.
2
Related Work
Feature-based methods [6, 15] attempt to simultaneously solve the pose and point
correspondence problems. The success of these methods are affected by the quality
of the features extracted from the object, which is non-trivial with objects like cars.
Features depend on the object geometry and can cause problems when recovering
a full 3D pose. Also different image modalities cause problems with feature based
methods. For example reflections which may appear as image features do not occur
in the 3D model projection. Our method on the contrary, does not depend on feature
extraction.
Segmentation. The use of shape priors for segmentation and pose estimation have
been investigated in [22, 21, 23, 25]. These methods focus on segmenting foreground
from background using 3D free-form contours. Our method, on the contrary, does
intra-object segmentation (into sub-segments) by initialising the segmentation using
projections of 3D CAD model parts at an estimated pose. In addition, our method
works on more complex objects like real cars.
3D Model Registration
We describe the use of a featureless gradient based loss function which is used to
register the 3D model over the 2D photo. Our method works on triangulated 3D
CAD models with a large number of polygons (including 3D models obtained from
laser scans) and utilises image gradients of the 3D model surface normals rather
than considering simple edge segments.
Gradient based loss function. We define a gradient based loss function that has
a minimum at the correct 3D pose 0 IR7 where the projected 3D model matches
the object in the given photo/image. The image gradients of the 3D model surface
normal components and the image gradients of the 2D photo are used to define a
loss function at a given pose .
We use (u,v) ZZ 2 to denote 2D pixel coordinates in the photo/image and
(x,y,z) IR3 to denote 3D coordinates of the 3D model. Let W be a d dimensional
matrix (for example d = 3 if W is an RGB image) with elements W (u,v) IRd . We
define the k norm gradient magnitude matrix of W as
Pd Wi (u,v) k
Wi (u,v) k
k
(1)
||W (u, v)||k := i=1 | u | + | v |
Based on this we have the gradient magnitude matrix GI for a 2D photo/image I as
GI (u, v) = ||I(u, v)||kk
(2)
renderer to obtain the projected surface normal component matrix such that
(u,v,) IR3 has surface normal component values at the 2D point (u,v) in the
projected image. Based on this we have the gradient normal matrix for the surface
normal components as
GN ()(u, v) = ||(u, v, )||kk
(3)
(4)
(a) x (u,v,)
(b) y (u,v,)
(c) z (u,v,)
(d)
x (u,v,)
u
(e)
x (u,v,)
v
(f)
y (u,v,)
u
(g)
y (u,v,)
v
(h)
z (u,v,)
u
(i)
z (u,v,)
v
(j) GN ()
Figure 2: The visualisations shows GN () for a 3D model in (j). The x,y and z
component matrices of the surface normal vector are shown in (a)-(c).
Their image gradients are shown in (d)-(i). The resulting GN () matrix is shown
in (j). No Gaussian smoothing has been applied.
Colour representation: green=positive, black=zero and red=negative. We use a
horizontal x axis pointing left to right, vertical y axis and pointing top to bottom
and an z axis which points out of the page.
6
(c) Real
I
u
(d) Synthetic
I
u
(e) Real
I
v
(f) Synthetic
I
v )
(g) Real GI
(h) Synthetic GI
(a) Real
(b) n=0
(c) n=2
(d) Synthetic
(e) n=0
(f) n=2
Figure 4: Overlaid images of GI and GN () for a real photo (row 1) and a synthetic
photo (row 2) obtained by rendering a 3D model are shown. The first column shows
the photos I. The overlaid images of GI and GN () with no Gaussian smoothing
(column 2) and 3 levels of Gaussian smoothing (column 3) are shown. The photo
is in the green channel and 3D model is in the red channel, with yellow showing
overlapping regions.
Initialisation. We use a rough pose estimate to seed the optimisation. An
object specific method can be used to obtain the rough pose. Possible methods for
obtaining a coarse initial pose include the work done by [17], [26] and [1]. We have
used the wheel match method developed by Hutter and Brewer [9] to obtain an
initial pose for vehicle photos where the wheels are visible. The wheels need not be
visible with the other methods mentioned above. We use the following to represent
the rough pose of cars as prescribed in [9] which neglects the effects of perspective
projection.
0 := (x , y , x , y , x , y )
(5)
(6)
Contour Detection
In this section, we discuss the procedure of contour detection used to segment the
known object in the image. We use a variation of the level set method which does
not require re-initialisation [11] to find boundaries of relevant object parts.
Most active contour models implement an edge-function to find boundaries.
The edge-function is a gradient dependant positive decreasing function. A common
formulation is as follows
g(|I|) =
1
1
,
1 + |G I|p
p 1,
(7)
0.995
0.95
Loss Value
Loss Value
0.99
0.985
n=0
n=1
n=2
0.98
0.975
-15
-10
-5
0.9
0.85
n=0
n=1
n=2
0.8
10
15
0.75
-15
-10
-5
10
15
(a) 2-norm
(b) 1-norm
Figure 6: We compare 1-norm and 2-norm loss landscapes obtained by shifting the
3D model along the x direction from a known 3D pose. The horizontal axis shows
the percentage deviation along the x axis. The numbers in the legend show the
level of Gaussian smoothing n applied on the gradient images before calculating the
loss in Equation 4. We note that the 1-norm loss is less noisy compared to the
2-norm loss. The actual loss function is seven dimensional and graphs of the other
dimensions are similar.
|I|
(8)
,
(u, v) outside contour C
As with other level set formulations like [4] and [13], the curve C is evolved using
the mean curvature div(/||) in the normal direction ||. Therefore the
curve evolution is represented by /t as
11
Results
(a) Photo
Figure 7: The images show pose estimation results for a real photograph of a Mazda
Astina car. The original photograph and subsequent images have been cropped
for clarity. The fine 3D pose in (f) is obtained by optimising the novel gradient
based loss function (Equation 4) using the rough pose in (c). The rough pose is
obtained as prescribed in [9]. Much of the background is removed (b) from the
original photo (a) using an adaptation of Grabcut [24] when estimating the fine
3D pose. Intermediate steps of optimising the loss function with different levels of
Gaussian smoothing n applied on the gradient images are shown in (d), (e) and (f).
The close ups highlight the visual improvement during intermediate steps Figure 8.
(a) n=2
(b) n=1
(c) n=0
Figure 8: Close ups at each step of the optimisation (shown in Figure 7) for different
levels of Gaussian smoothing n highlight the visual improvement in the 3D pose.
(a) Initialisation
(b) Result
Figure 9: The figure shows the Model Assisted Segmentation results for a real
photo of a Mazda Astina car. The initialisation curves for a selection of car body
parts are shown in 9(a) based on the fine 3D pose shown in Figure 7(f). The 3D
model outlines are shown in green and the initialisation curves obtained by eroding
these outlines are shown in red. The resulting segmentation is shown in 9(b). Close
ups are shown along with benchmark results in Figure 10.
13
(a) Initialisation
(b) Result
(c) Benchmark - GC
(d) Benchmark - LS
(e) Initialisation
(f) Result
(g) Benchmark - GC
(h) Benchmark - LS
Figure 10: Different close ups (row wise) for the results in Figure 9 are shown with
the initialisation curves (column 1), our results (column 2) and benchmark results
(columns 3 and 4). Our results are more accurate in general. Note the bleeding
and false positives in the benchmark results. Our method is more accurate and
sub-segments the image into meaningful parts.
Astina car. We register a triangulated 3D model of the car obtained by a 3D laser
scan. The rough 3D pose obtained using the wheel locations [9] is shown in Figure
7(c). The result of the approximate background removal is shown in Figure 7(b). We
optimise the gradient based loss function (Equation 4) for the image in Figure 7(b)
with respect to the seven pose parameters (Section 3) to obtain the fine 3D pose.
The optimisation is done sequentially moving from the highest level of Gaussian
smoothing to the lowest. We start from the rough pose with two levels of Gaussian
smoothing and obtain the pose in Figure 7(d). Next we use this pose to initialise an
optimisation of the loss function with one level of Gaussian smoothing and obtain
the pose in 7(e). Finally, we use this pose to perform one more optimisation with
no Gaussian smoothing and obtain the final fine 3D pose shown in Figure 7(f). We
note that the visual improvement in the image overlays gets smaller as we go up
the Gaussian pyramid. However, the improvement in the 3D pose becomes more
apparent when we compare the close ups in Figures 8(a), 8(b) and 8(c).
Segmentation. Segmentation results based on contour detection for the photograph in 7(a) using the fine 3D pose (Figure 7(f)) are shown in Figures 9 and 10.
The segmentation results for a selection of car parts (front and back doors, front and
back windows, fender, mud guard and front buffer) are shown in Figure 9(b) by the
yellow curves. The part boundaries obtained by projecting the 3D model are shown
14
(a) Initialisation
(b) Result
(c) Benchmark - LS
(d) Benchmark - GC
(e) Initialisation
(f) Result
(g) Benchmark - GC
(h) Benchmark - LS
(i) Initialisation
(j) Result
(k) Benchmark - GC
(l) Benchmark - LS
Figure 11: The figures show different close ups (row wise) for the results in Figure
1. Initialisation curves (column 1), our results (column 2) and benchmark results
(columns 3 and 4) are shown. We note that our results more accurate and has
sub-segmented the car into meaningful components.
in green and the initialisation curves are shown in red in Figure 9(a). For the sake
of clarity we also include close ups of a few parts. The initialisation curves and the
segmentation results for the back door and window are shown in Figures 10(a) and
10(b), using the same color code. Close ups for the front parts are shown in Figures
10(e) and 10(f). We see the high amount of reflection in the car body deteriorating
the performance of the segmentation results in the latter case, especially around the
hood of the car and windshield. In contrast the mud guard, lower parts of the buffer
and fender are segmented out quite well in Figure 10(f) as there is less reflection
noise in that region. Results for a semi-profile view of the car are shown in Figures
1 and 11 using same convention.
Accuracy. The accuracy of the results have been compared against a ground
truth obtained from the photos by hand annotation in Table 1. We calculate the
15
Part
Fender
Front door
Back door
Mud flap
Front window
Back window
Side View
97.7%
98.1%
96.8%
97.3%
97.8%
99.5%
Semi Profile
97.6%
95.3%
93.6%
95.1%
97.5%
93.9%
Avg.
97.7%
96.7%
95.2%
96.2%
97.7%
96.7%
u, v
(11)
where UR and UG are two binary images of the sub-segmentation result and ground
truth respectively. We note that the accuracy is considerably high. Also, the side
view has a higher accuracy in general because the pose estimation gave a better
result and hence the segmentation was better initialised.
Benchmark tests. Our results from Model Assisted Segmentation were compared
with state of the art image segmentation methods Grabcut (GC) [24] and Level
set (LS) [11] which do not use any Model Assistance. A bounding box has been
used initialise the benchmark methods. We compare our results (Figures 10(b)
and 10(f)) with the benchmark tests in Figure 10. The segmentation using our
method are more accurate in general. In addition to this, our method has the added
advantage of sub-segmenting parts of the same object. This is a non-trivial task
for conventional segmentation methods when the sub-segments of the object share
the same colour and texture. In terms of overall performance, we observe that in
our method the segmentation results bleed a lot less into adjacent areas, unlike
with the benchmark results. In terms of sub-segmenting parts of the same object,
we see in Figure 10(f) that our method is capable of successfully segmenting out
the fender, mud guard and the buffer from the front door unlike the benchmark
methods. In fact it would be extremely difficult (if not impossible) to sub-segment
parts of the front of the car which are painted the same color with conventional
methods. Similarly the back door, back window and the smaller glass panel have
been segmented out in Figure 10(b) where as the benchmark methods group them
together. Results for a semi-profile view of the car are shown in Figure 1 with close
ups and benchmark comparisons in Figure 11. Our results are better and separate
the object into meaningful parts.
16
Discussion
The Model Assisted Segmentation method described in this paper can segment parts
of a known 3D object from a given image. It performs better than the state of the
art and can segment (and separate) parts that have similar pixel characteristics.
We present our results on images of cars. The highly reflective surfaces of cars
make the pose estimation as well as the segmentation tasks more difficult than with
non-reflective objects.
We note that a close initialisation curve obtained from the 3D pose estimation
significantly improves the performance of contour detection, and hence the image
segmentation. However, the presence of reflections can deteriorate the quality of
the results. We intend to explore avenues to make the process more robust in the
presence of reflections.
Acknowledgment. The authors wish to thank Stephen Gould and Hongdong Li
for the valuable feedback and advice. This work was supported by Control =
Cxpert.
References
[1] M. Arie-Nachimson and R. Basri. Constructing implicit 3d shape models for pose estimation.
In ICCV, 2009.
[2] G. Bradski. The OpenCV Library. Dr. Dobbs Journal of Software Tools, 2000.
[3] Brooks, R. A., and Binford, T. O. Geometric modelling in vision for manufacturing. In
Proceedings of the Society of Photo-Optical Instrumentation Engineers Conference on Robot
Vision, volume 281, pages 141159, Washington, DC, USA, April 1981.
[4] V. Caselles, F. Catte, T. Coll, and F. Dibos. A geometric model for active contours in image
processing. Numerische Mathematik, 66(1):131, 1993.
[5] Roland T. Chin and Charles R. Dyer. Model-based recognition in robot vision. ACM Comput.
Surv., 18(1):67108, 1986.
[6] P. David, D. DeMenthon, R. Duraiswami, and H. Samet. SoftPOSIT: Simultaneous pose and
correspondence determination. International Journal of Computer Vision, 59(3):259284,
2004.
[7] David A. Forsyth and Jean Ponce. Computer Vision: A Modern Approach. Prentice Hall
Professional Technical Reference, 2002.
[8] B.K.P. Horn. Obtaining shape from shading information. In PsychCV75, pages 115155,
1975.
[9] M. Hutter and N. Brewer. Matching 2-D Ellipses to 3-D Circles with Application to Vehicle
Pose Identification. In Image and Vision Computing New Zealand, 2009. IVCNZ09. 24th
International Conference, pages 153158, 2009.
[10] Henner Kollnig and Hans-Hellmut Nagel. 3d pose estimation by directly matching polyhedral
models to gray value gradients. Int. J. Comput. Vision, 23(3):283302, 1997.
[11] Chunming Li, Chenyang Xu, Changfeng Gui, and M.D. Fox. Level set evolution without
re-initialization: a new variational formulation. In Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 430 436 vol. 1,
june 2005.
17
[12] R. Malladi, J. Sethian, and B. Vemuri. Evolutionary fronts for topology-independent shape
modeling and recovery. Computer VisionECCV94, pages 113, 1994.
[13] R. Malladi, J.A. Sethian, and B.C. Vemuri. Shape modeling with front propagation: A level
set approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(2):158
175, 2002.
[14] Ravikanth Malladi. A topology-independent shape modeling scheme. PhD thesis, University
of Florida, Gainesville, FL, USA, 1993. AAI9505796.
[15] F. Moreno-Noguer, V. Lepetit, and P. Fua. Pose priors for simultaneously solving alignment
and correspondence. Computer VisionECCV 2008, pages 405418, 2008.
[16] JA Nelder and R. Mead. A simplex method for function minimization. The computer journal,
7(4):308, 1965.
[17] M. Ozuysal, V. Lepetit, and P.Fua. Pose estimation for category specific multiview object
localization. In Conference on Computer Vision and Pattern Recognition, Miami, FL, June
2009.
[18] W. A. Perkins. A model-based vision system for industrial parts. IEEE Trans. Comput.,
27(2):126143, 1978.
[19] Poje, J. F., and Delp, E. J. A review of techniques for obtaining depth information with
applications to machine vision. Technical report, Center for Robotics and Integrated Manufacturing, Univ. of Michigan, Ann Arbor, 1982.
[20] Joseph Lee Rodgers and W. Alan Nicewander. Thirteen ways to look at the correlation
coefficient. The American Statistician, 42(1):pp. 5966, 1988.
[21] B. Rosenhahn, T. Brox, D. Cremers, and H.P. Seidel. A comparison of shape matching
methods for contour based pose estimation. Combinatorial Image Analysis, pages 263276,
2006.
[22] B. Rosenhahn, T. Brox, and J. Weickert. Three-dimensional shape knowledge for joint image
segmentation and pose tracking. International Journal of Computer Vision, 73(3):243262,
2007.
[23] B. Rosenhahn, C. Perwass, and G. Sommer. Pose estimation of 3D free-form contours. International Journal of Computer Vision, 62(3):267289, 2005.
[24] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using
iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309314, 2004.
[25] M. Rousson and N. Paragios. Shape priors for level set representations. Computer VisionECCV 2002, pages 416418, 2002.
[26] Min Sun, Bing-Xin Xu, Gary Bradski, and Silvio Savarese. Depth-encoded hough voting for
joint object detection and shape recovery. In ECCV, Crete, Greece, 09/2010 2010.
[27] T.N. Tan and K.D. Baker. Efficient image gradient based vehicle localization. IEEE Transactions on Image Processing, 9(8):13431356, 2000.
[28] M. Yachida and S. Tsuji. A versatile machine vision system for complex industrial parts.
IEEE Trans. Comput., 26(9):882894, 1977.
18