1.fan - A Point Set Generation Network For 3D Object Reconstruction From A Single Image - CVPR - 2017 - Paper
1.fan - A Point Set Generation Network For 3D Object Reconstruction From A Single Image - CVPR - 2017 - Paper
Abstract
1605
have to be updated. Our pipeline infers the point positions in and SLAM [10, 9], ideally, one expect that 3D can be
a 3D frame determined by the input image and the inferred reconstructed from the abundant single-view images.
viewpoint position. Under this setting, however, the problem is ill-posed
Given this unorthodox network output, one of our chal- and priors must be incorporated. Early work such as
lenges is how to measure loss during training, as the same ShapeFromX [12, 1] made strong assumptions over the
geometry may admit different point cloud representations shape or the environment lighting conditions. [11, 18]
at the same degree of approximation. Unlike the usual pioneered the use of learning-based approach for simple
L2 type losses, we use the solution of a transportation geometric structures. Coarse correspondences in an image
problem based on the Earth Mover’s distance (EMD), collection can also be used for rough 3D shape estima-
effectively solving an assignment problem. We exploit an tion [14, 3]. As commodity 3D sensors become popular,
approximation to the EMD to provide speed as well as RGBD database has been built and used to train learning-
ensure differentiability for end-to-end training. based systems [6, 8]. Though great progress has been made,
Our approach effectively attempts to solve the ill-posed these methods still cannot robustly reconstruct complete
problem of 3D structure recovery from a single projection and quality shapes from single images. Stronger shape
using certain learned priors. The network has to estimate priors are missing.
depth for the visible parts of the image and hallucinate the Recently, large-scale repositories of 3D CAD models,
rest of the object geometry, assessing the plausibility of sev- such as ShapeNet [4], have been introduced. They have
eral different completions. From a statistical perspective, it great potential for 3D reconstruction tasks. For example,
would be ideal if we can fully characterize the landscape [19, 13] proposed to deform and reassemble existing shapes
of the ground truth space, or be able to sample plausible into a new model to fit the observed image. These systems
candidates accordingly. If we view this as a regression rely on high-quality image-shape correspondence, which is
problem, then it has a rather unique and interesting feature a challenging and ill-posed problem itself.
arising from inherent object ambiguities in certain views. More relevant to our work is [5]. Given a single image,
These are situations where there are multiple, equally good they use a neural network to predict the underlying 3D
3D reconstructions of a 2D image, making our problem very object as a 3D volume. There are two key differences
different from classical regression/classification settings, between our work and [5]: First, the predicted object in
where each training sample has a unique ground truth [5] is a 3D volume; whilst ours is a point cloud. As
annotation. In such settings the proper loss definition can demonstrated and analyzed in Sec 5.2, point set forms a
be crucial in getting the most meaningful result. nicer shape space for neural networks, thus the predicted
Our final algorithm is a conditional sampler, which shapes tend to be more complete and natural. Second, we
samples plausible 3D point clouds from the estimated allow multiple reconstruction candidates for a single input
ground truth space given an input image. Experiments on image. This design reflects the fact that a single image
both synthetic and real world data verify the effectiveness cannot fully determine the reconstruction of a 3D shape.
of our method. Our contributions can be summarized as
follows:
Deep learning for geometric object synthesis In gen-
• We use deep learning techniques to study the point set eral, the field of how to predict geometries in an end-to-end
generation problem; fashion is quite a virgin land. In particular, our output, 3D
point set, is still not a typical object in the deep learning
• On the task of 3D reconstruction from a single
community. A point set contains orderless samples from
image, we apply our point set generation network and
a metric-measure space. Therefore, equivalent classes are
significantly outperform state of the art;
defined up to a permutation; in addition, the ground distance
• We systematically explore issues in the architecture must be taken into consideration. To our knowledge, we are
and loss function design for point generation network; not aware of prior deep learning systems with the abilities
• We discuss and address the ground-truth ambiguity to predict such objects.
issue for the 3D reconstruction from single image task.
3. Problem and Notations
Source code demonstrating our system can be obtained
from https://github.com/fanhqme/PointSetGeneration. Our goal is to reconstruct the complete 3D shape of
an object from a single 2D image (RGB or RGB-D). We
2. Related Work represent the 3D shapes in the form of unordered point set
S = {(xi , yi , zi )}N
i=1 where N is a predefined constant. We
3D reconstruction from single images While most observed that for most objects using N = 1024 is sufficient
researches focus on multi-view geometry such as SFM to preserve the major structures.
606
conv deconv fully connected set union concatenation
in1
input
in2 Encoder
r.v.
One advantage of point set comes from its unordered- the distance between the prediction and groundtruth. We
ness. Unlike 2D based representations like the depth propose two distance metrics for point sets – the Chamfer
map no topological constraint is put on the represented distance and the Earth Mover’s distance. We show that
object. Compared to 3D grids, the point set enjoys higher both metrics are differentiable almost everywhere and can
efficiency by encoding only the points on the surface. be used as the loss function, but has different properties in
Also, the coordinate values (xi , yi , zi ) go over simple linear capturing shape space. See Sec 4.3.
transformations when the object is rotated or scaled, which Modeling the uncertainty of groundtruth: Our problem
is in contrast to the case in volumetric representations. of 3D structural recovery from a single image is ill-
To model the problem’s uncertainty, we define the posed, thus the ambiguity of groundtruth arises during
groundtruth as a probability distribution P(·|I) over the the train and test time. It is fundamentally important to
shapes conditioned on the input I. In training we have characterize the ambiguity of groundtruth for a given input,
access to one sample from P(·|I) for each image I. and practically desirable to be able to generate multiple
We train a neural network G as a conditional sampler predictions. Surprisingly, this goal can be achieved tactfully
from P(·|I): by simply using the min function as a wrapper to the above
S = G(I, r; Θ) (1) proposed loss, or by a conditional variational autoencoder.
See Sec 4.4.
where Θ denotes network parameter, r ∼ N(0, I) is a
random variable to perturb the input 1 . During test time 4.2. Point Set Prediction Network
multiple samples of r could be used to generate different
The task of building a network for point set prediction
predictions.
is new. We design a network with the goal of possessing
strong representation power for complicated structures, and
4. Approach make the best use of the statistics of geometric data. To
4.1. Overview introduce our network progressively, we start from a simple
version and gradually add components.
Our task of building a conditional generative network
As in Fig 2 (top), our network has an encoder stage and
for point sets is challenging, due to the unordered form of
a predictor stage. The encoder maps the input pair of an
representation and the inherent ambiguity of groundtruth.
image I and a random vector r into an embedding space.
These challenges have pushed us to invent new architecture,
The predictor outputs a shape as an N × 3 matrix M, each
loss function, and learning paradigm. Specifically, we have
row containing the coordinates of one point.
to address three subproblems:
The encoder is a composition of convolution and ReLU
Point set generator architecture: Network to predict point
layers; in addition, a random vector r is subsumed so
set is barely studied in literature, leaving a huge open
that it perturbs the prediction from the image I. We
space for us to explore the design choices. Ideally, a
postpone the explanation of how r is used to Sec 4.4. The
network should make the best use of its data statistics
predictor generates the coordinates of N points through
and possess enough representation power. We propose a
a fully connected network. Though simple, this version
network with two prediction branches, one enjoys high
works reasonably well in practice.
flexibility in capturing complicated structures and the other
exploits geometric continuity. See Sec 4.2. We further improve the design of the predictor branch to
Loss function for point set comparison: For our novel better accommodate large and smooth surfaces which are
type of prediction, point set, it is unclear how to measure common in natural objects. The fully connected predictor
as above cannot make full use of such natural geometric
1 Similar to the Conditional Generative Adversarial Network [15]. statistics, since each point is predicted independently. The
607
improved predictor in Fig 2 (middle) exploits this geometric defined on point set pairs. For each point, the algorithm
smoothness property. of CD finds the nearest neighbor in the other set and sums
This version has two parallel predictor branches – a the squared distances up. Viewed as a function of point
fully-connected (fc) branch and a deconvolution (deconv) locations in S1 and S2 , CD is continuous and piecewise
branch. The fc branch predicts N1 points as before. The smooth. The range search for each point is independent,
deconv branch predicts a 3 channel image of size H × W , thus trivially parallelizable. Also, spatial data structures
of which the three values at each pixel are the coordinates like KD-tree can be used to accelerate nearest neighbor
of a point, giving another H × W points. Their predictions search. Though simple, CD produces reasonable high
are later merged together to form the whole set of points in quality results in practice.
M. Multiple skip links are added to boost information flow
across encoder and predictor. Earth Mover’s distance Consider S1 , S2 ⊆ R3 of equal
With the fc branch, our model enjoys high flexibility, size s = |S1 | = |S2 |. The EMD between A and B is defined
showing good performance at describing intricate struc- as:
tures. With the deconvolution branch, our model becomes X
not only more parameter parsimonious by weight sharing; dEM D (S1 , S2 ) = min kx − φ(x)k2 (4)
φ:S1 →S2
but also more friendly to large smooth surfaces, due to the x∈S1
spatial continuity induced by deconv and conv. Refer to
where φ : S1 → S2 is a bijection.
Sec 5.5 for experimental evidences.
The EMD distance solves an optimization problem,
Above introduces the design of our network G in Eq 1.
namely, the assignment problem. For all but a zero-
To train this network, however, we still need to design a
measure subset of point set pairs, the optimal bijection φ
proper loss function for point set prediction, and enable the
is unique and invariant under infinitesimal movement of the
role r for multiple candidates prediction. We explain in the
points. Thus EMD is differentiable almost everywhere. In
next two sections.
practice, exact computation of EMD is too expensive for
4.3. Distance Metric between Point Sets deep learning, even on graphics hardware. We therefore
implement a (1 + ǫ) approximation scheme given by
A critical challenge is to design a good loss function for [2]. We allocate fix amount of time for each instance
comparing the predicted point cloud and the groundtruth. and incrementally adjust allowable error ratio to ensure
To plug in a neural network, a suitable distance must satisfy termination. For typical inputs, the algorithm gives highly
at least three conditions: 1) differentiable with respect to accurate results (approximation error on the magnitude of
point locations; 2) efficient to compute, as data will be 1%). The algorithm is easily parallelizable on GPU.
forwarded and back-propagated for many times; 3) robust
against small number of outlier points in the sets (e.g.
Shape space Despite remarkable expressive power em-
Hausdorff distance would fail).
bedded in the deep layers, neural networks inevitably
We seek for a distance d between subsets in R3 , so that
encounter uncertainty in predicting the precise geometry
the loss function L({Sipred }, {Sigt }) takes the form
of an object. Such uncertainty could arise from limited
X network capacity, insufficient use of input resolution, or the
L({Sipred }, {Sigt }) = d(Sipred , Sigt ), (2)
ambiguity of groundtruth due to information loss in 3D-2D
i
projection. Facing the inherent inability to resolve the shape
where i indexes training samples, Sipred and Sigt are the precisely, neural networks tend to predict a “mean” shape
prediction and groundtruth of each sample, respectively. averaging out the space of uncertainty. The mean shape
We propose two candidates: Chamfer distance (CD) and carries the characteristics of the distance itself.
Earth Mover’s distance (EMD) [17]. In Figure 3, we illustrate the distinct mean-shape be-
havior of EMD and CD on synthetic shape distributions,
Chamfer distance We define the Chamfer distance be- by minimizing Es∼S [L(x, s)] through stochastic gradient
tween S1 , S2 ⊆ R3 as: descent, where S is a given shape distribution, L is one of
the distance functions.
X X
dCD (S1 , S2 ) = min kx − yk22 + min kx − yk22 In the first and the second case, there is a single
y∈S2 x∈S1 continuously changing hidden variable, namely the radius
x∈S1 y∈S2
(3) of the circle in (a) and the location of the arc in (b). EMD
roughly captures the shape corresponding to the mean value
In the strict sense, dCD is not a distance function because of the hidden variable. In contrast CD induces a splashy
triangle inequality does not hold. We nevertheless use shape that blurs the shape’s geometric structure. In the latter
the term “distance” to refer to any non-negative function two cases, there are categorical hidden variables: which
608
input ours ours (post- ground 3D-R2N2
image processed) truth
Input
EMD
mean
CD
mean
label CD / EMD
609
Ours 3D-R2N2 We report the IoU value for each category as in [5]. From
category
1 view 1 view 3 views 5 views
Table 1, we can see that for single view reconstruction the
plane 0.601 0.513 0.549 0.561
bench 0.550 0.421 0.502 0.527 proposed method consistently achieves higher IoU in all
cabinet 0.771 0.716 0.763 0.772 categories. 3R-R2N2 is also able to predict 3D shapes from
car 0.831 0.798 0.829 0.836 more than one views. On many categories our method even
chair 0.544 0.466 0.533 0.550 outperforms the 3D-R2N2’s prediction given 5 views.
monitor 0.552 0.468 0.545 0.565
lamp 0.462 0.381 0.415 0.421 Notice that both methods learn much more than predict-
speaker 0.737 0.662 0.708 0.717 ing the object’s class. In 3D-R2N2’s dataset, for example,
firearm 0.604 0.544 0.593 0.600 the average CD value from a shape to its categorical mean
couch 0.708 0.628 0.690 0.706
table 0.606 0.513 0.564 0.580 is 1.1, much larger than the result of any method.
cellphone 0.749 0.661 0.732 0.754 We visually compare reconstruction examples in Fig 5.
watercraft 0.611 0.513 0.596 0.610 As stated in [5], their method often misses thin features
mean 0.640 0.560 0.617 0.631
of objects (e.g. legs of furnitures). We surmise that this
Table 1. 3D reconstruction comparison (per category). Notice that is due to their volumetric representation and voxel-wise
in the single view reconstruction setting we achieved higher IoU in loss function which unduly punishes mispositioned thin
all categories. The mean is taken category-wise. For 8 out of 13
categories, our results are even better than 3D-R2N2 given 5 views. structures. In contrast, our point-cloud based objective
function encourages the preservation of fine structures and
makes our predictions more structurally plausible.
For each model, we normalized the radius of its bound- In our current implementation, processing one input
ing hemi-sphere to unit 1 and aligned their ground plane. image consumes 0.13 seconds on a laptop CPU.
Then each model was rendered into 2D images according
to the Blinn-Phong shading formula with randomly chosen 5.3. Injecting Additional Information
environmental maps. In our experiments we used a simple
local lightening model for the sake of computation time.
However, it is straight-forward to extend our method
to incorporate global illumination algorithms and more
complex backgrounds.
610
Figure 10. Visualization of points predicted by the deconvolution
Figure 8. Multiple predictions for a single input image. The point branch (blue) versus the fully connected branch (red).
sets are visualized from different view points for each object to better
reveal the differences. the introduction of deconvolution significantly improves
performance.
We further visualize the output of the deconv branch
cascaded after our predictions to enrich higher frequency
and fully connected branch separately to gain a better
details.
understanding of their functions. In Fig 9 the values in the
5.4. Predicting Multiple Plausible Shapes x, y and z channels are plotted as 2D images for one of the
models. In the deconv branch the network learns to use the
The randomness in our network enables prediction of convolution structure to constructs a 2D surface that warps
different shapes given the same input image. To show this, around the object. In the fully connected branch the output
we take the RGB image as the input. During training we is less organized as the channels are not ordered.
handle randomness by using either the Mo2 or the VAE In Fig 10 we render the two set of predictions in 3D
method. At test time when the ground truth is unknown, space. The deconv branch is in general good at capturing the
the random numbers are sampled from the predefined “main body” of the object, while the fully connected branch
distribution. complements the shape with more detailed components
Fig 8 plots examples of the set of predictions of our (e.g. tip of gun, tail of plane, arms of a sofa). This reveals
method. The network is able to reveal its uncertainty the complementarity of the two branches. The predefined
about the shape or the ambiguity in the input. Points weights sharing and node connectivity endow the deconv
that the neural network is certain about its position moves branch with higher efficiency when they are congruent with
little between different predictions. Along the direction of the desired output’s structure. The fully connected branch
ambiguity (e.g. the thickness of the penguin’s body) the is more flexible but the independent control of each point
variation is significantly larger. In this figure we trained consumes more network capacity.
our network with Mo2 and Chamfer Distance. Other
combinations of settings and methods give qualitatively
Analysis of distance metrics Different choices of the loss
similar results.
functions have distinct effect on the network’s prediction
deconv branch pattern. Fig 13 exemplifies the difference between two
networks trained by CD and EMD correspondingly. The
network trained by CD tends to scatter a few points in its
input image uncertain area (e.g. behind the door) but is able to better
preserve the detailed shape of the grip. In contrast, the
x-channel y-channel z-channel
network trained by EMD produces more compact results
fully connected branch but sometimes overly shrinks local structures. This is in
line with experiment on synthetic data.
x-channel y-channel z-channel
5.6. More results and application to real world data
Figure 9. Visualization of the channels.
Fig 11 lists more example predictions on both synthetic
data and real world photos. These real world photos are
5.5. Network Design Analysis
acquired from a viewpoint and distance that resemble the
Effect of combining deconv and fc branches for recon- setting we used for our synthetic data. A segmentation mask
struction We compared different designs of the neural is also needed to indicate the scope of the object.
network architectures. The performance values are reported
based on our own rendered training set.As shown in Fig 12,
611
Synthetic
Data
Real World
Data
Acknowledgements
The project receives support from NSF grant IIS-
1528025, the Stanford AI Lab-Toyota Center for Artificial
Figure 12. Comparison of different networks by Chamfer Distance
(CD) and Earth Mover Distance (EMD). More complex network Intelligence Research, a Samsung GRO grant and a Google
gives better results. Focused Research Award.
References
[1] J. Aloimonos. Shape from texture. Biological cybernetics,
58(5):345–360, 1988.
[2] D. P. Bertsekas. A distributed asynchronous relaxation
algorithm for the assignment problem. In Decision and
Figure 13. Comparison of predictions of networks trained by CD Control, 1985 24th IEEE Conference on, pages 1703–1704.
(blue, on the left) and EMD (green, on the right). IEEE, 1985.
[3] J. Carreira, S. Vicente, L. Agapito, and J. Batista. Lifting
6. Discussion object detection datasets into 3d. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 38(7):1342–
The major difficulties we faced in generating 3D point 1355, 2016.
cloud, namely how to represent unordered data and how [4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan,
to handle ambiguity are universal in machine learning. Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su,
We hope our demonstration of single image based 3D J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich
612
3D Model Repository. Technical Report arXiv:1512.03012
[cs.GR], 2015.
[5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-
r2n2: A unified approach for single and multi-view 3d object
reconstruction. arXiv preprint arXiv:1604.00449, 2016.
[6] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction
from a single image using a multi-scale deep network. In
Advances in neural information processing systems, pages
2366–2374, 2014.
[7] Y. Eldar, M. Lindenbaum, M. Porat, and Y. Y. Zeevi. The
farthest point strategy for progressive image sampling. IEEE
Transactions on Image Processing, 6(9):1305–1315, 1997.
[8] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D
primitives for single image understanding. In ICCV, 2013.
[9] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-
Mancha. Visual simultaneous localization and mapping: a
survey. Artificial Intelligence Review, 43(1):55–81, 2015.
[10] K. Häming and G. Peters. The structure-from-motion
reconstruction pipeline–a survey with focus on short image
sequences. Kybernetika, 46(5):926–937, 2010.
[11] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo
pop-up. ACM transactions on graphics (TOG), 24(3):577–
584, 2005.
[12] B. K. Horn. Obtaining shape from shading information. In
Shape from shading, pages 123–171. MIT press, 1989.
[13] Q. Huang, H. Wang, and V. Koltun. Single-view reconstruc-
tion via joint analysis of image and shape collections. ACM
Transactions on Graphics (TOG), 34(4):87, 2015.
[14] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-
specific object reconstruction from a single image. In CVPR,
2015.
[15] M. Mirza and S. Osindero. Conditional generative adversar-
ial nets. arXiv preprint arXiv:1411.1784, 2014.
[16] D. J. Rezende, S. Eslami, S. Mohamed, P. Battaglia,
M. Jaderberg, and N. Heess. Unsupervised learning of 3d
structure from images. arXiv preprint arXiv:1607.00662,
2016.
[17] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s
distance as a metric for image retrieval. International journal
of computer vision, 40(2):99–121, 2000.
[18] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d
scene structure from a single still image. IEEE transactions
on pattern analysis and machine intelligence, 31(5):824–
840, 2009.
[19] H. Su, Q. Huang, N. J. Mitra, Y. Li, and L. Guibas.
Estimating image depth using shape collections. ACM
Transactions on Graphics (TOG), 33(4):37, 2014.
613