0% found this document useful (0 votes)
19 views9 pages

1.fan - A Point Set Generation Network For 3D Object Reconstruction From A Single Image - CVPR - 2017 - Paper

This paper proposes a deep learning method to reconstruct 3D point clouds from single images. It uses a point set generation network that can output multiple plausible reconstructions to address ambiguity. The network is trained with an Earth Mover's distance loss to handle different point cloud representations of the same geometry. Experiments show it outperforms other methods on 3D reconstruction benchmarks and enables applications like shape completion.

Uploaded by

xuhangmk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

1.fan - A Point Set Generation Network For 3D Object Reconstruction From A Single Image - CVPR - 2017 - Paper

This paper proposes a deep learning method to reconstruct 3D point clouds from single images. It uses a point set generation network that can output multiple plausible reconstructions to address ambiguity. The network is trained with an Earth Mover's distance loss to handle different point cloud representations of the same geometry. Experiments show it outperforms other methods on 3D reconstruction benchmarks and enables applications like shape completion.

Uploaded by

xuhangmk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Point Set Generation Network for

3D Object Reconstruction from a Single Image

Haoqiang Fan ∗ Hao Su∗ Leonidas Guibas


Institute for Interdisciplinary Computer Science Department
Information Sciences
Tsinghua University Stanford University
[email protected] {haosu,guibas}@cs.stanford.edu

Abstract

Generation of 3D data by deep neural networks has


been attracting increasing attention in the research com-
munity. The majority of extant works resort to regular
representations such as volumetric grids or collections of
images; however, these representations obscure the natural
invariance of 3D shapes under geometric transformations,
and also suffer from a number of other issues. In this paper
we address the problem of 3D reconstruction from a single
image, generating a straight-forward form of output – point Input Reconstructed 3D point cloud
cloud coordinates. Along with this problem arises a unique
and interesting issue, that the groundtruth shape for an Figure 1. A 3D point cloud of the complete object can be
reconstructed from a single image. Each point is visualized as a
input image may be ambiguous. Driven by this unorthodox small sphere. The reconstruction is viewed at two viewpoints (0◦
output form and the inherent ambiguity in groundtruth, we and 90◦ along azimuth). A segmentation mask is used to indicate
design architecture, loss function and learning paradigm the scope of the object in the image.
that are novel and effective. Our final solution is a
conditional shape sampler, capable of predicting multiple for weight sharing, etc. That is why the majority of
plausible 3D point clouds from an input image. In extant works on using deep nets for 3D data resort to
experiments not only can our system outperform state-of- either volumetric grids or collections of images (2D views
the-art methods on single image based 3d reconstruction of the geometry). Such representations, however, lead to
benchmarks; but it also shows strong performance for 3D difficult trade-offs between sampling resolution and net
shape completion and promising ability in making multiple efficiency. Furthermore, they enshrine quantization artifacts
plausible predictions. that obscure natural invariances of the data under rigid
motions, etc.
In this paper we address the problem of generating the
1. Introduction 3D geometry of an object based on a single image of that
object. We explore generative networks for 3D geometry
As we try to duplicate the successes of current deep based on a point cloud representation. A point cloud
convolutional architectures in the 3D domain, we face a representation may not be as efficient in representing the
fundamental representational issue. Extant deep net archi- underlying continuous 3D geometry as compared to a CAD
tectures for both discriminative and generative learning in model using geometric primitives or even a simple mesh,
the signal domain are well-suited to data that is regularly but for our purposes it has many advantages. A point cloud
sampled, such as images, audio, or video. However, is a simple, uniform structure that is easier to learn, as
most common 3D geometry representations, such as 2D it does not have to encode multiple primitives or combi-
meshes or point clouds are not regular structures and do natorial connectivity patterns. In addition, a point cloud
not easily fit into architectures that exploit such regularity allows simple manipulation when it comes to geometric
∗ equal contribution transformations and deformations, as connectivity does not

1605
have to be updated. Our pipeline infers the point positions in and SLAM [10, 9], ideally, one expect that 3D can be
a 3D frame determined by the input image and the inferred reconstructed from the abundant single-view images.
viewpoint position. Under this setting, however, the problem is ill-posed
Given this unorthodox network output, one of our chal- and priors must be incorporated. Early work such as
lenges is how to measure loss during training, as the same ShapeFromX [12, 1] made strong assumptions over the
geometry may admit different point cloud representations shape or the environment lighting conditions. [11, 18]
at the same degree of approximation. Unlike the usual pioneered the use of learning-based approach for simple
L2 type losses, we use the solution of a transportation geometric structures. Coarse correspondences in an image
problem based on the Earth Mover’s distance (EMD), collection can also be used for rough 3D shape estima-
effectively solving an assignment problem. We exploit an tion [14, 3]. As commodity 3D sensors become popular,
approximation to the EMD to provide speed as well as RGBD database has been built and used to train learning-
ensure differentiability for end-to-end training. based systems [6, 8]. Though great progress has been made,
Our approach effectively attempts to solve the ill-posed these methods still cannot robustly reconstruct complete
problem of 3D structure recovery from a single projection and quality shapes from single images. Stronger shape
using certain learned priors. The network has to estimate priors are missing.
depth for the visible parts of the image and hallucinate the Recently, large-scale repositories of 3D CAD models,
rest of the object geometry, assessing the plausibility of sev- such as ShapeNet [4], have been introduced. They have
eral different completions. From a statistical perspective, it great potential for 3D reconstruction tasks. For example,
would be ideal if we can fully characterize the landscape [19, 13] proposed to deform and reassemble existing shapes
of the ground truth space, or be able to sample plausible into a new model to fit the observed image. These systems
candidates accordingly. If we view this as a regression rely on high-quality image-shape correspondence, which is
problem, then it has a rather unique and interesting feature a challenging and ill-posed problem itself.
arising from inherent object ambiguities in certain views. More relevant to our work is [5]. Given a single image,
These are situations where there are multiple, equally good they use a neural network to predict the underlying 3D
3D reconstructions of a 2D image, making our problem very object as a 3D volume. There are two key differences
different from classical regression/classification settings, between our work and [5]: First, the predicted object in
where each training sample has a unique ground truth [5] is a 3D volume; whilst ours is a point cloud. As
annotation. In such settings the proper loss definition can demonstrated and analyzed in Sec 5.2, point set forms a
be crucial in getting the most meaningful result. nicer shape space for neural networks, thus the predicted
Our final algorithm is a conditional sampler, which shapes tend to be more complete and natural. Second, we
samples plausible 3D point clouds from the estimated allow multiple reconstruction candidates for a single input
ground truth space given an input image. Experiments on image. This design reflects the fact that a single image
both synthetic and real world data verify the effectiveness cannot fully determine the reconstruction of a 3D shape.
of our method. Our contributions can be summarized as
follows:
Deep learning for geometric object synthesis In gen-
• We use deep learning techniques to study the point set eral, the field of how to predict geometries in an end-to-end
generation problem; fashion is quite a virgin land. In particular, our output, 3D
point set, is still not a typical object in the deep learning
• On the task of 3D reconstruction from a single
community. A point set contains orderless samples from
image, we apply our point set generation network and
a metric-measure space. Therefore, equivalent classes are
significantly outperform state of the art;
defined up to a permutation; in addition, the ground distance
• We systematically explore issues in the architecture must be taken into consideration. To our knowledge, we are
and loss function design for point generation network; not aware of prior deep learning systems with the abilities
• We discuss and address the ground-truth ambiguity to predict such objects.
issue for the 3D reconstruction from single image task.
3. Problem and Notations
Source code demonstrating our system can be obtained
from https://github.com/fanhqme/PointSetGeneration. Our goal is to reconstruct the complete 3D shape of
an object from a single 2D image (RGB or RGB-D). We
2. Related Work represent the 3D shapes in the form of unordered point set
S = {(xi , yi , zi )}N
i=1 where N is a predefined constant. We
3D reconstruction from single images While most observed that for most objects using N = 1024 is sufficient
researches focus on multi-view geometry such as SFM to preserve the major structures.

606
conv deconv fully connected set union concatenation
in1
input
in2 Encoder
r.v.

point out Predictor


set
vanilla version
in1
input
in2 Encoder
r.v.

point out Predictor


set
two prediction branch version
Figure 2. PointOutNet structure

One advantage of point set comes from its unordered- the distance between the prediction and groundtruth. We
ness. Unlike 2D based representations like the depth propose two distance metrics for point sets – the Chamfer
map no topological constraint is put on the represented distance and the Earth Mover’s distance. We show that
object. Compared to 3D grids, the point set enjoys higher both metrics are differentiable almost everywhere and can
efficiency by encoding only the points on the surface. be used as the loss function, but has different properties in
Also, the coordinate values (xi , yi , zi ) go over simple linear capturing shape space. See Sec 4.3.
transformations when the object is rotated or scaled, which Modeling the uncertainty of groundtruth: Our problem
is in contrast to the case in volumetric representations. of 3D structural recovery from a single image is ill-
To model the problem’s uncertainty, we define the posed, thus the ambiguity of groundtruth arises during
groundtruth as a probability distribution P(·|I) over the the train and test time. It is fundamentally important to
shapes conditioned on the input I. In training we have characterize the ambiguity of groundtruth for a given input,
access to one sample from P(·|I) for each image I. and practically desirable to be able to generate multiple
We train a neural network G as a conditional sampler predictions. Surprisingly, this goal can be achieved tactfully
from P(·|I): by simply using the min function as a wrapper to the above
S = G(I, r; Θ) (1) proposed loss, or by a conditional variational autoencoder.
See Sec 4.4.
where Θ denotes network parameter, r ∼ N(0, I) is a
random variable to perturb the input 1 . During test time 4.2. Point Set Prediction Network
multiple samples of r could be used to generate different
The task of building a network for point set prediction
predictions.
is new. We design a network with the goal of possessing
strong representation power for complicated structures, and
4. Approach make the best use of the statistics of geometric data. To
4.1. Overview introduce our network progressively, we start from a simple
version and gradually add components.
Our task of building a conditional generative network
As in Fig 2 (top), our network has an encoder stage and
for point sets is challenging, due to the unordered form of
a predictor stage. The encoder maps the input pair of an
representation and the inherent ambiguity of groundtruth.
image I and a random vector r into an embedding space.
These challenges have pushed us to invent new architecture,
The predictor outputs a shape as an N × 3 matrix M, each
loss function, and learning paradigm. Specifically, we have
row containing the coordinates of one point.
to address three subproblems:
The encoder is a composition of convolution and ReLU
Point set generator architecture: Network to predict point
layers; in addition, a random vector r is subsumed so
set is barely studied in literature, leaving a huge open
that it perturbs the prediction from the image I. We
space for us to explore the design choices. Ideally, a
postpone the explanation of how r is used to Sec 4.4. The
network should make the best use of its data statistics
predictor generates the coordinates of N points through
and possess enough representation power. We propose a
a fully connected network. Though simple, this version
network with two prediction branches, one enjoys high
works reasonably well in practice.
flexibility in capturing complicated structures and the other
exploits geometric continuity. See Sec 4.2. We further improve the design of the predictor branch to
Loss function for point set comparison: For our novel better accommodate large and smooth surfaces which are
type of prediction, point set, it is unclear how to measure common in natural objects. The fully connected predictor
as above cannot make full use of such natural geometric
1 Similar to the Conditional Generative Adversarial Network [15]. statistics, since each point is predicted independently. The

607
improved predictor in Fig 2 (middle) exploits this geometric defined on point set pairs. For each point, the algorithm
smoothness property. of CD finds the nearest neighbor in the other set and sums
This version has two parallel predictor branches – a the squared distances up. Viewed as a function of point
fully-connected (fc) branch and a deconvolution (deconv) locations in S1 and S2 , CD is continuous and piecewise
branch. The fc branch predicts N1 points as before. The smooth. The range search for each point is independent,
deconv branch predicts a 3 channel image of size H × W , thus trivially parallelizable. Also, spatial data structures
of which the three values at each pixel are the coordinates like KD-tree can be used to accelerate nearest neighbor
of a point, giving another H × W points. Their predictions search. Though simple, CD produces reasonable high
are later merged together to form the whole set of points in quality results in practice.
M. Multiple skip links are added to boost information flow
across encoder and predictor. Earth Mover’s distance Consider S1 , S2 ⊆ R3 of equal
With the fc branch, our model enjoys high flexibility, size s = |S1 | = |S2 |. The EMD between A and B is defined
showing good performance at describing intricate struc- as:
tures. With the deconvolution branch, our model becomes X
not only more parameter parsimonious by weight sharing; dEM D (S1 , S2 ) = min kx − φ(x)k2 (4)
φ:S1 →S2
but also more friendly to large smooth surfaces, due to the x∈S1
spatial continuity induced by deconv and conv. Refer to
where φ : S1 → S2 is a bijection.
Sec 5.5 for experimental evidences.
The EMD distance solves an optimization problem,
Above introduces the design of our network G in Eq 1.
namely, the assignment problem. For all but a zero-
To train this network, however, we still need to design a
measure subset of point set pairs, the optimal bijection φ
proper loss function for point set prediction, and enable the
is unique and invariant under infinitesimal movement of the
role r for multiple candidates prediction. We explain in the
points. Thus EMD is differentiable almost everywhere. In
next two sections.
practice, exact computation of EMD is too expensive for
4.3. Distance Metric between Point Sets deep learning, even on graphics hardware. We therefore
implement a (1 + ǫ) approximation scheme given by
A critical challenge is to design a good loss function for [2]. We allocate fix amount of time for each instance
comparing the predicted point cloud and the groundtruth. and incrementally adjust allowable error ratio to ensure
To plug in a neural network, a suitable distance must satisfy termination. For typical inputs, the algorithm gives highly
at least three conditions: 1) differentiable with respect to accurate results (approximation error on the magnitude of
point locations; 2) efficient to compute, as data will be 1%). The algorithm is easily parallelizable on GPU.
forwarded and back-propagated for many times; 3) robust
against small number of outlier points in the sets (e.g.
Shape space Despite remarkable expressive power em-
Hausdorff distance would fail).
bedded in the deep layers, neural networks inevitably
We seek for a distance d between subsets in R3 , so that
encounter uncertainty in predicting the precise geometry
the loss function L({Sipred }, {Sigt }) takes the form
of an object. Such uncertainty could arise from limited
X network capacity, insufficient use of input resolution, or the
L({Sipred }, {Sigt }) = d(Sipred , Sigt ), (2)
ambiguity of groundtruth due to information loss in 3D-2D
i
projection. Facing the inherent inability to resolve the shape
where i indexes training samples, Sipred and Sigt are the precisely, neural networks tend to predict a “mean” shape
prediction and groundtruth of each sample, respectively. averaging out the space of uncertainty. The mean shape
We propose two candidates: Chamfer distance (CD) and carries the characteristics of the distance itself.
Earth Mover’s distance (EMD) [17]. In Figure 3, we illustrate the distinct mean-shape be-
havior of EMD and CD on synthetic shape distributions,
Chamfer distance We define the Chamfer distance be- by minimizing Es∼S [L(x, s)] through stochastic gradient
tween S1 , S2 ⊆ R3 as: descent, where S is a given shape distribution, L is one of
the distance functions.
X X
dCD (S1 , S2 ) = min kx − yk22 + min kx − yk22 In the first and the second case, there is a single
y∈S2 x∈S1 continuously changing hidden variable, namely the radius
x∈S1 y∈S2
(3) of the circle in (a) and the location of the arc in (b). EMD
roughly captures the shape corresponding to the mean value
In the strict sense, dCD is not a distance function because of the hidden variable. In contrast CD induces a splashy
triangle inequality does not hold. We nevertheless use shape that blurs the shape’s geometric structure. In the latter
the term “distance” to refer to any non-negative function two cases, there are categorical hidden variables: which

608
input ours ours (post- ground 3D-R2N2
image processed) truth
Input

EMD
mean

CD
mean

(a) (b) (c) (d)

Figure 3. Mean-shape behavior of EMD and CD. The shape


distributions are (a) a circle with varying radius; (b) a spiky arc
moving along the diagonal; (c) a rectangle bar, with a square-
shaped attachment allocated randomly on one of the four corners;
(d) a bar, with a circular disk appearing next to it with probability 0.5.
The red dots plot the mean shape calculated according to EMD and
CD accordingly.

point cloud generation distribution modeling

r.v. Mo2 / VAE


Point Set Figure 5. Visual comparison to 3D-R2N2. Our method better
input Prediction preserves thin structures of the objects.
Network point cloud loss

label CD / EMD

Figure 4. System structure. By plugging in distributional modeling


module, our system is capable of generating multiple predictions.

corner the square is located at (c) and whether there is a


circle besides the bar (d). To address the uncertain presence
of the varying part, the minimizer of CD distributes some Figure 6. Quantitative comparison to 3D-R2N2. (a) Point-set based
points outside the main body at the correct locations; while metrics CD and EMD. (b) Volumetric representation based metric 1
the minimizer of EMD is considerably distorted. - IoU. Lower bars indicate smaller errors. Our method gives better
results on all three metrics.
4.4. Generation of Multiple Plausible Shapes
To better model the uncertainty or inherent ambiguity these methods is a complementary network (the discrim-
(e.g. unseen parts in the single view), we need to enable inator in GAN, or encoder in VAE) that consumes input
the system to generate distributional output. We expect that in the target modality (the 3D point set in our case) to
the random variable r passed to G (see Eq (1)) would help generate prediction or distribution parameters. However,
it explore the groundtruth distribution. However, naively how to feed the 3D point set to deep neural network is still
plugging G from Eq (1) into Loss (2) to predict Sipred won’t an open problem at the production of this paper. Our point
work, as the loss minimization will nullify the randomness. set representation will greatly benefit from future advances
We find practically a simple and effective method for in this direction.
uncertainty modeling: the MoN (min of N) loss:
X 5. Experiment
minimize min {d(G(Ik , rj ; Θ), Skgt )}
Θ rj ∼N(0,I) (5)
k 1≤j≤n 5.1. Training Data Generation by Synthesis
By giving n chances to minimizes the distance, the network To start, we introduce our training data preparation. We
learns to spread its prediction upon receiving different take the approach of rendering 2D views from CAD object
random vectors. In practice, we find that setting n = models. Our models are from the ShapeNet dataset [4],
2 already enables our method to explore the groundtruth containing large volume of manually cleaned 3D object
space. models with textures. Concretely we used a subset of
In principle, to model uncertainty we should use gener- 220K models covering 2,000 object categories. The use of
ative frameworks like the conditional GAN (CGAN) [15] synthesized data has been adopted in a number of existing
or the variational auto-encoder (VAE). One key element in works [5, 16].

609
Ours 3D-R2N2 We report the IoU value for each category as in [5]. From
category
1 view 1 view 3 views 5 views
Table 1, we can see that for single view reconstruction the
plane 0.601 0.513 0.549 0.561
bench 0.550 0.421 0.502 0.527 proposed method consistently achieves higher IoU in all
cabinet 0.771 0.716 0.763 0.772 categories. 3R-R2N2 is also able to predict 3D shapes from
car 0.831 0.798 0.829 0.836 more than one views. On many categories our method even
chair 0.544 0.466 0.533 0.550 outperforms the 3D-R2N2’s prediction given 5 views.
monitor 0.552 0.468 0.545 0.565
lamp 0.462 0.381 0.415 0.421 Notice that both methods learn much more than predict-
speaker 0.737 0.662 0.708 0.717 ing the object’s class. In 3D-R2N2’s dataset, for example,
firearm 0.604 0.544 0.593 0.600 the average CD value from a shape to its categorical mean
couch 0.708 0.628 0.690 0.706
table 0.606 0.513 0.564 0.580 is 1.1, much larger than the result of any method.
cellphone 0.749 0.661 0.732 0.754 We visually compare reconstruction examples in Fig 5.
watercraft 0.611 0.513 0.596 0.610 As stated in [5], their method often misses thin features
mean 0.640 0.560 0.617 0.631
of objects (e.g. legs of furnitures). We surmise that this
Table 1. 3D reconstruction comparison (per category). Notice that is due to their volumetric representation and voxel-wise
in the single view reconstruction setting we achieved higher IoU in loss function which unduly punishes mispositioned thin
all categories. The mean is taken category-wise. For 8 out of 13
categories, our results are even better than 3D-R2N2 given 5 views. structures. In contrast, our point-cloud based objective
function encourages the preservation of fine structures and
makes our predictions more structurally plausible.
For each model, we normalized the radius of its bound- In our current implementation, processing one input
ing hemi-sphere to unit 1 and aligned their ground plane. image consumes 0.13 seconds on a laptop CPU.
Then each model was rendered into 2D images according
to the Blinn-Phong shading formula with randomly chosen 5.3. Injecting Additional Information
environmental maps. In our experiments we used a simple
local lightening model for the sake of computation time.
However, it is straight-forward to extend our method
to incorporate global illumination algorithms and more
complex backgrounds.

5.2. 3D Shape Reconstruction from RGB Images


Comparison to state-of-the-art We compare our work
to 3D-R2N2[5] which is the state-of-the-art in deep learn-
ing based 3D object generation. 3D-R2N2 reconstructs
3D from single or multi-view images into a volumetric
representation. To enable the comparison we re-trained
our networks on the dataset used by 3D-R2N2’s authors.
The results are compared under three different metrics CD,
EMD and IoU (intersection over union). In 3D-R2N2 only
IoU values are reported, so we used the trained network Figure 7. Shape completion from a single RGBD image.
provided by the authors to compute their predictions. To
compute CD and EMD, their predicted and ground truth One interesting feature of our approach is that we can
volumes are sampled by iterative farthest point sampling [7] easily inject additional input information into the system.
to a discrete set of points with the same cardinality as ours. When the neural network is given RGBD input our system
We post-processed our point-set into a volumetric one with can be viewed as a 3D shape completion method. Fig 7
the same resolution as in 3D-R2N2 when computing IoU. visualizes examples of the predictions.
Refer to supplementary for details. The neural network successfully guesses the missing
In Fig 6 we report the result of our network compared parts of the model. By using the shape priors embedded in
with the single-view 3D-R2N2. To determine the absolute the object repository, the system can leverage cues of both
scale of CD and EMD we define unit 1 as 1/10 of the length symmetry (e.g. airplanes should have symetric sides) and
of the 3D grid used to encode the ground truth shape in functionality (tractors should have wheels). The flexible
3D-R2N2’s dataset. Though not directly trained by IoU, representation of point set facilitates the resolution of the
our network gives significantly better performance under all object’s general shape and topology. More fine-grained
three measures. methods that directly exploit local geometric cues could be

610
Figure 10. Visualization of points predicted by the deconvolution
Figure 8. Multiple predictions for a single input image. The point branch (blue) versus the fully connected branch (red).
sets are visualized from different view points for each object to better
reveal the differences. the introduction of deconvolution significantly improves
performance.
We further visualize the output of the deconv branch
cascaded after our predictions to enrich higher frequency
and fully connected branch separately to gain a better
details.
understanding of their functions. In Fig 9 the values in the
5.4. Predicting Multiple Plausible Shapes x, y and z channels are plotted as 2D images for one of the
models. In the deconv branch the network learns to use the
The randomness in our network enables prediction of convolution structure to constructs a 2D surface that warps
different shapes given the same input image. To show this, around the object. In the fully connected branch the output
we take the RGB image as the input. During training we is less organized as the channels are not ordered.
handle randomness by using either the Mo2 or the VAE In Fig 10 we render the two set of predictions in 3D
method. At test time when the ground truth is unknown, space. The deconv branch is in general good at capturing the
the random numbers are sampled from the predefined “main body” of the object, while the fully connected branch
distribution. complements the shape with more detailed components
Fig 8 plots examples of the set of predictions of our (e.g. tip of gun, tail of plane, arms of a sofa). This reveals
method. The network is able to reveal its uncertainty the complementarity of the two branches. The predefined
about the shape or the ambiguity in the input. Points weights sharing and node connectivity endow the deconv
that the neural network is certain about its position moves branch with higher efficiency when they are congruent with
little between different predictions. Along the direction of the desired output’s structure. The fully connected branch
ambiguity (e.g. the thickness of the penguin’s body) the is more flexible but the independent control of each point
variation is significantly larger. In this figure we trained consumes more network capacity.
our network with Mo2 and Chamfer Distance. Other
combinations of settings and methods give qualitatively
Analysis of distance metrics Different choices of the loss
similar results.
functions have distinct effect on the network’s prediction
deconv branch pattern. Fig 13 exemplifies the difference between two
networks trained by CD and EMD correspondingly. The
network trained by CD tends to scatter a few points in its
input image uncertain area (e.g. behind the door) but is able to better
preserve the detailed shape of the grip. In contrast, the
x-channel y-channel z-channel
network trained by EMD produces more compact results
fully connected branch but sometimes overly shrinks local structures. This is in
line with experiment on synthetic data.
x-channel y-channel z-channel
5.6. More results and application to real world data
Figure 9. Visualization of the channels.
Fig 11 lists more example predictions on both synthetic
data and real world photos. These real world photos are
5.5. Network Design Analysis
acquired from a viewpoint and distance that resemble the
Effect of combining deconv and fc branches for recon- setting we used for our synthetic data. A segmentation mask
struction We compared different designs of the neural is also needed to indicate the scope of the object.
network architectures. The performance values are reported
based on our own rendered training set.As shown in Fig 12,

611
Synthetic
Data

Real World
Data

Figure 11. Visualization of predictions on synthetic and real world data.

reconstruction can help motivate further advances in these


two realms.

Acknowledgements
The project receives support from NSF grant IIS-
1528025, the Stanford AI Lab-Toyota Center for Artificial
Figure 12. Comparison of different networks by Chamfer Distance
(CD) and Earth Mover Distance (EMD). More complex network Intelligence Research, a Samsung GRO grant and a Google
gives better results. Focused Research Award.

References
[1] J. Aloimonos. Shape from texture. Biological cybernetics,
58(5):345–360, 1988.
[2] D. P. Bertsekas. A distributed asynchronous relaxation
algorithm for the assignment problem. In Decision and
Figure 13. Comparison of predictions of networks trained by CD Control, 1985 24th IEEE Conference on, pages 1703–1704.
(blue, on the left) and EMD (green, on the right). IEEE, 1985.
[3] J. Carreira, S. Vicente, L. Agapito, and J. Batista. Lifting
6. Discussion object detection datasets into 3d. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 38(7):1342–
The major difficulties we faced in generating 3D point 1355, 2016.
cloud, namely how to represent unordered data and how [4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan,
to handle ambiguity are universal in machine learning. Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su,
We hope our demonstration of single image based 3D J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich

612
3D Model Repository. Technical Report arXiv:1512.03012
[cs.GR], 2015.
[5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-
r2n2: A unified approach for single and multi-view 3d object
reconstruction. arXiv preprint arXiv:1604.00449, 2016.
[6] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction
from a single image using a multi-scale deep network. In
Advances in neural information processing systems, pages
2366–2374, 2014.
[7] Y. Eldar, M. Lindenbaum, M. Porat, and Y. Y. Zeevi. The
farthest point strategy for progressive image sampling. IEEE
Transactions on Image Processing, 6(9):1305–1315, 1997.
[8] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D
primitives for single image understanding. In ICCV, 2013.
[9] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-
Mancha. Visual simultaneous localization and mapping: a
survey. Artificial Intelligence Review, 43(1):55–81, 2015.
[10] K. Häming and G. Peters. The structure-from-motion
reconstruction pipeline–a survey with focus on short image
sequences. Kybernetika, 46(5):926–937, 2010.
[11] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo
pop-up. ACM transactions on graphics (TOG), 24(3):577–
584, 2005.
[12] B. K. Horn. Obtaining shape from shading information. In
Shape from shading, pages 123–171. MIT press, 1989.
[13] Q. Huang, H. Wang, and V. Koltun. Single-view reconstruc-
tion via joint analysis of image and shape collections. ACM
Transactions on Graphics (TOG), 34(4):87, 2015.
[14] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-
specific object reconstruction from a single image. In CVPR,
2015.
[15] M. Mirza and S. Osindero. Conditional generative adversar-
ial nets. arXiv preprint arXiv:1411.1784, 2014.
[16] D. J. Rezende, S. Eslami, S. Mohamed, P. Battaglia,
M. Jaderberg, and N. Heess. Unsupervised learning of 3d
structure from images. arXiv preprint arXiv:1607.00662,
2016.
[17] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s
distance as a metric for image retrieval. International journal
of computer vision, 40(2):99–121, 2000.
[18] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d
scene structure from a single still image. IEEE transactions
on pattern analysis and machine intelligence, 31(5):824–
840, 2009.
[19] H. Su, Q. Huang, N. J. Mitra, Y. Li, and L. Guibas.
Estimating image depth using shape collections. ACM
Transactions on Graphics (TOG), 33(4):37, 2014.

613

You might also like