Chen Floor-SP Inverse CAD For Floorplans by Sequential Room-Wise Shortest Path ICCV 2019 Paper
Chen Floor-SP Inverse CAD For Floorplans by Sequential Room-Wise Shortest Path ICCV 2019 Paper
Figure 1. The proposed system, dubbed Floor-SP, takes aligned panorama RGBD scans as input, finds room segments, solves an opti-
mization problem to reconstruct a floorplan graph as multiple polygonal loops (one for each room), and merges them into a 2D graph via
simple post-processing heuristics. The optimization is the technical contribution of the paper, which employs the room-wise coordinate
descent strategy and sequentially solves shortest path problems to optimize the room structure.
Abstract tion [3] and hand tracking [30]. Unfortunately, the success
This paper proposes a new approach for automated has been limited to the cases of fixed known topology (e.g.,
floorplan reconstruction from RGBD scans, a major mile- a human has two arms). Inference of graph structure with
stone in indoor mapping research. The approach, dubbed unknown varying topology is still an open problem.
Floor-SP, formulates a novel optimization problem, where A popular approach to graph reconstruction is primitive
room-wise coordinate descent sequentially solves shortest detection and selection [11, 27, 22], for example, detecting
path problems to optimize the floorplan graph structure. corners, selecting subsets of corners to form edges, and se-
The objective function consists of data terms guided by lecting subsets of edges to form regions. The major problem
deep neural networks, consistency terms encouraging adja- of this bottom-up process is that it cannot recover from a
cent rooms to share corners and walls, and the model com- single false-negative in an earlier stage (i.e., a missing prim-
plexity term. The approach does not require corner/edge itive). The task becomes increasingly more difficult as the
primitive extraction unlike most other methods. We have primitive space grows exponentially with their degrees of
evaluated our system on production-quality RGBD scans of freedom, especially for non-Manhattan scenes which most
527 apartments or houses, including many units with non- existing methods do not handle [11, 2, 21, 20].
Manhattan structures. Qualitative and quantitative evalua- This paper seeks to make a breakthrough in the domain
tions demonstrate a significant performance boost over the of floorplan reconstruction with three key ideas.
current state-of-the-art. Please refer to our project website • First, we start from room segmentation via instance se-
http:// jcchen.me/ floor-sp/ for code and data. mantic segmentation technique (we use Mask-RCNN [12]).
The room segmentation reduces the floorplan graph infer-
ence into the reconstruction of multiple polygonal loops,
1. Introduction one for each room. This reduction allows us to formulate
Architectural floorplans play a crucial role in designing, floorplan reconstruction as sound energy optimization over
understanding, and remodeling indoor spaces. Automated multiple loops guided by room proposals.
floorplan reconstruction from raw sensor data is a major • Second, we employ room-wise coordinate descent strat-
milestone in indoor mapping research. The core techni- egy in optimizing the objective function. By exploiting the
cal challenge lies in the inference of wall graph structure, fact that the room topology is a simple loop, our formulation
whose topology is unknown and varies per example. finds the (near-)optimal graph structure by solving a short-
Computer Vision has made remarkable progress in the est path problem for each room one by one sequentially,
task of graph inference, for instance, human pose estima- while enforcing consistency with the other rooms.
12661
• Third, we utilize deep neural networks in evaluating the bottom-up process, missing corners in the detection phase
data terms of the optimization problem, measuring the dis- automatically lead to missing walls and rooms in the final
crepancy against the input sensor data. The data term is model. Second, false candidate primitives could lead to the
combined with the ad-hoc 1) consistency term, encourag- reconstruction of extraneous walls and rooms. Third, to en-
ing adjacent rooms to share corners and walls at the room able the usage of powerful IP, FloorNet needs to restrict the
boundaries, and 2) model complexity term, penalizing the solution space to Manhattan scenes.
number of corners in the graph. Structured indoor modeling by Ikehata et al. [17] is the
We have evaluated the proposed approach on production- source of inspiration for our work, which starts by room
quality RGBD scans of 527 apartments or houses, a few segmentation then solves shortest path problems to recon-
times larger than the current largest database [20]. Our struct room shapes followed by room merging and room
approach makes significant improvements over the current addition. While their system is a sequence of heuristics for
state-of-the-art [20]. We refer to our project website http: indoor modeling, our approach formulates a sound energy
//jcchen.me/floor-sp/ for code and data. minimization problem to recover the floorplan structure.
Indoor scan datasets: Affordable depth sensing hardware
2. Related Works enables researchers to build many indoor scan datasets. The
ETH3D dataset contains 16 indoor scans for multi-view
We discuss related work in two domains: graph recon- stereo [24]. The ScanNet dataset [6] and the SceneNN
struction and indoor scan datasets. dataset [15] capture a variety of indoor scenes. However,
Graph reconstruction: Graph structure inference has been most of their scans contain only one or two rooms, not
a popular field of study in Computer Vision, for instance, suitable for the floorplan reconstruction problem. Matter-
inferring a human body pose [3] or the semantic relation- port3D [4] builds high-quality panorama RGBD image sets
ships of categories [14, 28]. In these problems, the graph for 90 luxurious houses. 2D-3D-S dataset [1] provides 6
topology is defined over the label space, common to all the large-scale indoor scans of office spaces by using the same
instances (e.g., a head is always connected to a body). We Matterport system. Lastly, a large-scale synthetic dataset,
here focus on graph inference problems in the context of SUNCG [26], offers a variety of indoor scenes.
reconstruction, where the topology varies per instance. For the floorplan reconstruction task, FloorNet [20] pro-
Room layout estimation infers a graph of architectural vides the benchmark with full floorplan annotations and
feature lines from a single image, where nodes are room the corresponding RGBD videos from smartphones for 155
corners and edges are wall boundaries. Most approaches residential units. This paper utilizes production-quality
assume a 3D box-room to limit the topological variations in panorama RGBD scans for 527 houses or apartments with
the room layouts visible in 2D images [13, 25, 18, 5]. For floorplan annotations.
a room beyond a box shape, Dynamic Programming (DP)
was applied to search for an optimal room structure [8, 9]. 3. Floor-SP: System Overview
DP was similarly used to solve for floorplans by limiting
their topology to be a loop [2]. Floor-SP turns aligned panorama RGBD images into a
Bottom-up processing is a popular approach for graph floorplan graph in three phases: room segmentation, room-
reconstruction, where low-level primitives such as corners aware floorplan reconstruction, and loop merging (See
are detected, which are then selected to form higher-level Fig. 2). This section provides the system overview with
primitives such as edges or regions. DNN-based junc- minimal details. The aligned panorama RGBD scans are
tion detector was proposed for floorplan image vectoriza- first converted into 2D point-density/normal map, which is
tion [21], where a junction indicates incident edge direc- the input to Floor-SP. Unlike FloorNet [20], we focus on
tions in the Manhattan frame. The junction information is the wall structures, where doors/windows, icons, and room
utilized in inferring the edges by integer programming (IP). semantics can be added given proper wall structures.
Similarly, Huang et al. [16] uses DNN to detect junctions Room segmentation: The input panorama scans are con-
represented by a set of incident edge directions, and infer verted into a 4-channel 256×256 point-density/normal map
edges by heuristics for single-image wireframe reconstruc- in a top-down view (See Sect. 6). We utilize instance se-
tion of man-made scenes. mantic segmentation technique (Mask R-CNN [12]) to find
While many previous works utilize RGBD scans/point room segments given the 4-channel image. The room seg-
clouds for high-quality indoor reconstruction [17, 19, 23, ments set up a good foundation for floorplan reconstruction
20], FloorNet [20] is the current state-of-the-art for floor- by providing room proposals with rough shape, but they
plan reconstruction task tested on large-scale indoor bench- are still far away from a good floorplan graph because 1)
marks. FloorNet combines DNN and IP in a bottom-up pro- Mask R-CNN segment has a raster representation (i.e., un-
cess but it has three major failure modes. First, as in any known number and placement of corners); and 2) Walls are
2662
Figure 2. System overview: (Left) Mask-RCNN finds room segments (raster) from a top-down projection image consisting of point density
and mean surface normal, allowing us to reconstruct a floorplan as multiple room loops. (Middle) Room-wise coordinate descent optimizes
vectorzied room structures one by one by minimizing the sum of data, consistency, and model complexity terms. (Right) Simple graph
merging operations combine loops into a floorplan graph structure.
2663
This section explains 1) Shortest path problem reduction;
2) Containment constraint satisfaction; and 3) Two approx-
imation methods for speed-boost.
Shortest path problem reduction: The reduction process
is straightforward, as our cost function is the summation of
pixel-wise penalties and the number of corners. Without
loss of generality, suppose we are optimizing L1 while fix-
ing the other loops. Our optimization problem is equivalent
to solving a shortest path problem for R1 with the following
weight definition for each edge (e) (See the supplementary
document for the derivation):
X λ1
E C (p) +
2 data
p∈C(e)
X
E I
λ2 Edata (p) + λ3 Edata (p) +
p∈E(e)
X
λ4 (1 − 1C (p, L \ {L1 })) +
p∈C(e)
C
X
Figure 3. Illustration of data and consistency terms. Edata and λ5 (1 − 1E (p, L \ {L1 })) + λ6 .
E
Edata are defined based on corner and edge likelihood maps. p∈E(e)
Blue pixels indicate lower costs in these toy examples. Econsis
counts the number of pixels used by room corners and room edges. With abuse of notation, C(e) denotes the two pixels at the
When neighboring rooms share corners and edges as shown in (c), end-points of e, E(e) denotes the set of pixels along e ob-
Econsis goes down.
tained by Bresenham’s line algorithm, and L\{L1 } denotes
the set of loops excluding L1 .
Model complexity term: Emodel is the model complexity
penalty, counting the number of corners in our loops, pre- Containment constraint satisfaction: Shortest path is a
ferring compact shapes. powerful formulation that searches for the optimal number
and placement of corners with one caveat: An additional
Emodel (Li ) = λ6 {# of corners in Li }. constraint is necessary to avoid a trivial solution (i.e., an
empty loop). We use a heuristic similar in spirit to the prior
work [2] to implement this constraint: “Li contains (or goes
λ? are scalars defining the relative weights of the penalty
around) Ri ”. We refer the details to the supplementary doc-
terms. We found our system robust to these parameters and
ument and here summarize the process.
use the following setting throughout our experiments: λ1 =
0.2, λ2 = 0.2, λ3 = 100.0, λ4 = 0.2, λ5 = 0.1, λ6 = 1.0. First, we find corner candidates from the same corner
likelihood map used for the data term (see Fig. 4). Second,
we look at the edge likelihood map to identify a good pair
5. Sequential room-wise shortest path of corners forming the start-edge of the loop. Third, we
The inspiration of our optimization strategy comes from draw a start-line that starts from the room mask (Ri ) and
a prior work, which solves a shortest path problem and re- passes through the start-edge perpendicularly at its middle
constructs a floorplan as a loop [2]. This formulation con- point. Lastly, we remove all the edges that intersect with the
siders every pixel as a node of a graph, encodes objectives start-line to ensure that the path must go around Ri .
into edge weights, and finds the shortest path as a loop. Note that fixing the start-edge to be part of the loop
Our problem solves for multiple loops over multiple breaks the local optimality of our coordinate descent step,
rooms. We devise room-wise coordinate descent strategy but works well in practice as it is not difficult to identify one
that optimizes room structures one by one sequentially by wall segment with high confidence.
reducing a room-wise coordinate descent step into a short- Bounding box approximation: We make an approxima-
est path problem. While the algorithm is robust to the pro- tion in pruning nodes and edges to reduce the computational
cessing order, we visit rooms in increasing order of their expenses of the shortest path algorithm (SPA). We restrict
areas (i.e. smaller rooms are handled first) so that we get the domain of SPA, as it is wasteful to run it over an en-
fixed results given the same input. The optimization runs tire image domain to reconstruct one room. Given a room
for two rounds in our experiments. mask Ri , we apply the binary dilation 10 times to expand
2664
wise likelihoods for corner, edge, and edge direction, we
use the official implementation of Dilated Residual Net-
works [29], which produces 32 × 32 feature maps. In or-
der to produce an output in the same resolutions as the
input, we add 3 extra layers of residual blocks [10] with
transposed convolution of stride 2 to reach the resolution
of 256 × 256. For the corner likelihood supervision, we
render each ground truth corner as a 7 × 7 disk. For the
edge likelihood and wall-direction supervision, we draw the
edge mask and direction information with a width of 5 pix-
Figure 4. We solve a shortest path problem for each room, where els. The loss is binary cross entropy and the learning rate is
cost functions are encoded into edge weights. In order to avoid a 1e-4. Dijkstra’s algorithm solves the shortest path problem.
trivial solution (i.e., an empty graph) and enforce the path to go Loop merging: We use simple graph merging operations to
around the rough room segment (Ri ), we first identify a start-edge convert room loops into the final floorplan graph structure.
that is a part of a room shape with high-confidence. Next, we draw
More concretely, we denote a contiguous set of colinear line
a (red) start-line perpendicularly to split the domain. We prohibit
crossing the start-line, assign a very high penalty for going through
segments as a segment group. We repeatedly identify a pair
Ri , then solve for a shortest path that starts and ends at the two of parallel segment groups within 5 pixels and snap them
end-points of the start-edge. into a new segment group at the middle point while merging
corners. After applying the edge merging to all compatible
the mask and find its axis-aligned bounding box with a 5- pairs, we merge corners that are within 3 pixels.
pixel margin, in which we solve SPA.
Dominant direction approximation: Floor-SP goes be- 7. Experiments
yond the conventional Manhattan assumption by allowing We have evaluated the proposed system on 527 sets of
multiple Manhattan frames per room. We train the same aligned panorama RGBD scans. The average numbers of
DRN architecture to estimate the wall direction likelihoods 1) input 3D points for the point-density/normal image, 2)
in an increment of 10 degrees at every pixel. We perform a corners in the annotations, 3) wall segments in the annota-
simple statistical analysis to extract four Manhattan frames tions, and 4) rooms in the annotations are 432,552, 28.87,
(i.e., eight directions) globally , then assign its subset to 35.88, and 7.73, respectively. Out of 4072 rooms, 489
each room. We allow edges only along the selected domi- rooms do not follow the primary Manhattan structure of the
nant directions with some tolerance on discretization errors unit. Fig. 5 shows four examples from our dataset.
(See the supplementary document for details). 527 units are split into 433 and 94 for training and test-
ing, respectively. We make the test set more challenging on
6. System Details purpose for evaluations: 48 out of 94 testing units contain
challenging non-Manhattan structure, and 199 out of 667
Input processing: Given a set of panorama RGBD scans testing rooms follow non-Manhattan geometry.
where the Z axis is aligned with the gravity direction, we We have implemented the proposed system in Python
compute the tight axis-aligned bounding box of the points while using PyTorch as the DNN library. We have used a
on the horizontal plane. We expand the rectangle by 2.5% in workstation equipped with an NVIDIA 1080Ti with 12GB
each of the four directions, apply non-uniform scaling into GPU memory. We trained the Mask-RCNN for 70 epochs
a 256 × 256 pixel grid, and compute the point density and with a batch size of 1, and the DRNs for 35 epochs with a
normal in each pixel. The point density is the number of 3D batch size of 4. The training of each DNN model takes at
points that fall inside the pixel, which we linearly re-scale most a day. At test time, it takes about 5 minutes to process
to [0.0, 1.0] so that the highest density becomes 1.0. The one apartment/house. The bottleneck is the construction of
point normal is the average surface normal vector of the 3D the graph for the shortest path problem (a CPU-intensive).
points associated with the pixel.
Room segmentation: We use the publicly available 7.1. Qualitative evaluations
Mask R-CNN implementation [7] with the default hyper- Fig. 6 compares Floor-SP against the current state-of-
parameters except that we lower the detection threshold the-art FloorNet [20] and the variants of our system. Floor-
from 0.7 to 0.2. Given a segment from Mask R-CNN, we Net follows a bottom-up process, where it first detects cor-
apply the binary erosion operation for 2 iterations with 8- ners then uses Integer Programming to find their valid con-
connected neighborhood to obtain room segments (Ri ). nections. FloorNet suffers from three failure modes: 1)
Room-aware floorplan reconstruction: To estimate pixel- Missing rooms due to missing corners in the first corner de-
2665
Figure 5. Our dataset offers production-level panorama RGBD scans for 527 houses/apartments. We convert each scan into a point
density/normal map from a top-down view, which is the input to our system. We annotated floorplan structure as a 2D polygonal graph.
Note that for visualizing point-density/normal maps (the middle column), the intensity encodes the point density, and the hue/saturation
encodes the 2D horizontal component of the mean surface normal.
Table 1. The main quantitative evaluation results. The colors cyan, orange, magenta represent the top three entries.
Corner Edge Room Room++
Method
Prec. Recall Prec. Recall Prec. Recall Prec. Recall
FloorNet [20] 95.0 76.6 94.8 76.8 81.2 72.1 42.3 37.5
Ours (w/o Edata , Econsis ) 84.4 80.4 82.3 79.8 75.1 61.3 23.3 22.0
Ours (w/o Econsis ) 93.9 82.3 89.2 81.2 83.8 81.7 49.4 48.5
Ours (1st-round coordinate descent) 94.6 82.8 89.4 81.7 83.9 81.8 49.5 48.7
Ours (2nd-round coordinate descent) 95.1 82.2 90.2 81.1 84.7 83.0 51.4 50.4
tection step; 2) Extraneous rooms coming from extraneous Corner precision/recall: We declare that a corner is suc-
corner detections; and 3) Broken non-Manhattan structures, cessfully reconstructed if there is a ground-truth room cor-
which becomes challenging due to the excessive amount of ner within 10 pixels. When multiple corners are detected
search space in Integer Programming. around a single ground-truth corner, we only take the clos-
The right three columns show the variants of proposed est one as correct and treat the others as false-positives.
Floor-SP. The left does not have the consistency term and Edge precision/recall: We declare that an edge of a graph
replaces the DNN-based data term by the ad-hoc cost func- is successfully reconstructed if its two end-points pass the
tions in the prior work [2]. Our overall formulation guaran- corner test described above and the corresponding edge be-
tees a room reconstruction at each detected room segment, longs to the ground-truth.
producing reasonable results. On adding our DNN-based
data term Edata (middle), per-room structure improves sig- Room precision/recall: We declare that a room is success-
nificantly. However, inconsistencies at the room boundaries fully reconstructed if 1) it does not overlap with any other
are often noticeable. Lastly, with the addition of the con- room, and 2) there exists a room in the ground-truth with
sistency term (right), we see clean floorplan structures with intersection-over-union (IOU) score more than 0.7. Note
consistent shared room boundaries. that this metric does not consider the positioning and shar-
Fig. 7 illustrates the effect of room-wise coordinate de- ing of corners and edges.
scent over multiple rounds. Red ovals indicate challeng- Room++ precision/recall: We declare that a room is suc-
ing structure causing room overlaps or holes, which are re- cessfully reconstructed in this metric, if the room is con-
solved after the second round of optimization. nected (i.e., sharing edges) to the correct set of successfully
reconstructed rooms as in the ground-truth, besides passing
7.2. Quantitative evaluations the above two room conditions.
We follow FloorNet [20] and define the following four Table 1 shows the main quantitative evaluations. Preci-
metrics for the quantitative evaluations: sion metrics on low-level primitives (i.e., corners and edges)
2666
Figure 6. Qualitative comparisons against FloorNet [20] and the variants of our approach. We select hard non-Manhattan examples
here to illustrate the reconstruction challenges in our dataset. For reconstructions by Floor-SP variants, room colors are determined by
corresponding room segments from Mask R-CNN. For the ground-truth and the FloorNet, colors are based on the room types.
are high for FloorNet, because this task does not require fails. Floor-SP recovers such challenging corners through
high-level structural reasoning and the majority of the cor- the sequential room-wise optimization process.
ners are easy ones (e.g., Manhattan corners). On the other On room-level metrics, Floor-SP is consistently better
hand, their recall metrics are low even for low-level prim- than FloorNet. Furthermore, the addition of the data and
itives, because some room corners do not have enough 3D consistency terms improves the room-level metrics. Finally,
points due to occlusions where DNN based corner detection room-wise coordinate descent adds a further boost to the
2667
performance. The quantitative results and the visualization
of all 94 test examples are in the supplementary document.
2668
References [15] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen,
Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn:
[1] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis A scene meshes dataset with annotations. In 2016 Fourth
Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic International Conference on 3D Vision (3DV), 2016.
parsing of large-scale indoor spaces. In IEEE Conference on [16] Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding,
Computer Vision and Pattern Recognition (CVPR), 2016. Shenghua Gao, and Yi Ma. Learning to parse wireframes
[2] Ricardo Cabral and Yasutaka Furukawa. Piecewise planar in images of man-made environments. In IEEE Conference
and compact floorplan reconstruction from images. In IEEE on Computer Vision and Pattern Recognition (CVPR), 2018.
Conference on Computer Vision and Pattern Recognition [17] Satoshi Ikehata, Hang Yang, and Yasutaka Furukawa. Struc-
(CVPR). IEEE, 2014. tured indoor modeling. In IEEE International Conference on
[3] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Computer Vision (ICCV), 2015.
Realtime multi-person 2d pose estimation using part affinity [18] Chen-Yu Lee, Vijay Badrinarayanan, Tomasz Malisiewicz,
fields. In Proceedings of the IEEE Conference on Computer and Andrew Rabinovich. Roomnet: End-to-end room layout
Vision and Pattern Recognition (CVPR), 2017. estimation. In IEEE International Conference on Computer
[4] Angel X. Chang, Angela Dai, Thomas A. Funkhouser, Ma- Vision (ICCV), 2017.
ciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, [19] Minglei Li, Peter Wonka, and Liangliang Nan. Manhattan-
Andy Zeng, and Yinda Zhang. Matterport3d: Learning from world urban reconstruction from point clouds. In European
rgb-d data in indoor environments. In 2017 International Conference on Computer Vision (ECCV), 2016.
Conference on 3D Vision (3DV), 2017. [20] Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floornet:
[5] Yu-Wei Chao, Wongun Choi, Caroline Pantofaru, and Sil- A unified framework for floorplan reconstruction from 3d
vio Savarese. Layout estimation of highly cluttered in- scans. In European Conference on Computer Vision (ECCV),
door scenes using geometric and semantic cues. In In- 2018.
ternational Conference on Image Analysis and Processing [21] Chen Liu, Jiajun Wu, Pushmeet Kohli, and Yasutaka Fu-
(ICIAP), 2013. rukawa. Raster-to-vector: Revisiting floorplan transforma-
[6] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- tion. In IEEE International Conference on Computer Vision
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: (ICCV), 2017.
Richly-annotated 3d reconstructions of indoor scenes. In [22] Aron Monszpart, Nicolas Mellado, Gabriel J. Brostow, and
IEEE Conference on Computer Vision and Pattern Recog- Niloy Jyoti Mitra. Rapter: rebuilding man-made scenes
nition (CVPR), 2017. with regular arrangements of planes. ACM Trans. Graph.,
[7] pytorch-mask-rcnn. https://github.com/multimodallearning/ 34:103:1–103:12, 2015.
pytorch-mask-rcnn. [23] Liangliang Nan and Peter Wonka. Polyfit: Polygonal sur-
[8] Alex Flint, Christopher Mei, David Murray, and Ian Reid. face reconstruction from point clouds. In IEEE International
A dynamic programming approach to reconstructing build- Conference on Computer Vision (ICCV), 2017.
ing interiors. In European Conference on Computer Vision [24] Thomas Schöps, Johannes L. Schönberger, Silvano Galliani,
(ECCV), 2010. Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An-
dreas Geiger. A multi-view stereo benchmark with high-
[9] Alex Flint, David Murray, and Ian Reid. Manhattan scene
resolution images and multi-camera videos. In IEEE Confer-
understanding using monocular, stereo, and 3d features. In
ence on Computer Vision and Pattern Recognition (CVPR),
IEEE International Conference on Computer Vision (ICCV),
2017.
2011.
[25] Alexander G Schwing, Tamir Hazan, Marc Pollefeys, and
[10] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-
Raquel Urtasun. Efficient structured prediction for 3d indoor
manghelich, and Dacheng Tao. Deep ordinal regression net-
scene understanding. In IEEE Conference on Computer Vi-
work for monocular depth estimation. In IEEE Conference
sion and Pattern Recognition (CVPR). IEEE, 2012.
on Computer Vision and Pattern Recognition (CVPR), 2018.
[26] Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang,
[11] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Manolis Savva, and Thomas A. Funkhouser. Semantic scene
Richard Szeliski. Manhattan-world stereo. In IEEE Com- completion from a single depth image. In 2017 IEEE Confer-
puter Society Conference on Computer Vision and Pattern ence on Computer Vision and Pattern Recognition (CVPR),
Recognition (CVPR), 2009. 2016.
[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. [27] Jianxiong Xiao and Yasutaka Furukawa. Reconstructing the
Girshick. Mask r-cnn. In IEEE International Conference on world’s museums. International Journal of Computer Vision,
Computer Vision (ICCV), 2017. 110(3):243–258, 2014.
[13] Varsha Hedau, Derek Hoiem, and David A. Forsyth. Recov- [28] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei.
ering the spatial layout of cluttered rooms. In IEEE Interna- Scene graph generation by iterative message passing. In Pro-
tional Conference on Computer Vision (ICCV), 2009. ceedings of the IEEE Conference on Computer Vision and
[14] Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, Zicheng Pattern Recognition (CVPR), 2017.
Liao, and Greg Mori. Learning structured inference neural [29] Fisher Yu, Vladlen Koltun, and Thomas A. Funkhouser. Di-
networks with label relations. In IEEE Conference on Com- lated residual networks. In IEEE Conference on Computer
puter Vision and Pattern Recognition (CVPR), 2016. Vision and Pattern Recognition (CVPR), 2017.
2669
[30] Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and Tae-
Kyun Kim. Bighand2. 2m benchmark: Hand pose dataset
and state of the art analysis. In IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2017.
2670