Pointnet: A 3D Convolutional Neural Network For Real-Time Object Class Recognition
Pointnet: A 3D Convolutional Neural Network For Real-Time Object Class Recognition
net/publication/309776317
CITATIONS READS
84 2,581
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Francisco Gomez-Donoso on 05 December 2017.
[email protected], [email protected]
Abstract—During the last few years, Convolutional Neural finally they are classified to determine the possible object rep-
Networks are slowly but surely becoming the default method to resented by those descriptors. That classification is performed
solve many computer vision related problems. This is mainly due by using distance metrics or machine learning algorithms, e.g.,
to the continuous success that they have achieved when applied
to certain tasks such as image, speech, or object recognition. Support Vector Machines (SVMs) [5], random forests [6],
Despite all the efforts, object class recognition methods based or neural networks, which are trained with object datasets.
on deep learning techniques still have room for improvement. Despite the popularity and the continuous flow of successful
Most of the current approaches do not fully exploit 3D in- implementations of this kind of pipelines – specially for
formation, which has been proven to effectively improve the recognition in cluttered and occluded environments – hand-
performance of other traditional object recognition methods.
In this work, we propose PointNet, a new approach inspired by crafting feature descriptors requires domain expertise and re-
VoxNet and 3D ShapeNets, as an improvement over the existing markable engineering and theoretical skills, and even fulfilling
methods by using density occupancy grids representations for the both requirements they are still far from perfection. In this
input data, and integrating them into a supervised Convolutional regard, the aim of researchers has been to replace those hand-
Neural Network architecture. engineered descriptors with multilayer neural networks able
An extensive experimentation was carried out, using ModelNet
– a large-scale 3D CAD models dataset – to train and test the
to learn them automatically by using general-purpose training
system, to prove that our approach is on par with state-of-the- algorithms. This gave birth to the application of Convolutional
art methods in terms of accuracy while being able to perform Neural Networks (CNNs) to image analysis. A brand-new
recognition under real-time constraints. deep learning architecture designed to process data in form of
multiple arrays, which quickly surpassed the previous methods
I. I NTRODUCTION with many practical successes [7] – mainly due to the fact
Object class recognition is one of the main challenges for a that they were easy to train and generalized far better – so
computer to be able to understand a scene. This turns out to be that CNNs have become the community standard to tackle
a key capability for autonomous robots which operate in real- the object class recognition problem [8]. However, only a few
word environments. Due to the unstructured nature of those CNN-based systems have been proposed to use 3D data for
environments, mobile robots need to do reasoning grounded this purpose, therefore we strongly believe that there is still
in the real world, i.e., they must understand the information room for improvement since most of them do not fully exploit
provided by their sensors. In this regard, semantic object 3D data.
recognition is a crucial task, among other equally important
ones, which robots have to perform in order to achieve a In this work, we propose a new deep learning architecture
considerable level of scene understanding [1]. for object class recognition using pure 3D information, an
The advent of reliable and low-cost range sensors, e.g., approach inspired by the success of recently introduced CNN-
RGB-D cameras such as Microsoft Kinect and PrimeSense based systems for 3D object recognition. Its contribution is
Carmine, revolutionized the field by providing useful 3D data twofold: a novel way for representing the input data, which
to feed the prediction systems with a new dimension of useful is based on point density occupancy grids, and its integration
information. Because of that, traditional 2D approaches were into a CNN specifically tuned for the aforementioned purpose.
superseded by 3D-based ones. In addition, 3D model and
scene repositories are growing on a daily basis, thus providing This paper is structured as follows. Section II reviews the
researchers with enough reliable data sources for training and state of the art of deep learning methods applied to 3D object
testing object recognition systems. class recognition. Next, our proposal – namely PointNet – is
The vast majority of 3D object recognition methods [2] described in Section III. Section IV defines the methodology
are typically based on hand-crafted local feature descriptors followed to test our approach, as well as the experiments that
[3]. These kinds of approaches rely on specific pipelines were carried out, their results, and the corresponding discus-
[4] consisting of a keypoint detection phase, followed by sion. At last, Section V draws some conclusions about our
the computation of descriptors at those characteristic regions, work and sets future research lines to improve the proposal.
978-1-5090-0620-5/16/$31.00 2016
c IEEE 1578
II. R ELATED W ORK III. A PPROACH
The proposed system takes a point cloud of an object as an
Deep learning architectures have recently revolutionized input and predicts its class label. In this regard, the proposal is
2D object class recognition. The most significant example twofold: a volumetric grid based on point density to estimate
of such success is the CNN architecture, being AlexNet [7] spatial occupancy inside each voxel, and a pure 3DCNN
the milestone which started that revolution. Krizhevsky et which is trained to predict object classes. The occupancy
al. developed a deep learning model based on the CNN grid – inspired by VoxNet [13] occupancy models based on
architecture that outperformed by a large margin (15.315 % probabilistic estimates – provides a compact representation of
error rate against the 26.172 % scored by the runner-up the object’s 3D information from the point cloud. That grid is
not based on deep learning) state-of-the-art methods on the fed to the CNN architecture, which in turn computes a label
ImageNet [9] LSVRC 2012 challenge. for that sample, i.e., predicts the class of the object.
Due to the successful applications of the CNNs to 2D image This architecture was implemented using the Point Cloud
analysis, several researchers decided to take the same approach Library (PCL) [23] – which contains state-of-the-art algo-
for 2.5D data, treating the depth channel as an additional one rithms for 3D point cloud processing – and Caffe [19], a deep
together with the RGB ones [10]–[12]. These methods simply learning framework developed and maintained by the Berkeley
extend the architecture to take four channels – matrices – as Vision and Learning Center (BVLC) and an active community
input instead of the three featured by RGB images. This is of contributors on GitHub1 . This BSD-licensed C++ library
still a image-based approach which does not fully exploit the allows us to design, train, and deploy CNN architectures
geometric information of 3D shapes despite its straightforward efficiently, mainly thanks to its drop-in integration of NVIDIA
implementation. Apart from 2.5D approaches, specific archi- cuDNN [24] to take advantage of GPU acceleration.
tectures to learn from volumetric data, which make use of A. Occupancy Grid
pure 3D convolutions, have been recently developed. Those
architectures are commonly referred as 3DCNNs and their Occupancy grids [25] are data structures which allow us
foundations are the same as the 2D or 2.5D ones, but the to obtain a compact representation of the volumetric space.
nature of the input data is radically different. Since volumetric They stand between meshes or clouds, which offer rich but
data is usually quite dense and hard to process, most of the large amounts of information, and voxelized representations
successful 3DCNNs resort to a more compact representation with packed but poor information. At that midpoint, occupancy
of the 3D space: the occupancy grids. VoxNet [13] and 3D grids provide considerable shape cues to perform learning,
ShapeNets [14] make extensive use of this representation. while enabling an efficient processing of that information
thanks to their array-like implementation.
Those 3DCNNs are slowly overtaking other approaches
Recent 3D deep learning architectures make use of oc-
when applying object recognition to complete 3D scenes [15].
cupancy grids as a representation for the input data to be
This progress has been mainly enabled by two factors: the
learned or classified. 3D ShapeNets [14] is a Convolutional
substantial growth in the number of 3D models available
Deep Belief Network (CDBN) which represents a 3D shape
online through repositories, and the reduction of training times
thanks to frameworks and libraries which exploit the power of 1 http://github.com/BVLC/caffe
massively parallel architectures for this kind of tasks. On the
one hand, there exist many collections of 3D models, but they
tend to be small and usually lack annotations and other useful
information for training this kind of deep architectures. In
contrast, 2D approaches have taken advantage of the numerous
and high-quality datasets that already exist such as ImageNet
[9], LabelMe [16], and SUN [17]. During the last years,
researchers have unified efforts to create large-scale annotated
3D datasets inspired by the success of the 2D counterparts.
The most popular 3D datasets which have revamped data-
driven solutions – for computer vision in general, and object
recognition in particular – are the Princeton ModelNet [14], (a) Mesh (b) Cloud (c) Occ. Grid
and ShapeNets [18] datasets. On the other hand, the creation
of deep learning frameworks such as Caffe [19], Theano [20], Fig. 1: A mesh (a) is transformed into a point cloud (b), and
Torch [21], or TensorFlow [22], which allow researchers to that cloud is processed to obtain a voxelized occupancy grid
easily express and launch their architectures and accelerate the (c). The occupancy grid shown in this figure is a cube of
training calculations with Graphics Processing Units (GPUs) 30 × 30 × 30 voxels. Each voxel of that cube holds the point
by using CUDA or OpenCL, has enabled quick prototyping density inside its volume. In this case, dark voxels indicate
and testing. Both facts have turned out to be crucial for the high density whilst bright ones are low density volumes.
development of the field. Empty voxels were removed for better visualization.
as a 30 × 30 × 30 binary tensor in which a one indicates performed by this layer are defined in Section III-A. This layer
that a voxel intersects the mesh surface, and a zero represents takes the following parameters:
empty space. VoxNet [13] introduces three different occupancy • Voxel Grid Size (v): It defines the width, height, and
grids (32 × 32 × 32 voxels) that employ 3D ray tracing to depth of the occupancy grid in spatial units. A value of
compute the number of beams hitting or passing each voxel 300 is used to carry out the experiments in this paper.
and then use that information to compute the value of each • Leaf Size (l): It specifies the width, height and depth of
voxel depending on the chosen model: a binary occupancy a single voxel in spatial units. A value of 5 is used for
grid using probabilistic estimates, a density grid in which each our architecture.
voxel holds a value corresponding to the probability that it will In sight of these parameters, an occupancy grid data structure
block a sensor beam, and a hit grid that only considers hits of 60×60×60 voxels is used to process the input point clouds.
thus ignoring empty or unknown space. The binary and density Convolution Layer C(m, n, d) : These layers convolve the
grids proposed by Maturana et al. differentiate unknown and data in batches of n × n × n cubes with m filters and a stride
empty space, whilst the hit grid and the binary tensor do not. of d. As a result, they provide m grids that are the response
Currently, VoxNet’s occupancy grid holds the best accuracy of each filter. The values of the filters are updated as they are
in the ModelNet challenge for the 3D-centric approaches learned in the backpropagation stage. They are initialized with
described above. However, ray tracing grids considerably random values at the beginning. The hyperparameters are:
harmed performance in terms of execution time so that other
• Number of filters (m): It defines how many filters the
approaches must be considered for a real-time implementation.
layer will learn.
In that very same work, the authors show that hit grids
• Filter size (n): If specifies the width, height and depth of
performed comparably to other approaches while keeping a
the filter.
low complexity to achieve a reduced runtime.
• Stride (d): This parameter sets the offset to apply during
In this regard, we propose an occupancy grid inspired by the the convolution process.
aforementioned successes but aiming to maintain a reasonable
ReLU Layer R(): This layer performs the operation
accuracy while allowing a real-time implementation. In our
max(x, 0) returning the same value if the input value is greater
volumetric representation, each point of a cloud is mapped to
or equal to zero, and zero if the input value is negative. This
a voxel of a fixed-size occupancy grid. Before performing that
layer takes no parameters. The first Inner Product layer is
mapping, the object cloud is scaled to fit the grid. Each voxel
followed by a ReLU one together with a dropout layer to avoid
will hold a value representing the number of points mapped
overfitting with a 0.5 ratio.
to itself. At last, the values held by each cell are normalized.
Pooling Layer P(n,d): In this layer, a max-pooling process
Figure 1 shows the proposed occupancy grid representation
is performed. It takes the input data and summarizes it by
for a sample object.
taking the max value of a fixed local spatial region sliding it
B. Network Architecture across the grid data structure.
• Filter size (n): It specifies the width, height and depth of
As we have previously stated, CNNs have proven to be very
the local spatial regions.
useful for recognizing and classifying objects in 2D images.
• Stride (d): This parameter sets the offset to apply in the
A convolutional layer can recognize basic patterns such as
pooling process.
corners or planes and if we stack several of them they can
learn a topology of hierarchical filters that highlight regions of Inner Product Layer IP(n): This is a fully connected layer
the images. What is more, the composition of several of these in which every input neuron is connected to each output
regions can define a feature of a more complex object. In this one throughout a hyperparameter. These hyperparameters are
regard, a combination of various filters is able to recognize a learned during the backpropagation process.
full object. We apply this approach used in 2D images to 3D • Number of outputs (n): It determines the number of
recognition. In this section we will describe the main layers output neurons.
that compose PointNet as well as their parameters. The deep architecture featured by PointNet is represented in
Point Cloud Data Layer PC(v, l): This layer takes as input a Figure 2. This setup allows PointNet to be on par with state-
point cloud and outputs an occupancy grid structure. The tasks of-the-art algorithms while keeping reduced execution times.
A. Methodology
(a) Rendered views The CAD models are provided in Object File Format (OFF).
Firstly, we converted all OFF models into Polygon File Format
(PLY) to ease the usage of the dataset with the PCL. As we
already mentioned, the input for PointNet are point clouds,
but the dataset provides CAD models specifying vertices and
faces. In this regard, we converted the PLY models into Point
Cloud Data (PCD) clouds by raytracing them. A 3D sphere is
tessellated and a virtual camera is placed in each vertex of that
truncated icosahedron – pointing to the origin of the model –
then multiple snapshots are rendered using raytracing and the
(b) Mesh (c) Point cloud (d) Downsampled z-buffer data, which contains the depth information, is used to
generate point clouds from each point of view. After all points
of view have been processed, the point clouds are merged. A
Fig. 3: Dataset model processing example to generate the point voxel grid filter is applied to downsample the clouds after the
clouds for PointNet. Some rendered views of a toilet model raytracing operations. Figure 3 illustrates the aforementioned
are shown in (a). The original OFF mesh is shown in (b). processes. After that, the resulting point clouds are used to
The generated point cloud after merging all points of view is train, randomizing the order of the models, and test the system
shown in (c), and (d) shows the downsampled cloud using a taking into account the corresponding splits.
voxel grid filter with a leaf size of 0.7 × 0.7 × 0.7. 2 http://modelnet.cs.princeton.edu/
Fig. 4: Confusion matrix of the classification results achieved by PointNet after 200 training iterations using the ModelNet-10
dataset (learning rate is 0.0001, momentum is 0.9). It is important to remark the confusion between the classes Desk and Table.
The values shown in the table are not percentages but absolute number of examples.
All the timings and results were obtained by performing that a deeper network would provide better results so more
the experiments in the following test setup: Intel Core i5-3570 experiments were carried out.
with 8 GB of 1600 MHz DD3 RAM on an ASUS P8H77-M In the deeper network experiment we added several layers
PRO motherboard (Intel H77 chipset). Additionally, the system to the PointNet architecture. One more convolutional layer was
includes an NVIDIA Tesla K20 GPU, and a Seagate Barracuda added since these layers are coupled to the detection of the
7200.14 secondary storage. features of the objects, so the more layers there are, a better
Caffe RC2 was run over ElementaryOS Freya 0.3.1, an or more expressive model is produced. An Inner Product layer
Ubuntu-based Linux distribution. It was compiled using was also added. Since these layers make the classification
CMake 2.8.7, g++ 4.8.2, CUDA 7.0, and cuDNN v3. possible, adding more of them would theoretically provide
better classification results. The proposed architecture for this
B. Results experiment is described as follows: PC(300,5) - C(64,3) - R()
As a result of training PointNet with a learning rate of - P(3,2) - C(160,3,2) - R ()- P(2,2) - C(160,3,2) - R() - P(2,2)
0.0001 and a momentum of 0.9 during 200 iterations using - IP(1000) - IP(1000) - IP(10).
the ModelNet-10 dataset, it obtained a success rate of 77.6%. This architecture was trained during 1, 000 iterations and
As shown in Figure 4, the confusion matrix reveals the tested every 200 iterations. The best result was provided by
stability of the system, mainly confusing items that look alike the 800 iterations test with an accuracy of 76.7%, while the
such as desk and table. Because of the nature of the CNNs, 1, 000 iterations test dropped the performance to a 75.9% due
which heavily rely on detecting combinations of features, these to overfitting.
kind of errors are common. As we can observe in Figure 5, It is well known that training using an unlabanced dataset
the visual features that define a desk and a table are almost the tends to harm those classes with the least number of exam-
same, making it hard to distinguish. Figure 6 shows the neuron ples and to benefit those with the most, as stated by [26].
activations for the output layer of the architecture, proving that Having this in mind, and knowing that ModelNet-10 is highly
Desk and Table are consistently confused during the tests. unbalanced as shown in Table I, the dataset was balanced by
In light of these experiments, and taking into accout the limiting the number of examples of each class to 400 using
knowledge of the CNNs principles, it is conceivable to think random undersampling. This does not fully solve the problem
but improves the difference between the classes with the least
number of examples and those with the most.
The network was trained and tested with this more balanced
dataset using the architecture defined in Section III-B and it
achieved an accuracy of 72.9%. The fact is that balancing
the training set makes the accuracy of the classes with less
examples higher, but it harms the success rate on classes with
(a) Table (b) Desk more, as seen in Figure 7.
Accuracy
0.5
0.4
0.3
0.2
Unbalanced
0.1 Balanced
0
1 2 3 4 5 6 7 8 9 10
Class
(a) Desk tests activations
Fig. 7: Comparison of accuracy per class using an unbalanced
dataset and a balanced one with a maximum of 400 models
per class. Accuracy is lost in the classes in which models are
removed but gained otherwise.
V. C ONCLUSION
In this paper we proposed PointNet, a brand new kind of
CNN for object class recognition that handles tridimensional
data, inspired by VoxNet and 3D ShapeNets but using density
occupancy grids as inner representation for input data. It was
implemented in Caffe and provides a faster method than the
state of art ones yet obtaining a high success rate as the
experiments over the ModelNet10 dataset. This fact enlightens
a promising future in real-time 3D recognition tasks.
Following on this work, we plan to improve the inner
representation by using adaptable occupancy grids instead of
(b) Table tests activations fixed-size ones. In addition, we will integrate the system in
an object recognition pipeline for 3D scenes. Our network
Fig. 6: Neuron activations for the output layer of the archi-
will receive a point cloud segment of the scene where the
tecture when classifying all the test samples for both Desk
object lies, produced by a preprocessing method, and that
and Table classes. Each row represents an activation vector
segments will be used to generate the occupancy grids that will
for a specific sample, so each column is a position of the
be learned by the system. This implies adapting the system
vector: the activation to that particular class. The first column
for learning partial views of the objects and dealing with
corresponds to the Desk class, while the second one is the
occlusions and scale changes. As an additional feature, we will
Table. The activation shows the clear confusion between Desk
include pose estimation in that pipeline, all of this with goal
and Table. Although the latter one is much less confused with
of developing an end-to-end 3D object recognition system.
other classes, many Tables are misclassified as Desks thus
lowering the accuracy for this class. ACKNOWLEDGMENTS
This work has been supported by the Spanish Government
DPI2013-40534-R3 grant for the SIRMAVED project [27],
supported with Feder funds. This work has also been funded
with a 77.6% success rate. In addition, PointNet featuring the by the grant ”Ayudas para Estudios de Máster e Iniciación a la
architecture exposed in Figure 2 takes an average time of Investigación” from the University of Alicante. Experiments
24.6 miliseconds to classify an example (in comparison with were made possible by a generous hardware donation from
Voxnet, which can take up to half a second for its raytracing- NVIDIA (Tesla K20).
based implementation). These results prove the system as a
fast and accurate 3D object class recognition tool. 3 http://www.iuii.ua.es/SIRMAVED/?idioma=en