Geometric Deep Learning
Geometric Deep Learning
1 Introduction 1
1.1 Course goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 A brief history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 The challenges of geometric deep learning . . . . . . . . . . . . . . . . . . . 2
i
CONTENTS ii
Introduction
1.2 Motivation
The past decade in computer vision research has witnessed the re-emergence of “deep
learning”, and in particular convolutional neural network (CNN) techniques, allowing to
learn powerful image feature representations from large collections of examples. CNNs
achieve a breakthrough in performance in a wide range of applications such as image
classification, segmentation, detection and annotation. Nevertheless, when attempting
to apply the CNN paradigm to 3D shapes (feature-based description, similarity, corre-
spondence, retrieval, etc.) one has to face fundamental differences between images and
geometric objects. Shape analysis and geometry processing pose new challenges that are
non-existent in image analysis, and deep learning methods have only recently started
penetrating into the 3D shape community. CNNs have been applied to 3D data in recent
works using standard (Euclidean) CNN architectures applied to volumetric or view-based
shape representations. Intrinsic versions of CNNs have also been proposed very recently
with the generalization of the CNN paradigm to non-Euclidean manifolds, allowing them
to deal with shape deformations. These “generalized” CNNs can be used to learn invariant
shape features and correspondence, allowing to achieve state-of-the-art performance in
several shape analysis tasks, while at the same time allowing for different shape rep-
resentations, e.g. meshes, point clouds, or graphs. The purpose of this short course is
to overview the foundations and the current state of the art on learning techniques for
3D shape analysis and geometry processing. Special focus will be put on deep learning
techniques (CNN) applied to Euclidean and non-Euclidean manifolds for tasks of shape
classification, retrieval, reconstruction and correspondence. The course will present in a
new light the problems of shape analysis, emphasizing the analogies and differences with
1
1. INTRODUCTION 2
the classical 2D setting, and showing how to adapt popular learning schemes in order to
deal with deformable shapes.
These course notes will assume no particular background, beyond some basic working
knowledge that is a common denominator for people in the field of computer graphics.
All the necessary notions and mathematical foundations will be described. The course is
targeted to graduate students, practitioners, and researchers interested in shape analysis,
synthesis, matching, retrieval, and big data.
Figure 1.1: Left: extrinsic methods such as volumetric CNNs [100] treat 3D geometric data
in its Euclidean representation. Such a representation is not invariant to deformations (e.g.,
in the shown example, the filter that responds to features on a straight cylinder would not
respond to a bent one). Right: in an intrinsic representation, the filter is applied on the
surface itself, thus being invariant to deformations.
In the recent years we have experienced a paradigm shift from axiomatic modeling to
data-driven modeling due to the unprecedented performance of Deep Neural Networks in
tasks such as image and speech recognition, machine translation, and image segmentation
and detection.
The main reasons to this success are to be found in: (1) the growing computational
power of modern GPU based computers, (2) larger annotated datasets, and (3) more
efficient stochastic optimization methods and architectural changes that allow to effectively
train much more complex networks with many layers and degrees of freedom.
These factors made the qualitative breakthrough possible, and the wide-spread interest
from various communities ranging from pure machine learning to system design rapidly
brought deep learning to be widely used in commercial applications, including Siri speech
recognition, Google text translation, and the Mobileye autonomous driving technology.
Key aspect of Deep Learning (DL) systems is that they are able to learn representations
directly from the raw input data without requiring any hand-crafted feature extraction
stage. They allow to perform representation learning at various levels of representation,
therefore in a hierarchical fashion, by composing simple non-linear building blocks usually
referred to as layers. This composition results in very powerful models which can leverage
the large quantities of annotated data available today.
This is the reason why we have experienced such an outstanding improvement in tasks
where all machine learning methods rely on a feature extraction stage, namely speech
signals, images, and everything related to perceptual data.
This is in sharp contrast with the now old paradigm of machine learning where
features had to be provided by the user. During the neural network “dark age”, when
methods such as SVM defined the gold standard for classification, very little effort was
devoted to the tweaking of their parameters and most of the work was instead devoted to
a good feature extraction strategy. As an example, the typical computer vision recognition
pipeline before DL consisted in computing SIFT, doing sparse vector quantization and
pooling to finally feed everything to an SVM classifier. Of course all of these stages are
completely arbitrary and sub-optimal to say the least. With DL one needs to have very
few assumptions on what is good for a given task and in most cases it is enough to have
the input image and its annotation; this leaves to the data-driven approach the task of
finding what is the right feature and the right feature aggregation strategy.
The ability to learn representations directly from the data with very little prior has
allowed many machine learning techniques to shine and to already surpass human
performance in some specific domains.
According to how many processing steps separate the input to the output of a given
4
2. BASICS OF DEEP LEARNING 5
model we obtain shallow and deep models. SVM are shallow models as they can be
mimicked by a single layer neural network (implementing the kernel function) whereas
models with more than a single layer start to be called deep. However, during the years
such term has had various meanings and currently it is used with architectures with at
least 10 or more layers.
Before introducing neural networks we first review the popular random forest paradigm.
Despite being composed of several decision steps, the random forest model is still consid-
ered to be shallow as the decisions at each level are taken on the original input space. In
deep networks each stage operates on the output of the previous one instead, therefore
performing feature learning.
by using a random sample of all the training data, or by selecting the optimal parameters
at each node from a pool of randomly generated values. The training process is repeated
independently for each tree in the forest, leading to component trees that are different
from each other. This, in turn, leads to robustness to noisy data, de-correlation between
the individual tree predictions, and thus to improved generalization [21].
At testing time, a previously unseen data point is routed through each tree in the
forest, undergoing a number of predefined tests. When the data point reaches a leaf,
it is associated a class label (or a distribution over the space of labels). The final forest
prediction can be obtained, for example, by simply averaging the class posteriors of each
tree. Note that, differently from the training process, the testing phase is completely
deterministic once the trees are fixed.
where σ is a point-wise non-linearity such as ReLU, σ(x) = max(0, x). Other common
choices are s-shaped functions such as hyperbolic tangent and logistic. The learnable
parameters are the projection matrix W and the bias vector b, which for simplicity are
concatenated into the set Θ = {W, b}.
A deep feed-forward network is therefore a complex non-linear function resulting by
the composition of several such fully connected layers:
where Θ = {θi |i = 1..n} is the set of all parameters in each of the utilized layers.
In this document we will focus only on supervised feed-forward learning methods,
the most used approach in computer vision tasks, where the desired targets are known
in advance. In the cases where the input-to-output mapping incurs in a long list of
intermediate transformations, the models are said to be deep, otherwise they are referred to
as shallow. It is important to note that the meaning of the word deep has greatly changed
over the past few years given the recent advances in the field.
Unsupervised learning From a machine learning standpoint a very interesting and rela-
tively unexplored area is the one of unsupervised learning. In this setting the model knows
only the input signal and aims at modeling the generating input distribution that better
explains the data: the unknown function y is generating the input xi , and the function itself
is the target of the learning process. Unsupervised models usually aim at reconstructing
2. BASICS OF DEEP LEARNING 7
the input while limiting the representational power of the model via sparsity penalties for
example. Models in this family are called autoencoders. Another very popular paradigm
for unsupervised learning which does not rely on the ill-posed reconstruction cost is the
one of Predictability Minimization (PM) and Generative Adversarial Networks (GAN)
which are able to sample very realistic natural images and whose representations are often
competitive with fully supervised ones. A GAN model is composed by two networks,
the generator and the discriminator. The generator is trained to produce plausible input
samples from some input distribution, for example Gaussian noise, so that it makes the
discriminator fail. The discriminator is trained to classify samples from the training set
from samples generated by the generator. The result is a system able to learn features with
no supervision which compete with the ones obtained with fully supervised techniques
and quite impressive generative capabilities, in particular for scene and face generation
when the generator is a deep convolutional network [31, 69].
For classification tasks, where each sample can be assigned to one of K mutually
exclusive classes, the common choice is the multinomial logistic loss
K
X
LΘ (xi , ti ) = − ti log(FΘ (xi )) (2.5)
k=1
where ti is one-hot encoded (vector containing a 1 at the position of the ground truth label
and 0 otherwise) and the output of the network F (xi ) is K-dimensional with sof tmax
activation; this allows to interpret the output as posterior probability of the class given the
input p(F (xi ) = ti |xy , Θ).
The minimization of the loss function is performed through some form of gradient
descent and the gradient of the parameters of the network is computed with the backprop-
agation algorithm [72, 99].
LeCun and colleagues [50] later introduced gradient based learning and advanced the
state-of-the-art for this class of models since then. We can say that what is now known as
convolutional network is the latter variant of LeCun et al.
The convolutional layer C acts on a p-dimensional input f (x) = (f1 (x), . . . , fp (x)) and
applies a filter-bank W = (wl,l0 ∈ Rrc ), where r and c indicate the number of rows and
columns in the convolutional kernel; l and l0 indicate the number of input and output
maps:
p
X
gl (x) = (fl0 ? wl,l0 )(x). (2.6)
l0 =1
Pre-training Several works have shown, in the early days of deep learning, that initial-
izing the weights by the ones obtained via unsupervised learning is beneficial and can
lead improved and more robust results [27, 60, 51, 38, 93]. This process involves though a
first training phase which can take lot of time and that has a diminishing return when the
amount of labeled samples increases and has become less popular in practical applications.
What is common in computer vision though is to take a model trained to classify objects in
ImageNet [23] such as the VGG [81] or Inception [89] and to use its features or predictions
for other higher level tasks.
Initialization is a critical point for deep networks and finding good initializations was
one of the good arguments of pre-training. While initializing the weights from a normal
distribution with zero-mean and small standard deviation is often sufficient to achieve
good convergence properties, several authors have shown that ad-hoc initializations are
often more powerful and can lead to better solutions [30, 35].
Weight decay is perhaps the most popular regularization technique for deep learning.
It simply penalizes weights with large magnitude thus favoring smooth solutions and is
defined by the regularizer kΘk22
2. BASICS OF DEEP LEARNING 10
11
3. EXTRINSIC DEEP LEARNING 12
uncertainty and the energy E for each convolutional layer is computed as:
XX f
hj (W f ? v)j + cf hfj −
X
E(v, h) = − bl vl (3.1)
f j l
hfj
where vl is a visibility unit, a hidden unit in the feature channel f , W f a convolutional
filter, and ? a convolution operation. To help with the reconstruction, each visibility
unit vl is associated with a unique bias term bl and all hidden units hfj in the same
convolution channel share the same bias term cf . The extracted features can then be used
for classification, recognition, and reconstruction tasks and similar to classic 2D CNNs,
they significantly outperform existing methods as shown in [100].
Figure 3.2: A neural network can be trained to extract a feature descriptor and predict
the corresponding segmentation label on the human body surface for each point in the
input depth maps. Per-vertex descriptors are generated for 3D models by averaging the
feature descriptors in their rendered depth maps. The extracted features can then be used
for example to compute dense correspondences.
learning methods.
Given depth maps of two humans I1 , I2 , one important application is to determine
which two regions Ri ⊂ Ii of the depth maps come from corresponding parts of the body,
and to find the correspondence map φ : R1 → R2 between them. The strategy for doing so
is to formulate the correspondence problem first as a classification problem: first, a feature
descriptor f : I → Rd , which maps each pixel in a single depth image to a feature vector,
is learned. These feature descriptors are then used to establish correspondences across
depth maps (see Figure 3.2).
1. f depends only on the pixel location on the human body, so that if two pixels are
sampled from the same anatomical location on depth scans of two different humans,
their feature vector should be nearly identical, irrespective of pose, clothing, body
shape, and angle from which the depth image was captured;
2. kf (p) − f (q)k is small when p and q represent nearby points on the human body, and
large for distant points.
The literature takes two different approaches to enforcing these properties when learning
descriptors using convolutional neural networks. Direct methods include in their loss
functions terms penalizing failure of these properties (by using e.g. Siamese or triplet-loss
energies). However, it is not trivial how to sample a dense set of training pairs or triplets
that can all contribute to training [76]. Indirect methods instead optimize the network
architecture to perform classification. The network consists of a descriptor extraction tower
and a classification layer, and peeling off the classification layer after training leaves the
learned descriptor network (for example, many applications use descriptors extracted from
the second-to-last layer of the AlexNet). This approach works since classification networks
tend to assign similar (dissimilar) descriptors to the input points belonging to the same
(different) class, and thus satisfy the above properties implicitly. An indirect approach
is taken, as the experiments suggest that an indirect method that uses an ensemble of
classification tasks has better performance and computational efficiency.
Figure 3.3: To ensure smooth descriptors, [97] defines a classification problem for multiple
segmentations of the human body. Nearby points on the body are likely to be assigned
the same label in at least one segmentation.
w are the parameters corresponding to the classification layer and descriptor extraction
tower, respectively. The descriptor learning is defined as minimizing a combination of
loss functions of all classification problems:
M
X
{wi? }, w? = arg min L(wi , w). (3.2)
{wi },w i=1
After training, the optimized descriptor extraction tower becomes the output. It is easy to
see that when wi , w are given by convolutional neural networks, Eq. (3.2) can be effectively
optimized using stochastic gradient descent through back-propagation.
To address the challenge of heterogenous training sets, two types of classification tasks
are included in this ensemble: one for classifying key points, used for iter-subject training
where only sparse ground-truth correspondences are available, and one for classifying
dense pixel-wise labels, e.g., by segmenting models into patches (See Figure 3.3), used for
intra-subject training. Both contribute to the learning of the descriptor extraction tower.
To ensure descriptor smoothness, instead of introducing additional terms in the loss
function, a simple yet effective strategy is proposed that randomizes the dense-label
generation procedure. Specifically, as shown in Figure 3.3, multiple segmentations of the
same person are considered, and a classification problem for each is introduced. Clearly,
identical points will always be associated with the same label and far-apart points will be
associated with different labels. Yet for other points, the number of times that they are
associated with the same label is related to the distance between them. Consequently, the
similarity of the feature descriptors are correlated to the distance between them on the
human body resulting in a smooth embedding satisfying the desired properties discussed
in the beginning of the section.
0 1 2 3 4 5 6 7 8 9 10
layer image conv max conv max 2×conv conv max 2×conv int conv
filter-stride - 11-4 3-2 5-1 3-2 3-1 3-1 3-2 1-1 - 3-1
channel 1 96 96 256 256 384 256 256 4096 4096 16
activation - relu lrn relu lrn relu relu idn relu idn relu
size 512 128 64 64 32 32 32 16 16 128 512
num 1 1 4 4 16 16 16 64 64 1 1
Table 3.1: The end-to-end network architecture generates a per-pixel feature descriptor and
a classification label for all pixels in a depth map simultaneously. From top to bottom
in column: The filter size and the stride, the number of filters, the type of the activation
function, the size of the image after filtering and the number of copies reserved for
up-sampling.
extracted by fitting a 3D template model to an unstructured input scan. With the popu-
larization of commodity 3D scanners and recent advances in correspondence algorithms
for deformable shapes, human bodies can now be easily digitized [53, 63, 26] and their
performances captured using a single RGB-D sensor [52, 91].
Most techniques are based on robust non-rigid surface registration methods that can
handle complex skin and cloth deformations, as well as large regions of missing data due
to occlusions. Because geometric features can be ambiguous and difficult to identify and
match, the success of these techniques generally relies on the deformation between source
and target shapes being reasonably small, with sufficient overlap. While local shape
descriptors [73] can be used to determine correspondences between surfaces that are far
apart, they are typically sparse and prone to false matches, which require manual clean-
up. Dense correspondences between shapes with larger deformations can be obtained
reliably using statistical models of human shapes [5, 9], but the subject has to be naked [8].
For clothed bodies, the automatic computation of dense mappings [46, 54, 71, 16] has
been demonstrated on full surfaces with significant shape variations, but are limited to
compatible or zero-genus surface topologies.
The problem of estimating accurate dense correspondence between deformable partial
shapes has been investigated recently in the works of Rodolà and colleagues [70, 20, 55],
and has been the target of two recent SHREC’16 Correspondence benchmarks [19, 49].
Despite the recent interest, however, this problem has remained relatively unexplored in
the literature due to the several challenges it entails.
Wei and colleagues [97] recently introduced a deep neural network structure for com-
puting dense correspondences between shapes of clothed subjects in arbitrary complex
poses. The input surfaces can be a full model, a partial scan, or a depth map, maximiz-
ing the range of possible applications. Their system is trained with a large dataset of
depth maps generated from the human bodies of the SCAPE database [5], as well as
from clothed subjects of the Yobi3D [2] and MIT [94] dataset. While all meshes in the
SCAPE database are in full correspondence, they manually labeled the clothed 3D body
models. They combine both training datasets and learned a global feature descriptor using
a network structure that is well-suited for the unified treatment of different training data
(bodies, clothed subjects). Similar to the unified embedding approach of FaceNet [76], the
AlexNet [48] classification network can be used to learn distinctive feature vectors for dif-
ferent subregions of the human body. While the performance of this dense correspondence
computation is comparable to state of the art techniques between two full models, they
also demonstrate that learning shape priors of clothed subjects can yield highly accurate
matches between partial-to-full and partial-to-partial shapes. Their examples include fully
3. EXTRINSIC DEEP LEARNING 17
clothed individuals in a variety of complex poses and the effectiveness of this approach
has been demonstrated on a template based performance capture application that uses a
single RGB-D camera as input.
Comparisons. Surface matching techniques which are not restricted to naked human
body shapes are currently the most suitable solutions for handling subjects with clothing.
Though robust to partial input scans such as single-view RGB-D data, cutting edge non-
rigid registration techniques [42, 52] often fail to converge for large scale deformations
without additional manual guidance as shown in Figure 3.4. When both source and
target shapes are full models, an automatic mapping between shapes with considerable
deformations becomes possible as shown in [46, 54, 71, 16]. This method is compared with
the recent work of Chen et al. [16] and computes correspondences between pairs of scans
sampled from the same (intra-subject) and different (inter-subject) subjects. Chen et al.
evaluate a rich set of methods on randomly sampled pairs from the FAUST database [9]
3. EXTRINSIC DEEP LEARNING 18
Table 3.2: Comparison between the method described in this Section and the recent work
of Chen et al. [16] by computing correspondences for intra- and inter-subject pairs from
the FAUST data set. We show the average error on all pairs (AE, in centimeters) and
the average error on the worst pair for each technique (WE, in centimeters). While the
learning-based technique may introduce worse WE, overall accuracies are improved in
both cases.
source / target [Wei et al. 16] [Li et al. 09] [Huang et al. 08]
and report the state of the art results for their method. For a fair comparison, this method is
also evaluated on the same set of pairs. As shown in Table 3.2, the learning-based method
described here improves the average accuracy for both the intra- and the inter-subject
pairs. Note that by using simple AlexNet structure, an average accuracy of 10 cm can be
achieved. However, if multiple segmentations are not adapted to enforce smoothness, the
worst average error can be up to 30 cm.
Limitations. Like any supervised learning approach, this framework cannot handle
arbitrary shapes as the prior is entirely based on the class of training data. Despite
superior performance compared to the state of the art, the current implementation is
far from perfect. For poses and clothings that are significantly different than those from
the training data set, this method still produces wrong correspondences. However, the
outliers are often grouped together due to the enforced smoothness of the embedding,
which could be advantageous for outlier detection. Due to the limited memory capacity
of existing GPUs, this approach requires downsizing of the training input, and hence the
correspondence resolutions are limited to 512 × 512 depth map pixels.
Performance. The shown experiments are performed on a 6-core Intel Core i7-5930K
Processor with 3.9 GHz and 16GB RAM. Both offline training and online correspondence
computation run on an NVIDIA GeForce TITAN X (12GB GDDR5) GPU. While the com-
plete training of the neural network takes about 250 hours of computation, the extraction
of all the feature descriptors never exceeds 1 ms for each depth map. The subsequent
correspondence computation with these feature descriptors varies between 0.5 and 1 s,
depending on the resolution of the input data.
3. EXTRINSIC DEEP LEARNING 19
Figure 3.5: Sparse key point annotations of 33 landmarks across clothed human models of
different datasets.
Extraction tower. The descriptor extraction tower takes a depth image as input and
extracts for each pixel a dimension d (d = 16 in this paper) descriptor vector. A popular
choice is to let the network extract each pixel descriptor using a neighboring patch (c.f.[33,
104]). However, such a strategy is too expensive in this setting as this has to be computed
for dozens of thousands of patches per scan.
The strategy is to design a network that takes the entire depth image as input and
simultaneously outputs a descriptor for each pixel. Compared with the patch-based
strategy, the computation of patch descriptors are largely shared among adjacent patches,
making descriptor computation fairly efficient in testing time.
Table 3.1 describes the proposed network architecture. The first 7 layers are adapted
from the AlexNet architecture. Specifically, the first layer downsamples the input image
by a factor of 4. This downsampling not only makes the computations faster and more
memory efficient, but also removes salt-and-pepper noise which is typical in the output
from depth cameras. Moreover, a similar strategy described in [77] is adapted to modify
the pooling and inner product layers so that the original image resolution can be recov-
ered through upsampling. The final layer performs upsampling by using neighborhood
information in a 3-by-3 window. This upsampling implicitly performs linear smoothing
between the descriptors of neighboring pixels. It is possible to further smooth the descrip-
tors of neighboring pixels in a post-processing step, but as shown in the results, this is not
necessary since the network is capable of extracting smooth and reliable descriptors.
Classification module. The classification module receives the per-pixel descriptors and
predicts a class for each annotated pixel (i.e., either key points in the 33-class case or all
pixels in the 500-class case). Note that one layer for each segmentation of each person is
introduced in the SCAPE and the MIT datasets and one shared layer for all the key points.
Similar to AlexNet, softmax is used when defining the loss function.
Training. The network is trained using a variant of stochastic gradient descent. Specif-
ically, a task (i.e., key points or dense labels) is randomly picked for a random partial
scan and fed into the network for training. If the task is dense labels, a segmentation is
randomly selected among all possible segmentations. The network can be tuned with
200,000 iterations using a batch size of 128 key points or dense labels which may come
from multiple datasets.
target shapes to be close, it can effectively handle large and instantaneous motions. For
the real capture data, we visualize the reconstructed template model at every frame and
for the synthetic model we show the error to the ground truth.
3. EXTRINSIC DEEP LEARNING 22
real capture
dense correspondence
Mixamo (synthetic)
dense correspondence
full-to-full
real capture 1
SCAPE
FAUST
source target error source target error source target
real capture 2
Mixamo
MIT
full-to-partial
real capture 1
SCAPE
FAUST
real capture 2
Mixamo
MIT
partial-to-partial
real capture 1
SCAPE
FAUST
24
4. SPECTRAL LEARNING METHODS 25
of the function value as the result of this displacement is given by applying the form to the
tangent vector, df (x)F (x) = h∇X f (x), F (x)iTx X , and can be thought of as an extension of
the notion of the classical directional derivative. The operator ∇X f : L2 (X) → L2 (T X) is
called the intrinsic gradient, and is similar to the classical notion of the gradient defining the
direction of the steepest change of the function at a point. Similarly, the intrinsic divergence
is an operator divX F : L2 (T X) → L2 (X) acting on tangent vector fields and (formal)
adjoint to the gradient operator
acting on scalar fields. From (4.3) it follows that the LBO is self-adjoint,
The LBO is intrinsic, i.e., expressible entirely in terms of the Riemannian metric. As a
result, it is invariant to isometric (metric-preserving) deformations of the manifold.
where the analysis fˆk = hf, φk iL2 (X) can be regarded as the forward Fourier transform
and the synthesis k≥1 fˆk φk (x) is the inverse one; the eigenvalues {λk }k≥1 play the role
P
of frequencies.
Note that in the Euclidean case, the eigenfunctions of the 1D Laplacian takes the form
d2 iωx
− dx2 e = ω 2 eiωx , and the classical Fourier transform as the inner product between the
signal f (x) and the Laplacian eigenfunctions e−iωx , i.e.
Z ∞
fˆ(ω) = hf (x), e−iωx iL2 (R) = f (x)e−iωx dx. (4.7)
−∞
∂
∆X + f (x, t) = 0; (4.8)
∂t
f (x, 0) = f0 (x), (4.9)
where f (x, t) denotes the amount of heat at point x at time t, f0 (x) is the initial heat
distribution; if the manifold has a boundary, appropriate boundary conditions must be
1
It is easy to verify that the classical Fourier basis functions eiωx are eigenfunctions of the Euclidean
d2 iωx
Laplacian operator − dx 2e = ω 2 eiωx .
4. SPECTRAL LEARNING METHODS 26
added. The solution of (4.8) is obtained by applying the heat operator H t = e−t∆X to the
initial condition,
Z
t
f (x, t) = H f0 (x) = f0 (x0 )ht (x, x0 )dx0 , (4.10)
X
Since H t has the same eigenfunctions as ∆X with the eigenvalues {e−tλk }k≥1 , we can
express the solution of (4.8) in the Fourier domain as
Z
f0 (x0 ) e−tλk φk (x)φk (x0 ) dx0 ,
X
f (x, t) = (4.11)
X k≥1
| {z }
ht (x,x0 )
where ht (x, x0 ) is the heat kernel. Interpreting the LBO eigenvalues as ‘frequencies’, the
coefficients e−tλ play the role of a transfer function corresponding to a low-pass filter
sampled at {λk }k≥1 .
αij , βij denote the angles ∠ikj, ∠jhi of the triangles sharing the edge ij, and A =
diag(a1 , . . . , an ) with ai = 31 jk:ijk∈F Aijk being the local area element at vertex i and
P
1
4πt+ K(x)
12π + O(t). Sun et al. [87] defined the heat kernel signature (HKS) of dimension Q at
point x by sampling the autodiffusivity function at some fixed times t1 , . . . , tQ ,
The HKS has become a very popular approach in numerous applications due to several
appealing properties. First, it is intrinsic and hence invariant to isometric deformations of
the manifold by construction. Second, it is dense. Third, the spectral expression (4.13) of
the heat kernel allows efficient computation of the HKS by using the first few eigenvectors
and eigenvalues of the Laplace-Beltrami operator.
At the same time, a notable drawback of HKS stemming from the use of low-pass
filters is poor spatial localization (by the uncertainty principle, good localization in the
Fourier domain results in a bad localization in the spatial domain).
Wave Kernel Signature (WKS) Aubry et al. [6] considered a different physical model
of a quantum particle on the manifold, whose behavior is governed by the Schrödinger
equation,
∂
i∆X + ψ(x, t) = 0, (4.15)
∂t
where ψ(x, t) is the complex wave function capturing the particle behavior. Assuming
that the particle oscillates at frequency λ drawn from a probability distribution π(λ), the
solution of (4.15) can be expressed in the Fourier domain as
X
ψ(x, t) = eiλk t π(λk )φk (x). (4.16)
k≥1
where pν (x) is the probability (4.17) corresponding to the initial log-normal frequency
distribution with mean frequency ν, and ν1 , . . . , νQ are some logarithmically-sampled
frequencies.
While resembling the HKS in its construction and computation, WKS is based on
log-normal transfer functions that act as band-pass filters and thus exhibits better spatial
localization.
X K
X
f (x) = τ (λk )φ2k (x) ≈ τ (λk )φ2k (x) (4.19)
k≥1 k=1
4. SPECTRAL LEARNING METHODS 28
where τ (λ) = (τ1 (λ), . . . , τQ (λ))> is a bank of transfer functions acting on LBO eigenval-
ues, and used parametric transfer functions
M
X
τq (λ) = aqm βm (λ) (4.20)
m=1
where g(x) = (g1 (x), . . . , gM (x))> is a vector-valued function referred to as geometry vector,
dependent only on the intrinsic geometry of the shape. Thus, (4.19) is parametrized by the
Q × M matrix A = (alm ) and can be written in matrix form as f (x) = Ag(x). The main
idea of [56] is to learn the optimal parameters A by minimizing a task-specific loss which
reduces to a Mahalanobis-type metric learning.
4.4.1 Inference
We start by describing the inference (or testing)
step. In the context of shape matching, a decision
tree routes a point x ∈ M along the tree and
to a leaf node, where a probability distribution
defined on a discrete label set L is assigned to
the point. Each label ` ∈ L identifies a tuple
of corresponding points from the collection of
training shapes; in the inset figure, points having
the same color are associated to the same label
in L. This way, the collection of probability distributions over L, which are predicted
for each point x ∈ M , can be interpreted as defining a soft map from shape M to some
reference shape from the training set. Note that such a reference is not a specific shape
from the collection, but rather an abstraction that allows us to think of the label space as a
physical object. We will discuss this aspect with more detail in Section 4.4.3.
According to this inference procedure, each tree t ∈ F of a forest F provides a
posterior probability P (`|x, t) of label ` ∈ L, given a point x ∈ M . The prediction of
the whole forest F can be obtained by averaging the predictions of the single trees as
P (`|x, F) = |F1 | t∈F P (`|x, t).
P
4. SPECTRAL LEARNING METHODS 29
4.4.2 Learning
During the learning phase, the structure of the trees, the split functions and the leaf
posteriors are determined from a training set. The latter consists of a collection of shapes
with point-wise ground truth matches among them; for the sake of simplicity, we assume
that all shapes have the same number of points, such that for each shape Ri in the training
set we have a bijective mapping (or canonical transformation) Ti : Ri → L. The training
set of labelled data is then given by {(x, Ti (x)) | x ∈ Ri }i . It remains to define the label
predictions associated to each leaf, and the test functions associated to the interior nodes
of the forest.
A straightforward way to assign a label distribution to each leaf node is to measure,
for each label ` ∈ L, the proportion of training samples (x, `) among all training samples S
that have reached the leaf:
|{(x, `) ∈ S}|
P (`|S) = . (4.22)
|S|
The probability distribution P (·|S) will thus become the posterior probability during
inference for every shape point reaching the leaf.
As splitting functions for the interior nodes, Rodolà et al. [71] proposed to consider
binary tests of the form fΘ (x) > τ , where function fΘ (x) is a local shape descriptor
computed at x ∈ M and parametrized by Θ, and τ is a randomly chosen threshold.
Since the parameters Θ are optimized during training, the idea is to consider an existing
descriptor but let the forest automatically determine its discriminative features based on
the training examples. The baseline descriptor can be chosen depending on the matching
problem at hand. The WKS was considered in [71] for classical shape matching, while the
HKS was used in [19, 49] due to its better resilience to missing shape parts.
X
P (y|x) = P (`|x) · P (y|`) . (4.23)
`∈L
M N
Note that this calculation is very simple to carry
out in practice. Let XM , XN denote two matrices
containing the label predictions for the two shapes, i.e., for each point xi ∈ M and each
4. SPECTRAL LEARNING METHODS 30
label ` ∈ L the probability P (`|xi ) is given by (XM )`i and similarly for N . Since XM and
XN are left-stochastic matrices, (4.23) can be written as
where X̃>N denotes column normalization after transpose. In other words, it stores the
probabilities of a point yj ∈ N being the pre-image of a label ` ∈ L:
P (`|y)
(X̃>
N )j` = P (y|`) = P .
y∈N P (`|y)
We have seen in Chapter 3 that deep learning methods, in particular, convolutional neural
architectures, could be applied to geometric data that is treated as a Euclidean 3D object
(volume or range image). An alternative way of treating geometric data intrinsically as
manifolds, was discussed in Chapter 4. The main goal of this chapter is to generalize
convolutional neural networks to this latter setting. In particular, we will focus on the
intrinsic definition of the convolution operation.
Such an operation can be interpreted as a non-linear filtering of the signal f . The key
difference from the classical convolution is the lack of shift-invariance, which makes the
filter kernel to change depending on its position.
31
5. INTRINSIC CONVOLUTIONAL NEURAL NETWORKS 32
Figure 5.1: An illustration of the poor generalization of spectral filtering across non-
Euclidean domains. Left: a function defined on a manifold; middle: result of the applica-
tion of a filter in the frequency domain on the same manifold; right: the same filter applied
on the same function but on a different (nearly-isometric) domain produces a completely
different result.
Note that such a translation is not shift-invariant in general, i.e., the window would change
when moved around the manifold (see Figure 5.2). The modulation operator is defined as
5. INTRINSIC CONVOLUTIONAL NEURAL NETWORKS 33
ĝ1
0.20
0.15
0.10
0.05
0
0 0.01 0.02 0.03
eigenvalues
ĝ2
0.20
0.15
0.10
0.05
0
0 0.01 0.02 0.03
eigenvalues
Figure 5.2: Examples of different WFT atoms gx,k using different windows (top and bottom
rows; window Fourier coefficients are shown on the left), shown in different localizations
(second and third columns) and modulations (fourth and fifth columns).
(Mk f )(x) = φk (x)f (x), where φk is the kth eigenfunction of the Laplace-Beltrami operator.
Combining the two operators together, the WFT atom (see examples in Figure 5.2) becomes
Note that the ‘mother window’ is defined here in the frequency domain by the coefficients
ĝi . Finally, the WFT of a signal f ∈ L2 (X) can be defined as
(Sf )(x0 , k) = hf, gx0 ,k iL2 (X) = gˆi φi (x0 )hf, φi φk iL2 (X) .
X
(5.3)
i≥1
The WFT (Sf )(x, k) performs a filtering of the signal f at the point x at the frequency
k. By collecting its behavior over different frequencies, the content of the signal f in a local
support around x is extracted, reproducing in this way the window extraction on images.
The localized spectral convolution layer can thus be defined as
p X
X K
gl (x) = wl,k,l0 |(Sfl0 )(x, k)|,
l0 =1 k=1
Figure 5.3: Construction of local geodesic polar coordinates on a manifold. Left: exam-
ples of local geodesic patches, center and right: example of angular and radial weights,
respectively (red denotes larger weights).
template, and moves the window to the next position. In the non-Euclidean setting, the
lack of shift-invariance makes the patch extraction operation position-dependent. The
patch operator Dj (x) acting on the point x ∈ X can be defined as a re-weighting of the
input signal f by means of some weighting kernels {wi (x, ·)}i=1,...,J spatially localized
around x, i.e. Z
Dj (x)f = f (x0 )wj (x, x0 )dx0 , j = 1, . . . , J. (5.4)
X
The intrinsic convolution can be defined as
X
(f ? g)(x) = gj Dj (x)f, (5.5)
j
where gj denotes the filter coefficients applied on the patch extracted at each point.
Different spatial-domain intrinsic convolutional layers amounts for a different definition
of the patch operator D. In the following we will see two examples.
Figure 5.4: Visualization of different heat kernels (red represent high values). Leftmost:
example of an isotropic heat kernel. Remaining: examples of anisotropic heat kernels for
different rotation angles θ and anisotropy coefficient α.
where wθ , wρ are the angular and radial weights, respectively (see Figure 5.3 center and
right). Note that the choice of the origin of the angular coordinate is arbitrary, and therefore
it can vary from point to point. To overcome this problem, an angular max pooling was
used in [59], leading to the following definition of the geodesic convolution
Z
(f ? w)(x) = max w(θ + ∆θ, ρ)(D(x)f )(θ, ρ) dθdρ, (5.9)
∆θ∈[0,2π)
where the matrix Rθ (x) performs rotation of θ w.r.t. to some reference (e.g. the maximum
curvature) direction and α > 0 is a parameter controlling the degree of anisotropy (α = 1
corresponds to the classical isotropic case).
5. INTRINSIC CONVOLUTIONAL NEURAL NETWORKS 36
filter bank 1
p filters
Nθ rotations
... Σ
max
Σ ξ
...
...
Σ
...
f1 f1 g1
...
...
...
...
filter bank q
... Σ
max
Σ ξ
...
...
Σ
...
fm fp gq
Input LIN ReLU GC AMP Output
m-dim q-dim
k≥0
where φαθ,k (x), λαθ,k are the eigenfunctions and eigenvalues of the anisotropic Laplacian
∆αθ = −div(Aαθ (x)∇). The anisotropic heat kernel hαθt depends on two additional
parameters, the coefficient α and the rotation angle θ. Figure 5.4 shows some examples of
anisotropic heat kernels computed at different rotations θ and anisotopies α. In [11], such
kernels were used as the weighting functions for the construction of patch operator (5.4),
Z
(D(x)f )(θ, t) = hαθt (x, x0 )f (x0 )dx0 ,
X
mapping the values of f around point x to a local polar-like system of coordinates (θ, t).
5.3 Applications
Similarly to the Euclidean CNNs, an intrinsic convolutional neural network consists of
several layers that are applied subsequently, i.e. the output of the previous layer is used
as the input into the subsequent one. The convolutional layer (2.6) is used with the only
difference that the convolution operation is replaced by an intrinsic analogy. Figure 5.5
shows a toy example of an intrinsic CNN architecture with one intrinsic convolutional
layer.
Intrinsic CNN is a non-linear hierarchical parametric map of the form
where ψi , i = 1, . . . , n, represents the ith layer with parameters θi . The parameters of the
model Θ = {θi : i = 1, . . . , n} are the set of all the parameters of each layer. The function
5. INTRINSIC CONVOLUTIONAL NEURAL NETWORKS 37
is applied to point-wise input data (e.g. some simple geometric descriptors) and produces
some point-wise output. In the following, we will see how intrinsic CNNs can be applied
to two basic problems in computer graphics: the computation of local descriptors and
correspondences.
|T+ | |T− |
(µ − kΨΘ (fi ) − ΨΘ (fi− )k)2+ .
X X
L(Θ) = (1 − γ) kΨΘ (fi ) − ΨΘ (fi+ )k2 + γ
i=1 i=1
Here, λ ∈ [0, 1] is a parameter trading off between the positive and negative losses, µ is
a margin, (·)+ = max{0, ·} and T± = {(fi , fi± )} denotes the sets of positive and negative
pairs, respectively.
In [59], the authors used the geodesic CNN architecture shown in Figure 5.5 to produce
dense intrinsic pose- and subject-invariant descriptors on human shapes. A qualitative
evaluation of the goodness of the learned descriptors is reported in Figure 5.6, where
the Euclidean distance in the descriptor space between the descriptor at a selected point
and the rest of the points on the same shape as well as its transformations is depicted.
The descriptors produced by the geodesic CNN (GCNN) manifest both good localization
(better than HKS) and are more discriminative (less spurious minima than WKS and OSD),
as well as robustness to different kinds of noise, including isometric and non-isometric
deformations, geometric and topological noise, different sampling, and missing parts.
5.3.2 Correspondence
As we discussed in Section 4.4.3 of these notes, finding the correspondence in a collection
of shapes can be posed as a labelling problem, where one tries to label each vertex of a
given query shape X with the index of a corresponding point on some common reference
shape Y [71]. Let n and m denote the number of vertices in X and Y , respectively. For
a point x on a query shape, the output of an intrinsic CNN ΨΘ (x) is m-dimensional and
is interpreted as a probability distribution (‘soft correspondence’) on Y . The output of
the network at all the points of the query shape can be arranged as an n × m matrix with
elements of the form ψΘ (x, y), representing the probability of x mapped to y.
Let us denote by y ∗ (x) the ground-truth correspondence of x on the reference shape.
We assume to be provided with examples of points from shapes across the collection
and their ground-truth correspondence, T = {(x, y ∗ (x))}. The optimal parameters of the
network are found by minimizing the multinomial regression loss
GCNN
Figure 5.6: Normalized Euclidean distance between the descriptor at a reference point
on the shoulder (white sphere) and the descriptors computed at the rest of the points for
different transformations (shown left-to-right: near isometric deformations, non-isometric
deformations, topological noise, geometric noise, uniform/non-uniform subsampling,
missing parts). Cold and hot colors represent small and large distances, respectively;
distances are saturated at the median value. Ideal descriptors would produce a distance
map with a sharp minimum at the corresponding point and no spurious local minima at
other locations.
Figure 5.7: Examples of correspondence on the FAUST humans dataset obtained by the
anisotropic diffusion CNN. Shown is the texture transferred from the leftmost reference
shape to different subjects in different poses by means of our correspondence. The
correspondence is nearly perfect (only very few minor artifacts are noticeable).
0.1
0
Blended Intrinsic Map
geodesic CNN
0.1
Random Forest
Figure 5.9: Examples of partial correspondence on the dog shape from the SHREC’16
Partial (holes) dataset. First row: correspondence produced by anisotropic diffusion
CNN. Corresponding points are shown in similar color. Reference shape is shown on the
left. Second and third rows: pointwise geodesic error (in % of geodesic diameter) of the
anisotropic diffusion CNN and RF correspondence, respectively. For visualization clarity,
the error values are saturated at 10% of the geodesic diameter. Hot colors correspond to
large errors.
Bibliography
[6] M. Aubry, U. Schlickewei, and D. Cremers. The wave kernel signature: A quantum
mechanical approach to shape analysis. In Proc. 4DMOD, 2011.
[7] P. Baldi and P. Sadowski. The dropout learning algorithm. Artificial Intelligence,
210:78–122, 2014.
[9] F. Bogo, J. Romero, M. Loper, and M. J. Black. FAUST: Dataset and evaluation for
3D mesh registration. In Proc. CVPR, 2014.
[13] L. Breiman. Random forests. In Machine Learning, volume 45, pages 5–32, 2001.
[14] J. Bromley et al. Signature verification using a “Siamese” time delay neural network.
In Proc. NIPS. 1994.
[15] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally
connected networks on graphs. In Proc. ICLR, 2014.
41
BIBLIOGRAPHY 42
[18] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network
learning by exponential linear units (elus). In Proc. ICLR, 2015.
[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale
hierarchical image database. In Proc. CVPR, 2009.
[27] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why
does unsupervised pre-training help deep learning? JMLR, 11, 2010.
[29] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object
detection, tracking, and action recognition. Trans. PAMI, 33(11):2188–2202, 2011.
[30] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Proc. Conf. Artificial Intelligence and Statistics, 2010.
[33] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. MatchNet: Unifying feature
and metric learning for patch-based matching. In Proc. CVPR, 2015.
[34] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
arXiv:1512.03385, 2015.
[35] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing
human-level performance on ImageNet classification. In Proc. ICCV, 2015.
[36] G. E. Hinton. Training products of experts by minimizing contrastive divergence.
Neural Computation, 14(8):1771–1800, 2002.
[37] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief
nets. Neural Computation, 18(7):1527–1554, July 2006.
[38] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with
neural networks. Science, 313(5786):504–507, 2006.
[39] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Improving neural networks by preventing co-adaptation of feature detectors.
arXiv:1207.0580, 2012.
[40] S. Hochreiter and J. Schmidhuber. Simplifying neural nets by discovering flat
minima. In Proc. NIPS, 1995.
[41] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,
9(8):1735–1780, 1997.
[42] Q. Huang, B. Adams, M. Wicke, and L. J. Guibas. Non-rigid registration under
isometric deformations. In Proc. SGP, 2008.
[43] S. I. and C. S. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. arXiv:1502.03167, 2015.
[44] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer
networks. In Proc. NIPS, 2015.
[45] K. Kawaguchi. Deep learning without poor local minima. In Proc. NIPS, 2016.
[46] V. G. Kim, Y. Lipman, and T. Funkhouser. Blended Intrinsic Maps. Trans. Graphics,
30(4), 2011.
[47] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local
reparameterization trick. In Proc. NIPS. 2015.
[48] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In Proc. NIPS. 2012.
[49] Z. Lähner, E. Rodolà, M. M. Bronstein, et al. SHREC’16: Matching of deformable
shapes with topological noise. In Proc. 3DOR, 2016.
[50] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
[51] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In Proc. ICML,
2009.
BIBLIOGRAPHY 44
[52] H. Li, B. Adams, L. J. Guibas, and M. Pauly. Robust single-view geometry and
motion reconstruction. Trans. Graphics, 28(5), 2009.
[54] Y. Lipman and T. Funkhouser. Möbius voting for surface correspondence. Trans.
Graphics, 28(3), 2009.
[56] R. Litman and A. M. Bronstein. Learning spectral descriptors for deformable shape
correspondence. Trans. PAMI, 36(1):170–180, 2014.
[57] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. Proc. CVPR, 2015.
[64] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic
segmentation. arXiv:1505.04366, 2015.
[65] K. Olszewski, J. J. Lim, S. Saito, and H. Li. High-fidelity facial and speech animation
for VR HMDs. Trans. Graphics, 35(6), 2016.
[68] U. Pinkall and K. Polthier. Computing discrete minimal surfaces and their conju-
gates. Experimental Mathematics, 2(1):15–36, 1993.
[71] E. Rodolà, S. Rota Bulò, T. Windheuser, M. Vestner, and D. Cremers. Dense non-rigid
shape correspondence using random forests. In Proc. CVPR, 2014.
[75] S. Saito, T. Li, and H. Li. Real-time facial segmentation and performance capture
from RGB input. In Proc. ECCV, 2016.
[76] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face
recognition and clustering. arXiv:1503.03832, 2015.
[80] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support
inference from RGBD images. In Proc. ECCV, 2012.
[81] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv:1409.1556, 2014.
[82] S. Song and J. Xiao. Deep sliding shapes for amodal 3D object detection in RGB-D
images. arXiv:1511.02300, 2015.
[85] M. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal
selective attention through feedback connections. In Proc. NIPS, 2014.
[87] J. Sun, M. Ovsjanikov, and L. J. Guibas. A concise and provably informative multi-
scale signature based on heat diffusion. In Proc. SGP, 2009.
[90] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The vitruvian manifold: Inferring
dense correspondences for one-shot human pose estimation. In Proc. CVPR, 2012.
[91] A. Tevs, A. Berner, M. Wand, I. Ihrke, M. Bokeloh, J. Kerber, and H.-P. Seidel. Ani-
mation cartography – intrinsic reconstruction of shape and motion. Trans. Graphics,
31(2):12:1–12:15, 2012.
[92] J. Tompson, M. Stein, Y. LeCun, and K. Perlin. Real-time continuous pose recovery
of human hands using convolutional networks. Trans. Graphics, 33(5), 2014.
[93] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising
autoencoders: Learning useful representations in a deep network with a local
denoising criterion. JMLR, 11:3371–3408, 2010.
[94] D. Vlasic, I. Baran, W. Matusik, and J. Popović. Articulated mesh animation from
multi-view silhouettes. Trans. Graphics, 27(3), 2008.
[96] S. Wang and C. Manning. Fast dropout training. In Proc. ICML, 2013.
[97] L. Wei, Q. Huang, D. Ceylan, E. Vouga, and H. Li. Dense human body correspon-
dences using convolutional networks. In Proc. CVPR, 2016.
[98] X. Wei, P. Zhang, and J. Chai. Accurate realtime full-body motion capture using a
single depth camera. Trans. Graphics, 31(6), 2012.
[99] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences. PhD thesis, Harvard University, 1974.
[100] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D Shapenets: A
deep representation for volumetric shapes. In Proc. CVPR, 2015.
[101] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio.
Show, attend and tell: Neural image caption generation with visual attention. In
Proc. ICML, 2015.
[103] M. E. Yumer and N. J. Mitra. Learning semantic deformation flows with 3d convo-
lutional networks. In Proc. ECCV, 2016.
[104] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolu-
tional neural networks. In Proc. CVPR, 2015.
[105] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollár.
A multipath network for object detection. In Proc. BMVC, 2016.
[106] A. Zeng, S. Song, M. Nießner, M. Fisher, and J. Xiao. 3DMatch: Learning the
matching of local 3D geometry in range scans. arXiv:1603.08182, 2016.
BIBLIOGRAPHY 47
[107] S. Zhou, J. Wu, Y. Wu, and X. Zhou. Exploiting local structures with the kronecker
layer in convolutional networks. arXiv:1512.09194, 2015.