tmpD684 TMP
tmpD684 TMP
I.
I NTRODUCTION
II.
for the lower, more detailed layers. Since higher layers have
a coarser spatial resolution, they also provide a convenient
shortcut for information that needs to travel long distances.
Backward connections are immediately followed by a spatial
upsampling operation.
Due to padded convolutions and the opposing pooling and
upsampling operations, all connections coinciding on a given
hidden layer have the same size, and are simply summed
elementwise.
All connections use temporal weight sharing, i.e. for all t
and all k {1, 0, 1}, the weights used in the convolution
from H(t, l) to H(t + 1, l + k) are identical across time steps.
B. Output
In contrast to common CNN models, the output of our
network is always obtained in the lowest layer of the network,
at the first time step which ensures that the activations were
able to reach the highest layer and return back. This structural
property allows us to produce detailed outputs at input resolution. The cross-entropy loss over the C object categories
is measured on a subset of the maps in the lowest layer,
where every map is responsible for one category. Apart from a
simplified implementation, this allows the network to produce
and reuse intermediate outputs, and refine them over time.
C. Multi-Scale Inputs
Inputs may be given on all scales of the network. When
using multi-scale inputs, we additionally convolve a (downscaled) version of the input and add it to the result.
D. Unidirectional RNN
In the architecture described so far, we process the video
images sequentially, one image per time step. The state (i.e.,
the activations) of time t containing information about its past
is combined with the image at time t, producing an output and
a new state. Since the last output benefits from learning from
the whole sequence, it is natural to place the frame that we
want to evaluate at the end.
The first temporal copy is special, since it contains regular
feed forward connections. This allows us to produce activations
in each layer such that all connection types can be used in the
transition from t to t + 1.
Fig. 4: Architecture of the bidirectional RNN. The final output at the center has access to both past and future context. Delay
D is number of time frames between an input and corresponding output frame.
R ELATED W ORK
L EARNING
We initialize the weights biases from a Gaussian distribution. It is important to ensure that the activations do not
explode or vanish early during training. Ideally, activations in
the first forward pass should have similar magnitudes. This is
difficult to control, however. Instead, we choose the variance
of the weights and the mean of the bias such that the average
of the activations in every point of our network is positive and
slightly decreasing over time.
We learn the parameters of our network using Root Mean
Square Propagation (RMSProp), which is a variant of Resilient
Propagation (RProp) suitable for mini-batch learning. RProp
considers only the sign of the gradient, thus being robust
against vanishing and exploding gradients, phenomena that
occur when training recurrent neural networks.
We apply dropout [3] during learning, which we found
improves our results in almost all cases. Applying dropout
in RNNs is delicate, however. It should not affect recurrent
connections, which would otherwise loose their ability to learn
long-range dependencies [15]. Thus, when using dropout, we
apply it to add a final convolution applied to the bottom layer
to extract the output.
V.
E XPERIMENTS
A. Toy Experiments
We present three toy experiments, showing that our network is able to learn 1) filtering noisy information over time,
2) tracking and interpreting motion, and 3) retaining an internal
state including uncertainty.
1) Denoising: In this experiment, we simply feed different
degraded versions of the same binary image to the network.
We use salt and pepper noise, uniformly distributed over the
whole image. We also draw random black or white lines, to
make the task more difficult. The task is to obtain the original
image without noise. One way the network could solve this
task would be to learn to average the image over time. In
addition, denoising filters learned by the neural network can
remove high frequency noise.
To ensure that the network is able to generalize instead
of learning an object by heart, we use different objects for
training, validation and testing. Every split contains 100 independently generated sequences.
Since the task has a reduced complexity, we opt for a
simple convolutional model of only one hidden layer with 32
maps. A small filter size of 55 provides sufficient spacial
context. There is no specific order in such a sequence of noised
images, thus we only test the unidirectional architecture on this
task.
We use T =6 temporal copies. During training, we optimize
a weighted sum of the losses at all time steps, with a ten times
larger weight placed on the final output. In all toy examples,
we train for 12,000 iterations with minibatches of size 16.
Figure 5 shows an example from the test set. Our model is
able to improve its prediction step by step, accumulating over
time information even from the areas which are more affected
by noise. After only two steps, the network is able to remove
most of the false positives and to assemble together almost all
features of the object.
Average (%)
Method
Class
Pixel
Method
prop
Class
Pixel
Simplified Network
Bidirectional Network
Unidirectional Network
59.2
62.2
62.3
60.0
62.5
62.7
77.9
73.4
65.4
66.8
55.9
60.3
49.9
49.2
62.0
62.4
61.1
63.1
70.8
57.0
53.6
67.3
65.5
Unidirectional Network + MS
62.4
63.1
90.0
76.3
52.1
61.2
69.9
67.5
Unidirectional + SW
(c) Prediction
(b) Depth
Fig. 8: Prediction for one of the NYUD dataset frames. Images (a) and (b) show RGB and depth, respectively, after being
preprocessed. (c) and (d) represent the prediction and ground truth, respectively, where color codes floor ( ), prop ( ),
furniture ( ), structure ( ) and unknown ( ). The network detects most of the pixels correctly, even some wrongly
labeled ones (e.g the third object on the table and the center of the wall-mounted piece).
Average (%)
Method
prop
Class
Pixel
Unidirectional + SW
Schulz et al. [19]
Muller and Behnke [21]
Stuckler et al. [20]
Couprie et al. [22]
Hoft et al. [18]
Silberman et al. [16]
90.0
93.6
94.9
90.8
87.3
77.9
68
61.2
54.9
55.1
19.9
35.5
49.9
42
69.9
73.7
71.9
65.0
63.5
61.1
59.6
67.5
73.4
72.3
68.3
64.5
62.0
58.6
76.3
80.2
78.9
81.6
86.1
65.4
59
52.1
66.4
79.7
67.9
45.3
55.9
70
the main difference is due to the introduction of the depthnormalized sliding window. The results in Table II show that
we were able to improve on both baseline results in both pixel
and class accuracy.
Table III shows our depth-normalized sliding window result
together with state-of-the-art results on the same dataset. Our
method is still behind the state-of-the-art, but shows promising
results. In particular, it performs similar to Stuckler et al.
[20], who explicitly accumulated predictions in 3D. It is
likely that results will improve significantly when the neural
network has access to the height above ground [19] and
predictions are post-processed with conditional random fields,
which strongly improved object-class segmentation results of
random forests [21] and neural networks [19].
Note, however, that except for Stuckler et al. [20], none
of the listed publications made use of temporal context to
determine class labels.
VI.
[1]
[2]
[3]
[4]
[5]
[6]
C ONCLUSION
[7]
Fig. 9: Prediction for one sample of NYUD test dataset. Rows represent, from top to bottom: the RGB input, the softmax layer
output, the output of the network; and the evaluation ( True Positives True Negatives False Positives False Negatives)
for the class structure.
.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, in
Advances in Neural Information Processing Systems
(NIPS), Z. Ghahramani, M. Welling, C. Cortes, N.
Lawrence, and K. Weinberger, Eds., 2014, pp. 568576.
V. Michalski, R. Memisevic, and K. Konda, Modeling
deep temporal dependencies with recurrent grammar
cells, in Advances in Neural Information Processing
Systems (NIPS), Z. Ghahramani, M. Welling, C. Cortes,
N. Lawrence, and K. Weinberger, Eds., 2014, pp. 1925
1933.
M. Jung, J. Hwang, and J. Tani, Multiple spatiotemporal scales neural network for contextual visual
recognition of human actions, in International Conference on Development and Learning and on Epigenetic
Robotics (ICDL), 2014.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale video classification
with convolutional neural networks, in Computer Vision
and Pattern Recognition (CVPR), Conference on, 2014,
pp. 17251732.
P. H. Pinheiro and R. Collobert, Recurrent convolutional neural networks for scene labeling, 2014.
M. Sundermeyer, R. Schluter, and H. Ney, LSTM
neural networks for language modeling, in Interspeech,
2012.
R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks., Journal
of Machine Learning Research, vol. 28, pp. 13101318,
2013.
V. Pham, T. Bluche, C. Kermorvant, and J. Louradour,
Dropout improves recurrent neural networks for hand-
[16]
[17]
[18]
[19]
[20]
[21]
[22]