Deep Learning in Neural Networks: An Overview
Deep Learning in Neural Networks: An Overview
Jürgen Schmidhuber
The Swiss AI Lab IDSIA
Istituto Dalle Molle di Studi sull’Intelligenza Artificiale
University of Lugano & SUPSI
Galleria 2, 6928 Manno-Lugano
Switzerland
8 October 2014
Abstract
In recent years, deep artificial neural networks (including recurrent ones) have won numerous
contests in pattern recognition and machine learning. This historical survey compactly summarises
relevant work, much of it from the previous millennium. Shallow and deep learners are distin-
guished by the depth of their credit assignment paths, which are chains of possibly learnable, causal
links between actions and effects. I review deep supervised learning (also recapitulating the history
of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation,
and indirect search for short programs encoding deep and large networks.
Preface
This is the preprint of an invited Deep Learning (DL) overview. One of its goals is to assign credit
to those who contributed to the present state of the art. I acknowledge the limitations of attempting
to achieve this goal. The DL research community itself may be viewed as a continually evolving,
deep network of scientists who have influenced each other in complex ways. Starting from recent DL
results, I tried to trace back the origins of relevant ideas through the past half century and beyond,
sometimes using “local search” to follow citations of citations backwards in time. Since not all DL
publications properly acknowledge earlier relevant work, additional global search strategies were em-
ployed, aided by consulting numerous neural network experts. As a result, the present preprint mostly
consists of references. Nevertheless, through an expert selection bias I may have missed important
work. A related bias was surely introduced by my special familiarity with the work of my own DL
research group in the past quarter-century. For these reasons, this work should be viewed as merely a
snapshot of an ongoing credit assignment process. To help improve it, please do not hesitate to send
corrections and suggestions to [email protected].
1
Contents
1 Introduction to Deep Learning (DL) in Neural Networks (NNs) 4
2
6 DL in FNNs and RNNs for Reinforcement Learning (RL) 26
6.1 RL Through NN World Models Yields RNNs With Deep CAPs . . . . . . . . . . . . 26
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs) . . . . . . . 27
6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs) . . . . . . . . . . . . . . 27
6.4 RL Facilitated by Deep UL in FNNs and RNNs . . . . . . . . . . . . . . . . . . . . 28
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs . . . . . . 28
6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution . . . . . . . . . . . . . 28
6.7 Deep RL by Indirect Policy Search / Compressed NN Search . . . . . . . . . . . . . 29
6.8 Universal RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8 Acknowledgments 31
3
1 Introduction to Deep Learning (DL) in Neural Networks (NNs)
Which modifiable components of a learning system are responsible for its success or failure? What
changes to them improve performance? This has been called the fundamental credit assignment prob-
lem (?). There are general credit assignment methods for universal problem solvers that are time-
optimal in various theoretical senses (Sec. 6.8). The present survey, however, will focus on the nar-
rower, but now commercially important, subfield of Deep Learning (DL) in Artificial Neural Networks
(NNs).
A standard neural network (NN) consists of many simple, connected processors called neurons,
each producing a sequence of real-valued activations. Input neurons get activated through sensors per-
ceiving the environment, other neurons get activated through weighted connections from previously
active neurons (details in Sec. 2). Some neurons may influence the environment by triggering actions.
Learning or credit assignment is about finding weights that make the NN exhibit desired behavior,
such as driving a car. Depending on the problem and how the neurons are connected, such behavior
may require long causal chains of computational stages (Sec. 3), where each stage transforms (of-
ten in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately
assigning credit across many such stages.
Shallow NN-like models with few such stages have been around for many decades if not centuries
(Sec. 5.1). Models with several successive nonlinear layers of neurons date back at least to the 1960s
(Sec. 5.3) and 1970s (Sec. 5.5). An efficient gradient descent method for teacher-based Supervised
Learning (SL) in discrete, differentiable networks of arbitrary depth called backpropagation (BP)
was developed in the 1960s and 1970s, and applied to NNs in 1981 (Sec. 5.5). BP-based training
of deep NNs with many layers, however, had been found to be difficult in practice by the late 1980s
(Sec. 5.6), and had become an explicit research subject by the early 1990s (Sec. 5.9). DL became
practically feasible to some extent through the help of Unsupervised Learning (UL), e.g., Sec. 5.10
(1991), Sec. 5.15 (2006). The 1990s and 2000s also saw many improvements of purely supervised
DL (Sec. 5). In the new millennium, deep NNs have finally attracted wide-spread attention, mainly
by outperforming alternative machine learning methods such as kernel machines (??) in numerous
important applications. In fact, since 2009, supervised deep NNs have won many official international
pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first superhuman
visual pattern recognition results in limited domains (Sec. 5.19, 2011). Deep NNs also have become
relevant for the more general field of Reinforcement Learning (RL) where there is no supervising
teacher (Sec. 6).
Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic) NNs (RNNs) have won contests
(Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22). In a sense, RNNs are the deepest of all NNs (Sec. 3)—they
are general computers more powerful than FNNs, and can in principle create and process memories
of arbitrary sequences of input patterns (e.g., ??). Unlike traditional methods for automatic sequential
program synthesis (e.g., ????), RNNs can learn programs that mix sequential and parallel information
processing in a natural and efficient way, exploiting the massive parallelism viewed as crucial for
sustaining the rapid decline of computation cost observed over the past 75 years.
The rest of this paper is structured as follows. Sec. 2 introduces a compact, event-oriented notation
that is simple yet general enough to accommodate both FNNs and RNNs. Sec. 3 introduces the
concept of Credit Assignment Paths (CAPs) to measure whether learning in a given NN application is
of the deep or shallow type. Sec. 4 lists recurring themes of DL in SL, UL, and RL. Sec. 5 focuses
on SL and UL, and on how UL can facilitate SL, although pure SL has become dominant in recent
competitions (Sec. 5.17–5.23). Sec. 5 is arranged in a historical timeline format with subsections on
important inspirations and technical contributions. Sec. 6 on deep RL discusses traditional Dynamic
Programming (DP)-based RL combined with gradient-based search techniques for SL or UL in deep
NNs, as well as general methods for direct and indirect search in the weight space of deep FNNs and
4
RNNs, including successful policy gradient and evolutionary methods.
5
Sec. 5.5 will use the notation above to compactly describe a central algorithm of DL, namely,
backpropagation (BP) for supervised weight-sharing FNNs and RNNs. (FNNs may be viewed as
RNNs with certain fixed zero weights.) Sec. 6 will address the more general RL case.
6
direct search (Sec. 6.6) or indirect search (Sec. 6.7) in weight space, or through training an NN first
on shallow problems whose solutions may then generalize to deep problems, or through collapsing
sequences of (non)linear operations into a single (non)linear operation (but see an analysis of non-
trivial aspects of deep linear networks, ?, Section B). In general, however, finding an NN that precisely
models a given training set is an NP-complete problem (??), also in the case of deep NNs (???);
compare a survey of negative results (?, Section 1).
Above we have focused on SL. In the more general case of RL in unknown environments, pcc(p, q)
is also true if xp is an output event and xq any later input event—any action may affect the environment
and thus any later perception. (In the real world, the environment may even influence non-input events
computed on a physical hardware entangled with the entire universe, but this is ignored here.) It is
possible to model and replace such unmodifiable environmental PCCs through a part of the NN that
has already learned to predict (through some of its units) input events (including reward signals) from
former input events and actions (Sec. 6.1). Its weights are frozen, but can help to assign credit to
other, still modifiable weights used to compute actions (Sec. 6.1). This approach may lead to very
deep CAPs though.
Some DL research is about automatically rephrasing problems such that their depth is reduced
(Sec. 4). In particular, sometimes UL is used to make SL problems less deep, e.g., Sec. 5.10. Often
Dynamic Programming (Sec. 4.1) is used to facilitate certain traditional RL problems, e.g., Sec. 6.2.
Sec. 5 focuses on CAPs for SL, Sec. 6 on the more complex case of RL.
7
learnt concepts. Such hierarchical representation learning (???) is also a recurring theme of DL NNs
for SL (Sec. 5), UL-aided SL (Sec. 5.7, 5.10, 5.15), and hierarchical RL (Sec. 6.5). Often, abstract
hierarchical representations are natural by-products of data compression (Sec. 4.4), e.g., Sec. 5.10.
8
millennium (Sec. 5.15). Sec. 5.8 is about applying BP to CNNs (1989), which is important for today’s
DL applications. Sec. 5.9 explains BP’s Fundamental DL Problem (of vanishing/exploding gradients)
discovered in 1991. Sec. 5.10 explains how a deep RNN stack of 1991 (the History Compressor) pre-
trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment
Paths (CAPs, Sec. 3) of depth 1000 and more. Sec. 5.11 discusses a particular winner-take-all (WTA)
method called Max-Pooling (MP, 1992) widely used in today’s deep FNNs. Sec. 5.12 mentions a
first important contest won by SL NNs in 1994. Sec. 5.13 describes a purely supervised DL RNN
(Long Short-Term Memory, LSTM, 1995) for problems of depth 1000 and more. Sec. 5.14 mentions
an early contest of 2003 won by an ensemble of shallow FNNs, as well as good pattern recognition
results with CNNs and deep FNNs and LSTM RNNs (2003). Sec. 5.15 is mostly about Deep Belief
Networks (DBNs, 2006) and related stacks of Autoencoders (AEs, Sec. 5.7), both pre-trained by UL to
facilitate subsequent BP-based SL (compare Sec. 5.6.1, 5.10). Sec. 5.16 mentions the first SL-based
GPU-CNNs (2006), BP-trained MPCNNs (2007), and LSTM stacks (2007). Sec. 5.17–5.22 focus on
official competitions with secret test sets won by (mostly purely supervised) deep NNs since 2009,
in sequence recognition, image classification, image segmentation, and object detection. Many RNN
results depended on LSTM (Sec. 5.13); many FNN results depended on GPU-based FNN code de-
veloped since 2004 (Sec. 5.16, 5.17, 5.18, 5.19), in particular, GPU-MPCNNs (Sec. 5.19). Sec. 5.24
mentions recent tricks for improving DL in NNs, many of them closely related to earlier tricks from
the previous millennium (e.g., Sec. 5.6.2, 5.6.3). Sec. 5.25 discusses how artificial NNs can help to
understand biological NNs; Sec. 5.26 addresses the possibility of DL in NNs with spiking neurons.
5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11)
Simple cells and complex cells were found in the cat’s visual cortex (e.g., ??). These cells fire in
response to certain properties of visual sensory inputs, such as the orientation of edges. Complex
cells exhibit more spatial invariance than simple cells. This inspired later deep NN architectures
(Sec. 5.4, 5.11) used in certain modern award-winning Deep Learners (Sec. 5.19–5.22).
5.3 1965: Deep Networks Based on the Group Method of Data Handling
Networks trained by the Group Method of Data Handling (GMDH) (????) were perhaps the first DL
systems of the Feedforward Multilayer Perceptron type, although there was earlier work on NNs with
a single hidden layer (e.g., ??). The units of GMDH nets may have polynomial activation functions
implementing Kolmogorov-Gabor polynomials (more general than other widely used NN activation
functions, Sec. 2). Given a training set, layers are incrementally grown and trained by regression
analysis (e.g., ???) (Sec. 5.1), then pruned with the help of a separate validation set (using to-
day’s terminology), where Decision Regularisation is used to weed out superfluous units (compare
Sec. 5.6.3). The numbers of layers and units per layer can be learned in problem-dependent fashion.
To my knowledge, this was the first example of open-ended, hierarchical representation learning in
9
NNs (Sec. 4.3). A paper of 1971 already described a deep GMDH network with 8 layers (?). There
have been numerous applications of GMDH-style nets, e.g. (????????).
10
Efficient BP was soon explicitly used to minimize cost functions by adapting control parameters
(weights) (?). Compare some preliminary, NN-specific discussion (?, section 5.5.1), a method for
multilayer threshold NNs (?), and a computer program for automatically deriving and implementing
BP for given differentiable systems (?).
To my knowledge, the first NN-specific application of efficient BP as above was described in
1981 (??). Related work was published several years later (???). A paper of 1986 significantly
contributed to the popularisation of BP for NNs (?), experimentally demonstrating the emergence of
useful internal representations in hidden layers. See generalisations for sequence-processing recurrent
NNs (e.g., ???????????????), also for equilibrium RNNs (??) with stationary inputs.
5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs)
Using the notation of Sec. 2 for weight-sharing FNNs or RNNs, after an episode of activation spread-
ing through differentiable ft , P
a single iteration of gradient descent through BP computes changes of
∂E ∂E ∂nett
all wi in proportion to ∂w i
= t ∂nett ∂wi as in Algorithm 5.5.1 (for the additive case), where each
weight wi is associated with a real-valued variable 4i initialized by 0.
The computational costs of the backward (BP) pass are essentially those of the forward pass
(Sec. 2). Forward and backward passes are re-iterated until sufficient performance is reached.
As of 2014, this simple BP method is still the central learning algorithm for FNNs and RNNs. No-
tably, most contest-winning NNs up to 2014 (Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22) did not augment
supervised BP by some sort of unsupervised learning as discussed in Sec. 5.7, 5.10, 5.15.
11
5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs
To deal with long time lags between relevant events, several sequence processing methods were pro-
posed, including Focused BP based on decay factors for activations of units in RNNs (??), Time-Delay
Neural Networks (TDNNs) (?) and their adaptive extension (?), Nonlinear AutoRegressive with eX-
ogenous inputs (NARX) RNNs (?), certain hierarchical RNNs (?) (compare Sec. 5.10, 1991), RL
economies in RNNs with WTA units and local learning rules (?), and other methods (e.g., ??????).
However, these algorithms either worked for shallow CAPs only, could not generalize to unseen CAP
depths, had problems with greatly varying time lags between relevant events, needed external fine
tuning of delay constants, or suffered from other problems. In fact, it turned out that certain simple
but deep benchmark problems used to evaluate such methods are more quickly solved by randomly
guessing RNN weights until a solution is found (?).
While the RNN methods above were designed for DL of temporal sequences, the Neural Heat Ex-
changer (?) consists of two parallel deep FNNs with opposite flow directions. Input patterns enter the
first FNN and are propagated “up”. Desired outputs (targets) enter the “opposite” FNN and are prop-
agated “down”. Using a local learning rule, each layer in each net tries to be similar (in information
content) to the preceding layer and to the adjacent layer of the other net. The input entering the first
net slowly “heats up” to become the target. The target entering the opposite net slowly “cools down”
to become the input. The Helmholtz Machine (??) may be viewed as an unsupervised (Sec. 5.6.4)
variant thereof (Peter Dayan, personal communication, 1994).
A hybrid approach (??) initializes a potentially deep FNN through a domain theory in propo-
sitional logic, which may be acquired through explanation-based learning (???). The NN is then
fine-tuned through BP (Sec. 5.5). The NN’s depth reflects the longest chain of reasoning in the origi-
nal set of logical rules. An extension of this approach (??) initializes an RNN by domain knowledge
expressed as a Finite State Automaton (FSA). BP-based fine-tuning has become important for later
DL systems pre-trained by UL, e.g., Sec. 5.10, 5.15.
12
5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24)
Many researchers used BP-like methods to search for “simple,” low-complexity NNs (Sec. 4.4) with
high generalization capability. Most approaches address the bias/variance dilemma (?) through strong
prior assumptions. For example, weight decay (???) encourages near-zero weights, by penalizing
large weights. In a Bayesian framework (?), weight decay can be derived (?) from Gaussian or
Laplacian weight priors (??); see also (?). An extension of this approach postulates that a distribution
of networks with many similar weights generated by Gaussian mixtures is “better” a priori (?).
Often weight priors are implicit in additional penalty terms (?) or in methods based on valida-
tion sets (??????), Akaike’s information criterion and final prediction error (???), or generalized
prediction error (??). See also (???????). Similar priors (or biases towards simplicity) are implicit
in constructive and pruning algorithms, e.g., layer-by-layer sequential network construction (e.g.,
??????????????) (see also Sec. 5.3, 5.11), input pruning (??), unit pruning (e.g., ?????), weight
pruning, e.g., optimal brain damage (?), and optimal brain surgeon (?).
A very general but not always practical approach for discovering low-complexity SL NNs or
RL NNs searches among weight matrix-computing programs written in a universal programming
language, with a bias towards fast and short programs (?) (Sec. 6.7).
Flat Minimum Search (FMS) (??) searches for a “flat” minimum of the error function: a large
connected region in weight space where error is low and remains approximately constant, that is,
few bits of information are required to describe low-precision weights with high variance. Compare
perturbation tolerance conditions (????????). An MDL-based, Bayesian argument suggests that flat
minima correspond to “simple” NNs and low expected overfitting. Compare Sec. 5.6.4 and more
recent developments mentioned in Sec. 5.24.
13
Algorithm (SOTA) (?), and nonlinear Autoencoders (AEs) with more than 3 (e.g., 5) layers (???).
Such AE NNs (?) can be trained to map input patterns to themselves, for example, by compactly
encoding them through activations of units of a narrow bottleneck hidden layer. Certain nonlinear
AEs suffer from certain limitations (?).
L OCOCODE (?) uses FMS (Sec. 5.6.3) to find low-complexity AEs with low-precision weights
describable by few bits of information, often producing sparse or factorial codes. Predictability Mini-
mization (PM) (?) searches for factorial codes through nonlinear feature detectors that fight nonlinear
predictors, trying to become both as informative and as unpredictable as possible. PM-based UL was
applied not only to FNNs but also to RNNs (e.g., ??). Compare Sec. 5.10 on UL-based RNN stacks
(1991), as well as later UL RNNs (e.g., ??).
14
standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either
shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers
or CAP depth (Sec. 3), or they explode. This is also known as the long time lag problem. Much
subsequent DL research of the 1990s and 2000s was motivated by this insight. Later work (?) also
studied basins of attraction and their stability under noise from a dynamical systems point of view:
either the dynamics are not robust to noise, or the gradients vanish. See also (??). Over the years,
several ways of partially overcoming the Fundamental Deep Learning Problem were explored:
I A Very Deep Learner of 1991 (the History Compressor, Sec. 5.10) alleviates the problem
through unsupervised pre-training for a hierarchy of RNNs. This greatly facilitates subsequent
supervised credit assignment through BP (Sec. 5.5). In the FNN case, similar effects can be
achieved through conceptually related AE stacks (Sec. 5.7, 5.15) and Deep Belief Networks
(DBNs, Sec. 5.15).
II LSTM-like networks (Sec. 5.13, 5.16, 5.17, 5.21–5.23) alleviate the problem through a special
architecture unaffected by it.
III Today’s GPU-based computers have a million times the computational power of desktop ma-
chines of the early 1990s. This allows for propagating errors a few layers further down within
reasonable time, even in traditional NNs (Sec. 5.18). That is basically what is winning many of
the image recognition competitions now (Sec. 5.19, 5.21, 5.22). (Although this does not really
overcome the problem in a fundamental way.)
IV Hessian-free optimization (Sec. 5.6.2) can alleviate the problem for FNNs (????) (Sec. 5.6.2)
and RNNs (?) (Sec. 5.20).
V The space of NN weight matrices can also be searched without relying on error gradients,
thus avoiding the Fundamental Deep Learning Problem altogether. Random weight guessing
sometimes works better than more sophisticated methods (?). Certain more complex problems
are better solved by using Universal Search (?) for weight matrix-computing programs written
in a universal programming language (?). Some are better solved by using linear methods
to obtain optimal weights for connections to output events (Sec. 2), and evolving weights of
connections to other events—this is called Evolino (?). Compare also related RNNs pre-trained
by certain UL rules (?), also in the case of spiking neurons (??) (Sec. 5.26). Direct search
methods are relevant not only for SL but also for more general RL, and are discussed in more
detail in Sec. 6.6.
15
The RNN stack is essentially a deep generative model of the data, which can be reconstructed from
its compressed form. Adding another RNN to the stack improves a bound on the data’s description
length—equivalent to the negative logarithm of its probability (??)—as long as there is remaining
local learnable predictability in the data representation on the corresponding level of the hierarchy.
Compare a similar observation for feedforward Deep Belief Networks (DBNs, 2006, Sec. 5.15).
The system was able to learn many previously unlearnable DL tasks. One ancient illustrative DL
experiment (?) required CAPs (Sec. 3) of depth 1200. The top level code of the initially unsupervised
RNN stack, however, got so compact that (previously infeasible) sequence classification through ad-
ditional BP-based SL became possible. Essentially the system used UL to greatly reduce problem
depth. Compare earlier BP-based fine-tuning of NNs initialized by rules of propositional logic (?)
(Sec. 5.6.1).
There is a way of compressing higher levels down into lower levels, thus fully or partially collaps-
ing the RNN stack. The trick is to retrain a lower-level RNN to continually imitate (predict) the hidden
units of an already trained, slower, higher-level RNN (the “conscious” chunker), through additional
predictive output neurons (?). This helps the lower RNN (the automatizer) to develop appropriate,
rarely changing memories that may bridge very long time lags. Again, this procedure can greatly
reduce the required depth of the BP process.
The 1991 system was a working Deep Learner in the modern post-2000 sense, and also a first
Neural Hierarchical Temporal Memory (HTM). It is conceptually similar to earlier AE hierarchies
(1987, Sec. 5.7) and later Deep Belief Networks (2006, Sec. 5.15), but more general in the sense that it
uses sequence-processing RNNs instead of FNNs with unchanging inputs. More recently, well-known
entrepreneurs (??) also got interested in HTMs; compare also hierarchical HMMs (e.g., ?), as well
as later UL-based recurrent systems (????). Clockwork RNNs (?) also consist of interacting RNN
modules with different clock rates, but do not use UL to set those rates. Stacks of RNNs were used in
later work on SL with great success, e.g., Sec. 5.13, 5.16, 5.17, 5.22.
5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19)
The Neocognitron (Sec. 5.4) inspired the Cresceptron (?), which adapts its topology during training
(Sec. 5.6.3); compare the incrementally growing and shrinking GMDH networks (1965, Sec. 5.3).
Instead of using alternative local subsampling or WTA methods (e.g., ????), the Cresceptron
uses Max-Pooling (MP) layers. Here a 2-dimensional layer or array of unit activations is partitioned
into smaller rectangular arrays. Each is replaced in a downsampling layer by the activation of its
maximally active unit. A later, more complex version of the Cresceptron (?) also included “blurring”
layers to improve object location tolerance.
The neurophysiologically plausible topology of the feedforward HMAX model (?) is very similar
to the one of the 1992 Cresceptron (and thus to the 1979 Neocognitron). HMAX does not learn
though. Its units have hand-crafted weights; biologically plausible learning rules were later proposed
for similar models (e.g., ??).
When CNNs or convnets (Sec. 5.4, 5.8) are combined with MP, they become Cresceptron-like
or HMAX-like MPCNNs with alternating convolutional and max-pooling layers. Unlike Crescep-
tron and HMAX, however, MPCNNs are trained by BP (Sec. 5.5, 5.16) (?). Advantages of doing
this were pointed out subsequently (?). BP-trained MPCNNs have become central to many modern,
competition-winning, feedforward, visual Deep Learners (Sec. 5.17, 5.19–5.23).
16
intensity pulsations of an NH3 laser (??). No very deep CAPs (Sec. 3) were needed though.
17
protein structure prediction (?). DAG-RNNs (??) generalize BRNNs to multiple dimensions. They
learned to predict properties of small organic molecules (?) as well as protein contact maps (?), also
in conjunction with a growing deep FNN (?) (Sec. 5.21). BRNNs and DAG-RNNs unfold their full
potential when combined with the LSTM concept (???).
Particularly successful in recent competitions are stacks (Sec. 5.10) of LSTM RNNs (??) trained
by Connectionist Temporal Classification (CTC) (?), a gradient-based method for finding RNN
weights that maximize the probability of teacher-given label sequences, given (typically much longer
and more high-dimensional) streams of real-valued input vectors. CTC-LSTM performs simultaneous
segmentation (alignment) and recognition (Sec. 5.22).
In the early 2000s, speech recognition was dominated by HMMs combined with FNNs (e.g., ?).
Nevertheless, when trained from scratch on utterances from the TIDIGITS speech database, in 2003
LSTM already obtained results comparable to those of HMM-based systems (???). In 2007, LSTM
outperformed HMMs in keyword spotting tasks (?); compare recent improvements (??). By 2013,
LSTM also achieved best known results on the famous TIMIT phoneme recognition benchmark (?)
(Sec. 5.22). Recently, LSTM RNN / HMM hybrids obtained best known performance on medium-
vocabulary (?) and large-vocabulary speech recognition (?).
LSTM is also applicable to robot localization (?), robot control (?), online driver distraction de-
tection (?), and many other tasks. For example, it helped to improve the state of the art in diverse
applications such as protein analysis (?), handwriting recognition (????), voice activity detection (?),
optical character recognition (?), language identification (?), prosody contour prediction (?), audio on-
set detection (?), text-to-speech synthesis (?), social signal classification (?), machine translation (?),
and others.
RNNs can also be used for metalearning (???), because they can in principle learn to run their
own weight change algorithm (?). A successful metalearner (?) used an LSTM RNN to quickly learn
a learning algorithm for quadratic functions (compare Sec. 6.8).
Recently, LSTM RNNs won several international pattern recognition competitions and set nu-
merous benchmark records on large and complex data sets, e.g., Sec. 5.17, 5.21, 5.22. Gradient-
based LSTM is no panacea though—other methods sometimes outperformed it at least on certain
tasks (?????); compare Sec. 5.20.
18
of HMM-based systems (?); compare Sec. 5.13, 5.16, 5.21, 5.22.
19
Also in 2007, hierarchical stacks of LSTM RNNs were introduced (?). They can be trained by
hierarchical Connectionist Temporal Classification (CTC) (?). For tasks of sequence labelling, every
LSTM RNN level (Sec. 5.13) predicts a sequence of labels fed to the next level. Error signals at every
level are back-propagated through all the lower levels. On spoken digit recognition, LSTM stacks
outperformed HMMs, despite making fewer assumptions about the domain. LSTM stacks do not
necessarily require unsupervised pre-training like the earlier UL-based RNN stacks (?) of Sec. 5.10.
5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs
Stacks of LSTM RNNs trained by CTC (Sec. 5.13, 5.16) became the first RNNs to win official interna-
tional pattern recognition contests (with secret test sets known only to the organisers). More precisely,
three connected handwriting competitions at ICDAR 2009 in three different languages (French, Arab,
Farsi) were won by deep LSTM RNNs without any a priori linguistic knowledge, performing simul-
taneous segmentation and recognition. Compare (?????) (Sec. 5.22).
To detect human actions in surveillance videos, a 3-dimensional CNN (e.g., ??), combined with
SVMs, was part of a larger system (?) using a bag of features approach (?) to extract regions of
interest. The system won three 2009 TRECVID competitions. These were possibly the first official
international contests won with the help of (MP)CNNs (Sec. 5.16). An improved version of the
method was published later (?).
2009 also saw a GPU-DBN implementation (?) orders of magnitudes faster than previous CPU-
DBNs (see Sec. 5.15); see also (?). The Convolutional DBN (?) (with a probabilistic variant of MP,
Sec. 5.11) combines ideas from CNNs and DBNs, and was successfully applied to audio classifica-
tion (?).
20
is chosen as the system’s classification of the present input. Compare earlier, more sophisticated
ensemble methods (?), the contest-winning ensemble Bayes-NN (?) of Sec. 5.14, and recent related
work (?).
An ensemble of GPU-MPCNNs was the first system to achieve superhuman visual pattern recog-
nition (??) in a controlled competition, namely, the IJCNN 2011 traffic sign recognition contest in
San Jose (CA) (??). This is of interest for fully autonomous, self-driving cars in traffic (e.g., ?). The
GPU-MPCNN ensemble obtained 0.56% error rate and was twice better than human test subjects,
three times better than the closest artificial NN competitor (?), and six times better than the best
non-neural method.
A few months earlier, the qualifying round was won in a 1st stage online competition, albeit by
a much smaller margin: 1.02% (?) vs 1.03% for second place (?). After the deadline, the organisers
revealed that human performance on the test set was 1.19%. That is, the best methods already seemed
human-competitive. However, during the qualifying it was possible to incrementally gain information
about the test set by probing it through repeated submissions. This is illustrated by better and better
results obtained by various teams over time (?) (the organisers eventually imposed a limit of 10
resubmissions). In the final competition this was not possible.
This illustrates a general problem with benchmarks whose test sets are public, or at least can be
probed to some extent: competing teams tend to overfit on the test set even when it cannot be directly
used for training, only for evaluation.
In 1997 many thought it a big deal that human chess world champion Kasparov was beaten by
an IBM computer. But back then computers could not at all compete with little kids in visual pat-
tern recognition, which seems much harder than chess from a computational perspective. Of course,
the traffic sign domain is highly restricted, and kids are still much better general pattern recognis-
ers. Nevertheless, by 2011, deep NNs could already learn to rival them in important limited visual
domains.
An ensemble of GPU-MPCNNs was also the first method to achieve human-competitive perfor-
mance (around 0.2%) on MNIST (?). This represented a dramatic improvement, since by then the
MNIST record had hovered around 0.4% for almost a decade (Sec. 5.14, 5.16, 5.18).
Given all the prior work on (MP)CNNs (Sec. 5.4, 5.8, 5.11, 5.16) and GPU-CNNs (Sec. 5.16),
GPU-MPCNNs are not a breakthrough in the scientific sense. But they are a commercially relevant
breakthrough in efficient coding that has made a difference in several contests since 2011. Today, most
feedforward competition-winning deep NNs are (ensembles of) GPU-MPCNNs (Sec. 5.21–5.23).
21
Instead of relying on efficient GPU programming, this was done by brute force on 1,000 standard
machines with 16,000 cores.
So by 2011/2012, excellent results had been achieved by Deep Learners in image recognition and
classification (Sec. 5.19, 5.21). The computer vision community, however, is especially interested in
object detection in large images, for applications such as image-based search engines, or for biomed-
ical diagnosis where the goal may be to automatically detect tumors etc in images of human tissue.
Object detection presents additional challenges. One natural approach is to train a deep NN classifier
on patches of big images, then use it as a feature detector to be shifted across unknown visual scenes,
using various rotations and zoom factors. Image parts that yield highly active output units are likely
to contain objects similar to those the NN was trained on.
2012 finally saw the first DL system (an ensemble of GPU-MPCNNs, Sec. 5.19) to win a contest
on visual object detection (?) in large images of several million pixels (??). Such biomedical appli-
cations may turn out to be among the most important applications of DL. The world spends over 10%
of GDP on healthcare (> 6 trillion USD per year), much of it on medical diagnosis through expensive
experts. Partial automation of this could not only save lots of money, but also make expert diagnostics
accessible to many who currently cannot afford it. It is gratifying to observe that today deep NNs may
actually help to improve healthcare and perhaps save human lives.
2012 also saw the first pure image segmentation contest won by DL (?), again through an GPU-
MPCNN ensemble (?).2 EM stacks are relevant for the recently approved huge brain projects in Eu-
rope and the US (e.g., ?). Given electron microscopy images of stacks of thin slices of animal brains,
the goal is to build a detailed 3D model of the brain’s neurons and dendrites. But human experts need
many hours and days and weeks to annotate the images: Which parts depict neuronal membranes?
Which parts are irrelevant background? This needs to be automated (e.g., ?). Deep Multi-Column
GPU-MPCNNs learned to solve this task through experience with many training images, and won the
contest on all three evaluation metrics by a large margin, with superhuman performance in terms of
pixel error.
Both object detection (?) and image segmentation (?) profit from fast MPCNN-based image scans
that avoid redundant computations. Recent MPCNN scanners speed up naive implementations by up
to three orders of magnitude (??); compare earlier efficient methods for CNNs without MP (?).
Also in 2012, a system consisting of growing deep FNNs and 2D-BRNNs (?) won the CASP
2012 contest on protein contact map prediction. On the IAM-OnDoDB benchmark, LSTM RNNs
(Sec. 5.13) outperformed all other methods (HMMs, SVMs) on online mode detection (??) and key-
word spotting (?). On the long time lag problem of language modelling, LSTM RNNs outperformed
all statistical approaches on the IAM-DB benchmark (?); improved results were later obtained through
a combination of NNs and HMMs (?). Compare earlier RNNs for object recognition through iterative
image interpretation (???); see also more recent publications (??) extending work on biologically
plausible learning rules for RNNs (?).
they became the first recurrent Deep Learners to win official international pattern recognition contests—see Sec. 5.17.
22
LSTM-based systems also set benchmark records in language identification (?), medium-vocabulary
speech recognition (?), prosody contour prediction (?), audio onset detection (?), text-to-speech syn-
thesis (?), and social signal classification (?).
An LSTM RNN was used to estimate the state posteriors of an HMM; this system beat the previous
state of the art in large vocabulary speech recognition (??). Another LSTM RNN with hundreds of
millions of connections was used to rerank hypotheses of a statistical machine translation system; this
system beat the previous state of the art in English to French translation (?).
A new record on the ICDAR Chinese handwriting recognition benchmark (over 3700 classes)
was set on a desktop machine by an ensemble of GPU-MPCNNs (Sec. 5.19) with almost human
performance (?); compare (?).
The MICCAI 2013 Grand Challenge on Mitosis Detection (?) also was won by an object-detecting
GPU-MPCNN ensemble (?). Its data set was even larger and more challenging than the one of ICPR
2012 (Sec. 5.21): a real-world dataset including many ambiguous cases and frequently encountered
problems such as imperfect slide staining.
Three 2D-CNNs (with mean-pooling instead of MP, Sec. 5.11) observing three orthogonal projec-
tions of 3D images outperformed traditional full 3D methods on the task of segmenting tibial cartilage
in low field knee MRI scans (?).
Deep GPU-MPCNNs (Sec. 5.19) also helped to achieve new best results on important benchmarks
of the computer vision community: ImageNet classification (??) and—in conjunction with traditional
approaches—PASCAL object detection (?). They also learned to predict bounding box coordinates
of objects in the Imagenet 2013 database, and obtained state-of-the-art results on tasks of localization
and detection (?). GPU-MPCNNs also helped to recognise multi-digit numbers in Google Street View
images (?), where part of the NN was trained to count visible digits; compare earlier work on detect-
ing “numerosity” through DBNs (?). This system also excelled at recognising distorted synthetic text
in reCAPTCHA puzzles. Other successful CNN applications include scene parsing (?), object detec-
tion (?), shadow detection (?), video classification (?), and Alzheimers disease neuroimaging (?).
Additional contests are mentioned in the web pages of the Swiss AI Lab IDSIA, the University of
Toronto, NY University, and the University of Montreal.
23
5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3)
DBN training (Sec. 5.15) can be improved through gradient enhancements and automatic learning
rate adjustments during stochastic gradient descent (??), and through Tikhonov-type (?) regulariza-
tion of RBMs (?). Contractive AEs (?) discourage hidden unit perturbations in response to input
perturbations, similar to how FMS (Sec. 5.6.3) for L OCOCODE AEs (Sec. 5.6.4) discourages output
perturbations in response to weight perturbations.
Hierarchical CNNs in a Neural Abstraction Pyramid (e.g., ??) were trained to reconstruct images
corrupted by structured noise (?), thus enforcing increasingly abstract image representations in deeper
and deeper layers. Denoising AEs later used a similar procedure (?).
Dropout (??) removes units from NNs during training to improve generalisation. Some view it
as an ensemble method that trains multiple data models simultaneously (?). Under certain circum-
stances, it could also be viewed as a form of training set augmentation: effectively, more and more
informative complex features are removed from the training data. Compare dropout for RNNs (???).
A deterministic approximation coined fast dropout (?) can lead to faster learning and evaluation and
was adapted for RNNs (?). Dropout is closely related to older, biologically plausible techniques for
adding noise to neurons or synapses during training (e.g., ??????), which in turn are closely related
to finding perturbation-resistant low-complexity NNs, e.g., through FMS (Sec. 5.6.3). MDL-based
stochastic variational methods (?) are also related to FMS. They are useful for RNNs, where classic
regularizers such as weight decay (Sec. 5.6.3) represent a bias towards limited memory capacity (e.g.,
?). Compare recent work on variational recurrent AEs (?).
The activation function f of Rectified Linear Units (ReLUs) is f (x) = x for x > 0, f (x) =
0 otherwise—compare the old concept of half-wave rectified units (?). ReLU NNs are useful for
RBMs (??), outperformed sigmoidal activation functions in deep NNs (?), and helped to obtain best
results on several benchmark problems across multiple domains (e.g., ??).
NNs with competing linear units tend to outperform those with non-competing nonlinear units,
and avoid catastrophic forgetting through BP when training sets change over time (?). In this con-
text, choosing a learning algorithm may be more important than choosing activation functions (?).
Maxout NNs (?) combine competitive interactions and dropout (see above) to achieve excellent re-
sults on certain benchmarks. Compare early RNNs with competing units for SL and RL (?). To
address overfitting, instead of depending on pre-wired regularizers and hyper-parameters (??), self-
delimiting RNNs (SLIM NNs) with competing units (?) can in principle learn to select their own
runtime and their own numbers of effective free parameters, thus learning their own computable regu-
larisers (Sec. 4.4, 5.6.3), becoming fast and slim when necessary. One may penalize the task-specific
total length of connections (e.g., ????) and communication costs of SLIM NNs implemented on the
3-dimensional brain-like multi-processor hardware to be expected in the future.
RmsProp (??) can speed up first order gradient descent methods (Sec. 5.5, 5.6.2); compare vario-
η (?), Adagrad (?) and Adadelta (?). DL in NNs can also be improved by transforming hidden unit
activations such that they have zero output and slope on average (?). Many additional, older tricks
(Sec. 5.6.2, 5.6.3) should also be applicable to today’s deep NNs; compare (??).
24
be minimised may be quite similar to the one of visual ANNs. In fact, results obtained with relatively
deep artificial DBNs (?) and CNNs (?) seem compatible with insights about the visual pathway in the
primate cerebral cortex, which has been studied for many decades (e.g., ?????????????); compare a
computer vision-oriented survey (?).
25
6 DL in FNNs and RNNs for Reinforcement Learning (RL)
So far we have focused on Deep Learning (DL) in supervised or unsupervised NNs. Such NNs learn
to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act in
the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys, e.g.,
???). Here we add a discussion of DL FNNs and RNNs for RL. It will be shorter than the discussion
of FNNs and RNNs for SL and UL (Sec. 5), reflecting the current size of the various fields.
Without a teacher, solely from occasional real-valued pain and pleasure signals, RL agents must
discover how to interact with a dynamic, initially unknown environment to maximize their expected
cumulative reward signals (Sec. 2). There may be arbitrary, a priori unknown delays between actions
and perceivable consequences. The problem is as hard as any problem of computer science, since any
task with a computable description can be formulated in the RL framework (e.g., ?). For example, an
answer to the famous question of whether P = N P (??) would also set limits for what is achievable
by general RL. Compare more specific limitations, e.g., (???). The following subsections mostly
focus on certain obvious intersections between DL and RL—they cannot serve as a general RL survey.
26
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs)
The classical approach to RL (??) makes the simplifying assumption of Markov Decision Processes
(MDPs): the current input of the RL agent conveys all information necessary to compute an optimal
next output event or decision. This allows for greatly reducing CAP depth in RL NNs (Sec. 3, 6.1)
by using the Dynamic Programming (DP) trick (?). The latter is often explained in a probabilistic
framework (e.g., ?), but its basic idea can already be conveyed in a deterministic setting. For simplic-
ity, using the notation of Sec. 2, let input events xt encode the entire current state of the environment,
including a real-valued reward rt (no need to introduce additional vector-valued notation, since real
values can encode arbitrary vectors of real values). The original RL goal (find weights that maximize
the sum of all rewards of an episode) is replaced by an equivalent set of alternative goals set by a real-
valued value function V defined on input events. Consider any two subsequent input events xt , xk .
Recursively define V (xt ) = rt + V (xk ), where V (xk ) = rk if xk is the last input event. Now search
for weights that maximize the V of all input events, by causing appropriate output events or actions.
Due to the Markov assumption, an FNN suffices to implement the policy that maps input to out-
put events. Relevant CAPs are not deeper than this FNN. V itself is often modeled by a separate
FNN (also yielding typically short CAPs) learning to approximate V (xt ) only from local information
rt , V (xk ).
Many variants of traditional RL exist (e.g., ???????????????????????????). Most are formu-
lated in a probabilistic framework, and evaluate pairs of input and output (action) events (instead of
input events only). To facilitate certain mathematical derivations, some discount delayed rewards, but
such distortions of the original RL problem are problematic.
Perhaps the most well-known RL NN is the world-class RL backgammon player (?), which
achieved the level of human world champions by playing against itself. Its nonlinear, rather shal-
low FNN maps a large but finite number of discrete board states to values. More recently, a rather
deep GPU-CNN was used in a traditional RL framework to play several Atari 2600 computer games
directly from 84x84 pixel 60 Hz video input (?), using experience replay (?), extending previous
work on Neural Fitted Q-Learning (NFQ) (?). Even better results are achieved by using (slow) Monte
Carlo tree planning to train comparatively fast deep NNs (?). Compare RBM-based RL (?) with
high-dimensional inputs (?), earlier RL Atari players (?), and an earlier, raw video-based RL NN for
computer games (?) trained by Indirect Policy Search (Sec. 6.7).
1. Use an RNN as a value function mapping arbitrary event histories to values (e.g., ????). For
example, deep LSTM RNNs were used in this way for RL robots (?).
2. Use an RNN controller in conjunction with a second RNN as predictive world model, to obtain
a combined RNN with deep CAPs—see Sec. 6.1.
27
3. Use an RNN for RL by Direct Search (Sec. 6.6) or Indirect Search (Sec. 6.7) in weight space.
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs
Multiple learnable levels of abstraction (?????) seem as important for RL as for SL. Work on NN-
based Hierarchical RL (HRL) has been published since the early 1990s. In particular, gradient-based
subgoal discovery with FNNs or RNNs decomposes RL tasks into subtasks for RL submodules (??).
Numerous alternative HRL techniques have been proposed (e.g., ????????????????). While HRL
frameworks such as Feudal RL (?) and options (???) do not directly address the problem of automatic
subgoal discovery, HQ-Learning (?) automatically decomposes POMDPs (Sec. 6.3) into sequences of
simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent
HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor
control maps (?) inspired by neurophysiological findings (?).
28
Evolution Strategies (CMA-ES) (????), and NeuroEvolution of Augmenting Topologies (NEAT) (?).
Hybrid methods combine traditional NN-based RL (Sec. 6.2) and EAs (e.g., ?).
Since RNNs are general computers, RNN evolution is like GP in the sense that it can evolve
general programs. Unlike sequential programs learned by traditional GP, however, RNNs can mix
sequential and parallel information processing in a natural and efficient way, as already mentioned
in Sec. 1. Many RNN evolvers have been proposed (e.g., ????????????). One particularly effec-
tive family of methods coevolves neurons, combining them into networks, and selecting those neu-
rons for reproduction that participated in the best-performing networks (???). This can help to solve
deep POMDPs (?). Co-Synaptic Neuro-Evolution (CoSyNE) does something similar on the level of
synapses or weights (?); benefits of this were shown on difficult nonlinear POMDP benchmarks.
Natural Evolution Strategies (NES) (????) link policy gradient methods and evolutionary ap-
proaches through the concept of Natural Gradients (?). RNN evolution may also help to improve SL
for deep RNNs through Evolino (?) (Sec. 5.9).
6.8 Universal RL
General purpose learning algorithms may improve themselves in open-ended fashion and
environment-specific ways in a lifelong learning context (????). The most general type of RL is
constrained only by the fundamental limitations of computability identified by the founders of the-
oretical computer science (????). Remarkably, there exist blueprints of universal problem solvers
or universal RL machines for unlimited problem depth that are time-optimal in various theoretical
senses (????). In particular, the Gödel Machine can be implemented on general computers such as
29
RNNs and may improve any part of its software (including the learning algorithm itself) in a way
that is provably time-optimal in a certain sense (?). It can be initialized by an asymptotically optimal
meta-method (?) (also applicable to RNNs) which will solve any well-defined problem as quickly as
the unknown fastest way of solving it, save for an additive constant overhead that becomes negligible
as problem size grows. Note that most problems are large; only few are small. AI and DL researchers
are still in business because many are interested in problems so small that it is worth trying to re-
duce the overhead through less general methods, including heuristics. Here I won’t further discuss
universal RL methods, which go beyond what is usually called DL.
30
(Sec. 5.6.4). They will also implement Occam’s razor (Sec. 4.4, 5.6.3) as a by-product of energy min-
imization, by finding simple (highly generalizing) problem solutions that require few active neurons
and few, mostly short connections.
The more distant future may belong to general purpose learning algorithms that improve them-
selves in provably optimal ways (Sec. 6.8), but these are not yet practical or commercially relevant.
8 Acknowledgments
Since 16 April 2014, drafts of this paper have undergone massive open online peer review through
public mailing lists including [email protected], [email protected], comp-neuro-
@neuroinf.org, genetic [email protected], [email protected], imageworld-
@diku.dk, Google+ machine learning forum. Thanks to numerous NN / DL experts for valuable
comments. Thanks to SNF, DFG, and the European Commission for partially funding my DL re-
search group in the past quarter-century. The contents of this paper may be used for educational and
non-commercial purposes, including articles for Wikipedia and similar sites.
References
31