0% found this document useful (0 votes)
108 views

Deep Learning in Neural Networks: An Overview

This document provides a historical overview of deep learning in neural networks from the 1940s to 2014. It summarizes key developments in supervised learning, unsupervised learning, and reinforcement learning using neural networks. These include early neural networks, the development of backpropagation, convolutional neural networks, recurrent neural networks, deep belief networks, and recent techniques that have achieved human-level performance on tasks like object recognition. The document references 888 related works and provides context to assign credit to contributors to the field over several decades.

Uploaded by

Cheikh Salmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views

Deep Learning in Neural Networks: An Overview

This document provides a historical overview of deep learning in neural networks from the 1940s to 2014. It summarizes key developments in supervised learning, unsupervised learning, and reinforcement learning using neural networks. These include early neural networks, the development of backpropagation, convolutional neural networks, recurrent neural networks, deep belief networks, and recent techniques that have achieved human-level performance on tasks like object recognition. The document references 888 related works and provides context to assign credit to contributors to the field over several decades.

Uploaded by

Cheikh Salmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Deep Learning in Neural Networks: An Overview

Technical Report IDSIA-03-14 / arXiv:1404.7828 v4 [cs.NE] (88 pages, 888 references)

Jürgen Schmidhuber
The Swiss AI Lab IDSIA
Istituto Dalle Molle di Studi sull’Intelligenza Artificiale
University of Lugano & SUPSI
Galleria 2, 6928 Manno-Lugano
Switzerland
8 October 2014

Abstract
In recent years, deep artificial neural networks (including recurrent ones) have won numerous
contests in pattern recognition and machine learning. This historical survey compactly summarises
relevant work, much of it from the previous millennium. Shallow and deep learners are distin-
guished by the depth of their credit assignment paths, which are chains of possibly learnable, causal
links between actions and effects. I review deep supervised learning (also recapitulating the history
of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation,
and indirect search for short programs encoding deep and large networks.

LATEX source: http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.tex


Complete BIBTEX file (888 kB): http://www.idsia.ch/˜juergen/deep.bib

Preface
This is the preprint of an invited Deep Learning (DL) overview. One of its goals is to assign credit
to those who contributed to the present state of the art. I acknowledge the limitations of attempting
to achieve this goal. The DL research community itself may be viewed as a continually evolving,
deep network of scientists who have influenced each other in complex ways. Starting from recent DL
results, I tried to trace back the origins of relevant ideas through the past half century and beyond,
sometimes using “local search” to follow citations of citations backwards in time. Since not all DL
publications properly acknowledge earlier relevant work, additional global search strategies were em-
ployed, aided by consulting numerous neural network experts. As a result, the present preprint mostly
consists of references. Nevertheless, through an expert selection bias I may have missed important
work. A related bias was surely introduced by my special familiarity with the work of my own DL
research group in the past quarter-century. For these reasons, this work should be viewed as merely a
snapshot of an ongoing credit assignment process. To help improve it, please do not hesitate to send
corrections and suggestions to [email protected].

1
Contents
1 Introduction to Deep Learning (DL) in Neural Networks (NNs) 4

2 Event-Oriented Notation for Activation Spreading in NNs 5

3 Depth of Credit Assignment Paths (CAPs) and of Problems 6

4 Recurring Themes of Deep Learning 7


4.1 Dynamic Programming for Supervised/Reinforcement Learning (SL/RL) . . . . . . 7
4.2 Unsupervised Learning (UL) Facilitating SL and RL . . . . . . . . . . . . . . . . . 7
4.3 Learning Hierarchical Representations Through Deep SL, UL, RL . . . . . . . . . . 7
4.4 Occam’s Razor: Compression and Minimum Description Length (MDL) . . . . . . . 8
4.5 Fast Graphics Processing Units (GPUs) for DL in NNs . . . . . . . . . . . . . . . . 8

5 Supervised NNs, Some Helped by Unsupervised NNs 8


5.1 Early NNs Since the 1940s (and the 1800s) . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11) . . . . . . . 9
5.3 1965: Deep Networks Based on the Group Method of Data Handling . . . . . . . . . 9
5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron) . . . . . . 10
5.5 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs . . . . . . . 10
5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs) 11
5.6 Late 1980s-2000 and Beyond: Numerous Improvements of NNs . . . . . . . . . . . 11
5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs . . . . . . . . . . . 12
5.6.2 Better BP Through Advanced Gradient Descent (Compare Sec. 5.24) . . . . 12
5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24) . 13
5.6.4 Potential Benefits of UL for SL (Compare Sec. 5.7, 5.10, 5.15) . . . . . . . . 13
5.7 1987: UL Through Autoencoder (AE) Hierarchies (Compare Sec. 5.15) . . . . . . . 14
5.8 1989: BP for Convolutional NNs (CNNs, Sec. 5.4) . . . . . . . . . . . . . . . . . . 14
5.9 1991: Fundamental Deep Learning Problem of Gradient Descent . . . . . . . . . . . 14
5.10 1991: UL-Based History Compression Through a Deep Stack of RNNs . . . . . . . 15
5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19) . . . . . . . 16
5.12 1994: Early Contest-Winning NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN) . . . . . . . . . . . . 17
5.14 2003: More Contest-Winning/Record-Setting NNs; Successful Deep NNs . . . . . . 18
5.15 2006/7: UL For Deep Belief Networks / AE Stacks Fine-Tuned by BP . . . . . . . . 19
5.16 2006/7: Improved CNNs / GPU-CNNs / BP for MPCNNs / LSTM Stacks . . . . . . 19
5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs . . . . . . . . . 20
5.18 2010: Plain Backprop (+ Distortions) on GPU Breaks MNIST Record . . . . . . . . 20
5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance . . . . . . . . . 20
5.20 2011: Hessian-Free Optimization for RNNs . . . . . . . . . . . . . . . . . . . . . . 21
5.21 2012: First Contests Won on ImageNet, Object Detection, Segmentation . . . . . . . 21
5.22 2013-: More Contests and Benchmark Records . . . . . . . . . . . . . . . . . . . . 22
5.23 Currently Successful Techniques: LSTM RNNs and GPU-MPCNNs . . . . . . . . . 23
5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3) . . . . . . . 24
5.25 Consequences for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.26 DL with Spiking Neurons? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2
6 DL in FNNs and RNNs for Reinforcement Learning (RL) 26
6.1 RL Through NN World Models Yields RNNs With Deep CAPs . . . . . . . . . . . . 26
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs) . . . . . . . 27
6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs) . . . . . . . . . . . . . . 27
6.4 RL Facilitated by Deep UL in FNNs and RNNs . . . . . . . . . . . . . . . . . . . . 28
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs . . . . . . 28
6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution . . . . . . . . . . . . . 28
6.7 Deep RL by Indirect Policy Search / Compressed NN Search . . . . . . . . . . . . . 29
6.8 Universal RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Conclusion and Outlook 30

8 Acknowledgments 31

Abbreviations in Alphabetical Order


AE: Autoencoder HTM: Hierarchical Temporal Memory
AI: Artificial Intelligence HMAX: Hierarchical Model “and X”
ANN: Artificial Neural Network LSTM: Long Short-Term Memory (RNN)
BFGS: Broyden-Fletcher-Goldfarb-Shanno MDL: Minimum Description Length
BNN: Biological Neural Network MDP: Markov Decision Process
BM: Boltzmann Machine MNIST: Mixed National Institute of Standards
BP: Backpropagation and Technology Database
BRNN: Bi-directional Recurrent Neural Network MP: Max-Pooling
CAP: Credit Assignment Path MPCNN: Max-Pooling CNN
CEC: Constant Error Carousel NE: NeuroEvolution
CFL: Context Free Language NEAT: NE of Augmenting Topologies
CMA-ES: Covariance Matrix Estimation ES NES: Natural Evolution Strategies
CNN: Convolutional Neural Network NFQ: Neural Fitted Q-Learning
CoSyNE: Co-Synaptic Neuro-Evolution NN: Neural Network
CSL: Context Senistive Language OCR: Optical Character Recognition
CTC : Connectionist Temporal Classification PCC: Potential Causal Connection
DBN: Deep Belief Network PDCC: Potential Direct Causal Connection
DCT: Discrete Cosine Transform PM: Predictability Minimization
DL: Deep Learning POMDP: Partially Observable MDP
DP: Dynamic Programming RAAM: Recursive Auto-Associative Memory
DS: Direct Policy Search RBM: Restricted Boltzmann Machine
EA: Evolutionary Algorithm ReLU: Rectified Linear Unit
EM: Expectation Maximization RL: Reinforcement Learning
ES: Evolution Strategy RNN: Recurrent Neural Network
FMS: Flat Minimum Search R-prop: Resilient Backpropagation
FNN: Feedforward Neural Network SL: Supervised Learning
FSA: Finite State Automaton SLIM NN: Self-Delimiting Neural Network
GMDH: Group Method of Data Handling SOTA: Self-Organising Tree Algorithm
GOFAI: Good Old-Fashioned AI SVM: Support Vector Machine
GP: Genetic Programming TDNN: Time-Delay Neural Network
GPU: Graphics Processing Unit TIMIT: TI/SRI/MIT Acoustic-Phonetic Continu-
GPU-MPCNN: GPU-Based MPCNN ous Speech Corpus
HMM: Hidden Markov Model UL: Unsupervised Learning
HRL: Hierarchical Reinforcement Learning WTA: Winner-Take-All

3
1 Introduction to Deep Learning (DL) in Neural Networks (NNs)
Which modifiable components of a learning system are responsible for its success or failure? What
changes to them improve performance? This has been called the fundamental credit assignment prob-
lem (?). There are general credit assignment methods for universal problem solvers that are time-
optimal in various theoretical senses (Sec. 6.8). The present survey, however, will focus on the nar-
rower, but now commercially important, subfield of Deep Learning (DL) in Artificial Neural Networks
(NNs).
A standard neural network (NN) consists of many simple, connected processors called neurons,
each producing a sequence of real-valued activations. Input neurons get activated through sensors per-
ceiving the environment, other neurons get activated through weighted connections from previously
active neurons (details in Sec. 2). Some neurons may influence the environment by triggering actions.
Learning or credit assignment is about finding weights that make the NN exhibit desired behavior,
such as driving a car. Depending on the problem and how the neurons are connected, such behavior
may require long causal chains of computational stages (Sec. 3), where each stage transforms (of-
ten in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately
assigning credit across many such stages.
Shallow NN-like models with few such stages have been around for many decades if not centuries
(Sec. 5.1). Models with several successive nonlinear layers of neurons date back at least to the 1960s
(Sec. 5.3) and 1970s (Sec. 5.5). An efficient gradient descent method for teacher-based Supervised
Learning (SL) in discrete, differentiable networks of arbitrary depth called backpropagation (BP)
was developed in the 1960s and 1970s, and applied to NNs in 1981 (Sec. 5.5). BP-based training
of deep NNs with many layers, however, had been found to be difficult in practice by the late 1980s
(Sec. 5.6), and had become an explicit research subject by the early 1990s (Sec. 5.9). DL became
practically feasible to some extent through the help of Unsupervised Learning (UL), e.g., Sec. 5.10
(1991), Sec. 5.15 (2006). The 1990s and 2000s also saw many improvements of purely supervised
DL (Sec. 5). In the new millennium, deep NNs have finally attracted wide-spread attention, mainly
by outperforming alternative machine learning methods such as kernel machines (??) in numerous
important applications. In fact, since 2009, supervised deep NNs have won many official international
pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first superhuman
visual pattern recognition results in limited domains (Sec. 5.19, 2011). Deep NNs also have become
relevant for the more general field of Reinforcement Learning (RL) where there is no supervising
teacher (Sec. 6).
Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic) NNs (RNNs) have won contests
(Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22). In a sense, RNNs are the deepest of all NNs (Sec. 3)—they
are general computers more powerful than FNNs, and can in principle create and process memories
of arbitrary sequences of input patterns (e.g., ??). Unlike traditional methods for automatic sequential
program synthesis (e.g., ????), RNNs can learn programs that mix sequential and parallel information
processing in a natural and efficient way, exploiting the massive parallelism viewed as crucial for
sustaining the rapid decline of computation cost observed over the past 75 years.
The rest of this paper is structured as follows. Sec. 2 introduces a compact, event-oriented notation
that is simple yet general enough to accommodate both FNNs and RNNs. Sec. 3 introduces the
concept of Credit Assignment Paths (CAPs) to measure whether learning in a given NN application is
of the deep or shallow type. Sec. 4 lists recurring themes of DL in SL, UL, and RL. Sec. 5 focuses
on SL and UL, and on how UL can facilitate SL, although pure SL has become dominant in recent
competitions (Sec. 5.17–5.23). Sec. 5 is arranged in a historical timeline format with subsections on
important inspirations and technical contributions. Sec. 6 on deep RL discusses traditional Dynamic
Programming (DP)-based RL combined with gradient-based search techniques for SL or UL in deep
NNs, as well as general methods for direct and indirect search in the weight space of deep FNNs and

4
RNNs, including successful policy gradient and evolutionary methods.

2 Event-Oriented Notation for Activation Spreading in NNs


Throughout this paper, let i, j, k, t, p, q, r denote positive integer variables assuming ranges implicit
in the given contexts. Let n, m, T denote positive integer constants.
An NN’s topology may change over time (e.g., Sec. 5.3, 5.6.3). At any given moment, it can
be described as a finite subset of units (or nodes or neurons) N = {u1 , u2 , . . . , } and a finite set
H ⊆ N × N of directed edges or connections between nodes. FNNs are acyclic graphs, RNNs cyclic.
The first (input) layer is the set of input units, a subset of N . In FNNs, the k-th layer (k > 1) is the set
of all nodes u ∈ N such that there is an edge path of length k − 1 (but no longer path) between some
input unit and u. There may be shortcut connections between distant layers. In sequence-processing,
fully connected RNNs, all units have connections to all non-input units.
The NN’s behavior or program is determined by a set of real-valued, possibly modifiable, param-
eters or weights wi (i = 1, . . . , n). We now focus on a single finite episode or epoch of information
processing and activation spreading, without learning through weight changes. The following slightly
unconventional notation is designed to compactly describe what is happening during the runtime of
the system.
During an episode, there is a partially causal sequence xt (t = 1, . . . , T ) of real values that I call
events. Each xt is either an input set by the environment, or the activation of a unit that may directly
depend on other xk (k < t) through a current NN topology-dependent set int of indices k representing
incoming causal connections or links. Let the function v encode topology information and map such
event index pairs (k, t) to weight indices.
P For example, in the non-input case weQmay have xt =
ft (nett ) with real-valued nett = k∈int xk wv(k,t) (additive case) or nett = k∈int xk wv(k,t)
(multiplicative case), where ft is a typically nonlinear real-valued activation function such as tanh.
In many recent competition-winning NNs (Sec. 5.19, 5.21, 5.22) there also are events of the type
xt = maxk∈int (xk ); some network types may also use complex polynomial activation functions
(Sec. 5.3). xt may directly affect certain xk (k > t) through outgoing connections or links represented
through a current set outt of indices k with t ∈ ink . Some of the non-input events are called output
events.
Note that many of the xt may refer to different, time-varying activations of the same unit in
sequence-processing RNNs (e.g., ?, “unfolding in time”), or also in FNNs sequentially exposed to
time-varying input patterns of a large training set encoded as input events. During an episode, the
same weight may get reused over and over again in topology-dependent ways, e.g., in RNNs, or in
convolutional NNs (Sec. 5.4, 5.8). I call this weight sharing across space and/or time. Weight sharing
may greatly reduce the NN’s descriptive complexity, which is the number of bits of information
required to describe the NN (Sec. 4.4).
In Supervised Learning (SL), certain NN output events xt may be associated with teacher-given,
real-valued labels or targets dt yielding errors et , e.g., et = 1/2(xt −dt )2 . A typical goal of supervised
NN training is to find weights that yield episodes with small total error E, the sum of all such et . The
hope is that the NN will generalize well in later episodes, causing only small errors on previously
unseen sequences of input events. Many alternative error functions for SL and UL are possible.
SL assumes that input events are independent of earlier output events (which may affect the en-
vironment through actions causing subsequent perceptions). This assumption does not hold in the
broader fields of Sequential Decision Making and Reinforcement Learning (RL) (????) (Sec. 6). In
RL, some of the input events may encode real-valued reward signals given by the environment, and a
typical goal is to find weights that yield episodes with a high sum of reward signals, through sequences
of appropriate output actions.

5
Sec. 5.5 will use the notation above to compactly describe a central algorithm of DL, namely,
backpropagation (BP) for supervised weight-sharing FNNs and RNNs. (FNNs may be viewed as
RNNs with certain fixed zero weights.) Sec. 6 will address the more general RL case.

3 Depth of Credit Assignment Paths (CAPs) and of Problems


To measure whether credit assignment in a given NN application is of the deep or shallow type, I
introduce the concept of Credit Assignment Paths or CAPs, which are chains of possibly causal links
between the events of Sec. 2, e.g., from input through hidden to output layers in FNNs, or through
transformations over time in RNNs.
Let us first focus on SL. Consider two events xp and xq (1 ≤ p < q ≤ T ). Depending on the
application, they may have a Potential Direct Causal Connection (PDCC) expressed by the Boolean
predicate pdcc(p, q), which is true if and only if p ∈ inq . Then the 2-element list (p, q) is defined to
be a CAP (a minimal one) from p to q. A learning algorithm may be allowed to change wv(p,q) to
improve performance in future episodes.
More general, possibly indirect, Potential Causal Connections (PCC) are expressed by the re-
cursively defined Boolean predicate pcc(p, q), which in the SL case is true only if pdcc(p, q), or if
pcc(p, k) for some k and pdcc(k, q). In the latter case, appending q to any CAP from p to k yields a
CAP from p to q (this is a recursive definition, too). The set of such CAPs may be large but is finite.
Note that the same weight may affect many different PDCCs between successive events listed by a
given CAP, e.g., in the case of RNNs, or weight-sharing FNNs.
Suppose a CAP has the form (. . . , k, t, . . . , q), where k and t (possibly t = q) are the first succes-
sive elements with modifiable wv(k,t) . Then the length of the suffix list (t, . . . , q) is called the CAP’s
depth (which is 0 if there are no modifiable links at all). This depth limits how far backwards credit
assignment can move down the causal chain to find a modifiable weight.1
Suppose an episode and its event sequence x1 , . . . , xT satisfy a computable criterion used to
decide whether a given problem has been solved (e.g., total error E below some threshold). Then
the set of used weights is called a solution to the problem, and the depth of the deepest CAP within
the sequence is called the solution depth. There may be other solutions (yielding different event
sequences) with different depths. Given some fixed NN topology, the smallest depth of any solution
is called the problem depth.
Sometimes we also speak of the depth of an architecture: SL FNNs with fixed topology imply a
problem-independent maximal problem depth bounded by the number of non-input layers. Certain
SL RNNs with fixed weights for all connections except those to output units (????) have a maximal
problem depth of 1, because only the final links in the corresponding CAPs are modifiable. In general,
however, RNNs may learn to solve problems of potentially unlimited depth.
Note that the definitions above are solely based on the depths of causal chains, and agnostic to the
temporal distance between events. For example, shallow FNNs perceiving large “time windows” of
input events may correctly classify long input sequences through appropriate output events, and thus
solve shallow problems involving long time lags between relevant events.
At which problem depth does Shallow Learning end, and Deep Learning begin? Discussions with
DL experts have not yet yielded a conclusive response to this question. Instead of committing myself
to a precise answer, let me just define for the purposes of this overview: problems of depth > 10
require Very Deep Learning.
The difficulty of a problem may have little to do with its depth. Some NNs can quickly learn
to solve certain deep problems, e.g., through random weight guessing (Sec. 5.9) or other types of
1 An alternative would be to count only modifiable links when measuring depth. In many typical NN applications this would

not make a difference, but in some it would, e.g., Sec. 6.1.

6
direct search (Sec. 6.6) or indirect search (Sec. 6.7) in weight space, or through training an NN first
on shallow problems whose solutions may then generalize to deep problems, or through collapsing
sequences of (non)linear operations into a single (non)linear operation (but see an analysis of non-
trivial aspects of deep linear networks, ?, Section B). In general, however, finding an NN that precisely
models a given training set is an NP-complete problem (??), also in the case of deep NNs (???);
compare a survey of negative results (?, Section 1).
Above we have focused on SL. In the more general case of RL in unknown environments, pcc(p, q)
is also true if xp is an output event and xq any later input event—any action may affect the environment
and thus any later perception. (In the real world, the environment may even influence non-input events
computed on a physical hardware entangled with the entire universe, but this is ignored here.) It is
possible to model and replace such unmodifiable environmental PCCs through a part of the NN that
has already learned to predict (through some of its units) input events (including reward signals) from
former input events and actions (Sec. 6.1). Its weights are frozen, but can help to assign credit to
other, still modifiable weights used to compute actions (Sec. 6.1). This approach may lead to very
deep CAPs though.
Some DL research is about automatically rephrasing problems such that their depth is reduced
(Sec. 4). In particular, sometimes UL is used to make SL problems less deep, e.g., Sec. 5.10. Often
Dynamic Programming (Sec. 4.1) is used to facilitate certain traditional RL problems, e.g., Sec. 6.2.
Sec. 5 focuses on CAPs for SL, Sec. 6 on the more complex case of RL.

4 Recurring Themes of Deep Learning


4.1 Dynamic Programming for Supervised/Reinforcement Learning (SL/RL)
One recurring theme of DL is Dynamic Programming (DP) (?), which can help to facilitate credit
assignment under certain assumptions. For example, in SL NNs, backpropagation itself can be viewed
as a DP-derived method (Sec. 5.5). In traditional RL based on strong Markovian assumptions, DP-
derived methods can help to greatly reduce problem depth (Sec. 6.2). DP algorithms are also essential
for systems that combine concepts of NNs and graphical models, such as Hidden Markov Models
(HMMs) (??) and Expectation Maximization (EM) (??), e.g., (???????????).

4.2 Unsupervised Learning (UL) Facilitating SL and RL


Another recurring theme is how UL can facilitate both SL (Sec. 5) and RL (Sec. 6). UL (Sec. 5.6.4)
is normally used to encode raw incoming data such as video or speech streams in a form that is more
convenient for subsequent goal-directed learning. In particular, codes that describe the original data in
a less redundant or more compact way can be fed into SL (Sec. 5.10, 5.15) or RL machines (Sec. 6.4),
whose search spaces may thus become smaller (and whose CAPs shallower) than those necessary for
dealing with the raw data. UL is closely connected to the topics of regularization and compression
(Sec. 4.4, 5.6.3).

4.3 Learning Hierarchical Representations Through Deep SL, UL, RL


Many methods of Good Old-Fashioned Artificial Intelligence (GOFAI) (?) as well as more recent
approaches to AI (?) and Machine Learning (?) learn hierarchies of more and more abstract data
representations. For example, certain methods of syntactic pattern recognition (?) such as grammar
induction discover hierarchies of formal rules to model observations. The partially (un)supervised
Automated Mathematician / EURISKO (??) continually learns concepts by combining previously

7
learnt concepts. Such hierarchical representation learning (???) is also a recurring theme of DL NNs
for SL (Sec. 5), UL-aided SL (Sec. 5.7, 5.10, 5.15), and hierarchical RL (Sec. 6.5). Often, abstract
hierarchical representations are natural by-products of data compression (Sec. 4.4), e.g., Sec. 5.10.

4.4 Occam’s Razor: Compression and Minimum Description Length (MDL)


Occam’s razor favors simple solutions over complex ones. Given some programming language, the
principle of Minimum Description Length (MDL) can be used to measure the complexity of a so-
lution candidate by the length of the shortest program that computes it (e.g., ??????????). Some
methods explicitly take into account program runtime (????); many consider only programs with
constant runtime, written in non-universal programming languages (e.g., ??). In the NN case, the
MDL principle suggests that low NN weight complexity corresponds to high NN probability in the
Bayesian view (e.g., ????), and to high generalization performance (e.g., ?), without overfitting the
training data. Many methods have been proposed for regularizing NNs, that is, searching for solution-
computing but simple, low-complexity SL NNs (Sec. 5.6.3) and RL NNs (Sec. 6.7). This is closely
related to certain UL methods (Sec. 4.2, 5.6.4).

4.5 Fast Graphics Processing Units (GPUs) for DL in NNs


While the previous millennium saw several attempts at creating fast NN-specific hardware (e.g.,
???????), and at exploiting standard hardware (e.g., ???), the new millennium brought a DL break-
through in form of cheap, multi-processor graphics cards or GPUs. GPUs are widely used for video
games, a huge and competitive market that has driven down hardware prices. GPUs excel at the fast
matrix and vector multiplications required not only for convincing virtual realities but also for NN
training, where they can speed up learning by a factor of 50 and more. Some of the GPU-based FNN
implementations (Sec. 5.16–5.19) have greatly contributed to recent successes in contests for pattern
recognition (Sec. 5.19–5.22), image segmentation (Sec. 5.21), and object detection (Sec. 5.21–5.22).

5 Supervised NNs, Some Helped by Unsupervised NNs


The main focus of current practical applications is on Supervised Learning (SL), which has domi-
nated recent pattern recognition contests (Sec. 5.17–5.23). Several methods, however, use additional
Unsupervised Learning (UL) to facilitate SL (Sec. 5.7, 5.10, 5.15). It does make sense to treat SL and
UL in the same section: often gradient-based methods, such as BP (Sec. 5.5.1), are used to optimize
objective functions of both UL and SL, and the boundary between SL and UL may blur, for example,
when it comes to time series prediction and sequence classification, e.g., Sec. 5.10, 5.12.
A historical timeline format will help to arrange subsections on important inspirations and techni-
cal contributions (although such a subsection may span a time interval of many years). Sec. 5.1 briefly
mentions early, shallow NN models since the 1940s (and 1800s), Sec. 5.2 additional early neurobio-
logical inspiration relevant for modern Deep Learning (DL). Sec. 5.3 is about GMDH networks (since
1965), to my knowledge the first (feedforward) DL systems. Sec. 5.4 is about the relatively deep
Neocognitron NN (1979) which is very similar to certain modern deep FNN architectures, as it com-
bines convolutional NNs (CNNs), weight pattern replication, and subsampling mechanisms. Sec. 5.5
uses the notation of Sec. 2 to compactly describe a central algorithm of DL, namely, backpropagation
(BP) for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP 1960-1981
and beyond. Sec. 5.6 describes problems encountered in the late 1980s with BP for deep NNs, and
mentions several ideas from the previous millennium to overcome them. Sec. 5.7 discusses a first hier-
archical stack (1987) of coupled UL-based Autoencoders (AEs)—this concept resurfaced in the new

8
millennium (Sec. 5.15). Sec. 5.8 is about applying BP to CNNs (1989), which is important for today’s
DL applications. Sec. 5.9 explains BP’s Fundamental DL Problem (of vanishing/exploding gradients)
discovered in 1991. Sec. 5.10 explains how a deep RNN stack of 1991 (the History Compressor) pre-
trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment
Paths (CAPs, Sec. 3) of depth 1000 and more. Sec. 5.11 discusses a particular winner-take-all (WTA)
method called Max-Pooling (MP, 1992) widely used in today’s deep FNNs. Sec. 5.12 mentions a
first important contest won by SL NNs in 1994. Sec. 5.13 describes a purely supervised DL RNN
(Long Short-Term Memory, LSTM, 1995) for problems of depth 1000 and more. Sec. 5.14 mentions
an early contest of 2003 won by an ensemble of shallow FNNs, as well as good pattern recognition
results with CNNs and deep FNNs and LSTM RNNs (2003). Sec. 5.15 is mostly about Deep Belief
Networks (DBNs, 2006) and related stacks of Autoencoders (AEs, Sec. 5.7), both pre-trained by UL to
facilitate subsequent BP-based SL (compare Sec. 5.6.1, 5.10). Sec. 5.16 mentions the first SL-based
GPU-CNNs (2006), BP-trained MPCNNs (2007), and LSTM stacks (2007). Sec. 5.17–5.22 focus on
official competitions with secret test sets won by (mostly purely supervised) deep NNs since 2009,
in sequence recognition, image classification, image segmentation, and object detection. Many RNN
results depended on LSTM (Sec. 5.13); many FNN results depended on GPU-based FNN code de-
veloped since 2004 (Sec. 5.16, 5.17, 5.18, 5.19), in particular, GPU-MPCNNs (Sec. 5.19). Sec. 5.24
mentions recent tricks for improving DL in NNs, many of them closely related to earlier tricks from
the previous millennium (e.g., Sec. 5.6.2, 5.6.3). Sec. 5.25 discusses how artificial NNs can help to
understand biological NNs; Sec. 5.26 addresses the possibility of DL in NNs with spiking neurons.

5.1 Early NNs Since the 1940s (and the 1800s)


Early NN architectures (?) did not learn. The first ideas about UL were published a few years later (?).
The following decades brought simple NNs trained by SL (e.g., ????) and UL (e.g., ????), as well as
closely related associative memories (e.g., ??).
In a sense NNs have been around even longer, since early supervised NNs were essentially variants
of linear regression methods going back at least to the early 1800s (e.g., ???); Gauss also refers to his
work of 1795. Early NNs had a maximal CAP depth of 1 (Sec. 3).

5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11)
Simple cells and complex cells were found in the cat’s visual cortex (e.g., ??). These cells fire in
response to certain properties of visual sensory inputs, such as the orientation of edges. Complex
cells exhibit more spatial invariance than simple cells. This inspired later deep NN architectures
(Sec. 5.4, 5.11) used in certain modern award-winning Deep Learners (Sec. 5.19–5.22).

5.3 1965: Deep Networks Based on the Group Method of Data Handling
Networks trained by the Group Method of Data Handling (GMDH) (????) were perhaps the first DL
systems of the Feedforward Multilayer Perceptron type, although there was earlier work on NNs with
a single hidden layer (e.g., ??). The units of GMDH nets may have polynomial activation functions
implementing Kolmogorov-Gabor polynomials (more general than other widely used NN activation
functions, Sec. 2). Given a training set, layers are incrementally grown and trained by regression
analysis (e.g., ???) (Sec. 5.1), then pruned with the help of a separate validation set (using to-
day’s terminology), where Decision Regularisation is used to weed out superfluous units (compare
Sec. 5.6.3). The numbers of layers and units per layer can be learned in problem-dependent fashion.
To my knowledge, this was the first example of open-ended, hierarchical representation learning in

9
NNs (Sec. 4.3). A paper of 1971 already described a deep GMDH network with 8 layers (?). There
have been numerous applications of GMDH-style nets, e.g. (????????).

5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron)


Apart from deep GMDH networks (Sec. 5.3), the Neocognitron (???) was perhaps the first artificial
NN that deserved the attribute deep, and the first to incorporate the neurophysiological insights of
Sec. 5.2. It introduced convolutional NNs (today often called CNNs or convnets), where the (typically
rectangular) receptive field of a convolutional unit with given weight vector (a filter) is shifted step
by step across a 2-dimensional array of input values, such as the pixels of an image (usually there
are several such filters). The resulting 2D array of subsequent activation events of this unit can then
provide inputs to higher-level units, and so on. Due to massive weight replication (Sec. 2), relatively
few parameters (Sec. 4.4) may be necessary to describe the behavior of such a convolutional layer.
Subsampling or downsampling layers consist of units whose fixed-weight connections originate
from physical neighbours in the convolutional layers below. Subsampling units become active if at
least one of their inputs is active; their responses are insensitive to certain small image shifts (compare
Sec. 5.2).
The Neocognitron is very similar to the architecture of modern, contest-winning, purely super-
vised, feedforward, gradient-based Deep Learners with alternating convolutional and downsampling
layers (e.g., Sec. 5.19–5.22). Fukushima, however, did not set the weights by supervised backpropaga-
tion (Sec. 5.5, 5.8), but by local, WTA-based unsupervised learning rules (e.g., ?), or by pre-wiring. In
that sense he did not care for the DL problem (Sec. 5.9), although his architecture was comparatively
deep indeed. For downsampling purposes he used Spatial Averaging (??) instead of Max-Pooling
(MP, Sec. 5.11), currently a particularly convenient and popular WTA mechanism. Today’s DL com-
binations of CNNs and MP and BP also profit a lot from later work (e.g., Sec. 5.8, 5.16, 5.16, 5.19).

5.5 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs


The minimisation of errors through gradient descent (?) in the parameter space of complex, non-
linear, differentiable (?), multi-stage, NN-related systems has been discussed at least since the early
1960s (e.g., ?????????), initially within the framework of Euler-LaGrange equations in the Calculus
of Variations (e.g., ?).
Steepest descent in the weight space of such systems can be performed (???) by iterating the
chain rule (??) à la Dynamic Programming (DP) (?). A simplified derivation of this backpropagation
method uses the chain rule only (?).
The systems of the 1960s were already efficient in the DP sense. However, they backpropagated
derivative information through standard Jacobian matrix calculations from one “layer” to the previous
one, without explicitly addressing either direct links across several layers or potential additional effi-
ciency gains due to network sparsity (but perhaps such enhancements seemed obvious to the authors).
Given all the prior work on learning in multilayer NN-like systems (see also Sec. 5.3 on deep non-
linear nets since 1965), it seems surprising in hindsight that a book (?) on the limitations of simple
linear perceptrons with a single layer (Sec. 5.1) discouraged some researchers from further studying
NNs.
Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected,
NN-like networks apparently was first described in a 1970 master’s thesis (??), albeit without refer-
ence to NNs. BP is also known as the reverse mode of automatic differentiation (?), where the costs of
forward activation spreading essentially equal the costs of backward derivative calculation. See early
FORTRAN code (?) and closely related work (?).

10
Efficient BP was soon explicitly used to minimize cost functions by adapting control parameters
(weights) (?). Compare some preliminary, NN-specific discussion (?, section 5.5.1), a method for
multilayer threshold NNs (?), and a computer program for automatically deriving and implementing
BP for given differentiable systems (?).
To my knowledge, the first NN-specific application of efficient BP as above was described in
1981 (??). Related work was published several years later (???). A paper of 1986 significantly
contributed to the popularisation of BP for NNs (?), experimentally demonstrating the emergence of
useful internal representations in hidden layers. See generalisations for sequence-processing recurrent
NNs (e.g., ???????????????), also for equilibrium RNNs (??) with stationary inputs.

5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs)
Using the notation of Sec. 2 for weight-sharing FNNs or RNNs, after an episode of activation spread-
ing through differentiable ft , P
a single iteration of gradient descent through BP computes changes of
∂E ∂E ∂nett
all wi in proportion to ∂w i
= t ∂nett ∂wi as in Algorithm 5.5.1 (for the additive case), where each
weight wi is associated with a real-valued variable 4i initialized by 0.

Alg. 5.5.1: One iteration of BP for weight-sharing FNNs or RNNs


for t = T, . . . , 1 do
∂E
to compute ∂net t
, inititalize real-valued error signal variable δt by 0;
if xt is an input event then continue with next iteration;
if there is an error et P then δt := xt − dt ;
add to δt the value k∈outt wv(t,k) δk ; (this is the elegant and efficient recursive chain rule
application collecting impacts of nett on future events)
multiply δt by ft0 (nett );
for all k ∈ int add to 4wv(k,t) the value xk δt
end for
change each wi in proportion to 4i and a small real-valued learning rate

The computational costs of the backward (BP) pass are essentially those of the forward pass
(Sec. 2). Forward and backward passes are re-iterated until sufficient performance is reached.
As of 2014, this simple BP method is still the central learning algorithm for FNNs and RNNs. No-
tably, most contest-winning NNs up to 2014 (Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22) did not augment
supervised BP by some sort of unsupervised learning as discussed in Sec. 5.7, 5.10, 5.15.

5.6 Late 1980s-2000 and Beyond: Numerous Improvements of NNs


By the late 1980s it seemed clear that BP by itself (Sec. 5.5) was no panacea. Most FNN applications
focused on FNNs with few hidden layers. Additional hidden layers often did not seem to offer empiri-
cal benefits. Many practitioners found solace in a theorem (???) stating that an NN with a single layer
of enough hidden units can approximate any multivariate continous function with arbitrary accuracy.
Likewise, most RNN applications did not require backpropagating errors far. Many researchers
helped their RNNs by first training them on shallow problems (Sec. 3) whose solutions then gener-
alized to deeper problems. In fact, some popular RNN algorithms restricted credit assignment to a
single step backwards (???), also in more recent studies (???).
Generally speaking, although BP allows for deep problems in principle, it seemed to work only
for shallow problems. The late 1980s and early 1990s saw a few ideas with a potential to overcome
this problem, which was fully understood only in 1991 (Sec. 5.9).

11
5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs
To deal with long time lags between relevant events, several sequence processing methods were pro-
posed, including Focused BP based on decay factors for activations of units in RNNs (??), Time-Delay
Neural Networks (TDNNs) (?) and their adaptive extension (?), Nonlinear AutoRegressive with eX-
ogenous inputs (NARX) RNNs (?), certain hierarchical RNNs (?) (compare Sec. 5.10, 1991), RL
economies in RNNs with WTA units and local learning rules (?), and other methods (e.g., ??????).
However, these algorithms either worked for shallow CAPs only, could not generalize to unseen CAP
depths, had problems with greatly varying time lags between relevant events, needed external fine
tuning of delay constants, or suffered from other problems. In fact, it turned out that certain simple
but deep benchmark problems used to evaluate such methods are more quickly solved by randomly
guessing RNN weights until a solution is found (?).
While the RNN methods above were designed for DL of temporal sequences, the Neural Heat Ex-
changer (?) consists of two parallel deep FNNs with opposite flow directions. Input patterns enter the
first FNN and are propagated “up”. Desired outputs (targets) enter the “opposite” FNN and are prop-
agated “down”. Using a local learning rule, each layer in each net tries to be similar (in information
content) to the preceding layer and to the adjacent layer of the other net. The input entering the first
net slowly “heats up” to become the target. The target entering the opposite net slowly “cools down”
to become the input. The Helmholtz Machine (??) may be viewed as an unsupervised (Sec. 5.6.4)
variant thereof (Peter Dayan, personal communication, 1994).
A hybrid approach (??) initializes a potentially deep FNN through a domain theory in propo-
sitional logic, which may be acquired through explanation-based learning (???). The NN is then
fine-tuned through BP (Sec. 5.5). The NN’s depth reflects the longest chain of reasoning in the origi-
nal set of logical rules. An extension of this approach (??) initializes an RNN by domain knowledge
expressed as a Finite State Automaton (FSA). BP-based fine-tuning has become important for later
DL systems pre-trained by UL, e.g., Sec. 5.10, 5.15.

5.6.2 Better BP Through Advanced Gradient Descent (Compare Sec. 5.24)


Numerous improvements of steepest descent through BP (Sec. 5.5) have been proposed. Least-squares
methods (Gauss-Newton, Levenberg-Marquardt) (?????) and quasi-Newton methods (Broyden-
Fletcher-Goldfarb-Shanno, BFGS) (????) are computationally too expensive for large NNs. Partial
BFGS (??) and conjugate gradient (??) as well as other methods (???) provide sometimes useful
fast alternatives. BP can be treated as a linear least-squares problem (?), where second-order gradient
information is passed back to preceding layers.
To speed up BP, momentum was introduced (?), ad-hoc constants were added to the slope of the
linearized activation function (?), or the nonlinearity of the slope was exaggerated (?).
Only the signs of the error derivatives are taken into account by the successful and widely used
BP variant R-prop (?) and the robust variation iRprop+ (?), which was also successfully applied to
RNNs.
The local gradient can be normalized based on the NN architecture (?), through a diagonalized
Hessian approach (?), or related efficient methods (?).
Some algorithms for controlling BP step size adapt a global learning rate (?????), while others
compute individual learning rates for each weight (??). In online learning, where BP is applied
after each pattern presentation, the vario-η algorithm (?) sets each weight’s learning rate inversely
proportional to the empirical standard deviation of its local gradient, thus normalizing the stochastic
weight fluctuations. Compare a local online step size adaptation method for nonlinear NNs (?).
Many additional tricks for improving NNs have been described (e.g., ??). Compare Sec. 5.6.3 and
recent developments mentioned in Sec. 5.24.

12
5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24)
Many researchers used BP-like methods to search for “simple,” low-complexity NNs (Sec. 4.4) with
high generalization capability. Most approaches address the bias/variance dilemma (?) through strong
prior assumptions. For example, weight decay (???) encourages near-zero weights, by penalizing
large weights. In a Bayesian framework (?), weight decay can be derived (?) from Gaussian or
Laplacian weight priors (??); see also (?). An extension of this approach postulates that a distribution
of networks with many similar weights generated by Gaussian mixtures is “better” a priori (?).
Often weight priors are implicit in additional penalty terms (?) or in methods based on valida-
tion sets (??????), Akaike’s information criterion and final prediction error (???), or generalized
prediction error (??). See also (???????). Similar priors (or biases towards simplicity) are implicit
in constructive and pruning algorithms, e.g., layer-by-layer sequential network construction (e.g.,
??????????????) (see also Sec. 5.3, 5.11), input pruning (??), unit pruning (e.g., ?????), weight
pruning, e.g., optimal brain damage (?), and optimal brain surgeon (?).
A very general but not always practical approach for discovering low-complexity SL NNs or
RL NNs searches among weight matrix-computing programs written in a universal programming
language, with a bias towards fast and short programs (?) (Sec. 6.7).
Flat Minimum Search (FMS) (??) searches for a “flat” minimum of the error function: a large
connected region in weight space where error is low and remains approximately constant, that is,
few bits of information are required to describe low-precision weights with high variance. Compare
perturbation tolerance conditions (????????). An MDL-based, Bayesian argument suggests that flat
minima correspond to “simple” NNs and low expected overfitting. Compare Sec. 5.6.4 and more
recent developments mentioned in Sec. 5.24.

5.6.4 Potential Benefits of UL for SL (Compare Sec. 5.7, 5.10, 5.15)


The notation of Sec. 2 introduced teacher-given labels dt . Many papers of the previ-
ous millennium, however, were about unsupervised learning (UL) without a teacher (e.g.,
????????????????????????????); see also post-2000 work (e.g., ????).
Many UL methods are designed to maximize entropy-related, information-theoretic (???) objec-
tives (e.g., ???????????????).
Many do this to uncover and disentangle hidden underlying sources of signals (e.g.,
?????????????).
Many UL methods automatically and robustly generate distributed, sparse representations of input
patterns (??????) through well-known feature detectors (e.g., ??), such as off-center-on-surround-like
structures, as well as orientation sensitive edge detectors and Gabor filters (?). They extract simple
features related to those observed in early visual pre-processing stages of biological systems (e.g.,
??).
UL can also serve to extract invariant features from different data items (e.g., ?) through coupled
NNs observing two different inputs (?), also called Siamese NNs (e.g., ????).
UL can help to encode input data in a form advantageous for further processing. In the context of
DL, one important goal of UL is redundancy reduction. Ideally, given an ensemble of input patterns,
redundancy reduction through a deep NN will create a factorial code (a code with statistically inde-
pendent components) of the ensemble (??), to disentangle the unknown factors of variation (compare
?). Such codes may be sparse and can be advantageous for (1) data compression, (2) speeding up
subsequent BP (?), (3) trivialising the task of subsequent naive yet optimal Bayes classifiers (?).
Most early UL FNNs had a single layer. Methods for deeper UL FNNs include hierarchical
(Sec. 4.3) self-organizing Kohonen maps (e.g., ?????), hierarchical Gaussian potential function net-
works (?), layer-wise UL of feature hierarchies fed into SL classifiers (??), the Self-Organising Tree

13
Algorithm (SOTA) (?), and nonlinear Autoencoders (AEs) with more than 3 (e.g., 5) layers (???).
Such AE NNs (?) can be trained to map input patterns to themselves, for example, by compactly
encoding them through activations of units of a narrow bottleneck hidden layer. Certain nonlinear
AEs suffer from certain limitations (?).
L OCOCODE (?) uses FMS (Sec. 5.6.3) to find low-complexity AEs with low-precision weights
describable by few bits of information, often producing sparse or factorial codes. Predictability Mini-
mization (PM) (?) searches for factorial codes through nonlinear feature detectors that fight nonlinear
predictors, trying to become both as informative and as unpredictable as possible. PM-based UL was
applied not only to FNNs but also to RNNs (e.g., ??). Compare Sec. 5.10 on UL-based RNN stacks
(1991), as well as later UL RNNs (e.g., ??).

5.7 1987: UL Through Autoencoder (AE) Hierarchies (Compare Sec. 5.15)


Perhaps the first work to study potential benefits of UL-based pre-training was published in 1987.
It proposed unsupervised AE hierarchies (?), closely related to certain post-2000 feedforward Deep
Learners based on UL (Sec. 5.15). The lowest-level AE NN with a single hidden layer is trained to
map input patterns to themselves. Its hidden layer codes are then fed into a higher-level AE of the
same type, and so on. The hope is that the codes in the hidden AE layers have properties that facilitate
subsequent learning. In one experiment, a particular AE-specific learning algorithm (different from
traditional BP of Sec. 5.5.1) was used to learn a mapping in an AE stack pre-trained by this type of
UL (?). This was faster than learning an equivalent mapping by BP through a single deeper AE with-
out pre-training. On the other hand, the task did not really require a deep AE, that is, the benefits of UL
were not that obvious from this experiment. Compare an early survey (?) and the somewhat related
Recursive Auto-Associative Memory (RAAM) (???), originally used to encode sequential linguistic
structures of arbitrary size through a fixed number of hidden units. More recently, RAAMs were also
used as unsupervised pre-processors to facilitate deep credit assignment for RL (?) (Sec. 6.4).
In principle, many UL methods (Sec. 5.6.4) could be stacked like the AEs above, the history-
compressing RNNs of Sec. 5.10, the Restricted Boltzmann Machines (RBMs) of Sec. 5.15, or hierar-
chical Kohonen nets (Sec. 5.6.4), to facilitate subsequent SL. Compare Stacked Generalization (??),
and FNNs that profit from pre-training by competitive UL (e.g., ?) prior to BP-based fine-tuning (?).
See also more recent methods using UL to improve subsequent SL (e.g., ???).

5.8 1989: BP for Convolutional NNs (CNNs, Sec. 5.4)


In 1989, backpropagation (Sec. 5.5) was applied (???) to Neocognitron-like, weight-sharing, convo-
lutional neural layers (Sec. 5.4) with adaptive connections. This combination, augmented by Max-
Pooling (MP, Sec. 5.11, 5.16), and sped up on graphics cards (Sec. 5.19), has become an essential
ingredient of many modern, competition-winning, feedforward, visual Deep Learners (Sec. 5.19–
5.23). This work also introduced the MNIST data set of handwritten digits (?), which over time has
become perhaps the most famous benchmark of Machine Learning. CNNs helped to achieve good
performance on MNIST (?) (CAP depth 5) and on fingerprint recognition (?); similar CNNs were
used commercially in the 1990s.

5.9 1991: Fundamental Deep Learning Problem of Gradient Descent


A diploma thesis (?) represented a milestone of explicit DL research. As mentioned in Sec. 5.6, by the
late 1980s, experiments had indicated that traditional deep feedforward or recurrent networks are hard
to train by backpropagation (BP) (Sec. 5.5). Hochreiter’s work formally identified a major reason:
Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With

14
standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either
shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers
or CAP depth (Sec. 3), or they explode. This is also known as the long time lag problem. Much
subsequent DL research of the 1990s and 2000s was motivated by this insight. Later work (?) also
studied basins of attraction and their stability under noise from a dynamical systems point of view:
either the dynamics are not robust to noise, or the gradients vanish. See also (??). Over the years,
several ways of partially overcoming the Fundamental Deep Learning Problem were explored:

I A Very Deep Learner of 1991 (the History Compressor, Sec. 5.10) alleviates the problem
through unsupervised pre-training for a hierarchy of RNNs. This greatly facilitates subsequent
supervised credit assignment through BP (Sec. 5.5). In the FNN case, similar effects can be
achieved through conceptually related AE stacks (Sec. 5.7, 5.15) and Deep Belief Networks
(DBNs, Sec. 5.15).
II LSTM-like networks (Sec. 5.13, 5.16, 5.17, 5.21–5.23) alleviate the problem through a special
architecture unaffected by it.

III Today’s GPU-based computers have a million times the computational power of desktop ma-
chines of the early 1990s. This allows for propagating errors a few layers further down within
reasonable time, even in traditional NNs (Sec. 5.18). That is basically what is winning many of
the image recognition competitions now (Sec. 5.19, 5.21, 5.22). (Although this does not really
overcome the problem in a fundamental way.)

IV Hessian-free optimization (Sec. 5.6.2) can alleviate the problem for FNNs (????) (Sec. 5.6.2)
and RNNs (?) (Sec. 5.20).
V The space of NN weight matrices can also be searched without relying on error gradients,
thus avoiding the Fundamental Deep Learning Problem altogether. Random weight guessing
sometimes works better than more sophisticated methods (?). Certain more complex problems
are better solved by using Universal Search (?) for weight matrix-computing programs written
in a universal programming language (?). Some are better solved by using linear methods
to obtain optimal weights for connections to output events (Sec. 2), and evolving weights of
connections to other events—this is called Evolino (?). Compare also related RNNs pre-trained
by certain UL rules (?), also in the case of spiking neurons (??) (Sec. 5.26). Direct search
methods are relevant not only for SL but also for more general RL, and are discussed in more
detail in Sec. 6.6.

5.10 1991: UL-Based History Compression Through a Deep Stack of RNNs


A working Very Deep Learner (Sec. 3) of 1991 (??) could perform credit assignment across hundreds
of nonlinear operators or neural layers, by using unsupervised pre-training for a hierarchy of RNNs.
The basic idea is still relevant today. Each RNN is trained for a while in unsupervised fashion to
predict its next input (e.g., ??). From then on, only unexpected inputs (errors) convey new information
and get fed to the next higher RNN which thus ticks on a slower, self-organising time scale. It can
easily be shown that no information gets lost. It just gets compressed (much of machine learning is
essentially about compression, e.g., Sec. 4.4, 5.6.3, 6.7). For each individual input sequence, we get
a series of less and less redundant encodings in deeper and deeper levels of this History Compressor
or Neural Sequence Chunker, which can compress data in both space (like feedforward NNs) and
time. This is another good example of hierarchical representation learning (Sec. 4.3). There also is a
continuous variant of the history compressor (?).

15
The RNN stack is essentially a deep generative model of the data, which can be reconstructed from
its compressed form. Adding another RNN to the stack improves a bound on the data’s description
length—equivalent to the negative logarithm of its probability (??)—as long as there is remaining
local learnable predictability in the data representation on the corresponding level of the hierarchy.
Compare a similar observation for feedforward Deep Belief Networks (DBNs, 2006, Sec. 5.15).
The system was able to learn many previously unlearnable DL tasks. One ancient illustrative DL
experiment (?) required CAPs (Sec. 3) of depth 1200. The top level code of the initially unsupervised
RNN stack, however, got so compact that (previously infeasible) sequence classification through ad-
ditional BP-based SL became possible. Essentially the system used UL to greatly reduce problem
depth. Compare earlier BP-based fine-tuning of NNs initialized by rules of propositional logic (?)
(Sec. 5.6.1).
There is a way of compressing higher levels down into lower levels, thus fully or partially collaps-
ing the RNN stack. The trick is to retrain a lower-level RNN to continually imitate (predict) the hidden
units of an already trained, slower, higher-level RNN (the “conscious” chunker), through additional
predictive output neurons (?). This helps the lower RNN (the automatizer) to develop appropriate,
rarely changing memories that may bridge very long time lags. Again, this procedure can greatly
reduce the required depth of the BP process.
The 1991 system was a working Deep Learner in the modern post-2000 sense, and also a first
Neural Hierarchical Temporal Memory (HTM). It is conceptually similar to earlier AE hierarchies
(1987, Sec. 5.7) and later Deep Belief Networks (2006, Sec. 5.15), but more general in the sense that it
uses sequence-processing RNNs instead of FNNs with unchanging inputs. More recently, well-known
entrepreneurs (??) also got interested in HTMs; compare also hierarchical HMMs (e.g., ?), as well
as later UL-based recurrent systems (????). Clockwork RNNs (?) also consist of interacting RNN
modules with different clock rates, but do not use UL to set those rates. Stacks of RNNs were used in
later work on SL with great success, e.g., Sec. 5.13, 5.16, 5.17, 5.22.

5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19)
The Neocognitron (Sec. 5.4) inspired the Cresceptron (?), which adapts its topology during training
(Sec. 5.6.3); compare the incrementally growing and shrinking GMDH networks (1965, Sec. 5.3).
Instead of using alternative local subsampling or WTA methods (e.g., ????), the Cresceptron
uses Max-Pooling (MP) layers. Here a 2-dimensional layer or array of unit activations is partitioned
into smaller rectangular arrays. Each is replaced in a downsampling layer by the activation of its
maximally active unit. A later, more complex version of the Cresceptron (?) also included “blurring”
layers to improve object location tolerance.
The neurophysiologically plausible topology of the feedforward HMAX model (?) is very similar
to the one of the 1992 Cresceptron (and thus to the 1979 Neocognitron). HMAX does not learn
though. Its units have hand-crafted weights; biologically plausible learning rules were later proposed
for similar models (e.g., ??).
When CNNs or convnets (Sec. 5.4, 5.8) are combined with MP, they become Cresceptron-like
or HMAX-like MPCNNs with alternating convolutional and max-pooling layers. Unlike Crescep-
tron and HMAX, however, MPCNNs are trained by BP (Sec. 5.5, 5.16) (?). Advantages of doing
this were pointed out subsequently (?). BP-trained MPCNNs have become central to many modern,
competition-winning, feedforward, visual Deep Learners (Sec. 5.17, 5.19–5.23).

5.12 1994: Early Contest-Winning NNs


Back in the 1990s, certain NNs already won certain controlled pattern recognition contests with secret
test sets. Notably, an NN with internal delay lines won the Santa Fe time-series competition on chaotic

16
intensity pulsations of an NH3 laser (??). No very deep CAPs (Sec. 3) were needed though.

5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN)


Supervised Long Short-Term Memory (LSTM) RNN (???) could eventually perform similar feats as
the deep RNN hierarchy of 1991 (Sec. 5.10), overcoming the Fundamental Deep Learning Problem
(Sec. 5.9) without any unsupervised pre-training. LSTM could also learn DL tasks without local
sequence predictability (and thus unlearnable by the partially unsupervised 1991 History Compressor,
Sec. 5.10), dealing with very deep problems (Sec. 3) (e.g., ?).
The basic LSTM idea is very simple. Some of the units are called Constant Error Carousels
(CECs). Each CEC uses as an activation function f , the identity function, and has a connection to itself
with fixed weight of 1.0. Due to f ’s constant derivative of 1.0, errors backpropagated through a CEC
cannot vanish or explode (Sec. 5.9) but stay as they are (unless they “flow out” of the CEC to other,
typically adaptive parts of the NN). CECs are connected to several nonlinear adaptive units (some
with multiplicative activation functions) needed for learning nonlinear behavior. Weight changes of
these units often profit from error signals propagated far back in time through CECs. CECs are
the main reason why LSTM nets can learn to discover the importance of (and memorize) events that
happened thousands of discrete time steps ago, while previous RNNs already failed in case of minimal
time lags of 10 steps.
Many different LSTM variants and topologies are allowed. It is possible to evolve good problem-
specific topologies (?). Some LSTM variants also use modifiable self-connections of CECs (?).
To a certain extent, LSTM is biologically plausible (?). LSTM learned to solve many previously
unlearnable DL tasks involving: Recognition of the temporal order of widely separated events in noisy
input streams; Robust storage of high-precision real numbers across extended time intervals; Arith-
metic operations on continuous input streams; Extraction of information conveyed by the temporal
distance between events; Recognition of temporally extended patterns in noisy input sequences (??);
Stable generation of precisely timed rhythms, as well as smooth and non-smooth periodic trajecto-
ries (?). LSTM clearly outperformed previous RNNs on tasks that require learning the rules of regular
languages describable by deterministic Finite State Automata (FSAs) (?????????), both in terms of
reliability and speed.
LSTM also worked on tasks involving context free languages (CFLs) that cannot be represented
by HMMs or similar FSAs discussed in the RNN literature (???????). CFL recognition (?) requires
the functional equivalent of a runtime stack. Some previous RNNs failed to learn small CFL training
sets (?). Those that did not (??) failed to extract the general rules, and did not generalize well
on substantially larger test sets. Similar for context-sensitive languages (CSLs) (e.g., ?). LSTM
generalized well though, requiring only the 30 shortest exemplars (n ≤ 10) of the CSL an bn cn
to correctly predict the possible continuations of sequence prefixes for n up to 1000 and more. A
combination of a decoupled extended Kalman filter (??????) and an LSTM RNN (?) learned to deal
correctly with values of n up to 10 million and more. That is, after training the network was able to
read sequences of 30,000,000 symbols and more, one symbol at a time, and finally detect the subtle
differences between legal strings such as a10,000,000 b10,000,000 c10,000,000 and very similar but illegal
strings such as a10,000,000 b9,999,999 c10,000,000 . Compare also more recent RNN algorithms able to
deal with long time lags (????).
Bi-directional RNNs (BRNNs) (??) are designed for input sequences whose starts and ends are
known in advance, such as spoken sentences to be labeled by their phonemes; compare (?). To take
both past and future context of each sequence element into account, one RNN processes the sequence
from start to end, the other backwards from end to start. At each time step their combined outputs
predict the corresponding label (if there is any). BRNNs were successfully applied to secondary

17
protein structure prediction (?). DAG-RNNs (??) generalize BRNNs to multiple dimensions. They
learned to predict properties of small organic molecules (?) as well as protein contact maps (?), also
in conjunction with a growing deep FNN (?) (Sec. 5.21). BRNNs and DAG-RNNs unfold their full
potential when combined with the LSTM concept (???).
Particularly successful in recent competitions are stacks (Sec. 5.10) of LSTM RNNs (??) trained
by Connectionist Temporal Classification (CTC) (?), a gradient-based method for finding RNN
weights that maximize the probability of teacher-given label sequences, given (typically much longer
and more high-dimensional) streams of real-valued input vectors. CTC-LSTM performs simultaneous
segmentation (alignment) and recognition (Sec. 5.22).
In the early 2000s, speech recognition was dominated by HMMs combined with FNNs (e.g., ?).
Nevertheless, when trained from scratch on utterances from the TIDIGITS speech database, in 2003
LSTM already obtained results comparable to those of HMM-based systems (???). In 2007, LSTM
outperformed HMMs in keyword spotting tasks (?); compare recent improvements (??). By 2013,
LSTM also achieved best known results on the famous TIMIT phoneme recognition benchmark (?)
(Sec. 5.22). Recently, LSTM RNN / HMM hybrids obtained best known performance on medium-
vocabulary (?) and large-vocabulary speech recognition (?).
LSTM is also applicable to robot localization (?), robot control (?), online driver distraction de-
tection (?), and many other tasks. For example, it helped to improve the state of the art in diverse
applications such as protein analysis (?), handwriting recognition (????), voice activity detection (?),
optical character recognition (?), language identification (?), prosody contour prediction (?), audio on-
set detection (?), text-to-speech synthesis (?), social signal classification (?), machine translation (?),
and others.
RNNs can also be used for metalearning (???), because they can in principle learn to run their
own weight change algorithm (?). A successful metalearner (?) used an LSTM RNN to quickly learn
a learning algorithm for quadratic functions (compare Sec. 6.8).
Recently, LSTM RNNs won several international pattern recognition competitions and set nu-
merous benchmark records on large and complex data sets, e.g., Sec. 5.17, 5.21, 5.22. Gradient-
based LSTM is no panacea though—other methods sometimes outperformed it at least on certain
tasks (?????); compare Sec. 5.20.

5.14 2003: More Contest-Winning/Record-Setting NNs; Successful Deep NNs


In the decade around 2000, many practical and commercial pattern recognition applications were
dominated by non-neural machine learning methods such as Support Vector Machines (SVMs) (??).
Nevertheless, at least in certain domains, NNs outperformed other techniques.
A Bayes NN (?) based on an ensemble (??????) of NNs won the NIPS 2003 Feature Selection
Challenge with secret test set (?). The NN was not very deep though—it had two hidden layers and
thus rather shallow CAPs (Sec. 3) of depth 3.
Important for many present competition-winning pattern recognisers (Sec. 5.19, 5.21, 5.22) were
developments in the CNN department. A BP-trained (?) CNN (Sec. 5.4, Sec. 5.8) set a new
MNIST record of 0.4% (?), using training pattern deformations (?) but no unsupervised pre-training
(Sec. 5.7, 5.10, 5.15). A standard BP net achieved 0.7% (?). Again, the corresponding CAP depth
was low. Compare further improvements in Sec. 5.16, 5.18, 5.19.
Good image interpretation results (?) were achieved with rather deep NNs trained by the BP
variant R-prop (?) (Sec. 5.6.2); here feedback through recurrent connections helped to improve image
interpretation. FNNs with CAP depth up to 6 were used to successfully classify high-dimensional
data (?).
Deep LSTM RNNs started to obtain certain first speech recognition results comparable to those

18
of HMM-based systems (?); compare Sec. 5.13, 5.16, 5.21, 5.22.

5.15 2006/7: UL For Deep Belief Networks / AE Stacks Fine-Tuned by BP


While learning networks with numerous non-linear layers date back at least to 1965 (Sec. 5.3),
and explicit DL research results have been published at least since 1991 (Sec. 5.9, 5.10), the ex-
pression Deep Learning was actually coined around 2006, when unsupervised pre-training of deep
FNNs helped to accelerate subsequent SL through BP (??). Compare earlier terminology on loading
deep networks (??) and learning deep memories (?). Compare also BP-based (Sec. 5.5) fine-tuning
(Sec. 5.6.1) of (not so deep) FNNs pre-trained by competitive UL (?).
The Deep Belief Network (DBN) is a stack of Restricted Boltzmann Machines (RBMs) (?), which
in turn are Boltzmann Machines (BMs) (?) with a single layer of feature-detecting units; compare also
Higher-Order BMs (?). Each RBM perceives pattern representations from the level below and learns
to encode them in unsupervised fashion. At least in theory under certain assumptions, adding more
layers improves a bound on the data’s negative log probability (?) (equivalent to the data’s description
length—compare the corresponding observation for RNN stacks, Sec. 5.10). There are extensions for
Temporal RBMs (?).
Without any training pattern deformations (Sec. 5.14), a DBN fine-tuned by BP achieved 1.2%
error rate (?) on the MNIST handwritten digits (Sec. 5.8, 5.14). This result helped to arouse interest
in DBNs. DBNs also achieved good results on phoneme recognition, with an error rate of 26.7% on
the TIMIT core test set (?); compare further improvements through FNNs (??) and LSTM RNNs
(Sec. 5.22).
A DBN-based technique called Semantic Hashing (?) maps semantically similar documents (of
variable size) to nearby addresses in a space of document representations. It outperformed previ-
ous searchers for similar documents, such as Locality Sensitive Hashing (??). See the RBM/DBN
tutorial (?).
Autoencoder (AE) stacks (?) (Sec. 5.7) became a popular alternative way of pre-training deep
FNNs in unsupervised fashion, before fine-tuning (Sec. 5.6.1) them through BP (Sec. 5.5) (???).
Sparse coding (Sec. 5.6.4) was formulated as a combination of convex optimization problems (?).
Recent surveys of stacked RBM and AE methods focus on post-2006 developments (??). Unsu-
pervised DBNs and AE stacks are conceptually similar to, but in a certain sense less general than,
the unsupervised RNN stack-based History Compressor of 1991 (Sec. 5.10), which can process and
re-encode not only stationary input patterns, but entire pattern sequences.

5.16 2006/7: Improved CNNs / GPU-CNNs / BP for MPCNNs / LSTM Stacks


Also in 2006, a BP-trained (?) CNN (Sec. 5.4, Sec. 5.8) set a new MNIST record of 0.39% (?),
using training pattern deformations (Sec. 5.14) but no unsupervised pre-training. Compare further
improvements in Sec. 5.18, 5.19. Similar CNNs were used for off-road obstacle avoidance (?). A
combination of CNNs and TDNNs later learned to map fixed-size representations of variable-size
sentences to features relevant for language processing, using a combination of SL and UL (?).
2006 also saw an early GPU-based CNN implementation (?) up to 4 times faster than CPU-
CNNs; compare also earlier GPU implementations of standard FNNs with a reported speed-up factor
of 20 (?). GPUs or graphics cards have become more and more important for DL in subsequent years
(Sec. 5.18–5.22).
In 2007, BP (Sec. 5.5) was applied for the first time (?) to Neocognitron-inspired (Sec. 5.4),
Cresceptron-like (or HMAX-like) MPCNNs (Sec. 5.11) with alternating convolutional and max-
pooling layers. BP-trained MPCNNs have become an essential ingredient of many modern,
competition-winning, feedforward, visual Deep Learners (Sec. 5.17, 5.19–5.23).

19
Also in 2007, hierarchical stacks of LSTM RNNs were introduced (?). They can be trained by
hierarchical Connectionist Temporal Classification (CTC) (?). For tasks of sequence labelling, every
LSTM RNN level (Sec. 5.13) predicts a sequence of labels fed to the next level. Error signals at every
level are back-propagated through all the lower levels. On spoken digit recognition, LSTM stacks
outperformed HMMs, despite making fewer assumptions about the domain. LSTM stacks do not
necessarily require unsupervised pre-training like the earlier UL-based RNN stacks (?) of Sec. 5.10.

5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs
Stacks of LSTM RNNs trained by CTC (Sec. 5.13, 5.16) became the first RNNs to win official interna-
tional pattern recognition contests (with secret test sets known only to the organisers). More precisely,
three connected handwriting competitions at ICDAR 2009 in three different languages (French, Arab,
Farsi) were won by deep LSTM RNNs without any a priori linguistic knowledge, performing simul-
taneous segmentation and recognition. Compare (?????) (Sec. 5.22).
To detect human actions in surveillance videos, a 3-dimensional CNN (e.g., ??), combined with
SVMs, was part of a larger system (?) using a bag of features approach (?) to extract regions of
interest. The system won three 2009 TRECVID competitions. These were possibly the first official
international contests won with the help of (MP)CNNs (Sec. 5.16). An improved version of the
method was published later (?).
2009 also saw a GPU-DBN implementation (?) orders of magnitudes faster than previous CPU-
DBNs (see Sec. 5.15); see also (?). The Convolutional DBN (?) (with a probabilistic variant of MP,
Sec. 5.11) combines ideas from CNNs and DBNs, and was successfully applied to audio classifica-
tion (?).

5.18 2010: Plain Backprop (+ Distortions) on GPU Breaks MNIST Record


In 2010, a new MNIST (Sec. 5.8) record of 0.35% error rate was set by good old BP (Sec. 5.5) in deep
but otherwise standard NNs (?), using neither unsupervised pre-training (e.g., Sec. 5.7, 5.10, 5.15) nor
convolution (e.g., Sec. 5.4, 5.8, 5.14, 5.16). However, training pattern deformations (e.g., Sec. 5.14)
were important to generate a big training set and avoid overfitting. This success was made possi-
ble mainly through a GPU implementation of BP that was up to 50 times faster than standard CPU
versions. A good value of 0.95% was obtained without distortions except for small saccadic eye
movement-like translations—compare Sec. 5.15.
Since BP was 3-5 decades old by then (Sec. 5.5), and pattern deformations 2 decades (?)
(Sec. 5.14), these results seemed to suggest that advances in exploiting modern computing hardware
were more important than advances in algorithms.

5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance


In 2011, a flexible GPU-implementation (?) of Max-Pooling (MP) CNNs or Convnets was described
(a GPU-MPCNN), building on earlier MP work (?) (Sec. 5.11) CNNs (??) (Sec. 5.4, 5.8, 5.16),
and on early GPU-based CNNs without MP (?) (Sec. 5.16); compare early GPU-NNs (?) and GPU-
DBNs (?) (Sec. 5.17). MPCNNs have alternating convolutional layers (Sec. 5.4) and max-pooling
layers (MP, Sec. 5.11) topped by standard fully connected layers. All weights are trained by BP
(Sec. 5.5, 5.8, 5.16) (??). GPU-MPCNNs have become essential for many contest-winning FNNs
(Sec. 5.21, Sec. 5.22).
Multi-Column GPU-MPCNNs (?) are committees (??????) of GPU-MPCNNs with simple
democratic output averaging. Several MPCNNs see the same input; their output vectors are used to
assign probabilities to the various possible classes. The class with the on average highest probability

20
is chosen as the system’s classification of the present input. Compare earlier, more sophisticated
ensemble methods (?), the contest-winning ensemble Bayes-NN (?) of Sec. 5.14, and recent related
work (?).
An ensemble of GPU-MPCNNs was the first system to achieve superhuman visual pattern recog-
nition (??) in a controlled competition, namely, the IJCNN 2011 traffic sign recognition contest in
San Jose (CA) (??). This is of interest for fully autonomous, self-driving cars in traffic (e.g., ?). The
GPU-MPCNN ensemble obtained 0.56% error rate and was twice better than human test subjects,
three times better than the closest artificial NN competitor (?), and six times better than the best
non-neural method.
A few months earlier, the qualifying round was won in a 1st stage online competition, albeit by
a much smaller margin: 1.02% (?) vs 1.03% for second place (?). After the deadline, the organisers
revealed that human performance on the test set was 1.19%. That is, the best methods already seemed
human-competitive. However, during the qualifying it was possible to incrementally gain information
about the test set by probing it through repeated submissions. This is illustrated by better and better
results obtained by various teams over time (?) (the organisers eventually imposed a limit of 10
resubmissions). In the final competition this was not possible.
This illustrates a general problem with benchmarks whose test sets are public, or at least can be
probed to some extent: competing teams tend to overfit on the test set even when it cannot be directly
used for training, only for evaluation.
In 1997 many thought it a big deal that human chess world champion Kasparov was beaten by
an IBM computer. But back then computers could not at all compete with little kids in visual pat-
tern recognition, which seems much harder than chess from a computational perspective. Of course,
the traffic sign domain is highly restricted, and kids are still much better general pattern recognis-
ers. Nevertheless, by 2011, deep NNs could already learn to rival them in important limited visual
domains.
An ensemble of GPU-MPCNNs was also the first method to achieve human-competitive perfor-
mance (around 0.2%) on MNIST (?). This represented a dramatic improvement, since by then the
MNIST record had hovered around 0.4% for almost a decade (Sec. 5.14, 5.16, 5.18).
Given all the prior work on (MP)CNNs (Sec. 5.4, 5.8, 5.11, 5.16) and GPU-CNNs (Sec. 5.16),
GPU-MPCNNs are not a breakthrough in the scientific sense. But they are a commercially relevant
breakthrough in efficient coding that has made a difference in several contests since 2011. Today, most
feedforward competition-winning deep NNs are (ensembles of) GPU-MPCNNs (Sec. 5.21–5.23).

5.20 2011: Hessian-Free Optimization for RNNs


Also in 2011 it was shown (?) that Hessian-free optimization (e.g., ???) (Sec. 5.6.2) can alleviate
the Fundamental Deep Learning Problem (Sec. 5.9) in RNNs, outperforming standard gradient-based
LSTM RNNs (Sec. 5.13) on several tasks. Compare other RNN algorithms (????) that also at least
sometimes yield better results than steepest descent for LSTM RNNs.

5.21 2012: First Contests Won on ImageNet, Object Detection, Segmentation


In 2012, an ensemble of GPU-MPCNNs (Sec. 5.19) achieved best results on the ImageNet classifica-
tion benchmark (?), which is popular in the computer vision community. Here relatively large image
sizes of 256x256 pixels were necessary, as opposed to only 48x48 pixels for the 2011 traffic sign
competition (Sec. 5.19). See further improvements in Sec. 5.22.
Also in 2012, the biggest NN so far (109 free parameters) was trained in unsupervised mode
(Sec. 5.7, 5.15) on unlabeled data (?), then applied to ImageNet. The codes across its top layer were
used to train a simple supervised classifier, which achieved best results so far on 20,000 classes.

21
Instead of relying on efficient GPU programming, this was done by brute force on 1,000 standard
machines with 16,000 cores.
So by 2011/2012, excellent results had been achieved by Deep Learners in image recognition and
classification (Sec. 5.19, 5.21). The computer vision community, however, is especially interested in
object detection in large images, for applications such as image-based search engines, or for biomed-
ical diagnosis where the goal may be to automatically detect tumors etc in images of human tissue.
Object detection presents additional challenges. One natural approach is to train a deep NN classifier
on patches of big images, then use it as a feature detector to be shifted across unknown visual scenes,
using various rotations and zoom factors. Image parts that yield highly active output units are likely
to contain objects similar to those the NN was trained on.
2012 finally saw the first DL system (an ensemble of GPU-MPCNNs, Sec. 5.19) to win a contest
on visual object detection (?) in large images of several million pixels (??). Such biomedical appli-
cations may turn out to be among the most important applications of DL. The world spends over 10%
of GDP on healthcare (> 6 trillion USD per year), much of it on medical diagnosis through expensive
experts. Partial automation of this could not only save lots of money, but also make expert diagnostics
accessible to many who currently cannot afford it. It is gratifying to observe that today deep NNs may
actually help to improve healthcare and perhaps save human lives.
2012 also saw the first pure image segmentation contest won by DL (?), again through an GPU-
MPCNN ensemble (?).2 EM stacks are relevant for the recently approved huge brain projects in Eu-
rope and the US (e.g., ?). Given electron microscopy images of stacks of thin slices of animal brains,
the goal is to build a detailed 3D model of the brain’s neurons and dendrites. But human experts need
many hours and days and weeks to annotate the images: Which parts depict neuronal membranes?
Which parts are irrelevant background? This needs to be automated (e.g., ?). Deep Multi-Column
GPU-MPCNNs learned to solve this task through experience with many training images, and won the
contest on all three evaluation metrics by a large margin, with superhuman performance in terms of
pixel error.
Both object detection (?) and image segmentation (?) profit from fast MPCNN-based image scans
that avoid redundant computations. Recent MPCNN scanners speed up naive implementations by up
to three orders of magnitude (??); compare earlier efficient methods for CNNs without MP (?).
Also in 2012, a system consisting of growing deep FNNs and 2D-BRNNs (?) won the CASP
2012 contest on protein contact map prediction. On the IAM-OnDoDB benchmark, LSTM RNNs
(Sec. 5.13) outperformed all other methods (HMMs, SVMs) on online mode detection (??) and key-
word spotting (?). On the long time lag problem of language modelling, LSTM RNNs outperformed
all statistical approaches on the IAM-DB benchmark (?); improved results were later obtained through
a combination of NNs and HMMs (?). Compare earlier RNNs for object recognition through iterative
image interpretation (???); see also more recent publications (??) extending work on biologically
plausible learning rules for RNNs (?).

5.22 2013-: More Contests and Benchmark Records


A stack (??) (Sec. 5.10) of bi-directional LSTM RNNs (?) trained by CTC (Sec. 5.13, 5.17) broke a
famous TIMIT speech (phoneme) recognition record, achieving 17.7% test set error rate (?), despite
thousands of man years previously spent on Hidden Markov Model (HMMs)-based speech recognition
research. Compare earlier DBN results (Sec. 5.15).
CTC-LSTM also helped to score first at NIST’s OpenHaRT2013 evaluation (?). For optical char-
acter recognition (OCR), LSTM RNNs outperformed commercial recognizers of historical data (?).
2 It should be mentioned, however, that LSTM RNNs already performed simultaneous segmentation and recognition when

they became the first recurrent Deep Learners to win official international pattern recognition contests—see Sec. 5.17.

22
LSTM-based systems also set benchmark records in language identification (?), medium-vocabulary
speech recognition (?), prosody contour prediction (?), audio onset detection (?), text-to-speech syn-
thesis (?), and social signal classification (?).
An LSTM RNN was used to estimate the state posteriors of an HMM; this system beat the previous
state of the art in large vocabulary speech recognition (??). Another LSTM RNN with hundreds of
millions of connections was used to rerank hypotheses of a statistical machine translation system; this
system beat the previous state of the art in English to French translation (?).
A new record on the ICDAR Chinese handwriting recognition benchmark (over 3700 classes)
was set on a desktop machine by an ensemble of GPU-MPCNNs (Sec. 5.19) with almost human
performance (?); compare (?).
The MICCAI 2013 Grand Challenge on Mitosis Detection (?) also was won by an object-detecting
GPU-MPCNN ensemble (?). Its data set was even larger and more challenging than the one of ICPR
2012 (Sec. 5.21): a real-world dataset including many ambiguous cases and frequently encountered
problems such as imperfect slide staining.
Three 2D-CNNs (with mean-pooling instead of MP, Sec. 5.11) observing three orthogonal projec-
tions of 3D images outperformed traditional full 3D methods on the task of segmenting tibial cartilage
in low field knee MRI scans (?).
Deep GPU-MPCNNs (Sec. 5.19) also helped to achieve new best results on important benchmarks
of the computer vision community: ImageNet classification (??) and—in conjunction with traditional
approaches—PASCAL object detection (?). They also learned to predict bounding box coordinates
of objects in the Imagenet 2013 database, and obtained state-of-the-art results on tasks of localization
and detection (?). GPU-MPCNNs also helped to recognise multi-digit numbers in Google Street View
images (?), where part of the NN was trained to count visible digits; compare earlier work on detect-
ing “numerosity” through DBNs (?). This system also excelled at recognising distorted synthetic text
in reCAPTCHA puzzles. Other successful CNN applications include scene parsing (?), object detec-
tion (?), shadow detection (?), video classification (?), and Alzheimers disease neuroimaging (?).
Additional contests are mentioned in the web pages of the Swiss AI Lab IDSIA, the University of
Toronto, NY University, and the University of Montreal.

5.23 Currently Successful Techniques: LSTM RNNs and GPU-MPCNNs


Most competition-winning or benchmark record-setting Deep Learners actually use one of two super-
vised techniques: (a) recurrent LSTM (1997) trained by CTC (2006) (Sec. 5.13, 5.17, 5.21, 5.22), or
(b) feedforward GPU-MPCNNs (2011, Sec. 5.19, 5.21, 5.22) based on CNNs (1979, Sec. 5.4) with
MP (1992, Sec. 5.11) trained through BP (1989–2007, Sec. 5.8, 5.16).
Exceptions include two 2011 contests (???) specialised on Transfer Learning from one dataset
to another (e.g., ???). However, deep GPU-MPCNNs do allow for pure SL-based transfer (?), where
pre-training on one training set greatly improves performance on quite different sets, also in more
recent studies (??). In fact, deep MPCNNs pre-trained by SL can extract useful features from quite
diverse off-training-set images, yielding better results than traditional, widely used features such as
SIFT (??) on many vision tasks (?). To deal with changing datasets, slowly learning deep NNs were
also combined with rapidly adapting “surface” NNs (?).
Remarkably, in the 1990s a trend went from partially unsupervised RNN stacks (Sec. 5.10) to
purely supervised LSTM RNNs (Sec. 5.13), just like in the 2000s a trend went from partially unsuper-
vised FNN stacks (Sec. 5.15) to purely supervised MPCNNs (Sec. 5.16–5.22). Nevertheless, in many
applications it can still be advantageous to combine the best of both worlds—supervised learning and
unsupervised pre-training (Sec. 5.10, 5.15).

23
5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3)
DBN training (Sec. 5.15) can be improved through gradient enhancements and automatic learning
rate adjustments during stochastic gradient descent (??), and through Tikhonov-type (?) regulariza-
tion of RBMs (?). Contractive AEs (?) discourage hidden unit perturbations in response to input
perturbations, similar to how FMS (Sec. 5.6.3) for L OCOCODE AEs (Sec. 5.6.4) discourages output
perturbations in response to weight perturbations.
Hierarchical CNNs in a Neural Abstraction Pyramid (e.g., ??) were trained to reconstruct images
corrupted by structured noise (?), thus enforcing increasingly abstract image representations in deeper
and deeper layers. Denoising AEs later used a similar procedure (?).
Dropout (??) removes units from NNs during training to improve generalisation. Some view it
as an ensemble method that trains multiple data models simultaneously (?). Under certain circum-
stances, it could also be viewed as a form of training set augmentation: effectively, more and more
informative complex features are removed from the training data. Compare dropout for RNNs (???).
A deterministic approximation coined fast dropout (?) can lead to faster learning and evaluation and
was adapted for RNNs (?). Dropout is closely related to older, biologically plausible techniques for
adding noise to neurons or synapses during training (e.g., ??????), which in turn are closely related
to finding perturbation-resistant low-complexity NNs, e.g., through FMS (Sec. 5.6.3). MDL-based
stochastic variational methods (?) are also related to FMS. They are useful for RNNs, where classic
regularizers such as weight decay (Sec. 5.6.3) represent a bias towards limited memory capacity (e.g.,
?). Compare recent work on variational recurrent AEs (?).
The activation function f of Rectified Linear Units (ReLUs) is f (x) = x for x > 0, f (x) =
0 otherwise—compare the old concept of half-wave rectified units (?). ReLU NNs are useful for
RBMs (??), outperformed sigmoidal activation functions in deep NNs (?), and helped to obtain best
results on several benchmark problems across multiple domains (e.g., ??).
NNs with competing linear units tend to outperform those with non-competing nonlinear units,
and avoid catastrophic forgetting through BP when training sets change over time (?). In this con-
text, choosing a learning algorithm may be more important than choosing activation functions (?).
Maxout NNs (?) combine competitive interactions and dropout (see above) to achieve excellent re-
sults on certain benchmarks. Compare early RNNs with competing units for SL and RL (?). To
address overfitting, instead of depending on pre-wired regularizers and hyper-parameters (??), self-
delimiting RNNs (SLIM NNs) with competing units (?) can in principle learn to select their own
runtime and their own numbers of effective free parameters, thus learning their own computable regu-
larisers (Sec. 4.4, 5.6.3), becoming fast and slim when necessary. One may penalize the task-specific
total length of connections (e.g., ????) and communication costs of SLIM NNs implemented on the
3-dimensional brain-like multi-processor hardware to be expected in the future.
RmsProp (??) can speed up first order gradient descent methods (Sec. 5.5, 5.6.2); compare vario-
η (?), Adagrad (?) and Adadelta (?). DL in NNs can also be improved by transforming hidden unit
activations such that they have zero output and slope on average (?). Many additional, older tricks
(Sec. 5.6.2, 5.6.3) should also be applicable to today’s deep NNs; compare (??).

5.25 Consequences for Neuroscience


It is ironic that artificial NNs (ANNs) can help to better understand biological NNs (BNNs)—see the
ISBI 2012 results mentioned in Sec. 5.21 (??).
The feature detectors learned by single-layer visual ANNs are similar to those found in early visual
processing stages of BNNs (e.g., Sec. 5.6.4). Likewise, the feature detectors learned in deep layers
of visual ANNs should be highly predictive of what neuroscientists will find in deep layers of BNNs.
While the visual cortex of BNNs may use quite different learning algorithms, its objective function to

24
be minimised may be quite similar to the one of visual ANNs. In fact, results obtained with relatively
deep artificial DBNs (?) and CNNs (?) seem compatible with insights about the visual pathway in the
primate cerebral cortex, which has been studied for many decades (e.g., ?????????????); compare a
computer vision-oriented survey (?).

5.26 DL with Spiking Neurons?


Many recent DL results profit from GPU-based traditional deep NNs, e.g., Sec. 5.16–5.19. Current
GPUs, however, are little ovens, much hungrier for energy than biological brains, whose neurons
efficiently communicate by brief spikes (???), and often remain quiet. Many computational models
of such spiking neurons have been proposed and analyzed (e.g., ?????????????????????????).
Future energy-efficient hardware for DL in NNs may implement aspects of such models (e.g.,
???????????). A simulated, event-driven, spiking variant (?) of an RBM (Sec. 5.15) was trained by a
variant of the Contrastive Divergence algorithm (?). Spiking nets were evolved to achieve reasonable
performance on small face recognition data sets (?) and to control simple robots (??). A spiking DBN
with about 250,000 neurons (as part of a larger NN; ??) achieved 6% error rate on MNIST; compare
similar results with a spiking DBN variant of depth 3 using a neuromorphic event-based sensor (?).
In practical applications, however, current artificial networks of spiking neurons cannot yet compete
with the best traditional deep NNs (e.g., compare MNIST results of Sec. 5.19).

25
6 DL in FNNs and RNNs for Reinforcement Learning (RL)
So far we have focused on Deep Learning (DL) in supervised or unsupervised NNs. Such NNs learn
to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act in
the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys, e.g.,
???). Here we add a discussion of DL FNNs and RNNs for RL. It will be shorter than the discussion
of FNNs and RNNs for SL and UL (Sec. 5), reflecting the current size of the various fields.
Without a teacher, solely from occasional real-valued pain and pleasure signals, RL agents must
discover how to interact with a dynamic, initially unknown environment to maximize their expected
cumulative reward signals (Sec. 2). There may be arbitrary, a priori unknown delays between actions
and perceivable consequences. The problem is as hard as any problem of computer science, since any
task with a computable description can be formulated in the RL framework (e.g., ?). For example, an
answer to the famous question of whether P = N P (??) would also set limits for what is achievable
by general RL. Compare more specific limitations, e.g., (???). The following subsections mostly
focus on certain obvious intersections between DL and RL—they cannot serve as a general RL survey.

6.1 RL Through NN World Models Yields RNNs With Deep CAPs


In the special case of an RL FNN controller C interacting with a deterministic, predictable environ-
ment, a separate FNN called M can learn to become C’s world model through system identification,
predicting C’s inputs from previous actions and inputs (e.g., ??????????????????). Assume M has
learned to produce accurate predictions. We can use M to substitute the environment. Then M and
C form an RNN where M ’s outputs become inputs of C, whose outputs (actions) in turn become
inputs of M . Now BP for RNNs (Sec. 5.5.1) can be used to achieve desired input events such as high
real-valued reward signals: While M ’s weights remain fixed, gradient information for C’s weights is
propagated back through M down into C and back through M etc. To a certain extent, the approach
is also applicable in probabilistic or uncertain environments, as long as the inner products of M ’s
C-based gradient estimates and M ’s “true” gradients tend to be positive.
In general, this approach implies deep CAPs for C, unlike in DP-based traditional RL (Sec. 6.2).
Decades ago, the method was used to learn to back up a model truck (?). An RL active vision system
used it to learn sequential shifts (saccades) of a fovea, to detect targets in visual scenes (?), thus
learning to control selective attention. Compare RL-based attention learning without NNs (?).
To allow for memories of previous events in partially observable worlds (Sec. 6.3), the most
general variant of this technique uses RNNs instead of FNNs to implement both M and C (???).
This may cause deep CAPs not only for C but also for M .
M can also be used to optimize expected reward by planning future action sequences (?). In fact,
the winners of the 2004 RoboCup World Championship in the fast league (?) trained NNs to predict
the effects of steering signals on fast robots with 4 motors for 4 different wheels. During play, such
NN models were used to achieve desirable subgoals, by optimizing action sequences through quickly
planning ahead. The approach also was used to create self-healing robots able to compensate for
faulty motors whose effects do not longer match the predictions of the NN models (??).
Typically M is not given in advance. Then an essential question is: which experiments should
C conduct to quickly improve M ? The Formal Theory of Fun and Creativity (e.g., ??) formalizes
driving forces and value functions behind such curious and exploratory behavior: A measure of the
learning progress of M becomes the intrinsic reward of C (?); compare (??). This motivates C to
create action sequences (experiments) such that M makes quick progress.

26
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs)
The classical approach to RL (??) makes the simplifying assumption of Markov Decision Processes
(MDPs): the current input of the RL agent conveys all information necessary to compute an optimal
next output event or decision. This allows for greatly reducing CAP depth in RL NNs (Sec. 3, 6.1)
by using the Dynamic Programming (DP) trick (?). The latter is often explained in a probabilistic
framework (e.g., ?), but its basic idea can already be conveyed in a deterministic setting. For simplic-
ity, using the notation of Sec. 2, let input events xt encode the entire current state of the environment,
including a real-valued reward rt (no need to introduce additional vector-valued notation, since real
values can encode arbitrary vectors of real values). The original RL goal (find weights that maximize
the sum of all rewards of an episode) is replaced by an equivalent set of alternative goals set by a real-
valued value function V defined on input events. Consider any two subsequent input events xt , xk .
Recursively define V (xt ) = rt + V (xk ), where V (xk ) = rk if xk is the last input event. Now search
for weights that maximize the V of all input events, by causing appropriate output events or actions.
Due to the Markov assumption, an FNN suffices to implement the policy that maps input to out-
put events. Relevant CAPs are not deeper than this FNN. V itself is often modeled by a separate
FNN (also yielding typically short CAPs) learning to approximate V (xt ) only from local information
rt , V (xk ).
Many variants of traditional RL exist (e.g., ???????????????????????????). Most are formu-
lated in a probabilistic framework, and evaluate pairs of input and output (action) events (instead of
input events only). To facilitate certain mathematical derivations, some discount delayed rewards, but
such distortions of the original RL problem are problematic.
Perhaps the most well-known RL NN is the world-class RL backgammon player (?), which
achieved the level of human world champions by playing against itself. Its nonlinear, rather shal-
low FNN maps a large but finite number of discrete board states to values. More recently, a rather
deep GPU-CNN was used in a traditional RL framework to play several Atari 2600 computer games
directly from 84x84 pixel 60 Hz video input (?), using experience replay (?), extending previous
work on Neural Fitted Q-Learning (NFQ) (?). Even better results are achieved by using (slow) Monte
Carlo tree planning to train comparatively fast deep NNs (?). Compare RBM-based RL (?) with
high-dimensional inputs (?), earlier RL Atari players (?), and an earlier, raw video-based RL NN for
computer games (?) trained by Indirect Policy Search (Sec. 6.7).

6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs)


The Markov assumption (Sec. 6.2) is often unrealistic. We cannot directly perceive what is behind our
back, let alone the current state of the entire universe. However, memories of previous events can help
to deal with partially observable Markov decision problems (POMDPs) (e.g., ?????????????????).
A naive way of implementing memories without leaving the MDP framework (Sec. 6.2) would be
to simply consider a possibly huge state space, namely, the set of all possible observation histories
and their prefixes. A more realistic way is to use function approximators such as RNNs that produce
compact state features as a function of the entire history seen so far. Generally speaking, POMDP RL
often uses DL RNNs to learn which events to memorize and which to ignore. Three basic alternatives
are:

1. Use an RNN as a value function mapping arbitrary event histories to values (e.g., ????). For
example, deep LSTM RNNs were used in this way for RL robots (?).
2. Use an RNN controller in conjunction with a second RNN as predictive world model, to obtain
a combined RNN with deep CAPs—see Sec. 6.1.

27
3. Use an RNN for RL by Direct Search (Sec. 6.6) or Indirect Search (Sec. 6.7) in weight space.

In general, however, POMDPs may imply greatly increased CAP depth.

6.4 RL Facilitated by Deep UL in FNNs and RNNs


RL machines may profit from UL for input preprocessing (e.g., ?). In particular, an UL NN can learn
to compactly encode environmental inputs such as images or videos, e.g., Sec. 5.7, 5.10, 5.15. The
compact codes (instead of the high-dimensional raw data) can be fed into an RL machine, whose
job thus may become much easier (??), just like SL may profit from UL, e.g., Sec. 5.7, 5.10, 5.15.
For example, NFQ (?) was applied to real-world control tasks (??) where purely visual inputs were
compactly encoded by deep autoencoders (Sec. 5.7, 5.15). RL combined with UL based on Slow
Feature Analysis (??) enabled a real humanoid robot to learn skills from raw high-dimensional video
streams (?). To deal with POMDPs (Sec. 6.3) involving high-dimensional inputs, RBM-based RL
was used (?), and a RAAM (?) (Sec. 5.7) was employed as a deep unsupervised sequence encoder for
RL (?). Certain types of RL and UL also were combined in biologically plausible RNNs with spiking
neurons (Sec. 5.26) (e.g., ???).

6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs
Multiple learnable levels of abstraction (?????) seem as important for RL as for SL. Work on NN-
based Hierarchical RL (HRL) has been published since the early 1990s. In particular, gradient-based
subgoal discovery with FNNs or RNNs decomposes RL tasks into subtasks for RL submodules (??).
Numerous alternative HRL techniques have been proposed (e.g., ????????????????). While HRL
frameworks such as Feudal RL (?) and options (???) do not directly address the problem of automatic
subgoal discovery, HQ-Learning (?) automatically decomposes POMDPs (Sec. 6.3) into sequences of
simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent
HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor
control maps (?) inspired by neurophysiological findings (?).

6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution


Not quite as universal as the methods of Sec. 6.8, yet both practical and more general than most
traditional RL algorithms (Sec. 6.2), are methods for Direct Policy Search (DS). Without a need for
value functions or Markovian assumptions (Sec. 6.2, 6.3), the weights of an FNN or RNN are directly
evaluated on the given RL problem. The results of successive trials inform further search for better
weights. Unlike with RL supported by BP (Sec. 5.5, 6.3, 6.1), CAP depth (Sec. 3, 5.9) is not a crucial
issue. DS may solve the credit assignment problem without backtracking through deep causal chains
of modifiable parameters—it neither cares for their existence, nor tries to exploit them.
An important class of DS methods for NNs are Policy Gradient methods (??????????????????).
Gradients of the total reward with respect to policies (NN weights) are estimated (and then exploited)
through repeated NN evaluations.
RL NNs can also be evolved through Evolutionary Algorithms (EAs) (?????) in a series of tri-
als. Here several policies are represented by a population of NNs improved through mutations and/or
repeated recombinations of the population’s fittest individuals (e.g., ?????). Compare Genetic Pro-
gramming (GP) (?) (see also ?) which can be used to evolve computer programs of variable size (??),
and Cartesian GP (??) for evolving graph-like programs, including NNs (?) and their topology (?).
Related methods include probability distribution-based EAs (????), Covariance Matrix Estimation

28
Evolution Strategies (CMA-ES) (????), and NeuroEvolution of Augmenting Topologies (NEAT) (?).
Hybrid methods combine traditional NN-based RL (Sec. 6.2) and EAs (e.g., ?).
Since RNNs are general computers, RNN evolution is like GP in the sense that it can evolve
general programs. Unlike sequential programs learned by traditional GP, however, RNNs can mix
sequential and parallel information processing in a natural and efficient way, as already mentioned
in Sec. 1. Many RNN evolvers have been proposed (e.g., ????????????). One particularly effec-
tive family of methods coevolves neurons, combining them into networks, and selecting those neu-
rons for reproduction that participated in the best-performing networks (???). This can help to solve
deep POMDPs (?). Co-Synaptic Neuro-Evolution (CoSyNE) does something similar on the level of
synapses or weights (?); benefits of this were shown on difficult nonlinear POMDP benchmarks.
Natural Evolution Strategies (NES) (????) link policy gradient methods and evolutionary ap-
proaches through the concept of Natural Gradients (?). RNN evolution may also help to improve SL
for deep RNNs through Evolino (?) (Sec. 5.9).

6.7 Deep RL by Indirect Policy Search / Compressed NN Search


Some DS methods (Sec. 6.6) can evolve NNs with hundreds or thousands of weights, but not mil-
lions. How to search for large and deep NNs? Most SL and RL methods mentioned so far somehow
search the space of weights wi . Some profit from a reduction of the search space through shared
wi that get reused over and over again, e.g., in CNNs (Sec. 5.4, 5.8, 5.16, 5.21), or in RNNs for SL
(Sec. 5.5, 5.13, 5.17) and RL (Sec. 6.1, 6.3, 6.6).
It may be possible, however, to exploit additional regularities/compressibilities in the space of
solutions, through indirect search in weight space. Instead of evolving large NNs directly (Sec. 6.6),
one can sometimes greatly reduce the search space by evolving compact encodings of NNs, e.g.,
through Lindenmeyer Systems (??), graph rewriting (?), Cellular Encoding (?), HyperNEAT (????)
(extending NEAT; Sec. 6.6), and extensions thereof (e.g., ?). This helps to avoid overfitting (compare
Sec. 5.6.3, 5.24) and is closely related to the topics of regularisation and MDL (Sec. 4.4).
A general approach (?) for both SL and RL seeks to compactly encode weights of large NNs (?)
through programs written in a universal programming language (????). Often it is much more ef-
ficient to systematically search the space of such programs with a bias towards short and fast pro-
grams (???), instead of directly searching the huge space of possible NN weight matrices. A previous
universal language for encoding NNs was assembler-like (?). More recent work uses more practi-
cal languages based on coefficients of popular transforms (Fourier, wavelet, etc). In particular, RNN
weight matrices may be compressed like images, by encoding them through the coefficients of a dis-
crete cosine transform (DCT) (??). Compact DCT-based descriptions can be evolved through NES
or CoSyNE (Sec. 6.6). An RNN with over a million weights learned (without a teacher) to drive a
simulated car in the TORCS driving game (??), based on a high-dimensional video-like visual input
stream (?). The RNN learned both control and visual processing from scratch, without being aided by
UL. (Of course, UL might help to generate more compact image codes (Sec. 6.4, 4.2) to be fed into a
smaller RNN, to reduce the overall computational effort.)

6.8 Universal RL
General purpose learning algorithms may improve themselves in open-ended fashion and
environment-specific ways in a lifelong learning context (????). The most general type of RL is
constrained only by the fundamental limitations of computability identified by the founders of the-
oretical computer science (????). Remarkably, there exist blueprints of universal problem solvers
or universal RL machines for unlimited problem depth that are time-optimal in various theoretical
senses (????). In particular, the Gödel Machine can be implemented on general computers such as

29
RNNs and may improve any part of its software (including the learning algorithm itself) in a way
that is provably time-optimal in a certain sense (?). It can be initialized by an asymptotically optimal
meta-method (?) (also applicable to RNNs) which will solve any well-defined problem as quickly as
the unknown fastest way of solving it, save for an additive constant overhead that becomes negligible
as problem size grows. Note that most problems are large; only few are small. AI and DL researchers
are still in business because many are interested in problems so small that it is worth trying to re-
duce the overhead through less general methods, including heuristics. Here I won’t further discuss
universal RL methods, which go beyond what is usually called DL.

7 Conclusion and Outlook


Deep Learning (DL) in Neural Networks (NNs) is relevant for Supervised Learning (SL) (Sec. 5),
Unsupervised Learning (UL) (Sec. 5), and Reinforcement Learning (RL) (Sec. 6). By alleviating
problems with deep Credit Assignment Paths (CAPs, Sec. 3, 5.9), UL (Sec. 5.6.4) can not only facil-
itate SL of sequences (Sec. 5.10) and stationary patterns (Sec. 5.7, 5.15), but also RL (Sec. 6.4, 4.2).
Dynamic Programming (DP, Sec. 4.1) is important for both deep SL (Sec. 5.5) and traditional RL with
deep NNs (Sec. 6.2). A search for solution-computing, perturbation-resistant (Sec. 5.6.3, 5.15, 5.24),
low-complexity NNs describable by few bits of information (Sec. 4.4) can reduce overfitting and im-
prove deep SL & UL (Sec. 5.6.3, 5.6.4) as well as RL (Sec. 6.7), also in the case of partially observable
environments (Sec. 6.3). Deep SL, UL, RL often create hierarchies of more and more abstract repre-
sentations of stationary data (Sec. 5.3, 5.7, 5.15), sequential data (Sec. 5.10), or RL policies (Sec. 6.5).
While UL can facilitate SL, pure SL for feedforward NNs (FNNs) (Sec. 5.5, 5.8, 5.16, 5.18) and re-
current NNs (RNNs) (Sec. 5.5, 5.13) did not only win early contests (Sec. 5.12, 5.14) but also most
of the recent ones (Sec. 5.17–5.22). Especially DL in FNNs profited from GPU implementations
(Sec. 5.16–5.19). In particular, GPU-based (Sec. 5.19) Max-Pooling (Sec. 5.11) Convolutional NNs
(Sec. 5.4, 5.8, 5.16) won competitions not only in pattern recognition (Sec. 5.19–5.22) but also image
segmentation (Sec. 5.21) and object detection (Sec. 5.21, 5.22).
Unlike these systems, humans learn to actively perceive patterns by sequentially directing atten-
tion to relevant parts of the available data. Near future deep NNs will do so, too, extending previous
work since 1990 on NNs that learn selective attention through RL of (a) motor actions such as saccade
control (Sec. 6.1) and (b) internal actions controlling spotlights of attention within RNNs, thus closing
the general sensorimotor loop through both external and internal feedback (e.g., Sec. 2, 5.21, 6.6, 6.7).
Many future deep NNs will also take into account that it costs energy to activate neurons, and to
send signals between them. Brains seem to minimize such computational costs during problem solv-
ing in at least two ways: (1) At a given time, only a small fraction of all neurons is active because local
competition through winner-take-all mechanisms shuts down many neighbouring neurons, and only
winners can activate other neurons through outgoing connections (compare SLIM NNs; Sec. 5.24).
(2) Numerous neurons are sparsely connected in a compact 3D volume by many short-range and
few long-range connections (much like microchips in traditional supercomputers). Often neighbour-
ing neurons are allocated to solve a single task, thus reducing communication costs. Physics seems
to dictate that any efficient computational hardware will in the future also have to be brain-like in
keeping with these two constraints. The most successful current deep RNNs, however, are not. Un-
like certain spiking NNs (Sec. 5.26), they usually activate all units at least slightly, and tend to be
strongly connected, ignoring natural constraints of 3D hardware. It should be possible to improve
them by adopting (1) and (2), and by minimizing non-differentiable energy and communication costs
through direct search in program (weight) space (e.g., Sec. 6.6, 6.7). These more brain-like RNNs
will allocate neighboring RNN parts to related behaviors, and distant RNN parts to less related ones,
thus self-modularizing in a way more general than that of traditional self-organizing maps in FNNs

30
(Sec. 5.6.4). They will also implement Occam’s razor (Sec. 4.4, 5.6.3) as a by-product of energy min-
imization, by finding simple (highly generalizing) problem solutions that require few active neurons
and few, mostly short connections.
The more distant future may belong to general purpose learning algorithms that improve them-
selves in provably optimal ways (Sec. 6.8), but these are not yet practical or commercially relevant.

8 Acknowledgments
Since 16 April 2014, drafts of this paper have undergone massive open online peer review through
public mailing lists including [email protected], [email protected], comp-neuro-
@neuroinf.org, genetic [email protected], [email protected], imageworld-
@diku.dk, Google+ machine learning forum. Thanks to numerous NN / DL experts for valuable
comments. Thanks to SNF, DFG, and the European Commission for partially funding my DL re-
search group in the past quarter-century. The contents of this paper may be used for educational and
non-commercial purposes, including articles for Wikipedia and similar sites.

References

31

You might also like