A Survey On Deep Network PDF
A Survey On Deep Network PDF
Content
1. Overview ............................................................................................. 2
4. Conclusion ......................................................................................... 20
1. Overview
Nowadays deep network has become a hot topic in a variety of areas such as image
natural language understanding[3. Collobert et al.,2011] etc. Being one of the fastest-
growing and most exciting fields in research as well as in industry, It is worth that we
In this paper, the complete history of deep network will not be presented as many
other papers have done, instead, emphasis will be put on the application of deep
network in industry, and before the application part a brief introduction to four
The content of the rest of this paper is organized as following: First we give a brief
introduction to DBN, CNN and RNN, then region-based CNN (R-CNN) and Fast R-CNN
considerable domain expertise to design a feature extractor that transformed the raw
data (such as the pixel values of an image) into a suitable internal representation or
feature vector from which the learning subsystem, often a classifier, could detect or
classify patterns in the input. However, manually extracted features from raw data
sometimes turns out to be limited and less efficient in representing the characteristics
of input data, and thus compromise the subsequent detection or classfication process.
with raw data and to automatically discover the feature representations needed for
machine learning and deep learning. The key aspect of deep learning is that these
layers of features are not designed by human engineers: they are learned from data
Deep learning network normally has more than three feature layers, while some
deep network could build up to 20 feature layers or even more. Higher layers of
representation amplify aspects of the input that are important for discrimination and
suppress irrelevant variations. An image, for example, comes in the form of an array of
pixel values, and the learned features in the first layer of representation typically
the image. The second layer typically detects motifs by spotting particular
arrangements of edges, regardless of small variations in the edge positions. The third
layer may assemble motifs into larger combinations that correspond to parts of
familiar objects, and subsequent layers would detect objects as combinations of these
parts.
Three typical deep networks, say DBN, CNN and RNN, will be introduced in the
used in most deep networks training process to adjust internal parameters. Detail
about this technology will not be covered since it is so fundamental that you can find
model consisting of a layer of visible units and multiple layers of hidden units, where
each layer encodes correlations in the units in the layer below. DBNs and related
unsupervised learning algorithms such as autoencoders and sparse coding have been
used to learn higher-level feature representations from unlabeled data, suitable for
Restricted Boltzmann Machines (RBMs). Each sub-network's hidden layer serves as the
visible layer for the next. This also leads to a fast, layer-by-layer unsupervised training
algorithm is applied to each sub-network in turn, starting from the "lowest" pair of
As described above, RBM is a component of DBN. RBM has two layers of neurons.
One is visible layer which consists of visible units. Training data are connected with this
layer as input. The other is called hidden layer which consists of hidden units. Hidden
Note that all connections are between visible and hidden layer. There is no
connection inside each layer. The advantage is that given the visible units input, each
unit in hiddle layer are mutually independent, and vice versa. Then we have the
following equations:
N
P (h | v )
j
P (h
1
j |v)
M
P (v | h)
i
P (v
1
i | h)
Given a matrix of weights W (wi , j ) (size m×n) associated with the connection
between hidden unit and visible unit , as well as bias weights (offsets) for the
visible units and for the hidden units, the energy is defined as:
And probability distributions over hidden and/or visible vectors are defined in terms of
to 1.
RBM are trained to maximize the product of probabilities assigned to some training
set (a matrix, each row of which is treated as a visible vector ),
Hinton proposed a fast training algorithm called Contrastive Divergence (CD) which
has been proved very effective in practice. The CD procedure can be summarized as
follows:
1. Take a training sample v, compute the probabilities of the hidden units and
sample a hidden activation vector h from this probability distribution
2. Compute the outer product of v and h and call this the positive gradient
3. From h, sample a reconstruction v' of the visible units, then resample the
hidden activations h' from this (Gibbs sampling step)
4. Compute the outer product of v' and h' and call this the negative gradient
5. Let the weight update to be the positive gradient minus the negative
more popular in 1990s partly because deep supervised neural networks were generally
found too difficult to train. However, there is one notable exception: convolutional
neural network (CNN). It achieved many practical successes during the period when
neural networks were out of favour and it has recently been widely adopted or even
designed and trained convolutional networks using the error gradient, obtaining state-
of-the-art performance [8.Yann Lecun et al.,1989][9. Yann Lecun et al.,1990] [10. Yann
Lecun et al.,1998] on several pattern recognition tasks. The architeture for LeCun’s
convolutional layers and subsampling layers. At each location of each layer, there are
a number of different neurons, each with its set of input weights, associated with
neurons in a rectangular patch in the previous layer. The same set of weights, but a
different input rectangular patch, are associated with neurons at different locations.
The Convolutional layer is the core component of a CNN. The layer's parameters consist
of a set of learnable filters (or kernels), which have a small receptive field, but extend
through the full depth of the input volume. During the forward pass, each filter is convolved
across the width and height of the input volume, computing the product between the entries
of the filter and the input and producing a 2-dimensional activation map of that filter. As a
result, the network learns filters that activate when they see some specific type of feature
sampling. Pooling divides the input layer into a set of non-overlapping rectangles and
progressively reduce the spatial size of the representation to reduce the amount of
Although CNN had seen many sucesses in a variety of applications, CNN was largely
the ImageNet competition in 2012, which was a great breakthrough for the rehabilitation of
CNN. When deep convolutional networks were applied to a data set of about a million
images from the web that contained 1,000 different classes, they achieved spectacular
results, almost halving the error rates of the best competing approaches[1. Krizhevsky et
al.,2012]. This success came from the efficient use of GPUs, ReLUs, a new regularization
technique called dropout, and techniques to generate more training examples by deforming
the existing ones. This success has brought about a revolution in computer vision: CNN
are now the dominant approach for almost all recognition fields.
A recent stunning demonstration combines CNN and RNN (Recurrent Neural Network,
which will be illustrated in the next chapter) for the generation of image captions. Moreover,
the convolutional structure has been imported into RBMs [11. G. Desjardins et al.,2008]
and DBNs [12. H. Lee et al.,2009]. An important innovation in [12] is the design of a
generative version of the pooling / subsampling units, which worked beautifully in the
experiments reported, yielding state-of theart results not only on MNIST digits but also on
often better to use Recurrent Neural Network (RNN). RNN process an input sequence
one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly
contains information about the history of all the past elements of the sequence. We
consider the outputs of the hidden units at different discrete time steps as if they were
If we unfold RNN as Figure 4 shows, it is easy to see that RNN acts as very deep
feedforward networks in which all the layers share the same weights. As we regard the
outputs of the hidden units at earlier time steps as the input of an identical neural
network at present time, it becomes clear how we can apply back propagation
RNN has been found to be very good at predicting the next character in the text or
the next word in a sequence, but they can also be used for more complex tasks. For
example, after reading an English sentence one word at a time, an English ‘encoder’
network can be trained so that the final ‘state vector’ of its hidden units is a good
the theme of the sentence). This thought vector can then be used as the initial hidden
state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs
a probability distribution for the first word of the French translation. If a particular first
word is chosen from this distribution and provided as input to the decoder network it
will then output a probability distribution for the second word of the translation and
one can learn to ‘translate’ the meaning of an image into an English sentence which
seems more amazing. The encoder here is a CNN that converts the pixels into an
activity vector in its last hidden layer. The decoder is an RNN similar to the ones used
for machine translation and neural language modelling. There has been a surge of
empirical evidence shows that it is difficult to learn to store information for very
long[16. Bengio et al.,1994]. To solve this problem, one idea is to augment the network
with an explicit memory. The first proposal of this kind is the long short-term memory
(LSTM) networks that use special hidden units, the natural behaviour of which is to
remember inputs for a long time[17. Hochreiter et al.,1997].The basic LSTM idea is
very simple. Some of the units are called Constant Error Carousels (CECs). Each CEC
uses an activation function f, the identity function, and has a connection to itself with
fixed weight of 1.0. Due to f’s constant derivative of 1.0, errors backpropagated
through a CEC cannot vanish or explode. CECs are the main reason why LSTM nets can
learn to discover the importance of (and memorize) events that happened thousands
of discrete time steps ago, while previous RNNs already failed in case of minimal time
lags of 10 steps.
2.5 Region-based Deep Network and F-CNN
Classic segmentation has long time been exceedingly difficult: mistakes are
in how we think about segmentation occurred around 2005. The idea is as simple:
generate multiple candidate segmentations, and while your kitty may be disfigured in
many, hopefully at least one of the segmentations will contain her whole and
unharmed body. This was a leap in thinking because it shifts the focus to generating a
detection more feasible as well as more exciting. Region-based CNN, or Region with
CNN, was proposed by Ross Girshick in 2014[19. Ross Girshick et al.,2014]. This paper
try to understand if the Supervision CNN can be made to work as an object detector,
in other words, to bridge the gap between image classification and object detection.
It turns out to be the first paper to show that a CNN can lead to dramatically higher
The system first takes an input image, extracting around 2000 bottom-up region
proposals, after that it computes features for each proposal using a large convolutional
neural network (CNN), and then it classifies each region using class-specific linear
SVMs.
R-CNN proved to be highly effective in object detection, however, it is still far away
from application. The main reason is due to its compute-intensive property. The
processing time for one image is more than 10 seconds in general cases. That is why
the author of R-CNN came up with F-CNN, fast R-CNN[21. Ross Girshick,2015] a year
later. Fast R-CNN builds on previous work to efficiently classify object proposals using
several innovations to improve training and testing speed while also increasing
detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-
CNN, 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Fast
R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-
entire image and a set of object proposals. The network first processes the whole
image with several convolutional (conv) and max pooling layers to produce a conv
feature map. Then, for each object proposal a region of interest (RoI) pooling layer
extracts a fixed-length feature vector from the feature map. Each feature vector is fed
into a sequence of fully connected (fc) layers that finally branch into two sibling output
layers: one that produces softmax probability estimates over K object classes plus a
catch-all “background” class and another layer that outputs four real-valued numbers
for each of theK object classes. Each set of 4 values encodes refined bounding-box
Early this year (Jan 2016) one of my colleague visited Silicon Valley and found that
F-CNN has become a hot topic in deep learning. According to this paper, at runtime
the detection network processes each image in only 0.3s with very high accuracy
achieved in the mean time. It is reasonable to believe that F-CNN will find its broad
others. Face recognition appears to offer several advantages over other biometric
methods, a few of which are outlined here: face recognition can be done passively
without any explicit action or participation on the part of the user since face images
can be acquired from a distance by a camera. Futhermore, facial images can be easily
obtained with a couple of inexpensive fixed cameras, whereas iris and retina
identification require expensive equipment and are much too sensitive to any body
motion. Finally, technologies that require multiple individuals to use the same
equipment to capture their biological characteristics potentially expose the user to the
transmission of germs and impurities from other users. However, face recognition is
totally non-intrusive and does not carry any such health risks.
The conventional method for face recognition, before the implementation of neural
local image descriptors such as SIFT, LBP, HOG [24. A. Albiol et al.,2008][25. D.G.
Lowe,2004]; then they aggregate such local descriptors into an overall face descriptor
by using a pooling mechanism, for example the Fisher Vector [26. K. Simonyan et
al.,2013]. Another famous method for face recognition which was proposed in the
In this paper, unsupervised learning on deep networks was performed to learn some
very high level feature detectors for face recognition. They sampled tons of pictures
(frames) from youtube, which only a small portions include faces. These unlabelled
neuron can achieve 81.7% accuracy! After that ,they apply a logistic classifier on top
of the deep network for supervised learning and found their results better than the
state-of-the-art, which shows that the unsupervised learning of deep network can
This paper is concerned mainly with deep architectures for face recognition. The
core characteristic of such methods is the use of a CNN feature extractor, a learnable
function obtained by composing several linear and non-linear operators. Several
let us first take a look at a typical face recognition system which is indicated as below:
The first thing a face recognition system does is to detect a face, say, distinguish
human face from other objects like cat, wall or curtain. Then face image will be aligned
to make sure subsequent process receive a well regulated input. The core component
is the feature extraction subsystem. In deep network method high level features are
learned through supervised training. The extracted features are then used for final
recognition task.
There are two primary tasks for face recognition: Face Verification (one-to-one
matching) when presented with a face image of an unknown individual along with a
claim of identity, ascertaining whether the individual is who he/she claims to be; And
that image with a database of (possibly encoded) images of known individuals. Face
verification turns out to have a heavier usage since there are always cases where a
new face image shows up which does not exist in the pre-encoded database.
As figure 7 indicates, deep network has been implemented mainly on two
components: one is face alignment, and the other, more commonly, is feature
extraction.
al.,2011]. This problem is challenging when face images are taken with extreme poses,
approach for facial point detection with three levels of convolutional networks was
There are five facial points to be detected: left eye center (LE), right eye center (RE),
nose tip (N), left mouth corner (LM), and right mouth corner (RM). They cascade three
they employ three deep convolutional networks, F1, EN1, and NM1, whose input
regions cover the whole face (F1), eyes and nose (EN1), nose and mouth (NM1). Each
network simultaneously predicts multiple facial points. For each facial point, the
predictions of multiple networks are averaged to reduce the variance. Networks at the
second and third levels take local patches centered at the predicted positions of facial
points from previous levels as input and are only allowed to make small changes to
previous predictions. The sizes of patches and search ranges keep reducing along the
cascade. Predictions at the last two levels are strictly restricted because local
point at the last two levels is given by the average of the two networks with different
patch sizes. While networks at the first level aim to estimate keypoint positions
robustly with few large errors, networks at the last two levels are designed to achieve
high accuracy.
Figure 9 illustrates the deep structure of F1, which contains four convolutional layers
followed by max pooling, and two fully connected layers. EN1 and NM1 take the same
deep structure, but with different sizes at each layer since the sizes of their input
regions are different. Note that the network structures for the last two levels are
shallower, since their tasks are low-level and their input is limited to small local regions
come up since 2012, and has seen significant improvement on recognition accuracy
compared to conventional methods. In this paper, a few of them which are considered
method uses a deep CNN trained to classify faces using a dataset of 4 million examples
spanning 4000 unique identities. It also uses a siamese network architecture, where
the same CNN is applied to pairs of faces to obtain descriptors that are then compared
using the Euclidean distance. The authors later extended this work in [49. Y. Taigman
et al.,2015], by increasing the size of the dataset by two orders of magnitude, including
There are two major innocations in DeepFace: (1) Careful effort was made on face
alignment part. A generic 3D shape model was used in pre-processing phase to warp
the 2D-aligned crop to the image plane of 3D shape. (2) Unlike a convolutional layer
where a constant set of filters is applied on a whole input feature map, the last three
layers are locally connected, which means every location in the feature map learns a
The DeepFace work was extended by the DeepId series of papers by Y. Sun et al. [50,
51, 52], each of which steadily increased the performance on LFW and YFW. In the first
paper[50. Y. Sun et al.,2014], a huge number of identities (nearly 10,000) on the output
11:
The recognition performance was further improved with DeepID2 [52. Y. Sun et
phase was added which focused on intra-personal cases. The reason for adding the
second training phase is that while it is important for effective feature representations
accuracy was achieved on LFW dataset, which was the highest accuracy before
In [53. Y. Sun et al.,2015], very deep networks was introduced and the face
verification accuracy was further improved to 99.53% on LFW dataset. Compared to
DeepFace, DeepID does not use 3D face alignment, but a simpler 2D affine alignment.
Even higher accuracy might be achieved if more powerful alignment method was used
in DeepID.
4. Conclusion
In this survey, several typical deep networks are introduced. Face recognition, a
deep network.
Deep network has been seeing significant successes in a variety of areas, both in
and convolutional networks is still a big challenge remains unsolved. A more serious
problem is regarding the way deep network is playing for recognition: Human brain
seems to act in a totally different way when it comes to learning and recognizing a new
object. Human and animal learning is largely unsupervised: we discover the structure
of the world by observing it, not by being told the name of every object.
References