0% found this document useful (0 votes)
34 views24 pages

A Survey On Deep Network PDF

Uploaded by

Mateus Meireles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views24 pages

A Survey On Deep Network PDF

Uploaded by

Mateus Meireles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

A Survey on Deep Network

Huayi Zhan 3/16/2016

Content

1. Overview ............................................................................................. 2

2 Various Deep Networks ........................................................................ 2

2.1 Conventional Machine Learning and Deep Learning ...................... 2

2.2 Deep Believe Network .................................................................... 4

2.3 Convolutional Neural Network (CNN) ............................................ 6

2.4 Recurrent Neural Network ............................................................. 9

2.5 Region-based Deep Network and F-CNN...................................... 11

3 Application of Deep Network in Face Recognition ............................. 13

3.1 Face Recogniton Overview ........................................................... 13

3.2 Deep Network for Face Alignment ............................................... 16

3.3 Deep Network for Face Verification and Identification ................ 17

4. Conclusion ......................................................................................... 20
1. Overview
Nowadays deep network has become a hot topic in a variety of areas such as image

recognition[1. Krizhevsky et al.,2012], speech recognition[2. Mikolov et al.,2011],

natural language understanding[3. Collobert et al.,2011] etc. Being one of the fastest-

growing and most exciting fields in research as well as in industry, It is worth that we

take a deep look at the state-of-the-art of this promising technology.

In this paper, the complete history of deep network will not be presented as many

other papers have done, instead, emphasis will be put on the application of deep

network in industry, and before the application part a brief introduction to four

commonly used deep networks will be illustrated.

The content of the rest of this paper is organized as following: First we give a brief

introduction to DBN, CNN and RNN, then region-based CNN (R-CNN) and Fast R-CNN

will be explored at an introductory level. In chapter 3 we go a little detail into the

application of deep network in face recognition.

2 Various Deep Networks

2.1 Conventional Machine Learning and Deep Learning


Conventional machine learning techniques required careful engineering and

considerable domain expertise to design a feature extractor that transformed the raw

data (such as the pixel values of an image) into a suitable internal representation or

feature vector from which the learning subsystem, often a classifier, could detect or

classify patterns in the input. However, manually extracted features from raw data

sometimes turns out to be limited and less efficient in representing the characteristics
of input data, and thus compromise the subsequent detection or classfication process.

Representation(feature) learning is a set of methods that allows a machine to be fed

with raw data and to automatically discover the feature representations needed for

detection or classification. This is exactly the main difference between conventional

machine learning and deep learning. The key aspect of deep learning is that these

layers of features are not designed by human engineers: they are learned from data

using a general-purpose learning procedure.

Deep learning network normally has more than three feature layers, while some

deep network could build up to 20 feature layers or even more. Higher layers of

representation amplify aspects of the input that are important for discrimination and

suppress irrelevant variations. An image, for example, comes in the form of an array of

pixel values, and the learned features in the first layer of representation typically

represent the presence or absence of edges at particular orientations and locations in

the image. The second layer typically detects motifs by spotting particular

arrangements of edges, regardless of small variations in the edge positions. The third

layer may assemble motifs into larger combinations that correspond to parts of

familiar objects, and subsequent layers would detect objects as combinations of these

parts.

Three typical deep networks, say DBN, CNN and RNN, will be introduced in the

following chapter. There is a basic technology called backward propogation which is

used in most deep networks training process to adjust internal parameters. Detail

about this technology will not be covered since it is so fundamental that you can find

tutotial anywhere in the coursebook or on internet.


2.2 Deep Believe Network
Deep Belief Network (DBN) [4. Geoffrey E. Hinton,2009] is a generative graphical

model consisting of a layer of visible units and multiple layers of hidden units, where

each layer encodes correlations in the units in the layer below. DBNs and related

unsupervised learning algorithms such as autoencoders and sparse coding have been

used to learn higher-level feature representations from unlabeled data, suitable for

usage in tasks such as classification.

A typical DBN is composed of several simple, unsupervised networks such as

Restricted Boltzmann Machines (RBMs). Each sub-network's hidden layer serves as the

visible layer for the next. This also leads to a fast, layer-by-layer unsupervised training

procedure, where contrastive divergence[5. Geoffrey E. Hinton et al.,2006]

algorithm is applied to each sub-network in turn, starting from the "lowest" pair of

layers (the lowest visible layer being a training set).

Figure 1 DBN structure

As described above, RBM is a component of DBN. RBM has two layers of neurons.

One is visible layer which consists of visible units. Training data are connected with this
layer as input. The other is called hidden layer which consists of hidden units. Hidden

units are acting as feature detectors.

Figure 2 The structure of RBM

Note that all connections are between visible and hidden layer. There is no

connection inside each layer. The advantage is that given the visible units input, each

unit in hiddle layer are mutually independent, and vice versa. Then we have the

following equations:
N
P (h | v )  
j
P (h
1
j |v)

M
P (v | h)  
i
P (v
1
i | h)

Given a matrix of weights W  (wi , j ) (size m×n) associated with the connection

between hidden unit and visible unit , as well as bias weights (offsets) for the

visible units and for the hidden units, the energy is defined as:

And probability distributions over hidden and/or visible vectors are defined in terms of

the energy function:


where Z is defined as the sum of to ensure the probability distribution sums

to 1.

The individual activation probabilities are given by

where denotes the logistic sigmoid.

RBM are trained to maximize the product of probabilities assigned to some training
set (a matrix, each row of which is treated as a visible vector ),

Hinton proposed a fast training algorithm called Contrastive Divergence (CD) which
has been proved very effective in practice. The CD procedure can be summarized as
follows:

1. Take a training sample v, compute the probabilities of the hidden units and
sample a hidden activation vector h from this probability distribution
2. Compute the outer product of v and h and call this the positive gradient
3. From h, sample a reconstruction v' of the visible units, then resample the
hidden activations h' from this (Gibbs sampling step)
4. Compute the outer product of v' and h' and call this the negative gradient
5. Let the weight update to be the positive gradient minus the negative

gradient, times some learning rate: .

2.3 Convolutional Neural Network (CNN)


Unsupervised pre-training techniques such as Deep Belief Network were much

more popular in 1990s partly because deep supervised neural networks were generally

found too difficult to train. However, there is one notable exception: convolutional

neural network (CNN). It achieved many practical successes during the period when
neural networks were out of favour and it has recently been widely adopted or even

dominant by the computer-vision community.

The idea of convolution was first proposed in Fukushima’s Neocognitron

[7.Fukushima,1980]. As he recognized, when neurons with the same parameters are

applied on patches of the previous layer at different locations, a form of translational

invariance is obtained. Later, LeCun and collaborators, following up on this idea,

designed and trained convolutional networks using the error gradient, obtaining state-

of-the-art performance [8.Yann Lecun et al.,1989][9. Yann Lecun et al.,1990] [10. Yann

Lecun et al.,1998] on several pattern recognition tasks. The architeture for LeCun’s

convolutional network is as follows:

Figure 3 Typical Convolutional Neural Network Architecture

LeCun ’ s convolutional neural networks are organized in layers of two types:

convolutional layers and subsampling layers. At each location of each layer, there are

a number of different neurons, each with its set of input weights, associated with

neurons in a rectangular patch in the previous layer. The same set of weights, but a

different input rectangular patch, are associated with neurons at different locations.

The Convolutional layer is the core component of a CNN. The layer's parameters consist

of a set of learnable filters (or kernels), which have a small receptive field, but extend

through the full depth of the input volume. During the forward pass, each filter is convolved

across the width and height of the input volume, computing the product between the entries
of the filter and the input and producing a 2-dimensional activation map of that filter. As a

result, the network learns filters that activate when they see some specific type of feature

at some spatial position in the input.

Another important concept of CNNs is pooling, which is a form of non-linear down-

sampling. Pooling divides the input layer into a set of non-overlapping rectangles and

perform down-sampling on each sub-region. The function of the pooling layer is to

progressively reduce the spatial size of the representation to reduce the amount of

parameters and computation in the network, as well as to provide a form of tranlational

invariance. It is common to periodically insert a pooling layer between successive

Convolutional layers in a CNN architecture. The most common operation performed on

sub-region is MAX function.

Although CNN had seen many sucesses in a variety of applications, CNN was largely

overlooked by the mainstream computer-vision and machine-learning communities until

the ImageNet competition in 2012, which was a great breakthrough for the rehabilitation of

CNN. When deep convolutional networks were applied to a data set of about a million

images from the web that contained 1,000 different classes, they achieved spectacular

results, almost halving the error rates of the best competing approaches[1. Krizhevsky et

al.,2012]. This success came from the efficient use of GPUs, ReLUs, a new regularization

technique called dropout, and techniques to generate more training examples by deforming

the existing ones. This success has brought about a revolution in computer vision: CNN

are now the dominant approach for almost all recognition fields.

A recent stunning demonstration combines CNN and RNN (Recurrent Neural Network,

which will be illustrated in the next chapter) for the generation of image captions. Moreover,

the convolutional structure has been imported into RBMs [11. G. Desjardins et al.,2008]

and DBNs [12. H. Lee et al.,2009]. An important innovation in [12] is the design of a

generative version of the pooling / subsampling units, which worked beautifully in the

experiments reported, yielding state-of theart results not only on MNIST digits but also on

the Caltech-101 object classification benchmark.


2.4 Recurrent Neural Network
For tasks that involve sequential inputs, such as speech, language and video, it is

often better to use Recurrent Neural Network (RNN). RNN process an input sequence

one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly

contains information about the history of all the past elements of the sequence. We

consider the outputs of the hidden units at different discrete time steps as if they were

the outputs of different neurons in a deep multilayer network.

Figure 4 Unfolding RNN

If we unfold RNN as Figure 4 shows, it is easy to see that RNN acts as very deep

feedforward networks in which all the layers share the same weights. As we regard the

outputs of the hidden units at earlier time steps as the input of an identical neural

network at present time, it becomes clear how we can apply back propagation

algorithm to train RNN.

RNN has been found to be very good at predicting the next character in the text or

the next word in a sequence, but they can also be used for more complex tasks. For

example, after reading an English sentence one word at a time, an English ‘encoder’

network can be trained so that the final ‘state vector’ of its hidden units is a good

representation of the thought expressed by the sentence (high level representation of

the theme of the sentence). This thought vector can then be used as the initial hidden

state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs
a probability distribution for the first word of the French translation. If a particular first

word is chosen from this distribution and provided as input to the decoder network it

will then output a probability distribution for the second word of the translation and

so on until a full stop is chosen[13. Cho, K. et al.,2014][14. Sutskever et al.,2014].

Instead of translating the meaning of a French sentence into an English sentence,

one can learn to ‘translate’ the meaning of an image into an English sentence which

seems more amazing. The encoder here is a CNN that converts the pixels into an

activity vector in its last hidden layer. The decoder is an RNN similar to the ones used

for machine translation and neural language modelling. There has been a surge of

interest in such systems recently[15. Xu, K. et al.,2015].

Although their main purpose is to learn long-term dependencies, theoretical and

empirical evidence shows that it is difficult to learn to store information for very

long[16. Bengio et al.,1994]. To solve this problem, one idea is to augment the network

with an explicit memory. The first proposal of this kind is the long short-term memory

(LSTM) networks that use special hidden units, the natural behaviour of which is to

remember inputs for a long time[17. Hochreiter et al.,1997].The basic LSTM idea is

very simple. Some of the units are called Constant Error Carousels (CECs). Each CEC

uses an activation function f, the identity function, and has a connection to itself with

fixed weight of 1.0. Due to f’s constant derivative of 1.0, errors backpropagated

through a CEC cannot vanish or explode. CECs are the main reason why LSTM nets can

learn to discover the importance of (and memorize) events that happened thousands

of discrete time steps ago, while previous RNNs already failed in case of minimal time

lags of 10 steps.
2.5 Region-based Deep Network and F-CNN
Classic segmentation has long time been exceedingly difficult: mistakes are

irreversible and incorrect segments harm subsequent processing. Thus classic

segmentation rarely serves as a pre-processing step for detection. A major innovation

in how we think about segmentation occurred around 2005. The idea is as simple:

generate multiple candidate segmentations, and while your kitty may be disfigured in

many, hopefully at least one of the segmentations will contain her whole and

unharmed body. This was a leap in thinking because it shifts the focus to generating a

diversity of segmentations as opposed to a single and perfect (but unachievable)

segmentation. This kind of method is called object proposals or region proposals.

Applying region proposals method in CNN is a breakthrough. It makes the real-world

detection more feasible as well as more exciting. Region-based CNN, or Region with

CNN, was proposed by Ross Girshick in 2014[19. Ross Girshick et al.,2014]. This paper

try to understand if the Supervision CNN can be made to work as an object detector,

in other words, to bridge the gap between image classification and object detection.

It turns out to be the first paper to show that a CNN can lead to dramatically higher

object detection performance on PASCAL VOC as compared to systems based on

simpler HOG-like features. The architecture of R-CNN is as follows:


Figure 5 Overview of R-CNN

The system first takes an input image, extracting around 2000 bottom-up region

proposals, after that it computes features for each proposal using a large convolutional

neural network (CNN), and then it classifies each region using class-specific linear

SVMs.

R-CNN proved to be highly effective in object detection, however, it is still far away

from application. The main reason is due to its compute-intensive property. The

processing time for one image is more than 10 seconds in general cases. That is why

the author of R-CNN came up with F-CNN, fast R-CNN[21. Ross Girshick,2015] a year

later. Fast R-CNN builds on previous work to efficiently classify object proposals using

deep convolutional networks. Compared to previous work, Fast R-CNN employs

several innovations to improve training and testing speed while also increasing

detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-

CNN, 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Fast

R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-

source MIT License at https: //github.com/rbgirshick/fast-rcnn.

Figure 6 Architecture of F-CNN


Fig 6 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an

entire image and a set of object proposals. The network first processes the whole

image with several convolutional (conv) and max pooling layers to produce a conv

feature map. Then, for each object proposal a region of interest (RoI) pooling layer

extracts a fixed-length feature vector from the feature map. Each feature vector is fed

into a sequence of fully connected (fc) layers that finally branch into two sibling output

layers: one that produces softmax probability estimates over K object classes plus a

catch-all “background” class and another layer that outputs four real-valued numbers

for each of theK object classes. Each set of 4 values encodes refined bounding-box

positions for one of the K classes.

Early this year (Jan 2016) one of my colleague visited Silicon Valley and found that

F-CNN has become a hot topic in deep learning. According to this paper, at runtime

the detection network processes each image in only 0.3s with very high accuracy

achieved in the mean time. It is reasonable to believe that F-CNN will find its broad

implementation in industry in the near future.

3 Application of Deep Network in Face Recognition

3.1 Face Recogniton Overview


Face recognition has been widely used in many areas such as general identity

verification, surveillance, attendance system, human machine interface, among many

others. Face recognition appears to offer several advantages over other biometric

methods, a few of which are outlined here: face recognition can be done passively

without any explicit action or participation on the part of the user since face images

can be acquired from a distance by a camera. Futhermore, facial images can be easily
obtained with a couple of inexpensive fixed cameras, whereas iris and retina

identification require expensive equipment and are much too sensitive to any body

motion. Finally, technologies that require multiple individuals to use the same

equipment to capture their biological characteristics potentially expose the user to the

transmission of germs and impurities from other users. However, face recognition is

totally non-intrusive and does not carry any such health risks.

The conventional method for face recognition, before the implementation of neural

network, started by extracting a representation of the face image using hand-crafted

local image descriptors such as SIFT, LBP, HOG [24. A. Albiol et al.,2008][25. D.G.

Lowe,2004]; then they aggregate such local descriptors into an overall face descriptor

by using a pooling mechanism, for example the Fisher Vector [26. K. Simonyan et

al.,2013]. Another famous method for face recognition which was proposed in the

early 1990s is EigenFace[35. M. Turk,1991].

There is an interesting application about face recognition[36. Quoc V Le et al.,2012].

In this paper, unsupervised learning on deep networks was performed to learn some

very high level feature detectors for face recognition. They sampled tons of pictures

(frames) from youtube, which only a small portions include faces. These unlabelled

images were trained by a special-designed deep architecture. Surprisingly, the best

neuron can achieve 81.7% accuracy! After that ,they apply a logistic classifier on top

of the deep network for supervised learning and found their results better than the

state-of-the-art, which shows that the unsupervised learning of deep network can

provide good high level features to improve the classification accuracy.

This paper is concerned mainly with deep architectures for face recognition. The

core characteristic of such methods is the use of a CNN feature extractor, a learnable
function obtained by composing several linear and non-linear operators. Several

distinctive deep architectures will be introduced in the following chapter.

Before we go into any detail of deep network implementation on face recognition,

let us first take a look at a typical face recognition system which is indicated as below:

Figure 7 A typical face recognition system

The first thing a face recognition system does is to detect a face, say, distinguish

human face from other objects like cat, wall or curtain. Then face image will be aligned

to make sure subsequent process receive a well regulated input. The core component

is the feature extraction subsystem. In deep network method high level features are

learned through supervised training. The extracted features are then used for final

recognition task.

There are two primary tasks for face recognition: Face Verification (one-to-one

matching) when presented with a face image of an unknown individual along with a

claim of identity, ascertaining whether the individual is who he/she claims to be; And

Face Identification (one-to-many matching) when given an image of an unknown

individual, determining that person’s identity by comparing (possibly after encoding)

that image with a database of (possibly encoded) images of known individuals. Face

verification turns out to have a heavier usage since there are always cases where a

new face image shows up which does not exist in the pre-encoded database.
As figure 7 indicates, deep network has been implemented mainly on two

components: one is face alignment, and the other, more commonly, is feature

extraction.

3.2 Deep Network for Face Alignment


Facial keypoint detection is critical for face recognition and analysis, and has been

studied extensively in recent years[44. B. Amberg et al.,2011][45. P. N. Belhumeur et

al.,2011]. This problem is challenging when face images are taken with extreme poses,

lightings, expressions, and occlusions. To solve this problem, a cascaded regression

approach for facial point detection with three levels of convolutional networks was

proposed[47.Yi Sun et al.,2013].

Figure 8 Three-level cascaded convolutional network for face point detection

There are five facial points to be detected: left eye center (LE), right eye center (RE),

nose tip (N), left mouth corner (LM), and right mouth corner (RM). They cascade three

levels of convolutional networks to make coarse-to-fine prediction. At the first level,

they employ three deep convolutional networks, F1, EN1, and NM1, whose input

regions cover the whole face (F1), eyes and nose (EN1), nose and mouth (NM1). Each

network simultaneously predicts multiple facial points. For each facial point, the

predictions of multiple networks are averaged to reduce the variance. Networks at the

second and third levels take local patches centered at the predicted positions of facial
points from previous levels as input and are only allowed to make small changes to

previous predictions. The sizes of patches and search ranges keep reducing along the

cascade. Predictions at the last two levels are strictly restricted because local

appearance is sometimes ambiguous and unreliable. The predicted position of each

point at the last two levels is given by the average of the two networks with different

patch sizes. While networks at the first level aim to estimate keypoint positions

robustly with few large errors, networks at the last two levels are designed to achieve

high accuracy.

Figure 9 Deep convolutional network F1

Figure 9 illustrates the deep structure of F1, which contains four convolutional layers

followed by max pooling, and two fully connected layers. EN1 and NM1 take the same

deep structure, but with different sizes at each layer since the sizes of their input

regions are different. Note that the network structures for the last two levels are

shallower, since their tasks are low-level and their input is limited to small local regions

around the initial positions.

3.3 Deep Network for Face Verification and Identification


A lot of research regarding implementing deep network in face recognition has

come up since 2012, and has seen significant improvement on recognition accuracy
compared to conventional methods. In this paper, a few of them which are considered

to be most representative are presented.

One of the representative works is called DeepFace[48. Y. Taigman et al.,2014]. This

method uses a deep CNN trained to classify faces using a dataset of 4 million examples

spanning 4000 unique identities. It also uses a siamese network architecture, where

the same CNN is applied to pairs of faces to obtain descriptors that are then compared

using the Euclidean distance. The authors later extended this work in [49. Y. Taigman

et al.,2015], by increasing the size of the dataset by two orders of magnitude, including

10 million identities and 50 images per identity.

Figure 10 DeepFace Architecture

There are two major innocations in DeepFace: (1) Careful effort was made on face

alignment part. A generic 3D shape model was used in pre-processing phase to warp

the 2D-aligned crop to the image plane of 3D shape. (2) Unlike a convolutional layer

where a constant set of filters is applied on a whole input feature map, the last three

layers are locally connected, which means every location in the feature map learns a

different set of filters.

The DeepFace work was extended by the DeepId series of papers by Y. Sun et al. [50,

51, 52], each of which steadily increased the performance on LFW and YFW. In the first

paper[50. Y. Sun et al.,2014], a huge number of identities (nearly 10,000) on the output

layer were trained, maximizing the capability of extracted features in terms of


discriminating inter-personal variations. The convolutional network it used is as figure

11:

Figure 11 Convolutional network for predicting 10,000 classes

The recognition performance was further improved with DeepID2 [52. Y. Sun et

al.,2014]. A significant innovation was proposed in DeepID2, where a second training

phase was added which focused on intra-personal cases. The reason for adding the

second training phase is that while it is important for effective feature representations

to enlarge inter-personal variations, it is also important for them to reduce intra-

personal variations. Figure 12 is a rough description of this idea:

Figure 12 Identification-Verification jointly training

By implemetning Identification-Verification jointly training, 99.15% face verification

accuracy was achieved on LFW dataset, which was the highest accuracy before

DeepID3 was proposed.

In [53. Y. Sun et al.,2015], very deep networks was introduced and the face
verification accuracy was further improved to 99.53% on LFW dataset. Compared to

DeepFace, DeepID does not use 3D face alignment, but a simpler 2D affine alignment.

Even higher accuracy might be achieved if more powerful alignment method was used

in DeepID.

4. Conclusion
In this survey, several typical deep networks are introduced. Face recognition, a

promising biometric-based technology, is also presented as a popular application of

deep network.

Deep network has been seeing significant successes in a variety of areas, both in

research and in industry. However, better theoretic understanding of deep learning

and convolutional networks is still a big challenge remains unsolved. A more serious

problem is regarding the way deep network is playing for recognition: Human brain

seems to act in a totally different way when it comes to learning and recognizing a new

object. Human and animal learning is largely unsupervised: we discover the structure

of the world by observing it, not by being told the name of every object.

References

1. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep


convolutional neural networks. In Proc. Advances in Neural Information Processing
Systems 25 1090–1098 (2012).
2. Mikolov, T., Deoras, A., Povey, D., Burget, L. & Cernocky, J. Strategies for training
large scale neural network language models. In Proc. Automatic Speech
Recognition and Understanding 196–201 (2011).
3. Collobert, R., et al. Natural language processing (almost) from scratch. J. Mach.
Learn. Res. 12, 2493–2537 (2011).
4. Geoffrey E. Hinton (2009), Scholarpedia, 4(5):5947.
5. Geoffrey E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep
belief nets. Neural Computation, 18(7):1527–1554, 2006.
6. N. Le Roux and Y. Bengio, “Representational power of restricted boltzmann
machines and deep belief networks,” Neural Computation, vol. 20, no. 6,pp.
1631–1649, 2008.
7. K. Fukushima, “Neocognitron: A self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position,” Biological
Cybernetics, vol. 36, pp. 193–202, 1980.
8. Yann LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”
Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
9. Yann LeCun et al. Handwritten digit recognition with a back-propagation network.
In Proc. Advances in Neural Information Processing Systems 396–404, 1990.
10. Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp.
2278–2324, 1998.
11. G. Desjardins and Y. Bengio, “Empirical evaluation of convolutional RBMs for
vision,” Technical Report 1327, 2008.
12. H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations,” in Proceedings
of the Twenty-sixth International Conference on Machine Learning (ICML’09), (L.
Bottou and M. Littman, eds.), Montreal (Qc), Canada: ACM, 2009.
13. Cho, K. et al. Learning phrase representations using RNN encoder-decoder for
statistical machine translation. In Proc. Conference on Empirical Methods in
Natural Language Processing 1724–1734 (2014).
14. Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural
networks. In Proc. Advances in Neural Information Processing Systems 27 3104–
3112 (2014).
15. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual
attention. In Proc. International Conference on Learning Representations http://
arxiv.org/abs/1502.03044 (2015).
16. Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with gradient
descent is difficult. IEEE Trans. Neural Networks 5, 157–166 (1994).
17. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9,
1735–1780 (1997).
18. C. Gu, J.J. Lim, P. Arbelaez and J. Malik. Recognition Using Regions. CVPR 2009.
19. R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate
object detection and semantic segmentation. CVPR 2014.
20. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Regionbased convolutional
networks for accurate object detection and segmentation. TPAMI, 2015.
21. R. Girshick. Fast R-CNN. ICCV 2015.
22. K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional
networks for visual recognition. In ECCV, 2014.
23. E. Acosta, L. Torres, A. Albiol, and E. J. Delp, "An automatic face detection and
recognition system for video indexing applications," in Proceedings of the IEEE
International Conference on Acoustics, Speechand Signal Processing, Vol.4.
Orlando, Florida, 2002.
24. A. Albiol, D. Monzo, A. Martin, J. Sastre, and A. Albiol, "Face recognition using
HOG–EBGM" Pattern Recognition Letters, Vol.29, pp.1537-1543, 2008.
25. D.G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.
26. K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. Fisher Vector Faces in the
Wild. In Proc. BMVC., 2013.
27. Yoshua Bengio and Yann Lecun. Scaling Learning Algorithms towards AI To appear
in Large-Scale Kernel Machines , (1):1-41, 2007.
28. Yoshua Bengio. Learning Deep Architectures for AI, volume 2. 2009.
29. Ruslan Salakhutdinov and Geo_rey Hinton. Deep Boltzmann Machines. (2), 2009.
30. David H Ackley, Geo_rey E Hinton, and J Sejnowski. A Learning Algorithm for
Boltzmann Machines *. 169:147-169, 1985.
31. Geo_rey Hinton. A Practical Guide to Training Restricted Boltzmann Machines A
Practical Guide to Training Restricted Boltzmann Machines. 2010.
32. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy Layer-
Wise Training of Deep Networks. Processing, 19(d):153, 2007.
33. Geo_rey E Hinton. The wake-sleep algorithm for unsupervised neural networks.
1995.
34. Thomas P. Karnowski, Itamar Arel, and Derek Rose. Deep Spatiotemporal Feature
Learning with Application to Image Classi_cation. 2010 Ninth International Confer-
ence on Machine Learning and Applications, December 2010.
35. M. Turk and A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive
Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
36. Quoc V Le, Marc Aurelio Ranzato, Matthieu Devin, Greg S Corrado, and Andrew Y
Ng. Building High-level Features Using Large Scale Unsupervised Learning. 2012.
37. J. Zhang, S. Z. Li, and J. Wang, "Nearest Manifold Approach for Face Recognition,"
in Proc. IEEE International Conference on Automatic Face and Gesture Recognition,
2004.
38. Y. Wu, K. L. Chan, and L. Wang, "Face Recognition based on Discriminative Manifold
Learning," in Proc. IEEE Int’l Conf. on Pattern Recognition, Vol.4,2004.
39. M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski,"Face recognition by independent
component analysis,"IEEE Transactions on Neural Networks, Vol.13,pp.1450-
1464,2002.
40. K.-C. Kwak and W. Pedrycz, "Face Recognition Using an Enhanced Independent
Component Analysis Approach," IEEE Transactions on Neural Networks,Vol.18,
pp.530-541, 2007.
41. B. Li and H. Yin, "Face Recognition Using RBF Neural Networks and Wavelet
Transform," in Advances in Neural Networks – ISNN 2005, vol.3497,Lecture Notes
in Computer Science: Springer Berlin /Heidelberg, 2005.
42. F. S. Samaria, "Face recognition using Hidden Markov Models," Trinity College,
University of Cambridge, Cambridge, UK, Ph. D. Thesis 1994.
43. A. V. Nefian and M. H. Hayes III, "Face Recognition using an embedded HMM," in
IEEE International Conference Audio Video Biometric based Person Authentication,
1999.
44. B. Amberg and T. Vetter. Optimal landmark detection using shape models and
branch and bound. In Proc. ICCV, 2011.
45. P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizaing parts of
faces using a consensus of exemplars. In Proc. CVPR, 2011.
46. C. Lu and X. Tang. Surpassing human-level face verification performance on lfw
with gaussianface. AAAI, 2015.
47. Yi Sun, Xiaogang Wang, Xiaoou Tang. Deep Convolutional Network Cascade for
Facial Point Detection. CVPR, 2013.
48. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deep-Face: Closing the gap to human-
level performance in face verification. In Proc. CVPR, 2014.
49. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training for face
identification. In Proc. CVPR, 2015.
50. Y. Sun, X.Wang, and X. Tang. Deep learning face representation from predicting
10,000 classes. In Proc. CVPR, 2014.
51. Y. Sun, Y. Chen, X.Wang, and X. Tang. Deep learning face representation by joint
identification-verification. NIPS, 2014.
52. Y. Sun, L. Ding, X. Wang, and X. Tang. Deepid3: Face recognition with very deep
neural networks. CoRR, abs/1502.00873, 2015.
53. Yoshua Bengio, Aaron Courville, Pascal Vincent. Representation Learning: A Review
and New Perspectives, Arxiv, 2012.
54. Yoshua Bengio. Learning Deep Architectures for AI. Foundations and Trends in
Machine Learning Vol. 2, No. 1, 2009.

You might also like