0% found this document useful (0 votes)
3 views

module 5 part2new

The document discusses applications of deep learning in areas such as computer vision, speech recognition, and natural language processing, focusing on word embedding techniques like Word2Vec and GloVe. It details two models of Word2Vec: Continuous Bag-of-Words (CBOW) and Skip-Gram, explaining their architectures and training processes. Additionally, it introduces GloVe as an unsupervised learning algorithm that utilizes global word co-occurrence statistics to create word vector representations.

Uploaded by

uos4367
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

module 5 part2new

The document discusses applications of deep learning in areas such as computer vision, speech recognition, and natural language processing, focusing on word embedding techniques like Word2Vec and GloVe. It details two models of Word2Vec: Continuous Bag-of-Words (CBOW) and Skip-Gram, explaining their architectures and training processes. Additionally, it introduces GloVe as an unsupervised learning algorithm that utilizes global word co-occurrence statistics to create word vector representations.

Uploaded by

uos4367
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

CST414

DEEP LEARNING
Module-5 PART -II

1
SYLLABUS
2

Module-5 (Application Areas)


 Applications – computer vision, speech recognition, natural language
processing, common word embedding: continuous Bag-of-Words,
Word2Vec, global vectors for word representation (GloVe).
Research Areas – autoencoders, representation learning,
TRACE KTU
boltzmann machines, deep belief networks.
Common Word Embedding
3
 Word2Vec, a framework for generating word embeddings,
was pioneered by Mikolov et al.
 The two variants of word2vec are as follows:
1. Predicting target words from contexts: This model
tries to predict the ith word, wi, in a sentence using a
window of width t around the word. Therefore, the words
TRACE KTU
wi−twi−t+1 ...wi−1wi+1 ...wi+t−1wi+t are used to predict
the target word wi. This model is also referred to as the
continuous bag-of-words (CBOW) model.
The CBOW model used the encoder to create an
embedding from the full context (treated as one input)
and predict the target word. It turns out this strategy
works best for smaller datasets
4
2. Predicting contexts from target words: This model
tries to predict the context wi−twi−t+1 ...wi−1wi+1
...wi+t−1wi+t around word wi, given the ith word in the
sentence, denoted by wi. This model is referred to as the
skip-gram model.
 The Skip-Gram model does the inverse of CBOW, taking the

TRACE KTU
target word as an input, and then attempting to predict one
of the words in the context
 The first technique is a multinomial model which predicts one
5 word out of d outcomes.
 The second model is a Bernoulli model, which predicts whether
or not each context is present for a particular word. The second
model uses negative sampling of contexts for better efficiency
and accuracy.
Neural Embedding with Continuous Bag of Words
TRACE KTU
 In the continuous bag-of-words (CBOW) model, the training
pairs are all context-word pairs in which a window of context
words is input, and a single target word is predicted.
 The context contains 2·t words, corresponding to t words both
before and after the target word
 we will use the length m = 2 · t to define the length of the
6 context.Therefore, the input to the system is a set of m words.
 Without loss of generality, let the subscripts of these words be
numbered so that they are denoted by w1 ...wm, and let the target
(output) word in the middle of the context window be denoted by
w
 Note that w can be viewed as a categorical variable with d possible

TRACE KTU
values, where d is the size of the lexicon.
 The goal of the neural embedding is to compute the probability
P(w|w1w2 ...wm) and maximize the product of these
probabilities over all training samples
7

TRACE KTU
Figure 2.15: Word2vec: The CBOW model
 In the architecture, we have a single input layer with m × d
8 nodes, a hidden layer with p nodes, and an output layer with
d nodes.
 The nodes in the input layer are clustered into m different
groups, each of which has d units. Each group with d input
units is the one-hot encoded input vector of one of the m
context words being modeled by CBOW.
 Only one of these d inputs will be 1 and the remaining inputs
TRACE KTU
will be 0
 Therefore, one can represent an input xij with two indices
corresponding to contextual position and word identifier.
 The input xij {0, 1} contains two indices i and j in the
subscript, where i {1 ...m} is the position of the context, and j
{1 ...d} is the identifier of the word.
 The hidden layer contains p units, where p is the
9
dimensionality of the hidden layer in word2vec.
 Let h1, h2,...hp be the outputs of the hidden layer nodes
 Note that each of the d words in the lexicon has m different
representatives in the input layer corresponding to the m
different context words, but the weight of each of these m
connections is the same.

TRACE KTU
 weights are referred to as shared. Sharing weights is a
common trick used for regularization in neural networks
 Let the shared weight of each connection from the jth word in
the lexicon to the qth hidden layer node be denoted by ujq.
 Note that each of the m groups in the input layer has
connections to the hidden layer that are defined by the same d
× p weight matrix U
10  uj = (uj1, uj2,...ujp) can be viewed as the p-dimensional
embedding of the jth input word over the entire corpus, and
 h = (h1 ...hp) provides the embedding of a specific
instantiation of an input context
 Then, the output of the hidden layer is obtained by
averaging the embeddings of the words present in the

TRACE KTU
context.
 The embedding (h1 ...hp) is used to predict the probability
11 that the target word is one of each of the d outputs with the
use of the softmax function. The weights in the output layer
are parameterized with a p × d matrix V = [vqj ].
 The jth column of V is denoted by vj . The output after
applying softmax creates d output values ˆy1 ... yˆd, which are
real values in (0, 1).

TRACE KTU
 These real values sum to 1 because they can be interpreted as
probabilities.
 The ground-truth value of only one of the outputs y1 ...yd is 1
and the remaining values are 0 for a given training instance.
One can write this condition as follows:
12  The softmax function computes the probability P(w|w1
...wm) of the one-hot encoded ground-truth outputs yj as
follows:

TRACE KTU
 For a particular target word w = r {1 ...d}, the loss function is
given by L = −log[P(yr = 1|w1 ...wm)] = −log(ˆyr). The use of
the negative logarithm turns the multiplicative
likelihoods over different training instances into an
additive loss function using log-likelihoods
13  The updates are defined by using the backpropagation
algorithm, as training instances are passed through the
neural network one by one
 loss function can be used to update the gradients of the
weight matrix V in the output layer
 backpropagation can be used to update the weight matrix U

TRACE KTU
between the input and hidden layer. The update equations
with learning rate α are as follows:
14  The probability of making a mistake in prediction on the jth
word in the lexicon is defined by |yj − yˆj |. However, we use
signed mistakes ε j , in which only the correct word with yj =
1 is given a positive mistake value, while all the other words in
the lexicon receive negative mistake values. This is achieved by
dropping the modulus:
 ε j = yj − yˆ
TRACE KTU
 Note that εj can also be shown to be equal to the negative
of the derivative of the cross entropy loss with respect to
jth input into the softmax layer (which is h · vj )
15  the updates for a particular input context and output word are
as follows

 α > 0 is the learning rate. Repetitions of the same word i in


TRACE KTU
the context window trigger multiple updates of ui.
 Two different embeddings corresponding to the p-dimensional
rows of the matrix U and the p-dimensional columns of the
matrix V . The former type of embedding of words is referred to
as the input embedding, whereas the latter is referred to as
the output embedding
 . In the CBOW model, the input embedding represents context,
and therefore it makes sense to use the output embedding.
16 Neural Embedding with Skip-Gram Model

 In the skip-gram model, the target words are used to


predict the m context words. Therefore, we have one input
word and m outputs.
 The skip-gram model is the technique of choice when a large
amount of data is available.
TRACE KTU
 The skip-gram model uses a single target word w as the input
and outputs the m context words denoted by w1 ...wm.
Therefore, the goal is to estimate P(w1, w2....wm|w), which is
different from the quantity P(w|w1 ...wm) estimated in the
CBOW model.
 After such an encoding, the skip-gram model will have d binary
17 inputs denoted by x1 ...xd corresponding to the d possible
values of the single input word.
 Similarly, the output of each training instance is encoded as m
× d values yij {0, 1}, where i ranges from 1 to m (size of context
window), and j ranges from 1 to d (lexicon size
 Each yij {0, 1} indicates whether the ith contextual word
takes on the jth possible value for that training instance.
TRACE KTU
 However, the (i, j)th output node only computes a soft
probability value ˆyij = P(yij = 1|w).
 Therefore, the probabilities ˆyij in the output layer for fixed i and
varying j sum to 1, since the ith contextual position takes on
exactly one of the d words. The hidden layer contains p units,
the outputs are denoted by h1 ...hp. Each input xj is connected
to all the hidden nodes with a d×p matrix U
18

TRACE KTU
19

TRACE KTU
20

 the p hidden nodes are connected to each of the m groups of d


output nodes with the same set of shared weights.
 This set of shared weights between the p hidden nodes and
the d output nodes of each of the context words is defined by
the p × d matrix V .

TRACE KTU
 In the case of the skip-gram model, one can collapse the m
identical outputs into a single output, and achieve the
same results simply by using a particular type of mini-
batching during stochastic gradient descent in fig(a)
21
all elements of a single context window are
always forced to belong to the same mini-batch
in fig (b)
The output of the hidden layer can be
computed from the input layer using the d ×

TRACE KTU
p matrix of weights U = [ujq] between the
input and hidden layer as follows:
22

 the one-hot encoding of the input word w in terms of x1 ...xd.


 If the input word w is the rth word, then one simply copies
urq to the qth node of the hidden layer for each q {1 ...p}.
 In other words, the rth row ur of U is copied to the hidden
layer.

TRACE KTU
 the hidden layer is connected to m groups of d output nodes, each
of which is connected to the hidden layer with a p × d matrix V =
[vqj ].
 Each of these m groups of d output nodes computes the
probabilities of the various words for a particular context word.
 The jth column of V is denoted by vj and represents the output
embedding of the jth word. The output ˆyij is the probability that
the word in the ith context position takes on the jth word of the
lexicon
23  the neural network predicts the same multinomial
distribution for each of the context words. Therefore,
we have the following:

TRACE KTU
 Note that the probability ˆyij is the same for varying i
and fixed j, since the right-hand side of the above
equation does not depend on the exact location i in the
context window
24  The loss function for the backpropagation algorithm is the
negative of the log-likelihood values of the ground truth
yij {0, 1} of a training instance. This loss function L is
given by the following:

TRACE KTU
 The update equations with learning rate α are as follows:
25

TRACE KTU
Here, α > 0 is the learning rate. The p-dimensional rows of the
matrix U are used as the embeddings of the words. In other words,
the convention is to use the input embeddings in the rows of U
rather than the output embeddings in the columns of V .
26 Global Vectors for Word Representation(GloVe)

 GloVe is an unsupervised learning algorithm for obtaining vector


representations for words.
 Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014

TRACE KTU
 Training is performed on aggregated global word-word co-
occurrence statistics from a corpus, and the resulting
representations showcase interesting linear substructures of the
word vector space.
 Specific weighted least squares model that trains on global word-
word co-occurrence counts and thus makes efficient use of
statistics.
 model for word representation which we call GloVe, for Global
Vectors, because the global corpus statistics are captured
27
directly by the model
 The Euclidean distance (or cosine similarity) between two
word vectors provides an effective method for measuring
the linguistic or semantic similarity of the corresponding
words
 The similarity metrics used for nearest neighbor evaluations

TRACE KTU
produce a single scalar that quantifies the relatedness of two
words. This simplicity can be problematic since two given
words almost always exhibit more intricate relationships than
can be captured by a single number.
 For example, man may be regarded as similar to woman in that
both words describe human beings
28

TRACE KTU
29

TRACE KTU
30  The GloVe model is trained on the non-zero entries of a
global word-word co-occurrence matrix, which tabulates
how frequently words co-occur with one another in a
given corpus. Populating this matrix requires a single pass
through the entire corpus to collect the statistics.
 For large corpora, this pass can be computationally
expensive, but it is a one-time up-front cost. Subsequent

TRACE KTU
training iterations are much faster because the number of
non-zero matrix entries is typically much smaller than the
total number of words in the corpus.
31

 GloVe is essentially a log-bilinear model with a weighted


least-squares objective.

TRACE KTU
 The main intuition underlying the model is the simple
observation that ratios of word-word co-occurrence
probabilities have the potential for encoding some form of
meaning.
32
 ice co-occurs more frequently with solid than it does
with gas, whereas steam co-occurs more frequently
with gas than it does with solid.
 Both words co-occur with their shared property water
frequently, and both co-occur with the unrelated word
fashion infrequently.

TRACE KTU
 Only in the ratio of probabilities does noise from non-
discriminative words like water and fashion cancel out,
so that large values (much greater than 1) correlate well
with properties specific to ice, and small values (much
less than 1) correlate well with properties specific of
steam.
 Compared to the raw probabilities, the ratio is better able to
33 distinguish relevant words (solid and gas) from irrelevant
words (water and fashion) and it is also better able to
discriminate between the two relevant words. ratio Pik/Pjk
depends on three words i, j, and k, the most general model
takes the form,

TRACE KTU
 where w ϵ Rd are word vectors and w˜ ϵ Rd are separate
context word vectors.
 The right-hand side is extracted from the corpus, and F may
depend on some as-of-yet unspecified parameters.
 functions F that depend only on the difference of the two
34 target words,

 we note that the arguments of F in Eqn. (2) are vectors while

TRACE KTU
the right-hand side is a scalar
 While F could be taken to be a complicated function
parameterized by, e.g., a neural network, doing so would
obfuscate the linear structure we are trying to capture. To
avoid this issue, we can first take the dot product of the
arguments,
35  note that for word-word co-occurrence matrices, the
distinction between a word and a context word is arbitrary
 First, we require that F be a homomorphism between the
groups (R,+) and (R>0, ×), i.e.,

TRACE KTU
36
 Next, we note that Eqn. (6) would exhibit the exchange
symmetry if not for the log(Xi ) on the right-hand side.
However, this term is independent of k so it can be absorbed
into a bias bi forwi . Finally, adding an additional bias ˜bk for
w˜ k restores the symmetry,

TRACE KTU
 introducing a weighting function f (Xi j ) into the cost function
gives us the model
 where V is the size of the vocabulary. The weighting
37
function should obey the following properties:

TRACE KTU
 The basic idea behind the GloVe word embedding is to derive
38 the relationship between the words from statistics. Unlike
the occurrence matrix, the co-occurrence matrix tells you how
often a particular word pair occurs together. Each value in the
co-occurrence matrix represents a pair of words occurring
together
 The advantage of GloVe is that, unlike Word2vec, GloVe does
not rely just on local statistics (local context information

TRACE KTU
of words), but incorporates global statistics (word co-
occurrence) to obtain word vectors
 Disadvantages
• Because it uses a co-occurrence matrix & global information,
memory cost is more in GloVe compared to word2vec.
• Similar to word2vec, it does not solve the problem of
polysemous words since words & vectors have a one-to-one
relationship.
RESEARCH AREAS--AUTOENCODERS
39
 Autoencoders are very useful in the field of unsupervised
machine learning.
 We can use them to compress the data and reduce its
dimensionality.
 An autoencoder is a type of artificial neural network used to
learn efficient codings of unlabeled data (unsupervised
TRACE KTU
learning).
 An autoencoder learns two functions: an encoding function
that transforms the input data, and a decoding function that
recreates the input data from the encoded representation.
 The main difference between Autoencoders and Principal
40
Component Analysis (PCA) is that while PCA finds the directions
along which you can project the data with maximum variance,
Autoencoders reconstruct our original input given just a
compressed version of it.

TRACE KTU
Figure 6-4. The autoencoder architecture attempts to construct a high-
dimensional input into a low-dimensional embedding and then uses
that low-dimensional embedding to reconstruct the input
 An autoencoder is a neural network that is trained to attempt to copy its
input to its output it has a hidden layer h that describes a code used to
41
represent the input.
 The network may be viewed as consisting of two parts: an encoder function
h = f (x) and a decoder that produces a reconstruction r = g(h).

TRACE KTU f

x
g

Figure : The general structure of an autoencoder, mapping an


input x to an output (called reconstruction) r through an internal
representation or code h. The autoencoder has two components:
the encoder f (mapping x to h) and the decoder g (mapping h to r).
 Modern autoencoders have generalized the idea of an encoder and a decoder beyond deterministic
functions to stochastic mappings Pencoder(h | x) and pdecoder(x | h).
42
 autoencoders were used for dimensionality reduction or feature learning.

TRACE KTU
43  A classical example of the dimensionality reduction setting is
the autoencoder, which recreates the outputs from the inputs.
Therefore, the number of outputs and inputs is equal.
 The constricted hidden layer in the middle outputs the
reduced representation of each instance.
 As a result of this constriction, there is some loss in the
representation, which typically corresponds to the noise in the
data.
TRACE KTU
 The outputs of the hidden layers correspond to the reduced
representation of the data
44

TRACE KTU
 An Autoencoder is a type of neural network that can learn to
45 reconstruct images, text, and other data from compressed
versions of themselves.
 An Autoencoder consists of three layers:
1. Encoder
2. Code
3. Decoder

TRACE KTU
 The Encoder layer compresses the input image into a latent
space representation. It encodes the input image as a
compressed representation in a reduced dimension.
 The compressed image is a distorted version of the original
image.
 The Code layer represents the compressed input fed to the
decoder layer.
 The decoder layer decodes the encoded image back to the original
46 dimension. The decoded image is reconstructed from latent space
representation, and it is reconstructed from the latent space
representation and is a lossy reconstruction of the original image.
 Training Autoencoders:
 Training an autoencoder is unsupervised in the sense that no
labeled data is needed.

TRACE KTU
 The training process is still based on the optimization of a
cost function.
 The cost function measures the error between the input x and its
reconstruction at the output x ^ .
 An autoencoder is composed of an encoder and a decoder.
47  When you're building an autoencoder, there are a few things
to keep in mind.
 First, the code or bottleneck size is the most critical
hyperparameter to tune the autoencoder. It decides how
much data has to be compressed. It can also act as a
regularization term.

TRACE KTU
 Secondly, it's important to remember that the number of
layers is critical when tuning autoencoders. A higher
depth increases model complexity, but a lower depth is
faster to process.
 Thirdly, you should pay attention to how many nodes you use
per layer. The number of nodes decreases with each
subsequent layer in the autoencoder as the input to each
layer becomes smaller across the layers.
48  An autoencoder whose code dimension is less than the input
dimension is called undercomplete.

 An autoencoder whose code dimension is more than the


input dimension is called overcomplete.

TRACE KTU
 Learning an undercomplete representation forces the
autoencoder to capture the most salient features of the
training data.
 The learning process is described simply as minimizing a
loss function
 L(x, g(f(x)))
 where L is a loss function penalizing g(f (x)) for being
dissimilar from x, such as the mean squared error.
 Autoencoders have various use-cases like:
49 • Anomaly detection: autoencoders can identify data anomalies using a
loss function that penalizes model complexity. It can be helpful for
anomaly detection in financial markets, where you can use it to identify
unusual activity and predict market trends.
• Data denoising image and audio: autoencoders can help clean up noisy
pictures or audio files. You can also use them to remove noise from images
or audio recordings.

TRACE KTU
• Image inpainting: autoencoders have been used to fill in gaps in images
by learning how to reconstruct missing pixels based on surrounding
pixels. For example, if you're trying to restore an old photograph that's
missing part of its right side, the autoencoder could learn how to fill in the
missing details based on what it knows about the rest of the photo.
• Information retrieval: autoencoders can be used as content-based image
retrieval systems that allow users to search for images based on their
content.
RESEARCH AREAS--REPRESENTATION
50 LEARNING
 Representation Learning is concerned with training machine
learning algorithms to learn useful representations, e.g. those
that are interpretable, have latent features, or can be used for
transfer learning.
 Learning representations of the data that make it easier to extract
TRACE KTU
useful information when building classifiers or other predictors.
 Feedforward networks trained by supervised learning as
performing a kind of representation learning the last layer of the
network is typically a linear classifier, such as a softmax
regression classifier
 The rest of the network learns to provide a representation to this
51 classifier. Training with a supervised criterion naturally leads to
the representation at every hidden layer (but more so near the
top hidden layer) taking on properties that make the
classification task easier
 Representation learning is particularly interesting because it
provides one way to perform unsupervised and semi-
supervised learning
TRACE KTU
 Large amounts of unlabeled training data and relatively little
labeled training data.
 Training with supervised learning techniques on the labeled
subset often results in severe overfitting.
 Semi-supervised learning offers the chance to resolve this
52
overfitting problem by also learning from the unlabeled
data.
 we can learn good representations for the unlabeled data,
 Greedy layer-wise unsupervised pretraining relies on a
single-layer representation learning algorithm such as an
RBM, a single-layer autoencoder, etc.
TRACE KTU
 Each layer is pretrained using unsupervised learning, taking
the output of the previous layer and producing as output a
new representation of the data
 Greedy layer-wise pretraining is called greedy because it is a
greedy algorithm, meaning that it optimizes each piece of
the solution independently, one piece at a time, rather than
jointly optimizing all pieces. It is called layer-wise because
these independent pieces are the layers of the network.
 Greedy layer-wise pretraining proceeds one layer at a time,
53
training the k -th layer while keeping the previous ones fixed. In
particular, the lower layers (which are trained first) are not
adapted after the upper layers are introduced.
 It is called unsupervised because each layer is trained with an
unsupervised representation learning algorithm.
 However it is also called pretraining, because it is supposed to be

TRACE KTU
only a first step before a joint training algorithm is applied to fine-
tune all the layers together.
 In the context of a supervised learning task, it can be viewed as a
regularizer (in some experiments, pretraining decreases test error
without decreasing training error) and a form of parameter
initialization
 the entire two phase protocol that combines the pretraining
54 phase and a supervised learning phase.
 The supervised learning phase may involve training a simple
classifier on top of the features learned in the pretraining phase,
or it may involve supervised fine-tuning of the entire network
learned in the pretraining phase.

TRACE KTU
 Distributed Representation--Distributed representations of
55 concepts—representations composed of many elements that
can be set separately from each other—are one of the most
important tools for representation learning
 An important related concept that distinguishes a
distributed representation from a symbolic one is that
generalization arises due to shared attributes between
different concepts. As pure symbols, “ cat” and “ dog” are as far
TRACE KTU
from each other as any other two symbols.
 For example, our distributed representation may contain
entries such as “has_fur” or “number_of_legs” that have the
same value for the embedding of both “ cat ” and “ dog.”
56
 Neural language models that operate on distributed
representations of words generalize much better than other
models that operate directly on one-hot representations of
words
 Distributed representations induce a rich similarity space, in
which semantically close concepts (or inputs) are close in
TRACE KTU
distance, a property that is absent from purely symbolic
representations.
57 RESEARCH AREAS- -BOLTZMANN
MACHINES,
 Boltzmann machines were originally introduced as a general
“connectionist” approach to learning arbitrary probability
distributions over binary vectors

TRACE KTU
 We define the Boltzmann machine over a d-dimensional binary
random vector x {0, 1}d.
 The Boltzmann machine is an energy-based model meaning
we define the joint probability distribution using an energy
function:
58

 where U is the “weight” matrix of model parameters and b is the


vector of bias parameters.
 In the general setting of the Boltzmann machine, we are given a
TRACE KTU
set of training examples, each of which are n-dimensional.
 The Boltzmann machine becomes more powerful when not
all the variables are observed
 the Boltzmann machine becomes a universal approximator of
probability mass functions over discrete variables
59
Boltzmann Machine Learning
 Learning algorithms for Boltzmann machines are usually
based on maximum likelihood.
 One interesting property of Boltzmann machines when trained
with learning rules based on maximum likelihood is that the
update for a particular weight connecting two units

TRACE KTU
depends only the statistics of those two units, collected under
different distributions
 the weight can be updated without knowing anything about the
60
rest of the network or how those statistics were produced.
 This means that the learning rule is “local,” which makes
Boltzmann machine learning somewhat biologically
plausible.
 if each neuron were a random variable in a Boltzmann machine,
then the axons and dendrites connecting two random variables

TRACE KTU
could learn only by observing the firing pattern of the cells
that they actually physically touch
 In the positive phase, two units that frequently activate together
have their connection strengthened. This is an example of a
Hebbian learning rule--the mnemonic “fire together, wire
together.”
RESEARCH AREAS- DEEP BELIEF
61 NETWORKS
 Deep belief networks (DBNs) were one of the first non-
convolutional models to successfully admit training of deep
architectures (Hinton et al., 2006; Hinton, 2007b).
• DBN is an algorithm for unsupervised probabilistic deep learning.
 Deep belief networks are generative models with several layers

TRACE KTU
of latent variables.
 The latent variables are typically binary, while the visible units
may be binary or real. There are no intralayer connections
 every unit in each layer is connected to every unit in each
neighboring layer, though it is possible to construct more
sparsely connected DBNs. The connections between the top
two layers are undirected.
 The connections between all other layers are directed, with the
arrows pointed toward the layer that is closest to the data
62

h
1

v v v
1 2 3

 A DBN with l hidden layers contains l weight matrices: W(1), .

TRACE KTU
. . ,W(l). It also contains l+ 1 bias vectors: b(0), . . . , b(l), with
b(0) providing the biases for the visible layer. The probability
distribution represented by the DBN is given by
 with β diagonal for tractability
63
 Deep belief networks incur many of the problems
associated with both directedmodels and undirected
models
 To train a deep belief network, one begins by
training an RBM to maximize

TRACE KTU
 Ev pdata log p(v) using contrastive divergence or
stochastic maximum likelihood.
 The parameters of the RBM then define the parameters
of the first layer of the DBN. Next, a second RBM is
trained to approximately maximize
64  where p(1) is the probability distribution represented by
the first RBM and p(2) is the probability distribution
represented by the second RBM.
 In other words, the second RBM is trained to model the
distribution defined by sampling the hidden units of the first
RBM, when the first RBM is driven by the data.
 This procedure can be repeated indefinitely, to add as many
TRACE KTU
layers to the DBN as desired, with each new RBM modeling
the samples of the previous one.
 Each RBM defines another layer of the DBN. This
procedure can be justified as increasing a variational lower
bound on the log-likelihood of the data under the DBN
65  The trained DBN may be used directly as a generative model,
but most of the interest in DBNs arose from their ability to
improve classification models.
 We can take the weights from the DBN and use them to
define an MLP:

TRACE KTU
 The term “deep belief network” may also cause some
confusion because the term “belief network” is sometimes
used to refer to purely directed models, while deep belief
networks contain an undirected layer. Deep belief
networks also share the acronym DBN with dynamic
Bayesian networks (Dean and Kanazawa, 1989), which are
Bayesian networks for representing Markov chains.
66

TRACE KTU
 Deep Belief Network’s operational pipeline is as follows:
67
• We’ll use the Greedy learning algorithm to pre-train DBN. For
learning the top-down generative weights-the greedy learning
method that employs a layer-by-layer approach. These
generative weights determine the relationship between
variables in one layer and variables in the layer above.
• On the top two hidden layers, we run numerous steps of

TRACE KTU
Gibbs sampling in DBN. The top two hidden layers define the
RBM thus, this stage is effectively extracting a sample from it.
• Then generate a sample from the visible units using a single
pass of ancestral sampling through the rest of the model.
• We’ll use a single bottom-up pass to infer the values of the
latent variables in each layer. In the bottom layer, greedy
pretraining begins with an observed data vector. It then
oppositely fine-tunes the generative weights.
 Contrastive Divergence:
68
 RBM adjusts its weights by this method. Using some randomly
assigned initial weights,
 RBM calculates the hidden nodes, which in turn use the same
weights to reconstruct the input nodes.
 Each hidden node is constructed from all the visible nodes and
each visible node is reconstructed from all the hidden node and
TRACE KTU
hence, the input is different from the reconstructed input,
though the weights are the same.
 The process continues until the reconstructed input matches
the previous input. The process is said to be converged at this
stage. This entire procedure is known as Gibbs Sampling.
69

Fig:-Gibbs sampling
70 calculate the binary states of the hidden layers in the
positive phase by computing the probabilities of
weights and visible units.
It is known as the positive phase since it enhances
the likelihood of the training data set.
The negative phase reduces the likelihood of the
model producing samples.
To train a complete Deep Belief Network,
Several RBMs together make a Deep Belief Networks
71  Applications
 We employ deep belief networks in place of deep
feedforward networks or even convolutional neural
networks
 They have the benefit of being less computationally costly.
computational complexity grows linearly with the number
of layers, rather than exponentially as with feedforward
neural networks) and is less susceptible to the vanishing
gradients problem.
 Applications of DBN are as follows:
• Recognition of images.
• Sequences of video.
• Data on mocap.
• Speech recognition.

You might also like