module 5 part2new
module 5 part2new
DEEP LEARNING
Module-5 PART -II
1
SYLLABUS
2
TRACE KTU
target word as an input, and then attempting to predict one
of the words in the context
The first technique is a multinomial model which predicts one
5 word out of d outcomes.
The second model is a Bernoulli model, which predicts whether
or not each context is present for a particular word. The second
model uses negative sampling of contexts for better efficiency
and accuracy.
Neural Embedding with Continuous Bag of Words
TRACE KTU
In the continuous bag-of-words (CBOW) model, the training
pairs are all context-word pairs in which a window of context
words is input, and a single target word is predicted.
The context contains 2·t words, corresponding to t words both
before and after the target word
we will use the length m = 2 · t to define the length of the
6 context.Therefore, the input to the system is a set of m words.
Without loss of generality, let the subscripts of these words be
numbered so that they are denoted by w1 ...wm, and let the target
(output) word in the middle of the context window be denoted by
w
Note that w can be viewed as a categorical variable with d possible
TRACE KTU
values, where d is the size of the lexicon.
The goal of the neural embedding is to compute the probability
P(w|w1w2 ...wm) and maximize the product of these
probabilities over all training samples
7
TRACE KTU
Figure 2.15: Word2vec: The CBOW model
In the architecture, we have a single input layer with m × d
8 nodes, a hidden layer with p nodes, and an output layer with
d nodes.
The nodes in the input layer are clustered into m different
groups, each of which has d units. Each group with d input
units is the one-hot encoded input vector of one of the m
context words being modeled by CBOW.
Only one of these d inputs will be 1 and the remaining inputs
TRACE KTU
will be 0
Therefore, one can represent an input xij with two indices
corresponding to contextual position and word identifier.
The input xij {0, 1} contains two indices i and j in the
subscript, where i {1 ...m} is the position of the context, and j
{1 ...d} is the identifier of the word.
The hidden layer contains p units, where p is the
9
dimensionality of the hidden layer in word2vec.
Let h1, h2,...hp be the outputs of the hidden layer nodes
Note that each of the d words in the lexicon has m different
representatives in the input layer corresponding to the m
different context words, but the weight of each of these m
connections is the same.
TRACE KTU
weights are referred to as shared. Sharing weights is a
common trick used for regularization in neural networks
Let the shared weight of each connection from the jth word in
the lexicon to the qth hidden layer node be denoted by ujq.
Note that each of the m groups in the input layer has
connections to the hidden layer that are defined by the same d
× p weight matrix U
10 uj = (uj1, uj2,...ujp) can be viewed as the p-dimensional
embedding of the jth input word over the entire corpus, and
h = (h1 ...hp) provides the embedding of a specific
instantiation of an input context
Then, the output of the hidden layer is obtained by
averaging the embeddings of the words present in the
TRACE KTU
context.
The embedding (h1 ...hp) is used to predict the probability
11 that the target word is one of each of the d outputs with the
use of the softmax function. The weights in the output layer
are parameterized with a p × d matrix V = [vqj ].
The jth column of V is denoted by vj . The output after
applying softmax creates d output values ˆy1 ... yˆd, which are
real values in (0, 1).
TRACE KTU
These real values sum to 1 because they can be interpreted as
probabilities.
The ground-truth value of only one of the outputs y1 ...yd is 1
and the remaining values are 0 for a given training instance.
One can write this condition as follows:
12 The softmax function computes the probability P(w|w1
...wm) of the one-hot encoded ground-truth outputs yj as
follows:
TRACE KTU
For a particular target word w = r {1 ...d}, the loss function is
given by L = −log[P(yr = 1|w1 ...wm)] = −log(ˆyr). The use of
the negative logarithm turns the multiplicative
likelihoods over different training instances into an
additive loss function using log-likelihoods
13 The updates are defined by using the backpropagation
algorithm, as training instances are passed through the
neural network one by one
loss function can be used to update the gradients of the
weight matrix V in the output layer
backpropagation can be used to update the weight matrix U
TRACE KTU
between the input and hidden layer. The update equations
with learning rate α are as follows:
14 The probability of making a mistake in prediction on the jth
word in the lexicon is defined by |yj − yˆj |. However, we use
signed mistakes ε j , in which only the correct word with yj =
1 is given a positive mistake value, while all the other words in
the lexicon receive negative mistake values. This is achieved by
dropping the modulus:
ε j = yj − yˆ
TRACE KTU
Note that εj can also be shown to be equal to the negative
of the derivative of the cross entropy loss with respect to
jth input into the softmax layer (which is h · vj )
15 the updates for a particular input context and output word are
as follows
TRACE KTU
19
TRACE KTU
20
TRACE KTU
In the case of the skip-gram model, one can collapse the m
identical outputs into a single output, and achieve the
same results simply by using a particular type of mini-
batching during stochastic gradient descent in fig(a)
21
all elements of a single context window are
always forced to belong to the same mini-batch
in fig (b)
The output of the hidden layer can be
computed from the input layer using the d ×
TRACE KTU
p matrix of weights U = [ujq] between the
input and hidden layer as follows:
22
TRACE KTU
the hidden layer is connected to m groups of d output nodes, each
of which is connected to the hidden layer with a p × d matrix V =
[vqj ].
Each of these m groups of d output nodes computes the
probabilities of the various words for a particular context word.
The jth column of V is denoted by vj and represents the output
embedding of the jth word. The output ˆyij is the probability that
the word in the ith context position takes on the jth word of the
lexicon
23 the neural network predicts the same multinomial
distribution for each of the context words. Therefore,
we have the following:
TRACE KTU
Note that the probability ˆyij is the same for varying i
and fixed j, since the right-hand side of the above
equation does not depend on the exact location i in the
context window
24 The loss function for the backpropagation algorithm is the
negative of the log-likelihood values of the ground truth
yij {0, 1} of a training instance. This loss function L is
given by the following:
TRACE KTU
The update equations with learning rate α are as follows:
25
TRACE KTU
Here, α > 0 is the learning rate. The p-dimensional rows of the
matrix U are used as the embeddings of the words. In other words,
the convention is to use the input embeddings in the rows of U
rather than the output embeddings in the columns of V .
26 Global Vectors for Word Representation(GloVe)
TRACE KTU
Training is performed on aggregated global word-word co-
occurrence statistics from a corpus, and the resulting
representations showcase interesting linear substructures of the
word vector space.
Specific weighted least squares model that trains on global word-
word co-occurrence counts and thus makes efficient use of
statistics.
model for word representation which we call GloVe, for Global
Vectors, because the global corpus statistics are captured
27
directly by the model
The Euclidean distance (or cosine similarity) between two
word vectors provides an effective method for measuring
the linguistic or semantic similarity of the corresponding
words
The similarity metrics used for nearest neighbor evaluations
TRACE KTU
produce a single scalar that quantifies the relatedness of two
words. This simplicity can be problematic since two given
words almost always exhibit more intricate relationships than
can be captured by a single number.
For example, man may be regarded as similar to woman in that
both words describe human beings
28
TRACE KTU
29
TRACE KTU
30 The GloVe model is trained on the non-zero entries of a
global word-word co-occurrence matrix, which tabulates
how frequently words co-occur with one another in a
given corpus. Populating this matrix requires a single pass
through the entire corpus to collect the statistics.
For large corpora, this pass can be computationally
expensive, but it is a one-time up-front cost. Subsequent
TRACE KTU
training iterations are much faster because the number of
non-zero matrix entries is typically much smaller than the
total number of words in the corpus.
31
TRACE KTU
The main intuition underlying the model is the simple
observation that ratios of word-word co-occurrence
probabilities have the potential for encoding some form of
meaning.
32
ice co-occurs more frequently with solid than it does
with gas, whereas steam co-occurs more frequently
with gas than it does with solid.
Both words co-occur with their shared property water
frequently, and both co-occur with the unrelated word
fashion infrequently.
TRACE KTU
Only in the ratio of probabilities does noise from non-
discriminative words like water and fashion cancel out,
so that large values (much greater than 1) correlate well
with properties specific to ice, and small values (much
less than 1) correlate well with properties specific of
steam.
Compared to the raw probabilities, the ratio is better able to
33 distinguish relevant words (solid and gas) from irrelevant
words (water and fashion) and it is also better able to
discriminate between the two relevant words. ratio Pik/Pjk
depends on three words i, j, and k, the most general model
takes the form,
TRACE KTU
where w ϵ Rd are word vectors and w˜ ϵ Rd are separate
context word vectors.
The right-hand side is extracted from the corpus, and F may
depend on some as-of-yet unspecified parameters.
functions F that depend only on the difference of the two
34 target words,
TRACE KTU
the right-hand side is a scalar
While F could be taken to be a complicated function
parameterized by, e.g., a neural network, doing so would
obfuscate the linear structure we are trying to capture. To
avoid this issue, we can first take the dot product of the
arguments,
35 note that for word-word co-occurrence matrices, the
distinction between a word and a context word is arbitrary
First, we require that F be a homomorphism between the
groups (R,+) and (R>0, ×), i.e.,
TRACE KTU
36
Next, we note that Eqn. (6) would exhibit the exchange
symmetry if not for the log(Xi ) on the right-hand side.
However, this term is independent of k so it can be absorbed
into a bias bi forwi . Finally, adding an additional bias ˜bk for
w˜ k restores the symmetry,
TRACE KTU
introducing a weighting function f (Xi j ) into the cost function
gives us the model
where V is the size of the vocabulary. The weighting
37
function should obey the following properties:
TRACE KTU
The basic idea behind the GloVe word embedding is to derive
38 the relationship between the words from statistics. Unlike
the occurrence matrix, the co-occurrence matrix tells you how
often a particular word pair occurs together. Each value in the
co-occurrence matrix represents a pair of words occurring
together
The advantage of GloVe is that, unlike Word2vec, GloVe does
not rely just on local statistics (local context information
TRACE KTU
of words), but incorporates global statistics (word co-
occurrence) to obtain word vectors
Disadvantages
• Because it uses a co-occurrence matrix & global information,
memory cost is more in GloVe compared to word2vec.
• Similar to word2vec, it does not solve the problem of
polysemous words since words & vectors have a one-to-one
relationship.
RESEARCH AREAS--AUTOENCODERS
39
Autoencoders are very useful in the field of unsupervised
machine learning.
We can use them to compress the data and reduce its
dimensionality.
An autoencoder is a type of artificial neural network used to
learn efficient codings of unlabeled data (unsupervised
TRACE KTU
learning).
An autoencoder learns two functions: an encoding function
that transforms the input data, and a decoding function that
recreates the input data from the encoded representation.
The main difference between Autoencoders and Principal
40
Component Analysis (PCA) is that while PCA finds the directions
along which you can project the data with maximum variance,
Autoencoders reconstruct our original input given just a
compressed version of it.
TRACE KTU
Figure 6-4. The autoencoder architecture attempts to construct a high-
dimensional input into a low-dimensional embedding and then uses
that low-dimensional embedding to reconstruct the input
An autoencoder is a neural network that is trained to attempt to copy its
input to its output it has a hidden layer h that describes a code used to
41
represent the input.
The network may be viewed as consisting of two parts: an encoder function
h = f (x) and a decoder that produces a reconstruction r = g(h).
TRACE KTU f
x
g
TRACE KTU
43 A classical example of the dimensionality reduction setting is
the autoencoder, which recreates the outputs from the inputs.
Therefore, the number of outputs and inputs is equal.
The constricted hidden layer in the middle outputs the
reduced representation of each instance.
As a result of this constriction, there is some loss in the
representation, which typically corresponds to the noise in the
data.
TRACE KTU
The outputs of the hidden layers correspond to the reduced
representation of the data
44
TRACE KTU
An Autoencoder is a type of neural network that can learn to
45 reconstruct images, text, and other data from compressed
versions of themselves.
An Autoencoder consists of three layers:
1. Encoder
2. Code
3. Decoder
TRACE KTU
The Encoder layer compresses the input image into a latent
space representation. It encodes the input image as a
compressed representation in a reduced dimension.
The compressed image is a distorted version of the original
image.
The Code layer represents the compressed input fed to the
decoder layer.
The decoder layer decodes the encoded image back to the original
46 dimension. The decoded image is reconstructed from latent space
representation, and it is reconstructed from the latent space
representation and is a lossy reconstruction of the original image.
Training Autoencoders:
Training an autoencoder is unsupervised in the sense that no
labeled data is needed.
TRACE KTU
The training process is still based on the optimization of a
cost function.
The cost function measures the error between the input x and its
reconstruction at the output x ^ .
An autoencoder is composed of an encoder and a decoder.
47 When you're building an autoencoder, there are a few things
to keep in mind.
First, the code or bottleneck size is the most critical
hyperparameter to tune the autoencoder. It decides how
much data has to be compressed. It can also act as a
regularization term.
TRACE KTU
Secondly, it's important to remember that the number of
layers is critical when tuning autoencoders. A higher
depth increases model complexity, but a lower depth is
faster to process.
Thirdly, you should pay attention to how many nodes you use
per layer. The number of nodes decreases with each
subsequent layer in the autoencoder as the input to each
layer becomes smaller across the layers.
48 An autoencoder whose code dimension is less than the input
dimension is called undercomplete.
TRACE KTU
Learning an undercomplete representation forces the
autoencoder to capture the most salient features of the
training data.
The learning process is described simply as minimizing a
loss function
L(x, g(f(x)))
where L is a loss function penalizing g(f (x)) for being
dissimilar from x, such as the mean squared error.
Autoencoders have various use-cases like:
49 • Anomaly detection: autoencoders can identify data anomalies using a
loss function that penalizes model complexity. It can be helpful for
anomaly detection in financial markets, where you can use it to identify
unusual activity and predict market trends.
• Data denoising image and audio: autoencoders can help clean up noisy
pictures or audio files. You can also use them to remove noise from images
or audio recordings.
TRACE KTU
• Image inpainting: autoencoders have been used to fill in gaps in images
by learning how to reconstruct missing pixels based on surrounding
pixels. For example, if you're trying to restore an old photograph that's
missing part of its right side, the autoencoder could learn how to fill in the
missing details based on what it knows about the rest of the photo.
• Information retrieval: autoencoders can be used as content-based image
retrieval systems that allow users to search for images based on their
content.
RESEARCH AREAS--REPRESENTATION
50 LEARNING
Representation Learning is concerned with training machine
learning algorithms to learn useful representations, e.g. those
that are interpretable, have latent features, or can be used for
transfer learning.
Learning representations of the data that make it easier to extract
TRACE KTU
useful information when building classifiers or other predictors.
Feedforward networks trained by supervised learning as
performing a kind of representation learning the last layer of the
network is typically a linear classifier, such as a softmax
regression classifier
The rest of the network learns to provide a representation to this
51 classifier. Training with a supervised criterion naturally leads to
the representation at every hidden layer (but more so near the
top hidden layer) taking on properties that make the
classification task easier
Representation learning is particularly interesting because it
provides one way to perform unsupervised and semi-
supervised learning
TRACE KTU
Large amounts of unlabeled training data and relatively little
labeled training data.
Training with supervised learning techniques on the labeled
subset often results in severe overfitting.
Semi-supervised learning offers the chance to resolve this
52
overfitting problem by also learning from the unlabeled
data.
we can learn good representations for the unlabeled data,
Greedy layer-wise unsupervised pretraining relies on a
single-layer representation learning algorithm such as an
RBM, a single-layer autoencoder, etc.
TRACE KTU
Each layer is pretrained using unsupervised learning, taking
the output of the previous layer and producing as output a
new representation of the data
Greedy layer-wise pretraining is called greedy because it is a
greedy algorithm, meaning that it optimizes each piece of
the solution independently, one piece at a time, rather than
jointly optimizing all pieces. It is called layer-wise because
these independent pieces are the layers of the network.
Greedy layer-wise pretraining proceeds one layer at a time,
53
training the k -th layer while keeping the previous ones fixed. In
particular, the lower layers (which are trained first) are not
adapted after the upper layers are introduced.
It is called unsupervised because each layer is trained with an
unsupervised representation learning algorithm.
However it is also called pretraining, because it is supposed to be
TRACE KTU
only a first step before a joint training algorithm is applied to fine-
tune all the layers together.
In the context of a supervised learning task, it can be viewed as a
regularizer (in some experiments, pretraining decreases test error
without decreasing training error) and a form of parameter
initialization
the entire two phase protocol that combines the pretraining
54 phase and a supervised learning phase.
The supervised learning phase may involve training a simple
classifier on top of the features learned in the pretraining phase,
or it may involve supervised fine-tuning of the entire network
learned in the pretraining phase.
TRACE KTU
Distributed Representation--Distributed representations of
55 concepts—representations composed of many elements that
can be set separately from each other—are one of the most
important tools for representation learning
An important related concept that distinguishes a
distributed representation from a symbolic one is that
generalization arises due to shared attributes between
different concepts. As pure symbols, “ cat” and “ dog” are as far
TRACE KTU
from each other as any other two symbols.
For example, our distributed representation may contain
entries such as “has_fur” or “number_of_legs” that have the
same value for the embedding of both “ cat ” and “ dog.”
56
Neural language models that operate on distributed
representations of words generalize much better than other
models that operate directly on one-hot representations of
words
Distributed representations induce a rich similarity space, in
which semantically close concepts (or inputs) are close in
TRACE KTU
distance, a property that is absent from purely symbolic
representations.
57 RESEARCH AREAS- -BOLTZMANN
MACHINES,
Boltzmann machines were originally introduced as a general
“connectionist” approach to learning arbitrary probability
distributions over binary vectors
TRACE KTU
We define the Boltzmann machine over a d-dimensional binary
random vector x {0, 1}d.
The Boltzmann machine is an energy-based model meaning
we define the joint probability distribution using an energy
function:
58
TRACE KTU
depends only the statistics of those two units, collected under
different distributions
the weight can be updated without knowing anything about the
60
rest of the network or how those statistics were produced.
This means that the learning rule is “local,” which makes
Boltzmann machine learning somewhat biologically
plausible.
if each neuron were a random variable in a Boltzmann machine,
then the axons and dendrites connecting two random variables
TRACE KTU
could learn only by observing the firing pattern of the cells
that they actually physically touch
In the positive phase, two units that frequently activate together
have their connection strengthened. This is an example of a
Hebbian learning rule--the mnemonic “fire together, wire
together.”
RESEARCH AREAS- DEEP BELIEF
61 NETWORKS
Deep belief networks (DBNs) were one of the first non-
convolutional models to successfully admit training of deep
architectures (Hinton et al., 2006; Hinton, 2007b).
• DBN is an algorithm for unsupervised probabilistic deep learning.
Deep belief networks are generative models with several layers
TRACE KTU
of latent variables.
The latent variables are typically binary, while the visible units
may be binary or real. There are no intralayer connections
every unit in each layer is connected to every unit in each
neighboring layer, though it is possible to construct more
sparsely connected DBNs. The connections between the top
two layers are undirected.
The connections between all other layers are directed, with the
arrows pointed toward the layer that is closest to the data
62
h
1
v v v
1 2 3
TRACE KTU
. . ,W(l). It also contains l+ 1 bias vectors: b(0), . . . , b(l), with
b(0) providing the biases for the visible layer. The probability
distribution represented by the DBN is given by
with β diagonal for tractability
63
Deep belief networks incur many of the problems
associated with both directedmodels and undirected
models
To train a deep belief network, one begins by
training an RBM to maximize
TRACE KTU
Ev pdata log p(v) using contrastive divergence or
stochastic maximum likelihood.
The parameters of the RBM then define the parameters
of the first layer of the DBN. Next, a second RBM is
trained to approximately maximize
64 where p(1) is the probability distribution represented by
the first RBM and p(2) is the probability distribution
represented by the second RBM.
In other words, the second RBM is trained to model the
distribution defined by sampling the hidden units of the first
RBM, when the first RBM is driven by the data.
This procedure can be repeated indefinitely, to add as many
TRACE KTU
layers to the DBN as desired, with each new RBM modeling
the samples of the previous one.
Each RBM defines another layer of the DBN. This
procedure can be justified as increasing a variational lower
bound on the log-likelihood of the data under the DBN
65 The trained DBN may be used directly as a generative model,
but most of the interest in DBNs arose from their ability to
improve classification models.
We can take the weights from the DBN and use them to
define an MLP:
TRACE KTU
The term “deep belief network” may also cause some
confusion because the term “belief network” is sometimes
used to refer to purely directed models, while deep belief
networks contain an undirected layer. Deep belief
networks also share the acronym DBN with dynamic
Bayesian networks (Dean and Kanazawa, 1989), which are
Bayesian networks for representing Markov chains.
66
TRACE KTU
Deep Belief Network’s operational pipeline is as follows:
67
• We’ll use the Greedy learning algorithm to pre-train DBN. For
learning the top-down generative weights-the greedy learning
method that employs a layer-by-layer approach. These
generative weights determine the relationship between
variables in one layer and variables in the layer above.
• On the top two hidden layers, we run numerous steps of
TRACE KTU
Gibbs sampling in DBN. The top two hidden layers define the
RBM thus, this stage is effectively extracting a sample from it.
• Then generate a sample from the visible units using a single
pass of ancestral sampling through the rest of the model.
• We’ll use a single bottom-up pass to infer the values of the
latent variables in each layer. In the bottom layer, greedy
pretraining begins with an observed data vector. It then
oppositely fine-tunes the generative weights.
Contrastive Divergence:
68
RBM adjusts its weights by this method. Using some randomly
assigned initial weights,
RBM calculates the hidden nodes, which in turn use the same
weights to reconstruct the input nodes.
Each hidden node is constructed from all the visible nodes and
each visible node is reconstructed from all the hidden node and
TRACE KTU
hence, the input is different from the reconstructed input,
though the weights are the same.
The process continues until the reconstructed input matches
the previous input. The process is said to be converged at this
stage. This entire procedure is known as Gibbs Sampling.
69
Fig:-Gibbs sampling
70 calculate the binary states of the hidden layers in the
positive phase by computing the probabilities of
weights and visible units.
It is known as the positive phase since it enhances
the likelihood of the training data set.
The negative phase reduces the likelihood of the
model producing samples.
To train a complete Deep Belief Network,
Several RBMs together make a Deep Belief Networks
71 Applications
We employ deep belief networks in place of deep
feedforward networks or even convolutional neural
networks
They have the benefit of being less computationally costly.
computational complexity grows linearly with the number
of layers, rather than exponentially as with feedforward
neural networks) and is less susceptible to the vanishing
gradients problem.
Applications of DBN are as follows:
• Recognition of images.
• Sequences of video.
• Data on mocap.
• Speech recognition.