0% found this document useful (0 votes)
19 views

Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023

The document introduces encoder-decoder models and the attention mechanism. It discusses how recurrent neural networks can be used for sequence learning problems like language modeling and natural language generation. RNNs can learn word embeddings and predict the next word in a sequence given previous words.

Uploaded by

dd23015
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023

The document introduces encoder-decoder models and the attention mechanism. It discusses how recurrent neural networks can be used for sequence learning problems like language modeling and natural language generation. RNNs can learn word embeddings and predict the next word in a sequence given previous words.

Uploaded by

dd23015
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

09-05-2023

Introduction to Encoder-Decoder
Models and Attention Mechanism
Dr. Dileep A. D.

Associate Professor,
Multimedia Analytics Networks And Systems (MANAS) Lab,
School of Computing and Electrical Engineering (SCEE),
Indian Institute of Technology Mandi, Kamand, H.P.
Email: [email protected]

Introduction to Encoder-Decoder Models


• We have seen 3 types of neural networks:
– Fully connected neural networks (FCNN)
– Convolutional neural networks (CNN)
– Recurrent neural networks (RNN)
• Combine these to come up with wide range of
applications
– Combining leads to encoder-decoder models
• Start by revisiting the problem of language modeling

1
09-05-2023

Sequence Learning: Language Modeling


• RNN used for sequential learning problem
– Each input is dependent on the previous or future input
– In many applications the input is not of a fixed size
• Consider the problem of language modelling: Natural
sentence generation
– Given t − i words predict the tth word

• Example: Generate a Group of people shopping vegetables <stop>


sentence – “Group of
people shopping V V V V V V
vegetables”
s0 s1 s2 s3 s4 s5 s6
• A word shopping is
predicted given the W W W W W
words Group, of,
U U U U U U
people
<start> Group of people shopping vegetables
• st is the state of the network at time step t
• s0 is initialised randomly 3

Sequence Learning Problem: RNN


• Sequence learning: More formally, given y1, y2, ..., yt−1
we want to find yˆ  arg max P (y t  j | y1 , y 2 ,..., y t 1 )
– where j V and V is the set of all the words in vocabulary
• Let us denote P(y t  j | y1 , y 2 ,..., y t 1 ) as P (y t  j | (y)1t 1 )

2
09-05-2023

Sequence Learning Problem: RNN


• Sequence learning: More formally, given y1, y2, ..., yt−1 we
want to find yˆ  arg max P (y t  j | y1 , y 2 ,..., y t 1 )
– where j V and V is the set of all the words in vocabulary
P(y2) P(y3) P(y4) P(y5) P(y6) P(y7)
Pred. Pred. Pred. Pred. Pred. Pred.
distri- distri- distri- distri- distri- distri-
bution bution bution bution bution bution
<start> 0.00 0.00 0.00 0.00 0.00 0.00
Group 0.60 0.10 0.10 0.10 0.10 0.10
of 0.10 0.60 0.05 0.05 0.05 0.20
people 0.05 0.05 0.60 0.25 0.20 0.05
shopping 0.20 0.20 0.20 0.55 0.02 0.20
vegetables 0.02 0.02 0.02 0.02 0.60 0.03
<stop> 0.03 0.03 0.03 0.03 0.03 0.60

V V V V V V
s1 s2 s3 s4 s5 s6
W W W W W

U U U U U U
<start> Group of people shopping vegetables
y1 y2 y3 y4 y5 y6 5

Sequence Learning Problem: RNN


• Sequence learning: More formally, given y1, y2, ..., yt−1
we want to find yˆ  arg max P (y t  j | y1 , y 2 ,..., y t 1 )
– where j V and V is the set of all the words in vocabulary
• Let us denote P(y t  j | y1 , y 2 ,..., y t 1 ) as P (y t  j | (y)1t 1 )

• Using RNN:
Group of people <stop>
P ( y t  j | (y)1t 1 )  softmax ( Vs t  c) j
P(y2) P(y3) P(y4) P(yT)
• st is the state vector (hidden
V V V V representation) at time step t
• Recurrent connections
s0 s1 s2 s3 sT
ensure that information
W W W W about sequence y1, y2, ...,
U U U U yt−1 is embedded in st
y1 y2 y3 yT-1 • Hence,
<start> Group of vegetables P (y t  j | (y)1t 1 )  P(y t  j | st )
6

3
09-05-2023

Sequence Learning Problem: RNN


• Sequence learning: More formally, given y1, y2, ..., yt−1
we want to find yˆ  arg max P (y t  j | y1 , y 2 ,..., y t 1 )
– where j V and V is the set of all the words in vocabulary
• Let us denote P(y t  j | y1 , y 2 ,..., y t 1 ) as P (y t  j | (y)1t 1 )

• Using RNN:
Group of people
P ( y t  j | s t )  softmax ( Vst  c) j
<stop>
P(y2) P(y3) P(y4) P(yT)
• Recurrent connections
V V V V ensure that information
about sequence y1, y2, ...,
s0 s1 s2 s3 sT
yt−1 is embedded in st
W W W W
st  tanh(Uy t  Wst 1  b)
U U U U
st  RNN (st 1 , y t )
y1 y2 y3 yT-1
<start> Group of vegetables
7

Language Modelling Problem:


Natural Sentence Generation : RNN
• Data: All sentences from any large corpus (say Wikipedia)
– Each word in the vocabulary is represented as d-dimensional
word vector (example: word-to-vec [1,2,3,4])

[1] Xin Rong, “word2vec Parameter Learning


Explained”, in arXiv:1411.2738v4, 2016.
[2] Yoav Goldberg and Omer Levy, “word2vec
Explained: deriving Mikolov et al.'s
negative-sampling word-embedding
method”, in arXiv:1402.3722v1, 2014.
Group of people <stop> [3] Sebastian Ruder’s blogs on word embeddings
P(y2) P(y3) P(y4) P(yT) (https://www.ruder.io/word-embeddings-1/)
[4] Al Ghodsi’s video lecture on Word2Vec.

V V V V
s0 s1 s2 s3 sT
W W W W

U U U U
y1 y2 y3 yT-1
<start> Group of vegetables 8

4
09-05-2023

Language Modelling Problem:


Natural Sentence Generation : RNN
• Data: All sentences from any large corpus (say Wikipedia)
– Each word in the vocabulary is represented as d-dimensional
word vector (example: word-to-vec)
• Model: P ( y t  j | st )  softmax ( Vs t  c) j Compact representation:
st  tanh(Uy t  Wst 1  b) st  RNN (st 1 , y t )
• RNN is trained using
backpropagation through time
Group of people (BPTT)
<stop>
P(y2) P(y3) P(y4) P(yT) • Parameters: U, V, W, b, c
T

V V V V • Loss: L (θ )   Lt (θ )
t 1

s0 s1 s2 s3 sT
Lt (θ)   log P (y t  lt | (y)1t 1 )
W W W W
– where lt is the true word at
U U U U time step t
y1 y2 y3 yT-1 • One can also use LSTM or GRU
<start> Group of vegetables in the place of vanilla RNN 9

Language Modelling Problem:


Natural Sentence Generation : LSTM
• Data: All sentences from any large corpus (say Wikipedia)
– Each word in the vocabulary is represented as d-dimensional
word vector (example: word-to-vec)
• Model: P ( y t  j | s t )  softmax ( Vs t  c) j Compact representation:
st  ft st 1 + i t s t ht , st  LSTM (h t 1 , st 1 , y t )
st  tanh(Uxt  Wht 1  b) • LSTM is trained using
ht  ot tanh(st ) backpropagation through time
Group of people (BPTT)
<stop>
P(y2) P(y3) P(y4) P(yT) • Parameters: U, V, W, b, c,
U(f),V(f),W(f),b(f),U(i),V(i),W(i),b(i),
V V V V
U(o),V(o),W(o),b(o)
T
s0 s1 s2 s3 sT • Loss: L (θ ) 
 Lt (θ )
W W W W t 1

U U U U Lt (θ)   log P (y t  lt | (y)1t 1 )


y1 y2 y3 yT-1 – where lt is the true word at
<start> Group of vegetables time step t 10

10

5
09-05-2023

Language Modelling Problem:


Natural Sentence Generation : GRU
• Data: All sentences from any large corpus (say Wikipedia)
– Each word in the vocabulary is represented as d-dimensional
word vector (example: word-to-vec)
• Model: P ( y t  j | s t )  softmax ( Vs t  c ) j Compact representation:
st   (1  i t ) st 1  +  i t st  st  GRU (s t 1 , y t )
st  tanh(Uxt  Wh t 1  b) • GRU is trained using
where, ht 1  ot 1 st 1 backpropagation through time
Group of people (BPTT)
<stop>
P(y2) P(y3) P(y4) P(yT) • Parameters: U, V, W, b, c,
U(i),V(i),W(i),b(i), U(o),V(o),W(o),b(o)
V V V V T
• Loss: L (θ )   Lt (θ )
s0 s1 s2 s3 sT t 1

W W W W
Lt (θ)   log P (y t  lt | (y)1t 1 )
U U U U – where lt is the true word at
y1 y2 y3 yT-1 time step t
<start> Group of vegetables 11

11

Neural Image Caption Generation


• So far, we have seen how to generate a sentence
given previous words
• Now, we want to generate a sentence given an image
A bird flying over <stop> • Image caption generation
P(y2) P(y3) P(y4) P(y5) P(yT)
• We are now interested
V V V V V in P(yt  j | y1, y2 ,..., yt 1, I)
s0 s1 s2 s3 s4 sT instead of P(yt  j | y1, y2 ,..., yt 1 )
W W W W W – where I is an image
U U U U U • Usually, information in
y1 y2 y3 y4 yT-1 the image is encoded in
<start> A bird flying water a feature vector

A bird flying over a body of water

12

12

6
09-05-2023

Neural Image Caption Generation


A bird flying over
• We now model
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) P(y t  j | y1 , y 2 ,..., y t 1 , I )
as P (y t  j | st , h)
V V V V V
– where I is an image
s0 s1 s2 s3 s4 sT
– h is the abstract
W W W W W representation of
U U U U U image, obtained from
the last convolution
y1 y2 y3 y4 yT-1 layer of CNN
h <start> A bird flying water
CNN
• Feed the abstract representation of
image (h) at every time step along
with word representation to compute st

13

13

Neural Image Caption Generation


A bird flying over
• We now model
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) P(y t  j | y1 , y 2 ,..., y t 1 , I )
as P (y t  j | st , h)
V V V V V
– where I is an image
s0 s1 s2 s3 s4 sT
– h is the abstract
W W W W W representation of
U U U U U image, obtained from
the last convolution
y1 y2 y3 y4 yT-1 layer of CNN
h <start> A bird flying water
CNN
• Feed the abstract representation of
image (h) at every time step along
with word representation to compute st

st  RNN (st 1 ,[y Tt , hT ]T )

14

14

7
09-05-2023

Encoder-Decoder Model: Image Captioning


• A CNN is first used to
encode the image

h
CNN

Encoder 15

15

Encoder-Decoder Model: Image Captioning


A bird flying over <stop>
P(y2) P(y3) P(y4) P(y5) P(yT)
• A CNN is first used to
Decoder
encode the image
V V V V V
• A RNN is then used to
s0 s1 s2 s3 s4 sT
decode (generate) a
W W W W W
sentence from the
U U U U U encoding
y1 y2 y3 y4 yT-1 • Both the encoder and
<start> A bird flying water decoder use a neural
network
h
CNN • The encoder’s output is fed to every step of
the decoder
• Encoder-decoder model is implemented as
end-to-end model
• Learning algorithm: Backpropagation
through time and backpropagation through
Encoder CNN 16

16

8
09-05-2023

Encoder-Decoder Model: Image Captioning


A bird flying over <stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Data: N number of
images and their captions
V V V V Decoder
V (in sentence)

s0 s1 s2 s3 s4 sT • Model:
– Encoder: h  CNN (I )
W W W W W T T T
– Decoder: st  RNN (st 1 ,[ y t , h ] )
U U U U U
P ( y t  j | st , I )  softmax ( Vs t  c) j
y1 y2 y3 y4 yT-1
<start> A bird flying water • Model is trained using
Backpropagation through
h time and backpropagation
CNN through CNN
• Parameters: RNN parameters - U, V, W, b, c and
CNN parameters - W(CNN)
T

• Loss: L (θ )   Lt (θ ) Lt (θ)   log P (y t  lt | (y)1t 1 , I )


t 1

– where lt is the true word at time step t


Encoder 17

17

More Applications of
Encoder-Decoder Models
• Machine Translation:
– Translating sentence in one language to another
– Encoder: RNN
– Decoder: RNN
• Transliteration:
– Translating the script of one language to script of
another language
– Encoder: RNN
– Decoder: RNN
• Image Question Answering:
– Given the image and a question (sentence), generate
answer (word)
– Encoder: CNN + RNN
– Decoder: FCNN

18

18

9
09-05-2023

More Applications of
Encoder-Decoder Models
• Document Summarization:
– Generating a summary of a document
– Encoder: RNN
– Decoder: RNN
• Video Captioning:
– Generate sentence given video
– Encoder: RNN(CNN)
– Decoder: RNN
• And many more …

19

19

Attention
A bird
in over
flying
Encoder-Decoder Mechanism
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Encoder-decoder models
Decoder
can be made even more
V V V V V expressive by adding an
s0 s1 s2 s3 s4 sT “attention” mechanism
W W W W W

U U U U U • Let us motivate the task


of attention with the help
y1 y2 y3 y4 yT-1 of image captioning
<start> A bird flying water
• The encoder reads the
image only once and
h
encodes it
CNN
– Embedding from last
convolution layer of CNN
• At each timestep the decoder uses this embedding
to produce a new word

Encoder 20

20

10
09-05-2023

Attention Mechanism: Image Captioning


• Humans try to produce each word in
the output by focusing only on certain
objects (concepts) in the input image
• Example:

A bird flying over a body of water

21

21

Attention Mechanism: Image Captioning


• Humans try to produce each word in
the output by focusing only on certain
objects (concepts) in the input image
• Example:

A bird flying over a body of water

22

22

11
09-05-2023

Attention Mechanism: Image Captioning


• Humans try to produce each word in
the output by focusing only on certain
objects (concepts) in the input image
• Example:
• Essentially at each time step we come
up with a distribution (weights) on the
input concepts (objects)

A bird flying over a body of water • t1: A [100]


• This distribution tells us how much • t2: bird [ 0.5 0.5 0 ]
attention to pay to each object • t3: flying [010]
location in input at each time step • t4: over [010]
• Ideally, at each time step we • t5: a [001]
should feed only this relevant
information (i.e. encodings of • t6: body [001]
relevant objects) to the decoder • t7: of [001]
• t8: water [001]
23

23

Attention Mechanism: Image Captioning


• Humans try to produce each word in the output by
focusing only on certain objects (concepts) in the
input image
• Example:

A bird flying over a body of water

A group of people sitting on a boat


in the water

24

24

12
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Let us revisit the decoder that
we have seen so far
V V V V V • Entire image is encoded into a
vector representation
s0 s1 s2 s3 s4 sT
• We feed this encoded
W W W W W representation to decoder at
each time step
U U U U U
• Suppose there are J concept
y1 y2 y3 y4 yT-1 locations (objects) in an
<start> A bird flying water image
• Now suppose an oracle told
h you which location in image to
focus on at a given time step t
CNN

25

25

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Let us revisit the decoder that
we have seen so far
V V V V V • Entire image is encoded into a
vector representation
s0 s1 s2 s3 s4 sT
• We feed this encoded
W W W W W representation to decoder at
each time step
U U U U U
• Suppose there are J concept
y1 y2 y3 y4 yT-1 locations (objects) in an
<start> A bird flying water image
+ c1 Now suppose an oracle told

a11 a21 a31
you which location in image to
h h1 h2 h3 focus on at a given time step t
LC(I) • We could just take a weighted average of the
CNN corresponding location representations (hj) and feed it to
J
the decoder c  a h t  j 1
jt j

– Context vector (ct) is the weighted sum of location


representation (hj) of encoder
– ajt denotes the amount of attention (attention weight) on the
location (jth location vector) in image to produce the tth word 26
26

13
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Let us revisit the decoder that
we have seen so far
V V V V V • Entire image is encoded into a
vector representation
s0 s1 s2 s3 s4 sT
• We feed this encoded
W W W W W representation to decoder at
each time step
U U U U U
• Suppose there are J concept
y1 y2 y3 y4 yT-1 locations (objects) in an
<start> A bird flying water image
+ c2 Now suppose an oracle told

a12 a22 a32
you which location in image to
h h1 h2 h3 focus on at a given time step t
LC(I) • We could just take a weighted average of the
CNN corresponding location representations (hj) and feed it to
J
the decoder c  a h t  j 1
jt j

– Context vector (ct) is the weighted sum of location


representation (hj) of encoder
– ajt denotes the amount of attention (attention weight) on the
location (jth location vector) in image to produce the tth word 27
27

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Let us revisit the decoder that
we have seen so far
V V V V V • Entire image is encoded into a
vector representation
s0 s1 s2 s3 s4 sT
• We feed this encoded
W W W W W representation to decoder at
each time step
U U U U U
• Suppose there are J concept
y1 y2 y3 y4 yT-1 locations (objects) in an
<start> A bird flying water image
+ c3 Now suppose an oracle told

a13
a23 a33 you which location in image to
h h1 h2 h3 focus on at a given time step t
LC(I) • We could just take a weighted average of the
CNN corresponding location representations (hj) and feed it to
J
the decoder c  a h t  j 1
jt j

– Context vector (ct) is the weighted sum of location


representation (hj) of encoder
– ajt denotes the amount of attention (attention weight) on the
location (jth location vector) in image to produce the tth word 28
28

14
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Let us revisit the decoder that
we have seen so far
V V V V V • Entire image is encoded into a
vector representation
s0 s1 s2 s3 s4 sT
• We feed this encoded
W W W W W representation to decoder at
each time step
U U U U U
• Suppose there are J concept
y1 y2 y3 y4 yT-1 locations (objects) in an
<start> A bird flying water image
a14
+ c4 Now suppose an oracle told

a24 a34 you which location in image to
h h1 h2 h3 focus on at a given time step t
LC(I) • We could just take a weighted average of the
CNN corresponding location representations (hj) and feed it to
J
the decoder c  a h t 
j 1
jt j

– Context vector (ct) is the weighted sum of location


representation (hj) of encoder
– ajt denotes the amount of attention (attention weight) on the
location (jth location vector) in image to produce the tth word 29
29

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Let us revisit the decoder that
we have seen so far
V V V V V • Entire image is encoded into a
vector representation
s0 s1 s2 s3 s4 sT
• We feed this encoded
W W W W W representation to decoder at
each time step
U U U U U
• Suppose there are J concept
y1 y2 y3 y4 yT-1 locations (objects) in an
<start> A bird flying water
cT image
+
Now suppose an oracle told

a1T
a3T you which location in image to
h h1 h a2T2 h3 focus on at a given time step t
LC(I) • We could just take a weighted average of the
CNN corresponding location representations (hj) and feed it to
J
the decoder c  a h t 
j 1
jt j

– Context vector (ct) is the weighted sum of location


representation (hj) of encoder
– ajt denotes the amount of attention (attention weight) on the
location (jth location vector) in image to produce the tth word 30
30

15
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Let us revisit the decoder that
we have seen so far
V V V V V • Entire image is encoded into a
vector representation
s0 s1 s2 s3 s4 sT
• We feed this encoded
W W W W W representation to decoder at
each time step
U U U U U
• Suppose there are J concept
y1 y2 y3 y4 yT-1 locations (objects) in an
<start> A bird flying water
cT image
+
Now suppose an oracle told

a1T
a3T you which location in image to
h h1 h2a2T h3 focus on at a given time step t
LC(I) • We could just take a weighted average of the
CNN corresponding location representations (hj) and feed
it to the decoder
• Intuitively this should work better because we are
not overloading the decoder with irrelevant
information
• How do we convert this intuition into a model? 31

31

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • In practice we will not have
the information about the
V V V V V importance of each locations
– The machine will have to
s0 s1 s2 s3 s4 sT learn this from the data
W W W W W • The importance of concept
location representations (hj)
U U U U U in decoding and generating
y1 y2 y3 y4 yT-1 a word at the time t is
<start> A bird flying water captured by an attention
cT
+ score:
a1T
jt ATT t 1 j  f (s , h )
a3T
h h1 h2a2T h3
• The attention score, αjt , captures the importance of
LC(I) the jth concept location in image for decoding the tth
CNN output word
• This attention score is normalized using the softmax
function to obtain attention weight:
exp( jt )
a jt  J J = Number of concept locations
 exp(
j 1
jt ) in image
32

32

16
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V
• Attention weight:
s0 s1 s2 s3 s4 sT exp( jt ) J = Number of
a jt  concept locations
W W W W W J


j 1
exp( jt ) in image
U U U U U
y1 y2 y3 y4 yT-1 • Every location
water
representation (hj) at every
<start> A bird flying cT
+ time t is associated with one
a1T attention weight, ajt
a3T
h h1 h2a2T h3
This attention weight (ajt) along with location
LC(I) representation (hj) is used to generate context vector:
CNN J
ct   a jt h j
j 1

• Context vector combines all the location representation


(hj) of encoder and gives the how much each is
important in the combination
33

33

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V
• Attention weight:
s0 s1 s2 s3 s4 sT exp( jt ) J = Number of
a jt  concept locations
W W W W W J


j 1
exp( jt ) in image
U U U U U
y1 y2 y3 y4 yT-1 • Every location
water
representation (hj) at every
<start> A bird flying cT
+ time t is associated with one
a1T attention weight, ajt
a3T
h h1 h2a2T h3
This attention weight (ajt) along with location
LC(I) representation (hj) is used to generate context vector:
CNN J
ct   a jt h j
j 1

• Context vector (ct) is the weighted sum of location


representation (hj) of encoder
• This context vector is used to perform decoding operation at
time t 34

34

17
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V
• Attention weight:
s0 s1 s2 s3 s4 sT exp( jt ) J = Number of
a jt  concept locations
W W W W W J


j 1
exp( jt ) in image
U U U U U J

y1 y2 y3 y4 yT-1 • Context vector: ct  a


j 1
jt hj
<start> A bird flying water
cT – Context vector (ct) is used
+ in generating st
a1T
a3T
h h1 h2a2T h3 c t 
• Input to the decoder at time t: z t   
LC(I) yt 
CNN • Decoder at time t: st  tanh(Uz t  Ws t 1  b)
• Output of the model at time t (i.e. word generated at
time t): P ( y t 1 )  softmax( Vst  c)
• Note: For every input image, different attention weights
(ajt) are there
35

35

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V
• How to define fATT(.)?
s0 s1 s2 s3 s4 sT • Dot-product attention:
W W W W W – Attention score (αjt) will be
a dot product between the
U U U U U state of the decoder (st-1)
y1 y2 y4 yT-1 and location representation
y3
(hj)
<start> A bird flying water
cT
+  jt  f ATT (st 1 , h j )  st 1 , h j
a1T
a3T
h h1 h2a2T h3
• Limitation: Applicable only
LC(I) when the dimension of st-1
CNN and hj are same

36

36

18
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V • How to define fATT(.)?
s0 s1 s2 s3 s4 sT • Multilayer perceptron
W W W W W attention:
– It is similar to the gates
U U U U U used in LSTM
y1 y2 y3 y4 yT-1  jt  f ATT (st 1 , h j )
<start> A bird flying water
cT  (U ATT h j  WATT s t 1  b ATT )
+
a1T
a3T
h h1 h2a2T h3
LC(I)
CNN

37

37

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V
• How to define fATT(.)?
s0 s1 s2 s3 s4 sT • Multilayer perceptron
W W W W W attention:
– It is similar to the gates
U U U U U used in LSTM
y1 y2 y3 y4 yT-1  jt  f ATT (st 1 , h j )
<start> A bird flying water
cT  VATT , (U ATT h j  WATT st 1  b ATT )
+
a1T
a3T UATT , VATT and WATT are the
h h1 h2a2T h3 parameters of multilayer perceptron
attention
LC(I)
CNN • Ω(.) can be either logistic or tan hyperbolic

38

38

19
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V • How to define fATT(.)?
s0 s1 s2 s3 s4 sT • Multilayer perceptron
W W W W W attention:
– It is similar to the gates
U U U U U used in LSTM
y1 y2 y3 y4 yT-1  jt  f ATT (st 1 , h j )
<start> A bird flying water
cT  VATT , (U ATT h j  WATT st 1  b ATT )
+
a1T UATT , VATT and WATT are the
a3T
h h1 h2a2T h3 parameters of multilayer perceptron
attention
LC(I)
CNN • st-1 and hj are need not be of same dimension
• Then, the softmax operation is applied to obtain
attention weights
exp( jt )
a jt  J J = Number of concept locations
 exp(
j 1
jt ) in image
39

39

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • Attention score:
 jt  f ATT (st 1 , h j )
V V V V V
• How to define fATT(.)?
s0 s1 s2 s3 s4 sT • Multilayer perceptron
W W W W W attention:
– It is similar to the gates
U U U U U used in LSTM
y1 y2 y3 y4 yT-1  jt  f ATT (st 1 , h j )
<start> A bird flying water
cT  VATT , (U ATT h j  WATT st 1  b ATT )
+
a1T
a3T UATT , VATT and WATT are the
h h1 h2a2T h3 parameters of multilayer perceptron
attention
LC(I)
CNN • st-1 and hj are need not be of same dimension
• Then, the softmax operation is applied to obtain
attention weights
• Attention weights are then used to generate context
vector associated with time t
40

40

20
09-05-2023

Attention
A bird
Mechanism:
flying over
Image Captioning
<stop>
P(y2) P(y3) P(y4) P(y5) P(yT) • How do we get the
location information?
V V V V V
• Image is encoded into a
s0 s1 s2 s3 s4 sT
last convolution layer
W W W W W representation
U U U U U
• Convolution
y1 y2 y3 y4 yT-1 representation has
<start> A bird flying water
cT location information
+
a1T
a3T
h h1 h2a2T h3
LC(I)
CNN

41

41

Learning Attention Over Image Location


• Consider VGG16 network to encode image.

• Output of last convolution layer is a 14x14x512


feature map
• We could think of this as 196 (i.e., 14x14) locations
(each having a 512-dimensional representation) 42

42

21
09-05-2023

Learning Attention Over Image Location


• Consider VGG16 network to encode image.
• Output of last convolution layer is a 14x14x512 feature
map
• We could think of this as 196 (i.e., 14x14) locations (each
having a 512-dimensional representation)

512

1 2 196
h1 h2 h196
43

43

Learning Attention Over Image Location


• Consider VGG16 network to encode image.
• Output of last convolution layer is a 14x14x512 feature
map
• We could think of this as 196 (i.e., 14x14) locations (each
having a 512-dimensional representation)

• ajt denotes the amount of attention


(attention weight) on the location
(jth location vector) in image to
produce the tth output word
+
a1t a2t a196t • The multilayer perceptron
attention will then learn an
512 attention over these locations
(which in turn correspond to actual
1 2 196 locations in the images)
h1 h2 h196
44

44

22
09-05-2023

Illustrations: Attention Over Images

Examples of the attention-based model attending to the correct


object (white indicates the attended regions, underlines indicates the
corresponding word) [3]
[3] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention”, in Proceedings of the 32nd International Conference on
Machine Learning, PMLR vol. 37, pp. 2048-2057, 2015. 45

45

Illustrations: Attention Over Images

Attention over time. As the model generates each word, its attention
changes to reflect the relevant parts of the image. (white indicates
the attended regions) [3]

[3] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention”, in Proceedings of the 32nd International Conference on
Machine Learning, PMLR vol. 37, pp. 2048-2057, 2015. 46

46

23
09-05-2023

A group of people sitting on a boat in the water. [3]


[3] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural
Image Caption Generation with Visual Attention”, in Proceedings of the
32nd International Conference on Machine Learning, PMLR vol. 37, pp.
2048-2057, 2015. 47

47

Attention Mechanism in Machine


Translation (ML) Task
• Machine translation (MT) is a sequence-to-sequence
(seq2seq) mapping task
• Here input is a sequence and output is also a sequence
• Machine translation translates sentence in one language
(source) to another language (destination)
• Input sequence: X = (x1, x2, … , xj , …, xTs)
– Ts is the length of the input sequence
– Example: My name is Dileep
– Each word is represented as d-dimensional word vector, xj
• Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Td is the length of the output sequence
– Example: Mera naam Dileep hai
– Each word is represented as d-dimensional word vector, yt
• It uses an encoder-decoder framework

48

48

24
09-05-2023

Attention Mechanism in MT Task

Encoder Decoder
X s Y

• Encoder is RNN
– The encoder reads the sentences only once and encodes
it in final state, sE,Ts
• Decoder is also RNN
– At each timestep, the decoder uses the embedding (sE,Ts)
from encoder to produce a new word

49

49

Attention Mechanism in MT Task


hai
o/p: Mera naam Dileep <stop> • Encoder is RNN
P(y2) P(y3) P(y4) P(y5) P(y6)
– The encoder reads the
VD VD VD VD VD sentences only once and
encodes it in final state,
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5 sE,Ts
WD WD WD WD

UD UD UD UD UD • Decoder is also RNN


– At each timestep, the
y1 y2 y3 y4 y6
decoder uses the
<Go> Mera naam Dileep hai
embedding (sE,Ts) from
encoder to produce a new
word
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5
WD WD WD WD

UE UE UE UE UE
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 50

50

25
09-05-2023

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Ideally, at each time-step we
P(y2) P(y3) P(y4) P(y5) P(y6) should feed only the relevant
information (i.e., encodings of
VD VD VD VD VD relevant words) to the decoder
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5
• We are interested in capturing
WD WD WD WD how much attention we need to
pay for the state of encoder at
UD UD UD UD UD different instances of time when
y1 y2 y3 y4 y6 we are decoding at time t
<Go> Mera naam Dileep hai
• Now suppose an oracle told you
which input words to focus on to
generate word at time step t

sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Take a weighted average of the
WD WD WD WD state of encoder at different
instances of time j and feed it to
UE UE UE UE UE the decoder at time step t
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 51

51

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Ideally, at each time-step we
P(y2) P(y3) P(y4) P(y5) P(y6) should feed only the relevant
information (i.e. encodings of
VD VD VD VD VD relevant words) to the decoder
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5
• We are interested in capturing
WD WD WD WD how much attention we need to
pay for the state of encoder at
UD UD UD UD UD different instances of time when
y1 y2 y3 y4 y6 we are decoding at time t
<Go> Mera naam Dileep hai
• Now suppose an oracle told you
+ c1 which input words to focus on to
a51 generate word at time step t
a11 a21 a31 a41

sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Take a weighted average of the
WD WD WD WD state of encoder at different
instances of time j and feed it to
UE UE UE UE UE the decoder at time step t
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 52

52

26
09-05-2023

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Ideally, at each time-step we
P(y2) P(y3) P(y4) P(y5) P(y6) should feed only the relevant
information (i.e. encodings of
VD VD VD VD VD relevant words) to the decoder
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5
• We are interested in capturing
WD WD WD WD how much attention we need to
pay for the state of encoder at
UD UD UD UD UD different instances of time when
y1 y2 y3 y4 y6 we are decoding at time t
<Go> Mera naam Dileep hai
• Now suppose an oracle told you
+ c2 which input words to focus on to
a12 a22 a32 a42 a52 generate word at time step t

sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Take a weighted average of the
WD WD WD WD state of encoder at different
instances of time j and feed it to
UE UE UE UE UE the decoder at time step t
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 53

53

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Ideally, at each time-step we
P(y2) P(y3) P(y4) P(y5) P(y6) should feed only the relevant
information (i.e. encodings of
VD VD VD VD VD relevant words) to the decoder
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5
• We are interested in capturing
WD WD WD WD how much attention we need to
pay for the state of encoder at
UD UD UD UD UD different instances of time when
y1 y2 y3 y4 y6 we are decoding at time t
<Go> Mera naam Dileep hai
• Now suppose an oracle told you
+ c3 which input words to focus on to
a13 a23 a33 a43 a53 generate word at time step t

sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Take a weighted average of the
WD WD WD WD state of encoder at different
instances of time j and feed it to
UE UE UE UE UE the decoder at time step t
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 54

54

27
09-05-2023

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Ideally, at each time-step we
P(y2) P(y3) P(y4) P(y5) P(y6) should feed only the relevant
information (i.e. encodings of
VD VD VD VD VD relevant words) to the decoder
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5
• We are interested in capturing
WD WD WD WD how much attention we need to
pay for the state of encoder at
UD UD UD UD UD different instances of time when
y1 y2 y3 y4 y6 we are decoding at time t
<Go> Mera naam Dileep hai
• Now suppose an oracle told you
+ c4 which input words to focus on to
a14 a34 a44 a54 generate word at time step t
a24
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Take a weighted average of the
WD WD WD WD state of encoder at different
instances of time j and feed it to
UE UE UE UE UE the decoder at time step t
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 55

55

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Ideally, at each time-step we
P(y2) P(y3) P(y4) P(y5) P(y6) should feed only the relevant
information (i.e. encodings of
VD VD VD VD VD relevant words) to the decoder
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5
• We are interested in capturing
WD WD WD WD how much attention we need to
pay for the state of encoder at
UD UD UD UD UD different instances of time when
y1 y2 y3 y4 y6 we are decoding at time t
<Go> Mera naam Dileep hai
• Now suppose an oracle told you
+ c5 which input words to focus on to
a15 a35 a45 a55 generate word at time step t
a25
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Take a weighted average of the
WD WD WD WD state of encoder at different
instances of time j and feed it to
UE UE UE UE UE the decoder at time step t
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 56

56

28
09-05-2023

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • In practice we will not have the
P(y2) P(y3) P(y4) P(y5) P(y6) information about the
importance of each input words
VD VD VD VD VD – The machine will have to learn
this from the data
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5
• The importance of encoder state
WD WD WD WD (sE,j) in decoding and generating
UD UD UD UD UD a word at the time t is captured
by an attention score:
y1 y2 y3 y4 y6  jt  f ATT (s D,t 1 , s E, j )
<Go> Mera naam Dileep hai
• The attention score, αjt , captures
+ c5 the importance of the encoder
a15 a35 a45 a55 state at time j for decoding the
a25
tth output word
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5
• This attention score is
WD WD WD WD normalized using the softmax
function to obtain attention
UE UE UE UE UE exp( jt )
weight:
a jt 
x1 x2 x3 x4 x6 Ts

i/p: <Go> My name is Dileep  exp(


j 1
jt )
57

57

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Attention score:  jt  f ATT (s D,t 1 , s E, j )
P(y2) P(y3) P(y4) P(y5) P(y6)
exp( jt )
• Attention weight: a jt  Ts
VD VD VD VD VD  exp(
j 1
jt )
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5 • Every encoder state (sE,j) is
WD WD WD WD associated with one attention weight,
ajt
UD UD UD UD UD • This attention weight (ajt) along with
y1 y2 y3 y4 y6 encoder state (sE,j) is used to
generate context vector:
<Go> Mera naam Dileep hai Ts

c5 ct   a jt s E, j
+ j 1
a15 a35 a45 a55 • Context vector (ct) is the weighted
a25
sum of hidden state of encoder
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Context vector combines all the state
WD WD WD WD information of encoder and gives the
how much each is important in the
UE UE UE UE UE combination
• This context vector is used to
x1 x2 x3 x4 x6 perform decoding operation at time t
i/p: <Go> My name is Dileep 58

58

29
09-05-2023

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Attention score:  jt  f ATT (s D,t 1 , s E, j )
P(y2) P(y3) P(y4) P(y5) P(y6)
exp( jt )
• Attention weight: a jt  Ts
VD VD VD VD VD  exp(
j 1
jt )
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5 • Context vector: Ts
WD WD WD WD ct   a jt s E, j
j 1
UD UD UD UD UD – This context vector is used to
y1 y2 y3 y4 y6 perform decoding operation at
time t
<Go> Mera naam Dileep hai
• Input to the decoder at time t:
+ c5
c t 
a15 a35 a45 a55
zt   
a25 yt 
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • The decoder state at time t:
WD WD WD WD s D,t  tanh( U D z t  WDs D,t 1  b D )
UE UE UE UE UE • Output of the model at time t
(i.e. word generated at time t):
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep P (y t 1 )  f (VDs D,t  c D ) 59

59

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Attention score:  jt  f ATT (s D,t 1 , s E, j )
P(y2) P(y3) P(y4) P(y5) P(y6)
• How to define fATT(.)?

VD VD VD VD VD • Dot-product attention:
– Attention score (αjt) will be a dot
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5 product between the state of the
WD WD WD WD decoder (sD,t-1) and state of the
encoder at time j (sE,j)
UD UD UD UD UD
 jt  f ATT (s D,t 1 , s E, j )  s D,t 1 , s E, j
y1 y2 y3 y4 y6
<Go> Mera naam Dileep hai – Limitation: Applicable only when
the dimension of sD,t-1 and sE,j are
+ c5 same
a15 a35 a45 a55 – That is, when the number of
a25
nodes in the hidden layers of
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 both the RNN encoder and RNN
decoder must be same
WD WD WD WD

UE UE UE UE UE
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 60

60

30
09-05-2023

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Attention score:  jt  f ATT (s D,t 1 , s E, j )
P(y2) P(y3) P(y4) P(y5) P(y6)
• How to define fATT(.)?

VD VD VD VD VD • Multilayer perceptron
attention:
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5 – It is similar to the gates used in
WD WD WD WD LSTM
 jt  f ATT (s D,t 1 , s E, j )
UD UD UD UD UD
 v ATT , (U ATT s E, j  WATT s D,t 1  b ATT )
y1 y2 y3 y4 y6
<Go> Mera naam Dileep hai

+ c5
a15 a35 a45 a55
a25
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5
WD WD WD WD

UE UE UE UE UE
x1 x2 x3 x4 x6
i/p: <Go> My name is Dileep 61

61

Attention Mechanism in MT Task


o/p: Mera naam Dileep hai <stop> • Attention score:  jt  f ATT (s D,t 1 , s E, j )
P(y2) P(y3) P(y4) P(y5) P(y6)
• How to define fATT(.)?

VD VD VD VD VD • Multilayer perceptron
attention:
sD,0 sD,1 sD,2 sD,3 sD,4 sD,5 – It is similar to the gates used in
WD WD WD WD LSTM
 jt  f ATT (s D,t 1 , s E, j )
UD UD UD UD UD
 v ATT , (U ATT s E, j  WATT s D,t 1  b ATT )
y1 y2 y3 y4 y6
<Go> Mera naam Dileep hai
– UATT , vATT and WATT are the
parameters of multilayer
+ c5 perceptron attention
a15 a35 a45 a55 – sD,t-1 and sE,j are need not be of
a25
same dimension
sE,0 sE,1 sE,2 sE,3 sE,4 sE,5 • Then, the softmax operation is
WD WD WD WD applied to obtain attention
weights
UE UE UE UE UE
• Attention weights are then used
x1 x2 x3 x4 x6 to generate context vector
i/p: <Go> My name is Dileep associated with time t 62

62

31
09-05-2023

Summary: Encoder-Decoder Models with


Attention
• Encoder-decoder model
– Encoder first used to encode the input
– A decoder is then used to decode (generate) a output
from the encoding
• Attention mechanism in encoder-decoder models
– Encoder-decoder models can be made even more
expressive by adding an “attention” mechanism
• A model will then learn an attention over input to
generate output
– Attention is seen as probability of portion of input
responsible for generating output

63

63

Attention-based Models:
Transformers

64

32
09-05-2023

Sequence-to-Sequence Mapping Tasks


• Neural Machine Translation: Translation of a sentence in the
source language to a sentence in the target language
– Input: A sequence of words
– Output: A sequence of words

• Speech Recognition (Speech-to-Text Conversion):


Conversion of the speech signal of a sentence utterance to
the text of a sentence
– Input: A sequence of feature vectors extracted from the
speech signal of a sentence
– Output: A sequence of words
• Video Captioning: Generation of a sentence as the caption
for a video represented as a sequence of frames
– Input: A sequence of feature vectors extracted from the
frames of a video
– Output: A sequence of words
65

65

Sequence-to-Sequence Mapping Tasks


• Each of the above tasks involves mapping an input
sequence to an output sequence
• So far we have seen encoder-decoder paradigm using RNN
models for sequence-to-sequence mapping
• Training the RNN-based sequence-to-sequence mapping
systems is
– computationally intensive, and
– there is not much scope for parallelization of operations in the
training process
• Goal: Come up with totally different approach for solving
sequence-to-sequence mapping tasks that
– avoid recurrent structure in encoder-decoder paradigm and
– avoid huge training time required for training RNNs

66

66

33
09-05-2023

Attention-based Deep Learning Models for


Sequence-to-Sequence Mapping
• Attention-based models implement sequence-to-sequence
mapping using only the attention-based techniques
– They don’t use any RNNs for that matter
• Attention based models [4] try to capture and use
– Relations among elements in the input sequence (Self-
Attention)
– Relations among elements in the output sequence (Self-
Attention)
– Relations between elements in the input sequence and
elements in the output sequence (Cross-Attention)
• In literature, these attention-based models are called
transformers
– Perform several transformations to capture better representation that
will avoid recurrences

[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.


Polosukhin, “Attention is all you need,” 31st Conference on Neural Information Processing
Systems (NIPS 2017), Long Beach, CA, USA pp. 1-11, 2017. 67

67

Attention-based Deep Learning Models for


Sequence-to-Sequence Mapping
• Given the sequence X = (x1, x2, … , xj , …, xTs)
– get a representation for X which will be capturing the
relationships among the elements in the sequence X
– Basically, apply some kind of transformations to get the
representation that preserve the sequence by avoiding
recurrences
• Major advantages:
– Training times are smaller compared to the RNNs as there is
no recurrences and hence no need to perform backpropagation
through time (BPTT)
– There will not be any vanishing and exploding gradient
problems
– Most importantly, this gives a lot of scope for parallelization
when you use GPUs for training

68

68

34
09-05-2023

Text Books
1. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning,
MIT Press, Available online: http://www.deeplearningbook.org,
2016
2. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer,
2018
3. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India,
1999.
4. Satish Kumar, Neural Networks - A Class Room Approach, Second
Edition, Tata McGraw-Hill, 2013.
5. S. Haykin, Neural Networks and Learning Machines, Prentice Hall of
India, 2010.
6. C. M. Bishop, Pattern Recognition and Machine Learning, Springer,
2006.
7. J. Han and M. Kamber, Data Mining: Concepts and Techniques,
Third Edition, Morgan Kaufmann Publishers, 2011.
8. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic
Press, 2009. 69

69

35

You might also like