0% found this document useful (0 votes)

11 views

transformers and attention models

The lecture discusses the sequence-to-sequence model using Recurrent Neural Networks (RNNs) for tasks like machine translation, detailing the encoder-decoder architecture. It introduces the concept of attention mechanisms to improve the model's ability to reference input sequences during decoding. Key components include alignment scores, attention weights, and context vectors that enhance the translation process.

Uploaded by

jayeshsanap10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

transformers and attention models

Uploaded by

jayeshsanap10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 115

Lecture 8:

Attention and Transformers

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 1 April 25, 2024
Administrative
● Assignment 2 due 05/06

● Discussion section tomorrow

○ Covering PyTorch, the main deep learning framework used by AI researchers + what
we recommend for your projects!

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 2 April 25, 2024
Last Time: Recurrent Neural Networks

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 3 April 25, 2024
Last Time: Variable length computation
graph with shared weights
y1 L1

h0 fW h1

x1
W

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 4 April 25, 2024
Last Time: Variable length computation
graph with shared weights
y1 L1 y2 L2

h0 fW h1 fW h2

x1 x2
W W is reused (recurrently)!

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 5 April 25, 2024
Last Time: Variable length computation L
graph with shared weights
y1 L1 y2 L2 y3 L3 yT LT

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W Calculate total loss across all
timesteps to find dW/dL
(backpropagation through
time)!

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 6 April 25, 2024
Sequence to Sequence with RNNs: Encoder - Decoder
Input: Sequence x1, … xT A motivating example for today’s discussion –
machine translation! English → Italian
Output: Sequence y1, …, yT’

Encoder: ht = fW(xt, ht-1)

h1 h2 h3 h4

x1 x2 x3 x4

we see the sky

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 7 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, …, yT’

From final hidden state predict:

Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0

x1 x2 x3 x4 c

we see the sky

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 8 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ vediamo

From final hidden state predict:

y1
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c y0

we see the sky [START]

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 9 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ vediamo il

From final hidden state predict:

y1 y2
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2

x1 x2 x3 x4 c y0 y1

we see the sky [START] vediamo

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 10 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ vediamo il cielo [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we see the sky [START] vediamo il cielo

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 11 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’
Remember: vediamo il cielo [STOP]

During Training: From final hidden state predict:

y1 y2 y3 y4
Often, we use the “correct”
Initial tokenstate
decoder even sif0the
Encoder: ht =model
fW(xist, ht-1 )
wrong.Context
Called teacher
vectorforcing
c (often c=h ) T

During Test-time:
h1 h3 s0 s3 s4
h2
We sample
h4
from the model’s outputs until s1 s2

we sample [STOP]

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread [START] vediamo il cielo

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 12 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ vediamo il cielo [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we see the sky Q: Are there any problems [START] vediamo il cielo
with using C like this??
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 13 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ vediamo il cielo [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

Answer: Input sequence

we see the sky [START] vediamo il cielo
bottlenecked through fixed-
sized vector. What if T=1000?
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 14 April 25, 2024
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ vediamo il cielo [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we see the sky Ideally we can reference [START] vediamo il cielo

the inputs as we decode…
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 15 April 25, 2024
Sequence to Sequence with RNNs and Attention
Input: Sequence x1, … xT
Output: Sequence y1, …, yT’

From final hidden state:

Encoder: ht = fW(xt, ht-1) Initial decoder state s
0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we see the sky

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 16 April 25, 2024
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
et,i = fatt(st-1, hi) (fatt is a Linear Layer)

From final hidden state:

e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we see the sky

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 17 April 25, 2024
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
a11 a12 a13 a14
et,i = fatt(st-1, hi) (fatt is a Linear Layer)

Normalize alignment scores

softmax
to get attention weights
From final hidden state:
e11 e13 e14 0 < at,i < 1 ∑iat,i = 1
e12
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we see the sky

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 18 April 25, 2024
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
a11 a12 a13 a14
et,i = fatt(st-1, hi) (fatt is a Linear Layer)
vediamo
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e13 e14 0 < at,i < 1 ∑iat,i = 1
e12
Initial decoder state s0
Compute context vector as
weighted sum of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi

x1 x2 x3 x4 c1 y0

we see the sky

[START]

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 19 April 25, 2024
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
a11 a12 a13 a14
et,i = fatt(st-1, hi) (fatt is a Linear Layer)
vediamo
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as
weighted sum of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi
Intuition: Context
vector attends to the Use context vector in
x1 x2 x3 x4 relevant part of the c1 y0
decoder: st = gU(yt-1, st-1, ct)
input sequence
we see the sky “vediamo” = “we see”
so maybe a11=a12=0.45, [START]
a13=a14=0.05
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 20 April 25, 2024
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
a11 a12 a13 a14
et,i = fatt(st-1, hi) (fatt is a Linear Layer)
vediamo
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as
weighted sum of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi
Intuition: Context
vector attends to the Use context vector in
x1 x2 x3 x4 relevant part of the c1 y0
decoder: st = gU(yt-1, st-1, ct)
input sequence This is all differentiable! No
we see the sky “vediamo” = “we see” supervision on attention
so maybe a11=a12=0.45, [START]
weights – backprop through
a13=a14=0.05 everything
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 21 April 25, 2024
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
new context vector c2
a21 a22 a23 a24 Compute (scalar)
vediamo
alignment scores
softmax et,i = fatt(st-1, hi)
y1
(fatt is a Linear Layer)
e21 e22 e23 e24

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c1 y0 c2

we see the sky

[START]

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 22 April 25, 2024
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
new context vector c2
a21 a22 a23 a24
vediamo il

softmax
y1 y2
e21 e22 e23 e24

+ Use context vector

in decoder: st =
h1 h2 h3 h4 s0 s1 s2 gU(yt-1, st-1, ct)

x1 x2 x3 x4 c1 y0 c2 y1

we see the sky

[START] vediamo

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 23 April 25, 2024
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
new context vector c2
a21 a22 a23 a24
vediamo il

softmax
y1 y2
e21 e22 e23 e24

+ Use context vector

in decoder: st =
h1 h2 h3 h4 s0 s1 s2 gU(yt-1, st-1, ct)

Intuition: Context vector

attends to the relevant
x1 x2 x3 x4 part of the input sequence c1 y0 c2 y1

“il” = “the”
we see the sky so maybe a21=a22=0.05,
[START] vediamo
a24=0.1, a23=0.8
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 24 April 25, 2024
Sequence to Sequence with RNNs and Attention
Use a different context vector in each timestep of decoder

- Input sequence not bottlenecked through single vector

vediamo il cielo [STOP]
- At each timestep of decoder, context vector “looks at”
different parts of the input sequence
y1 y2 y3 y4

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we see the sky

[START] vediamo il cielo

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 25 April 25, 2024
Sequence to Sequence with RNNs and Attention
Example: English to French Visualize attention weights at,i
translation
at1 at2 at3 at4

softmax

et1 et2 et3 et4

h1 h2 h3 h4 st

x1 x2 x3 x4

we see the sky

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 26 April 25, 2024
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation

Input: “The agreement on the

European Economic Area
was signed in August 1992.”

Output: “L’accord sur la zone

économique européenne a
été signé en août 1992.”

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 27 April 25, 2024
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
Diagonal attention means
Input: “The agreement on words correspond in order
the European Economic Area
was signed in August 1992.”

Output: “L’accord sur la

zone économique européenne
a été signé en août 1992.”
Diagonal attention means
words correspond in order

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 28 April 25, 2024
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to
French translation
Diagonal attention means
Input: “The agreement on words correspond in order
the European Economic
Area was signed in Attention figures out
different word orders
August 1992.”

Output: “L’accord sur la

zone économique
européenne a été signé Diagonal attention means
words correspond in order
en août 1992.”
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 29 April 25, 2024
Sequence to Sequence with RNNs and Attention
Context vectors don’t use the fact that hi form an ordered
sequence – it just treats them as an unordered set {h i}
vediamo il cielo [STOP]
General architecture + strategy given any set of input
hidden vectors {hi}! (calculate attention weights + sum) y1 y2 y3 y4

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we see the sky

[START] vediamo il cielo

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 30 April 25, 2024
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT An example network for image captioning
without attention

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 31 April 25, 2024
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT

Encoder: h0 = fW(z)
where z is spatial CNN features
fW(.) is an MLP

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 32 April 25, 2024
Image Captioning using spatial features
Input: Image I Decoder: ht = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
and output yt = T(ht)

Encoder: h0 = fW(z) person

where z is spatial CNN features
fW(.) is an MLP y1

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 33 April 25, 2024
Image Captioning using spatial features
Input: Image I Decoder: ht = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
and output yt = T(ht)

Encoder: h0 = fW(z) person wearing

where z is spatial CNN features
fW(.) is an MLP y1 y2

z0,0 z0,1 z0,2

hh0 h1 h2
z1,0 z1,1 z1,2 0
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 34 April 25, 2024
Image Captioning using spatial features
Input: Image I Decoder: ht = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
and output yt = T(ht)

Encoder: h0 = fW(z) person wearing hat

where z is spatial CNN features
fW(.) is an MLP y1 y2 y3

z0,0 z0,1 z0,2

hh0 h1 h2 h3
z1,0 z1,1 z1,2 0
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 35 April 25, 2024
Image Captioning using spatial features
Input: Image I Decoder: ht = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
and output yt = T(ht)

Encoder: h0 = fW(z) person wearing hat [END]

where z is spatial CNN features
fW(.) is an MLP y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 36 April 25, 2024
Image Captioning using spatial features
Input: Image I Decoder: ht = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
and output yt = T(ht)
Q: What is the problem
Encoder: h0 = fW(z) with this setup? Think person wearing hat [END]
where z is spatial CNN features back to last time…
fW(.) is an MLP y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 37 April 25, 2024
Image Captioning using spatial features
Answer: Input is "bottlenecked" through c
- Model needs to encode everything it
wants to say within c
person wearing hat [END]
This is a problem if we want to generate
really long descriptions? 100s of words long
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 38 April 25, 2024
Image Captioning with RNNs and Attention
gif source

Attention idea: New context vector at every time step.

Each context vector will attend to different image regions

Attention Saccades in humans

z0,0 z0,1 z0,2
h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 39 April 25, 2024
Image Captioning with RNNs and Attention
Alignment scores:
Compute alignments HxW
scores (scalars):
e1,0,0 e1,0,1 e1,0,2

e1,1,0 e1,1,1 e1,1,2

fatt(.) is an MLP
e1,2,0 e1,2,1 e1,2,2

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 40 April 25, 2024
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2

fatt(.) is an MLP 0 < at, i, j < 1,
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2 attention values sum to 1

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 41 April 25, 2024
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get Compute context vector:
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2

fatt(.) is an MLP 0 < at, i, j < 1,
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2 attention values sum to 1

z0,0 z0,1 z0,2

h0 Q: How many context vectors
z1,0 z1,1 z1,2
CNN are computed?
z2,0 z2,1 z2,2

Extract spatial Features:

c1
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 42 April 25, 2024
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 43 April 25, 2024
Image Captioning with RNNs and Attention
Alignment scores: Attention: Decoder: yt = gV(yt-1, ht-1, ct)
HxW HxW New context vector at every time step
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2

person
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 44 April 25, 2024
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing

y1 y2

z0,0 z0,1 z0,2

h0 h1 h2
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 45 April 25, 2024
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing hat

y1 y2 y3

z0,0 z0,1 z0,2

h0 h1 h2 h3
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 46 April 25, 2024
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 47 April 25, 2024
Image Captioning with RNNs and Attention
Alignment scores: Attention: This entire process is differentiable.
HxW HxW - model chooses its own
e1,0,0 e1,0,1 e1,0,2
attention weights. No attention
a1,0,0 a1,0,1 a1,0,2
supervision is required
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 48 April 25, 2024
Image Captioning with Attention

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 49 April 25, 2024
Image Captioning with RNNs and Attention
Alignment scores: Attention: A general and useful tool!
HxW HxW Calculating vectors that are learned,
e1,0,0 e1,0,1 e1,0,2
weighted averages over inputs and
a1,0,0 a1,0,1 a1,0,2
features
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 50 April 25, 2024
Attention we just saw in image captioning

z0,0 z0,1 z0,2

Features

z1,0 z1,1 z1,2

z2,0 z2,1 z2,2
Inputs:
Features: z (shape: H x W x D)
h Query: h (shape: D)  “query” refers to a vector used to calculate a corresponding context
vector.

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 51 April 25, 2024
Attention we just saw in image captioning

Operations:
Alignment: ei,j = fatt(h, zi,j)

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 52 April 25, 2024
Attention we just saw in image captioning

a0,0 a0,1 a0,2

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)

softmax

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 53 April 25, 2024
Attention we just saw in image captioning
c
Outputs:
context vector: c (shape: D)
mul + add

a0,0 a0,1 a0,2

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)
Output: c = ∑i,j ai,jzi,j
softmax

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 54 April 25, 2024
Attention we just saw in image captioning
c
Outputs:
context vector: c (shape: D)
mul + add

a0,0 a0,1 a0,2

How is this different

Attention
a1,0 a1,1 a1,2
Operations: from the attention
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j) mechanism in
Attention: a = softmax(e)
Output: c = ∑i,j ai,jzi,j transformers?
softmax

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2 We’ll go into that next,
Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

any questions?
z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 55 April 25, 2024
General attention layer – used in LLMs + beyond
c
Outputs:
context vector: c (shape: D)
mul + add

Attention
a1 Operations:
a2
Alignment: ei = fatt(h, xi)
Attention: a = softmax(e)
Output: c = ∑i ai xi
softmax
Input vectors

x0 e0
Alignment

x1 e1
Attention operation is permutation invariant.
x2 e2 - Doesn't care about ordering of the features
- Stretch into N = H x W vectors
Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 56 April 25, 2024
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add

Attention
Change fatt(.) to a dot product, this actually
a1 Operations: can work well in practice, but a simple dot
a2
Alignment: ei = h ᐧ xi product can have some issues…
Attention: a = softmax(e)
Output: c = ∑i ai xi
softmax
Input vectors

x0 e0
Alignment

x1 e1

x2 e2

Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 57 April 25, 2024
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add
Change fatt(.) to a scaled simple dot product
a0 - Larger dimensions means more terms in the

Attention
dot product sum.
a1 Operations: - So, the variance of the logits is higher. Large
a2
Alignment: ei = h ᐧ xi / √D magnitude vectors will produce much higher
Attention: a = softmax(e) logits.
Output: c = ∑i ai xi - So, the post-softmax distribution has lower-
softmax entropy, assuming logits are IID.
Input vectors

- Ultimately, these large magnitude vectors

x0 e0 will cause softmax to peak and assign very
Alignment

little weight to all others

x1 e1 - Divide by √D to reduce effect of large
magnitude vectors
x2 e2 - Similar to Xavier and Kaiming Initialization!
Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 58 April 25, 2024
General attention layer
y0 y1 y2
Outputs:
mul(→) + add (↑)
context vectors: y (shape: D)
Multiple query vectors
- each query creates a new,
a0,0 a0,1 a0,2
corresponding output context

Attention
a1,0 a1,1 a1,2 vector
Operations:
a2,0 a2,1 a2,2
Alignment: ei,j = qj ᐧ xi / √D
Attention: a = softmax(e)
Output: yj = ∑i ai,j xi Allows us to compute multiple attention
softmax (↑) context vectors at once
Input vectors

Will go into more details in future slides, but

x0 e0,0 e0,1 e0,2
this allows us to compute context vectors
Alignment

x1 e1,0 e1,1 e1,2 for multiple timesteps in parallel

x2 e2,0 e2,1 e2,2

Multiple query vectors

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 59 April 25, 2024
General attention layer
y0 y1 y2
Outputs:
mul(→) + add (↑)
context vectors: y (shape: D)

a0,0 a0,1 a0,2

Notice that the input vectors are used for

Attention
a1,0 a1,1 a1,2
Operations: both the alignment as well as the
Alignment: ei,j = qj ᐧ xi / √D attention calculations.
a2,0 a2,1 a2,2
Attention: a = softmax(e) - We can add more expressivity to
Output: yj = ∑i ai,j xi the layer by adding a different FC
softmax (↑) layer before each of the two steps.
Input vectors

x0 e0,0 e0,1 e0,2

Alignment

x1 e1,0 e1,1 e1,2

x2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 60 April 25, 2024
General attention layer

v0
Notice that the input vectors are used for
v1 Operations: both the alignment as well as the
Key vectors: k = xWk attention calculations.
v2 - We can add more expressivity to
Value vectors: v = xWv
the layer by adding a different FC
layer before each of the two steps.
Input vectors

x0 k0

x1 k1

x2 k2
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 61 April 25, 2024
General attention layer
y0 y1 y2
Outputs:
The input and output dimensions can
context vectors: y (shape: Dv)
mul(→) + add (↑) now change depending on the key and
value FC layers
v0 a0,0 a0,1 a0,2

Attention
v1 a1,0 a1,1 a1,2
Operations: Since the alignment scores are just
Key vectors: k = xWk scalars, the value vectors can be any
v2 a2,0 a2,1 a2,2
dimension we want
Value vectors: v = xWv
Alignment: ei,j = qj ᐧ ki / √D
softmax (↑) Attention: a = softmax(e)
Input vectors

Output: yj = ∑i ai,j vi
x0 k0 e0,0 e0,1 e0,2
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 62 April 25, 2024
General attention layer This is a working example of how we could use an
attention layer + CNN encoder for image captioning
y0 y1 y2
Outputs:
mul(→) + add (↑)
context vectors: y (shape: Dv)

Recall that the query vector was a

v0 a0,0 a0,1 a0,2
function of the input vectors

Attention
v1 a1,0 a1,1 a1,2
Operations: Encoder: h0 = fW(z)
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
where z is spatial CNN features
Alignment: ei,j = qj ᐧ ki / √D fW(.) is an MLP
softmax (↑) Attention: a = softmax(e)
z0,0 z0,1 z0,2
Input vectors

Output: yj = ∑i ai,j vi h0
x0 k0 e0,0 e0,1 e0,2
CNN z1,0 z1,1 z1,2 MLP
Alignment

x1 k1 e1,0 e1,1 e1,2 z2,0 z2,1 z2,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D) We used h0 as q0 previously
Queries: q (shape: M x Dk)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 63 April 25, 2024
Lecture 8:
Video Lecture Supplement
Attention and Transformers

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 64 April 25, 2024
Next: The Self-attention Layer
y0 y1 y2
Outputs:
mul(→) + add (↑)
context vectors: y (shape: Dv)

v0 a0,0 a0,1 a0,2

Attention
v1 a1,0 a1,1 a1,2
Operations:
Idea: leverages the
v2 a2,0 a2,1 a2,2
Key vectors: k = xWk strengths of attention
Value vectors: v = xWv
Alignment: ei,j = qj ᐧ ki / √D
layers without the need
softmax (↑) Attention: a = softmax(e) for separate query
Input vectors

Output: yj = ∑i ai,j vi
vectors.
x0 k0 e0,0 e0,1 e0,2
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 65 April 25, 2024
Self attention layer

We can calculate the query vectors

from the input vectors, therefore,
defining a "self-attention" layer.
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Query vectors: q = xWq Instead, query vectors are
Alignment: ei,j = qj ᐧ ki / √D calculated using a FC layer.
Input vectors

Attention: a = softmax(e)
x0 Output: yj = ∑i ai,j vi

x2
No input query vectors anymore
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 66 April 25, 2024
Self attention layer
y0 y1 y2
Outputs:
mul(→) + add (↑)
context vectors: y (shape: Dv)

v0 a0,0 a0,1 a0,2

Attention
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 67 April 25, 2024
Self attention layer - attends over sets of inputs
y0 y1 y2
Outputs:
mul(→) + add (↑)
context vectors: y (shape: Dv)

v0 a0,0 a0,1 a0,2

Attention
y0 y1 y2
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
self-attention
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D x0 x1 x2
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 68 April 25, 2024
Self attention layer - attends over sets of inputs

y1 y0 y2 y2 y1 y0 y0 y1 y2

self-attention self-attention self-attention

x1 x0 x2 x2 x1 x0 x0 x1 x2

Permutation equivariant

Self-attention layer doesn’t care about the orders of the inputs!

Problem: How can we encode ordered sequences like language or spatially ordered image features?

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 69 April 25, 2024
Positional encoding
y0 y1 y2

self-attention

x0 x1 x2
p0 p1 p2

position encoding
Possible desirable properties of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each time-
step (word’s position in a sentence)
Concatenate or add special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the vector 4. It must be deterministic.
into a d-dimensional vector

So, pj = pos(j)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 70 April 25, 2024
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2
p0 p1 p2

position encoding
Possible desirable properties of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each time-
step (word’s position in a sentence)
Concatenate special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the 4. It must be deterministic.
vector into a d-dimensional vector

So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 71 April 25, 2024
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2 2. Design a fixed function with the desired properties

p0 p1 p2

position encoding

x1 x0 x2

Concatenate special positional

encoding pj to each input vector xj
p(t) =
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector where
So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 72 April 25, 2024
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2 2. Design a fixed function with the desired properties

p0 p1 p2

position encoding Intuition:

x0 x1 x2

Concatenate special positional

encoding pj to each input vector xj
p(t) =
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector where
image source
So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 73 April 25, 2024
Masked self-attention layer
y0 y1 y2
Outputs:
mul(→) + add (↑)
context vectors: y (shape: Dv)

- Allows us to parallelize
v0 a0,0 a0,1 a0,2
attention across time

Attention
v1 0 a1,1 a1,2
Operations: - Don’t need to calculate the
v2 0 0 a2,2
Key vectors: k = xWk context vectors from the
Value vectors: v = xWv previous timestep first!
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D
- Prevent vectors from
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi looking at future vectors.
Alignment

x1 k1 -∞ e1,1 e1,2 - Manually set alignment

scores to –infinity (-nan)
x2 k2 -∞ -∞ e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 74 April 25, 2024
Multi-head self-attention layer
- Multiple self-attention “heads” in parallel
y0 y1 y2

Q: Why do this?

Concatenate + FC layer to reduce dim

head0 head1 headH-1

y0 y1 y2 y0 y1 y2 y0 y1 y2

Self-attention Self-attention ... Self-attention

x0 x1 x2 x0 x1 x2 x0 x1 x2

Split
x0 x1 x2

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 75 April 25, 2024
Multi-head self-attention layer
- Multiple self-attention “heads” in parallel
y0 y1 y2

A: We may want to have

multiple sets of
Concatenate + FC layer to reduce dim queries/keys/values
calculated in the layer.
This is a similar idea to
head0 head1 headH-1 having multiple conv
y0 y1 y2 y0 y1 y2 y0 y1 y2 filters learned in a layer

Self-attention Self-attention ... Self-attention

x0 x1 x2 x0 x1 x2 x0 x1 x2

Split
x0 x1 x2

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 76 April 25, 2024
General attention versus self-attention

Transformer models rely on many, stacked self-attention layers

y0 y1 y2 y0 y1 y2

attention self-attention

k0 k1 k2 v0 v1 v2 q0 q1 q2 x0 x1 x2

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 77 April 25, 2024
Comparing RNNs to Transformer

RNNs
(+) LSTMs work reasonably well for long sequences.
(-) Expects an ordered sequences of inputs
(-) Sequential computation: subsequent hidden states can only be computed after the previous
ones are done.

Transformer:
(+) Good at long sequences. Each attention calculation looks at all inputs.
(+) Can operate over unordered sets or ordered sequences with positional encodings.
(+) Parallel computation: All alignment and attention scores for all inputs can be done in parallel.
(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculated and
stored for a single self-attention head. (but GPUs are getting bigger and better)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 78 April 25, 2024
“ImageNet Moment for Natural
Language Processing”

Pretraining:
Download a lot of text from the
internet

Train a giant Transformer model for

language modeling

Finetuning:
Fine-tune the Transformer on your
own NLP task

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 79 April 25, 2024
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 80 April 25, 2024
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2

CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 81 April 25, 2024
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT

Encoder: c = TW(z)
where z is spatial CNN features
TW(.) is the transformer encoder

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 82 April 25, 2024
Image Captioning using Transformers
Input: Image I Decoder: yt = TD(y0:t-1, c)
Output: Sequence y = y1, y2,..., yT where TD(.) is the transformer decoder

Encoder: c = TW(z) person wearing hat [END]

where z is spatial CNN features
TW(.) is the transformer encoder y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 83 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
Transformer encoder

Made up of N encoder blocks.

In vaswani et al. N = 6, Dq= 512

Positional encoding

z0,0 z0,1 z0,2 ... z2,2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 84 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
Transformer encoder

Let's dive into one encoder block

Positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 85 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
Transformer encoder

Multi-head self-attention Attention attends over all the vectors

Positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 86 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
Transformer encoder

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 87 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
Transformer encoder

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 88 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
Transformer encoder

xN
MLP MLP over each vector individually

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 89 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
Transformer encoder

+ Residual connection
xN
MLP MLP over each vector individually

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 90 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2

Layer norm LayerNorm over each vector individually

Transformer encoder

+ Residual connection
xN
MLP MLP over each vector individually

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 91 April 25, 2024
The Transformer encoder block
c0,0 c0,1 c0,2 ... c2,2
y0 y1 y2 y3
Transformer Encoder Block:
Layer norm
Inputs: Set of vectors x
Transformer encoder

+ Outputs: Set of vectors y

xN
MLP Self-attention is the only
interaction between vectors.
Layer norm

+
Layer norm and MLP operate
independently per vector.
Multi-head self-attention
Highly scalable, highly
Positional encoding parallelizable, but high memory usage.

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 92 April 25, 2024
The Transformer decoder
person wearing hat [END]

y0 y1 y2 y3

Transformer decoder
Made up of N decoder blocks.
xN

c0,0 In vaswani et al. N = 6, Dq= 512

c0,1

c0,2
...

c2,2 Positional encoding

y0 y1 y2 y3

[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 93 April 25, 2024
The Transformer y0 y1 y2 y3

decoder block FC

person wearing hat [END]

y0 y1 y2 y3

Transformer decoder
Let's dive into the
c0,0
transformer decoder block
xN
c0,1
c0,0
c0,2
c0,1

...
c0,2 c2,2
...

c2,2
y0 y1 y2 y3

x0 x1 x2 x3
[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 94 April 25, 2024
The Transformer y0 y1 y2 y3

decoder block FC
Layer norm
person wearing hat [END]
+

y0 y1 y2 y3
MLP
Layer norm

Transformer decoder
+ Most of the network is the
c0,0
same the transformer
xN encoder.
c0,1
c0,0
c0,2
Layer norm
c0,1

...
+

c0,2 c2,2
Masked Multi-head Ensures we only look at
self-attention
...

the previous tokens

c2,2 (teacher forcing during
y0 y1 y2 y3
training)
x0 x1 x2 x3
[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 95 April 25, 2024
The Transformer y0 y1 y2 y3

decoder block FC
Layer norm
person wearing hat [END]
+

y0 y1 y2 y3
MLP
Layer norm

Transformer decoder
+ Multi-head attention block
c0,0
attends over the transformer
xN Multi-head attention encoder outputs.
c0,1 k v q
c0,0
c0,2 For image captioning, this is
Layer norm
c0,1 how we inject image

...
+
features into the decoder.
c0,2 c2,2
Masked Multi-head
self-attention
...

c2,2
y0 y1 y2 y3

x0 x1 x2 x3
[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 96 April 25, 2024
The Transformer y0 y1 y2 y3

decoder block FC
Layer norm
Transformer Decoder Block:

person wearing hat [END] Inputs: Set of vectors x and

+
Set of context vectors c.
y0 y1 y2 y3 Outputs: Set of vectors y.
MLP
Layer norm
Masked Self-attention only

Transformer decoder
c0,0
+ interacts with past inputs.
xN Multi-head attention
c0,1 k v q Multi-head attention block is
c0,0
NOT self-attention. It attends
c0,2 over encoder outputs.
Layer norm
c0,1

...
+
Highly scalable, highly
c0,2 c2,2
Masked Multi-head parallelizable, but high memory
self-attention usage.
...

c2,2
y0 y1 y2 y3

x0 x1 x2 x3
[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 97 April 25, 2024
Image Captioning using transformers
- No recurrence at all

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 98 April 25, 2024
Image Captioning using transformers
- Perhaps we don't need
convolutions at all?

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 99 April 25, 2024
Image Captioning using ONLY transformers
- Transformers from pixels to language

person wearing hat [END]

y1 y2 y3 y4

c0,0 c0,1 c0,2 ... c2,2

Transformer decoder

Transformer encoder

y0 y1 y2 y3
...

Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
[START] person wearing hat
Colab link to an implementation of vision transformers

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 100 April 25, 2024
ViTs – Vision Transformers

Figure from:
Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 101 April 25, 2024
Vision Transformers vs. ResNets

Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
Colab link to an implementation of vision transformers

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 102 April 25, 2024
Vision Transformers

Liu et al, “Swin Transformer: Hierarchical Vision

Fan et al, “Multiscale Vision Transformers”, ICCV 2021 Transformer using Shifted Windows”, CVPR 2021

Carion et al, “End-to-End Object Detection with Transformers”,

ECCV 2020

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 103 April 25, 2024
ConvNets strike back!

A ConvNet for the 2020s. Liu et al. CVPR 2022

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 104 April 25, 2024
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 105 April 25, 2024
Summary
- Adding attention to RNNs allows them to "attend" to different
parts of the input at every time step
- The general attention layer is a new type of layer that can be
used to design new neural network architectures
- Transformers are a type of layer that uses self-attention and
layer norm.
○ It is highly scalable and highly parallelizable
○ Faster training, larger models, better performance across
vision and language tasks
○ They are quickly replacing RNNs, LSTMs, and may(?) even
replace convolutions.

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 106 April 25, 2024
Next time: Object Detection + Segmentation

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 107 April 25, 2024
Appendix Slides from Previous Years

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 108 April 25, 2024
Image Captioning with Attention

Soft attention

Hard attention
(requires
reinforcement
learning)

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 109 April 25, 2024
Example: CNN with Self-Attention

Input Image

CNN
Features:
CxHxW
Cat image is free to use under the Pixabay License

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 110 April 25, 2024
Example: CNN with Self-Attention

Queries:
C’ x H x W

Input Image 1x1 Conv

Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License

Values:
C’ x H x W

1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 111 April 25, 2024
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License

Values:
C’ x H x W

1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 112 April 25, 2024
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x
1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 113 April 25, 2024
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
CxHxH
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x 1x1 Conv
1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 114 April 25, 2024
Example: CNN with Self-Attention
Residual Connection

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
CxHxW
Keys:
CNN C’ x H x W
+
1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x 1x1 Conv
1x1 Conv

Self-Attention Module
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 9 - 115 April 25, 2024

Esquema Schematic Asus - X550CC - Repair Guide
No ratings yet
Esquema Schematic Asus - X550CC - Repair Guide
5 pages
Stata Dcreate Module
No ratings yet
Stata Dcreate Module
2 pages
Generac Wiring and Everything Else
100% (2)
Generac Wiring and Everything Else
92 pages
The Number of Heart Surgeries Performed at Hartville General Hos
No ratings yet
The Number of Heart Surgeries Performed at Hartville General Hos
1 page
Attention Layers
No ratings yet
Attention Layers
103 pages
Attention and Transformers
No ratings yet
Attention and Transformers
103 pages
598_FA2020_lecture13
No ratings yet
598_FA2020_lecture13
112 pages
Attention Deep Learning
No ratings yet
Attention Deep Learning
118 pages
Lecture5
No ratings yet
Lecture5
102 pages
L5
No ratings yet
L5
99 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
Deep Recurrent Neural Networks (1)
No ratings yet
Deep Recurrent Neural Networks (1)
24 pages
AD3501-DL-UNIT 3 NOTES
No ratings yet
AD3501-DL-UNIT 3 NOTES
34 pages
cl8_encdec
No ratings yet
cl8_encdec
51 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
Sequence Learning Problem
No ratings yet
Sequence Learning Problem
42 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
No ratings yet
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
25 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
09. Chap 10-1_Sequence Modeling Recurrent and Recursive nets_Eunjeong Yi
No ratings yet
09. Chap 10-1_Sequence Modeling Recurrent and Recursive nets_Eunjeong Yi
21 pages
Mod 4-RNN Deep Learning
No ratings yet
Mod 4-RNN Deep Learning
63 pages
Rnn
No ratings yet
Rnn
50 pages
Unit-5-updated
No ratings yet
Unit-5-updated
125 pages
lec14-RNN3-8-Feb-18
No ratings yet
lec14-RNN3-8-Feb-18
16 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
RNN
No ratings yet
RNN
48 pages
DL_MOD4 (3)
No ratings yet
DL_MOD4 (3)
105 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Sequence Transduction With Recurrent Neural Networks: Hochreiter Et Al. 2001
No ratings yet
Sequence Transduction With Recurrent Neural Networks: Hochreiter Et Al. 2001
9 pages
Module 3 Part 2 Encoder
No ratings yet
Module 3 Part 2 Encoder
14 pages
Retentive Network - A Successor To Transformer For Large Language Models
No ratings yet
Retentive Network - A Successor To Transformer For Large Language Models
14 pages
Retentive Network: A Successor To Transformer For Large Language Models
No ratings yet
Retentive Network: A Successor To Transformer For Large Language Models
14 pages
Introduction To RNNS!: Arun Mallya!
No ratings yet
Introduction To RNNS!: Arun Mallya!
52 pages
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
11 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
111 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Lab RNN Intro
No ratings yet
Lab RNN Intro
22 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
06_NNDL_SeqModNN_Part1_Jan_Apr24 (1)
No ratings yet
06_NNDL_SeqModNN_Part1_Jan_Apr24 (1)
92 pages
Simplifying Neural Networks and Deep Learning Basics!
No ratings yet
Simplifying Neural Networks and Deep Learning Basics!
27 pages
6b. Recurrent Neural Networks
No ratings yet
6b. Recurrent Neural Networks
38 pages
DLNLP CH-4 N
No ratings yet
DLNLP CH-4 N
5 pages
DL For Sequencial Data
No ratings yet
DL For Sequencial Data
36 pages
Program: B.Tech, CSE, 6 Sem, 3 Year CS 601: Machine Learning Unit-4 Machine Learning: RNN in ML
No ratings yet
Program: B.Tech, CSE, 6 Sem, 3 Year CS 601: Machine Learning Unit-4 Machine Learning: RNN in ML
24 pages
Unit III- Recurrent Neural Networks
No ratings yet
Unit III- Recurrent Neural Networks
44 pages
Time Series Rnn Lstm 1746197734
No ratings yet
Time Series Rnn Lstm 1746197734
25 pages
Neural_Network_and_Deep_Learning_1736802600
No ratings yet
Neural_Network_and_Deep_Learning_1736802600
54 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
RNN and LSTM.pptx
No ratings yet
RNN and LSTM.pptx
65 pages
Modelling Time Series With Neural Networks: Volker Tresp Summer 2017
No ratings yet
Modelling Time Series With Neural Networks: Volker Tresp Summer 2017
24 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
RNN (v2)
No ratings yet
RNN (v2)
89 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Square Summable Power Series
From Everand
Square Summable Power Series
Louis de Branges
5/5 (1)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
The Logical Solution Syracuse Conjecture
From Everand
The Logical Solution Syracuse Conjecture
Rolando Zucchini
No ratings yet
Paleta-Fresco: Color 1 Color 2 Color 3 Color 4 Color 5
No ratings yet
Paleta-Fresco: Color 1 Color 2 Color 3 Color 4 Color 5
4 pages
ARCODE Geared (EN81-20) .En
No ratings yet
ARCODE Geared (EN81-20) .En
47 pages
%enterprise Architecture Modeling With SoaML Using BMM and BPMN - %
No ratings yet
%enterprise Architecture Modeling With SoaML Using BMM and BPMN - %
7 pages
ch10 PDF
No ratings yet
ch10 PDF
71 pages
Ooad & Uml - Unit 5
No ratings yet
Ooad & Uml - Unit 5
4 pages
Business Process Modelling Notation - C2
No ratings yet
Business Process Modelling Notation - C2
3 pages
6 Finetuning For Classification - Build A Large Language Model (From Scratch)
No ratings yet
6 Finetuning For Classification - Build A Large Language Model (From Scratch)
49 pages
4158 9373 1 PB
No ratings yet
4158 9373 1 PB
9 pages
Unit 5 - Activity 1 - Probability Distribution Worksheet
No ratings yet
Unit 5 - Activity 1 - Probability Distribution Worksheet
4 pages
Functional Dependencies & Normalization For Relational Dbs
No ratings yet
Functional Dependencies & Normalization For Relational Dbs
76 pages
Color 1 Color 2 Color 3 Color 4: HEX RGB HSB Cmyk HEX RGB HSB Cmyk HEX RGB HSB Cmyk HEX RGB HSB Cmyk
No ratings yet
Color 1 Color 2 Color 3 Color 4: HEX RGB HSB Cmyk HEX RGB HSB Cmyk HEX RGB HSB Cmyk HEX RGB HSB Cmyk
4 pages
Ema Trading New
No ratings yet
Ema Trading New
17 pages
Over View of Uml Diagrams & Drawing Tools
No ratings yet
Over View of Uml Diagrams & Drawing Tools
17 pages
Introduction To UML: Use Case Diagram
No ratings yet
Introduction To UML: Use Case Diagram
33 pages
Perancangan Dan Implementasi Sistem Informasi Tracer Study Amik Mahaputra Riau Berbasis Web
No ratings yet
Perancangan Dan Implementasi Sistem Informasi Tracer Study Amik Mahaputra Riau Berbasis Web
12 pages
Metek Yamauchi (M) SDN BHD - Project - Memo1
No ratings yet
Metek Yamauchi (M) SDN BHD - Project - Memo1
3 pages
UNIT 10 Logic and Venn Diagrams CSEC Revision Test
No ratings yet
UNIT 10 Logic and Venn Diagrams CSEC Revision Test
6 pages
A Short Tutorial On Fuzzy Time Series - Part III
No ratings yet
A Short Tutorial On Fuzzy Time Series - Part III
17 pages
Enhanced Entity Relationship Modeling: SECD2523 Database Semester 1 2020/2021
No ratings yet
Enhanced Entity Relationship Modeling: SECD2523 Database Semester 1 2020/2021
19 pages
结构方程模型输出
No ratings yet
结构方程模型输出
2 pages
Electronic Diagram and Electrical Symbol
No ratings yet
Electronic Diagram and Electrical Symbol
27 pages
Contoh Kira Bil Elektrik
50% (2)
Contoh Kira Bil Elektrik
6 pages
SS_GenAI and LLMs-May-2024-25_final
No ratings yet
SS_GenAI and LLMs-May-2024-25_final
2 pages
Hotel Managment System Project
100% (1)
Hotel Managment System Project
30 pages
ID Analisis Efektivitas Metode Forecasting Terhadap Permintaan Produk PT Arara Abad
No ratings yet
ID Analisis Efektivitas Metode Forecasting Terhadap Permintaan Produk PT Arara Abad
16 pages