0% found this document useful (0 votes)

15 views62 pages

Tranformrerz

Uploaded by

jamdhadey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views62 pages

Tranformrerz

Uploaded by

jamdhadey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Natural Language Processing

with Deep Learning

CS224N/Ling284

Anna Goldie
Lecture 8: Transformers
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

2
Lecture Plan
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

3
Transformers: Is Attention All We Need?
• Last lecture, we learned that attention dramatically improves the performance of
recurrent neural networks.
• Today, we will take this one step further and ask Is Attention All We Need?

4
Transformers: Is Attention All We Need?
• Last lecture, we learned that attention dramatically improves the performance of
recurrent neural networks.
• Today, we will take this one step further and ask Is Attention All We Need?
• Spoiler: Not Quite!

5
Transformers Have Revolutionized the Field of NLP
• By the end of this lecture, you will deeply understand the neural architecture that
underpins virtually every state-of-the-art NLP model today! Output
Probabilities

Softmax

Linear

Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
6 (shifted right)
[Vaswani et al., 2017]
Great Results with Transformers: Machine Translation
First, Machine Translation results from the original Transformers paper!

Not just better Machine Also more efficient to

Translation BLEU scores train!
7 [Test sets: WMT 2014 English-German and English-French] [Vaswani et al., 2017]
Great Results with Transformers: SuperGLUE
SuperGLUE is a suite of challenging NLP tasks, including question-answering, word sense
disambiguation, coreference resolution, and natural language inference.

Not just better Machine Also more efficient to

Translation BLEU scores train!
8 [Test sets: SuperGLUE Leaderboard Version: 2.0] [Wang et al., 2019]
Great Results with Transformers: Rise of Large Language Models!
Today, Transformer-based models dominate LMSYS Chatbot Arena Leaderboard!

9 [Chiang et al., 2024]

Transformers Even Show Promise Outside of NLP

10
Transformers Even Show Promise Outside of NLP
Protein Folding

[Jumper et al. 2021] aka AlphaFold2!

11
Transformers Even Show Promise Outside of NLP
Protein Folding

Image Classification
[Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms
ResNet-based baselines with substantially less compute.

[Jumper et al. 2021] aka AlphaFold2!

12
Transformers Even Show Promise Outside of NLP
Protein Folding

Image Classification
[Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms
ResNet-based baselines with substantially less compute.

ML for Systems
[Zhou et al. 2020]: A Transformer-based
compiler model (GO-one) speeds up a
Transformer model!

[Jumper et al. 2021] aka AlphaFold2!

13
Scaling Laws: Are Transformers All We Need?
• With Transformers, language modeling performance improves smoothly as we increase
model size, training data, and compute resources in tandem.
• This power-law relationship has been observed over multiple orders of magnitude with
no sign of slowing!
• If we keep scaling up these models (with no change to the architecture), could they
eventually match or exceed human-level performance?

[Kaplan et al., 2020]

14
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

15
As of last lecture: recurrent models for (most) NLP!

• Circa 2016, the de facto strategy in NLP is to

encode sentences with a bidirectional LSTM:
(for example, the source sentence in a translation)

• Define your output (parse, sentence,

summary) as a sequence, and use an LSTM to
generate it.

• Use attention to allow flexible access to

memory

16
Why Move Beyond Recurrence?
Motivation for Transformer Architecture
The Transformers authors had 3 desirata when designing this architecture:
1. Minimize (or at least not increase) computational complexity per layer.
2. Minimize path length between any pair of words to facilitate learning of long-range
dependencies.
3. Maximize the amount of computation that can be parallelized.

17
[Vaswani et al., 2017]
1. Transformer Motivation: Computational Complexity Per Layer
When sequence length (n) << representation dimension (d), complexity per layer is lower for a Transformer
compared to the recurrent models we’ve learned about so far.

Table 1 of the Transformer paper.

18
[Vaswani et al., 2017]
2. Transformer Motivation: Minimize Linear Interaction Distance

• RNNs are unrolled “left-to-right”.

• It encodes linear locality: a useful heuristic!
• Nearby words often affect each other’s meanings
tasty pizza
• Problem: RNNs take O(sequence length) steps for distant word
pairs to interact.

O(sequence length)

The chef who … ate

19
2. Transformer Motivation: Minimize Linear Interaction Distance

• O(sequence length) steps for distant word pairs to interact means:

• Hard to learn long-distance dependencies (because gradient problems!)
• Linear order of words is “baked in”; we already know sequential structure
doesn't tell the whole story...

The chef who … ate

Info of chef has gone through
O(sequence length) many layers!
20
3. Transformer Motivation: Maximize Parallelizability

• Forward and backward passes have O(seq length) unparallelizable operations

• GPUs (and TPUs) can perform many independent computations at once!
• But future RNN hidden states can’t be computed in full before past RNN hidden
states have been computed
• Inhibits training on very large datasets!
• Particularly problematic as sequence length increases, as we can no longer batch
many examples together due to memory limitations

1 2 3 T

0 1 2 T-1

h1 h2 hT

Numbers indicate min # of steps before a state can be computed

21
High-Level Architecture: Transformer is all about (Self) Attention

• To recap, attention treats each word’s representation as a query to

access and incorporate information from a set of values.
• Last lecture, we saw attention from the decoder to the encoder in a
recurrent sequence-to-sequence model
• Self-attention is encoder-encoder (or decoder-decoder) attention where
each word attends to each other word within the input (or output).

2 2 2 2 2 2 2 2
All words attend
attention to all words in
1 1 1 1 1 1 1 1 previous layer;
attention
most arrows here
0 0 0 0 0 0 0 0 are omitted
embedding
h1 h2 hT
22
Computational Dependencies for Recurrence vs. Attention
RNN-Based Encoder-Decoder
Model with Attention

Transformer-Based
Encoder-Decoder Model

23
Computational Dependencies for Recurrence vs. Attention
Transformer Advantages:
RNN-Based Encoder-Decoder • Number of unparallelizable operations does
Model with Attention not increase with sequence length.
• Each "word" interacts with each other, so
maximum interaction distance is O(1).

Transformer-Based
Encoder-Decoder Model

24
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

25
The Transformer Encoder-Decoder [Vaswani et al., 2017]
Output
In this section, you will learn exactly how Probabilities

the Transformer architecture works: Softmax

• First, we will talk about the Encoder!
Linear
• Next, we will go through the Decoder
(which is quite similar)! Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

26
Encoder: Self-Attention
Self-Attention is the core building block of
Transformer, so let's first focus on that! Output
Probabilities

Decoder
Encoder

Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

27
Intuition for Attention Mechanism
▪ Let's think of attention as a "fuzzy" or approximate hashtable:
▪ To look up a value, we compare a query against keys in a table.
▪ In a hashtable (shown on the bottom left):
▪ Each query (hash) maps to exactly one key-value pair.
▪ In (self-)attention (shown on the bottom right):
▪ Each query matches each key to varying degrees.
▪ We return a sum of values weighted by the query-key match.

k0 v0 k0 v0
k1 v1 k1 v1
k2 v2 k2 v2
q k3 v3 q k3 v3
k4 v4 k4 v4
k5 v5 k5 v5
k6 v6 k6 v6
k7 v7 k7 v7
28
Recipe for Self-Attention in the Transformer Encoder
▪ Step 1: For each word , calculate its query, key, and value.
k0 v0
k1 v1
• Step 2: Calculate attention score between query and keys. k2 v2
q k3 v3
k4 v4
k5 v5
• Step 3: Take the softmax to normalize attention scores.
k6 v6
k7 v7

• Step 4: Take a weighted sum of values.

29
Recipe for (Vectorized) Self-Attention in the Transformer Encoder
▪ Step 1: With embeddings stacked in X, calculate queries, keys, and values.

• Step 2: Calculate attention scores between query and keys.

• Step 3: Take the softmax to normalize attention scores.

• Step 4: Take a weighted sum of values.

30
What We Have So Far: (Encoder) Self-Attention!

Output
Probabilities

Decoder

Encoder

Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

31
But attention isn't quite all you need!
• Problem: Since there are no element-wise non-linearities, self-
attention is simply performing a re-averaging of the value vectors.
• Easy fix: Apply a feedforward layer to the output of attention, Output
Probabilities

providing non-linear activation (and additional expressive power).

Equation for Feed Forward Layer

Decoder
Encoder Feed Forward

Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

32
But how do we make this work for deep networks?
Output
Probabilities

Encoder Feed Forward Decoder

Repeat 6x Repeat 6x
(# of Layers) (# of Layers)

Self-Attention

Training Trick #1: Residual Connections Input Output

Embedding Embedding
Training Trick #2: LayerNorm
Training Trick #3: Scaled Dot Product Attention Inputs Outputs
(shifted right)

33
Training Trick #1: Residual Connections [He et al., 2016] Output
Probabilities

• Residual connections are a simple but powerful

technique from computer vision.
• Deep networks are surprisingly bad at Decoder
Repeat 6x
learning the identity function! Add (# of Layers)
Encoder
• Therefore, directly passing "raw" embeddings to Repeat 6x
(# of Layers)
Feed Forward

the next layer can actually be very helpful! Add

Self-Attention

• This prevents the network from "forgetting" or

distorting important information as it is
Input Output
processed by many layers. Embedding Embedding

Inputs Outputs
(shifted right)

Residual connections are

also thought to smooth the
loss landscape and make
training easier!
34
Training Trick #2: Layer Normalization [Ba et al., 2016]
• Problem: Difficult to train the parameters of Output
Probabilities
a given layer because its input from the layer
beneath keeps shifting.
• Solution: Reduce variation by normalizing to
zero mean and standard deviation of one Add & Norm
Encoder Decoder
within each layer. Repeat 6x
Feed Forward
Repeat 6x
(# of Layers) (# of Layers)
Add & Norm
Mean: Standard Deviation:
Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

35
Training Trick #2: Layer Normalization [Ba et al., 2016]
Output
Probabilities

Add & Norm

Encoder Feed Forward
Repeat 6x Repeat 6x
(# of Layers) (# of Layers)
Add & Norm

Self-Attention

An Example of How LayerNorm Works (Image by Bala Priya C, Pinecone)

Input Output
Embedding Embedding
Mean: Standard Deviation:
Inputs Outputs
(shifted right)

36
Training Trick #3: Scaled Dot Product Attention
• After LayerNorm, the mean and variance of Output
Probabilities
vector elements is 0 and 1, respectively. (Yay!)
• However, the dot product still tends to take on
extreme values, as its variance scales with
dimensionality dk
Add & Norm

Quick Statistics Review: Encoder Feed Forward Decoder

Repeat 6x Repeat 6x
• Mean of sum = sum of means = (# of Layers) (# of Layers)
Add & Norm
• Variance of sum = sum of variances =
Scaled Attention
• To set the variance to 1, simply divide by !

Input Output
Embedding Embedding

Updated Self-Attention Equation:

Inputs Outputs
(shifted right)

37
Major issue!
• We're almost done with the
Encoder, but we have a
major problem! Has anyone
spotted it?
• Consider this sentence:
• "Man eats small dinosaur."

Transformer-Based
Encoder-Decoder Model

Man eats small dinosaur

38
Major issue!
• We're almost done with the
Encoder, but we have a
major problem! Has anyone
spotted it?
• Consider this sentence:
• "Man eats small dinosaur."
• Wait a minute, order doesn't
impact the network at all!
• This seems wrong given that
word order does have meaning Transformer-Based
in many languages, including Encoder-Decoder Model
English!
Man eats small dinosaur

39
Solution: Inject Order Information through Positional Encodings!
Output
Probabilities

Decoder
Repeat 6x
Add & Norm
(# of Layers)
Encoder Feed Forward
Repeat 6x
(# of Layers)
Add & Norm

Scaled Attention

Positional
+
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

40
Fixing the first self-attention problem: sequence order

• Since self-attention doesn’t build in order information, we need to encode the order of the
sentence in our keys, queries, and values.
• Consider representing each sequence index as a vector
𝑝𝑖 ∈ ℝ𝑑 , for 𝑖 ∈ {1,2, … , 𝑇} are position vectors

• Don’t worry about what the 𝑝𝑖 are made of yet!

• Easy to incorporate this info into our self-attention block: just add the 𝑝𝑖 to our inputs!
• Let 𝑣෤𝑖 𝑘෨ 𝑖 , 𝑞෤𝑖 be our old values, keys, and queries.

𝑣𝑖 = 𝑣෤𝑖 + 𝑝𝑖 In deep self-attention

𝑞𝑖 = 𝑞෤𝑖 + 𝑝𝑖 networks, we do this at the
𝑘𝑖 = 𝑘෨ 𝑖 + 𝑝𝑖 first layer! You could
concatenate them as well,
but people mostly just add…
41
Position representation vectors through sinusoids (original)

• Sinusoidal position representations: concatenate sinusoidal functions of varying periods:

sin(𝑖/100002∗1/𝑑 )
cos(𝑖/100002∗1/𝑑 )

Dimension
𝑝𝑖 =
𝑑
2∗ 2 /𝑑
sin(𝑖/10000 )
𝑑
cos(𝑖/100002∗2 /𝑑 ) Index in the sequence
• Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Maybe can extrapolate to longer sequences as periods restart
• Cons:
• Not learnable; also the extrapolation doesn’t really work

42 Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
Extension: Self-Attention w/ Relative Position Encodings
Key Insight: The most salient position information is the relationship (e.g. “cat” is the word before “eat”)
between words, rather than their absolute position (e.g. “cat” is word 2).

Original Self-Attention Output: Relation-Aware Self-Attention Output:

where where

Table and Equations From [Shaw et al., 2018]

43
Multi-Headed Self-Attention: k heads are better than 1!
• High-Level Idea: Let's perform self-attention multiple times in parallel and combine the results.

[Vaswani et al. 2017]

Wizards of the Coast, Artist: Todd Lockwood

44
The Transformer Encoder: Multi-headed Self-Attention
• What if we want to look in multiple places in the sentence at
once?
• For word 𝑖, self-attention “looks” where 𝑥𝑖⊤ 𝑄⊤ 𝐾𝑥𝑗 is high, but
maybe we want to focus on different 𝑗 for different reasons?
• We’ll define multiple attention “heads” through multiple Q,K,V
matrices
𝑑
𝑑×
• Let, 𝑄ℓ , 𝐾ℓ , 𝑉ℓ ∈ ℝ , where ℎ is the number of attention heads,
ℎ

and ℓ ranges from 1 to ℎ.

• Each attention head performs attention independently:
• output ℓ = softmax 𝑋𝑄ℓ 𝐾ℓ⊤ 𝑋 ⊤ ∗ 𝑋𝑉ℓ , where output ℓ ∈
ℝ𝑑/ℎ
• Then the outputs of all the heads are combined! Credit to https://jalammar.github.io/illustrated-transformer/

• output = 𝑌[output1 ; … ; output ℎ ], where 𝑌 ∈ ℝ𝑑×𝑑

• Each head gets to “look” at different things, and construct value

vectors differently.
45
Yay, we've completed the Encoder! Time for the Decoder...

Output
Probabilities

Add & Norm

Encoder Feed Forward Decoder
Repeat 6x Repeat 6x
(# of Layers) (# of Layers)
Add & Norm
Multi-Head
Attention

Positional
+
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

46
Decoder: Masked Multi-Head Self-Attention
• Problem: How do we keep the decoder
from “cheating”? If we have a language
modeling objective, can't the network
just look ahead and "see" the answer?

Transformer-Based
Encoder-Decoder Model

47
Decoder: Masked Multi-Head Self-Attention
• Problem: How do we keep the decoder
from “cheating”? If we have a language
modeling objective, can't the network
just look ahead and "see" the answer?
• Solution: Masked Multi-Head
Attention. At a high-level, we hide
(mask) information about future
tokens from the model.

Transformer-Based
Encoder-Decoder Model

48
Masking the future in self-attention
We can look at these
(not greyed out) words
• To use self-attention in
decoders, we need to ensure
we can’t peek at the future.

[START] −∞ −∞ −∞ −∞
• At every timestep, we could
change the set of keys and
queries to include only past The −∞ −∞ −∞
words. (Inefficient!) For encoding
these words
chef −∞ −∞
• To enable parallelization, we
mask out attention to future
words by setting attention who −∞
scores to −∞. 𝑞𝑖⊤ 𝑘𝑗 , 𝑗 < 𝑖
𝑒𝑖𝑗 = ൝
−∞, 𝑗 ≥ 𝑖
49
Decoder: Masked Multi-Headed Self-Attention
Output
Probabilities

Add & Norm

Encoder Feed Forward Decoder
Repeat 6x Repeat 6x
(# of Layers) (# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

50
Encoder-Decoder Attention
• We saw that self-attention is when keys, queries, Output

and values come from the same source. Probabilities

• In the decoder, we have attention that looks

more like what we saw last week.
• Let ℎ1 , … , ℎ 𝑇 be output vectors from the Add & Norm Add & Norm

Transformer encoder; 𝑥𝑖 ∈ ℝ𝑑 Encoder

Repeat 6x
Feed Forward
Multi-Head
Cross-Attention
Decoder
Repeat 6x
• Let 𝑧1 , … , 𝑧𝑇 be input vectors from the (# of Layers)
Add & Norm Add & Norm
(# of Layers)

Transformer decoder, 𝑧𝑖 ∈ ℝ𝑑 Multi-Head

Self-Attention
Masked Multi-Head
Self-Attention

• Then keys and values are drawn from the

encoder (like a memory): Positional
Encoding
+ +
Positional
Encoding

• 𝑘𝑖 = 𝐾ℎ𝑖 , 𝑣𝑖 = 𝑉ℎ𝑖 .
Input Output
Embedding Embedding

• And the queries are drawn from the decoder, Inputs Outputs
(shifted right)
𝑞𝑖 = 𝑄𝑧𝑖 .

51
Decoder: Finishing touches!

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

52
Decoder: Finishing touches!
• Add a feed forward layer (with residual
connections and layer norm)

Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

53
Decoder: Finishing touches!
• Add a feed forward layer (with residual
connections and layer norm)
• Add a final linear layer to project the
Linear
embeddings into a much longer vector of
Add & Norm
length vocab size (logits)
Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

54
Decoder: Finishing touches!
• Add a feed forward layer (with residual Output

connections and layer norm) Probabilities

Softmax
• Add a final linear layer to project the
Linear
embeddings into a much longer vector of
Add & Norm
length vocab size (logits)
Feed Forward

• Add a final softmax to generate a Decoder

Add & Norm Add & Norm
probability distribution of possible next Encoder Multi-Head
Repeat 6x
(# of Layers)
Feed Forward
words! Repeat 6x
(# of Layers)
Attention

Add & Norm Add & Norm

Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

55
Recap of Transformer Architecture
Output
Probabilities

Softmax

Linear

Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

56
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

57
What would we like to fix about the Transformer?

• Quadratic compute in self-attention (today):

• Computing all pairs of interactions means our computation grows
quadratically with the sequence length!
• For recurrent models, it only grew linearly!
• Position representations:
• Are simple absolute indices the best we can do to represent position?
• As we learned: Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]
• Rotary Embeddings [Su et al., 2021]

58
Recent work on improving on quadratic self-attention cost

• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, Linformer [Wang et al., 2020]

Inference time (s)

Key idea: map the
sequence length
dimension to a lower-
dimensional space for
values, keys

Sequence length / batch size

59
Recent work on improving on quadratic self-attention cost

• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, BigBird [Zaheer et al., 2021]

Key idea: replace all-pairs interactions with a family of other interactions, like local
windows, looking at everything, and random interactions.

60
Do Transformer Modifications Transfer?
• "Surprisingly, we find that most modifications do not meaningfully improve
performance."

61
Parting remarks

• Yay, you now understand Transformers!

• Next class, we will see how pre-training can take performance to the next level!
• Good luck on assignment 4!
• Remember to work on your project proposal!

English Today Beginner Level 1 - DVD 1: 1st Tutorual Movie
No ratings yet
English Today Beginner Level 1 - DVD 1: 1st Tutorual Movie
17 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
272 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Psychosis
100% (13)
Psychosis
289 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
WordPower 100 Manual PDF
No ratings yet
WordPower 100 Manual PDF
20 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
GenAI For Developers
No ratings yet
GenAI For Developers
205 pages
Week 12
100% (1)
Week 12
64 pages
Critical Thinking Impact Study
No ratings yet
Critical Thinking Impact Study
12 pages
Problem Solving With The Sequential Logic Structure
No ratings yet
Problem Solving With The Sequential Logic Structure
16 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
The Impact of Emotional Intelligence On Health and Wellbeing
No ratings yet
The Impact of Emotional Intelligence On Health and Wellbeing
23 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Step-By-step WAIS IV Interpretation
100% (12)
Step-By-step WAIS IV Interpretation
4 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
Semantics and Pragmatics of Thanks and Apologies in English Discourse
No ratings yet
Semantics and Pragmatics of Thanks and Apologies in English Discourse
6 pages
01 The Transformer
No ratings yet
01 The Transformer
64 pages
Transformer Networks
No ratings yet
Transformer Networks
53 pages
Generative AI
No ratings yet
Generative AI
54 pages
Transformers: Attention Is All You Need
No ratings yet
Transformers: Attention Is All You Need
54 pages
Transformers
No ratings yet
Transformers
21 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformer
No ratings yet
Transformer
55 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
Data Mining and Decision Making
No ratings yet
Data Mining and Decision Making
11 pages
GST202 Module2
No ratings yet
GST202 Module2
51 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Edgar Dale S Cone of Experience PPT 3
No ratings yet
Edgar Dale S Cone of Experience PPT 3
41 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
No ratings yet
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
20 pages
Definition:: Large Language Models (LLMS)
No ratings yet
Definition:: Large Language Models (LLMS)
41 pages
Internship Assessment Form (Form 3A - Audit) : Name of APA
No ratings yet
Internship Assessment Form (Form 3A - Audit) : Name of APA
3 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Shivam Final
No ratings yet
Shivam Final
34 pages
Transformers
No ratings yet
Transformers
27 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Attendance Appreciation
No ratings yet
Attendance Appreciation
25 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
Transformer Design Report
No ratings yet
Transformer Design Report
21 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Generative AI For Everyone: Doç. Dr. Murat Mühendislik Fakültesi, Bilgisayar, Gazi Üniversitesi, E-Mail: My Gazi - Edu.tr
No ratings yet
Generative AI For Everyone: Doç. Dr. Murat Mühendislik Fakültesi, Bilgisayar, Gazi Üniversitesi, E-Mail: My Gazi - Edu.tr
44 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
Transformers - Introduction
No ratings yet
Transformers - Introduction
22 pages
Zs Internship Report
No ratings yet
Zs Internship Report
16 pages
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
No ratings yet
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
19 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
8 pages
Assessment Brief CW1
No ratings yet
Assessment Brief CW1
10 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Decision Tree and KNN Assignment Two
No ratings yet
Decision Tree and KNN Assignment Two
13 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
Attention
No ratings yet
Attention
15 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
9 pages
Final Exam
No ratings yet
Final Exam
4 pages
Hate Speech Background Twitter Detection
No ratings yet
Hate Speech Background Twitter Detection
3 pages
COT Opinion and Assertion
100% (1)
COT Opinion and Assertion
4 pages
Aiayn
No ratings yet
Aiayn
15 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
KRR (New)
No ratings yet
KRR (New)
3 pages
A1
No ratings yet
A1
11 pages
Using Pre-Trained Models
No ratings yet
Using Pre-Trained Models
16 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
JioDiscover-What Is The Neural Networ
No ratings yet
JioDiscover-What Is The Neural Networ
5 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Ling Fillmore Single2021
No ratings yet
Ling Fillmore Single2021
5 pages
DLL Week 2 2nd Quarter
100% (1)
DLL Week 2 2nd Quarter
4 pages
Transformer
No ratings yet
Transformer
5 pages
Larsen Freeman - Grammar (Celce Murcia CHPT 17) - Summary
No ratings yet
Larsen Freeman - Grammar (Celce Murcia CHPT 17) - Summary
3 pages
All About Learning Outcomes - DR Vijay Kumar Chattu MD, MPH
No ratings yet
All About Learning Outcomes - DR Vijay Kumar Chattu MD, MPH
5 pages
Sat Plan
No ratings yet
Sat Plan
4 pages
Example File
No ratings yet
Example File
3 pages
ASS Class 1
No ratings yet
ASS Class 1
2 pages
Attention 1 2
No ratings yet
Attention 1 2
2 pages
Lesson Plan Template: 1. Broad Areas of Learning and Cross Curricular Competencies
No ratings yet
Lesson Plan Template: 1. Broad Areas of Learning and Cross Curricular Competencies
6 pages
Learning Games
No ratings yet
Learning Games
3 pages
Reading Philosophy
No ratings yet
Reading Philosophy
1 page
Beyond Effective Go: Part 1 - Achieving High-Performance Code
From Everand
Beyond Effective Go: Part 1 - Achieving High-Performance Code
Corey S Scott
5/5 (1)
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet