0% found this document useful (0 votes)

37 views47 pages

13 Pretraining

Uploaded by

Ritesah Madhunala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views47 pages

13 Pretraining

Uploaded by

Ritesah Madhunala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Pretrained Transformers

Pawan Goyal

CSE, IIT Kharagpur

CS60010

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 1 / 45

Initial days: pretrained word embeddings

④ *
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 2 / 45

Now: pretraining whole models

I
self-supervised

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 3 / 45

What can we learn from reconstructing the input?

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 4 / 45

Pretraining through Language Modeling - General
Paradigm

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 5 / 45

The Pretraining / Finetuning paradigm
↑
↑

e sa
- -

N
-
-

O O
pretrained
↳ Ea

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 6 / 45

Stochastic Gradient Descent and Pretrain/Finetune

O
*
- >
-

O
- -
O

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 7 / 45

Using Transformers for Pretraining
wow-

O
> GPT1/2's

Enco -
↓

Bret

Pawan Goyal (IIT Kharagpur)

OEncoder-decoder
Pretrained Transformers
B

CS60010
T

8 / 45
Pretraining for three types of architectures

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 9 / 45

Pretraining Encoders

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 10 / 45

Pretraining Encoders

What would be the objective function?

So far, we’ve looked at language modeling for pretraining.
But encoders get bidirectional context, so we can’t do language modeling!

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 10 / 45

Solution: Use Masks

-
legplment LM
masked
-

H
-

&
-
x -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 11 / 45

BERT: Bidirectional Encoder Representations from
Transformers

E
Oo
-

Or
-

- -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 12 / 45

BERT: another view

yi = softmax(WV hi ), WV 2 R|V|⇥dh , hi 2 Rdh

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 13 / 45

BERT: Next Sentence Prediction

din multiple
↳
zu d Posses

-
- Sz

↳
y = softmax(WNSP C), WNSP 2 R2⇥dh , C 2 Rdh

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 14 / 45

BERT: Next Sentence Prediction

Why NSP?
Masking focuses on predicting words from surrounding contexts so as to
produce effective word-level representations.
Many applications require relationship between two sentences, e.g.,
I paraphrase detection (detecting if two sentences have similar meanings),
I entailment (detecting if the meanings of two sentences entail or contradict
each other)
I discourse coherence (deciding if two neighboring sentences form a
coherent discourse)

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 15 / 45

BERT: Bidirectional Encoder Representations from
Transformers

-
a I
E
- -

- - >
-

- - - -
- -

t
-
E
-

- -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 16 / 45

BERT: Bidirectional Encoder Representations from
Transformers

-
a I
E
-- - >
-

- - - -
- -

t
size ?
-
E vocals
-

3
- -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 16 / 45

Using BERT for different tasks
for
>
- Not generation
1684z

B-PER
I-PER
-

roun
&

↑ I

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 17 / 45

Transfer Learning through Fine-tuning
-
-

The power of pretrained language models lies in their ability to extract

generalizations from large amounts of text
To make practical use of these generalizations, we need to create
interfaces from these models to downstream applications through a
O
process called fine-tuning.
Fine-tuning facilitates the creation of applications on top of pretrained
models through the (possible) addition of a small set of
application-specific parameters. -
-

The fine-tuning process consists of using labeled

-
data from the②
application to train these additional application-specific parameters.
- -

Typically, this training will either freeze or make only minimal adjustments

ms
to the pretrained language model parameters.
- rate is small)
③ optional
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 18 / 45
Fine Tuning for Sequence Classification
tre

We - - Wa >
- -res
neutral
-

With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
-

the role of sentence

- embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences, both during pretraining and encoding.
The output vector in the final layer of the model for the [CLS] input serves
as the input to the classifier head.

TextModel dim
LELSY
f
# x3
- 76S #
↳
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 19 / 45
Fine Tuning for Sequence Classification

With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
the role of sentence embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences.
The output vector C 2 Rdh in the final layer of the model for the [CLS]
input serves as the input to classifier head -
a classifier
- head.
The only new parameters introduced during fine-tuning are classification
layer weights WC 2 RK⇥dh , where K is the number of labels.

Q ③ tuneyour
· Date model-pasans)
-
freege
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 20 / 45
Fine Tuning for Sequence Classification

4
y = softmax(WC C) Kxdi
-

Ldax
--

-
-

-
>
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 21 / 45

Pair-wise Sequence Classification

Example: multiNLI -

Pairs of sentences are given one of 3 labels: entails, contradicts and

--
neutral.
-

These labels describe a relationship between the meaning of the first

sentence (the premise) and the second sentence (the hypothesis).

#
↑ 16843
>
-

I -
[SEP] b
[Crs] .

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 22 / 45

Pair-wise Sequence Classification
As with NSP training, the two inputs are separated by a [SEP] token.
As with sequence classification, the output vector associated with the
prepended [CLS] token represents the model’s view of the input pair.
This vector C provides the input to a three-way classifier that can be
trained on the MultiNLI training corpus.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 23 / 45

Sequence Labeling
45 as

-
B-PF
witasseh -PER
OR B-ORS
Press4 I-ORG

o
Here, the final output vector corresponding to each input token is passed
to a classifier that produces a softmax distribution over the possible set of
tags.
The set of weights to be learned for this additional layer is WK 2 Rk⇥dh , -

where k is the number of possible tags for the task.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 24 / 45
POS Tagging

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 25 / 45

Named Entity Recognition and BIO Scheme

Supervised training data for tasks like named entity recognition (NER) is
&

typically in the form of BIO tags associated with text segmented at the word
level. For example:B-Loc
- - LOC x
500 204
&
-- -

----
201 102 000 t X .

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 26 / 45

BIO Scheme with subwords

B-Loc 1- LOG
1
-

(
MA
EI
-°
- /
°

-
-°
-
->
-

After wordPiece Tokenization: -

-
- - -
>
- -] °
⑦
-
The sequence does not align with the original tags.

Solution: Training and Decoding

Training: we can just assign the gold-standard tag associated with each
-

word to all of the subword tokens derived from it.

-- -

Decoding: the simplest approach is to use the argmax BIO tag

- -

associated with the first subword token of a word.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 27 / 45

Fine-tuning for span-based applications

Grave
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 28 / 45

Fine-tuning for span-based applications

-see
T3
E
-

es- S
↳ pasar

asE
To Te -
Tar

-
-

T3
-
-

e ti
so PE n
-

S .
Te
S II du
E-Ti
-Tal
3
-
E du
-
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 29 / 45
Fine-tuning for SQUAD

We represent the input question and passage as a single packed

sequence (with [SEP])
We only introduce a start vector S 2 Rdh and an end vector E 2 Rdh during
-
fine-tuning.
-

The probability of word i being the start of the answer span is computed
as a dot product between Ti and S followed by a softmax over all of the
words in the paragraph. To 5 Tr
S
.

e
eS·Ti
.

Mlone
2?
-

-
M Pi =
M
- T Âj eS·Tj . To E TM
.

The analogous formula is used for the end of the answer span.
The score of a candidate span from position i to position j is defined as

I
S · Ti + E · Tj , and the maximum scoring span where j i is used as a
prediction. M +M
↑ & Mc
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 30 / 45
Fine-tuning for span-based applications

Formally, given an input sequence x consisting of T tokens,

(x1 , x2 , . . . , xT ), a span is a contiguous sequence of tokens with start i
and end j such that 1  i  j  T .
T(T+1)
This formulation results in a total set of spans equal to 2 .
For practical purposes, span-based models often impose an
application-specific length limit L, so the legal spans are limited to those
where j i < L.
We’ll refer to the enumerated set of legal spans in input x as S(x)

s da

Edu
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 31 / 45
Evaluation: GLUE Benchmark

GLUE?
General Language Understanding Evaluation

Q Pawan Goyal (IIT Kharagpur)

↳

Pretrained Transformers CS60010

32 / 45
Results on GLUE

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 33 / 45

Extensions of BERT

f
~ >
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 34 / 45

Extensions of BERT

A takeaway from the RoBERTa paper

More compute, more data can improve pretraining even when not changing the
underlying Transformer encoder.

↳
-
22
E
7

- - -
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 35 / 45

Pretraining encoder-decoders

What pretraining objective to use?

For encoder-decoders, we could do something like language modeling, but
where a prefix of every input is provided to the encoder and is not predicted.

decode

Encoder
The encoder portion benefits from bidirectional context; the decoder portion is
used to train the whole model through language modeling.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 36 / 45

T5: A New Training Objective

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 37 / 45

T5: A New Training Objective

Span Corruption
Replace different-length spans from the input with unique placeholders;

O O
- -
frcodes

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 38 / 45

T5: A new Training Objective

m TE
-
>
Span Corruption
decode out the spans that were corrupted

f f
des
E- less from des
#
--

I
&

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 39 / 45

T5 can be used for various tasks

tiny I

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 40 / 45

Bi-directional and Auto-Regressive Transformers (BART)

DEC :

Earl

- --

Ene -dec
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 41 / 45

BART: Transformation for noising the input
tokens
O
toker >
-

& Span I
token

exactly
f
↑
- - ↑ y

spar
Traglatic
-
Dialogs /OA

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 42 / 45

Pretraining Decoders

It’s natural to pretrain decoders as language models

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 43 / 45

Generative Pretrained Transformer (GPT)

Be
-

E
Transformer decoder with 12 layers, 117M parameters.
768-dimensional hidden states, 3072-dimensional feed-forward hidden
layers.
Byte-pair encoding with 40,000 merges
--
Trained on BooksCorpus: over 7000 unique books.
Contains long spans of contiguous text, for learning long-distance
dependencies.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 44 / 45

Pretrained Decoders

E
E
&
- -
-
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 45 / 45

LSTM to BERT
No ratings yet
LSTM to BERT
30 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
AI-900 slides
100% (1)
AI-900 slides
91 pages
QCQI Exercise Solutions (Chapter 7)
100% (1)
QCQI Exercise Solutions (Chapter 7)
18 pages
Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models
No ratings yet
Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models
25 pages
AcceleratingTrainingOfTransformerBasedLanguageModelsWithProgressiveLayerDropping
No ratings yet
AcceleratingTrainingOfTransformerBasedLanguageModelsWithProgressiveLayerDropping
16 pages
LLM_book_8_42
No ratings yet
LLM_book_8_42
35 pages
BERT-1-42
No ratings yet
BERT-1-42
42 pages
Lecture 04 - Pre-trained Language Models (PLMs)
No ratings yet
Lecture 04 - Pre-trained Language Models (PLMs)
36 pages
C4_W3
No ratings yet
C4_W3
98 pages
11 Bert
No ratings yet
11 Bert
66 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
1 pretraining
No ratings yet
1 pretraining
18 pages
Clip
No ratings yet
Clip
15 pages
cl12_huggingface
No ratings yet
cl12_huggingface
34 pages
2108.05542
No ratings yet
2108.05542
42 pages
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
No ratings yet
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
10 pages
A Survey Large Language Models
No ratings yet
A Survey Large Language Models
58 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Unit 2
No ratings yet
Unit 2
34 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
1-s2.0-S001048252500006X-main
No ratings yet
1-s2.0-S001048252500006X-main
15 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Optimizing ViViT Training: Time and Memory
No ratings yet
Optimizing ViViT Training: Time and Memory
16 pages
Technical Background
No ratings yet
Technical Background
13 pages
Nlp Lecture 01-16-Plm-tl
No ratings yet
Nlp Lecture 01-16-Plm-tl
11 pages
Bert
No ratings yet
Bert
20 pages
Aloha - Low Cost Hardware Robot
No ratings yet
Aloha - Low Cost Hardware Robot
18 pages
BLEURT: Learning Robust Metrics For Text Generation
No ratings yet
BLEURT: Learning Robust Metrics For Text Generation
12 pages
Enhancing Podcast Accessibility - A Guide To LLM Text Highlighting - Analytics Vidhya
No ratings yet
Enhancing Podcast Accessibility - A Guide To LLM Text Highlighting - Analytics Vidhya
14 pages
Machine Learning in RNA Structure Prediction - Advances and Challenges
No ratings yet
Machine Learning in RNA Structure Prediction - Advances and Challenges
11 pages
A Large Language Model (LLM) Research Paper
No ratings yet
A Large Language Model (LLM) Research Paper
13 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Create A ChatGPT-Based App To Control Inventor With Natural Language
No ratings yet
Create A ChatGPT-Based App To Control Inventor With Natural Language
10 pages
Ch-4 Pre-trained Models and Fine-tuning
No ratings yet
Ch-4 Pre-trained Models and Fine-tuning
13 pages
Heat
No ratings yet
Heat
14 pages
Adversarial Training For Fake News Classification
No ratings yet
Adversarial Training For Fake News Classification
10 pages
Multimodal Knowledge Graph Construction For Risk Identification in Water Diversion Projects
No ratings yet
Multimodal Knowledge Graph Construction For Risk Identification in Water Diversion Projects
15 pages
Pyvene:: A Library For Understanding and Improving Pytorch Models Via Interventions
No ratings yet
Pyvene:: A Library For Understanding and Improving Pytorch Models Via Interventions
8 pages
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
No ratings yet
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
25 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Masked Autoencoders Are Scalable Vision Learners
No ratings yet
Masked Autoencoders Are Scalable Vision Learners
14 pages
Author Identification On Anonymous Regional Literature
No ratings yet
Author Identification On Anonymous Regional Literature
7 pages
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
No ratings yet
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
18 pages
LLM_test_v1_p8_12
No ratings yet
LLM_test_v1_p8_12
5 pages
Model
No ratings yet
Model
5 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Lec 02
No ratings yet
Lec 02
33 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
17 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
N19-1213
No ratings yet
N19-1213
7 pages
FineTune OPUS MT Engine
No ratings yet
FineTune OPUS MT Engine
9 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
ASSIGNMENT 05 CL[1]
No ratings yet
ASSIGNMENT 05 CL[1]
3 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
EMNLP_2021_REBEL__Camera_Ready_
No ratings yet
EMNLP_2021_REBEL__Camera_Ready_
12 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
Domain Specific Language Models Pre Trained On Constru - 2024 - Automation in Co
No ratings yet
Domain Specific Language Models Pre Trained On Constru - 2024 - Automation in Co
14 pages
Hugging Face
100% (1)
Hugging Face
11 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
DIGU
No ratings yet
DIGU
26 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Boosting The Performance of Transformer Architectu
No ratings yet
Boosting The Performance of Transformer Architectu
6 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Transformer Network For Video To Text Translation
No ratings yet
Transformer Network For Video To Text Translation
6 pages
DL Practical 09text Pre Processing
No ratings yet
DL Practical 09text Pre Processing
6 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
A Systematic Review For Transformer-Based Long-Term Series Forecasting
No ratings yet
A Systematic Review For Transformer-Based Long-Term Series Forecasting
30 pages
Gen AI Notes Part 1
No ratings yet
Gen AI Notes Part 1
15 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
22 pages
Applied NLP
50% (2)
Applied NLP
8 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
No ratings yet
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
19 pages
Generative AI Tghjraining in Hyderabad
No ratings yet
Generative AI Tghjraining in Hyderabad
22 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
Chatgpt Prompt Writing
100% (1)
Chatgpt Prompt Writing
25 pages
02 As An AI Language Model Developed by OpenAI
No ratings yet
02 As An AI Language Model Developed by OpenAI
3 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
Applied NLP - Project - Learner Template
No ratings yet
Applied NLP - Project - Learner Template
5 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages