0% found this document useful (0 votes)
37 views47 pages

13 Pretraining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views47 pages

13 Pretraining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Pretrained Transformers

Pawan Goyal

CSE, IIT Kharagpur

CS60010

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 1 / 45


Initial days: pretrained word embeddings

④ *
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 2 / 45


Now: pretraining whole models

I
self-supervised

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 3 / 45


What can we learn from reconstructing the input?

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 4 / 45


Pretraining through Language Modeling - General
Paradigm

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 5 / 45


The Pretraining / Finetuning paradigm

e sa
- -

N
-
-

O O
pretrained
↳ Ea

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 6 / 45


Stochastic Gradient Descent and Pretrain/Finetune

O
*
- >
-

O
- -
O

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 7 / 45


Using Transformers for Pretraining
wow-

O
> GPT1/2's

Enco -

Bret

Pawan Goyal (IIT Kharagpur)


OEncoder-decoder
Pretrained Transformers
B

CS60010
T

8 / 45
Pretraining for three types of architectures

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 9 / 45


Pretraining Encoders

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 10 / 45


Pretraining Encoders

What would be the objective function?


So far, we’ve looked at language modeling for pretraining.
But encoders get bidirectional context, so we can’t do language modeling!

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 10 / 45


Solution: Use Masks

-
legplment LM
masked
-

H
-

--

&
-
x -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 11 / 45


BERT: Bidirectional Encoder Representations from
Transformers

E
Oo
-

Or
-

- -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 12 / 45


BERT: another view

yi = softmax(WV hi ), WV 2 R|V|⇥dh , hi 2 Rdh

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 13 / 45


BERT: Next Sentence Prediction

din multiple

zu d Posses

-
- Sz


y = softmax(WNSP C), WNSP 2 R2⇥dh , C 2 Rdh

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 14 / 45


BERT: Next Sentence Prediction

Why NSP?
Masking focuses on predicting words from surrounding contexts so as to
produce effective word-level representations.
Many applications require relationship between two sentences, e.g.,
I paraphrase detection (detecting if two sentences have similar meanings),
I entailment (detecting if the meanings of two sentences entail or contradict
each other)
I discourse coherence (deciding if two neighboring sentences form a
coherent discourse)

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 15 / 45


BERT: Bidirectional Encoder Representations from
Transformers

-
a I
E
- -

- - >
-

- - - -
- -

t
-
E
-

- -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 16 / 45


BERT: Bidirectional Encoder Representations from
Transformers

-
a I
E
-- - >
-

- - - -
- -

t
size ?
-
E vocals
-

3
- -

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 16 / 45


Using BERT for different tasks
for
>
- Not generation
1684z

B-PER
I-PER
-

PE

roun
&

↑ I

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 17 / 45


Transfer Learning through Fine-tuning
-
-

The power of pretrained language models lies in their ability to extract


generalizations from large amounts of text
To make practical use of these generalizations, we need to create
interfaces from these models to downstream applications through a
O
process called fine-tuning.
Fine-tuning facilitates the creation of applications on top of pretrained
models through the (possible) addition of a small set of
application-specific parameters. -
-

The fine-tuning process consists of using labeled


-
data from the②
application to train these additional application-specific parameters.
- -

Typically, this training will either freeze or make only minimal adjustments

ms
to the pretrained language model parameters.
- rate is small)
③ optional
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 18 / 45
Fine Tuning for Sequence Classification
tre

We - - Wa >
- -res
neutral
-

With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
-

the role of sentence


- embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences, both during pretraining and encoding.
The output vector in the final layer of the model for the [CLS] input serves
as the input to the classifier head.

TextModel dim
LELSY
f
# x3
- 76S #

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 19 / 45
Fine Tuning for Sequence Classification

With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
the role of sentence embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences.
The output vector C 2 Rdh in the final layer of the model for the [CLS]
input serves as the input to classifier head -
a classifier
- head.
The only new parameters introduced during fine-tuning are classification
layer weights WC 2 RK⇥dh , where K is the number of labels.

Q ③ tuneyour
· Date model-pasans)
-
freege
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 20 / 45
Fine Tuning for Sequence Classification

4
y = softmax(WC C) Kxdi
-

Ldax
--

-
-

-
>
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 21 / 45


Pair-wise Sequence Classification

Example: multiNLI -

Pairs of sentences are given one of 3 labels: entails, contradicts and


--
neutral.
-

These labels describe a relationship between the meaning of the first


sentence (the premise) and the second sentence (the hypothesis).

#
↑ 16843
>
-

I -
[SEP] b
[Crs] .

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 22 / 45


Pair-wise Sequence Classification
As with NSP training, the two inputs are separated by a [SEP] token.
As with sequence classification, the output vector associated with the
prepended [CLS] token represents the model’s view of the input pair.
This vector C provides the input to a three-way classifier that can be
trained on the MultiNLI training corpus.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 23 / 45


Sequence Labeling
45 as

-
B-PF
witasseh -PER
OR B-ORS
Press4 I-ORG

o
Here, the final output vector corresponding to each input token is passed
to a classifier that produces a softmax distribution over the possible set of
tags.
The set of weights to be learned for this additional layer is WK 2 Rk⇥dh , -

where k is the number of possible tags for the task.


Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 24 / 45
POS Tagging

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 25 / 45


Named Entity Recognition and BIO Scheme

Supervised training data for tasks like named entity recognition (NER) is
&

typically in the form of BIO tags associated with text segmented at the word
level. For example:B-Loc
- - LOC x
500 204
&
-- -

----
201 102 000 t X .

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 26 / 45


BIO Scheme with subwords

--

B-Loc 1- LOG
1
-

(
MA
EI

- /
°

-

-
->
-

After wordPiece Tokenization: -


-

-
- - -
>
- -] °

-
The sequence does not align with the original tags.

Solution: Training and Decoding


Training: we can just assign the gold-standard tag associated with each
-

word to all of the subword tokens derived from it.


-- -

Decoding: the simplest approach is to use the argmax BIO tag


- -

associated with the first subword token of a word.


-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 27 / 45


Fine-tuning for span-based applications

Grave
-

--

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 28 / 45


Fine-tuning for span-based applications

-see
T3
E
-

es- S
↳ pasar

asE
To Te -
Tar

-
-

T3
-
-

e ti
so PE n
-

S .
Te
S II du
E-Ti
-Tal
3
-
E du
-
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 29 / 45
Fine-tuning for SQUAD

We represent the input question and passage as a single packed


sequence (with [SEP])
We only introduce a start vector S 2 Rdh and an end vector E 2 Rdh during
-
fine-tuning.
-

The probability of word i being the start of the answer span is computed
as a dot product between Ti and S followed by a softmax over all of the
words in the paragraph. To 5 Tr
S
.

e
eS·Ti
.

Mlone
2?
-

-
M Pi =
M
- T Âj eS·Tj . To E TM
.

The analogous formula is used for the end of the answer span.
The score of a candidate span from position i to position j is defined as

I
S · Ti + E · Tj , and the maximum scoring span where j i is used as a
prediction. M +M
↑ & Mc
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 30 / 45
Fine-tuning for span-based applications

Formally, given an input sequence x consisting of T tokens,


(x1 , x2 , . . . , xT ), a span is a contiguous sequence of tokens with start i
and end j such that 1  i  j  T .
T(T+1)
This formulation results in a total set of spans equal to 2 .
For practical purposes, span-based models often impose an
application-specific length limit L, so the legal spans are limited to those
where j i < L.
We’ll refer to the enumerated set of legal spans in input x as S(x)

s da

Edu
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 31 / 45
Evaluation: GLUE Benchmark

GLUE?
General Language Understanding Evaluation

Q Pawan Goyal (IIT Kharagpur)


Pretrained Transformers CS60010


E

32 / 45
Results on GLUE

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 33 / 45


Extensions of BERT

&

f
~ >
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 34 / 45


Extensions of BERT

A takeaway from the RoBERTa paper


More compute, more data can improve pretraining even when not changing the
underlying Transformer encoder.


-
22
E
7

- - -
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 35 / 45


Pretraining encoder-decoders

What pretraining objective to use?


For encoder-decoders, we could do something like language modeling, but
where a prefix of every input is provided to the encoder and is not predicted.

decode

Encoder
The encoder portion benefits from bidirectional context; the decoder portion is
used to train the whole model through language modeling.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 36 / 45


T5: A New Training Objective

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 37 / 45


T5: A New Training Objective

Span Corruption
Replace different-length spans from the input with unique placeholders;

O O
- -
frcodes

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 38 / 45


T5: A new Training Objective

m TE
-
>
Span Corruption
decode out the spans that were corrupted

f f
des
E- less from des
#
--

I
&

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 39 / 45


T5 can be used for various tasks

tiny I

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 40 / 45


Bi-directional and Auto-Regressive Transformers (BART)

DEC :

Earl

- --

Ene -dec
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 41 / 45


BART: Transformation for noising the input
tokens
O
toker >
-

& Span I
token

exactly
f

- - ↑ y

spar
Traglatic
-
Dialogs /OA

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 42 / 45


Pretraining Decoders

It’s natural to pretrain decoders as language models

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 43 / 45


Generative Pretrained Transformer (GPT)

Be
-

E
Transformer decoder with 12 layers, 117M parameters.
768-dimensional hidden states, 3072-dimensional feed-forward hidden
layers.
Byte-pair encoding with 40,000 merges
--
Trained on BooksCorpus: over 7000 unique books.
Contains long spans of contiguous text, for learning long-distance
dependencies.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 44 / 45


Pretrained Decoders

E
E
&
- -
-
-

Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 45 / 45

You might also like