13 Pretraining
13 Pretraining
Pawan Goyal
CS60010
④ *
-
I
self-supervised
e sa
- -
N
-
-
O O
pretrained
↳ Ea
O
*
- >
-
O
- -
O
O
> GPT1/2's
Enco -
↓
Bret
CS60010
T
8 / 45
Pretraining for three types of architectures
-
legplment LM
masked
-
H
-
--
&
-
x -
E
Oo
-
Or
-
- -
din multiple
↳
zu d Posses
-
- Sz
↳
y = softmax(WNSP C), WNSP 2 R2⇥dh , C 2 Rdh
Why NSP?
Masking focuses on predicting words from surrounding contexts so as to
produce effective word-level representations.
Many applications require relationship between two sentences, e.g.,
I paraphrase detection (detecting if two sentences have similar meanings),
I entailment (detecting if the meanings of two sentences entail or contradict
each other)
I discourse coherence (deciding if two neighboring sentences form a
coherent discourse)
-
a I
E
- -
- - >
-
- - - -
- -
t
-
E
-
- -
-
a I
E
-- - >
-
- - - -
- -
t
size ?
-
E vocals
-
3
- -
B-PER
I-PER
-
PE
roun
&
↑ I
Typically, this training will either freeze or make only minimal adjustments
ms
to the pretrained language model parameters.
- rate is small)
③ optional
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 18 / 45
Fine Tuning for Sequence Classification
tre
We - - Wa >
- -res
neutral
-
With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
-
TextModel dim
LELSY
f
# x3
- 76S #
↳
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 19 / 45
Fine Tuning for Sequence Classification
With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
the role of sentence embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences.
The output vector C 2 Rdh in the final layer of the model for the [CLS]
input serves as the input to classifier head -
a classifier
- head.
The only new parameters introduced during fine-tuning are classification
layer weights WC 2 RK⇥dh , where K is the number of labels.
Q ③ tuneyour
· Date model-pasans)
-
freege
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 20 / 45
Fine Tuning for Sequence Classification
4
y = softmax(WC C) Kxdi
-
Ldax
--
-
-
-
>
-
Example: multiNLI -
#
↑ 16843
>
-
I -
[SEP] b
[Crs] .
-
B-PF
witasseh -PER
OR B-ORS
Press4 I-ORG
o
Here, the final output vector corresponding to each input token is passed
to a classifier that produces a softmax distribution over the possible set of
tags.
The set of weights to be learned for this additional layer is WK 2 Rk⇥dh , -
Supervised training data for tasks like named entity recognition (NER) is
&
typically in the form of BIO tags associated with text segmented at the word
level. For example:B-Loc
- - LOC x
500 204
&
-- -
----
201 102 000 t X .
--
B-Loc 1- LOG
1
-
(
MA
EI
-°
- /
°
-
-°
-
->
-
-
- - -
>
- -] °
⑦
-
The sequence does not align with the original tags.
Grave
-
--
-see
T3
E
-
es- S
↳ pasar
asE
To Te -
Tar
-
-
T3
-
-
e ti
so PE n
-
S .
Te
S II du
E-Ti
-Tal
3
-
E du
-
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 29 / 45
Fine-tuning for SQUAD
The probability of word i being the start of the answer span is computed
as a dot product between Ti and S followed by a softmax over all of the
words in the paragraph. To 5 Tr
S
.
e
eS·Ti
.
Mlone
2?
-
-
M Pi =
M
- T Âj eS·Tj . To E TM
.
The analogous formula is used for the end of the answer span.
The score of a candidate span from position i to position j is defined as
I
S · Ti + E · Tj , and the maximum scoring span where j i is used as a
prediction. M +M
↑ & Mc
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 30 / 45
Fine-tuning for span-based applications
s da
Edu
Pawan Goyal (IIT Kharagpur) Pretrained Transformers CS60010 31 / 45
Evaluation: GLUE Benchmark
GLUE?
General Language Understanding Evaluation
32 / 45
Results on GLUE
&
f
~ >
-
↳
-
22
E
7
- - -
-
decode
Encoder
The encoder portion benefits from bidirectional context; the decoder portion is
used to train the whole model through language modeling.
Span Corruption
Replace different-length spans from the input with unique placeholders;
O O
- -
frcodes
m TE
-
>
Span Corruption
decode out the spans that were corrupted
f f
des
E- less from des
#
--
I
&
tiny I
DEC :
Earl
- --
Ene -dec
-
& Span I
token
exactly
f
↑
- - ↑ y
spar
Traglatic
-
Dialogs /OA
Be
-
E
Transformer decoder with 12 layers, 117M parameters.
768-dimensional hidden states, 3072-dimensional feed-forward hidden
layers.
Byte-pair encoding with 40,000 merges
--
Trained on BooksCorpus: over 7000 unique books.
Contains long spans of contiguous text, for learning long-distance
dependencies.
E
E
&
- -
-
-