11 Bert
11 Bert
Harvard
AC295/CS287r/CSCI E-115B
Chris Tanner
ANNOUNCEMENTS
• HW3 has been released! Due Oct 19 (Tues) @ 11:59pm.
3
Outline
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
4
Outline
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
5
RECAP: L10
r1 r2 r3 r4
Encoder
FFNN
Encoder #3 =
Encoder #2
Encoder #1
=
FFNN
Encoder-Decoder Attention
Encoder Self-Attention
https://arxiv.org/pdf/1706.03762.pdf
Machine Translation results: state-of-the-art (at the time)
:
Machine Translation results: state-of-the-art (at the time)
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
21
Outline
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
22
Everything we’ve discussed so far
GOALS/TASKS: MODELS:
23
Everything we’ve discussed so far
GOALS/TASKS: MODELS:
Aside from text classification (e.g., IMDb sentiments), we’ve
• Learn distributed
only workedrepresentations
with unlabelled, naturally• n-gram (not
occurring neural)
data so far.
• word2vec (type-based word embeddings) • RNNs/LSTMs
There’s (token-based
• RNNs/LSTMs a vast oceancontextualized)
of interesting tasks that require labelled
• seq2seq
data,
• Machine and we often perform different types of learning in
Translation • Transformer Encoder/Decoder
order to better leverage our limited labelled data.
• Text classification
24
Types of Data
UNLABELLED LABELLED
• Raw text (e.g., web pages) • Linear/unstructured
• Parallel corpora (e.g., for translations) • N-to-1 (e.g., sentiment analysis)
• N-to-N (e.g., POS tagging)
• N-to-M (e.g., summarization)
• Structured
• Dependency parse trees
• Constituency parse trees
• Semantic Role Labelling
25
Types of Data
UNLABELLED LABELLED
• Raw text
We(e.g.,
mostweb pages)
often about • Linear/unstructured
• Parallel corpora (e.g., for translations) • N-to-1 (e.g., sentiment analysis)
this type of data • N-to-N (e.g., POS tagging)
• N-to-M (e.g., summarization)
• Structured
• Dependency parse trees
• Constituency parse trees
• Semantic Role Labelling
26
Types of Data
27
Types of Learning
One axis that refers to our style of One axis that hinges upon the type of
using/learning our data: data we have:
Supervised Learning
Multi-task Learning
Unsupervised Learning
28
Types of Learning
• Ideally, your tasks should be closely related (e.g., constituency parsing and
dependency parsing)
• Multi-task learning may help improve the task that has limited data
• General domain à specific domain (e.g., all of the web’s text -> law text)
• Unlabelled text à labelled text (e.g., language model -> named entity recognition)
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
31
Outline
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
32
BERT
33
BERT
34
BERT
35
BERT
36
Naming convention
Many deep learning models, including pre-trained ones with cute names
(e.g., ELMo, BERT, ALBERT, GPT-3), refer to an exact combination of:
37
BERT
brown 0.92
lazy 0.05 BERT:
playful 0.03
BERT
• Model: several Transformer Encoders.
Encoder #6 Input sentence or sentence pairs, [CLS]
token, subword embeddings
BERT
1. Predict the Masked word (a la CBOW)
Encoder #6
15% of all input words are randomly masked.
isNext 0.98
BERT has 2 training objectives:
notNext 0.02
Encoder #2
Encoder #1
<CLS> The brown dog ran […] [SEP] Fido came home [SEP]
40
x1 x2 x3 x4
BERT (alternate view) https://jalammar.github.io/illustrated-bert/
41
BERT (alternate view) https://jalammar.github.io/illustrated-bert/
42
BERT
43
Source: original BERT paper: https://arxiv.org/pdf/1810.04805.pdf
BERT
44
Source: original BERT paper: https://arxiv.org/pdf/1810.04805.pdf
BERT’s inputs https://arxiv.org/pdf/1810.04805.pdf
45
Outline
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
46
Outline
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
47
https://aclanthology.org/N19-1423.pdf 48
BERT fine-tuning
49
BERT fine-tuning
50
BERT fine-tuning
51
BERT fine-tuning
52
BERT Or, one could extract the contextualized embeddings
53
Picture: https://jalammar.github.io/illustrated-bert/
BERT Later layers have the best contextualized embeddings
54
Picture: https://jalammar.github.io/illustrated-bert/
BERT
BERT yields state-of-the-art (SOTA) results on many tasks
55
Source: original BERT paper: https://arxiv.org/pdf/1810.04805.pdf
BERT (a Transformer variant)
yTakeaway
BERT is incredible for learning
Typically, one uses BERT’s awesome
contextualized embeddings of words
embeddings to fine-tune toward a
r1 r2 andr3usingr4transfer learning for other
different NLP task (this is called
tasks (e.g., classification).
Sequential Transfer Learning)
Can’t generate new sentences though,
Encoder #8
due to no decoders.
Encoder #1
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
57
Outline
Transformer Decoder
Learning/Data/Tasks
BERT
BERT Fine-Tuning
Extensions
58
Extensions
Transformer-Encoders
• BERT
• ALBERT (A Lite BERT …)
• RoBERTa (A Robustly Optimized BERT …)
• DistilBERT (small BERT)
• ELECTRA (Pre-training Text Encoders as Discriminators not Generators)
• Longformer (Long-Document Transformer)
59
Extensions
Autoregressive
• GPT (Generative Pre-training)
• CTRL (Conditional Transformer LM for Controllable Generation)
• Reformer
• XLNet
60
ICPC
61
ICPC
62
ICPC
63
64
65
66