0% found this document useful (0 votes)
14 views

11 Bert

Uploaded by

BALAJI MANI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

11 Bert

Uploaded by

BALAJI MANI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Lecture 11: BERT

The Power of Transformer Encoders

Harvard
AC295/CS287r/CSCI E-115B
Chris Tanner
ANNOUNCEMENTS
• HW3 has been released! Due Oct 19 (Tues) @ 11:59pm.

• Research Project Selection (Google Form) is now closed.

• We’re making selections now!

• Read “Man is to Computer Programmer as Woman is to Homemaker? Debiasing

Word Embeddings” before Oct 14 (Thurs)

3
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

4
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

5
RECAP: L10
r1 r2 r3 r4
Encoder

FFNN

z1A z1B z1C


+ x Residual Connections +LayerNorm

z2A z2B z2C z3A z3B z3C z4A z4B z4C


=
Self-attention Head

The brown dog ran


x1 x2 x3 x4
RECAP: L10
r1 r2 r3 r4

Encoder #3 =
Encoder #2

Encoder #1

The brown dog ran


x1 x2 x3 x4
RECAP: L10

The original Transformer model was intended for


Machine Translation, so it had Decoders, too
Transformer Decoder
r1 r2 r3 r4
Decoder

=
FFNN

+ x Residual Connections +LayerNorm


z1A z1B z1C z2A z2B z2C z3A z3B z3C z4A z4B z4C

Masked Self-attention Head

<s> El perro marrón


x1 x2 x3 x4
https://jalammar.github.io/illustrated-transformer/
Three ways to Attend

Encoder-Decoder Attention

Encoder Self-Attention

Decoder Masked Self-Attention


https://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-transformer/
Attention is All you Need (2017) https://arxiv.org/pdf/1706.03762.pdf
Loss Function: cross-entropy (predicting translated word)

Training Time: ~4 days on (8) GPUs


n = sequence length
d = length of representation (vector)

Q: Is the complexity of self-attention good?


Important: when learning dependencies b/w words, you don’t want
long paths. Shorter is better.

Self-attention connects all positions with a constant # of sequentially


executed operations, whereas RNNs require 𝑂(𝑛).

https://arxiv.org/pdf/1706.03762.pdf
Machine Translation results: state-of-the-art (at the time)

:
Machine Translation results: state-of-the-art (at the time)

You can train to translate from Language A to Language B.

Then train it to translate from Language B. to Language C.

Then, without training, it can translate from Language A to


Language C
• What if we don’t want to decode/translate?

• Just want to perform a particular task (e.g., classification)

• Want even more robust, flexible, rich representation!

• Want to explicitly capture fluency, somehow.


Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

21
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

22
Everything we’ve discussed so far

GOALS/TASKS: MODELS:

• Learn distributed representations • n-gram (not neural)


• word2vec (type-based word embeddings) • RNNs/LSTMs
• RNNs/LSTMs (token-based contextualized) • seq2seq
• Machine Translation • Transformer Encoder/Decoder
• Text classification

23
Everything we’ve discussed so far

GOALS/TASKS: MODELS:
Aside from text classification (e.g., IMDb sentiments), we’ve
• Learn distributed
only workedrepresentations
with unlabelled, naturally• n-gram (not
occurring neural)
data so far.
• word2vec (type-based word embeddings) • RNNs/LSTMs
There’s (token-based
• RNNs/LSTMs a vast oceancontextualized)
of interesting tasks that require labelled
• seq2seq
data,
• Machine and we often perform different types of learning in
Translation • Transformer Encoder/Decoder
order to better leverage our limited labelled data.
• Text classification

24
Types of Data

UNLABELLED LABELLED
• Raw text (e.g., web pages) • Linear/unstructured
• Parallel corpora (e.g., for translations) • N-to-1 (e.g., sentiment analysis)
• N-to-N (e.g., POS tagging)
• N-to-M (e.g., summarization)

• Structured
• Dependency parse trees
• Constituency parse trees
• Semantic Role Labelling

25
Types of Data

UNLABELLED LABELLED
• Raw text
We(e.g.,
mostweb pages)
often about • Linear/unstructured
• Parallel corpora (e.g., for translations) • N-to-1 (e.g., sentiment analysis)
this type of data • N-to-N (e.g., POS tagging)
• N-to-M (e.g., summarization)

• Structured
• Dependency parse trees
• Constituency parse trees
• Semantic Role Labelling

26
Types of Data

Labelled data is a scarce commodity.

How can we get more of it?

How can we leverage more plentiful, other data (either


labelled or unlabelled) so as to make better use of our
limited labelled data?

27
Types of Learning

One axis that refers to our style of One axis that hinges upon the type of
using/learning our data: data we have:
Supervised Learning
Multi-task Learning
Unsupervised Learning

Transfer Learning Self-supervised Learning


Semi-supervised Learning
Pre-training

28
Types of Learning

One axis that refers to our style of


using/learning our data:

Multi-task Learning = general term for training on multiple tasks

Transfer Learning = type of multi-task learning where we only care about


one of the tasks

Pre-training = type of transfer learning where we first focus on one objective

See chalkboard for example


29
Multi-task heuristics

• Ideally, your tasks should be closely related (e.g., constituency parsing and
dependency parsing)

• Multi-task learning may help improve the task that has limited data

• General domain à specific domain (e.g., all of the web’s text -> law text)

• High-resourced language à low-resourced language (e.g., English -> Igbo)

• Unlabelled text à labelled text (e.g., language model -> named entity recognition)

Inspired by or based on http://www.phontron.com/class/anlp2021/assets/slides/anlp-08-pretraining.pdf 30


Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

31
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

32
BERT

Bidirectional Encoder Representations from Transformers

33
BERT

Bidirectional Encoder Representations from Transformers

Like Bidirectional LSTMs, let’s look in both directions

34
BERT

Bidirectional Encoder Representations from Transformers

Let’s only use Transformer Encoders, no Decoders

35
BERT

Bidirectional Encoder Representations from Transformers

It’s a language model that builds rich representations

36
Naming convention

Many deep learning models, including pre-trained ones with cute names
(e.g., ELMo, BERT, ALBERT, GPT-3), refer to an exact combination of:

• The model’s architecture

• The training objective to pre-train (e.g., MLM prediction)

• The data (e.g., Google BooksCorpus, Wikipedia)

Many people abuse the terms and swap out components.

37
BERT
brown 0.92
lazy 0.05 BERT:
playful 0.03

BERT
• Model: several Transformer Encoders.
Encoder #6 Input sentence or sentence pairs, [CLS]
token, subword embeddings

• Objective: MLM and next-sentence


Encoder #2 prediction

• Data: BooksCorpus and Wikipedia


Encoder #1

<CLS> The brown dog


x1 x2 x3 x4 38
BERT
brown 0.92
lazy 0.05 BERT has 2 training objectives:
playful 0.03

BERT
1. Predict the Masked word (a la CBOW)
Encoder #6
15% of all input words are randomly masked.

• 80% become [MASK]


Encoder #2
• 10% become revert back

Encoder #1 • 10% become are deliberately corrupted


as wrong words
<CLS> The brown dog
x1 x2 x3 x4 39
BERT

isNext 0.98
BERT has 2 training objectives:
notNext 0.02

BERT 2. Two sentences are fed in


at a time. Predict the if the
Encoder #6 second sentence of input
truly follows the first one
or not.

Encoder #2

Encoder #1

<CLS> The brown dog ran […] [SEP] Fido came home [SEP]
40
x1 x2 x3 x4
BERT (alternate view) https://jalammar.github.io/illustrated-bert/

41
BERT (alternate view) https://jalammar.github.io/illustrated-bert/

42
BERT

The two sentences are separated by


a <SEP> token.

50% of the time, the 2nd sentence is a


randomly selected sentence from the
corpus.

50% of the time, it truly follows the


first sentence in the corpus.

43
Source: original BERT paper: https://arxiv.org/pdf/1810.04805.pdf
BERT

NOTE: BERT also embeds the inputs


by their WordPiece embeddings.

WordPiece is a sub-word tokenization


learns to merge and use characters
based on which pairs maximize the
likelihood of the training data if
added to the vocab.

44
Source: original BERT paper: https://arxiv.org/pdf/1810.04805.pdf
BERT’s inputs https://arxiv.org/pdf/1810.04805.pdf

45
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

46
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

47
https://aclanthology.org/N19-1423.pdf 48
BERT fine-tuning

49
BERT fine-tuning

50
BERT fine-tuning

51
BERT fine-tuning

52
BERT Or, one could extract the contextualized embeddings

53
Picture: https://jalammar.github.io/illustrated-bert/
BERT Later layers have the best contextualized embeddings

54
Picture: https://jalammar.github.io/illustrated-bert/
BERT
BERT yields state-of-the-art (SOTA) results on many tasks

55
Source: original BERT paper: https://arxiv.org/pdf/1810.04805.pdf
BERT (a Transformer variant)

yTakeaway
BERT is incredible for learning
Typically, one uses BERT’s awesome
contextualized embeddings of words
embeddings to fine-tune toward a
r1 r2 andr3usingr4transfer learning for other
different NLP task (this is called
tasks (e.g., classification).
Sequential Transfer Learning)
Can’t generate new sentences though,
Encoder #8
due to no decoders.

Encoder #1

The brown dog ran


x1 x2 x3 x4 56
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

57
Outline

Transformer Decoder

Learning/Data/Tasks

BERT

BERT Fine-Tuning

Extensions

58
Extensions

Transformer-Encoders
• BERT
• ALBERT (A Lite BERT …)
• RoBERTa (A Robustly Optimized BERT …)
• DistilBERT (small BERT)
• ELECTRA (Pre-training Text Encoders as Discriminators not Generators)
• Longformer (Long-Document Transformer)

59
Extensions

Autoregressive
• GPT (Generative Pre-training)
• CTRL (Conditional Transformer LM for Controllable Generation)
• Reformer
• XLNet

60
ICPC

61
ICPC

The last International Collegiate Programming Contest has hosted


over 60,000 students from 3,514 universities in 115 countries that
span the globe. October 5 more than 100 teams will compete in
logic, mental speed, and strategic thinking at Russia’s main
Manege Central Conference Hall.

62
ICPC

63
64
65
66

You might also like