NLP_basics
NLP_basics
Processing
Generative AI and Foundation Models
Spring 2024
Department of Mathematical Sciences
Ernest K. Ryu
Seoul National University
1
Natural language processing (NLP)
Natural language processing (NLP) is concerned with computationally processing natural
(human) languages. The goal is to design and/or train a system that can understand and
process information written in documents.
A natural language or ordinary language is any language that has evolved naturally in
humans through use and repetition without conscious planning or premeditation such as
English or Korean. They are distinguished from formal and constructed languages such as
C, Python, Lojban, and Esperanto.
NLP was once a field that relied on insight into linguistics, but modern NLP is dominated by
data-driven deep-learning based approaches.
2
Task: Review sentiment analysis
Given a review 𝑋 ∈ 𝒳 on a reviewing website, decide whether its label 𝑌 ∈ 𝒴 = −1,0, +1 is
negative (−1), neutral (0), or positive (+1).
Eg.
Review: I hate this movie
Sentiment: Negative
3
Sentiment analysis with BOW
A bag of words (BOW) model makes the prediction with a linear combination of tokenized
word. This is a simple baseline.
More generally “bag of words” refers to models that view a sentence as an unordered
collection (bag) of words. Completely disregarding word order is a significant drawback of
the method.
4
Sequence (seq) notation
Let 𝒰 be any set. Define 𝑘-tuples of 𝒰 as
𝒰𝑘 = 𝑢1 , … , 𝑢𝑘 𝑢1 , … , 𝑢𝑘 ∈ 𝒰
The Kleene star notation
𝒰∗ = ራ 𝒰𝑘 = 𝑢1 , … , 𝑢𝑘 𝑢1 , … , 𝑢𝑘 ∈ 𝒰, 𝑘 ≥ 0
𝑘≥0
Although unimportant in most practical setups, we define the empty sequence is a valid sequence of length 0 and write ∈ 𝒰∗ . 5
Characters
Let 𝒞 be a set of “characters”.
• 𝒞 can be the set of English characters, space, and some punctuation.
• 𝒞 can be the set of all unicode characters.
Let 𝒳 = 𝐶 ∗ be the set of finite-length sequence of characters, i.e., 𝑋 ∈ 𝒳 is raw text.
6
Tokenization
Given 𝑋 = 𝑐1 , … , 𝑐𝑇 ∈ 𝒞 ∗ , a tokenizer is a function 𝜏 ∶ 𝒞 ∗ → ℝ𝑛 ∗ such that
𝜏 𝑐1 , 𝑐2 , … , 𝑐𝑇 = 𝑢1 , 𝑢2 , … , 𝑢𝐿
where 𝑢1 , 𝑢2 , … , 𝑢𝐿 ∈ ℝ𝑛 . 𝑇 and 𝐿 are often not the same. Sometimes 𝜏 is fixed, and
sometimes it is trainable (e.g. word2vec).
7
Character-level tokenizer v.0
Example: 𝒞 = 𝑎, 𝑏, … , 𝑧,_,.,?,! and
𝜏 𝑋 = 𝜏 𝑐1 , … , 𝑐𝐿 = 𝜏 𝑐1 , … , 𝜏 𝑐𝐿
𝜏 𝑎 = 1, 𝜏 𝑏 = 2, … 𝜏 𝑧 = 26, …
So 𝑛 = 1 and 𝐿 = 𝑇.
8
Character-level tokenizer v.1
Example: 𝒞 = 𝑎, 𝑏, … , 𝑧,_,.,?,!
𝜏 𝑋 = 𝜏 𝑐1 , … , 𝑐𝐿 = 𝜏 𝑐1 , … , 𝜏 𝑐𝐿
1 0 0
0 1 0
𝜏 𝑎 = 0 , 𝜏 𝑏 = 0 , … 𝜏 ! = 0
⋮ ⋮ ⋮
0 0 1
So 𝑛 = 30 and 𝑇 = 𝐿. The output vectors are called one-hot-encodings as only one element
of the encoded vector is nonzero (hot).
9
Word-level tokenizer
Examples: 𝒞 = 𝑎, 𝑏, … , 𝑧,_ (so English letters and space) and 𝒲 = English words
𝜏 𝑋 = 𝜏 𝑐1 , … , 𝑐𝑇 = 𝜏 𝑤1 , … , 𝑤𝐿 = 𝜏 𝑤1 , … , 𝜏 𝑤𝐿
1 0 0
0 1 0
𝜏 ‘aardvark’ = 0 , 𝜏 ‘ability’ = 0 , … , 𝜏 ‘Zyzzyva’ = 0 , …
⋮ ⋮ ⋮
0 0 1
where 𝑤1 , … , 𝑤𝐿 ∈ 𝒲. So 𝑛 = 𝒲 = size of dictionary and 𝐿 ≤ 𝑇.
10
End-of-string (EOS) token
Given 𝑋 ∈ 𝒳 and its length 0 ≤ 𝑇 < ∞, we equivalently consider a special “end-of-string”
token <EOS> to be the final 𝑇 + 1 -th element. In other words,
𝑋 = 𝑐1 , 𝑐2 , … , 𝑐𝑇 = 𝑐1 , 𝑐2 , … , 𝑐𝑇 , <EOS>
for any 𝑋 ∈ 𝒳, where 𝑐1 , … , 𝑐𝑇 ∈ 𝒞.
11
Discussion on tokenizers
Q) Why tokenizers?
A) Neural networks perform arithmetic on vectors and numbers, so tokenizers convert text
into a sequence of vectors.
12
Discussion on tokenizers
Q) Advantage of word-level tokenizer over character-level tokernizer?
A) Shorter tokenized sequence. Uses dictionary. (Model need not learn words from scratch.)
13
Basic BOW implementation
Let 𝜏 be a word-level tokenizer with dictionary 𝒲.
𝑓𝜃 𝑋 = 𝑏 + 𝑎 ⋅ 𝜏 𝑤ℓ = 𝑏 + 𝑎 ⋅ 𝜏 𝑋 ℓ
ℓ=1 ℓ=1
14
Sentiment analysis with DNN
Modern state-of-the-art NLP methods are based on deep neural networks (DNN).
15
Task: Language model (LM)
A language model (LM) achieves one or two of the
following goals.
(This definition excluded encoder-only transformer models such as BERT from language models,
but we will not be overly concerned with these definitions.) 16
Applications of LM: Voice-to-text
In a voice-to-text system, two interpretations can be auditorily ambiguous but semantically
not ambiguous. An LM can determine which interpretation is more likely.
17
Applications of LM: Autocomplete
An autocomplete system can assist writing by suggesting likely completions of a sentence.
Meeting Arrangement
Meeting Arrangement
Dear professor,
18
Applications of LM: SSL pre-training and
universal interface
Training an NN to be a language model is a useful pretext task in the sense of self-
supervised training and transfer learning in the sense of self-supervised learning (SSL).
Pre-trained language models serve as foundation models that can be fine tuned for other
downstream tasks.
• More on this when we talk about ELMo, BERT, and GPT
19
Probabilities with sequences
Assume a sequence
𝑢1 , 𝑢2 , … , 𝑢𝐿 = 𝑢1 , 𝑢2 , … , 𝑢𝐿 , <EOS> ∈ 𝒰∗
is generated randomly, i.e., we can assign a probability
ℙ 𝑢1 , 𝑢2 , … , 𝑢𝐿 , <EOS> ∈ 0,1
20
Probability notation with <EOS>
Clarification) Given 0 ≤ 𝐿 < ∞ and 𝑢1 , 𝑢2 , … , 𝑢𝐿 ∈ 𝒰,
is the probability that a random sequence in 𝒰∗ has values 𝑢1 , 𝑢2 , … , 𝑢𝐿 for the first 𝐿
elements and then terminates, i.e., 𝑢𝐿+1 =<EOS>.
is the probability that a random sequence in 𝒰∗ has values 𝑢1 , 𝑢2 , … , 𝑢𝐿 for the first 𝐿
elements (and none of them are <EOS>) but 𝑢𝐿+1 but may or may not be <EOS>. In particular,
21
Conditional probabilities with sequences
With the chain rule (conditional probability), we have
To clarify, we have made no assumptions on the sequence probabilities. (We have not
assumed that anything is Markov or that anything is independent.)
22
Cond. prob. with continuous sequences
If sequence elements 𝑢𝑡 are continuous random variables, then we need density functions
instead of discrete probability mass functions. However, calculations are essentially the
same, so we do not repeat it. (Measure-theoretic probability theory unifies analysis.)
For image patches (vision transformers), seq elements are (essentially) continuous.
23
Autoregressive (AR) modelling
An autoregressive model of a sequence learns to predict 𝑢ℓ given the past ovservations
𝑢1 , … , 𝑢ℓ−1 . Goal is to learn a model 𝑓𝜃 that approximates the full conditional distribution
𝑓𝜃 𝑢ℓ ; 𝑢1 , … , 𝑢ℓ−1 ≈ ℙ 𝑢ℓ 𝑢1 , … , 𝑢ℓ−1
24
Sequence likelihood with AR model
Given a trained autoregressive model 𝑓𝜃 𝑢ℓ ; 𝑢1 , … , 𝑢ℓ−1 ≈ ℙ 𝑢ℓ 𝑢1 , … , 𝑢ℓ−1 , we can
(approximately) compute the likelihood of a sequence 𝑢1 , … , 𝑢𝐿 with
25
Sequence generation with AR model
Given a trained autoregressive model 𝑓𝜃 𝑢𝑡 ; 𝑢1 , … , 𝑢ℓ−1 ≈ ℙ 𝑢ℓ 𝑢1 , … , 𝑢ℓ−1 , and an un-
terminated sequence 𝑢1 , … , 𝑢ℓ−1 (if ℓ = 1, then start generation from nothing) we can
generate 𝑢1 , … , 𝑢ℓ−1 , 𝑢ℓ , … , 𝑢𝐿 ∼ ℙ 𝑢𝑡 , … , 𝑢𝐿 , 𝑢𝐿+1 =<EOS> 𝑢1 , … , 𝑢ℓ−1 by sampling
which is justified by
26
Modern NLP and sequence processing
Modern NLP solves various tasks, especially language modelling, with deep neural networks.
Why still learn RNNs? Although transformers have been replacing RNNs and CNNs in recent
years, RNNs and CNNs are not yet obsolete. Also much of the architecture design of
transformers are inspired by practices inherited from the RNN era. One still needs to know
RNNs to fully understand modern NLP.
27
Learning with variable-size inputs
In image classification, the input 𝑋 ∈ ℝ3×𝑛×𝑚 is of fixed size and processed by a deep CNN.
We now want to process variable-size input 𝑋 ∈ 𝒞 ∗ with a neural network.
28
Process one input per layer
Same 𝐴 and 𝑏
Consider backpropagation on :
32
Backprop for RNN
෨
Next, compute 𝜕ℎ 𝑇 /𝜕𝜃:
33
Backprop for RNN
෨
Translate calculation to 𝜕ℓ/𝜕𝜃:
P. J. Werbos, Generalization of backpropagation with application to a recurrent gas market model, Neural Networks, 1988. 34
Backprop code for RNN
# Forward pass given 𝜏(𝑋)=(u[1],...u[L])
h[0] = 0
for l = 1,2,...,L:
h[l] = q(th,h[l-1],u[l]) # ℎℓ = 𝑞𝜃෩ ℎℓ−1 , 𝑢ℓ
fX = A @ h[L] + b # 𝑓𝜃 𝑋 = 𝐴ℎ𝐿 + 𝑏
ell = loss(fX,Y) # ℒ 𝑓𝜃 𝑋 , 𝑌
# Backward pass
dldf = loss.df(fX,Y) # 𝜕ℒ/𝜕𝑓
dldA = dldf @ h[L] # 𝜕ℒ/𝜕𝐴 ✓
dldb = dldf # 𝜕ℒ/𝜕𝑏 ✓
Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994. 36
Exploding gradients and gradient clipping
The exploding gradient problem occurs when the gradient magnitude is very large.
Exploding gradients imply the output is very sensitive to small changes of the parameters in
a certain direction. Sometimes, such gradients are unworkable and the neural network
architecture must be changed.
37
Vanishing gradients
The vanishing gradient problem occurs when the magnitude of a gradient is very small.
Intuitively, vanishing gradients means the gradient signal does not reach the earlier layers.
In an RNN, for example, may not be small but
can be small.
This means changes in ℎℓ do not affect the output ℒ. Since this further implies
that (small) changes in 𝑢ℓ do not affect ℒ. We can intuitively understand this as the RNN not
utilizing information of 𝑢ℓ , i.e., RNN does not remember 𝑢ℓ at step 𝐿. (Although this
argument is not precisely correct since large changes in 𝑢ℓ may affect ℒ.) In any case, the
gradient signal from far away at time 𝐿 is lost and the model can’t learn what information to
preserve at time ℓ. 38
Promoting better gradient flow
As an example, consider
If , then and we say the gradient does not flow well through the layer
ℎℓ+1 well; any information contained in is lost.
39
Promoting better gradient flow
So then, do we always want good gradient flow? Do we always want ?
Solution) Design a “neural circuit” that explicitly controls when to remember information and
when to forget information.
“cell state”
“forget gate”
40
LSTM cells
𝑐ℓ , 𝑓ℓҧ , 𝑖ℓҧ , 𝑔ҧℓ , 𝑜ҧℓ have same
dimension as ℎℓ .
Long short-term memory (LSTM) cells
has an intricate and somewhat
arbitrary structure. “cell state”
“forget gate”
43
LSTM name meaning
To clarify, “long short-term memory” does not mean long-term & short-term memory.
Rather, it means that the cell state serves as a longer short-term memory. In contrast, a
naïve RNN (that uses an MLP rather an LSTM cell as the recurrent function) would have a
much shorter short-term memory.
A true long term memory would correspond to some external storage, which an LSTM RNN
doesn’t have. (In fact no mainstream NLP system currently uses long term memory.)
44
Aside: Exploding/vanishing gradient
problem
The exploding/vanishing gradient problem is a problem not just for RNNs. It can be a
problem for all deep neural networks.
The ResNet architectue, and more generally the use of residual connections is one
approach to mitigate the exploding/vanishing gradient problem.
#T. Cooijmans, N. Ballas, C. Laurent, C. Gülçehre, and A. Courville, Recurrent batch normalization, ICLR, 2017. 45
Sentiment analysis
with LSTM
The output hidden state
can be used for the single
(non-sequence) output.
46
Sentiment analysis
with LSTM
Pooling all of the hidden
states often performs better
than then using only the last
one for learning a single
(non-sequence) output.
47
Stacked RNN
48
Example task:
Parts of speech
tagging
For some RNN tasks, the output
is a sequence, and the total loss
is the sum of the losses incurred
at each sequence term.
49
Bidirectional RNN
(Unidirectional) RNNs process information
forward in time. In language, however, it is
common for later words to provide necessary
context for understanding a previous word.
51
RNN-LM
The RNN language model (RNN-LM) is trained as an autoregressive model with the
following structure.
T. Mikolov, S. Kombrink, L. Burget, J. H. Černocký, and S. Khudanpur, Extensions of recurrent neural network language model, ICASSP, 2011. 52
LM loss
Let us interpret the loss
53
LSTM with output projection
𝑐ℓ , 𝑓ℓҧ , 𝑖ℓҧ , 𝑔ҧℓ , 𝑜ҧℓ have same dimension
Sometimes, you want the LSTM to output a ℎℓ has a different (usually larger) dimension
large hidden state while maintaining a
reasonably-sized internal computation.
“cell state”
“forget gate”
(In LMs, the output size can be the vocabulary
size or the number of possible tokens with
byte pair encoding, both are large.)
H. Sak, A. W. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition, arXiv,
2014. 54
Backprop with RNN?
In this RNN-LM, the output of LSTM goes into two blocks, so the backprop computation
should be the sum of the two contributions.
55
Trainable tokenizer
The tokenizer is the first contact between language and our algorithm.
Instead of using one-hot encodings, which is fixed (given a dictionary), it is better to have
some trainable component in the tokenizer.
Currently, byte-pair encoding has become the standard choice, but we shall consider the
historical context.
56
What does a word means?
Denotational semantics: A word is the collection of the objects it describes.
This is the intuitive and straightforward view of the meaning of words, but it is not a very
actionable definition.
57
Word2vec
Train a model such that a word predicts a neighboring word.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, ICLR Workshop, 2013.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, NeurIPS, 2013. 58
NeurIPS 2023 Test of Time Award
Word2vec
Specifically, we minimize the loss function
𝐿
ℒ 𝜃 = − log 𝑝𝜃 𝑤ℓ+𝑘 𝑤𝑘
ℓ=1 −𝑚≤𝑘≤𝑚
𝑘≠0
2 vectors per word, 𝑣𝑤 for center word and 𝑢𝑤 for context word makes sense are used
because it is not likely that a word appears in its own context, so you would want to
minimize the probability 𝑝 𝑤 𝑤 . But if you use the same vectors for 𝑤 as context than for 𝑤
as center word, it is more awkward to directly minimize 𝑝 𝑤 𝑤 . However, using the same
vector makes sense and should work.
59
Word2vec with negative sampling
However, the full softmax model is intractable. So the practical implementation of word2vec
considers an alternate loss
𝐿
60
Word2vec with negative sampling
𝜏 𝜎 𝑥 =
ℒመ 𝜃 = − log 𝜎 𝑢𝑤ℓ+𝑘 ⋅ 𝑣𝑤ℓ − log 𝜎 −𝑢𝑗 𝑖 ⋅ 𝑣𝑤ℓ
𝑖=1
መ
Interpretation of minimizing ℒ:
• 𝑢𝑤ℓ+𝑘 ⋅ 𝑣𝑤ℓ should be large; neighboring words should have aligned vectors.
• 𝑢𝑗 𝑖 ⋅ 𝑣𝑤ℓ should be small; with high probability, 𝑢𝑗 𝑖 and 𝑣𝑤ℓ will be non-neighboring
words, and they should have un-aligned (𝑢𝑗 𝑖 ⋅ 𝑣𝑤ℓ ≈ 0) or negatively aligned (𝑢𝑗 𝑖 ⋅
𝑣𝑤ℓ < 0) vectors.
61
Word2vec: Trained word-level tokenizer
The of word2vec is a trained word-level tokenizer:
Given input 𝑋 = 𝑤1 , … , 𝑤𝐿 chunked into words 𝑤1 , … , 𝑤𝐿 ∈ 𝒲, the trained tokenizer 𝜏𝜃
outputs the corresponding 𝑣-vectors (or the 𝑢-vectors).
Downside: The word-level tokenizer 𝜏𝜃 𝑤ℓ does not take into account the context in which
the word 𝑤ℓ is used in (cf. polysemy).
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, ICLR Workshop, 2013.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, NeurIPS, 2013. 62
ELMo
Embeddings from Language Models (ELMo) is an in-context tokenizer. Produces word
representations in the context of the entire sentence.
Uses bidirectional LSTM structure. The states of RNNs are hidden states, but they can also be
considered tokernized values of the given words.
M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power, Semi-supervised sequence tagging with bidirectional language models, ACL, 2017.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, NAACL, 2018. 63
Bidirectional LM loss for pre-training
Pre-training uses the loss
64
Non-causal language model
Causal language models learn
𝑓𝜃 𝑢ℓ ; 𝑢1 , … , 𝑢ℓ−1 ≈ ℙ 𝑢ℓ 𝑢1 , … , 𝑢ℓ−1
i.e., the LM learns to predict the next token left-to-right.
ELMo and BERT are not causal language models. (Half of ELMO is a causal language
model, but that is not the point.) ELMo and BERT can understand language and solve many
NLP tasks, but it cannot generate text.
65
ELMo fine-tuning
Given a prior NLP method (which can be very specialized and tailored to the specific task)
that takes in 𝑥ℓ 𝐿ℓ=1 , replace the input 𝑥ℓ 𝐿ℓ=1 with 𝑥ℓ 𝐿ℓ=1 , where
where 𝐾 is the depth of the LSTM RNN, 𝑘 = 0 corresponds to the tokenization layer, and
𝑠𝑘task are the task-specific trainable parameters. (The sum is over the LSTM depth.)
task 𝐾
Then, train the entire pipeline, including the ELMo weights, 𝑠𝑘 𝑘=0
, and the weights of
the NLP method on labeled fine-tuning data.
66
Results
ELMo achieves state-of-the-art performance on a wide range of NLP tasks.
• Question answering
• Textual entailment (determining whether a “hypothesis” is true, given a “premise”)
• Semantic role labeling (Answers “Who did what to whom”)
• Coreference resolution (clustering mentions in text that refer to the same underlying real
world entities)
• Named entity extraction (finding four types of named entities (PER, LOC, ORG, MISC) in
news articles)
• Sentiment analysis (whether paragraph is positive or negative)
67
Discussion of ELMo
Although the idea of semi-supervised learning through large-scale pre-training and fine-
tuning was not new (Dai and Le 2015) ELMo executed it very well and advanced the state
of the art substantially.
However, LSTM RNN is not the best architecture. The left- and right-directional RNNs only
process information unidirectionally. What is the model needs to examine the entire
sentence to make inference? Also, RNNs are fundamentally computationally inefficient.
The overall approach is still not universal; each task needs a tailored method and ELMo
only served to provide better tokenization.
J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, NAACL, 2019. 69
Transformers
Transformer architectures are sequence-to-sequence models. They “transform” a sequence
to another sequence in each layer.
70
Encoder-only transformer
The transformer architecture relies on the following components
• Token embeddings
• Multi-head self-attention
• Residual connections
• Layer Normalization
• Position-wise FFN
• Positional encodings
71
Token embeddings
Let 𝜏 be a tokenizer such that
𝜏 𝑋 = 𝑢1 , 𝑢2 , … , 𝑢𝐿
where we can think of 𝑢𝑖 = 𝑘 ∈ 1, … , 𝑛 or, equivalently, 𝑢𝑖 = 𝑒𝑘 ∈ ℝ𝑛 . Here, 𝑒𝑘 denotes the
𝑘-th unit (one-hot) vector, and 𝑛 denotes the number of distinct tokens. A word-level
tokenizer is one of the simple instances of this.
73
Attention is a pseudo-linear operation
Functions 𝑓 ∶ ℝ𝑛 → ℝ𝑚 of the form
𝑓 𝑥 =𝐴 𝑥 𝑥
are said to be “pseudo-linear”. (It is not linear because they the matrix 𝐴 𝑥 ∈ ℝ𝑛×𝑚 .)
74
Multi-head self attention (MHA)
Just as one uses multiple CNN channels, we use multiple attention heads.
𝐿 𝐿
Seq-to-seq transformation 𝑥ℓ ℓ=1 ↦ 𝑧ℓ ℓ=1 . Often 𝑑𝑋 = 𝑑𝑍 required by residual connection.
75
Encoder-only transformer 𝑥ℓ
𝑘+1
𝐿
ℓ=1
𝐿
𝑘
𝑥ℓ 76
ℓ=1
Layer normalization
Layer normalization (LN) also stabilizes training by normalizing the features and thereby
avoiding exploding and vanishing gradients.
Normalization across the features. Does not normalize over sequence lengths or batch
elements. Assume 𝑋 has dimension (batch × sequence length × channel/feature)
𝐶
1
𝜇ො ∶, ∶ = 𝑋[∶, ∶, 𝑐]
𝐶
𝑐=1
𝐶
1
𝜎ො 2 ∶, ∶ = 𝑋 ∶, ∶, 𝑐 − 𝜇ො ∶, ∶ 2
𝐶
𝑐=1
𝑋 ∶, ∶, 𝑐 − 𝜇ො ∶, ∶
LN𝛾,𝛽 𝑋 [∶, ∶, 𝑐] = 𝛾 𝑐 +𝛽 𝑐 𝑐 = 1, … , 𝐶
𝜎ො 2 ∶, ∶ + 𝜀
For CNNs, LN normalize over channels and spatial dimensions. For transformers, LN
normalizes over channels and not over spatial dimensions.
BN LN in CNNs LN in Transformer
seq. length
H, W
H, W
C N C N C N
78
Position-wise FFN
Position-wise FFN is a 2-layer MLP with ReLU, GELU, or SiLU activation functions:
79
GELU, SiLU, Swish activations
Gaussian error linear unit (GELU), Sigmoid-weighted linear unit (SiLU), and Swish are
smooth non-monotone activation functions. The three are qualitatively similar: they
decrease near 0 and then increase nearly linearly.
D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv, 2016.
P. Ramachandran, B. Zoph, and Q. V. Le, Searching for Activation Functions, arXiv, 2017. 80
S. Elfwing, E. Uchibe, and K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, Neural Networks, 2018.
Positional encoding/embedding
Problem: Transformer architecture is permutation equivariant and it does not know
positional information of tokens. Relative positions of tokens (word order or patch location)
obviously carries important meaning.
𝐿
Solution: After tokenization 𝑢ℓ ℓ=1 = 𝜏 𝑋 , add positional embedding vectors and then pass
𝑢ℓ + 𝑝ℓ 𝐿ℓ=1
as input to the transformer layers.
S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, End-to-end memory networks, NeurIPS, 2015.
J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, Convolutional sequence to sequence learning, ICML, 2017. 81
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, NeurIPS, 2017.
Positional encoding/embedding
NLP transformers often use the
sinuisoidal positional encoding 𝑝1 , … , 𝑝𝐿 ∈ ℝ𝑑
(Feels like a very arbitrary design, but this work well and is hard to beat.) Since NLP
transformers must accommodate arbitrary sequence length 𝐿, using a positional encoding
with an analytical formula makes sense.
On the other hand, vision transformers let 𝑝ℓ 𝐿ℓ=1 be trainable. Possible since image
resolution and hence sequence length 𝐿 is fixed.
82
Positional encoding/embedding
Idea is often attributed to Vaswani et al. 2017,
However, Sukhbaatar et al. 2015 and Gehring et al. 2017 did publish the positional
encoding technique earlier. The sinusoidal encoding is due to Vaswani et a. 2017.
S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, End-to-end memory networks, NeurIPS, 2015.
J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, Convolutional sequence to sequence learning, ICML, 2017. 83
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, NeurIPS, 2017.
Post-LN vs Pre-LN TF architectutres
There are 2 variants of the transformer architecture based on the position of LN.
The original (Vaswani et al. 2017) paper illustrates postLN in its figure. However, their
updated official codebase uses pre-LN. It is later reported that Pre-LN is more stable.
R. Xiong, Y. Yang, D. He, K. Zheng, X. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T.-Y. Liu, On layer normalization in the transformer architecture, 84
ICML, 2020.
Transformer depth
Thanks to the residual connections and layer norm, transformers can often be much deeper
than stacked RNNs. (ELMo has 2 layers, while BERT has 24 layers.)
To clarify, the layer norm and the residual connection are used to mitigate the
exploding/vanishing gradient problem across the transformer depth.
The transformer does not have exploding/vanishing gradient problem along the sequence
length 𝐿 due to its use of attention mechanism.
85
Why transformers over RNNs?
Handling long sequence length:
RNNs can’t handle long input sequences due to a fixed memory size and vanishing or
exploding gradients. LSTMs are designed to mitigate this problem, but transformers really
solve this problem. Transformers allows the full input sequence to be considered when
computing the representation of each token.
86
BERT pre-training
BERT pre-training uses two losses.
1. Masked LM (MLM)
Randomly mask out 15% of the words and let BERT
predict it. Output tokens corresponding to masked words
are fed into softmax and CE loss.
Fine-tuning is computationally
very cheap (<1 hour on a single
Google TPU).
88
Vision transformer
Vision transformer is an encoder-only
transformer architecture.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N.
Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, ICLR, 2021. 89
GPT-1
GPT (generative pre-training) uses a causal language model loss.
Initially, GPT was trained to be an unsupervised pre-trained model in the vein of BERT, and
the its text generation ability was not that strong.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, 2018. 90
Masked attention
In RNNs, information is naturally processed sequentially.
Therefore, GPT uses a masked attention that allows the current sequence element to only
query earlier sequence elements.
91
Masked single-head self attention
Only lower-triangular
components of 𝐴ሚ are finite.
Only lower-triangular
components of 𝐴 are nonzero.
𝑦ℓ is a linear combination of 𝑣1 , … , 𝑣ℓ .
𝑦ℓ only depends on 𝑥1 , … , 𝑥ℓ .
( 𝑥ℓ 𝐿ℓ=1 ↦ 𝑦ℓ 𝐿ℓ=1 has causal dependency)
92
Masked multi-head self attention
𝐿 𝐿
Seq-to-seq transformation 𝑥ℓ ℓ=1 ↦ 𝑧ℓ ℓ=1 with causal dependence:
𝑧ℓ only depends on 𝑥1 , … , 𝑥ℓ .
Since other components of transformer all act positionwise, the transformer with causal
MHA is a seq-to-seq transformation with causal dependence.
93
Self-supervised pre-training
Let 𝑋 be the input text tokenzed as 𝜏 𝑋 = 𝑢1 , 𝑢2 , … , 𝑢𝐿 .
𝐿 𝐿
Let 𝑓𝜃 be the transformer mapping 𝑢ℓ ℓ=1 ↦ 𝑤ℓ ℓ=1 , where 𝑤ℓ ∈ ℝ𝑛 . Then,
where ℓCE is the cross entropy loss with 𝑢ℓ+1 ∈ 1, … , 𝑛 above viewed as an integer.
94
Supervised fine-tuning
First, transform the relevant
text into sequence with
appropriate delimiter tokens.
95
Supervised fine-tuning
For classification, given an input text 𝑋 and a tokenizer 𝜏, the transformer maps
<Start>, 𝜏 𝑋 , <Extract> = 𝑢ℓ 𝐿ℓ ↦ 𝑤ℓ 𝐿ℓ
The final token 𝑤𝐿 corresponding to the <Extract> token, is extracted. The loss
loss 𝐴𝑤𝐿 + 𝑏, 𝑌
where 𝐴 and 𝑏 are the parameters of the linear layer and 𝑌 is the label corresponding to 𝑋,
is used.
(Note that BERT had a <Cls> token at the start of the input, and it basically served the same
role as the <Extract> token for GPT. Different from BERT, GPT is a causal language model,
so the <Extract> token must be at the end if we want 𝑤𝐿 to encode information about the full
sentence.) 96
Supervised fine-tuning
The full GPT-1 model (the pre-trained TF), the final linear layer, and the vector embeddings
corresponding to <Start>, <Extract>, and <Delim> are trained.
For similarity tasks, there is no inherent ordering of the two sentences being compared. So
the transformer is given both orderings.
97
Example task: Machine translation
In machine translation, training data contains translation pairs between different languages.
Classically with an RNN, the encoding stage encodes (summarizes) the entire sentence into
a latent vector, and the decode generates translation text autoregressively.
98
Bahdanau attention and cross attention
The problem with the previous approach is that the hidden state passed from the encoder
RNN to the decoder RNN acts as a bottleneck, and the hidden state may not be able to
retain all the necessary information.
K Q
Solution: Allow the decoder RNN Key query
D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, ICLR, 2015. 99
Transformer and cross attention
Vaswani et al. questioned whether the RNN mechanism
was necessary. They concluded “Attention is all you need”.
𝐿
Cross attention layer derives 𝑞ℓ ℓ=1 from previous layer’s
𝐿 ′ ′
𝑥ℓdec ℓ=1 but 𝑘ℓ 𝐿ℓ=1 and 𝑣ℓ 𝐿ℓ=1 are derived from the
encoder layer’s 𝑥ℓenc 𝐿ℓ=1 . (In cross attention, number of
queries need not match the number of keys and values.)
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, NeurIPS, 2017. 100
Understanding TF from historical context
The transformer architecture feels somewhat arbitrary, but we can understand the
designers’ intent through the historical context
There is no mathematical reason that thing must be the way they are, and the standard
architecture will likely change in the future.
The historical context does inform us of the intended purpose of the components, and it
gives us a rough guideline of what things will certainty not work and what new components
may work.
101
How to decoding with causal LM
Assume a causal language model 𝑝𝜃 𝑢ℓ 𝑢1 , … , 𝑢ℓ−1 has been trained. If 𝑝𝜃 were perfect,
then naïve sampling (as defined soon) should be sufficient for text generation (decoding).
However, 𝑝𝜃 is imperfect, so effective sampling requires the following techniques:
• Naïve sampling (with temperature)
• Greedy sampling
• Beam sampling
• Top-k sampling
• Top-p (nucleus) sampling
102
Naïve sampling
Let 𝑓𝜃 𝑢1 , … , 𝑢ℓ ∈ ℝ𝑛 be the final ℓ-th output token of a transformer architecture (𝑛 is the
number of distinct tokens) such that
𝑝𝜃 𝑢ℓ+1 = 𝑖 𝑢1 , … , 𝑢ℓ = softmax 𝑓𝜃 𝑢1 , … , 𝑢ℓ
𝑖
for 𝑖 = 1, … , 𝑛. Let 𝑢1 , … , 𝑢ℓ be given and 𝑢ℓ ≠ <EOS>.
Problem: Low-probability words are sampled with low but non-zero probabilities, and they
tend to be bad.
103
Naïve sampling with temperature
Naïve sampling with temperature adjusts the “temperature” parameter of the softmax
𝑓𝜃 𝑢1 , … , 𝑢ℓ
𝑝𝜃 𝑢ℓ+1 = 𝑖 𝑢1 , … , 𝑢ℓ = softmax
𝛽
𝑖
104
Greedy sampling
If the low probability words are problematic, then why not just sample the most likely word?
Problem 1) Greedy sampling does not generate the most like sequence tokens.
Problem 2) Likely text is not always good. More on this later.
105
Exact MAP decoding
Exact maximum a posteriori (MAP) decoding
#K. Knight, Decoding complexity in word-replacement translation models, Computational Linguistics, 1999. 106
Beam search
Beam search is a tractable, heuristic approximation to exact MAP decoding.
Beam search produces sequence with higher likelihood compared to greedy search.
107
Human text is not too predictable
High quality human text does
not follow a distribution of high
probability next words.
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, The curious case of neural text degeneration, ICLR, 2020. 108
Top-K sampling
Top-K sampling samples among the 𝐾 most likely word, with probabilities determined by
softmax applied to the top 𝐾 words with a temperature 𝛽 > 0.
A. Fan, M. Lewis, and Y. Dauphin, Hierarchical neural story generation, ACL, 2018. 109
Top-p (nucleus) sampling
Top-p sampling samples among the fewest top most likely word such that their probabilities
(with a temperature 𝛽 > 0) exceeds probability 𝑝. Once the set top words 𝑉 𝑝 are defined
𝑝𝜃 𝑢ℓ+1 = 𝑖 𝑢1 , … , 𝑢ℓ
𝑝𝜃 𝑢ℓ+1 = 𝑖 𝑢1 , … , 𝑢ℓ =
σ𝑖∈𝑉 𝑝 𝑝𝜃 𝑢ℓ+1 = 𝑖 𝑢1 , … , 𝑢ℓ
sampling is done from 𝑝𝜃 . (So σ𝑖∈𝑉 𝑝 𝑝𝜃 𝑢ℓ+1 = 𝑖 𝑢1 , … , 𝑢ℓ ≥ 𝑝.)
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, The curious case of neural text degeneration, ICLR, 2020. 110
GPT-2 and GPT-3
The GPT-2 and GPT-3 papers had very similar titles and messages. GPT-3 simply scales up
GPT-2 and achieves stronger results. (Architecture didn’t change much from GPT-1.)
• GPT-(1,2,3) Model size: 117M→1.5B→175B
• GPT-(1,2,3) Data size: 4GB→40GB→600GB
Main message: GPT can solve many tasks with a unified task-agnostic architecture and without
supervised fine-tuning. Task-specific training data is not used (zero-shot) or only a few is used
during inference (few-shot in-context learning).
No task-specific training data is used for training or fine-tuning. (However, having diverse task-
specific training data is helpful, as T5 and Flan-T5 shows.)
(ELMo was not at all task agnostic. BERT and GPT-1 was a little more task agnostic, but had
task-specific heads.)
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language models are unsupervised multitask learners, Tech. Report, Feb. 2019.
T. B. Brown, B. Mann, N. Ryder, …, A. Radford, I. Sutskever, and D. Amodei, Language models are few-shot learners, NeurIPS, 2020. (arXiv May 2020) 111
In-context learning
The notion of in-context learning (ICL) was first
formalized in the GPT-3 paper.
T. B. Brown, B. Mann, N. Ryder, …, A. Radford, I. Sutskever, and D. Amodei, Language models are few-shot learners, NeurIPS, 2020. (arXiv May 2020) 112
GPT and closed-source research
GPT-2 is open source
https://github.com/openai/gpt-2
GPT-3 is famously not open source, and GPT-3 kickstarted the (somewhat unfortunate)
closed-source proprietary LLM research.
113
Emergence and
LLMs as a universal interface
Emergence refers to the phenomenon where complex and often unexpected behaviors or
capabilities arise from a system as it scales, particularly when these behaviors were not
explicitly programmed or evident in smaller or less complex versions of the system.
LLMs exhibit many emergent properties. Among them, noteworthy is the ability of LLMs to
just follow textual instructions. This allows LLMs to operate as a tool with natural language
instructions serving as a universal interface.
114
T5 model
Unified task-agnostic architecture. The many tasks, which are not semantically related, are
formatted into a text-to-text format. Same model, objective, training procedure and decoding
process to every task that we consider.
C. Raffel, N. Shazeer, A. Roberts, K .Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-
to-text transformer, JMLR, 2020. (arXiv Oct. 2019) 115
T5 pre-training
Pre-training on large unlabeled text with diverse objectives inspired by prior work.
The “inputs” are fed into the encoder block while the “target” text is generated by the
decoder one token at a time.
116
T5 attention mask
Encoder uses un-masked attention, while decoder can access encoder tokens via cross
attention and the earlier decoder tokens.
Alternatively, interpret the T5 transformer as using the “causal with prefix” attention mask.
117
T5 fine-tuning
Simultaneously fine-tune on a wide range of tasks. Simply prompt the model differently for
each task to inform T5 of the specific task to solve.
118
T5 contribution
Advanced state-of-the-art with the pre-train-than-fine-tune approach.
Gave the idea that language models can understand and respond to natural language
instructions. We can simply tell a language model what we want (in natural language) and it
will follow our instructions.
C. Raffel, N. Shazeer, A. Roberts, K .Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-
to-text transformer, JMLR, 2020. (arXiv Oct. 2019) 119