Modern Language Models
Modern Language Models
John Hawkins
getting-data-science-done.com
Getting to language models
● Intro to modelling with text
– The problem with sequences
● Semantic Spaces & Word Embeddings
● Recurrent Networks
● Attention
● Self-Attention (Critical to Transformers)
2
Intro to modelling with text
● Text data has multiple unusual (difficult) properties
– Variable length
– Multiple potential languages
– Multiple potential encodings
– Varying sources of error or confusion
● Typos, encoding, transcription, slang, dialects, deception
3
Dealing with variable length
● Machine learning models need fixed length inputs*
– This is a set numbers in a vector, upon which the
model will make a prediction.
– Machine learning with text always involve a strategy
or trick to overcome this problem.
* Models that do handle variable length inputs use fixed length components + some kind of trick 4
A simple strategy : N-Gram
● The most common simple trick is to process a block of
text into a vector of n-grams.
– Each position in the vector corresponds to a word
or phrase up to N words long.
– The processing determines the vocabulary of words
and phrases.
5
Visualising : N-Gram Models
Input Text Output Vector
0 1 1 1 0 1 0 0
“How now brown cow”
6
Contents of N-Gram Vector
● There are multiple ways you can represent the words inside
an n-gram vector:
– Binary Indicator
– Count of number of appearances
– TfIdf
● All of which can also be normalised to reduce the impact of
varying size blocks of text (or not).
7
Why move beyond N-Grams?
● The N-Gram representation of text only captures certain kinds of text properties.
– Are there certain terms or phrases that specifically indicate what we want the
model to learn?
– Is there a pattern to the frequency or scarcity of words that relates to the
learning task?
● N-Gram models are very poor at representing any kind of meaning in the text.
They are also poor at capturing relationships in the text that depend on long
range linguistic structures. For example, the impact of the number of embedded
clauses on the readability of a block of text.
8
Semantic Spaces
● All modern approaches try to represent text as a
vector that captures the meaning.
– This would mean that the vector for these two
sentences should be similar*:
– “The dog jumped over the fence”
– “The hound leapt across the barrier”
* Note that the N-Gram encoding of these sentences would NOT be similar. 9
Neural Network Primer
...
...
...
10
Word Embedding
● Word Embeddings (e.g. Word2Vec*) are trained to predict the
relationship between a target word (we want to encode) and the words
around it (context)
– Two training approaches: CBOW and Skip-Gram
– The hidden (or projection) layer is extracted and used as the word
embedding.
● The word embedding is a semantic representation of a word. The
embedding of multiple words can be added to created a semantic vector
for the block of text.
12
Word Embedding pros & cons
● Word Embeddings can be trained from raw text, and have been
demonstrated to capture words of similar meaning.
● However, if we simply combine them into a meaning vector we can
blur meaning. For example, the following two sentences will be very
similar:
– “His writing is not good”
– “It is good he is not writing”
● Why? Because ultimately the order of words also matters.
13
Recurrent Neural Network (RNN)
● A recurrent neural network will process a sequence in
chunks*, and maintain a memory state of what has been
processed.
● When the network processes the final chunk, the model
will contain an internal representation of the sequence.
● This internal state can be engineered such that it is a
semantic representation of the sequence.
Input Nodes
Fixed Length Input
...
15
Beyond RNNs
● RNNs and variations such as LSTM enabled a wide range of
language processing tasks in machine learning…
● ...but they have lingering issues
– They tend to emphasise recent input
– It is still hard to learn long range dependencies in the
sequences.
– Sequential processing makes parallelization hard.
16
Encoder Decoder Architecture
● In order to perform tasks like language translation we need a
machine learning architecture that turns one arbitrary length
sequence into another:
– A sequence to sequence architecture*
● The Encoder Decoder Architecture was designed to improve
translation by processing the input into a semantic vector
(Encoder) and then processing that vector into the target
language (Decoder)
18
Encoder Decoder + Attention
● Using the final hidden node activation as the semantic
vector had a problem:
– Context from early in the sentence was washed out
● An attention mechanism* was added to allow the
architecture to learn what it needed to attend to at
each stage of the output process.
W1
S
WN
20
Self Attention in Transformers
● The Sequence to Sequence Model with Attention still
requires a recurrent neural network to process the
entire input sequence in serial.
● In Attention is all you Need* the authors design a
network architecture called a Transformer network in
which the input can be processed in parallel.
– Sequence to Sequence without recurrent units
*Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017) NIPS 21
Transformer Architecture
Multi-Headed Attention
22
Pt1 Self Attention
● We start with a sequence of word representations
● Transform it so the encoded words are influenced by
the other words in the sentence
The bank of the river
So far…
bank
We have transformed
the word encodings
such that they are now
W1 W2 Wn
contextualised by the
other words.
∑i=1 to n WiVi
∑ No learning
24
The bank of the river
Pt3
Key W Key W Key W Key W Key W
Query W
bank
W1 W2 Wn
Value W
Value W
Value W
bank
river
The
25
Self Attention Summary
● Self attention allows us to transform text into a
contextually sensitive representation
– All words transformed in parallel
– Influenced by all other words in the text
● Can be used on more than just sequence to
sequence.
26
Summary
27
Credit where credit is due
● Aside from the referenced papers, the following
sources were invaluable:
– Jay Alammar’s Visual Transformer Posts
● http://jalammar.github.io
– RASA Whiteboard Series on Youtube
28