0% found this document useful (0 votes)
32 views

Modern Language Models

Modern language models use various techniques to model sequences of text with variable lengths as fixed-length vectors that can be used as inputs for machine learning algorithms. Early methods like n-grams represent text as a vector counting the occurrences of words or phrases up to a fixed length. More advanced models use word embeddings that represent words as dense vectors trained to capture semantic relationships. Recurrent neural networks process input sequences sequentially and maintain internal states to represent sequences. Transformer models use self-attention to allow sequences to be processed in parallel without recurrence.

Uploaded by

John Hawkins
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Modern Language Models

Modern language models use various techniques to model sequences of text with variable lengths as fixed-length vectors that can be used as inputs for machine learning algorithms. Early methods like n-grams represent text as a vector counting the occurrences of words or phrases up to a fixed length. More advanced models use word embeddings that represent words as dense vectors trained to capture semantic relationships. Recurrent neural networks process input sequences sequentially and maintain internal states to represent sequences. Transformer models use self-attention to allow sequences to be processed in parallel without recurrence.

Uploaded by

John Hawkins
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 28

Modern language models

for machine learning

John Hawkins
getting-data-science-done.com
Getting to language models
● Intro to modelling with text
– The problem with sequences
● Semantic Spaces & Word Embeddings
● Recurrent Networks
● Attention
● Self-Attention (Critical to Transformers)

2
Intro to modelling with text
● Text data has multiple unusual (difficult) properties
– Variable length
– Multiple potential languages
– Multiple potential encodings
– Varying sources of error or confusion
● Typos, encoding, transcription, slang, dialects, deception

3
Dealing with variable length
● Machine learning models need fixed length inputs*
– This is a set numbers in a vector, upon which the
model will make a prediction.
– Machine learning with text always involve a strategy
or trick to overcome this problem.

* Models that do handle variable length inputs use fixed length components + some kind of trick 4
A simple strategy : N-Gram
● The most common simple trick is to process a block of
text into a vector of n-grams.
– Each position in the vector corresponds to a word
or phrase up to N words long.
– The processing determines the vocabulary of words
and phrases.

5
Visualising : N-Gram Models
Input Text Output Vector

the how now cow fox brown quick and

“The quick brown fox” 1 0 0 0 1 1 1 0

0 1 1 1 0 1 0 0
“How now brown cow”

6
Contents of N-Gram Vector
● There are multiple ways you can represent the words inside
an n-gram vector:
– Binary Indicator
– Count of number of appearances
– TfIdf
● All of which can also be normalised to reduce the impact of
varying size blocks of text (or not).

7
Why move beyond N-Grams?
● The N-Gram representation of text only captures certain kinds of text properties.
– Are there certain terms or phrases that specifically indicate what we want the
model to learn?
– Is there a pattern to the frequency or scarcity of words that relates to the
learning task?
● N-Gram models are very poor at representing any kind of meaning in the text.
They are also poor at capturing relationships in the text that depend on long
range linguistic structures. For example, the impact of the number of embedded
clauses on the readability of a block of text.

8
Semantic Spaces
● All modern approaches try to represent text as a
vector that captures the meaning.
– This would mean that the vector for these two
sentences should be similar*:
– “The dog jumped over the fence”
– “The hound leapt across the barrier”

* Note that the N-Gram encoding of these sentences would NOT be similar. 9
Neural Network Primer

...

...
...

Input Nodes Hidden Nodes Output Node(s)


Fixed Length Input Data Learned Data Features Predicted Value or Output

10
Word Embedding
● Word Embeddings (e.g. Word2Vec*) are trained to predict the
relationship between a target word (we want to encode) and the words
around it (context)
– Two training approaches: CBOW and Skip-Gram
– The hidden (or projection) layer is extracted and used as the word
embedding.
● The word embedding is a semantic representation of a word. The
embedding of multiple words can be added to created a semantic vector
for the block of text.

*Mikolov, Sutskever, Chen, Corrado, Dean (2013) https://arxiv.org/pdf/1310.4546.pdf 11


Word2Vec Architecture

12
Word Embedding pros & cons
● Word Embeddings can be trained from raw text, and have been
demonstrated to capture words of similar meaning.
● However, if we simply combine them into a meaning vector we can
blur meaning. For example, the following two sentences will be very
similar:
– “His writing is not good”
– “It is good he is not writing”
● Why? Because ultimately the order of words also matters.

13
Recurrent Neural Network (RNN)
● A recurrent neural network will process a sequence in
chunks*, and maintain a memory state of what has been
processed.
● When the network processes the final chunk, the model
will contain an internal representation of the sequence.
● This internal state can be engineered such that it is a
semantic representation of the sequence.

*Chunks can be letters, phonemes, words or even N-Grams 14


Visualise RNNs
... Time t-1

Input Nodes
Fixed Length Input

...

Input Sequence Hidden Nodes Output Node(s)


Arbitrary Length Raw Data Learned Data Features Predicted Value or Output

15
Beyond RNNs
● RNNs and variations such as LSTM enabled a wide range of
language processing tasks in machine learning…
● ...but they have lingering issues
– They tend to emphasise recent input
– It is still hard to learn long range dependencies in the
sequences.
– Sequential processing makes parallelization hard.

16
Encoder Decoder Architecture
● In order to perform tasks like language translation we need a
machine learning architecture that turns one arbitrary length
sequence into another:
– A sequence to sequence architecture*
● The Encoder Decoder Architecture was designed to improve
translation by processing the input into a semantic vector
(Encoder) and then processing that vector into the target
language (Decoder)

*Sutskever, Vinyals, V.Le (2014) NIPS 17


Visualise Encoder Decoder
...
S S

Encoder Semantic Vector Decoder


Processes Input Sequence Learned Sequence Representation Produces Output Sequence

18
Encoder Decoder + Attention
● Using the final hidden node activation as the semantic
vector had a problem:
– Context from early in the sentence was washed out
● An attention mechanism* was added to allow the
architecture to learn what it needed to attend to at
each stage of the output process.

*Bahdanau, Cho, Bengio (2014) 19


Adding Attention
...
Attention
S

W1
S

WN

Encoder Attention Semantics Decoder


Processes Input Sequence Process into an attention weighted vector Produces Output Sequence

20
Self Attention in Transformers
● The Sequence to Sequence Model with Attention still
requires a recurrent neural network to process the
entire input sequence in serial.
● In Attention is all you Need* the authors design a
network architecture called a Transformer network in
which the input can be processed in parallel.
– Sequence to Sequence without recurrent units

*Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017) NIPS 21
Transformer Architecture

Multi-Headed Attention

Process an arbitrary sequence


of encoded words in parallel and
generate new context sensitive
encodings of the sequence

22
Pt1 Self Attention
● We start with a sequence of word representations
● Transform it so the encoded words are influenced by
the other words in the sentence
The bank of the river

Self Attention Module

*Example taken from the RASA Whiteboard series 23


Pt2
The bank of the river

So far…
bank

We have transformed
the word encodings
such that they are now
W1 W2 Wn
contextualised by the
other words.

∑i=1 to n WiVi
∑ No learning

24
The bank of the river
Pt3
Key W Key W Key W Key W Key W
Query W
bank

W1 W2 Wn

Value W

Value W
Value W

bank

river
The

25
Self Attention Summary
● Self attention allows us to transform text into a
contextually sensitive representation
– All words transformed in parallel
– Influenced by all other words in the text
● Can be used on more than just sequence to
sequence.

26
Summary

Start Then Finally

Process text by generating Process Sequences Using Attention to


Fixed-Length Into Semantic Spaces: add context to
Feature Vectors: RNNs, LSTMs semantic Vectors.

NGrams, Tfidf Seq2Seq Models Self-Attention in


Transformers

27
Credit where credit is due
● Aside from the referenced papers, the following
sources were invaluable:
– Jay Alammar’s Visual Transformer Posts
● http://jalammar.github.io
– RASA Whiteboard Series on Youtube

28

You might also like