0% found this document useful (0 votes)

32 views

Modern Language Models

Modern language models use various techniques to model sequences of text with variable lengths as fixed-length vectors that can be used as inputs for machine learning algorithms. Early methods like n-grams represent text as a vector counting the occurrences of words or phrases up to a fixed length. More advanced models use word embeddings that represent words as dense vectors trained to capture semantic relationships. Recurrent neural networks process input sequences sequentially and maintain internal states to represent sequences. Transformer models use self-attention to allow sequences to be processed in parallel without recurrence.

Uploaded by

John Hawkins

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Modern Language Models

Uploaded by

John Hawkins

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

You are on page 1/ 28

Modern language models

for machine learning

John Hawkins
getting-data-science-done.com
Getting to language models
● Intro to modelling with text
– The problem with sequences
● Semantic Spaces & Word Embeddings
● Recurrent Networks
● Attention
● Self-Attention (Critical to Transformers)

2
Intro to modelling with text
● Text data has multiple unusual (difficult) properties
– Variable length
– Multiple potential languages
– Multiple potential encodings
– Varying sources of error or confusion
● Typos, encoding, transcription, slang, dialects, deception

3
Dealing with variable length
● Machine learning models need fixed length inputs*
– This is a set numbers in a vector, upon which the
model will make a prediction.
– Machine learning with text always involve a strategy
or trick to overcome this problem.

* Models that do handle variable length inputs use fixed length components + some kind of trick 4
A simple strategy : N-Gram
● The most common simple trick is to process a block of
text into a vector of n-grams.
– Each position in the vector corresponds to a word
or phrase up to N words long.
– The processing determines the vocabulary of words
and phrases.

5
Visualising : N-Gram Models
Input Text Output Vector

the how now cow fox brown quick and

“The quick brown fox” 1 0 0 0 1 1 1 0

0 1 1 1 0 1 0 0
“How now brown cow”

6
Contents of N-Gram Vector
● There are multiple ways you can represent the words inside
an n-gram vector:
– Binary Indicator
– Count of number of appearances
– TfIdf
● All of which can also be normalised to reduce the impact of
varying size blocks of text (or not).

7
Why move beyond N-Grams?
● The N-Gram representation of text only captures certain kinds of text properties.
– Are there certain terms or phrases that specifically indicate what we want the
model to learn?
– Is there a pattern to the frequency or scarcity of words that relates to the
learning task?
● N-Gram models are very poor at representing any kind of meaning in the text.
They are also poor at capturing relationships in the text that depend on long
range linguistic structures. For example, the impact of the number of embedded
clauses on the readability of a block of text.

8
Semantic Spaces
● All modern approaches try to represent text as a
vector that captures the meaning.
– This would mean that the vector for these two
sentences should be similar*:
– “The dog jumped over the fence”
– “The hound leapt across the barrier”

* Note that the N-Gram encoding of these sentences would NOT be similar. 9
Neural Network Primer

...

...
...

Input Nodes Hidden Nodes Output Node(s)

Fixed Length Input Data Learned Data Features Predicted Value or Output

10
Word Embedding
● Word Embeddings (e.g. Word2Vec*) are trained to predict the
relationship between a target word (we want to encode) and the words
around it (context)
– Two training approaches: CBOW and Skip-Gram
– The hidden (or projection) layer is extracted and used as the word
embedding.
● The word embedding is a semantic representation of a word. The
embedding of multiple words can be added to created a semantic vector
for the block of text.

*Mikolov, Sutskever, Chen, Corrado, Dean (2013) https://arxiv.org/pdf/1310.4546.pdf 11

Word2Vec Architecture

12
Word Embedding pros & cons
● Word Embeddings can be trained from raw text, and have been
demonstrated to capture words of similar meaning.
● However, if we simply combine them into a meaning vector we can
blur meaning. For example, the following two sentences will be very
similar:
– “His writing is not good”
– “It is good he is not writing”
● Why? Because ultimately the order of words also matters.

13
Recurrent Neural Network (RNN)
● A recurrent neural network will process a sequence in
chunks*, and maintain a memory state of what has been
processed.
● When the network processes the final chunk, the model
will contain an internal representation of the sequence.
● This internal state can be engineered such that it is a
semantic representation of the sequence.

*Chunks can be letters, phonemes, words or even N-Grams 14

Visualise RNNs
... Time t-1

Input Nodes
Fixed Length Input

...

Input Sequence Hidden Nodes Output Node(s)

Arbitrary Length Raw Data Learned Data Features Predicted Value or Output

15
Beyond RNNs
● RNNs and variations such as LSTM enabled a wide range of
language processing tasks in machine learning…
● ...but they have lingering issues
– They tend to emphasise recent input
– It is still hard to learn long range dependencies in the
sequences.
– Sequential processing makes parallelization hard.

16
Encoder Decoder Architecture
● In order to perform tasks like language translation we need a
machine learning architecture that turns one arbitrary length
sequence into another:
– A sequence to sequence architecture*
● The Encoder Decoder Architecture was designed to improve
translation by processing the input into a semantic vector
(Encoder) and then processing that vector into the target
language (Decoder)

*Sutskever, Vinyals, V.Le (2014) NIPS 17

Visualise Encoder Decoder
...
S S

Encoder Semantic Vector Decoder

Processes Input Sequence Learned Sequence Representation Produces Output Sequence

18
Encoder Decoder + Attention
● Using the final hidden node activation as the semantic
vector had a problem:
– Context from early in the sentence was washed out
● An attention mechanism* was added to allow the
architecture to learn what it needed to attend to at
each stage of the output process.

*Bahdanau, Cho, Bengio (2014) 19

Adding Attention
...
Attention
S

W1
S

Encoder Attention Semantics Decoder

Processes Input Sequence Process into an attention weighted vector Produces Output Sequence

20
Self Attention in Transformers
● The Sequence to Sequence Model with Attention still
requires a recurrent neural network to process the
entire input sequence in serial.
● In Attention is all you Need* the authors design a
network architecture called a Transformer network in
which the input can be processed in parallel.
– Sequence to Sequence without recurrent units

*Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017) NIPS 21
Transformer Architecture

Multi-Headed Attention

Process an arbitrary sequence

of encoded words in parallel and
generate new context sensitive
encodings of the sequence

22
Pt1 Self Attention
● We start with a sequence of word representations
● Transform it so the encoded words are influenced by
the other words in the sentence
The bank of the river

Self Attention Module

*Example taken from the RASA Whiteboard series 23

Pt2
The bank of the river

So far…
bank

We have transformed
the word encodings
such that they are now
W1 W2 Wn
contextualised by the
other words.

∑i=1 to n WiVi
∑ No learning

24
The bank of the river
Pt3
Key W Key W Key W Key W Key W
Query W
bank

W1 W2 Wn

Value W

Value W
Value W

bank

river
The

25
Self Attention Summary
● Self attention allows us to transform text into a
contextually sensitive representation
– All words transformed in parallel
– Influenced by all other words in the text
● Can be used on more than just sequence to
sequence.

26
Summary

Start Then Finally

Process text by generating Process Sequences Using Attention to

Fixed-Length Into Semantic Spaces: add context to
Feature Vectors: RNNs, LSTMs semantic Vectors.

NGrams, Tfidf Seq2Seq Models Self-Attention in

Transformers

27
Credit where credit is due
● Aside from the referenced papers, the following
sources were invaluable:
– Jay Alammar’s Visual Transformer Posts
● http://jalammar.github.io
– RASA Whiteboard Series on Youtube

Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Transformer Architecture Explained
No ratings yet
Transformer Architecture Explained
8 pages
Transformer Architecture explained in LLMs
No ratings yet
Transformer Architecture explained in LLMs
2 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
L15-Transformer1 (1)
No ratings yet
L15-Transformer1 (1)
19 pages
Transformer
No ratings yet
Transformer
55 pages
NLP_Machine Learning
No ratings yet
NLP_Machine Learning
23 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Attention and Memory in Deep Learning and NLP
No ratings yet
Attention and Memory in Deep Learning and NLP
8 pages
UNIT-2
No ratings yet
UNIT-2
6 pages
NLP_slides2
No ratings yet
NLP_slides2
93 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
Model5 partial
No ratings yet
Model5 partial
52 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
14.Chapter10_AdvancedDeepLearningForText
No ratings yet
14.Chapter10_AdvancedDeepLearningForText
22 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
Transformers
No ratings yet
Transformers
27 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
2022-foundations-tutorial3-sunwang-deeplearning4nlp
No ratings yet
2022-foundations-tutorial3-sunwang-deeplearning4nlp
103 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
L5
No ratings yet
L5
99 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
The Illustrated Transformer – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
No ratings yet
The Illustrated Transformer – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
5 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
Deep Learning for Natural Language
No ratings yet
Deep Learning for Natural Language
23 pages
2009 Tutorial Nips
No ratings yet
2009 Tutorial Nips
113 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
Unit_2_Generative_AI[1]
No ratings yet
Unit_2_Generative_AI[1]
14 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
All You Need To Know About Attention and Transformers In-Depth Understanding Part 1
No ratings yet
All You Need To Know About Attention and Transformers In-Depth Understanding Part 1
13 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
No ratings yet
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
44 pages
Transformer
No ratings yet
Transformer
5 pages
AN2DL_06_2324_AttentionAndTrasformers
No ratings yet
AN2DL_06_2324_AttentionAndTrasformers
60 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
Transformers Illustraded
No ratings yet
Transformers Illustraded
31 pages
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
C# Functional: Monads from Zero to Hero
From Everand
C# Functional: Monads from Zero to Hero
Carlos Bueno
No ratings yet

Modern Language Models

Uploaded by

Modern Language Models

Uploaded by

Modern language models

for machine learning

the how now cow fox brown quick and

“The quick brown fox” 1 0 0 0 1 1 1 0

Input Nodes Hidden Nodes Output Node(s)

*Mikolov, Sutskever, Chen, Corrado, Dean (2013) https://arxiv.org/pdf/1310.4546.pdf 11

*Chunks can be letters, phonemes, words or even N-Grams 14

Input Sequence Hidden Nodes Output Node(s)

*Sutskever, Vinyals, V.Le (2014) NIPS 17

Encoder Semantic Vector Decoder

*Bahdanau, Cho, Bengio (2014) 19

Encoder Attention Semantics Decoder

Process an arbitrary sequence

Self Attention Module

*Example taken from the RASA Whiteboard series 23

Start Then Finally

Process text by generating Process Sequences Using Attention to

NGrams, Tfidf Seq2Seq Models Self-Attention in

You might also like