0% found this document useful (0 votes)

39 views

LSTM to BERT

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

LSTM to BERT

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

The Journey from

LSTM to BERT
All slides are my own. Citations provided for borrowed images

Kolluru Sai Keshav

PhD Scholar
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
Vaibhav: similar
Word2Vec to MLM

● Converts words to vectors such that

similar words are located near to each
other in the vector space

● Made possible using CBOW

(Continuous Bag of Words) objective
● Words in the context are used to
predict the middle word

● Words with similar contexts are

embedded close to each other
“A word is known by the company it
keeps”
Reference: https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
Contextualized Word Representations (ELMo)
● Bidirectional language modelling using separate forward and backward LSTMs
● Issue: Both LSTMs are not coupled with one another

Reference: https://nlp.stanford.edu/seminar/details/jdevlin.pdf
Universal Language Model Fine-tuning for Text Classification
Trained
Model
PRE-TRAIN FINE-TUNE ● Introduced the
on LM task on End-Task Pretrain-Finetune paradigm for
NLP

End ● Similar to pretraining ResNet

LSTM
Model
on ImageNet and finetune on
specific tasks
● Uses the same architecture for both
● Pretrained using Language
pretraining and finetuning
modelling task
● ELMo is added as additional component
● Finetuned on End-Task (such
to existing task-specific architectures
as Sentiment Analysis)
Generative Pre-training
● GPT - Uses Transformer decoder instead of LSTM for Language Modeling
● GPT-2 - Trained on larger corpus of text (40 GB) Model size:1.5 B parameters
● Can generate text given initial prompt - “unicorn” story, economist interview
Unicorn Story
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
BERT : Masked language modelling
● GPT-2 is unidirectional. Tasks like classification - we already know all the
words - using unidirectional model is sub-optimal
● But language modeling objective is inherently unidirectional
BERT vs. OpenAI-GPT vs. ELMo

Bidirectional Unidirectional De-coupled

Bidirectionality
Input Representation
Word-Piece tokenizer Atishya, Siddhant: UNK tokens

● Middle ground between character level and word level representations

● tweeting → tweet + ##ing
● xanax → xa + ##nax
● Technique originally taken from paper for Japanese and Korean languages
from a speech conference

● Given a training corpus and a number of desired tokens D, the optimization

problem is to select D wordpieces such that the resulting corpus is minimal in
the number of wordpieces when segmented according to the chosen
wordpiece model.
Schuster, Mike, and Kaisuke Nakajima. "Japanese and korean voice
search." 2012 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2012.
Misc Details
● Uses an activation function called GeLU
- a continuous version of ReLU
● Multiplies the input with a stochastic
one-zero map (in the expectation)

● Optimizer: A variant of the Adam optimizer

where the learning rate first increases (Warm-up phase) and is then decayed

*Image Credits: [3]

Practical Tips
● Proper modelling of input for BERT is extremely important
○ Question Answering: [CLS] Query [SEP] Passage [SEP]
○ Natural Language Inference: [CLS] Sent1 [SEP] Sent2 [SEP]
○ BERT cannot be used as a general purpose sentence embedder

● Maximum input length is limited to 512. Truncation strategies

have to be adopted

● BERT-Large model requires random restarts to work

● Always PRE-TRAIN, on related task - will improve accuracy Atishya: TPUs vs.
● Highly optimized for TPUs, not so much for GPUs GPUs
Small Hyperparameter search
● Because of using a pre-trained model - we can’t really change the model
architecture any more
● Number of hyper-parameters are actually few:
○ Batch Size: 16, 32
○ Learning Rate: 3e-6, 1e-5, 3e-5, 5e-5
○ Number of epochs to run
● Compare to LSTMs where we need to decide number of layers, the optimizer,
the hidden size, the embedding size, etc…
● This greatly simplifies using the model
Implementation for fine-tuning
● Using BERT requires 3 modules
○ Tokenization, Model and Optimizer
● Originally developed in Tensorflow
● HuggingFace ported it to Pytorch and to-date remains the most popular way
of using BERT (18K stars)
● Tensorflow 2.0 also has a very compact way of using it - from TensorflowHub
○ But fewer people use it, so support is low

● My choice - use HuggingFace BERT API with Pytorch-Lightning

○ Lightning provides a Keras-like API for Pytorch
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
Evaluating Progress: GLUE-benchmark
DecaNLP - a forgotten benchmark
● Spans 10 tasks
● Question Answering (SQUAD)
● Summarization (CNN/DM)
● Natural Language Inference
(MNLI)
● Semantic Parsing (WikiSQL)
● ….

● Interesting choice of tasks but

did not pick up steam

● Model designers had to

manually communicate the
results

● GLUE had an automatic

system
Surprising effectiveness of BERT
BERT as Feature Extractor
Ablation Study
Self-Supervised Learning
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
Roberta: A Robustly Optimized BERT Pretraining
Approach
ERNIE: A Continual Pre-Training Framework for
Language Understanding
Pre Training tasks in ERNIE
Snapshot taken on 24th December, 2019
Review of Reviews
● (Sankalan, Vaibhav) Using image as input: VL-BERT
● (Sankalan) Using KB facts as input (KB-QA): Retrieval+Concatenation
● Using BERT as a KB: E-BERT
● (Atishya) Inter-dependencies between masked tokens: XL-Net
● (Rajas) Freeze layers while fine-tuning: Adapter-BERT
○ 0.4% accuracy drop adding only 3.6% parameters
● (Rajas) Pre-training over multiple tasks: ERNIE (with a curriculum)
● (Shubham) Fine-training over multiple tasks: MT-DNN, SMART
Review of Reviews
● (Pratyush) Masking using NER: ERNIE
● (Jigyasa) Model Compression: DistilBERT, MobileBERT
○ Reduces size of BERT by 40%, improves inference by 60% while
achieving 99% of the results
● (Saransh) Using BERT for VQA: LXMBERT
● (Siddhant) Analyzing BERT: Bertology
○ Though post-facto and not axiomatic
● (Soumya) Issue with breaking negative affixes: Whole-word masking
● (Vipul) Pre-training on supervised tasks: Universal Sentence Repr.
● (Lovish) Introducing language embeddings: mBART, T5 (task-embedding)
● (Pavan) Text-Generation tasks: GPT-2, T5, BART

Bert Explained
No ratings yet
Bert Explained
8 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
BERT
No ratings yet
BERT
98 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
BERT
No ratings yet
BERT
4 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
Bert ayman
No ratings yet
Bert ayman
5 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Bert
No ratings yet
Bert
20 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Stanford Dataset 2.0
No ratings yet
Stanford Dataset 2.0
9 pages
Bert
No ratings yet
Bert
10 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Lec 02
No ratings yet
Lec 02
33 pages
Bert
No ratings yet
Bert
60 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Bert 1
No ratings yet
Bert 1
4 pages
BERT-1-42
No ratings yet
BERT-1-42
42 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
2009.05451v1
No ratings yet
2009.05451v1
12 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
11 Bert
No ratings yet
11 Bert
66 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
C4_W3
No ratings yet
C4_W3
98 pages
Bert
No ratings yet
Bert
36 pages
2003.07000
No ratings yet
2003.07000
9 pages
BERT Slides
No ratings yet
BERT Slides
41 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
NLP-week9-fine-tuning_and_IR
No ratings yet
NLP-week9-fine-tuning_and_IR
64 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
data_mining_report
No ratings yet
data_mining_report
17 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
14 04 Transformers
No ratings yet
14 04 Transformers
11 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
21CSE356T-NLP-Unit 4.2
No ratings yet
21CSE356T-NLP-Unit 4.2
31 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
the-death-of-feature-engineering-bert-with-linguistic-4s2zmqi9xs
No ratings yet
the-death-of-feature-engineering-bert-with-linguistic-4s2zmqi9xs
6 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
L11 ClassIntricacies
No ratings yet
L11 ClassIntricacies
9 pages
L14 OptimizationSingleVariable
No ratings yet
L14 OptimizationSingleVariable
33 pages
L12 FileInputOutput
No ratings yet
L12 FileInputOutput
18 pages
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
No ratings yet
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
38 pages
L6 Tuple Container
No ratings yet
L6 Tuple Container
18 pages
L7 Set Container
No ratings yet
L7 Set Container
16 pages
L8 Dictionaries
No ratings yet
L8 Dictionaries
17 pages
WeatherLink Console Guide
No ratings yet
WeatherLink Console Guide
29 pages
Architecture Rev Answer 2023
No ratings yet
Architecture Rev Answer 2023
12 pages
Com Profibus 7ut613 7ut63 Us
No ratings yet
Com Profibus 7ut613 7ut63 Us
46 pages
STATISTICS
No ratings yet
STATISTICS
9 pages
Vehicle License Plate Detection and Recognition Using Neural Network
No ratings yet
Vehicle License Plate Detection and Recognition Using Neural Network
5 pages
Result Analysis Report.docx - Google Docs
No ratings yet
Result Analysis Report.docx - Google Docs
23 pages
Face Recognition - Past - Present - Future
No ratings yet
Face Recognition - Past - Present - Future
28 pages
002.11.1 - Presentation - Kaspersky Endpoint Security For Business. What's New
No ratings yet
002.11.1 - Presentation - Kaspersky Endpoint Security For Business. What's New
77 pages
Download full Beginning AutoCAD 2014 Exercise Workbook For Windows Cheryl R. Shrock ebook all chapters
100% (4)
Download full Beginning AutoCAD 2014 Exercise Workbook For Windows Cheryl R. Shrock ebook all chapters
81 pages
Serial, Parallel Communication Protocols (I2C, SPI, UART)
No ratings yet
Serial, Parallel Communication Protocols (I2C, SPI, UART)
14 pages
The Garbage Collection Handbook (2)
No ratings yet
The Garbage Collection Handbook (2)
16 pages
HPM Service HP13610
No ratings yet
HPM Service HP13610
496 pages
Hyper-V Cmdlets in Windows PowerShell PDF
No ratings yet
Hyper-V Cmdlets in Windows PowerShell PDF
246 pages
3esi Enersight - Whitepaper - 10 Minute A D - Updated2 SC
No ratings yet
3esi Enersight - Whitepaper - 10 Minute A D - Updated2 SC
8 pages
API Testing With Postman
No ratings yet
API Testing With Postman
9 pages
640 Manual
No ratings yet
640 Manual
52 pages
FG SSCQ2212 Domestic Data Entry Operator 12 02 2019 PDF
100% (1)
FG SSCQ2212 Domestic Data Entry Operator 12 02 2019 PDF
199 pages
Ezpm Manual
No ratings yet
Ezpm Manual
181 pages
GEE 101 Final Week - Information Technology in Business
No ratings yet
GEE 101 Final Week - Information Technology in Business
11 pages
SR20 Uav Sheet
No ratings yet
SR20 Uav Sheet
1 page
CHapter 2 - Cellular Wireless Networks
No ratings yet
CHapter 2 - Cellular Wireless Networks
82 pages
Arid Agriculture University, Rawalpindi: (Practical)
100% (1)
Arid Agriculture University, Rawalpindi: (Practical)
12 pages
SSH Key Pair Generation v2
No ratings yet
SSH Key Pair Generation v2
7 pages
PSD-I Exam Dumps – Pass Your Exam on First Try
No ratings yet
PSD-I Exam Dumps – Pass Your Exam on First Try
22 pages
Analysis of Use Need Image Archives
No ratings yet
Analysis of Use Need Image Archives
13 pages
Segooa 2018
No ratings yet
Segooa 2018
5 pages
PDP Report Example
No ratings yet
PDP Report Example
41 pages
8051 Atmel Datasheet
100% (3)
8051 Atmel Datasheet
12 pages
CBLM Coc2
No ratings yet
CBLM Coc2
54 pages
1 Introduction to Computer System-ppt
No ratings yet
1 Introduction to Computer System-ppt
42 pages

LSTM to BERT

Uploaded by

LSTM to BERT

Uploaded by

The Journey from

Kolluru Sai Keshav

● Converts words to vectors such that

● Made possible using CBOW

● Words with similar contexts are

End ● Similar to pretraining ResNet

Bidirectional Unidirectional De-coupled

● Middle ground between character level and word level representations

● Given a training corpus and a number of desired tokens D, the optimization

● Optimizer: A variant of the Adam optimizer

*Image Credits: [3]

● Maximum input length is limited to 512. Truncation strategies

● BERT-Large model requires random restarts to work

● My choice - use HuggingFace BERT API with Pytorch-Lightning

● Interesting choice of tasks but

● Model designers had to

● GLUE had an automatic

You might also like