0% found this document useful (0 votes)
39 views

LSTM to BERT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

LSTM to BERT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

The Journey from

LSTM to BERT
All slides are my own. Citations provided for borrowed images

Kolluru Sai Keshav


PhD Scholar
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
Vaibhav: similar
Word2Vec to MLM

● Converts words to vectors such that


similar words are located near to each
other in the vector space

● Made possible using CBOW


(Continuous Bag of Words) objective
● Words in the context are used to
predict the middle word

● Words with similar contexts are


embedded close to each other
“A word is known by the company it
keeps”
Reference: https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html
Contextualized Word Representations (ELMo)
● Bidirectional language modelling using separate forward and backward LSTMs
● Issue: Both LSTMs are not coupled with one another

Reference: https://nlp.stanford.edu/seminar/details/jdevlin.pdf
Universal Language Model Fine-tuning for Text Classification
Trained
Model
PRE-TRAIN FINE-TUNE ● Introduced the
on LM task on End-Task Pretrain-Finetune paradigm for
NLP

End ● Similar to pretraining ResNet


LSTM
Model
on ImageNet and finetune on
specific tasks
● Uses the same architecture for both
● Pretrained using Language
pretraining and finetuning
modelling task
● ELMo is added as additional component
● Finetuned on End-Task (such
to existing task-specific architectures
as Sentiment Analysis)
Generative Pre-training
● GPT - Uses Transformer decoder instead of LSTM for Language Modeling
● GPT-2 - Trained on larger corpus of text (40 GB) Model size:1.5 B parameters
● Can generate text given initial prompt - “unicorn” story, economist interview
Unicorn Story
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
BERT : Masked language modelling
● GPT-2 is unidirectional. Tasks like classification - we already know all the
words - using unidirectional model is sub-optimal
● But language modeling objective is inherently unidirectional
BERT vs. OpenAI-GPT vs. ELMo

Bidirectional Unidirectional De-coupled


Bidirectionality
Input Representation
Word-Piece tokenizer Atishya, Siddhant: UNK tokens

● Middle ground between character level and word level representations


● tweeting → tweet + ##ing
● xanax → xa + ##nax
● Technique originally taken from paper for Japanese and Korean languages
from a speech conference

● Given a training corpus and a number of desired tokens D, the optimization


problem is to select D wordpieces such that the resulting corpus is minimal in
the number of wordpieces when segmented according to the chosen
wordpiece model.
Schuster, Mike, and Kaisuke Nakajima. "Japanese and korean voice
search." 2012 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2012.
Misc Details
● Uses an activation function called GeLU
- a continuous version of ReLU
● Multiplies the input with a stochastic
one-zero map (in the expectation)

● Optimizer: A variant of the Adam optimizer


where the learning rate first increases (Warm-up phase) and is then decayed

*Image Credits: [3]


Practical Tips
● Proper modelling of input for BERT is extremely important
○ Question Answering: [CLS] Query [SEP] Passage [SEP]
○ Natural Language Inference: [CLS] Sent1 [SEP] Sent2 [SEP]
○ BERT cannot be used as a general purpose sentence embedder

● Maximum input length is limited to 512. Truncation strategies


have to be adopted

● BERT-Large model requires random restarts to work


● Always PRE-TRAIN, on related task - will improve accuracy Atishya: TPUs vs.
● Highly optimized for TPUs, not so much for GPUs GPUs
Small Hyperparameter search
● Because of using a pre-trained model - we can’t really change the model
architecture any more
● Number of hyper-parameters are actually few:
○ Batch Size: 16, 32
○ Learning Rate: 3e-6, 1e-5, 3e-5, 5e-5
○ Number of epochs to run
● Compare to LSTMs where we need to decide number of layers, the optimizer,
the hidden size, the embedding size, etc…
● This greatly simplifies using the model
Implementation for fine-tuning
● Using BERT requires 3 modules
○ Tokenization, Model and Optimizer
● Originally developed in Tensorflow
● HuggingFace ported it to Pytorch and to-date remains the most popular way
of using BERT (18K stars)
● Tensorflow 2.0 also has a very compact way of using it - from TensorflowHub
○ But fewer people use it, so support is low

● My choice - use HuggingFace BERT API with Pytorch-Lightning


○ Lightning provides a Keras-like API for Pytorch
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
Evaluating Progress: GLUE-benchmark
DecaNLP - a forgotten benchmark
● Spans 10 tasks
● Question Answering (SQUAD)
● Summarization (CNN/DM)
● Natural Language Inference
(MNLI)
● Semantic Parsing (WikiSQL)
● ….

● Interesting choice of tasks but


did not pick up steam

● Model designers had to


manually communicate the
results

● GLUE had an automatic


system
Surprising effectiveness of BERT
BERT as Feature Extractor
Ablation Study
Self-Supervised Learning
Concepts
● Self-Attention
○ Pooling
○ Attention (Seq2Seq, Image Captioning)
○ Structured Self-Attention in LSTMs
○ Transformers
● LM-based pretraining
○ ELMo
○ ULMiFit
○ GPT
● GLUE Benchmark
● BERT
● Extensions: Roberta, ERNIE
Roberta: A Robustly Optimized BERT Pretraining
Approach
ERNIE: A Continual Pre-Training Framework for
Language Understanding
Pre Training tasks in ERNIE
Snapshot taken on 24th December, 2019
Review of Reviews
● (Sankalan, Vaibhav) Using image as input: VL-BERT
● (Sankalan) Using KB facts as input (KB-QA): Retrieval+Concatenation
● Using BERT as a KB: E-BERT
● (Atishya) Inter-dependencies between masked tokens: XL-Net
● (Rajas) Freeze layers while fine-tuning: Adapter-BERT
○ 0.4% accuracy drop adding only 3.6% parameters
● (Rajas) Pre-training over multiple tasks: ERNIE (with a curriculum)
● (Shubham) Fine-training over multiple tasks: MT-DNN, SMART
Review of Reviews
● (Pratyush) Masking using NER: ERNIE
● (Jigyasa) Model Compression: DistilBERT, MobileBERT
○ Reduces size of BERT by 40%, improves inference by 60% while
achieving 99% of the results
● (Saransh) Using BERT for VQA: LXMBERT
● (Siddhant) Analyzing BERT: Bertology
○ Though post-facto and not axiomatic
● (Soumya) Issue with breaking negative affixes: Whole-word masking
● (Vipul) Pre-training on supervised tasks: Universal Sentence Repr.
● (Lovish) Introducing language embeddings: mBART, T5 (task-embedding)
● (Pavan) Text-Generation tasks: GPT-2, T5, BART

You might also like