0% found this document useful (0 votes)
10 views

Transformer networks

The document provides an overview of Transformer networks, detailing their architecture, including encoders and decoders, and the attention mechanism that enhances contextual understanding. It contrasts Transformers with previous models like RNNs and LSTMs, highlighting their advantages such as parallelization and improved context modeling. Additionally, it addresses limitations of Transformers, including fixed-length text handling and ethical concerns.

Uploaded by

temphelp21random
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Transformer networks

The document provides an overview of Transformer networks, detailing their architecture, including encoders and decoders, and the attention mechanism that enhances contextual understanding. It contrasts Transformers with previous models like RNNs and LSTMs, highlighting their advantages such as parallelization and improved context modeling. Additionally, it addresses limitations of Transformers, including fixed-length text handling and ethical concerns.

Uploaded by

temphelp21random
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Hi,

I am BERT!

Deep Learning

Deep Learning
Transformer Networks
Dr. Teena Sharma
Outline
1. Introduction

2. Transformer architecture

3. Encoder/Decoder

4. Attention mechanism

5. Conclusion

6. References
Introdution - Background
 The RNN and LSTM neural models were designed to process
language and perform tasks like classification, summarization,
translation, and sentiment detection. LSTM: Long Short-Term Memory
• Takes long to train
 In both models, layers get the next input word and have access
to some previous words, allowing it to use the word’s left context.

 They used word embeddings where each word was encoded as


a vector of 100-300 real numbers representing its meaning.
RNN: Recurrent Neural Network
• Cannot remember
• Cannot be parallelized
Introdution - Background
 Transformers extend this to allow the network to process a word
input knowing the words in both its left and right context.
 This provides a more powerful context model.
 Transformers add additional features, like attention, which identifies
the important words in this context.
 And break the problem into two parts:
◦ An encoder (e.g., Bert) Transformer:
• Use attention
◦ A decoder (e.g., GPT)
• No recurrence
• Faster to train
• Can be parallelized
Introdution - Background
Transformer architecture
 The Transformer architecture was originally designed for translation.
Transformer architecture

The decoder generates an


The task of the encoder is output sequence based on
to map an input sequence the output of the encoder
to a sequence of together with the decoder
continuous representations. output at the previous time
step.
Transformer architecture
Each of these parts can be used independently, depending on the task:

 Encoder-only models: Good for tasks that require understanding of the input, such as sentence
classification and named entity recognition. (ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa)

 Decoder-only models: Good for generative tasks such as text generation. (CTRL, GPT, GPT-2,
GPT-3, GPT-4, Transformer XL)

 Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that


require an input, such as translation or summarization. (BART, T5, Marian, mBART)
Transformer architecture
 Transformer consists of six encoders and
six decoders.

 Each encoder is very similar to each


other.

 All encoders have the same architecture.

 Decoders share the same property.


Decoder

Transformer architecture
The Transformer model runs as follows: Encoder
1. Each word forming an input sequence is
transformed into an embedding vector.
2. Each embedding vector is augmented to a
positional encoding vector.
3. The augmented embedding vectors are fed into the
encoder block.
4. The decoder receives as input its own predicted
output word at time-step, t–1.
Decoder

Transformer architecture
5. The input to the decoder is also augmented by Encoder
positional encoding in the same manner done on
the encoder side.
6. The augmented decoder input is fed into the three
sublayers comprising the decoder. Masking is
applied in the first sublayer in order to stop the
decoder from attending to the succeeding words.
7. The output of the decoder finally passes through a
fully connected layer, followed by a softmax layer,
to generate a prediction for the next word of the
output sequence.
Transformer architecture
Transformer architecture
Attention mechanism
Attention mechanism

I swam across the river to get to the other bank

Bank == financial institution?


Bank == sloping raised land ?
Attention mechanism

I swam across the river to get to the other bank

Bank == financial institution?


Bank == sloping raised land ?
Attention mechanism

I drove across the river to get to the other bank

Bank == financial institution?


Bank == sloping raised land ?
Attention mechanism

I swam across the river to get to the other bank

Context matters!
Attention mechanism

Can we build a mechanism that weights neighboring


words to enhance the meaning of the word of
interest?
Attention mechanism

I swam across the river to get to the other bank

The main purpose of self-attention mechanism is to add contextual


information to words in the sentence.
Attention mechanism

https://projector.tensorflow.org/
Attention mechanism
Conclusion
Some limitations of Transformer networks:

Attention can only deal with fixed-length text strings. The text has to be split into a certain number of
segments or chunks before being fed into the system as input.

This chunking of text causes context fragmentation. For example, if a sentence is split from the
middle, then a significant amount of context is lost.

Transformers can require significant computational resources, particularly for large models and long
text sequences.

Ethical concerns associated with Transformers (misinformation, etc.).


References
[1] Vaswani, Ashish et al. “Attention is All you Need.” ArXiv abs/1706.03762
(2017): n. pag.

[2] Intuition Behind Self-Attention Mechanism in Transformer Networks,


https://www.youtube.com/watch?v=g2BRIuln4uc&t=340s

[3] The Illustrated Transformer,


https://jalammar.github.io/illustrated-transformer/

[4] Neural machine translation with a Transformer and Keras,


https://www.tensorflow.org/text/tutorials/transformer

You might also like