Transformer networks
Transformer networks
I am BERT!
Deep Learning
Deep Learning
Transformer Networks
Dr. Teena Sharma
Outline
1. Introduction
2. Transformer architecture
3. Encoder/Decoder
4. Attention mechanism
5. Conclusion
6. References
Introdution - Background
The RNN and LSTM neural models were designed to process
language and perform tasks like classification, summarization,
translation, and sentiment detection. LSTM: Long Short-Term Memory
• Takes long to train
In both models, layers get the next input word and have access
to some previous words, allowing it to use the word’s left context.
Encoder-only models: Good for tasks that require understanding of the input, such as sentence
classification and named entity recognition. (ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa)
Decoder-only models: Good for generative tasks such as text generation. (CTRL, GPT, GPT-2,
GPT-3, GPT-4, Transformer XL)
Transformer architecture
The Transformer model runs as follows: Encoder
1. Each word forming an input sequence is
transformed into an embedding vector.
2. Each embedding vector is augmented to a
positional encoding vector.
3. The augmented embedding vectors are fed into the
encoder block.
4. The decoder receives as input its own predicted
output word at time-step, t–1.
Decoder
Transformer architecture
5. The input to the decoder is also augmented by Encoder
positional encoding in the same manner done on
the encoder side.
6. The augmented decoder input is fed into the three
sublayers comprising the decoder. Masking is
applied in the first sublayer in order to stop the
decoder from attending to the succeeding words.
7. The output of the decoder finally passes through a
fully connected layer, followed by a softmax layer,
to generate a prediction for the next word of the
output sequence.
Transformer architecture
Transformer architecture
Attention mechanism
Attention mechanism
Context matters!
Attention mechanism
https://projector.tensorflow.org/
Attention mechanism
Conclusion
Some limitations of Transformer networks:
Attention can only deal with fixed-length text strings. The text has to be split into a certain number of
segments or chunks before being fed into the system as input.
This chunking of text causes context fragmentation. For example, if a sentence is split from the
middle, then a significant amount of context is lost.
Transformers can require significant computational resources, particularly for large models and long
text sequences.