L.7
L.7
• Vanishing and Exploding Gradients: RNNs and LSTMs are susceptible to the vanishing and
exploding gradient problems. When gradients become too small or too large during
backpropagation, they hinder the training process, making it challenging to capture long-range
dependencies in sequences.
• Limited Parallelism: Due to their sequential nature, RNNs and LSTMs have limited parallelism. This
restricts their ability to take full advantage of modern hardware accelerators for deep learning,
which excel in parallel processing.
How Transformers fix RNNs problems
• Encoder: Responsible for processing input sequences, the Encoder utilizes multiple layers with
a self-attention mechanism and a feedforward neural network. This enables the model to
capture dependencies and relationships within the input data, transforming it into meaningful
representations.
• Decoder: Focuses on generating output sequences, the Decoder employs similar layers to the
Encoder. It introduces bidirectional attention, considering both the input sequence and the
generated part of the output. This enables the Decoder to attend to relevant information while
generating each element of the output sequence step by step
High level overview of the Transformers architecture
• Pre-processing Steps: Involving tasks such as tokenization and positional encoding,
pre-processing prepares input data for the Transformer. Tokenization divides the input
sequence into smaller units or tokens, which are then embedded into high-
dimensional vectors. Positional encoding injects information about the position of
each token, addressing the model’s lack of inherent sequential understanding.
• Post-processing Steps: Following the processing of input and generation of output
sequences by the Encoder and Decoder, post-processing steps are applied. This
includes converting internal representations back into readable formats, such as words
or numerical values. Additional post-processing steps may be task-specific, such as
applying Softmax activation for probability distribution in classification tasks or
employing decoding strategies in natural language generation.
Transformers architecture variations
Encoder-only architecture
• The encoder-only architecture is primarily used for tasks where the model takes an input
sequence and produces a fixed-length representation (contextual embedding) of that
sequence.
• Applications :
• Text classification: Assigning a category label to a text.
• Named entity recognition: Identifying entities like names, dates, and locations in text.
• Sentiment analysis: Determining the sentiment (positive, negative, neutral) in a piece of
text.
• Example : Sentiment Analysis
• Input : “I loved the movie” → Positive
• Input : “The movie was terrible” → Negative
Decoder-only architecture
• The decoder-only architecture is used for tasks where the model generates an output
sequence from a fixed-length input representation.
• Applications :
• Text generation: Creating coherent and contextually relevant sentences or paragraphs.
• Language modeling: Predicting the next word in a sequence.
• Example : Text generation
• Input : “During” → “summer”
• Input : “During summer” → “vacation”
• Input : “During summer vacation” → “we”
• Input : “During summer vacation, we” → “enjoyed”
• Input : “During summer vacation, we enjoyed” → “ice”
• Input : “During summer vacation, we enjoyed ice” → “cream”
Encoder-Decoder architecture
• The encoder-decoder architecture is designed for sequence-to-sequence tasks where the
model takes an input sequence, encodes it into a contextual representation, and then
generates an output sequence based on that representation.
• Applications :
• Machine translation: Translating text from one language to another.
• Text summarization: Generating concise summaries of longer texts.
• Question-answering: Generating answers to natural language questions.
• Example : English to French Translation
• Encoder Input: “The movie was terrible”
• Decoder Input: “Le” → “film”
• Decoder Input: “Le film” → “était”
• Decoder Input: “Le film était” → “horrible”
Transformers sub-layers
Transformers sub-layers
• Input Self-Attention:
• This sub-layer is a crucial component of the Transformer’s encoder. It
allows the model to weigh the importance of different words in the input
sequence concerning each other. It enables the Transformer to capture
relationships and dependencies within the input data efficiently.
• Output Masked Self-Attention:
• The output masked self-attention sub-layer is primarily associated with the
decoder in a Transformer architecture. During the generation of each
element in the output sequence, this sub-layer ensures that the model
attends only to positions preceding the current position. This masking
prevents the model from “cheating” by looking ahead in the sequence,
ensuring that predictions are made based on previously generated
elements.
Transformers sub-layers
• Encoder-Decoder Attention:
• The encoder-decoder attention sub-layer allows the model to pay
attention to different parts of the input sequence (encoder’s output)
while generating the output sequence. It allows the decoder to
consider the entire input sequence’s context, enabling the model to
generate accurate and contextually relevant predictions.
• Feed Forward Neural Networks:
• The feed forward neural network is an essential component,
contributing to the model’s ability to capture and process intricate
patterns within the input sequence. This network processes the
information gathered by the attention mechanisms, injecting non-
linearity and enabling the model to capture complex relationships
within the data.
Transformer components in
details (Steps)
Transformer components in details
1- Tokenization
• Tokenization is the initial step where input
sequences, are broken down into smaller units or
tokens. Tokens serve as the fundamental building
blocks for the subsequent processing steps in the
Transformer architecture.
• Tokenization splits text into words (or sub-words)
called tokens. Some other special tokens are also
added to the input sequence:
• <BOS> : Added in the start of the sequence and
indicates the Beginning Of the Sequence.
• <EOS> : Added in the end of the sequence and
indicates the End Of the Sequence.
Transformer components in details
2-Embedding
• Embedding
• Word embeddings are a
technique in NLP that
represents words as
numerical vectors in a
multi-dimensional space.
These vectors capture the
semantic meaning of
words and their
relationships, making it
easier for computers to
work with and understand
language.
Transformer components in details
2-Embedding
• The key idea behind word embeddings is to represent words in a way that
preserves their semantic meaning. This means that similar words are
represented by vectors that are close to each other in the vector space.
• Word embeddings are learned from large text corpora. During training,
the model looks at the context in which words appear. Words that often
appear in similar contexts are assigned vectors that are closer in the
vector space.
Transformer components in details
2-Embedding
• Word embeddings transform a word into a
vector (of size d_model), the Input
Embedding layer transforms each input
sequence (of size seq_size) into a matrix.
• QKᵀ represents the dot product of the Query matrix (Q) and the
transpose of the Key matrix (K). This provides a raw measure of
similarity between the query and key and this similarity reflects how
much attention should be given to each element in the input sequence.
A higher dot product implies a higher level of similarity, while a lower
dot product suggests lower relevance.
Multi-Head Attention — Scaled Dot-Product
• Each attention head has its own set of learned weights, which means it can specialize in
capturing specific patterns or dependencies in the data. Separate sections of the
Embedding can learn different aspects of the meanings of each word, as it relates to other
words in the sequence. This allows the Transformer to capture richer interpretations of the
sequence.
Masked self-attention
• In standard self-attention, each element
token in a sequence attends to all other
elements in the sequence, including itself.
• Masked self-attention is a variant of the self-
attention mechanism used in the
Transformer architecture, specifically
employed in the decoder side of the model
during language generation tasks:
Masked self-attention
• The purpose of masked self-attention is to prevent attending to future
positions in the sequence during training, ensuring that each position can
only attend to its past positions. This is because the model should not have
access to information that hasn’t been generated yet.
matrix to be equal to 0:
Encoder-decoder attention
• The encoder-decoder attention is used for tasks like machine
translation, text summarization, and question-answering. It
facilitates the alignment of information between the input
(encoder) and output (decoder) sequences.