0% found this document useful (0 votes)
3 views

L.7

The document discusses the advantages of Transformers over traditional RNNs and LSTMs, highlighting their ability to efficiently capture long-range dependencies and process data in parallel. It outlines the architecture of Transformers, which includes an encoder, decoder, and various preprocessing and post-processing steps, as well as the importance of components like self-attention and multi-head attention. Additionally, it details the process of tokenization, embedding, and positional encoding, which are crucial for the model's performance in natural language processing tasks.

Uploaded by

moazeldsokys9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

L.7

The document discusses the advantages of Transformers over traditional RNNs and LSTMs, highlighting their ability to efficiently capture long-range dependencies and process data in parallel. It outlines the architecture of Transformers, which includes an encoder, decoder, and various preprocessing and post-processing steps, as well as the importance of components like self-attention and multi-head attention. Additionally, it details the process of tokenization, embedding, and positional encoding, which are crucial for the model's performance in natural language processing tasks.

Uploaded by

moazeldsokys9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Transformers

Attention is all you need


Dr Nora EL Rahsidy
Why Transformers? Addressing the Limitations
of RNNs and LSTMs
• Before delving into the transformative aspects of Transformers, it’s
essential to understand the limitations of their predecessors —
Recurrent Neural Networks (RNNs) and Long Short-Term Memory
networks (LSTMs).
Why Transformers? Addressing the
Limitations of RNNs and LSTMs
• Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks
(LSTMs) were once the torchbearers in sequential data processing.

• These architectures, characterized by their ability to maintain a hidden


state that captures information from previous time steps, served well in
tasks such as time series prediction and language modeling.
Why Transformers? Addressing the
Limitations of RNNs and LSTMs
• In an RNN, the hidden state is updated at each time step, allowing the
network to maintain a form of memory.

• LSTMs, an improvement over traditional RNNs, introduced a more


sophisticated gating mechanism to control the flow of information
through the network, addressing the vanishing gradient problem and
improving the capture of long-range dependencies.
Why Transformers? Addressing the Limitations
of RNNs and LSTMs
Despite LSTM and RNN successes, RNNs and LSTMs come with inherent challenges that limited their
scalability and efficiency:

• Vanishing and Exploding Gradients: RNNs and LSTMs are susceptible to the vanishing and
exploding gradient problems. When gradients become too small or too large during
backpropagation, they hinder the training process, making it challenging to capture long-range
dependencies in sequences.

• Limited Parallelism: Due to their sequential nature, RNNs and LSTMs have limited parallelism. This
restricts their ability to take full advantage of modern hardware accelerators for deep learning,
which excel in parallel processing.
How Transformers fix RNNs problems

• Transformers address these limitations by introducing the attention


mechanism that allows the model to focus on different parts of the
input sequence simultaneously.

• This parallelization capability, coupled with the ability to capture


long-range dependencies effectively, makes Transformers a significant
leap forward in sequential data processing
Why Transformers? Addressing the
Limitations of RNNs and LSTMs
• Long-Range Dependencies: Transformers use a self-attention mechanism that
allows them to capture long-range dependencies in the data efficiently. This
enables them to consider all positions in the input sequence when making
predictions, eliminating the vanishing gradient problem and making them more
effective at understanding context in long sequences.
• Parallelism: Transformers process input data in parallel rather than
sequentially. This allows them to perform computations on all elements of a
sequence simultaneously, making them highly efficient, especially
Advantages of Transformers
• Transformers have revolutionized the field of Machine Learning, particularly in Natural
Language Processing and sequential data tasks. Their architecture brings several advantages
that contribute to their widespread adoption and success. Here are some key advantages of
Transformers:
• Scalability: Transformers are highly scalable. By stacking multiple Transformer layers, you can create
deep models that capture complex patterns and dependencies in the data without encountering
convergence problems.
• State-of-the-Art Performance: Transformers have achieved state-of-the-art results in numerous NLP
tasks, setting new standards for accuracy and performance in the field.
• Transfer Learning: Pre-trained Transformer models, such as BERT, GPT and others, have shown
exceptional performance in various downstream tasks. Transfer learning with Transformers allows fine-
tuning on specific tasks, reducing the need for extensive data and compute resources.
High level overview of the Transformers architecture
High level overview of the Transformers architecture
In its fundamental form, a Transformer comprises four main elements: an encoder, a decoder, and
preprocessing and post-processing steps. Each component plays a role in the overall functioning
of the model:

• Encoder: Responsible for processing input sequences, the Encoder utilizes multiple layers with
a self-attention mechanism and a feedforward neural network. This enables the model to
capture dependencies and relationships within the input data, transforming it into meaningful
representations.

• Decoder: Focuses on generating output sequences, the Decoder employs similar layers to the
Encoder. It introduces bidirectional attention, considering both the input sequence and the
generated part of the output. This enables the Decoder to attend to relevant information while
generating each element of the output sequence step by step
High level overview of the Transformers architecture
• Pre-processing Steps: Involving tasks such as tokenization and positional encoding,
pre-processing prepares input data for the Transformer. Tokenization divides the input
sequence into smaller units or tokens, which are then embedded into high-
dimensional vectors. Positional encoding injects information about the position of
each token, addressing the model’s lack of inherent sequential understanding.
• Post-processing Steps: Following the processing of input and generation of output
sequences by the Encoder and Decoder, post-processing steps are applied. This
includes converting internal representations back into readable formats, such as words
or numerical values. Additional post-processing steps may be task-specific, such as
applying Softmax activation for probability distribution in classification tasks or
employing decoding strategies in natural language generation.
Transformers architecture variations
Encoder-only architecture
• The encoder-only architecture is primarily used for tasks where the model takes an input
sequence and produces a fixed-length representation (contextual embedding) of that
sequence.
• Applications :
• Text classification: Assigning a category label to a text.
• Named entity recognition: Identifying entities like names, dates, and locations in text.
• Sentiment analysis: Determining the sentiment (positive, negative, neutral) in a piece of
text.
• Example : Sentiment Analysis
• Input : “I loved the movie” → Positive
• Input : “The movie was terrible” → Negative
Decoder-only architecture
• The decoder-only architecture is used for tasks where the model generates an output
sequence from a fixed-length input representation.
• Applications :
• Text generation: Creating coherent and contextually relevant sentences or paragraphs.
• Language modeling: Predicting the next word in a sequence.
• Example : Text generation
• Input : “During” → “summer”
• Input : “During summer” → “vacation”
• Input : “During summer vacation” → “we”
• Input : “During summer vacation, we” → “enjoyed”
• Input : “During summer vacation, we enjoyed” → “ice”
• Input : “During summer vacation, we enjoyed ice” → “cream”
Encoder-Decoder architecture
• The encoder-decoder architecture is designed for sequence-to-sequence tasks where the
model takes an input sequence, encodes it into a contextual representation, and then
generates an output sequence based on that representation.
• Applications :
• Machine translation: Translating text from one language to another.
• Text summarization: Generating concise summaries of longer texts.
• Question-answering: Generating answers to natural language questions.
• Example : English to French Translation
• Encoder Input: “The movie was terrible”
• Decoder Input: “Le” → “film”
• Decoder Input: “Le film” → “était”
• Decoder Input: “Le film était” → “horrible”
Transformers sub-layers
Transformers sub-layers

• Input Self-Attention:
• This sub-layer is a crucial component of the Transformer’s encoder. It
allows the model to weigh the importance of different words in the input
sequence concerning each other. It enables the Transformer to capture
relationships and dependencies within the input data efficiently.
• Output Masked Self-Attention:
• The output masked self-attention sub-layer is primarily associated with the
decoder in a Transformer architecture. During the generation of each
element in the output sequence, this sub-layer ensures that the model
attends only to positions preceding the current position. This masking
prevents the model from “cheating” by looking ahead in the sequence,
ensuring that predictions are made based on previously generated
elements.
Transformers sub-layers

• Encoder-Decoder Attention:
• The encoder-decoder attention sub-layer allows the model to pay
attention to different parts of the input sequence (encoder’s output)
while generating the output sequence. It allows the decoder to
consider the entire input sequence’s context, enabling the model to
generate accurate and contextually relevant predictions.
• Feed Forward Neural Networks:
• The feed forward neural network is an essential component,
contributing to the model’s ability to capture and process intricate
patterns within the input sequence. This network processes the
information gathered by the attention mechanisms, injecting non-
linearity and enabling the model to capture complex relationships
within the data.
Transformer components in
details (Steps)
Transformer components in details
1- Tokenization
• Tokenization is the initial step where input
sequences, are broken down into smaller units or
tokens. Tokens serve as the fundamental building
blocks for the subsequent processing steps in the
Transformer architecture.
• Tokenization splits text into words (or sub-words)
called tokens. Some other special tokens are also
added to the input sequence:
• <BOS> : Added in the start of the sequence and
indicates the Beginning Of the Sequence.
• <EOS> : Added in the end of the sequence and
indicates the End Of the Sequence.
Transformer components in details
2-Embedding
• Embedding
• Word embeddings are a
technique in NLP that
represents words as
numerical vectors in a
multi-dimensional space.
These vectors capture the
semantic meaning of
words and their
relationships, making it
easier for computers to
work with and understand
language.
Transformer components in details
2-Embedding
• The key idea behind word embeddings is to represent words in a way that
preserves their semantic meaning. This means that similar words are
represented by vectors that are close to each other in the vector space.

• Word embeddings are learned from large text corpora. During training,
the model looks at the context in which words appear. Words that often
appear in similar contexts are assigned vectors that are closer in the
vector space.
Transformer components in details
2-Embedding
• Word embeddings transform a word into a
vector (of size d_model), the Input
Embedding layer transforms each input
sequence (of size seq_size) into a matrix.

• Example: [“<BOS>”, “the”, “cat”, “is”,


“black”, “<EOS>”]
For this example, we have seq_size = 6 and
for simplicity, let’s consider d_model = 8
Transformer components in details
3-Positional Encoding
• Positional Encoding
• In an RNN, words are processed one at a time in a sequential manner.
However, this sequential processing limits parallelism and makes it
challenging to capture long-range dependencies.
• In contrast, Transformers process all words in a sequence in parallel,
allowing for more efficient computation. This is its major advantage over
the RNN architecture, but it means that the position information is lost,
and has to be added back in separately.
• Positional Encoding allows transformer models to take full advantage of
parallelism while maintaining an understanding of the order and position
of words in the sequence.
Transformer components in details
3-Positional Encoding
• Positional Encoding creates a matrix with
the same dimensions as the embeddings
matrix (seq_size, d_model).
• To populate this matrix, we use two
formulas :
• Where “pos” is the position of the word in
the sequence (0 .. seq_size-1).
And “i” is the index in the embedding
dimension (0 .. d_model — 1)
The first formula is used to populate even
indices, while the second is used for odd
indices
Why use Sine and Cosine for positional
encoding
• Uniqueness and Smoothness: Sine and cosine functions are
continuous and periodic, ensuring that each position in the
sequence is represented by a unique set of values. This
uniqueness helps the model distinguish between different
positions effectively.
• Generalization: Sine and cosine functions allow the model to
generalize well to sequence lengths that were not seen during
training. The periodic nature of these functions helps capture the
relative distances between positions in a way that is consistent
and generalizable.
Why use Sine and Cosine for positional
encoding
• Low Frequency Component: Sine and cosine functions have different
frequencies, with sine and cosine of different frequencies being
orthogonal to each other. This property helps capture positional
information at different scales within the sequence.
• Ease of Computation: Sine and cosine functions are computationally
efficient and easy to implement in neural networks. They can be
calculated once and reused across different positions and batches.
• Additivity: The positional encodings generated using sine and cosine
functions can be added to the input embeddings directly, allowing the
model to learn to attend to both the content of the input tokens and
their positions.
Transformer components in details
4-Embedding + Positional Encoding
• Embedding + Positional Encoding
• Word embeddings and positional encoding matrices are added
together as an input preprocessing stage, combining these two
matrices creates representations that are both semantically
meaningful and positionally aware. This means that each vector in the
input matrix encodes both the meaning of a token and its position.
Attention mechanism
• The attention mechanism is a pivotal concept in neural network
architectures, enhancing the model’s ability to focus on specific
portions of input data. It’s widely used in various deep learning tasks,
particularly in sequence-to-sequence models like Transformers.
Self-Attention mechanism

• Self-attention, a variant of the attention mechanism that allows elements


within the same sequence to attend to each other.

• It’s a key component in the Transformer architecture and it enables them


to capture complex relationships and dependencies within sequential data.

• It is particularly effective in modeling long-range dependencies in


sequential data like sentences.
Transformer components in details
5- Multi-Head Attention
• The Attention layer in the transfrmers architecture takes its inputs in
form of parameters (Query, Key and Value). Each of these parameters
plays a distinct role in determining how the model attends to
different parts of the input sequence.
• Query (Q): The Query matrix is responsible for generating queries or
questions about the input sequence.
• Key (K): The Key matrix provides information about the positions in
the input sequence.
• Value (V): The Value matrix contains information associated with each
position in the input sequence.
The encoder input
• The encoder input matrix (embeddings + positional encoding) is used
for all of the three parameters in the self attention layer in the
encoder.
• Multi-Head means that the input matrix is duplicated into multiple
matrices, each copy will be processed independently through a
separate “head”.
Example
• Input sequence : [“<BOS>”, “the”, “cat”, “is”, “black”, “<EOS>”]
d_model = 8 and nb_heads = 4 and seq_size = 6
Example
• Multi-Head Attention — Linear layers
• Linear layers are applied for Q, K and V separately, each their own
weights.
• These weights are of size
(d_model x q_size) and
are learned during the
training phase. the query
size is a new parameter
which we introduce, it is
equal to: q_size =
d_model / nb_heads
Steps to calculate the self attention
1. Input Representation
- Start with an input sequence represented as a matrix of embeddings. Each word in the sequence is
transformed into a vector representation (embedding).
2. Create Query, Key, and Value Vector:
- For each input embedding, create three vectors: **Query (Q)**, **Key (K)**, and **Value (V)**.
- This is done by multiplying the input embeddings by learned weight matrices:
- \( Q = XW_Q \)
- \( K = XW_K \)
- \( V = XW_V \)
- Here, \( X \) is the input matrix, and \( W_Q, W_K, W_V \) are the weight matrices for queries, keys,
and values, respectively.
3. Calculate Attention Scores:
- Compute the attention scores by taking the dot product of the query vectors with the key vectors:
- \( \text{Attention Score} = QK^T \)
- This produces a score matrix that indicates how much focus each word should have on every other
word.
Steps to calculate the self attention
4. Scale the Scores:
- Scale the scores to prevent them from being too large, which can affect the softmax
function:
- \( \text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}} \)
- Here, \( d_k \) is the dimension of the key vectors.
5. Apply Softmax:
- Apply the softmax function to the scaled scores to obtain attention weights. This normalizes
the scores to be between 0 and 1:
- \( \text{Attention Weights} = \text{softmax}(\text{Scaled Scores}) \)
6. Compute the Weighted Sum of Values:
- Multiply the attention weights by the value vectors to get the output of the self-attention
mechanism:
- \( \text{Output} = \text{Attention Weights} \times V \)
7. Concatenate and Final Linear Transformation (for Multi-Head Attention):
- If using multi-head attention, repeat the above steps for each attention head, then
concatenate the results:
- \( \text{Concat} = \text{head}_1 \oplus \text{head}_2 \oplus ... \oplus \text{head}_h \)
- Finally, pass the concatenated output through a linear layer to transform it into the desired
output dimension.
Multi-Head Attention — Scaled Dot-Product

• The Scaled dot-product attention calculates the similarity between


each pair of query and key vectors. This similarity is often computed as
the dot product of the query and key vectors.

• QKᵀ represents the dot product of the Query matrix (Q) and the
transpose of the Key matrix (K). This provides a raw measure of
similarity between the query and key and this similarity reflects how
much attention should be given to each element in the input sequence.
A higher dot product implies a higher level of similarity, while a lower
dot product suggests lower relevance.
Multi-Head Attention — Scaled Dot-Product

• dₖ is the query size (q_size) parameter that we introduced before, the


division by √dₖ is is called “Scaling” and it is introduced to prevent the
dot product from becoming too large as the dimensionality of the
vectors increases. It helps stabilize the gradients during the training
process, making the model more robust.
• The Softmax function is used to normalize the raw scores and ensures
that the resulting scores are in the range [0, 1] and sum to 1, effectively
representing a probability distribution. This distribution reflects the
model’s certainty or confidence in assigning attention weights to
different positions in the input sequence.
Multi-Head Attention — Concat and linear
• The Scaled Dot-Product outputs a matrix of dimension (seq_size x
q_size) for each head.
• concatenates matrices from each head so they come back to The Concat
layer their original dimension (d_model x seq_size).
• The linear layer applies a weight matrix of dimension (d_model x
d_model) to the output.
Why the use of multiple heads (Multi-head) ?
• The attention mechanism allows the model to focus on different parts of the input
sequence when making predictions. When multiple heads are used, it means that the
attention mechanism is applied multiple times in parallel, each with its own set of weights.

• Each attention head has its own set of learned weights, which means it can specialize in
capturing specific patterns or dependencies in the data. Separate sections of the
Embedding can learn different aspects of the meanings of each word, as it relates to other
words in the sequence. This allows the Transformer to capture richer interpretations of the
sequence.
Masked self-attention
• In standard self-attention, each element
token in a sequence attends to all other
elements in the sequence, including itself.
• Masked self-attention is a variant of the self-
attention mechanism used in the
Transformer architecture, specifically
employed in the decoder side of the model
during language generation tasks:
Masked self-attention
• The purpose of masked self-attention is to prevent attending to future
positions in the sequence during training, ensuring that each position can
only attend to its past positions. This is because the model should not have
access to information that hasn’t been generated yet.

• During training, positions in the sequence that correspond to future tokens


are masked. The masking is typically implemented by applying a triangular
mask to the attention weights matrix, setting the upper triangular portion
to -∞.
Masked self-attention

• And since Softmax(-∞) = 0, we

expect that the probability

assigned to each element that

had a value of −∞ in the input

matrix to be equal to 0:
Encoder-decoder attention
• The encoder-decoder attention is used for tasks like machine
translation, text summarization, and question-answering. It
facilitates the alignment of information between the input
(encoder) and output (decoder) sequences.

• The encoder-decoder attention enables the decoder to


attend to different positions in the encoder’s output
sequence:

• In encoder-decoder attention, the query Q is derived from


the decoder, while the key K and value V are derived from
the encoder:
Encoder-decoder attention
• The attention scores are calculated by taking the dot product
of the query from the decoder with the key from the
encoder. These scores determine how much attention the
decoder should give to each position in the encoder’s output
sequence.

• The encoder-decoder attention enables the model to


selectively attend to different positions in the encoder’s
output, allowing the decoder to gather pertinent information
and context for accurate sequence generation. This
mechanism is fundamental in tasks like machine translation,
where understanding the entire input sequence is essential
for generating a meaningful translation.
Feed Forward Network
• The Feed Forward Network (FFN) is a
normal Neural network consisting of
two fully connected layers:

• The dimension of the input and


output layers are equal to d_model.

• The dimension of the hidden layer is


generally equal to (4 * d_model).
After that, there is a Relu activation
function: ReLu(x) = max(0, x)
Feed Forward Network
• In the Transformer architecture, the Feed Forward
Network (FFN) is typically positioned after the
multi-head self-attention layer in each encoder and
decoder block:)

• This placement is a crucial element of the design


and contributes to the mail role of Neural Networks
which is to introduce non-linearity within the data.
The FFN processes the information obtained from
the Multi-Head Attention layer and learns complex
relationships within the data
Add & Norm layers

• Add & Norm layers are placed after the

attention and feed forward layers, the

objective is to allow the preservation of

essential information and maintaining

stability throughout the network


Linear and Softmax layers
• The final linear and softmax layers are crucial for
generating the output sequence, these layers take
the decoder’s contextual embeddings and
produce a probability distribution over the target
vocabulary for each position in the output
sequence.
• Linear Layer: The linear layer is applied to the
decoder output matrix (seq_size x d_model) to
transform them into a numerical vector that
matches the size of the target vocabulary.
• Softmax Layer: Following the linear layer, the
softmax layer is applied to the transformed
values. The softmax function is used to convert
these values into a probability distribution over
the target vocabulary.
• Example:
• Decoder Input: “<BOS>, the, cat, is, black”
• Vocabulary: (“<BOS>”, “the”, “cat”, “is”, “black”, “<EOS>”)
• d_model = 8 and decoder_seq_size = 5 and vocab_size = 6
• “vocab_size” represents the size of the vocabulary, which is the total
number of unique words or tokens that the model can generate in
the output sequence.

You might also like