0% found this document useful (0 votes)

27 views4 pages

Transformer 345

The Transformer is a neural network architecture introduced in 2017 that utilizes self-attention instead of sequential recurrence, enabling efficient parallel processing for tasks like machine translation. It consists of an Encoder-Decoder structure, where the Encoder processes input sequences and the Decoder predicts the next tokens, both employing multi-head self-attention mechanisms. This architecture has led to the development of various models such as BERT and GPT, which excel in understanding context and generating text.

Uploaded by

Elie Al-Hallak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views4 pages

Transformer 345

Uploaded by

Elie Al-Hallak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

The Transformer is a revolutionary neural network architecture introduced in the 2017 paper

"Attention Is All You Need." It completely abandoned the sequential recurrence (RNNs/LSTMs)
and convolution typically used for sequence-to-sequence tasks like machine translation, relying
entirely on a mechanism called Self-Attention. This parallel processing capability is what made
it vastly more efficient and scalable, leading to the development of modern large language
models (LLMs) like GPT and BERT.

🏗️ Core Architecture
The original Transformer model follows the standard Encoder-Decoder structure:

1. Encoder Stack: Processes the input sequence (e.g., an English sentence) and creates an abstract,
continuous representation of it. The original model uses a stack of six identical encoder layers.
2. Decoder Stack: Uses the encoded representation, along with the partially generated output
sequence, to predict the next token (e.g., a word in the translated French sentence). It also
consists of six identical decoder layers.

1. Input Processing

Before entering the stacks, the input sequence undergoes two key steps:

 Embeddings: Each word or sub-word (token) is converted into a numerical vector (an
embedding) that captures its semantic meaning.
 Positional Encoding: Since the Transformer processes all tokens in parallel, it loses the natural
sequential order. Positional Encoding adds a vector to each input embedding, providing the
model with information about the token's absolute or relative position in the sequence. The
original paper used sine and cosine functions for this, but later models often use learned
embeddings.

🧠 The Self-Attention Mechanism

Self-attention is the heart of the Transformer. For every token in the input sequence, it calculates
an attention score with every other token (including itself) to determine how much to focus on
them when computing the current token's new representation. This is done by transforming the
input vector into three different, smaller vectors:
Vector Function Analogy

Query The current token being processed (what A question in a

($\mathbf{Q}$) I'm looking for). search engine.
Vector Function Analogy

The tokens in the sequence that are being

Key The index/label of
compared against the Query (what I have
($\mathbf{K}$) the documents.
available).

Value The actual information from the tokens that The content of the
($\mathbf{V}$) will be aggregated. documents.

The calculation for the output of the Scaled Dot-Product Attention is given by the formula:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) =

\text{Softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{\text{T}}}{\sqrt{d_k}}\right)\mathbf{V}$$

1. Attention Score: A dot product is computed between the $\mathbf{Q}$ vector of the current
token and the $\mathbf{K}$ vectors of all tokens in the input. This measures the relevance or
compatibility.
2. Scaling: The scores are divided by $\sqrt{d_k}$ (the square root of the dimension of the key
vectors) to prevent the dot products from becoming too large and destabilizing the training
process.
3. Normalization: The Softmax function is applied to the scaled scores, turning them into
attention weights that sum to 1.
4. Weighted Sum: The attention weights are multiplied by the $\mathbf{V}$ vectors. Summing
these weighted values produces the final, context-aware output for the current token.

Multi-Head Attention

Instead of performing the attention calculation once, Multi-Head Attention performs the scaled
dot-product attention in parallel (e.g., 8 times). Each "head" uses different, independently
learned linear projections ($\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V$) to transform the
input vectors into $\mathbf{Q}, \mathbf{K}, \mathbf{V}$.

This allows the model to:

 Jointly attend to information from different representation subspaces: One head might learn
to focus on syntactic relationships, while another might focus on semantic meaning.
 Better capture long-range dependencies: By having multiple perspectives on the sequence
simultaneously.

The outputs from all attention heads are concatenated and then linearly projected to produce the
final output of the Multi-Head Attention layer.

🧠 Encoder and Decoder Layers

Encoder Layer Components

Each encoder layer has two main sub-layers, each with a residual connection and Layer
Normalization (which helps stabilize training):

1. Multi-Head Self-Attention: Allows the encoder to look at all other tokens in the input sequence
to compute a better representation for the current token.
2. Feed-Forward Network (FFN): A simple, position-wise fully-connected network applied
independently and identically to each position (token) in the sequence. It consists of two linear
transformations with a ReLU activation in between: $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2
+ b_2$.

Decoder Layer Components

Each decoder layer has three main sub-layers, also followed by residual connections and layer
normalization:

1. Masked Multi-Head Self-Attention: This is the same as the encoder's self-attention, but with a
mask applied. The mask ensures that when predicting the next token, the decoder can only
attend to the previously generated tokens, preventing it from "cheating" by looking ahead.
2. Encoder-Decoder Multi-Head Attention: The Queries ($\mathbf{Q}$) come from the
previous masked attention layer in the decoder, but the Keys ($\mathbf{K}$) and Values
($\mathbf{V}$) come from the output of the entire encoder stack. This is the mechanism that
allows the decoder to focus on relevant parts of the input sequence.
3. Feed-Forward Network (FFN): Identical to the one in the encoder.

The final output of the decoder stack is passed through a Linear Layer and a Softmax function
to generate a probability distribution over the vocabulary, which determines the predicted next
token.

🚀 Impact and Variations

The Transformer architecture, especially its self-attention mechanism, is a massive leap forward
because it:

 Parallelizes Computation: Unlike Recurrent Neural Networks (RNNs), which must process
tokens one by one, the Transformer processes the entire sequence simultaneously, drastically
reducing training time.
 Handles Long-Range Dependencies: The self-attention mechanism can directly link any two
tokens in a sequence, regardless of how far apart they are, solving the "vanishing gradient"
problem that plagued RNNs over long sequences.

This architecture forms the basis for numerous groundbreaking models, often simplified to use
only the Encoder or only the Decoder:
 Encoder-only Models (e.g., BERT): Excellent for understanding and encoding context from an
input (tasks like classification, question answering).
 Decoder-only Models (e.g., GPT): Used for generative tasks (text generation, language
modeling) as they are trained to predict the next word autoregressively.
 Encoder-Decoder Models (e.g., T5, BART): Used for sequence-to-sequence tasks like
translation and summarization.

Would you like a more detailed explanation of the Multi-Head Attention formula or how the
Positional Encoding works?

Transformers Architecture
No ratings yet
Transformers Architecture
5 pages
DeepLearning Report
No ratings yet
DeepLearning Report
32 pages
N-gram vs Negative Sampling in NLP
No ratings yet
N-gram vs Negative Sampling in NLP
117 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
In-Depth Guide to Transformer Architecture
No ratings yet
In-Depth Guide to Transformer Architecture
4 pages
Transformer
No ratings yet
Transformer
4 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Transformer
No ratings yet
Transformer
14 pages
3 2transformers
No ratings yet
3 2transformers
22 pages
Transformer Model Overview and Insights
No ratings yet
Transformer Model Overview and Insights
33 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Transformers
No ratings yet
Transformers
15 pages
The Transformer Architecture
No ratings yet
The Transformer Architecture
7 pages
Transformer Concepts
100% (1)
Transformer Concepts
8 pages
Transformer Architecture Overview
No ratings yet
Transformer Architecture Overview
18 pages
Understanding the Transformer Architecture
No ratings yet
Understanding the Transformer Architecture
10 pages
Self-Attention Mechanism in NLP
No ratings yet
Self-Attention Mechanism in NLP
18 pages
Transformers
No ratings yet
Transformers
15 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
5 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
1 page
Understanding Transformer Models and BERT
No ratings yet
Understanding Transformer Models and BERT
10 pages
A1
No ratings yet
A1
11 pages
Transformers III Decoders
No ratings yet
Transformers III Decoders
17 pages
Transformer: Attention Mechanism Unleashed
No ratings yet
Transformer: Attention Mechanism Unleashed
15 pages
Transformer Model: Attention Mechanism
No ratings yet
Transformer Model: Attention Mechanism
3 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
15 pages
Encoder Decoder Transformers Notes
No ratings yet
Encoder Decoder Transformers Notes
6 pages
Attention
No ratings yet
Attention
15 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
GEN-AI Handout 1
No ratings yet
GEN-AI Handout 1
4 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
Transformer
No ratings yet
Transformer
31 pages
Ch5 Transformers
No ratings yet
Ch5 Transformers
40 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
Understanding The Transformer Archi
No ratings yet
Understanding The Transformer Archi
2 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Aiayn
No ratings yet
Aiayn
15 pages
Transformer Model for Sequence Transduction
No ratings yet
Transformer Model for Sequence Transduction
139 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
Transformer Model in NLP Explained
No ratings yet
Transformer Model in NLP Explained
1 page
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Understanding the Transformer Model
No ratings yet
Understanding the Transformer Model
32 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
Transformers II Encoders
No ratings yet
Transformers II Encoders
19 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Understanding Bahdanau Attention Mechanism
No ratings yet
Understanding Bahdanau Attention Mechanism
41 pages
NLP 8
No ratings yet
NLP 8
42 pages
Transformer
No ratings yet
Transformer
30 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
Section+3+Assignment+ANSWER+KEY
No ratings yet
Section+3+Assignment+ANSWER+KEY
2 pages
Section+3+Assignment
No ratings yet
Section+3+Assignment
2 pages
Section+2+Assignment+ANSWER+KEY
No ratings yet
Section+2+Assignment+ANSWER+KEY
2 pages
Se Storedge Three Phase Inverter Installation Guide
No ratings yet
Se Storedge Three Phase Inverter Installation Guide
71 pages
Introduction to Photovoltaic Systems Course
No ratings yet
Introduction to Photovoltaic Systems Course
2 pages
Technical Support Roles and Responsibilities
No ratings yet
Technical Support Roles and Responsibilities
3 pages
Subtitle - 2024-08-25T124821.765
No ratings yet
Subtitle - 2024-08-25T124821.765
3 pages
How Does A Web Server Work
No ratings yet
How Does A Web Server Work
5 pages
Week 1
No ratings yet
Week 1
20 pages
Week 3
No ratings yet
Week 3
10 pages
Week 4
No ratings yet
Week 4
17 pages
Power Box User Manual
No ratings yet
Power Box User Manual
35 pages
Learning Front-End Development
No ratings yet
Learning Front-End Development
4 pages
Understanding Browsers and DNS Requests
No ratings yet
Understanding Browsers and DNS Requests
2 pages
Understanding Technical Support Levels
No ratings yet
Understanding Technical Support Levels
4 pages
Remote Technical Support Best Practices
No ratings yet
Remote Technical Support Best Practices
3 pages
Introduction to Photovoltaic Systems Course
No ratings yet
Introduction to Photovoltaic Systems Course
2 pages
Virtual HR - Report
No ratings yet
Virtual HR - Report
84 pages
An Improved UFLD-V2 Lane Line Recognition Method
No ratings yet
An Improved UFLD-V2 Lane Line Recognition Method
10 pages
Attention Mechanisms
No ratings yet
Attention Mechanisms
13 pages
Generative AI Interview Questions
No ratings yet
Generative AI Interview Questions
90 pages
Transformer NLP
No ratings yet
Transformer NLP
15 pages
Datavalley
No ratings yet
Datavalley
53 pages
Understanding Transformers in AI
No ratings yet
Understanding Transformers in AI
20 pages
Improving Flight Delays Prediction by Developing Attention-Based Bidirectional LSTM Network
No ratings yet
Improving Flight Delays Prediction by Developing Attention-Based Bidirectional LSTM Network
22 pages
1 s2.0 S1574013724001047 Main
No ratings yet
1 s2.0 S1574013724001047 Main
18 pages
Segmentation 2
No ratings yet
Segmentation 2
16 pages
Section 1 - Mathematical Foundations & Core Theory For Dog Behavior Detection From Video
No ratings yet
Section 1 - Mathematical Foundations & Core Theory For Dog Behavior Detection From Video
33 pages
STGATE Spatial-Temporal Graph Attention Network Wi
No ratings yet
STGATE Spatial-Temporal Graph Attention Network Wi
11 pages
Leanattention: Hardware-Aware Scalable Attention Mechanism For The Decode-Phase of Transformers
No ratings yet
Leanattention: Hardware-Aware Scalable Attention Mechanism For The Decode-Phase of Transformers
13 pages
Understanding Attention in Deep Learning
No ratings yet
Understanding Attention in Deep Learning
65 pages
Evaluating LLMs with RAG Method
No ratings yet
Evaluating LLMs with RAG Method
58 pages
MULTIMODAL LLMs
No ratings yet
MULTIMODAL LLMs
82 pages
End To End Speech Recognition A Review For The French Language
No ratings yet
End To End Speech Recognition A Review For The French Language
10 pages
Neunet D 25 00110
No ratings yet
Neunet D 25 00110
40 pages
Towards Automatic Generation of Piping and Instrumentation Diagrams (P&ids) With Artificial Intelligence
No ratings yet
Towards Automatic Generation of Piping and Instrumentation Diagrams (P&ids) With Artificial Intelligence
18 pages
Deep Attention Networks in Drug Discovery
No ratings yet
Deep Attention Networks in Drug Discovery
17 pages
EPIVAN
No ratings yet
EPIVAN
7 pages
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
No ratings yet
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
39 pages
Universal Network
No ratings yet
Universal Network
18 pages
LoRA Fine-Tuning Performance of Llama-2
No ratings yet
LoRA Fine-Tuning Performance of Llama-2
4 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Understanding Self-Attention in Transformers
No ratings yet
Understanding Self-Attention in Transformers
16 pages
Attention and Memory Models
No ratings yet
Attention and Memory Models
28 pages
OCI AI Foundations
No ratings yet
OCI AI Foundations
13 pages
YOLOv5 for Small Traffic Sign Detection
No ratings yet
YOLOv5 for Small Traffic Sign Detection
8 pages
Wireless Communications and Mobile Computing - 2021 - Pang - Aspect Level Sentiment Analysis Approach Via BERT and Aspect
No ratings yet
Wireless Communications and Mobile Computing - 2021 - Pang - Aspect Level Sentiment Analysis Approach Via BERT and Aspect
13 pages

Transformer 345

Uploaded by

Transformer 345

Uploaded by

The Transformer is a revolutionary neural network architecture introduced in the 2017 paper

🧠 The Self-Attention Mechanism

Query The current token being processed (what A question in a

The tokens in the sequence that are being

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) =

This allows the model to:

🧠 Encoder and Decoder Layers

Decoder Layer Components

🚀 Impact and Variations

You might also like