0% found this document useful (0 votes)

7 views

RoPE

The document discusses various aspects of positional encoding in Transformers, emphasizing its importance for maintaining token order and understanding relationships between words. It covers methods such as sinusoidal positional encodings, learned embeddings, and rotary positional embeddings (RoPE), along with their advantages and disadvantages. Additionally, it addresses challenges related to handling long sequences and extrapolating beyond training data.

Uploaded by

dhondi1990

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

RoPE

Uploaded by

dhondi1990

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

1.

Position Encoding

Intuition

Transformers process tokens in parallel, meaning they lack inherent word order awareness. Position Encoding assigns
a unique representation to each token’s position in a sequence.

Visualization

Imagine words on a timeline. Without position encoding, "I love AI" and "AI love I" look the same to a Transformer. Position
encoding helps distinguish their order.

Python Code Example

import numpy as np
import matplotlib.pyplot as plt

def positional_encoding(pos, d_model):

pe = np.zeros((pos, d_model))
for i in range(d_model // 2):
pe[:, 2 * i] = np.sin(np.arange(pos) / (10000 ** (2 * i / d_model)))
pe[:, 2 * i + 1] = np.cos(np.arange(pos) / (10000 ** (2 * i / d_model)))
return pe
pe = positional_encoding(100, 16) # 100 positions, 16 dimensions
plt.imshow(pe, cmap="coolwarm", aspect="auto")
plt.colorbar()
plt.title("Positional Encoding Heatmap")
plt.show()

2. Self-Attention

Intuition

Self-attention allows each token in a sequence to attend to every other token, learning relationships between words
dynamically. Instead of treating words independently, it assigns importance (weights) to different words when making
predictions.

Visualization

Think of reading a sentence:

● "The cat sat on the mat."

● "The cat ate the fish."
In "ate the fish," "ate" should have high attention on "fish" rather than "mat." Self-attention captures these
dependencies.
Python Code Example
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):

d_k = Q.shape[-1] # Embedding dimension
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights

# Sample tensors
Q = torch.rand(3, 4) # (tokens, embedding)
K = torch.rand(3, 4)
V = torch.rand(3, 4)

output, attn_weights = scaled_dot_product_attention(Q, K, V)

print("Attention Weights:\n", attn_weights)
print("Output:\n", output)

3. Permutation Invariance

Intuition

A model is permutation-invariant if it gives the same output regardless of the order of input tokens.

● Example: If a model treats "dog bites man" and "man bites dog" the same, it lacks order awareness.
● Transformers use self-attention, which does not inherently respect order, making them permutation-invariant
until positional encoding is added.
Since dot-product and softmax do not depend on input order, transformers need positional encoding to restore order
information.

Python Example
import torch

Q = torch.tensor([[1.0, 2.0], [3.0, 4.0]])

K = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
V = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

def attention(Q, K, V):

scores = torch.matmul(Q, K.T) / torch.sqrt(torch.tensor(Q.shape[-1],
dtype=torch.float32))
return torch.matmul(torch.softmax(scores, dim=-1), V)

print("Original Order:\n", attention(Q, K, V))

print("Swapped Order:\n", attention(Q.flip(0), K.flip(0), V.flip(0))) # Reversing order

# The outputs remain the same → Shows permutation invariance

Fixing Permutation Invariance

Adding positional encoding fixes this by making token embeddings position-aware.

4. Token Order

Intuition

Token order refers to the sequence in which words (tokens) appear in an input text. Since Transformers process
tokens in parallel (not sequentially like RNNs), they need positional encodings to differentiate order.

Example

● Correct Order: "The cat sat on the mat."

● Jumbled Order: "Mat the sat on cat the."

A Transformer without positional encoding treats both sentences identically, failing to capture correct meaning.
This ensures the model learns correct token order.

Python Code Example

import torch
import numpy as np

tokens = ["The", "cat", "sat", "on", "the", "mat"]

positions = np.arange(len(tokens)) # [0, 1, 2, 3, 4, 5]
embedding_dim = 4

# Simulate embeddings
embeddings = torch.rand(len(tokens), embedding_dim)
position_encodings = torch.sin(torch.tensor(positions).unsqueeze(1) / (10000 ** (2 *
torch.arange(embedding_dim) / embedding_dim)))

# Add position information

final_embeddings = embeddings + position_encodings

print("Final Token Embeddings:\n", final_embeddings)

This ensures tokens maintain their order while processing.

5. Sinusoidal Positional Encodings

Intuition
Sinusoidal positional encodings assign unique, continuous values to each token’s position using sine and cosine
functions. This allows the model to generalize to unseen sequence lengths, as these functions are periodic and smooth.

Why Use Sinusoids?

1. Captures Relative Positioning: The difference between two positions remains consistent, which helps in
generalization.
2. No Extra Parameters: Unlike learned embeddings, these are fixed mathematical functions, reducing memory
usage.
3. Smooth Variations: Small position shifts lead to small encoding changes, aiding generalization.

● Even indices use sine, odd indices use cosine.

● The denominator scales the position for different embedding dimensions.

Visualization

Each position gets a unique wave-like encoding:

● Closer positions have similar embeddings.

● Distant positions have distinguishable embeddings.

Python Code Example

import numpy as np
import matplotlib.pyplot as plt

def sinusoidal_positional_encoding(seq_len, d_model):

pe = np.zeros((seq_len, d_model))
for pos in range(seq_len):
for i in range(0, d_model, 2):
pe[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
pe[pos, i+1] = np.cos(pos / (10000 ** (i / d_model)))
return pe
seq_len = 100
d_model = 16 # Embedding dimension
pe = sinusoidal_positional_encoding(seq_len, d_model)

plt.figure(figsize=(10, 6))
plt.imshow(pe, aspect='auto', cmap='coolwarm')
plt.colorbar()
plt.title("Sinusoidal Positional Encoding Heatmap")
plt.xlabel("Embedding Dimension")
plt.ylabel("Token Position")
plt.show()

This generates a heatmap where color variations represent encoding values.

6. Learned Positional Embeddings

Intuition

Unlike sinusoidal encodings, learned positional embeddings allow the model to learn position information from data
instead of using fixed functions.

How It Works

● Each position iii in a sequence has a trainable embedding vector PEiPE_iPEi.

● These embeddings are optimized during training.
● The model learns the best way to represent positions for a specific task.
Advantages & Disadvantages

✅ More flexible: Learns optimal encoding for a dataset.

✅ Better for shorter sequences: No fixed function constraints.
❌ No extrapolation: Cannot generalize to longer sequences than seen during training.
Python Code Example (Using PyTorch)
import torch
import torch.nn as nn

seq_len = 10 # Example sequence length

d_model = 16 # Embedding dimension

# Trainable positional embeddings

pos_embeddings = nn.Embedding(seq_len, d_model)
positions = torch.arange(seq_len) # [0,1,2,...,seq_len-1]
pos_encoding = pos_embeddings(positions)

print("Learned Positional Embeddings:\n", pos_encoding)

💡 Unlike sinusoidal encodings, these embeddings are randomly initialized and updated during training.

7. Handling Long Sequences

Intuition

Transformers have a fixed input length due to memory constraints in self-attention. Handling long sequences requires
techniques to either:

1. Extend sequence length (e.g., RoPE, ALiBi, FlashAttention).

2. Reduce memory usage (e.g., Sparse Attention, Linformer).
Methods to Handle Long Sequences
Method Approach Strength Weakness

RoPE (Rotary Rotates token Scales well, handles Requires modification

Positional embeddings using longer sequences. in attention layers.
Embeddings) rotation matrices.

ALiBi (Attention Adds linear position bias Extrapolates well, No explicit position
Linear Biases) to attention scores. simple. embeddings.

Sparse Attention Limits attention scope to Reduces memory May miss long-range
nearby tokens. usage. dependencies.

Linformer Projects sequences into Linear complexity. May lose information.

a smaller dimension.

Memory-efficient Stores activations in Saves memory. Requires custom

Attention blocks instead of all kernels.
pairs.

Python Code Example (RoPE)

import torch

def rotary_embedding(x, pos):

d = x.shape[-1]
theta = torch.exp(torch.arange(0, d, 2) * (-torch.log(torch.tensor(10000.0)) / d))
theta = pos[:, None] * theta # Broadcast position
cos_theta, sin_theta = torch.cos(theta), torch.sin(theta)
x1, x2 = x[..., ::2], x[..., 1::2] # Split channels
return torch.cat([x1 * cos_theta - x2 * sin_theta, x1 * sin_theta + x2 * cos_theta],
dim=-1)

tokens = torch.rand(10, 16) # 10 tokens, 16-d embeddings

positions = torch.arange(10).float()
rotated_embeddings = rotary_embedding(tokens, positions)

print("RoPE Applied Embeddings:\n", rotated_embeddings)

8. Extrapolating Beyond Training Data

Intuition

Most models are trained on fixed-length sequences. But in real-world tasks, we may encounter longer sequences than
seen during training. Extrapolation refers to a model’s ability to handle unseen sequence lengths correctly.

Challenges

● Learned positional embeddings fail: They are fixed for known positions.
● Vanilla transformers degrade: Self-attention scales quadratically, causing memory issues for long inputs.
● Need for position-aware mechanisms: To extend beyond training data, we require positional encodings that
generalize.

Solutions
Method Key Idea Strength Weakness

Sinusoidal Positional Uses periodic functions to Works well for Hardcoded, not
Encoding represent positions. extrapolation. adaptive.

RoPE (Rotary Rotates token embeddings Extrapolates Needs attention

Positional to encode relative smoothly. modification.
Embeddings) positions.

ALiBi (Attention Applies linear bias to Very efficient for No explicit position
Linear Biases) attention scores. long sequences. encoding.

NTK Scaling (Neural Dynamically adjusts Adapts well to Computationally

Tangent Kernel positional embeddings at longer sequences. expensive.
Scaling) inference time.
Python Code Example (NTK Scaling)
import numpy as np
import torch

def ntk_scaled_positional_encoding(pos, d_model, scale_factor=2.0):

pe = np.zeros((pos, d_model))
for i in range(d_model // 2):
pe[:, 2*i] = np.sin((np.arange(pos) * scale_factor) / (10000 ** (2*i/d_model)))
pe[:, 2*i+1] = np.cos((np.arange(pos) * scale_factor) / (10000 ** (2*i/d_model)))
return torch.tensor(pe, dtype=torch.float32)

scaled_pe = ntk_scaled_positional_encoding(200, 16, scale_factor=1.5)

print("NTK Scaled Positional Encoding Shape:", scaled_pe.shape)

By scaling positional encodings dynamically, models can generalize to longer sequences than seen during training.

9. Encoding Positions

Intuition

Transformers process tokens in parallel, meaning they do not inherently know the order of words in a sentence. Position
encoding solves this by injecting order information into the model.

Why is Position Encoding Important?

● Without it, "The cat sat on the mat." and "Mat the sat on cat the." would be treated the same.
● Encodes where each token is located in the sequence.
● Helps the model understand syntax and meaning.

Types of Positional Encoding

1. Sinusoidal Encoding (Fixed, works well for extrapolation).
2. Learned Embeddings (Trainable, better for in-domain data).
3. RoPE (Rotary Positional Embeddings) (Captures relative positions).
4. ALiBi (Attention Linear Biases) (Adds bias to attention weights).

Python Code Example (Sinusoidal Encoding)

import numpy as np
import torch

def positional_encoding(seq_len, d_model):

pe_matrix = positional_encoding(10, 16)

print("Positional Encoding Matrix:\n", pe_matrix)

This ensures tokens are order-aware while keeping Transformer flexibility.

10. Rotating Token Embeddings in Continuous Space

Intuition
Instead of adding position encodings explicitly, some methods (like Rotary Positional Embeddings (RoPE)) encode
positional information by rotating embeddings in continuous space.

● Key Idea: Each token embedding is rotated in a high-dimensional space based on its position.
● Why? This allows models to capture relative positions naturally, making them better at generalizing to long
sequences.

This results in position-aware embeddings without explicit position vectors.

Python Code Example (RoPE Embeddings)

import torch

def rotary_embedding(x, pos):

d = x.shape[-1] # Embedding dimension
theta = torch.exp(torch.arange(0, d, 2) * (-torch.log(torch.tensor(10000.0)) / d))
theta = pos[:, None] * theta # Broadcast position values
cos_theta, sin_theta = torch.cos(theta), torch.sin(theta)

x1, x2 = x[..., ::2], x[..., 1::2] # Split embedding into even and odd indices
rotated_x = torch.cat([x1 * cos_theta - x2 * sin_theta, x1 * sin_theta + x2 *
cos_theta], dim=-1)
return rotated_x

tokens = torch.rand(10, 16) # 10 tokens, 16-dimensional embeddings

positions = torch.arange(10).float()
rotated_embeddings = rotary_embedding(tokens, positions)

print("RoPE Applied Embeddings:\n", rotated_embeddings)

● Effect: Instead of adding positional vectors, token embeddings are rotated based on their position.
● Advantage: Encodes relative positions naturally, leading to better generalization.
11. Rotation Matrices

Intuition

A rotation matrix is a special transformation that preserves vector magnitudes while rotating them in a
higher-dimensional space. In positional encoding, rotation matrices help encode position information without explicitly
adding embeddings.

Why Use Rotation?

● Keeps token embeddings continuous and differentiable.

● Allows relative position encoding instead of absolute.
● Enables extrapolation to longer sequences beyond training data.

Mathematical Definition

A 2D rotation matrix for an angle θ\thetaθ is:

R(θ)=[cos⁡(θ)−sin⁡(θ)sin⁡(θ)cos⁡(θ)]R(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta)

\end{bmatrix}R(θ)=[cos(θ)sin(θ)−sin(θ)cos(θ)]

Applying R(θ)R(\theta)R(θ) to a vector [x1,x2][x_1, x_2][x1,x2]:

[x1′x2′]=[cos⁡(θ)−sin⁡(θ)sin⁡(θ)cos⁡(θ)][x1x2]\begin{bmatrix} x_1' \\ x_2' \end{bmatrix} = \begin{bmatrix} \cos(\theta) &

-\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix} \begin{bmatrix} x_1 \\ x_2
\end{bmatrix}[x1′x2′]=[cos(θ)sin(θ)−sin(θ)cos(θ)][x1x2]

Expanding:

x1′=x1cos⁡(θ)−x2sin⁡(θ)x_1' = x_1 \cos(\theta) - x_2 \sin(\theta)x1′=x1cos(θ)−x2sin(θ) x2′=x1sin⁡(θ)+x2cos⁡(θ)x_2' = x_1

\sin(\theta) + x_2 \cos(\theta)x2′=x1sin(θ)+x2cos(θ)

This transformation rotates the embedding based on position θ\thetaθ, as used in RoPE (Rotary Positional
Embeddings).

Python Code Example (Rotation Matrix)

python
CopyEdit
import torch

def rotation_matrix(theta):
return torch.tensor([[torch.cos(theta), -torch.sin(theta)],
[torch.sin(theta), torch.cos(theta)]])

# Example: Rotate a 2D vector by 30 degrees

theta = torch.tensor(30.0 * 3.1415926535 / 180) # Convert degrees to radians
R = rotation_matrix(theta)

vector = torch.tensor([1.0, 0.0]) # Example vector

rotated_vector = torch.matmul(R, vector)

print("Rotation Matrix:\n", R)
print("Original Vector:", vector.numpy())
print("Rotated Vector:", rotated_vector.numpy())

Key Takeaways:

● Rotation matrices preserve vector length (important for stable embeddings).

● They allow smooth transformations of positional encodings.
● RoPE uses these rotations to encode relative position information.

Shall we move to the next concept? 🚀

You said:
yes

ChatGPT said:

12. Encoding Relative Position

Intuition

Transformers naturally lack position awareness because they process tokens in parallel. Relative positional
encoding helps models capture how far apart tokens are, rather than their absolute positions.

Why Use Relative Positioning?

✅ Captures relationships between words (e.g., "cat on the mat" vs. "mat under the cat").
✅ Better for generalization (since meaning often depends on relative order, not absolute position).
✅ Handles long sequences more effectively (especially in RoPE, ALiBi, and T5-style encodings).
Mathematical Formulation

Instead of assigning a position vector PEiPE_iPEito each token iii, we define a relative position function:

Δ(i,j)=f(i−j)\Delta(i, j) = f(i - j)Δ(i,j)=f(i−j)

where:

● iii and jjj are token positions.

● Δ(i,j)\Delta(i, j)Δ(i,j) represents the relative distance between tokens.

Example implementations:

1. Learned Relative Embeddings (T5, Transformer-XL): Each token-pair has a learned weight based on distance.
2. Rotary Positional Embeddings (RoPE): Uses rotation matrices to encode relative positions implicitly.
3. ALiBi (Attention Linear Biases): Adds a bias term in self-attention proportional to distance.

Python Code Example (Relative Attention in RoPE)

python
CopyEdit
import torch

def relative_rotary_embedding(x, pos):

d = x.shape[-1]
theta = torch.exp(torch.arange(0, d, 2) * (-torch.log(torch.tensor(10000.0)) / d))
theta = pos[:, None] * theta # Scale by position
cos_theta, sin_theta = torch.cos(theta), torch.sin(theta)

x1, x2 = x[..., ::2], x[..., 1::2] # Split dimensions

rotated_x = torch.cat([x1 * cos_theta - x2 * sin_theta, x1 * sin_theta + x2 *
cos_theta], dim=-1)
return rotated_x

tokens = torch.rand(10, 16) # 10 tokens, 16-dimensional embeddings

positions = torch.arange(10).float()
relative_embeddings = relative_rotary_embedding(tokens, positions)

print("Relative Positional Embeddings (RoPE):\n", relative_embeddings)

Key Takeaways:

● Absolute position encoding assigns a fixed vector to each token.

● Relative position encoding models the relationship between tokens, which is more useful for tasks like parsing,
translation, and reasoning.
● RoPE naturally encodes relative positions via rotation.
Would you like to proceed to the next topic? 🚀
4o

You said:
yes

ChatGPT said:

13. Token Position

Intuition

A token's position refers to its place within a sequence, which is critical for understanding context. Transformers process
tokens in parallel, so they must explicitly encode this position.

Why is Token Position Important?

● In "The cat sat on the mat," "cat" at position 2 means something different than "cat" at position 5.
● Without position encoding, transformers would treat the sequence as a bag of words, losing meaning.

How is Token Position Encoded?

There are two main approaches:

1. Absolute Positional Encoding – Assigns a fixed embedding for each position.
○ Example: Sinusoidal Encoding (used in vanilla Transformers).
2. Relative Positional Encoding – Captures distance between tokens.
○ Example: RoPE (Rotary Positional Embeddings).

Mathematical Formulation (Absolute vs. Relative)

For absolute encoding, each token receives a position vector PEiPE_iPEi:

Zi=xi+PEiZ_i = x_i + PE_iZi=xi+PEi

where:

● xix_ixiis the token embedding.

● PEiPE_iPEiis the positional encoding.

For relative encoding, the attention score is adjusted using relative position Δ(i,j)\Delta(i, j)Δ(i,j):

Attention(i,j)∝xiWxjT+bΔ(i,j)d\text{Attention}(i, j) \propto \frac{x_i W x_j^T + b_{\Delta(i,

j)}}{\sqrt{d}}Attention(i,j)∝dxiWxjT+bΔ(i,j)

where bΔ(i,j)b_{\Delta(i, j)}bΔ(i,j)is a learned bias based on relative position.

Python Code Example (Token Position Encoding in Transformers)
python
CopyEdit
import torch
import numpy as np

def sinusoidal_positional_encoding(seq_len, d_model):

pos_enc = np.zeros((seq_len, d_model))
for pos in range(seq_len):
for i in range(0, d_model, 2):
pos_enc[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
pos_enc[pos, i+1] = np.cos(pos / (10000 ** (i / d_model)))
return torch.tensor(pos_enc, dtype=torch.float32)

seq_len, d_model = 10, 16 # Example sequence length and embedding size

pos_enc = sinusoidal_positional_encoding(seq_len, d_model)
print("Token Positional Encodings:\n", pos_enc)

✅
Key Takeaways:

✅
Token position is crucial for understanding sequence order.

✅
Absolute encodings assign fixed position vectors.
Relative encodings capture relationships between tokens.

Would you like to proceed to the next concept? 🚀

You said:
yes

ChatGPT said:

14. Encode Relative Position

Intuition

Instead of assigning absolute position embeddings to each token, we can encode relative positions between tokens.

✅
This is especially useful when:

✅
The model needs to generalize to longer sequences.
Meaning depends on token relationships rather than exact positions (e.g., "the cat on the mat" vs. "the mat under the
cat").
Why Use Relative Position Encoding?

● Absolute positions are fixed, so they struggle with extrapolation.

● Relative encoding captures the distance between words, which is more important for meaning.
● Used in Transformer-XL, T5, ALiBi, RoPE for better sequence modeling.

Mathematical Formulation

For an input sequence X={x1,x2,…,xn}X = \{x_1, x_2, \dots, x_n\}X={x1,x2,…,xn}, the attention weight between tokens
xix_ixiand xjx_jxjis modified by a relative position function r(i,j)r(i, j)r(i,j):

Attention(i,j)∝xiWxjT+r(i,j)d\text{Attention}(i, j) \propto \frac{x_i W x_j^T + r(i, j)}{\sqrt{d}}Attention(i,j)∝dxiWxjT+r(i,j)

where r(i,j)r(i, j)r(i,j) is a function of the relative position i−ji - ji−j, such as:

1. Learned Bias: r(i,j)=bi−jr(i, j) = b_{i-j}r(i,j)=bi−j(Transformer-XL).

2. Rotary Positional Embeddings (RoPE): Uses rotation matrices.
3. ALiBi (Attention Linear Biases): Adds a linear decay bias.

Python Code Example (Relative Position Encoding in Self-Attention)

python
CopyEdit
import torch

def relative_position_encoding(seq_len, d_model):

positions = torch.arange(seq_len).unsqueeze(1) - torch.arange(seq_len).unsqueeze(0)
rpe_matrix = torch.exp(-torch.abs(positions) / 10.0) # Exponential decay of position
difference
return rpe_matrix

seq_len = 10
rpe = relative_position_encoding(seq_len, d_model=16)
print("Relative Position Encoding Matrix:\n", rpe)

What This Code Does

● Computes the relative distance between each token.

● Applies an exponential decay so distant tokens contribute less.
● Can be used in self-attention layers instead of absolute positions.

✅
Key Takeaways:
Encodes relative distances, making it better for longer sequences.
✅ Works well for reasoning tasks, translation, and long-context understanding.
✅ Used in Transformer-XL, RoPE, ALiBi, and T5 for improved attention.
Would you like to move to the next concept? 🚀
4o

You said:
yes

ChatGPT said:

15. Generalization to Longer Sequences

Intuition

When transformers are trained on sequences of a certain length (e.g., 512 tokens), they often struggle to handle longer

✅
sequences (e.g., 2048 tokens). This is because:

✅
Absolute positional encodings are limited to a fixed range.

✅
Attention computation grows quadratically with sequence length.
Learned embeddings do not naturally extend beyond training data.

How Can Models Generalize to Longer Sequences?

1. Relative Position Encodings (RoPE, ALiBi) – Capture token relationships rather than absolute positions.
2. Exponential Position Scaling (NTK Scaling) – Rescales existing embeddings to larger sequence lengths.
3. Sparse Attention Mechanisms (Longformer, BigBird) – Reduce computation while maintaining global context.
4. Efficient Memory Layers (Transformer-XL, RWKV) – Allow models to remember previous tokens efficiently.

Mathematical Insight: NTK Scaling for Longer Sequences

Neural Tangent Kernel (NTK) theory suggests that scaling positional frequencies can allow transformers to extrapolate
beyond their training length.

θ′=θ⋅LnewLtrain\theta' = \theta \cdot \frac{L_{\text{new}}}{L_{\text{train}}}θ′=θ⋅LtrainLnew

where:

● θ\thetaθ is the original positional frequency,

● LnewL_{\text{new}}Lnewis the desired sequence length,
● LtrainL_{\text{train}}Ltrainis the training sequence length.

This helps position encodings generalize to longer contexts by stretching or compressing the positional space.

Python Code Example (NTK Scaling for Longer Sequences)

python
CopyEdit
import torch

def ntk_scaled_positions(original_positions, train_len, new_len):

scaling_factor = new_len / train_len
return original_positions * scaling_factor

train_seq_len = 512
new_seq_len = 2048
original_positions = torch.arange(train_seq_len, dtype=torch.float32)

scaled_positions = ntk_scaled_positions(original_positions, train_seq_len, new_seq_len)

print("Scaled Position Encodings for Longer Sequences:\n", scaled_positions)

Why NTK Scaling Works

● It adjusts positional frequencies instead of creating new embeddings.

● Enables long-sequence generalization without retraining the model.
● Used in ChatGLM, QWen, and Llama-2 70B for long-context adaptation.

✅
Key Takeaways:

✅
Generalization to long sequences is crucial for real-world NLP tasks.

✅
NTK scaling, relative encodings, and sparse attention help achieve this.
Models like Llama, GPT-4, and Claude-Opus use these techniques for extended context windows.

Would you like to continue to the next topic? 🚀

You said:
yes

ChatGPT said:

16. Extrapolation to Longer Sequences

Intuition

Extrapolation means enabling a model to handle sequences longer than what it was trained on without retraining. This
is challenging because:
✅ Transformers learn positional dependencies only within their training sequence length.
✅ Absolute embeddings do not naturally extend beyond their training range.
✅ Attention scores weaken for distant tokens, reducing long-range coherence.
Why Do Transformers Struggle with Long Sequences?

● Fixed Position Encodings: If a model is trained on sequences of length 512, it doesn’t know what position 2048
should look like.
● Softmax Attention Decay: Attention between distant tokens shrinks due to softmax normalization.
● Computation Complexity: Full attention scales as O(n²), making it expensive for long sequences.

How Can We Enable Extrapolation?

1. RoPE (Rotary Positional Embeddings) – Uses rotation matrices to encode relative positions continuously.
2. NTK Scaling (Neural Tangent Kernel Scaling) – Stretches learned embeddings for longer contexts.
3. Dynamic Scaling (ChatGLM, QWen) – Adapts position encodings at runtime.
4. Sparse Attention (Longformer, BigBird) – Reduces quadratic complexity for efficient long-sequence
processing.

Mathematical Insight: How NTK Scaling Enables Extrapolation

θ′=θ⋅LnewLtrain\theta' = \theta \cdot \frac{L_{\text{new}}}{L_{\text{train}}}θ′=θ⋅LtrainLnew

where:

● θ\thetaθ is the learned positional encoding frequency,

● LnewL_{\text{new}}Lnewis the longer sequence length we want to handle,
● LtrainL_{\text{train}}Ltrainis the training sequence length.

This scales the frequency of positional encodings, effectively stretching the learned embeddings over a longer
sequence.

Python Code Example (NTK Scaling for Long Sequences)

python
CopyEdit
import torch

def extrapolate_positions(train_len, new_len, d_model):

positions = torch.arange(train_len).float()
scale_factor = new_len / train_len
new_positions = positions * scale_factor

theta = torch.exp(-torch.arange(0, d_model, 2).float() / d_model)

scaled_theta = new_positions[:, None] * theta # Apply NTK scaling
return torch.cat([torch.sin(scaled_theta), torch.cos(scaled_theta)], dim=-1)

train_seq_len = 512
new_seq_len = 2048
d_model = 16 # Example embedding size

long_seq_embeddings = extrapolate_positions(train_seq_len, new_seq_len, d_model)

print("Extrapolated Positional Embeddings for Longer Sequences:\n", long_seq_embeddings)

Why This Works

✅ Expands the learned embedding space without new training data.

✅ Encodes positions beyond the training range while preserving structure.
✅ Used in modern LLMs (ChatGLM, Llama, GPT-4) to process long texts efficiently.
Shall we move on to the next topic? 🚀
4o

You said:
yes

ChatGPT said:

17. Learned Embeddings

Intuition

Instead of using fixed position encodings (like sinusoidal functions), transformers can learn position embeddings during
training. These embeddings are stored as trainable vectors that the model optimizes along with word embeddings.

Why Use Learned Embeddings?

✅ More flexible than sinusoidal encodings.

✅ Model decides how to encode positions based on training data.
✅ Used in GPT models, BERT, T5, and many transformer architectures.
Mathematical Formulation

For a sequence of tokens X={x1,x2,...,xn}X = \{x_1, x_2, ..., x_n\}X={x1,x2,...,xn}, we define a trainable position
embedding matrix PPP with shape (L,d)(L, d)(L,d), where:

● LLL is the maximum sequence length.

● ddd is the embedding dimension.

The input embedding ZiZ_iZifor a token xix_ixiis:

Zi=xi+PiZ_i = x_i + P_iZi=xi+Pi

where PiP_iPiis the learned positional embedding for position iii.

These embeddings are updated through gradient descent during training.

Python Code Example (Learned Positional Embeddings in PyTorch)

python
CopyEdit
import torch
import torch.nn as nn

class LearnedPositionalEncoding(nn.Module):
def __init__(self, max_seq_len, d_model):
super().__init__()
self.positional_embedding = nn.Embedding(max_seq_len, d_model)

def forward(self, positions):

return self.positional_embedding(positions)

# Example usage
max_seq_len = 512
d_model = 16
positions = torch.arange(10) # Example sequence positions (length 10)

pos_encoder = LearnedPositionalEncoding(max_seq_len, d_model)

learned_pos_embeddings = pos_encoder(positions)

print("Learned Positional Embeddings:\n", learned_pos_embeddings)

Advantages & Disadvantages

✅ More expressive – Learns directly from data.

✅ Better for tasks like summarization & reasoning.
❌ Does not generalize to longer sequences unless adapted (e.g., NTK scaling).
Key Takeaways:
● Learned embeddings adapt to training data but need scaling tricks for longer sequences.
● Used in BERT, GPT, T5, but models like RoPE & ALiBi use relative encodings instead.

Shall we proceed to the next concept? 🚀

You said:
yes

ChatGPT said:

18. Multimodal Transformers

Intuition

Traditional transformers process only text, but multimodal transformers handle multiple data types (e.g., images,
audio, video, text) in a unified architecture.

✅ Examples:
● CLIP: Jointly learns text and images for vision-language tasks.
● Flamingo: Processes images and text together.
● GPT-4V: Handles vision and text in a single model.
● Gemini, Kosmos, and Emu: Handle text, images, and sometimes audio/video.

How Do Multimodal Transformers Work?

Each modality (text, image, etc.) is converted into token-like representations, then fed into a shared transformer model.

1. Text is tokenized into subword embeddings (like GPT/BERT).

2. Images are split into patches (like Vision Transformers).
3. Audio is processed as spectrograms or waveform embeddings.
4. All modalities share attention layers for cross-modal learning.

Mathematical Formulation: Cross-Modality Attention

For a text token xix_ixiand an image patch vjv_jvj, the attention score is computed as:

Attention(i,j)=(xiWQ)(vjWK)Td\text{Attention}(i, j) = \frac{(x_i W_Q) (v_j W_K)^T}{\sqrt{d}}Attention(i,j)=d(xiWQ)(vjWK)T

where:

● WQW_QWQand WKW_KWKare projection matrices.

● ddd is the hidden dimension.
This allows the model to align and compare different modalities.

Python Code Example (Simple Multimodal Transformer)

python
CopyEdit
import torch
import torch.nn as nn

class MultimodalTransformer(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.text_proj = nn.Linear(d_model, d_model)
self.image_proj = nn.Linear(d_model, d_model)
self.attention = nn.MultiheadAttention(embed_dim=d_model, num_heads=num_heads)

def forward(self, text_tokens, image_patches):

text_embeds = self.text_proj(text_tokens)
image_embeds = self.image_proj(image_patches)
multimodal_input = torch.cat([text_embeds, image_embeds], dim=0) # Combine text &
image tokens
output, _ = self.attention(multimodal_input, multimodal_input, multimodal_input)
return output

# Example usage
d_model = 16
num_heads = 2
text_tokens = torch.randn(5, 1, d_model) # 5 text tokens
image_patches = torch.randn(3, 1, d_model) # 3 image patches

model = MultimodalTransformer(d_model, num_heads)

output = model(text_tokens, image_patches)

print("Multimodal Transformer Output:\n", output)

Key Takeaways:

✅ Multimodal transformers unify text, images, audio, and video.

✅ Cross-modal attention enables interaction between modalities.
✅ Used in models like CLIP, Flamingo, GPT-4V, Gemini, Kosmos, and Emu.
Would you like to continue to the next concept? 🚀
4o

You said:
yes

ChatGPT said:

19. Scaled RoPE (Rotary Position Embedding Scaling)

Intuition

RoPE (Rotary Position Embedding) is a relative positional encoding method that encodes token positions using
rotation matrices in continuous space. Scaled RoPE extends this by scaling the rotational frequencies to handle
longer sequences than the model was originally trained on.

✅ Why Scaling Matters?

● Original RoPE models are trained on fixed sequence lengths (e.g., 4K tokens).
● To handle longer sequences (e.g., 32K tokens), we scale the RoPE frequency to maintain positional
consistency.
● Used in Llama-2 70B, Qwen, ChatGLM, and Falcon for long-context generalization.

Mathematical Formulation

RoPE represents each token as a vector in 2D rotation space:

x′=R(θ)⋅x\textbf{x}' = R(\theta) \cdot \textbf{x}x′=R(θ)⋅x

where R(θ)R(\theta)R(θ) is a rotation matrix that encodes the token's position.

For Scaled RoPE, we modify the frequency θk\theta_kθkof the rotation:

θk′=θk⋅LnewLtrain\theta_k' = \theta_k \cdot \frac{L_{\text{new}}}{L_{\text{train}}}θk′=θk⋅LtrainLnew

where:

● θk\theta_kθkis the original RoPE frequency.

● LtrainL_{\text{train}}Ltrainis the training sequence length (e.g., 4K).
● LnewL_{\text{new}}Lnewis the desired sequence length (e.g., 32K).

This scales the rotation frequencies, allowing RoPE to work beyond its original training limit.

Python Code Example (Scaled RoPE Implementation)

python
CopyEdit
import torch

def scaled_rope_embeddings(seq_len, d_model, train_len, new_len):

theta = 10000 ** (-torch.arange(0, d_model, 2).float() / d_model) # Base RoPE
frequencies
scale_factor = new_len / train_len # Scaling factor
theta_scaled = theta * scale_factor # Adjust frequencies

positions = torch.arange(seq_len).float().unsqueeze(1)
scaled_embeddings = torch.cat([torch.sin(positions * theta_scaled),
torch.cos(positions * theta_scaled)], dim=-1)

return scaled_embeddings

train_seq_len = 4096
new_seq_len = 32768
d_model = 16

rope_embeddings = scaled_rope_embeddings(new_seq_len, d_model, train_seq_len, new_seq_len)

print("Scaled RoPE Embeddings:\n", rope_embeddings)

Why Scaled RoPE Works

✅ Maintains token relationships at longer sequence lengths.

✅ No need to retrain the model from scratch.
✅ Used in Llama-2, ChatGLM, Qwen, and Falcon for extending context windows.
Would you like to move on to the next topic? 🚀
4o

You said:
yes

ChatGPT said:

20. Dynamic NTK Scaling

Intuition

Dynamic NTK Scaling is a method that adapts positional embeddings to longer sequences by dynamically adjusting
the Neural Tangent Kernel (NTK) frequency scaling factor.

✅ Why is NTK Scaling Needed?

● Transformers are trained on fixed sequence lengths (e.g., 4K tokens).
● When processing longer sequences (e.g., 32K tokens), absolute embeddings break down.
● Scaling the positional frequency allows embeddings to extrapolate beyond their training range.

How Dynamic NTK Scaling Works

Instead of using a fixed scale factor, Dynamic NTK Scaling learns the optimal scaling dynamically based on
sequence length.

For a given token position xix_ixi, the scaled position embedding is computed as:

θ′=θ⋅f(Lnew,Ltrain)\theta' = \theta \cdot f(L_{\text{new}}, L_{\text{train}})θ′=θ⋅f(Lnew,Ltrain)

where f(Lnew,Ltrain)f(L_{\text{new}}, L_{\text{train}})f(Lnew,Ltrain) is a learned function that determines how much to

stretch the embedding space for longer sequences.

This allows for adaptive extrapolation instead of static frequency scaling.

Mathematical Formulation: Dynamic Frequency Adjustment

θk′=θk⋅(LnewLtrain)α\theta_k' = \theta_k \cdot \left( \frac{L_{\text{new}}}{L_{\text{train}}}
\right)^{\alpha}θk′=θk⋅(LtrainLnew)α

where:

● α\alphaα is a learned scaling factor (not fixed).

● LtrainL_{\text{train}}Ltrainis the training sequence length.
● LnewL_{\text{new}}Lnewis the desired sequence length.

Python Code Example (Dynamic NTK Scaling in PyTorch)

python
CopyEdit
import torch

class DynamicNTKScaling:
def __init__(self, train_len, d_model):
self.train_len = train_len
self.alpha = torch.nn.Parameter(torch.tensor(0.5)) # Learnable scaling factor
self.theta = 10000 ** (-torch.arange(0, d_model, 2).float() / d_model) # Base
frequencies

def forward(self, seq_len):

scale_factor = (seq_len / self.train_len) ** self.alpha # Dynamic scaling
theta_scaled = self.theta * scale_factor
positions = torch.arange(seq_len).float().unsqueeze(1)
scaled_embeddings = torch.cat([torch.sin(positions * theta_scaled),
torch.cos(positions * theta_scaled)], dim=-1)
return scaled_embeddings

# Example usage
train_seq_len = 4096
new_seq_len = 32768
d_model = 16

ntk_scaler = DynamicNTKScaling(train_seq_len, d_model)

scaled_embeddings = ntk_scaler.forward(new_seq_len)

print("Dynamically Scaled NTK Embeddings:\n", scaled_embeddings)

Why Dynamic NTK Scaling Works

✅ Learns the best scaling factor instead of using a fixed ratio.

✅ Better extrapolation to long sequences (used in ChatGLM, Qwen, Llama-2).
✅ Handles extreme long-context modeling (e.g., 128K tokens).
Would you like to proceed to the final topic? 🚀
4o

You said:
yes

ChatGPT said:

21. Multimodal RoPE (Rotary Position Embedding for Multimodal Models)

Intuition
Multimodal RoPE (Rotary Position Embedding) extends RoPE to work with multiple data types (e.g., text, images,
audio, video). Instead of treating each modality separately, it rotates embeddings in a shared continuous space to
maintain positional consistency across modalities.

✅ Why is Multimodal RoPE Important?

● Text and images have different positional structures.
● Text: Sequential (1D), Image: Grid-based (2D), Audio: Spectrogram (2D).
● RoPE’s rotation mechanism ensures that all modalities share a common positional encoding framework,
improving cross-modal attention.

How Does It Work?

1. Applies RoPE separately to each modality (text, image, etc.).

2. Rotates embeddings in a shared continuous space so that different modalities remain aligned.
3. Maintains relative position consistency across text, image, and audio features.

Mathematical Formulation: Generalizing RoPE for Multiple Modalities

For a text token xix_ixiand an image patch vjv_jvj, we encode them using separate RoPE transformations:

xi′=Rtext(θ)⋅xi,vj′=Rimage(θ)⋅vjx_i' = R_{\text{text}}(\theta) \cdot x_i, \quad v_j' = R_{\text{image}}(\theta) \cdot

v_jxi′=Rtext(θ)⋅xi,vj′=Rimage(θ)⋅vj

where R(θ)R(\theta)R(θ) is the rotation matrix specific to each modality.

To align them in a shared space, we introduce a modality alignment function ggg:

z=g(xi′,vj′)z = g(x_i', v_j')z=g(xi′,vj′)

which projects both embeddings into a common representation space.

Python Code Example (Multimodal RoPE for Text and Image Patches)
python
CopyEdit
import torch

class MultimodalRoPE:
def __init__(self, d_model):
self.theta_text = 10000 ** (-torch.arange(0, d_model, 2).float() / d_model)
self.theta_image = 10000 ** (-torch.arange(0, d_model, 2).float() / d_model)

def forward(self, text_positions, image_positions):

text_rope = torch.cat([torch.sin(text_positions * self.theta_text),
torch.cos(text_positions * self.theta_text)], dim=-1)
image_rope = torch.cat([torch.sin(image_positions * self.theta_image),
torch.cos(image_positions * self.theta_image)], dim=-1)

return text_rope, image_rope

# Example usage
d_model = 16
text_positions = torch.arange(10).float().unsqueeze(1) # 10 text tokens
image_positions = torch.arange(5).float().unsqueeze(1) # 5 image patches

rope_encoder = MultimodalRoPE(d_model)
text_rope, image_rope = rope_encoder.forward(text_positions, image_positions)

print("Text RoPE Embeddings:\n", text_rope)

print("Image RoPE Embeddings:\n", image_rope)

Why Multimodal RoPE Works?

✅ Ensures that text, images, and audio maintain relative positions correctly.
✅ Improves cross-modal learning in models like GPT-4V, Gemini, and Kosmos.
✅ Maintains scalability for long sequences and high-resolution images.

Geometry of Deep Learning - Ye
100% (3)
Geometry of Deep Learning - Ye
338 pages
AI by Hand Vol 1
No ratings yet
AI by Hand Vol 1
28 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Transformers Torch
No ratings yet
Transformers Torch
38 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Transformer
No ratings yet
Transformer
58 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
Transformer
No ratings yet
Transformer
5 pages
Chap 6 Embedding
No ratings yet
Chap 6 Embedding
44 pages
Astro AI
No ratings yet
Astro AI
20 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
No ratings yet
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
12 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
anlp-05-transformers
No ratings yet
anlp-05-transformers
40 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Roformer - Enhanced Transformer With Rotary Position Embedding
No ratings yet
Roformer - Enhanced Transformer With Rotary Position Embedding
14 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Ai
No ratings yet
Ai
28 pages
chapter 2
No ratings yet
chapter 2
11 pages
VR Part2 Lecture 6 Annotated
No ratings yet
VR Part2 Lecture 6 Annotated
10 pages
NPU MachineLearning
No ratings yet
NPU MachineLearning
28 pages
A Present A Ç Ão Deep Learning
No ratings yet
A Present A Ç Ão Deep Learning
28 pages
nlp4
No ratings yet
nlp4
10 pages
Astro AI
No ratings yet
Astro AI
20 pages
P059
No ratings yet
P059
8 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
41 pages
An Introduction to Transformers
No ratings yet
An Introduction to Transformers
10 pages
1706.03762v7-5-15
No ratings yet
1706.03762v7-5-15
11 pages
chapter 2
No ratings yet
chapter 2
11 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Autoencoders
No ratings yet
Autoencoders
14 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Week 6 Unsupervised Learning
No ratings yet
Week 6 Unsupervised Learning
60 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
Solved-example-of-transformers
No ratings yet
Solved-example-of-transformers
20 pages
Autoencoder_GAN_edited
No ratings yet
Autoencoder_GAN_edited
138 pages
chapter_4
No ratings yet
chapter_4
24 pages
The Transformer Model in Equations: John Thickstun
No ratings yet
The Transformer Model in Equations: John Thickstun
5 pages
Positional Embeddings
No ratings yet
Positional Embeddings
7 pages
Deep Learning: Prof:Naveen Ghorpade
No ratings yet
Deep Learning: Prof:Naveen Ghorpade
43 pages
Pre Study
No ratings yet
Pre Study
14 pages
Deep Learning: Early Work Why Deep Learning Stacked Auto Encoders Deep Belief Networks
No ratings yet
Deep Learning: Early Work Why Deep Learning Stacked Auto Encoders Deep Belief Networks
33 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
Transformer
No ratings yet
Transformer
10 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Chap6 Transformer (20240219) - DL4H practioner guide
No ratings yet
Chap6 Transformer (20240219) - DL4H practioner guide
36 pages
FLPE_new
No ratings yet
FLPE_new
22 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Geometry of Deep Learning: A Signal Processing Perspective Ye instant download
100% (1)
Geometry of Deep Learning: A Signal Processing Perspective Ye instant download
41 pages
Jiang Et Al, 2023, Incorporating Prior Knowledge Into Neural Networks Through An Implicit Composite Kernel
No ratings yet
Jiang Et Al, 2023, Incorporating Prior Knowledge Into Neural Networks Through An Implicit Composite Kernel
23 pages
2022PhD - Princeton - Bridging Theory and Practice in Deep Learning Optimization and Generalization
No ratings yet
2022PhD - Princeton - Bridging Theory and Practice in Deep Learning Optimization and Generalization
540 pages
Optimization for Deep Learning- An Overview
No ratings yet
Optimization for Deep Learning- An Overview
51 pages
Dataset Distillation
No ratings yet
Dataset Distillation
24 pages
2309.04860 Approximation Results For Gradient Descent Trained
No ratings yet
2309.04860 Approximation Results For Gradient Descent Trained
69 pages
Shi_Improved_Implicit_Neural_Representation_with_Fourier_Reparameterized_Training_CVPR_2024_paper-2
No ratings yet
Shi_Improved_Implicit_Neural_Representation_with_Fourier_Reparameterized_Training_CVPR_2024_paper-2
10 pages
Toward A Theory of Optimization For Over-Parameter
No ratings yet
Toward A Theory of Optimization For Over-Parameter
40 pages
spectrum_dependent_learning
No ratings yet
spectrum_dependent_learning
22 pages
Rigor With Machine Learning From Field Theory To The Poincar e Conjecture
No ratings yet
Rigor With Machine Learning From Field Theory To The Poincar e Conjecture
17 pages
The Surprising Harmfulness of Benign Overfitting For
No ratings yet
The Surprising Harmfulness of Benign Overfitting For
47 pages
Fourier Features Let Networks Learn High Freq Funcs in Low Dim Domains
No ratings yet
Fourier Features Let Networks Learn High Freq Funcs in Low Dim Domains
24 pages
DLbook
No ratings yet
DLbook
165 pages
Thesis Atakan
No ratings yet
Thesis Atakan
101 pages
Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
6_s088_course_materials-3
No ratings yet
6_s088_course_materials-3
14 pages
Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training
No ratings yet
Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training
28 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Self-adaptive physics-informed neural networks
No ratings yet
Self-adaptive physics-informed neural networks
23 pages
PINNs Expert Guide
No ratings yet
PINNs Expert Guide
37 pages

RoPE

Uploaded by

RoPE

Uploaded by

1.

Python Code Example

def positional_encoding(pos, d_model):

Think of reading a sentence:

●​ "The cat sat on the mat."

def scaled_dot_product_attention(Q, K, V):

output, attn_weights = scaled_dot_product_attention(Q, K, V)

Q = torch.tensor([[1.0, 2.0], [3.0, 4.0]])

def attention(Q, K, V):

print("Original Order:\n", attention(Q, K, V))

# The outputs remain the same → Shows permutation invariance

Fixing Permutation Invariance

Adding positional encoding fixes this by making token embeddings position-aware.

●​ Correct Order: "The cat sat on the mat."

Python Code Example

tokens = ["The", "cat", "sat", "on", "the", "mat"]

# Add position information

print("Final Token Embeddings:\n", final_embeddings)

This ensures tokens maintain their order while processing.

5. Sinusoidal Positional Encodings

Why Use Sinusoids?

●​ Even indices use sine, odd indices use cosine.

Each position gets a unique wave-like encoding:

●​ Closer positions have similar embeddings.

Python Code Example

def sinusoidal_positional_encoding(seq_len, d_model):

This generates a heatmap where color variations represent encoding values.

6. Learned Positional Embeddings

●​ Each position iii in a sequence has a trainable embedding vector PEiPE_iPEi​.

✅ More flexible: Learns optimal encoding for a dataset.​

seq_len = 10 # Example sequence length

# Trainable positional embeddings

print("Learned Positional Embeddings:\n", pos_encoding)

7. Handling Long Sequences

1.​ Extend sequence length (e.g., RoPE, ALiBi, FlashAttention).

RoPE (Rotary Rotates token Scales well, handles Requires modification

Linformer Projects sequences into Linear complexity. May lose information.

Memory-efficient Stores activations in Saves memory. Requires custom

Python Code Example (RoPE)

def rotary_embedding(x, pos):

tokens = torch.rand(10, 16) # 10 tokens, 16-d embeddings

print("RoPE Applied Embeddings:\n", rotated_embeddings)

8. Extrapolating Beyond Training Data

RoPE (Rotary Rotates token embeddings Extrapolates Needs attention

NTK Scaling (Neural Dynamically adjusts Adapts well to Computationally

def ntk_scaled_positional_encoding(pos, d_model, scale_factor=2.0):

scaled_pe = ntk_scaled_positional_encoding(200, 16, scale_factor=1.5)

Why is Position Encoding Important?

Types of Positional Encoding

Python Code Example (Sinusoidal Encoding)

def positional_encoding(seq_len, d_model):

pe_matrix = positional_encoding(10, 16)

This ensures tokens are order-aware while keeping Transformer flexibility.

10. Rotating Token Embeddings in Continuous Space

This results in position-aware embeddings without explicit position vectors.

Python Code Example (RoPE Embeddings)

def rotary_embedding(x, pos):

tokens = torch.rand(10, 16) # 10 tokens, 16-dimensional embeddings

print("RoPE Applied Embeddings:\n", rotated_embeddings)

Why Use Rotation?

●​ Keeps token embeddings continuous and differentiable.

A 2D rotation matrix for an angle θ\thetaθ is:

R(θ)=[cos⁡(θ)−sin⁡(θ)sin⁡(θ)cos⁡(θ)]R(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta)

Applying R(θ)R(\theta)R(θ) to a vector [x1,x2][x_1, x_2][x1​,x2​]:

[x1′x2′]=[cos⁡(θ)−sin⁡(θ)sin⁡(θ)cos⁡(θ)][x1x2]\begin{bmatrix} x_1' \\ x_2' \end{bmatrix} = \begin{bmatrix} \cos(\theta) &

x1′=x1cos⁡(θ)−x2sin⁡(θ)x_1' = x_1 \cos(\theta) - x_2 \sin(\theta)x1′​=x1​cos(θ)−x2​sin(θ) x2′=x1sin⁡(θ)+x2cos⁡(θ)x_2' = x_1

Python Code Example (Rotation Matrix)

# Example: Rotate a 2D vector by 30 degrees

vector = torch.tensor([1.0, 0.0]) # Example vector

●​ Rotation matrices preserve vector length (important for stable embeddings).

Shall we move to the next concept? 🚀

12. Encoding Relative Position

Why Use Relative Positioning?

Δ(i,j)=f(i−j)\Delta(i, j) = f(i - j)Δ(i,j)=f(i−j)

●​ iii and jjj are token positions.

Python Code Example (Relative Attention in RoPE)

● "The cat sat on the mat."

● Correct Order: "The cat sat on the mat."

● Even indices use sine, odd indices use cosine.

● Closer positions have similar embeddings.

● Each position iii in a sequence has a trainable embedding vector PEiPE_iPEi.

✅ More flexible: Learns optimal encoding for a dataset.

1. Extend sequence length (e.g., RoPE, ALiBi, FlashAttention).

● Keeps token embeddings continuous and differentiable.

Applying R(θ)R(\theta)R(θ) to a vector [x1,x2][x_1, x_2][x1,x2]:

x1′=x1cos⁡(θ)−x2sin⁡(θ)x_1' = x_1 \cos(\theta) - x_2 \sin(\theta)x1′=x1cos(θ)−x2sin(θ) x2′=x1sin⁡(θ)+x2cos⁡(θ)x_2' = x_1

● Rotation matrices preserve vector length (important for stable embeddings).

● iii and jjj are token positions.

● Absolute position encoding assigns a fixed vector to each token.

For absolute encoding, each token receives a position vector PEiPE_iPEi:

Zi=xi+PEiZ_i = x_i + PE_iZi=xi+PEi

● xix_ixiis the token embedding.

where bΔ(i,j)b_{\Delta(i, j)}bΔ(i,j)is a learned bias based on relative position.

● Absolute positions are fixed, so they struggle with extrapolation.

Attention(i,j)∝xiWxjT+r(i,j)d\text{Attention}(i, j) \propto \frac{x_i W x_j^T + r(i, j)}{\sqrt{d}}Attention(i,j)∝dxiWxjT+r(i,j)

1. Learned Bias: r(i,j)=bi−jr(i, j) = b_{i-j}r(i,j)=bi−j(Transformer-XL).

● Computes the relative distance between each token.

θ′=θ⋅LnewLtrain\theta' = \theta \cdot \frac{L_{\text{new}}}{L_{\text{train}}}θ′=θ⋅LtrainLnew

● θ\thetaθ is the original positional frequency,

● It adjusts positional frequencies instead of creating new embeddings.

● θ\thetaθ is the learned positional encoding frequency,

✅ Expands the learned embedding space without new training data.

✅ More flexible than sinusoidal encodings.

● LLL is the maximum sequence length.

The input embedding ZiZ_iZifor a token xix_ixiis:

Zi=xi+PiZ_i = x_i + P_iZi=xi+Pi

where PiP_iPiis the learned positional embedding for position iii.

✅ More expressive – Learns directly from data.

1. Text is tokenized into subword embeddings (like GPT/BERT).

Attention(i,j)=(xiWQ)(vjWK)Td\text{Attention}(i, j) = \frac{(x_i W_Q) (v_j W_K)^T}{\sqrt{d}}Attention(i,j)=d(xiWQ)(vjWK)T

● WQW_QWQand WKW_KWKare projection matrices.