Encoder_Decoder_Transformers_Notes
Encoder_Decoder_Transformers_Notes
## 1. Encoder-Decoder Architecture
### Overview
1. **Encoder**:
- The encoder processes the input sequence and converts it into a fixed-length context vector
(latent representation).
2. **Decoder**:
- The decoder generates the output sequence step by step, using the context vector from the
- It predicts the next token based on the current state and the context vector.
### Workflow
1. The encoder processes the input sequence \( x = \{x_1, x_2, \dots, x_n\} \) and produces a
\[
2. The decoder takes \( C \) and generates the output sequence \( y = \{y_1, y_2, \dots, y_m\} \):
\[
\]
\[
\]
### Limitations
- Fixed-length context vectors can struggle to capture all the essential details of long input
sequences.
---
## 2. Transformers
### Overview
non-sequential architecture based entirely on attention mechanisms. They overcome the limitations
1. **Self-Attention Mechanism**:
- Allows the model to weigh the importance of different parts of the sequence when encoding each
token.
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\]
where \( Q \), \( K \), and \( V \) are query, key, and value matrices, respectively.
2. **Multi-Head Attention**:
- Splits attention into multiple heads to capture different types of relationships in the data.
- Formula:
\[
\]
3. **Positional Encoding**:
- Since Transformers process tokens in parallel, positional encodings are added to input
- Formula:
\[
\]
\[
\]
### Architecture
1. **Encoder**:
- Multi-head self-attention
2. **Decoder**:
- Similar to the encoder but includes an additional cross-attention mechanism to attend to the
encoder's output.
### Advantages
---
|-----------------------------|-----------------------------|-------------------------------|
---
## 4. Illustration
```
Input -> [Encoder] -> Context Vector -> [Decoder] -> Output
```
```
```
---
## Key Takeaways
- The Encoder-Decoder framework is foundational for seq2seq tasks but struggles with long
- The choice between these architectures depends on the task, with Transformers being the go-to