0% found this document useful (0 votes)
34 views4 pages

RNN_LSTM_Transformers_Notes

The document provides an overview of various neural network architectures including RNNs, LSTMs, Bidirectional LSTMs, Encoder-Decoder models, and Transformers. It highlights their structures, key features, advantages, and applications, particularly in handling sequential data and long-term dependencies. Additionally, it includes a summary table comparing these models based on their key features and use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views4 pages

RNN_LSTM_Transformers_Notes

The document provides an overview of various neural network architectures including RNNs, LSTMs, Bidirectional LSTMs, Encoder-Decoder models, and Transformers. It highlights their structures, key features, advantages, and applications, particularly in handling sequential data and long-term dependencies. Additionally, it includes a summary table comparing these models based on their key features and use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Notes on RNN, LSTM, Bidirectional LSTM, Encoder-Decoder, and Transformers

1. Recurrent Neural Networks (RNN)

---------------------------------

RNNs are neural networks designed for sequential data. Unlike feedforward networks, RNNs have

connections that form directed cycles,

allowing them to maintain hidden states that capture temporal dependencies.

Architecture:

- Input Layer: Takes one time step of the input sequence at a time.

- Hidden Layer: Processes the input and the previous hidden state.

- Output Layer: Produces the output for each time step.

Equation:

h_t = f(W_xh*x_t + W_hh*h_(t-1) + b_h)

y_t = g(W_hy*h_t + b_y)

Where:

- x_t: Input at time t

- h_t: Hidden state at time t

- y_t: Output at time t

- W and b: Weights and biases

Limitations:

- Struggles with long-term dependencies due to vanishing gradients.

2. Long Short-Term Memory (LSTM)


---------------------------------

LSTMs are an advanced version of RNNs designed to handle long-term dependencies. They

introduce gates to regulate the flow of information.

Architecture:

- Cell State: Stores long-term information.

- Forget Gate: Decides what information to discard.

- Input Gate: Determines which new information to store.

- Output Gate: Controls the output based on the cell state and hidden state.

Equations:

f_t = sigmoid(W_f[x_t, h_(t-1)] + b_f)

i_t = sigmoid(W_i[x_t, h_(t-1)] + b_i)

C~_t = tanh(W_C[x_t, h_(t-1)] + b_C)

C_t = f_t * C_(t-1) + i_t * C~_t

o_t = sigmoid(W_o[x_t, h_(t-1)] + b_o)

h_t = o_t * tanh(C_t)

Advantages:

- Effectively captures long-term dependencies.

3. Bidirectional LSTM

----------------------

A Bidirectional LSTM processes the sequence in both forward and backward directions, capturing

context from both past and future.

Architecture:
- Two LSTM layers: One processes the input forward, and the other processes it backward.

Equation:

h_t = concat(h_t_forward, h_t_backward)

Applications:

- Speech recognition, language modeling, etc.

4. Encoder-Decoder Architecture

--------------------------------

Used for tasks like machine translation, this architecture includes:

- Encoder: Processes the input sequence and encodes it into a context vector.

- Decoder: Decodes the context vector into the target sequence.

Workflow:

1. The encoder processes the input sequence and generates a fixed-size context vector.

2. The decoder takes this context vector and generates the output sequence step-by-step.

5. Transformers

----------------

Transformers are powerful models that replace RNNs with self-attention mechanisms. They are the

foundation for models like BERT and GPT.

Components:

1. Encoder-Decoder Structure:

- Encoder: Consists of multiple layers with self-attention and feedforward sublayers.

- Decoder: Similar to the encoder but includes cross-attention layers.


2. Self-Attention: Computes attention weights for each word in relation to other words.

Attention Equation:

Attention(Q, K, V) = softmax((QK^T)/sqrt(d_k))V

Advantages:

- Captures global dependencies.

- Highly parallelizable.

Summary Table:

------------------------------------------------------

| Model | Key Feature | Use Case |

|---------------------|----------------------------|---------------------------|

| RNN | Cyclic connections | Sequential data |

| LSTM | Gates for long-term deps. | Long sequence modeling |

| Bidirectional LSTM | Processes two directions | Context-aware tasks |

| Encoder-Decoder | Separate encode/decode | Translation, summarization|

| Transformers | Self-attention | NLP, image processing |

------------------------------------------------------

You might also like