Temporal-Difference (TD) Learning: Basics
Temporal-Difference (TD) Learning: Basics
Definition:
- TD learning is a model-free approach in reinforcement learning (RL) where
an agent learns by bootstrapping (using current estimates to update future
predictions).
- The agent updates its value estimates after each action based on the
difference (error) between predicted and observed rewards.
Example:
Imagine a robot navigating a maze to find an exit. It receives a reward of +10
for reaching the exit, -1 for hitting a wall, and 0 for empty cells.
- If the robot starts in state (S1), moves to (S2), and eventually reaches the
exit, it updates its value estimates for each state based on the observed
rewards.
- Using TD learning, the robot updates its estimate of (S1) after observing the
reward from transitioning to (S2).
How It Works:
1. The robot estimates that the value of (S1) is 2 (based on prior experience).
2. After moving to (S2), it receives a reward of 0 and estimates the value of
(S2) as 3.
---
TD(0) Algorithm:
- TD(0) is the simplest form of TD learning where updates are made using
one-step bootstrapping.
Example:
Consider a game where an agent navigates a grid. Each step costs -1, and
reaching the goal state gives +10.
---
Example Scenario:
Imagine a self-driving car learning to navigate through a city. It uses TD
prediction methods to update its knowledge of different routes.
Advantages:
1. Efficient Updates: The car updates its route knowledge after every turn
rather than waiting for a complete journey.
2. Handles Incomplete Episodes: Even if the car stops before reaching the
destination, it updates its route knowledge based on observed turns.
3. Online Learning: The car learns while driving, adapting to traffic and road
changes in real-time.
4. No Model Required: The car does not need a complete map of the city; it
learns from experience.
5. Works with Large State Spaces: Suitable for complex environments like
city navigation.
6. Adaptable: The car quickly adjusts its policy based on changes in traffic
patterns.
7. Combines Past Experience: Uses previously learned knowledge and
current observations.
---
4. Optimality of TD(0)
Example:
A gaming AI is trying to find the best path in a maze. TD(0) helps it learn
optimal value estimates over time.
- Initially, the AI may have random estimates for each state in the maze.
- By repeatedly navigating and updating values using TD(0), it learns to assign
accurate values to each state.
- The AI updates its value estimates incrementally based on new
experiences.
Optimality:
- With enough exploration, the AI’s estimates converge to the true value
function, meaning it learns the best strategy to navigate the maze.
---
Definition:
- SARSA is an on-policy algorithm, meaning it updates its Q-values based on
the actions chosen by its current policy.
Example:
Imagine an agent playing a simple game where it needs to move left or right
to collect coins.
1. It starts at state (S), chooses action (A) (move right), and moves to (S'),
where it receives a coin (+1).
2. It then selects action (A') (move left).
Why Use SARSA:
- It is safer in non-stationary environments (e.g., games with changing rules)
because it follows its own policy.
---
Definition:
- Q-learning is an off-policy algorithm, meaning it learns the optimal policy
independently of the actions taken by the current policy.
Example:
In a robot soccer game, the robot learns to score goals.
Benefit:
- Q-learning learns the optimal policy faster, making it effective in games and
robotic tasks.
---
7. Expected SARSA
Definition:
- Instead of using a single sampled action, Expected SARSA uses the expected
value of the next state’s actions based on the current policy.
Example:
An agent playing a card game updates its strategy by considering the average
reward of all possible actions.
---
Maximization Bias:
- In Q-learning, using the maximum Q-value for updates can cause
overestimation.
- Example: In a stock trading environment, the agent might overestimate the
value of certain actions, leading to poor decisions.
Double Q-learning:
- Uses two separate Q-value estimates ((Q1) and (Q2)) to address bias.
- Example:
- The agent flips a coin to decide which Q-value to update ((Q1) or (Q2)).
Benefit:
- Reduces overestimation, leading to better convergence and more accurate
value estimates.
---
Afterstates:
- An afterstate represents the state of the game after an action has been
taken but before the opponent moves.
- Example: In chess, an afterstate is the board configuration after your move
but before the opponent responds.
- Simplifies value estimation by focusing on game outcomes rather than
specific actions.
Special Cases:
- In some games, like tic-tac-toe, afterstates simplify learning by evaluating
the board directly after each move, reducing the action space.
---