0% found this document useful (0 votes)
32 views

Temporal-Difference (TD) Learning: Basics

Uploaded by

Dhinesh T
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Temporal-Difference (TD) Learning: Basics

Uploaded by

Dhinesh T
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

1.

Temporal-Difference (TD) Learning: Basics

Definition:
- TD learning is a model-free approach in reinforcement learning (RL) where
an agent learns by bootstrapping (using current estimates to update future
predictions).
- The agent updates its value estimates after each action based on the
difference (error) between predicted and observed rewards.

Example:
Imagine a robot navigating a maze to find an exit. It receives a reward of +10
for reaching the exit, -1 for hitting a wall, and 0 for empty cells.

- If the robot starts in state (S1), moves to (S2), and eventually reaches the
exit, it updates its value estimates for each state based on the observed
rewards.
- Using TD learning, the robot updates its estimate of (S1) after observing the
reward from transitioning to (S2).

How It Works:
1. The robot estimates that the value of (S1) is 2 (based on prior experience).
2. After moving to (S2), it receives a reward of 0 and estimates the value of
(S2) as 3.

---

2. TD Prediction with TD(0)

TD(0) Algorithm:
- TD(0) is the simplest form of TD learning where updates are made using
one-step bootstrapping.
Example:
Consider a game where an agent navigates a grid. Each step costs -1, and
reaching the goal state gives +10.

1. The agent starts at (S0), estimates its value as 5.


2. Moves to (S1), receives -1 reward.
3. Estimates (S1) as 6.

---

3. Advantages of TD Prediction Methods

Example Scenario:
Imagine a self-driving car learning to navigate through a city. It uses TD
prediction methods to update its knowledge of different routes.

Advantages:
1. Efficient Updates: The car updates its route knowledge after every turn
rather than waiting for a complete journey.
2. Handles Incomplete Episodes: Even if the car stops before reaching the
destination, it updates its route knowledge based on observed turns.
3. Online Learning: The car learns while driving, adapting to traffic and road
changes in real-time.
4. No Model Required: The car does not need a complete map of the city; it
learns from experience.
5. Works with Large State Spaces: Suitable for complex environments like
city navigation.
6. Adaptable: The car quickly adjusts its policy based on changes in traffic
patterns.
7. Combines Past Experience: Uses previously learned knowledge and
current observations.
---

4. Optimality of TD(0)

Example:
A gaming AI is trying to find the best path in a maze. TD(0) helps it learn
optimal value estimates over time.

- Initially, the AI may have random estimates for each state in the maze.
- By repeatedly navigating and updating values using TD(0), it learns to assign
accurate values to each state.
- The AI updates its value estimates incrementally based on new
experiences.

Optimality:
- With enough exploration, the AI’s estimates converge to the true value
function, meaning it learns the best strategy to navigate the maze.

---

5. SARSA: On-policy TD Control

Definition:
- SARSA is an on-policy algorithm, meaning it updates its Q-values based on
the actions chosen by its current policy.

Example:
Imagine an agent playing a simple game where it needs to move left or right
to collect coins.

1. It starts at state (S), chooses action (A) (move right), and moves to (S'),
where it receives a coin (+1).
2. It then selects action (A') (move left).
Why Use SARSA:
- It is safer in non-stationary environments (e.g., games with changing rules)
because it follows its own policy.

---

6. Q-learning: Off-policy TD Control

Definition:
- Q-learning is an off-policy algorithm, meaning it learns the optimal policy
independently of the actions taken by the current policy.

Example:
In a robot soccer game, the robot learns to score goals.

1. The robot is at state (S), decides to kick (action (A)).


2. It moves to (S') and checks the best possible action from (S').

Benefit:
- Q-learning learns the optimal policy faster, making it effective in games and
robotic tasks.

---

7. Expected SARSA

Definition:
- Instead of using a single sampled action, Expected SARSA uses the expected
value of the next state’s actions based on the current policy.

Example:
An agent playing a card game updates its strategy by considering the average
reward of all possible actions.

1. It reaches a state (S) and selects action (A).


2. It moves to state (S'), where it could choose any action (A') with certain
probabilities.

Why Use Expected SARSA:


- It reduces the variance in updates, leading to more stable learning.

---

8. Maximization Bias and Double Q-learning

Maximization Bias:
- In Q-learning, using the maximum Q-value for updates can cause
overestimation.
- Example: In a stock trading environment, the agent might overestimate the
value of certain actions, leading to poor decisions.

Double Q-learning:
- Uses two separate Q-value estimates ((Q1) and (Q2)) to address bias.
- Example:
- The agent flips a coin to decide which Q-value to update ((Q1) or (Q2)).

Benefit:
- Reduces overestimation, leading to better convergence and more accurate
value estimates.

---

9. Games, Afterstates, and Special Cases

Afterstates:
- An afterstate represents the state of the game after an action has been
taken but before the opponent moves.
- Example: In chess, an afterstate is the board configuration after your move
but before the opponent responds.
- Simplifies value estimation by focusing on game outcomes rather than
specific actions.

Special Cases:
- In some games, like tic-tac-toe, afterstates simplify learning by evaluating
the board directly after each move, reducing the action space.

---

These examples should provide a clear understanding of each concept with


practical scenarios. Let me know if you need more details or specific use
cases!

You might also like