RL UNIT - II
RL UNIT - II
1. State Space (S): A set of all possible states the agent can be in. Each state provides
information about the environment and is used to make decisions.
2. Action Space (A): A set of all possible actions the agent can take. The action selected
by the agent influences the next state and the reward received.
4. Reward Function (R): A function that defines the immediate reward received after
transitioning from one state to another due to an action. It’s represented as:
5. Discount Factor (γ): A scalar γ ∈ [0,1] that represents the agent’s preference for
immediate rewards over future rewards. A higher discount factor means the agent
values future rewards more.
Objective in an MDP
The goal in an MDP is to find an optimal policy π\piπ that maximizes the expected
cumulative reward. A policy π(a∣s) is a mapping from states to probabilities of taking
each action in that state. The objective is to find a policy π∗ that maximizes the expected
return (discounted cumulative reward), defined as:
Solving an MDP
To solve an MDP, we often calculate the value function and action-value function, which
are core concepts in RL:
Approaches to Solve MDPs in Reinforcement Learning
1. Dynamic Programming:
o Value Iteration: Directly calculates the optimal value function and derives
the optimal policy.
2. Model-Free Methods:
o Policy Gradient Methods: Optimize the policy directly rather than learning
value functions, making them suitable for continuous action spaces.
Example of an MDP Problem: Grid World
Consider a simple grid world where an agent navigates in a grid to reach a goal. The states
are the grid cells, actions are the possible moves (up, down, left, right), and the reward is
given for reaching the goal. The agent’s task is to learn a policy that maximizes its
cumulative reward by reaching the goal in the fewest steps.
This example shows how MDPs can be used to model sequential decision-making
problems and how RL techniques help find optimal solutions.
1. Policy (π)
A policy defines the behavior of an agent at any given state. It’s a mapping from each state
to a probability distribution over possible actions the agent can take in that state. In
simple terms, a policy determines the action the agent should take when it’s in a particular
state.
• Deterministic Policy: A policy that maps a state to a specific action with certainty.
π(s)=a
This means in state s, the agent always takes action a.
• Stochastic Policy: A policy that gives probabilities for each action in each state.
This means in state s, the agent has a probability of choosing each possible action a. This
is useful in environments with uncertain outcomes or when exploration is beneficial.
2. Value Functions
Value functions are crucial in RL as they estimate how good it is for the agent to be in a
given state (or state-action pair) while following a policy. They quantify the expected future
rewards that can be obtained from a state or an action.
The state value function represents the expected cumulative reward the agent
can achieve, starting from state s and following policy π thereafter. Formally, it’s defined
as:
b. Action Value Function
The action value function represents the expected cumulative reward starting
from state s, taking action a, and following policy π\piπ thereafter. Formally:
This function is useful when evaluating the quality of individual actions rather than states,
making it foundational in action-based RL algorithms like Q-learning.
3. Bellman Equations
The Bellman equations provide a recursive relationship for both the state value function
and the action value function , allowing these values to be computed
iteratively.
REWARD MODELS (INFINITE DISCOUNTED)
In reinforcement learning, reward models describe how rewards are accumulated and
how much weight is given to future rewards. One commonly used reward model in RL is
the infinite discounted reward model. This model applies a discount factor to future
rewards to prioritize short-term rewards over long-term ones while still considering the
future to some degree.
The infinite discounted reward model involves an agent interacting with an environment
over an infinite time horizon, with future rewards being discounted by a factor γ at each
step. The purpose of this model is to capture the idea that immediate rewards are generally
more valuable than distant rewards, allowing the agent to balance short-term gains with
long-term goals.
Key Components
1. Discount Factor γ:
2. Return Gt:
o The return Gt is the cumulative reward starting from a given time step t.
o Here, rt+k+1 is the reward received k steps into the future, and γk is the discount
applied to it.
o This means choosing actions such that the cumulative reward over time,
considering the discount, is as high as possible.
o For a given policy π, the state value function in the infinite discounted
reward model is the expected cumulative discounted reward starting from
state s:
o The value function captures the long-term benefit of being in state s and
following policy π.
The Bellman equations in the infinite discounted reward model describe how the value
of a state or state-action pair can be computed recursively. They serve as the foundation
for many RL algorithms.
This equation states that the value of state s under policy π is the immediate reward plus
the discounted value of the next state.
These equations help to iteratively compute the value functions, which are used to
determine the best actions to take for maximizing the expected cumulative reward.
To achieve the maximum expected return over an infinite horizon, the agent aims to find
an optimal policy π∗, leading to optimal value functions:
Summary
The infinite discounted reward model helps balance short- and long-term rewards by
discounting future rewards. It forms the basis for much of RL, allowing agents to achieve
goals over time in environments that are not strictly time-limited. The discount factor
plays a key role in shaping agent behavior, determining how much importance it places
on immediate versus future rewards.
1. Total Reward:
o The total reward is simply the sum of rewards the agent collects over a fixed
number of steps or until it reaches a terminal state.
o Here, T is the finite time horizon, or the maximum number of steps the agent
will take before the episode ends. rt+k+1 is the reward received at each step.
2. Goal:
o The agent's objective is to maximize the sum of rewards over this finite time
horizon, as there’s no discounting involved. All rewards are valued equally,
whether they’re received early or late in the process.
3. Applications:
o This model is ideal for tasks with a clear ending, like completing a puzzle or
reaching a destination within a limited time.
o It's also useful in situations where every action and its reward have the same
importance, so there’s no need to prioritize one reward over another.
o This function represents the expected total reward starting from state s and
following policy π for a finite horizon.
o It helps the agent evaluate which states are likely to yield a high cumulative
reward by the end of the episode.
o This function represents the expected total reward starting from state s,
taking action a, and then following policy π\piπ.
o This guides the agent in choosing actions that maximize the total reward over
the finite time horizon.
Why No Discounting?
• In the total reward model, all rewards are equally valuable, meaning there’s no need
to prioritize one over the other.
• This model is more straightforward than discounted models, as it doesn’t reduce the
importance of future rewards.
Summary
In the total reward model, the agent aims to maximize the sum of all rewards over a finite
period without discounting. This model is suited for tasks with fixed endpoints, allowing
the agent to focus on collecting the highest possible reward within that defined timeframe.
1. Reward Calculation:
o The total reward in a finite horizon model is the sum of rewards collected
from the start until the end of the horizon:
2. Goal:
o The agent aims to maximize the cumulative reward over this limited
horizon, which means making decisions that optimize performance within a
fixed number of steps.
3. Applications:
o This model is ideal for tasks with a fixed length, like short-term objectives
(e.g., games with a limited number of moves or episodes that have a time
constraint).
o In finite horizon problems, the policy and value functions are sometimes
time-dependent. The policy might vary as the agent approaches the end of the
horizon since it may prioritize actions differently with limited steps remaining.
The average reward model is used when the agent operates in an ongoing or cyclic
environment without a defined endpoint. Instead of focusing on maximizing the total
reward over a finite period, the agent aims to maximize the average reward per time step
over the long run.
1. Reward Calculation:
o The average reward ρ is defined as the long-term average of the rewards per
step:
o This measures the steady rate of rewards over time, aiming to make the
reward per step as high as possible in the long term.
2. Goal:
o The agent aims to maximize the average reward per step indefinitely,
focusing on consistent, ongoing performance rather than short-term gains.
3. Applications:
o This model is suitable for continuous, ongoing tasks like industrial processes,
inventory management, or any situation where the task has no natural ending
and performance needs to be sustained.
o In the average reward model, value functions focus on deviations from the
average reward. The value function in this context can be defined in terms of
the differential value function, which measures how much better or worse
each state is compared to the average reward.
Summary Comparison
Horizon/End
Model Objective Suitable For
Condition
Maximize
Tasks with clear
Finite Horizon Fixed horizon T cumulative reward
time/step limits
over limited steps
Maximize average
Ongoing, repetitive,
Average Reward Infinite horizon reward per step
or cyclic tasks
over long term
In summary, the finite horizon model focuses on achieving the best performance within
a set time frame, while the average reward model emphasizes sustaining high rewards
per step in an ongoing process. Both reward models are helpful in different types of RL
tasks depending on the structure and goals of the environment.
Episodic Tasks
Episodic tasks are tasks that consist of distinct episodes, each with a clear beginning and
end. An episode terminates when the agent reaches a terminal state, which could
represent completing a task or failing it. Once an episode ends, the environment resets,
and the agent begins a new episode from the start state.
Key Characteristics
1. Clear Endpoints:
o Each episode has a defined endpoint where the task terminates, and the agent
starts over in a new episode.
2. Finite Length:
3. Objective:
o The goal in episodic tasks is to maximize the cumulative reward over each
episode.
o The agent learns a policy that optimizes performance for each episode as a
whole.
4. Examples:
o Games like chess or tic-tac-toe, where the game ends when someone wins or
there’s a draw.
o Robotics tasks, like picking up an object, where the episode ends once the
task is completed or fails.
5. Discount Factor:
o Episodic tasks often use discount factors close to 1, especially if they are
relatively short.
Continuing Tasks
Continuing tasks (or infinite-horizon tasks) are tasks that have no natural endpoint and
continue indefinitely. In these tasks, the agent keeps interacting with the environment
without resetting, and there is no terminal state.
Key Characteristics
1. No Terminal State:
o The agent never "finishes" the task; it keeps acting in the environment without
an end.
2. Infinite Horizon:
o Since the task doesn’t end, it’s modeled over an infinite time horizon.
3. Objective:
o The goal in continuing tasks is to maximize the long-term reward. This can
be achieved in two main ways:
▪ Average reward: Maximizing the average reward per step over time.
4. Examples:
5. Discount Factor:
o A discount factor γ<1 is essential for continuing tasks to ensure the agent
prioritizes more immediate rewards over far-future rewards, keeping the total
return finite.
Summary of Differences
• Episodic models are preferred when tasks have clear success or failure conditions,
allowing the agent to reset and learn from each episode.
• Continuing models are better suited for ongoing tasks where there’s no clear
endpoint, and performance needs to be sustained over time.
In RL, understanding whether a task is episodic or continuing shapes the way agents are
trained and how they evaluate and balance immediate versus future rewards.
Key Concepts
V∗.
function , which represents the best expected return for each state-
action pair under any policy.
The Bellman optimality equation defines the optimal value function for a state sss as
follows:
where:
This operator, when applied to any value function V, produces a new value function that
is "closer" to V∗.
1. Contraction Mapping:
2. Fixed Point:
o The optimal value function V∗ is the fixed point of the Bellman optimality
Bellman’s optimality operator is the foundation of value iteration, a key algorithm for
finding optimal policies in RL. In value iteration, we repeatedly apply the Bellman
optimality operator to an initial value function until it converges to V∗.
1. Initialize: Start with an initial guess for V(s) (often zero for all states).
3. Convergence: Stop when V converges to a fixed point within a small error margin
(i.e., it approximates V∗).
After finding V∗, we can derive the optimal policy π∗ by choosing actions that maximize
the expected return in each state.
Summary
• Bellman’s optimality operator T is a tool for computing the optimal value function
in reinforcement learning.
This operator underlies many RL algorithms and provides a way to approach optimal
policies and decision-making in sequential environments.
VALUE ITERATION
Value iteration is a fundamental algorithm in reinforcement learning (RL) used to find
the optimal policy for a Markov Decision Process (MDP). It works by iteratively improving
the estimated values of each state until they converge to the optimal value function, V∗,
1. Goal:
o The goal of value iteration is to compute the optimal value function V∗ for
each state, from which we can derive the optimal policy.
o For any state s, the Bellman optimality equation for V∗(s) is:
o Here:
1. Initialize:
o Start with an initial guess for the value function V(s) for all states s. This could
be zero or random values.
2. Iterative Update:
o For each state s, update V(s) using the Bellman optimality equation:
o This step is repeated for all states in the MDP. This update makes V(s) closer
to the optimal value V∗(s) after each iteration.
3. Convergence:
o Continue updating V(s) for each state until the values converge, meaning the
changes in V(s) become smaller than a defined threshold (e.g., an error
tolerance θ).
o After convergence, we can derive the optimal policy π∗ by choosing the action
Suppose we have a simple MDP with three states and two actions. We initialize the values
V(s)V(s)V(s) for each state to zero. We repeatedly apply the Bellman update for each state,
calculating the maximum expected return for each action until V(s) converges.
Value iteration uses the Bellman optimality operator, which is a contraction mapping,
meaning it brings the value function closer to the optimal solution with each application.
This guarantees that the algorithm converges to V∗.
Advantages:
Disadvantages:
• Computationally Intensive for Large State Spaces: Calculating the update for
each state-action pair can be very slow for large or continuous MDPs.
• Not Suitable for Continuous Action Spaces: Value iteration is challenging to apply
directly in continuous spaces without approximations.
Summary
1. Policy Evaluation:
o Given a policy π, policy evaluation calculates the value function Vπ for that
policy, which represents the expected cumulative reward if the agent follows
π.
o This step involves solving a set of equations for all states to find the values of
Vπ.
2. Policy Improvement:
1. Initialize:
o Compute the value function Vπ for the current policy π\piπ by solving the
Bellman expectation equation:
3. Policy Improvement:
o For each state s, update the policy by choosing the action a that maximizes
the expected return, based on Vπ:
o If the updated policy π′ is the same as the old policy π, then π is optimal, and
the algorithm stops.
4. Repeat:
1. Initialization: Start with an initial policy, say π0, where the agent picks random
actions.
4. Repeat: Continue with policy evaluation and policy improvement until no further
improvements are possible.
Policy iteration guarantees convergence to the optimal policy. After each policy
improvement step, the updated policy is either the same as the previous one or strictly
better. Since there are only a finite number of possible policies, the algorithm must
converge to the optimal policy after a finite number of steps.
Advantages:
• More Efficient than Value Iteration in Some Cases: Policy iteration can converge
faster for some problems because it doesn’t require as many iterations to achieve
convergence for the value function within each policy.
Disadvantages:
• Policy Evaluation Can Be Expensive: For large state spaces, calculating Vπ exactly
can be computationally expensive, making policy iteration less efficient.
• Less Suitable for Large or Continuous State Spaces: Policy iteration relies on
explicit calculations for each state, which can be impractical for very large or
continuous state spaces.
Comparison to Value Iteration
Efficiency May converge faster for some problems Often more efficient for large
but can be costly in large spaces state spaces
Summary
Policy iteration is an effective algorithm for solving MDPs by iteratively evaluating and
improving policies. Its alternating structure makes it highly efficient for small to medium-
sized problems and guarantees convergence to the optimal policy, making it a cornerstone
technique in RL.