0% found this document useful (0 votes)
2 views

RL UNIT - II

A Markov Decision Process (MDP) is a mathematical framework in Reinforcement Learning that models decision-making problems where outcomes are influenced by both randomness and the agent's actions. Key components of an MDP include the state space, action space, transition probabilities, reward function, and discount factor, with the goal of finding an optimal policy that maximizes expected cumulative rewards. Various methods, such as dynamic programming and model-free methods like Q-learning, are used to solve MDPs and determine optimal policies.

Uploaded by

harsha8383m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

RL UNIT - II

A Markov Decision Process (MDP) is a mathematical framework in Reinforcement Learning that models decision-making problems where outcomes are influenced by both randomness and the agent's actions. Key components of an MDP include the state space, action space, transition probabilities, reward function, and discount factor, with the goal of finding an optimal policy that maximizes expected cumulative rewards. Various methods, such as dynamic programming and model-free methods like Q-learning, are used to solve MDPs and determine optimal policies.

Uploaded by

harsha8383m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

MARKOV DECISION PROBLEM

In Reinforcement Learning (RL), a Markov Decision Process (MDP) provides a


mathematical framework for modeling decision-making problems where outcomes are
partly random and partly under the control of a decision-maker (agent). MDPs are
foundational in RL because they formalize how an agent interacts with an environment to
maximize some notion of cumulative reward over time.

Key Components of an MDP

An MDP is defined by the tuple (S, A, P, R, γ):

1. State Space (S): A set of all possible states the agent can be in. Each state provides
information about the environment and is used to make decisions.

2. Action Space (A): A set of all possible actions the agent can take. The action selected
by the agent influences the next state and the reward received.

3. Transition Probability (P): A probability distribution that defines the likelihood of


transitioning from one state to another given a specific action. Mathematically, this
is represented as:

where s and s′ are states, and a is an action.

4. Reward Function (R): A function that defines the immediate reward received after
transitioning from one state to another due to an action. It’s represented as:

where rt+1 is the reward received after taking action a in state s.

5. Discount Factor (γ): A scalar γ ∈ [0,1] that represents the agent’s preference for
immediate rewards over future rewards. A higher discount factor means the agent
values future rewards more.

Objective in an MDP

The goal in an MDP is to find an optimal policy π\piπ that maximizes the expected
cumulative reward. A policy π(a∣s) is a mapping from states to probabilities of taking
each action in that state. The objective is to find a policy π∗ that maximizes the expected
return (discounted cumulative reward), defined as:

Solving an MDP

To solve an MDP, we often calculate the value function and action-value function, which
are core concepts in RL:
Approaches to Solve MDPs in Reinforcement Learning

1. Dynamic Programming:

o Policy Iteration: Iteratively evaluates and improves a policy until it converges


to the optimal policy.

o Value Iteration: Directly calculates the optimal value function and derives
the optimal policy.

2. Model-Free Methods:

o Q-Learning: Learns the optimal action-value function Q(s,a) without needing


a model of the environment (transition probabilities).

o SARSA (State-Action-Reward-State-Action): On-policy learning where the


agent updates its action-value function based on the policy it is following.

3. Approximate Solutions for Large State Spaces:

o Deep Q-Networks (DQN): Uses neural networks to approximate the action-


value function in large or continuous state spaces.

o Policy Gradient Methods: Optimize the policy directly rather than learning
value functions, making them suitable for continuous action spaces.
Example of an MDP Problem: Grid World

Consider a simple grid world where an agent navigates in a grid to reach a goal. The states
are the grid cells, actions are the possible moves (up, down, left, right), and the reward is
given for reaching the goal. The agent’s task is to learn a policy that maximizes its
cumulative reward by reaching the goal in the fewest steps.

This example shows how MDPs can be used to model sequential decision-making
problems and how RL techniques help find optimal solutions.

MARKOV DECISION POLICY AND VALUE FUNCTION


In Reinforcement Learning (RL), policy and value functions are two essential concepts
within the Markov Decision Process (MDP) framework. Together, they provide a structured
way for an agent to make decisions that maximize long-term rewards.

1. Policy (π)

A policy defines the behavior of an agent at any given state. It’s a mapping from each state
to a probability distribution over possible actions the agent can take in that state. In
simple terms, a policy determines the action the agent should take when it’s in a particular
state.

• Deterministic Policy: A policy that maps a state to a specific action with certainty.

π(s)=a
This means in state s, the agent always takes action a.

• Stochastic Policy: A policy that gives probabilities for each action in each state.

This means in state s, the agent has a probability of choosing each possible action a. This
is useful in environments with uncertain outcomes or when exploration is beneficial.

2. Value Functions

Value functions are crucial in RL as they estimate how good it is for the agent to be in a
given state (or state-action pair) while following a policy. They quantify the expected future
rewards that can be obtained from a state or an action.

There are two primary types of value functions:

a. State Value Function

The state value function represents the expected cumulative reward the agent
can achieve, starting from state s and following policy π thereafter. Formally, it’s defined
as:
b. Action Value Function

The action value function represents the expected cumulative reward starting
from state s, taking action a, and following policy π\piπ thereafter. Formally:

This function is useful when evaluating the quality of individual actions rather than states,
making it foundational in action-based RL algorithms like Q-learning.

3. Bellman Equations

The Bellman equations provide a recursive relationship for both the state value function
and the action value function , allowing these values to be computed
iteratively.
REWARD MODELS (INFINITE DISCOUNTED)
In reinforcement learning, reward models describe how rewards are accumulated and
how much weight is given to future rewards. One commonly used reward model in RL is
the infinite discounted reward model. This model applies a discount factor to future
rewards to prioritize short-term rewards over long-term ones while still considering the
future to some degree.

Infinite Discounted Reward Model

The infinite discounted reward model involves an agent interacting with an environment
over an infinite time horizon, with future rewards being discounted by a factor γ at each
step. The purpose of this model is to capture the idea that immediate rewards are generally
more valuable than distant rewards, allowing the agent to balance short-term gains with
long-term goals.

Key Components

1. Discount Factor γ:

o The discount factor γ is a constant value between 0 and 1: 0 ≤ γ ≤ 1.

o γ determines the weight of future rewards.

▪ When γ is close to 1, the agent values future rewards almost as much


as immediate ones.

▪ When γ is close to 0, the agent heavily favors immediate rewards and


largely ignores future rewards.

2. Return Gt:

o The return Gt is the cumulative reward starting from a given time step t.

o In an infinite discounted reward model, the return at time t is calculated as:

o Here, rt+k+1 is the reward received k steps into the future, and γk is the discount
applied to it.

o As k increases, γk diminishes the contribution of rt+k+1, meaning rewards far


in the future contribute less to the total return.

3. Objective in Infinite Discounted Reward Models:

o The agent’s goal in an infinite discounted reward setting is to find an optimal


policy π∗ that maximizes the expected discounted cumulative reward (or
expected return).

o This means choosing actions such that the cumulative reward over time,
considering the discount, is as high as possible.

Value Functions in Infinite Discounted Reward Model


In an infinite discounted setting, value functions measure the expected return, helping
the agent assess the desirability of states and state-action pairs under a specific policy.

1. State Value Function :

o For a given policy π, the state value function in the infinite discounted
reward model is the expected cumulative discounted reward starting from
state s:

o The value function captures the long-term benefit of being in state s and
following policy π.

2. Action Value Function :

o The action value function measures the expected cumulative


discounted reward starting from state s, taking action a, and following policy
π afterward:

Bellman Equations in the Infinite Discounted Reward Model

The Bellman equations in the infinite discounted reward model describe how the value
of a state or state-action pair can be computed recursively. They serve as the foundation
for many RL algorithms.

1. Bellman Equation for :

This equation states that the value of state s under policy π is the immediate reward plus
the discounted value of the next state.

2. Bellman Equation for :

These equations help to iteratively compute the value functions, which are used to
determine the best actions to take for maximizing the expected cumulative reward.

Solving for Optimal Policy in Infinite Discounted Models

To achieve the maximum expected return over an infinite horizon, the agent aims to find
an optimal policy π∗, leading to optimal value functions:
Summary

The infinite discounted reward model helps balance short- and long-term rewards by
discounting future rewards. It forms the basis for much of RL, allowing agents to achieve
goals over time in environments that are not strictly time-limited. The discount factor
plays a key role in shaping agent behavior, determining how much importance it places
on immediate versus future rewards.

REWARD MODELS (TOTAL)


In reinforcement learning (RL), a total reward model (or "finite-horizon undiscounted
model") is used to measure the agent's performance when it interacts with an environment
over a finite number of steps. This model is especially helpful when there's a clear endpoint
in the agent's task, such as reaching a goal state, completing a game, or finishing a task
with a fixed duration.

Key Ideas in the Total Reward Model

1. Total Reward:

o The total reward is simply the sum of rewards the agent collects over a fixed
number of steps or until it reaches a terminal state.

o It’s calculated as:

o Here, T is the finite time horizon, or the maximum number of steps the agent
will take before the episode ends. rt+k+1 is the reward received at each step.
2. Goal:

o The agent's objective is to maximize the sum of rewards over this finite time
horizon, as there’s no discounting involved. All rewards are valued equally,
whether they’re received early or late in the process.

3. Applications:

o This model is ideal for tasks with a clear ending, like completing a puzzle or
reaching a destination within a limited time.

o It's also useful in situations where every action and its reward have the same
importance, so there’s no need to prioritize one reward over another.

Value Functions in the Total Reward Model

1. State Value Function :

o This function represents the expected total reward starting from state s and
following policy π for a finite horizon.

o It helps the agent evaluate which states are likely to yield a high cumulative
reward by the end of the episode.

2. Action Value Function :

o This function represents the expected total reward starting from state s,
taking action a, and then following policy π\piπ.

o This guides the agent in choosing actions that maximize the total reward over
the finite time horizon.

Why No Discounting?

• In the total reward model, all rewards are equally valuable, meaning there’s no need
to prioritize one over the other.

• This model is more straightforward than discounted models, as it doesn’t reduce the
importance of future rewards.

Summary

In the total reward model, the agent aims to maximize the sum of all rewards over a finite
period without discounting. This model is suited for tasks with fixed endpoints, allowing
the agent to focus on collecting the highest possible reward within that defined timeframe.

REWARD MODELS (FINITE HORIZON AND AVERAGE)


In reinforcement learning (RL), finite horizon and average reward models are two
different approaches for evaluating an agent's performance over time. They are often used
when there is a specific goal or endpoint in the environment (finite horizon) or when the
goal is to maintain performance over the long term (average reward).

1. Finite Horizon Reward Model


The finite horizon reward model is used when the agent interacts with the environment
for a limited number of steps, known as the horizon length T. In this model, the agent
tries to maximize the cumulative reward over the fixed number of steps.

Key Points of the Finite Horizon Model

1. Reward Calculation:

o The total reward in a finite horizon model is the sum of rewards collected
from the start until the end of the horizon:

o Here, T is the horizon length, so the agent considers rewards up to T steps


ahead.

2. Goal:

o The agent aims to maximize the cumulative reward over this limited
horizon, which means making decisions that optimize performance within a
fixed number of steps.

3. Applications:

o This model is ideal for tasks with a fixed length, like short-term objectives
(e.g., games with a limited number of moves or episodes that have a time
constraint).

4. Policy and Value Functions:

o In finite horizon problems, the policy and value functions are sometimes
time-dependent. The policy might vary as the agent approaches the end of the
horizon since it may prioritize actions differently with limited steps remaining.

2. Average Reward Model

The average reward model is used when the agent operates in an ongoing or cyclic
environment without a defined endpoint. Instead of focusing on maximizing the total
reward over a finite period, the agent aims to maximize the average reward per time step
over the long run.

Key Points of the Average Reward Model

1. Reward Calculation:

o The average reward ρ is defined as the long-term average of the rewards per
step:

o This measures the steady rate of rewards over time, aiming to make the
reward per step as high as possible in the long term.
2. Goal:

o The agent aims to maximize the average reward per step indefinitely,
focusing on consistent, ongoing performance rather than short-term gains.

3. Applications:

o This model is suitable for continuous, ongoing tasks like industrial processes,
inventory management, or any situation where the task has no natural ending
and performance needs to be sustained.

4. Policy and Value Functions:

o In the average reward model, value functions focus on deviations from the
average reward. The value function in this context can be defined in terms of
the differential value function, which measures how much better or worse
each state is compared to the average reward.

Summary Comparison

Horizon/End
Model Objective Suitable For
Condition
Maximize
Tasks with clear
Finite Horizon Fixed horizon T cumulative reward
time/step limits
over limited steps
Maximize average
Ongoing, repetitive,
Average Reward Infinite horizon reward per step
or cyclic tasks
over long term

In summary, the finite horizon model focuses on achieving the best performance within
a set time frame, while the average reward model emphasizes sustaining high rewards
per step in an ongoing process. Both reward models are helpful in different types of RL
tasks depending on the structure and goals of the environment.

EPISODIC & CONTINUING TASKS


In reinforcement learning (RL), tasks are often categorized as either episodic or
continuing based on whether they have a defined endpoint or continue indefinitely. This
distinction is essential because it influences how agents approach the goal, measure
rewards, and learn optimal policies.

Episodic Tasks

Episodic tasks are tasks that consist of distinct episodes, each with a clear beginning and
end. An episode terminates when the agent reaches a terminal state, which could
represent completing a task or failing it. Once an episode ends, the environment resets,
and the agent begins a new episode from the start state.

Key Characteristics

1. Clear Endpoints:

o Each episode has a defined endpoint where the task terminates, and the agent
starts over in a new episode.
2. Finite Length:

o Episodes usually have a fixed or variable but finite number of steps.

3. Objective:

o The goal in episodic tasks is to maximize the cumulative reward over each
episode.

o The agent learns a policy that optimizes performance for each episode as a
whole.

4. Examples:

o Games like chess or tic-tac-toe, where the game ends when someone wins or
there’s a draw.

o Robotics tasks, like picking up an object, where the episode ends once the
task is completed or fails.

5. Discount Factor:

o Episodic tasks often use discount factors close to 1, especially if they are
relatively short.

o If episodes are long or uncertain, a smaller discount factor might be applied


to prioritize short-term rewards.

Continuing Tasks

Continuing tasks (or infinite-horizon tasks) are tasks that have no natural endpoint and
continue indefinitely. In these tasks, the agent keeps interacting with the environment
without resetting, and there is no terminal state.

Key Characteristics

1. No Terminal State:

o The agent never "finishes" the task; it keeps acting in the environment without
an end.

2. Infinite Horizon:

o Since the task doesn’t end, it’s modeled over an infinite time horizon.

3. Objective:

o The goal in continuing tasks is to maximize the long-term reward. This can
be achieved in two main ways:

▪ Discounted reward: Using a discount factor γ<1 to prioritize immediate


rewards while still considering future ones.

▪ Average reward: Maximizing the average reward per step over time.

4. Examples:

o Industrial processes that run continuously, like managing an assembly line.


o Stock market trading, where the agent continuously makes buy, hold, or sell
decisions.

o Autonomous vehicle navigation in ongoing traffic without a fixed endpoint.

5. Discount Factor:

o A discount factor γ<1 is essential for continuing tasks to ensure the agent
prioritizes more immediate rewards over far-future rewards, keeping the total
return finite.

Summary of Differences

Feature Episodic Tasks Continuing Tasks


Task
Consists of distinct episodes Continues indefinitely
Structure
Ends upon reaching a terminal
End Condition No terminal state; infinite steps
state
Maximize cumulative reward per Maximize long-term or average
Objective
episode reward
Discount Often close to 1, sometimes
Typically γ<1
Factor variable
Games, specific goal-oriented Stock trading, industrial
Examples
tasks processes
Choosing Between Episodic and Continuing Models

• Episodic models are preferred when tasks have clear success or failure conditions,
allowing the agent to reset and learn from each episode.

• Continuing models are better suited for ongoing tasks where there’s no clear
endpoint, and performance needs to be sustained over time.

In RL, understanding whether a task is episodic or continuing shapes the way agents are
trained and how they evaluate and balance immediate versus future rewards.

BELLMAN'S OPTIMALITY OPERATOR


In reinforcement learning (RL), Bellman’s optimality operator is a key concept used to
find optimal policies and value functions. It is a mathematical operator that helps compute
the optimal value function by capturing the essence of dynamic programming for
decision-making over time.

Key Concepts

1. Optimal Value Function:

o The optimal value function, , is the maximum possible value (expected


return) for each state under any policy.

o It satisfies the Bellman optimality equation, which recursively defines


as the best possible expected return from each state.

2. Bellman Optimality Operator:


o Bellman’s optimality operator, typically denoted T, is a mapping that takes a
value function V and returns a new value function closer to the optimal one,

V∗.

o When this operator is applied iteratively to any value function V, it gradually

converges to V∗, solving the optimality equation.

3. Optimal Action Value Function:

o Similarly, Bellman's operator can be applied to find the optimal action-value

function , which represents the best expected return for each state-
action pair under any policy.

Bellman Optimality Equation

The Bellman optimality equation defines the optimal value function for a state sss as
follows:

where:

• P(s′∣s,a) is the probability of moving to state s′ given state s and action a.

• R(s,a) is the expected reward for taking action a in state s.

• γ is the discount factor.

Bellman Optimality Operator Definition

The Bellman optimality operator T is defined as:

This operator, when applied to any value function V, produces a new value function that
is "closer" to V∗.

Properties of the Bellman Optimality Operator

1. Contraction Mapping:

o T is a contraction mapping under a norm called the sup norm, meaning it


brings value functions progressively closer to V∗ with each application.

o This property guarantees convergence to V∗ if we apply T repeatedly, a process


known as value iteration.

2. Fixed Point:
o The optimal value function V∗ is the fixed point of the Bellman optimality

operator T, meaning V∗=TV∗.

o Once V equals V∗, further applications of T will keep returning V∗.

Using Bellman’s Optimality Operator

Bellman’s optimality operator is the foundation of value iteration, a key algorithm for
finding optimal policies in RL. In value iteration, we repeatedly apply the Bellman
optimality operator to an initial value function until it converges to V∗.

1. Initialize: Start with an initial guess for V(s) (often zero for all states).

2. Iterate: Apply T repeatedly:

3. Convergence: Stop when V converges to a fixed point within a small error margin
(i.e., it approximates V∗).

After finding V∗, we can derive the optimal policy π∗ by choosing actions that maximize
the expected return in each state.

Summary

• Bellman’s optimality operator T is a tool for computing the optimal value function
in reinforcement learning.

• It iteratively "improves" a value function by considering the best possible actions at


each state.

• By repeatedly applying T, we converge to the optimal value function V∗, which

leads to the optimal policy π∗.

This operator underlies many RL algorithms and provides a way to approach optimal
policies and decision-making in sequential environments.

VALUE ITERATION
Value iteration is a fundamental algorithm in reinforcement learning (RL) used to find
the optimal policy for a Markov Decision Process (MDP). It works by iteratively improving
the estimated values of each state until they converge to the optimal value function, V∗,

which leads to the optimal policy, π∗.


Key Concepts of Value Iteration

1. Goal:

o The goal of value iteration is to compute the optimal value function V∗ for
each state, from which we can derive the optimal policy.

2. Bellman Optimality Equation:

o Value iteration is based on the Bellman optimality equation, which defines


the optimal value of each state as the maximum expected cumulative reward
starting from that state.

o For any state s, the Bellman optimality equation for V∗(s) is:

o Here:

▪ a is an action available in state s,

▪ s′ is the next state,

▪ P(s′∣s,a) is the probability of transitioning to s′ after taking action a in s,

▪ R(s,a) is the reward received for taking action a in s,

▪ γ is the discount factor.

Steps of the Value Iteration Algorithm

1. Initialize:

o Start with an initial guess for the value function V(s) for all states s. This could
be zero or random values.

2. Iterative Update:

o For each state s, update V(s) using the Bellman optimality equation:

o This step is repeated for all states in the MDP. This update makes V(s) closer
to the optimal value V∗(s) after each iteration.

3. Convergence:

o Continue updating V(s) for each state until the values converge, meaning the
changes in V(s) become smaller than a defined threshold (e.g., an error
tolerance θ).

o At convergence, V(s)≈V∗(s)V(s), meaning the value function is optimal.


4. Extract the Optimal Policy:

o After convergence, we can derive the optimal policy π∗ by choosing the action

in each state that maximizes the expected reward, based on V∗:

Example of Value Iteration

Suppose we have a simple MDP with three states and two actions. We initialize the values
V(s)V(s)V(s) for each state to zero. We repeatedly apply the Bellman update for each state,
calculating the maximum expected return for each action until V(s) converges.

Why Value Iteration Works

Value iteration uses the Bellman optimality operator, which is a contraction mapping,
meaning it brings the value function closer to the optimal solution with each application.
This guarantees that the algorithm converges to V∗.

Advantages and Disadvantages

Advantages:

• Efficient: Value iteration combines policy evaluation and policy improvement in a


single step, speeding up convergence.

• Guarantees Optimality: Given enough iterations and a proper discount factor, it


will find the optimal policy.

Disadvantages:

• Computationally Intensive for Large State Spaces: Calculating the update for
each state-action pair can be very slow for large or continuous MDPs.

• Not Suitable for Continuous Action Spaces: Value iteration is challenging to apply
directly in continuous spaces without approximations.

Summary

Value iteration is a fundamental RL algorithm that iteratively updates value estimates to


find the optimal value function V∗ and, subsequently, the optimal policy π∗. By repeatedly
applying the Bellman optimality equation, value iteration converges to an optimal solution,
making it highly effective for solving discrete MDPs.
POLICY ITERATION
Policy iteration is a classic algorithm in reinforcement learning (RL) for finding the
optimal policy for a Markov Decision Process (MDP). Unlike value iteration, which directly
tries to compute the optimal value function, policy iteration alternates between evaluating
a policy and improving it. This algorithm has two main steps: policy evaluation and
policy improvement.

Key Concepts of Policy Iteration

1. Policy Evaluation:

o Given a policy π, policy evaluation calculates the value function Vπ for that
policy, which represents the expected cumulative reward if the agent follows
π.

o This step involves solving a set of equations for all states to find the values of
Vπ.

2. Policy Improvement:

o After obtaining Vπ, policy improvement updates the policy by choosing

actions that maximize the expected reward based on Vπ.

o The updated policy is guaranteed to be at least as good as the previous one,


and if no improvement can be made, the policy is optimal.

Steps of the Policy Iteration Algorithm

1. Initialize:

o Start with an arbitrary initial policy π0.


2. Policy Evaluation:

o Compute the value function Vπ for the current policy π\piπ by solving the
Bellman expectation equation:

o This can be done either by iterative approximation (approximate evaluation


over several steps until convergence) or by solving a system of linear equations
if the MDP is small enough.

3. Policy Improvement:

o For each state s, update the policy by choosing the action a that maximizes
the expected return, based on Vπ:
o If the updated policy π′ is the same as the old policy π, then π is optimal, and
the algorithm stops.

4. Repeat:

Example of Policy Iteration

1. Initialization: Start with an initial policy, say π0, where the agent picks random
actions.

2. Policy Evaluation: Calculate Vπ0 by solving the Bellman equations.

3. Policy Improvement: Update π0 by selecting actions that maximize the expected


reward according to Vπ0.

4. Repeat: Continue with policy evaluation and policy improvement until no further
improvements are possible.

Why Policy Iteration Works

Policy iteration guarantees convergence to the optimal policy. After each policy
improvement step, the updated policy is either the same as the previous one or strictly
better. Since there are only a finite number of possible policies, the algorithm must
converge to the optimal policy after a finite number of steps.

Advantages and Disadvantages

Advantages:

• Guaranteed Convergence to Optimal Policy: Policy iteration is guaranteed to


converge to the optimal policy in a finite number of iterations for finite MDPs.

• More Efficient than Value Iteration in Some Cases: Policy iteration can converge
faster for some problems because it doesn’t require as many iterations to achieve
convergence for the value function within each policy.

Disadvantages:

• Policy Evaluation Can Be Expensive: For large state spaces, calculating Vπ exactly
can be computationally expensive, making policy iteration less efficient.

• Less Suitable for Large or Continuous State Spaces: Policy iteration relies on
explicit calculations for each state, which can be impractical for very large or
continuous state spaces.
Comparison to Value Iteration

Feature Policy Iteration Value Iteration

Approach Alternates between policy evaluation and Iteratively applies the


improvement Bellman optimality equation

Convergence Converges to optimal policy when the Converges to optimal value


policy no longer changes function V∗V^*V∗

Efficiency May converge faster for some problems Often more efficient for large
but can be costly in large spaces state spaces

Summary

Policy iteration is an effective algorithm for solving MDPs by iteratively evaluating and
improving policies. Its alternating structure makes it highly efficient for small to medium-
sized problems and guarantees convergence to the optimal policy, making it a cornerstone
technique in RL.

You might also like