Experiment 3
Experiment 3
EXPERIMENT ASSESSMENT
Experiment No.: 3
Roll Number:
Date of Performance:
Date of Submission:
Evaluation
Performance Indicator Max. Marks Marks Obtained
Performance 5
Understanding 5
Journal work and timely submission. 10
Total 20
Checked by
Signature :
Date :
EXPERIMENT 3
Aim: Implementing a basic grid-world environment as an MDP and applying policy iteration and
value iteration algorithms to find optimal policies.
Theory:
Markov Decision Process (MDP):
● An MDP is a mathematical framework used to model decision-making in
environments where outcomes are partially random and partially under the control of
an agent.
● It consists of a set of states, a set of actions, transition probabilities, rewards, and a
discount factor.
● In a grid-world environment, each cell represents a state, and the agent can take
actions (move up, down, left, right) to transition between states.
Optimal Policies:
● An optimal policy specifies the best action to take in each state to maximize
cumulative rewards over time.
● Finding optimal policies involves determining the best strategy for the agent to
navigate the environment and achieve its goals.
Policy Iteration:
● Policy iteration is an iterative algorithm for finding the optimal policy in an MDP.
● It alternates between two steps: policy evaluation and policy improvement.
● Policy evaluation involves estimating the value function for a given policy, while
policy improvement updates the policy based on the current value function.
● This process continues until the policy converges to an optimal policy.
Algorithm for Policy Iteration:
1. Policy Evaluation:
- Initialize V(s) arbitrarily for all states
- Repeat until Δ < ε (small positive number):
-Δ←0
Vidyavardhini’s College of Engineering & Technology
Department of Computer Science & Engineering (Data Science)
In these algorithms:
● π represents the policy, which specifies the action to be taken in each state.
● V(s) represents the value function, which estimates the expected cumulative reward starting
from state s and following the current policy π.
● P(s' | s, a) represents the transition probability from state s to state s' under action a.
● R(s, a, s') represents the reward obtained when transitioning from state s to state s' under
action a.
● γ is the discount factor, representing the importance of future rewards compared to immediate
rewards.
● Δ is a small positive number used to determine convergence.
These algorithms iteratively update the value function and policy until convergence, where the
policy either stabilizes (in policy iteration) or the value function converges (in value iteration).
The resulting policy is then considered optimal for the given grid-world environment.
Code:
# Initialize environment
row = 3
col = 4
# Error Rate
Max_Err = 10**(-3)
# Get the utility of the state reached by performing the given action from
the given state
def utility(U, r, c, action):
dr, dc = A[action]
newR, newC = r+dr, c+dc
Vidyavardhini’s College of Engineering & Technology
Department of Computer Science & Engineering (Data Science)
if newR < 0 or newC < 0 or newR >= row or newC >= col or (newR == newC ==
1): # collide with the boundary or the wall
return U[r][c]
else:
return U[newR][newC]
Output:
Conclusion:
1. How does the value iteration algorithm differ from policy iteration, and what are its
main steps in the context of finding optimal policies?
The value iteration algorithm and policy iteration algorithm are both fundamental methods for
finding optimal policies in Markov Decision Processes (MDPs). Value iteration directly
computes the optimal value function through iterative updates, while policy iteration alternates
between policy evaluation and improvement steps until convergence. Value iteration is often
computationally more efficient as it does not require separate evaluation and improvement
steps, making it a preferred choice in many scenarios. Both algorithms aim to converge to the
optimal policy by iteratively refining value estimates or policies until they no longer change
significantly.