0% found this document useful (0 votes)

7 views

Experiment 3

Uploaded by

Dikshant Buwa

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Experiment 3

Uploaded by

Dikshant Buwa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Vidyavardhini’s College of Engineering & Technology

Department of Computer Science & Engineering (Data Science)

EXPERIMENT ASSESSMENT

ACADEMIC YEAR 2023-24

Course: Reinforcement Learning Lab
Course code: CSDOL8013
Year: BE Sem: VIII

Experiment No.: 3

Aim: - Implementing a basic grid-world environment as an MDP and applying policy

iteration and value iteration algorithms to find optimal policies
Name:

Roll Number:
Date of Performance:

Date of Submission:

Evaluation
Performance Indicator Max. Marks Marks Obtained

Performance 5
Understanding 5
Journal work and timely submission. 10
Total 20

Exceed Expectations Meet Expectations Below Expectations

Performance Indicator
(EE) (ME) (BE)
Performance 5 3 2
Understanding 5 3 2
Journal work and
10 8 4
timely submission.

Checked by

Name of Faculty : Rujuta Vartak

Signature :

Date :

CSDOL8013- Reinforcement Learning Lab

Vidyavardhini’s College of Engineering & Technology
Department of Computer Science & Engineering (Data Science)

EXPERIMENT 3
Aim: Implementing a basic grid-world environment as an MDP and applying policy iteration and
value iteration algorithms to find optimal policies.

Objective: The objective is to create a simplified grid-world environment, representing it as

Markov Decision Process (MDP), and then employ policy iteration and value iteration algorithms to
compute optimal policies within this environment, facilitating a fundamental understanding of
dynamic programming techniques in reinforcement learning.

Theory:
Markov Decision Process (MDP):
● An MDP is a mathematical framework used to model decision-making in
environments where outcomes are partially random and partially under the control of
an agent.
● It consists of a set of states, a set of actions, transition probabilities, rewards, and a
discount factor.
● In a grid-world environment, each cell represents a state, and the agent can take
actions (move up, down, left, right) to transition between states.
Optimal Policies:
● An optimal policy specifies the best action to take in each state to maximize
cumulative rewards over time.
● Finding optimal policies involves determining the best strategy for the agent to
navigate the environment and achieve its goals.
Policy Iteration:
● Policy iteration is an iterative algorithm for finding the optimal policy in an MDP.
● It alternates between two steps: policy evaluation and policy improvement.
● Policy evaluation involves estimating the value function for a given policy, while
policy improvement updates the policy based on the current value function.
● This process continues until the policy converges to an optimal policy.
Algorithm for Policy Iteration:
1. Policy Evaluation:
- Initialize V(s) arbitrarily for all states
- Repeat until Δ < ε (small positive number):
-Δ←0
Vidyavardhini’s College of Engineering & Technology
Department of Computer Science & Engineering (Data Science)

- For each state s:

- v ← V(s)
- V(s) ← Σ(P(s' | s, π(s)) * [R(s, π(s), s') + γ * V(s')])
- Δ ← max(Δ, |v - V(s)|)
2. Policy Improvement:
- Policy_stable ← true
- For each state s:
- old_action ← π(s)
- π(s) ← argmax[a]{Σ(P(s' | s, a) * [R(s, a, s') + γ * V(s')])}
- If old_action ≠ π(s), then Policy_stable ← false
If Policy_stable, then stop and return optimal policy π*
Value Iteration:
● Value iteration is another iterative algorithm used to find optimal policies in MDPs.
● It iteratively updates the value function for each state until it converges to the optimal
value function.
● The optimal policy can then be derived from the optimal value function by selecting
actions that maximize expected returns.
Algorithm of Value Iteration:
Initialize V(s) arbitrarily for all states
Repeat until Δ < ε (small positive number):
-Δ←0
For each state s:
- v ← V(s)
- V(s) ← max[a]{Σ(P(s' | s, a) * [R(s, a, s') + γ * V(s')])}
- Δ ← max(Δ, |v - V(s)|)
Return optimal policy π* such that π*(s) = argmax[a]{Σ(P(s' | s, a) * [R(s, a, s') + γ * V(s')])}
Vidyavardhini’s College of Engineering & Technology
Department of Computer Science & Engineering (Data Science)

In these algorithms:
● π represents the policy, which specifies the action to be taken in each state.
● V(s) represents the value function, which estimates the expected cumulative reward starting
from state s and following the current policy π.
● P(s' | s, a) represents the transition probability from state s to state s' under action a.
● R(s, a, s') represents the reward obtained when transitioning from state s to state s' under
action a.
● γ is the discount factor, representing the importance of future rewards compared to immediate
rewards.
● Δ is a small positive number used to determine convergence.
These algorithms iteratively update the value function and policy until convergence, where the
policy either stabilizes (in policy iteration) or the value function converges (in value iteration).
The resulting policy is then considered optimal for the given grid-world environment.

Code:
# Initialize environment
row = 3
col = 4

# Non-terminal state rewards

α = -0.01
γ = 0.99

# Action sequneces Down, Left, Up, Right

A = [(1, 0), (0, -1), (-1, 0), (0, 1)]
# Acton steps
Steps = 4

# Error Rate
Max_Err = 10**(-3)

# Define untility control

U = [[0, 0, 0, 1], [0, 0, 0, -1], [0, 0, 0, 0], [0, 0, 0, 0]]

# Construct a random policy

rand_policy = [[random.randint(0, 3) for j in range(col)] for i in
range(row)]

# Get the utility of the state reached by performing the given action from
the given state
def utility(U, r, c, action):
dr, dc = A[action]
newR, newC = r+dr, c+dc
Vidyavardhini’s College of Engineering & Technology
Department of Computer Science & Engineering (Data Science)
if newR < 0 or newC < 0 or newR >= row or newC >= col or (newR == newC ==
1): # collide with the boundary or the wall
return U[r][c]
else:
return U[newR][newC]

# Calculate the utility of a state per each given action

def calc_util(U, r, c, action):
u = R
u += 0.1 * γ * utility(U, r, c, (action-1)%4)
u += 0.8 * γ * utility(U, r, c, action)
u += 0.1 * γ * utility(U, r, c, (action+1)%4)
return u

# value iteration steps to approximate the util

def evaluate_Policy(rand_policy, U):
while True:
nextU = [[0, 0, 0, 1], [0, 0, 0, -1], [0, 0, 0, 0], [0, 0, 0, 0]]
error = 0
for r in range(row):
for c in range(col):
if (r <= 1 and c == 3) or (r == c == 1):
continue
nextU[r][c] = calc_util(U, r, c, rand_policy[r][c]) #
simplified Bellman update
error = max(error, abs(nextU[r][c]-U[r][c]))
U = nextU
if error < Max_Err * (1-γ) / γ:
break
return U
# Actions during the policy iteration
def policy_itter(rand_policy, U):
while True:
U = evaluate_Policy(rand_policy, U)
unchanged = True
for r in range(row):
for c in range(col):
if (r <= 1 and c == 3) or (r == c == 1):
continue
maxAction, maxU = None, -float("inf")
for action in range(Steps):
u = calc_util(U, r, c, action)
if u > maxU:
maxAction, maxU = action, u
if maxU > calc_util(U, r, c, rand_policy[r][c]):
rand_policy[r][c] = maxAction # the action that maximizes
the utility
unchanged = False
if unchanged:
break
return rand_policy
print('Random policy after evaluation:', rand_policy, '\n')
print('Retrun U: ', U)
policy_table_3 = np.zeros((6, 6))
plt.title('Dynamic Programming Value Itteration| Algorithm #3')
Vidyavardhini’s College of Engineering & Technology
Department of Computer Science & Engineering (Data Science)
for i in range(5):
for j in range(5):
policy_table_3[i][j]=np.max(U)
grd = sns.heatmap(policy_table_3, cmap="YlGnBu", annot=True)
grd

Output:

Conclusion:
1. How does the value iteration algorithm differ from policy iteration, and what are its
main steps in the context of finding optimal policies?
The value iteration algorithm and policy iteration algorithm are both fundamental methods for
finding optimal policies in Markov Decision Processes (MDPs). Value iteration directly
computes the optimal value function through iterative updates, while policy iteration alternates
between policy evaluation and improvement steps until convergence. Value iteration is often
computationally more efficient as it does not require separate evaluation and improvement
steps, making it a preferred choice in many scenarios. Both algorithms aim to converge to the
optimal policy by iteratively refining value estimates or policies until they no longer change
significantly.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
61% (72)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
Trauma-Focused ACT - Russ Harris
95% (38)
Trauma-Focused ACT - Russ Harris
568 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Penis Enlargement Secret
61% (123)
Penis Enlargement Secret
12 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (29)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
How To Kiss A Woman's Breast
60% (114)
How To Kiss A Woman's Breast
14 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Tolerance Stackup Course
No ratings yet
Tolerance Stackup Course
256 pages
DSP PlantSpecific
100% (2)
DSP PlantSpecific
33 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Experiment 7
No ratings yet
Experiment 7
7 pages
First Report
No ratings yet
First Report
14 pages
RL
No ratings yet
RL
6 pages
Assignment-11 (13)
No ratings yet
Assignment-11 (13)
16 pages
Project-1 (Data Preprocessing)
No ratings yet
Project-1 (Data Preprocessing)
5 pages
Aakash S Project Report
No ratings yet
Aakash S Project Report
12 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Dynamic Programming Models
No ratings yet
Dynamic Programming Models
27 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
RL UNIT V QA (1)
No ratings yet
RL UNIT V QA (1)
13 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Assignment 2
No ratings yet
Assignment 2
13 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
MLA Manual
No ratings yet
MLA Manual
25 pages
AIH_LAB1
No ratings yet
AIH_LAB1
10 pages
CM
No ratings yet
CM
8 pages
ML manoj
No ratings yet
ML manoj
51 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
5 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
Regression Analysis
No ratings yet
Regression Analysis
16 pages
B24 ML Exp-1
No ratings yet
B24 ML Exp-1
10 pages
RL
No ratings yet
RL
9 pages
Final Project
No ratings yet
Final Project
14 pages
MDP
No ratings yet
MDP
10 pages
Linear regression
No ratings yet
Linear regression
1 page
ANLY 502 Final Report
No ratings yet
ANLY 502 Final Report
7 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Machine Learning
100% (5)
Machine Learning
56 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
ML Answers Updated
No ratings yet
ML Answers Updated
13 pages
Syllabus of Machine Learning
No ratings yet
Syllabus of Machine Learning
19 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
A Practical Introduction To Data Structures and Algorithm Analysis
No ratings yet
A Practical Introduction To Data Structures and Algorithm Analysis
346 pages
MLP Unit-2
No ratings yet
MLP Unit-2
102 pages
Module 3
No ratings yet
Module 3
27 pages
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
No ratings yet
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
10 pages
ML Module - 5 QB Solved-1
No ratings yet
ML Module - 5 QB Solved-1
11 pages
Kazemi One Millisecond Face 2014 CVPR Paper
No ratings yet
Kazemi One Millisecond Face 2014 CVPR Paper
8 pages
8 Ejercicio - Optimización y Guardado de Modelos - Training - Microsoft Learn Ingles
No ratings yet
8 Ejercicio - Optimización y Guardado de Modelos - Training - Microsoft Learn Ingles
13 pages
Lecture Material 11
No ratings yet
Lecture Material 11
14 pages
DL questions
No ratings yet
DL questions
30 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
64 pages
Exercise 7 Submission Group 12
No ratings yet
Exercise 7 Submission Group 12
22 pages
B-56 Sanket Jambhulkar MLA-3
No ratings yet
B-56 Sanket Jambhulkar MLA-3
7 pages
lab-5-nguyenngocmaithi-20130120
No ratings yet
lab-5-nguyenngocmaithi-20130120
20 pages
Mlt2
No ratings yet
Mlt2
11 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Case Study Competition
No ratings yet
Case Study Competition
2 pages
Cruise Business Summary
No ratings yet
Cruise Business Summary
4 pages
Sikadur 100
No ratings yet
Sikadur 100
2 pages
Submitted To: Ms. Zainab Iqbal
No ratings yet
Submitted To: Ms. Zainab Iqbal
9 pages
The Community and Community Organizing
No ratings yet
The Community and Community Organizing
10 pages
Installation Manuals W 2000 Ts 2000
No ratings yet
Installation Manuals W 2000 Ts 2000
4 pages
automotive fault
No ratings yet
automotive fault
83 pages
SRK
No ratings yet
SRK
15 pages
Press Hardened Steels - AHSS Guidelines
No ratings yet
Press Hardened Steels - AHSS Guidelines
28 pages
2012 Proceedings Final
No ratings yet
2012 Proceedings Final
350 pages
EN - 1.2 Introduction To Entrepreneurship Test
No ratings yet
EN - 1.2 Introduction To Entrepreneurship Test
4 pages
Index
No ratings yet
Index
17 pages
Synthesis Recipe
No ratings yet
Synthesis Recipe
1 page
Asme BPVC Section VIII Division 1: Key Changes 2019 Edition
No ratings yet
Asme BPVC Section VIII Division 1: Key Changes 2019 Edition
69 pages
Invention 13: Johann Sebastian Bach (1685-1750) BWV 784
No ratings yet
Invention 13: Johann Sebastian Bach (1685-1750) BWV 784
2 pages
Q1 Mod3 PDF
89% (19)
Q1 Mod3 PDF
28 pages
Att HMB1
No ratings yet
Att HMB1
111 pages
EIC 4 Practice Exercises Unit 9
100% (1)
EIC 4 Practice Exercises Unit 9
4 pages
Standard For The Qualification and Certification of Resource Recovery Facility Operators
No ratings yet
Standard For The Qualification and Certification of Resource Recovery Facility Operators
22 pages
SA-500 Page-1: Types of Audit Evidence
No ratings yet
SA-500 Page-1: Types of Audit Evidence
11 pages
Chapter 3 4 Fabm
No ratings yet
Chapter 3 4 Fabm
6 pages
The 1 International Symposium On Rockfill Dams: Reports
No ratings yet
The 1 International Symposium On Rockfill Dams: Reports
4 pages
Stepan Formulation 1107
No ratings yet
Stepan Formulation 1107
2 pages
PANDAS PRACTISE QUESTIONS
No ratings yet
PANDAS PRACTISE QUESTIONS
2 pages
CB Series Linear Slot Diffusers
No ratings yet
CB Series Linear Slot Diffusers
7 pages
Lesson 2 Introduction To Python Programming
No ratings yet
Lesson 2 Introduction To Python Programming
21 pages
Commercial Dispatch Eedition 7-7-19
No ratings yet
Commercial Dispatch Eedition 7-7-19
28 pages
Electricity_Price_Prediction_Based_on_LSTM_and_LightGBM
No ratings yet
Electricity_Price_Prediction_Based_on_LSTM_and_LightGBM
5 pages
Fabrication of An Inexpensive Lon-Selective Electrode: (Ag, S-CUS)
No ratings yet
Fabrication of An Inexpensive Lon-Selective Electrode: (Ag, S-CUS)
1 page