0% found this document useful (0 votes)

6 views

Experiment 4

Uploaded by

Dikshant Buwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Experiment 4

Uploaded by

Dikshant Buwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Vidyavardhini’s College of Engineering & Technology

Department of Computer Science & Engineering (Data Science)

EXPERIMENT ASSESSMENT

ACADEMIC YEAR 2023-24

Course: Reinforcement Learning Lab
Course code: CSDOL8013
Year: BE Sem: VIII

Experiment No.: 4
Aim: - Applying dynamic programming algorithms, such as policy evaluation and policy
improvement, to solve a small-scale MDP problem.
Name:

Roll Number:
Date of Performance:

Date of Submission:

Evaluation
Performance Indicator Max. Marks Marks Obtained

Performance 5
Understanding 5
Journal work and timely submission. 10
Total 20

Exceed Expectations Meet Expectations Below Expectations

Performance Indicator
(EE) (ME) (BE)
Performance 5 3 2
Understanding 5 3 2
Journal work and
10 8 4
timely submission.

Checked by

Name of Faculty : Rujuta Vartak

Signature :

Date :

CSDOL8013- Reinforcement Learning Lab

EXPERIMENT 4

Aim: Applying dynamic programming algorithms, such as policy evaluation and policy
improvement, to solve a small-scale MDP problem.

Objective: The objective is to employ dynamic programming algorithms, such as policy evaluation
and policy improvement, to effectively address a small-scale Markov Decision Process (MDP)
problem, aiming to iteratively refine and optimize policies, thereby gaining insights into foundational
concepts of reinforcement learning and optimization.

Theory:
Markov Decision Processes (MDPs): MDPs are mathematical frameworks used to model
decision-making problems where outcomes are partly random and partly under the control of a
decision-maker. They consist of states, actions, transition probabilities, rewards, and a discount factor.
Policy Evaluation: Policy evaluation is the process of determining the value function for a given
policy. The value function represents the expected cumulative reward starting from a particular state
and following a specific policy. The basic idea is to iteratively update the value of each state until
convergence.
Algorithm for policy evaluation :
Input:

MDP: (S, A, P, R, γ), where

S is the set of states.
A is the set of actions.
P is the transition probability matrix, P(s' | s, a), representing the probability of transitioning to state s'
from state s by taking action a.
R is the reward function, R(s, a, s'), representing the immediate reward received after transitioning
from state s to state s' by taking action a.
γ is the discount factor.
Output:

Value function V(s) for each state s.

Algorithm:

Initialize the value function arbitrarily: V(s) for all s in S.

Repeat until convergence:

Initialize Δ to be a very large value (e.g., infinity).
For each state s in S:
Let v be the current value of V(s).
Update V(s) using the Bellman expectation equation:
V(s) = Σ [P(s' | s, a) * (R(s, a, s') + γ * V(s'))] over all possible next states s' and actions a.

Update Δ to be the maximum between Δ and |v - V(s)|.

If Δ is smaller than a predefined threshold ε, break.
Return the converged value function V(s).

Explanation:

The algorithm iteratively updates the value function V(s) for each state s using the Bellman
expectation equation until convergence.
At each iteration, it computes the expected value of being in state s and taking action a, which is the
sum of immediate reward R(s, a, s') and the discounted value of the next state V(s').
The process continues until the maximum change in the value function between iterations (Δ) falls
below a predefined threshold ε, indicating convergence.
The output is the converged value function V(s), which represents the expected cumulative reward
from each state under the given policy.
This algorithm is known as "Iterative Policy Evaluation" and is a fundamental component of dynamic
programming approaches for solving MDPs. It provides a way to estimate the value of each state
under a given policy.

Policy Improvement: Policy improvement involves selecting better actions in each state to improve
the current policy. It's based on the idea of greedily selecting actions that maximize the expected
cumulative reward given the current value function.
Algorithm for policy improvement:
Input:

MDP: (S, A, P, R, γ), where

Initialize a boolean variable policy_stable to true.

For each state s in S, do:

Let old_action be the current action selected by the policy π for state s.
Compute Q-value for each action a in A:
Q(s, a) = Σ [P(s' | s, a) * (R(s, a, s') + γ * V(s'))] over all possible next states s'.
Select the action a_max that maximizes the Q-value: a_max = argmax(Q(s, a)).
Update the policy π'(s) to select the action a_max.
Check for policy stability:

If the new policy π' is different from the old policy π for any state s, set policy_stable to false.
If policy_stable is true, return the improved policy π'.
Else, return to step 2 and repeat the process with the updated policy π'.

Explanation:

The algorithm iterates over each state in the state space and computes the Q-value for each action
based on the current value function V(s).
It selects the action that maximizes the Q-value as the new action for the state.
The process continues until the policy stabilizes, i.e., the new policy is the same as the old policy for
all states.
The output is the improved policy π' that greedily selects actions to maximize the expected
cumulative reward according to the current value function V(s).
This algorithm is known as "Policy Improvement" and is used in combination with policy evaluation
to iteratively improve the policy until convergence to an optimal policy in dynamic programming
approaches for solving MDPs.
1. Iterative Policy Evaluation and Policy Improvement: The process of policy evaluation and
policy improvement is often interleaved. After evaluating a policy, we improve it by selecting
better actions based on the updated value function. Then, we re-evaluate the policy to refine
the value estimates further.
2. Convergence: Both policy evaluation and policy improvement converge to the optimal value
function and policy if executed iteratively until convergence.
3. Implementation: Dynamic programming algorithms can be implemented efficiently using
programming languages like Python, where you can define MDPs, transition probabilities,
rewards, value functions, and policies, and then iteratively update them until convergence.
This general approach can be applied to small-scale MDP problems to find the optimal policy
efficiently. However, for larger problems, approximate methods like reinforcement learning
techniques may be more practical due to the computational complexity of dynamic programming
algorithms.

CODE
import numpy as np

# Define MDP parameters

NUM_STATES = 3
NUM_ACTIONS = 2
GAMMA = 0.9 # Discount factor

# Define transition probabilities

# transition_probs[state][action][next_state]
transition_probs = np.array([
[[0.2, 0.4, 0.4], [1.0, 0.0, 0.0]], # from S1
[[0.0, 0.0, 1.0], [0.0, 0.0, 1.0]], # from S2
[[1.0, 0.0, 0.0], [1.0, 0.0, 0.0]] # from S3
])

# Define rewards
# rewards[state][action][next_state]
rewards = np.array([
[[-1, -1, -1], [0, 0, 0]], # from S1
[[-1, -1, -1], [-1, -1, 10]], # from S2
[[-1, -1, -1], [-1, -1, -1]] # from S3
])

# Initialize a random policy

policy = np.random.randint(0, NUM_ACTIONS, size=NUM_STATES)

def policy_evaluation(policy, transition_probs, rewards, gamma=0.9,

tol=1e-6):
V = np.zeros(NUM_STATES) # Initialize value function to zeros
while True:
delta = 0
for s in range(NUM_STATES):
v = V[s]
bellman_expectation = sum(transition_probs[s][policy[s]][s1] *
(rewards[s][policy[s]][s1] + gamma * V[s1]) for s1 in range(NUM_STATES))
V[s] = bellman_expectation
delta = max(delta, abs(v - V[s]))
if delta < tol:
break
return V

def policy_improvement(V, transition_probs, rewards, gamma=0.9):

policy_stable = True
for s in range(NUM_STATES):
old_action = policy[s]
action_values = np.zeros(NUM_ACTIONS)
for a in range(NUM_ACTIONS):
action_values[a] = sum(transition_probs[s][a][s1] *
(rewards[s][a][s1] + gamma * V[s1]) for s1 in range(NUM_STATES))
# Greedily select the best action
policy[s] = np.argmax(action_values)
if old_action != policy[s]:
policy_stable = False
return policy, policy_stable

# Perform policy iteration

policy_stable = False
iteration = 0
while not policy_stable:
print(f"Iteration {iteration}: Policy {policy}")
V = policy_evaluation(policy, transition_probs, rewards, gamma=GAMMA)
policy, policy_stable = policy_improvement(V, transition_probs, rewards,
gamma=GAMMA)
iteration += 1

print(f"Optimal Policy: {policy}")

print(f"Optimal Value Function: {V}")

Output:
Conclusion:
1. Give one example of dynamic Programming.
One classic example of dynamic programming is the "Fibonacci sequence". The Fibonacci sequence
is a series of numbers in which each number (except for the first two) is the sum of the two preceding
ones. It starts with 0 and 1, and the sequence goes like this: 0, 1, 1, 2, 3, 5, 8, 13, 21, ...The naive
recursive implementation of Fibonacci can be very inefficient because it recalculates the values for
the same subproblems multiple times. Dynamic programming can optimize this by storing the results
of subproblems and reusing them when needed.instead of recalculating Fibonacci numbers from
scratch for each value of n, we store previously calculated Fibonacci numbers in the fib list and reuse
them as needed. This significantly improves the efficiency of the Fibonacci calculation, especially for
large values of n. This approach is an example of bottom-up dynamic programming, where we solve
smaller subproblems first and build up to the larger problem.

2. Explain how dynamic programming is utilized in reinforcement learning.

Here's how dynamic programming is utilized in reinforcement learning:

a.Policy Evaluation: In RL, the value function represents the expected return (or cumulative
reward) that an agent can achieve from a given state under a certain policy. Dynamic
programming is used to iteratively estimate the value function for a given policy until
convergence. This process is called policy evaluation.
b.Policy Improvement: Once the value function has been evaluated, dynamic programming is
used to improve the policy based on the estimated value function. The policy improvement
step involves selecting actions that lead to states with higher value estimates.
c.Value Iteration: Value iteration is another DP algorithm commonly used in reinforcement
learning. It combines both policy evaluation and policy improvement into a single step. In
value iteration, the value function is updated iteratively by taking the maximum expected
return over all possible actions from each state.
d.Model-Based RL: Dynamic programming methods are particularly useful in model-based
RL, where the agent has access to a model of the environment dynamics (transition
probabilities and rewards).

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
Miao, Jianjun: Economic Dynamics in Discrete Time
No ratings yet
Miao, Jianjun: Economic Dynamics in Discrete Time
4 pages
Quantitative Techniques For Managerial Decisions
No ratings yet
Quantitative Techniques For Managerial Decisions
3 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
M 2
No ratings yet
M 2
12 pages
Module 04
No ratings yet
Module 04
63 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lec 09
No ratings yet
Lec 09
51 pages
Rl Lecture4
No ratings yet
Rl Lecture4
16 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
15 MDP
No ratings yet
15 MDP
35 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Solution to Assignment_4_Dynamic Programming
No ratings yet
Solution to Assignment_4_Dynamic Programming
11 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
CS229
No ratings yet
CS229
17 pages
RL Ese
No ratings yet
RL Ese
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
04_RL_DP
No ratings yet
04_RL_DP
76 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
RL_UNIT-II (1)
No ratings yet
RL_UNIT-II (1)
14 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Pomdps
No ratings yet
Pomdps
76 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
notes
No ratings yet
notes
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
MDP 2
No ratings yet
MDP 2
53 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
AS02
No ratings yet
AS02
16 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Maths Worksheet 1
No ratings yet
Maths Worksheet 1
4 pages
IET - Applications of Machine Learning in Wireless Communications
No ratings yet
IET - Applications of Machine Learning in Wireless Communications
492 pages
02 Notasi Big O
No ratings yet
02 Notasi Big O
16 pages
Assignment_03
No ratings yet
Assignment_03
2 pages
Ma 2021
No ratings yet
Ma 2021
12 pages
Recursion
No ratings yet
Recursion
12 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Lecture 1 Game Theory
No ratings yet
Lecture 1 Game Theory
10 pages
Unit 4-Health care and Deep Learninh
No ratings yet
Unit 4-Health care and Deep Learninh
87 pages
Cbse Chapterwise Question 4 With Water Mark
No ratings yet
Cbse Chapterwise Question 4 With Water Mark
62 pages
AIML Brochure
No ratings yet
AIML Brochure
13 pages
09 Power Method
No ratings yet
09 Power Method
72 pages
Unfolded Formulation of 4d Yang-Mills Theory
No ratings yet
Unfolded Formulation of 4d Yang-Mills Theory
12 pages
AP18110010172 Crypto
No ratings yet
AP18110010172 Crypto
13 pages
Implementation of Stronger Aes by Using Dynamic S-Box Dependent of Master Key
No ratings yet
Implementation of Stronger Aes by Using Dynamic S-Box Dependent of Master Key
9 pages
HW1
No ratings yet
HW1
2 pages
Information Security: Subject Lect. Tut. Pract. Hrs
100% (1)
Information Security: Subject Lect. Tut. Pract. Hrs
69 pages
Buddy System in Operating Systems
43% (7)
Buddy System in Operating Systems
28 pages
Adaptive Fuzzy Systems
No ratings yet
Adaptive Fuzzy Systems
6 pages
Maths X Assertion Reasoning Chapter 03
No ratings yet
Maths X Assertion Reasoning Chapter 03
14 pages
IS Lab Mannual
No ratings yet
IS Lab Mannual
26 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Unit-5 Greedy Algorithm
No ratings yet
Unit-5 Greedy Algorithm
71 pages
32 CFD - Paper
No ratings yet
32 CFD - Paper
4 pages
Module 10 Math 8
No ratings yet
Module 10 Math 8
6 pages

Experiment 4

Uploaded by

Experiment 4

Uploaded by

Vidyavardhini’s College of Engineering & Technology

Department of Computer Science & Engineering (Data Science)

ACADEMIC YEAR 2023-24

Exceed Expectations Meet Expectations Below Expectations

Name of Faculty : Rujuta Vartak

CSDOL8013- Reinforcement Learning Lab

MDP: (S, A, P, R, γ), where

Value function V(s) for each state s.

Initialize the value function arbitrarily: V(s) for all s in S.

Repeat until convergence:

Update Δ to be the maximum between Δ and |v - V(s)|.

MDP: (S, A, P, R, γ), where

Initialize a boolean variable policy_stable to true.

For each state s in S, do:

# Define MDP parameters

# Define transition probabilities

# Initialize a random policy

def policy_evaluation(policy, transition_probs, rewards, gamma=0.9,

def policy_improvement(V, transition_probs, rewards, gamma=0.9):

# Perform policy iteration

print(f"Optimal Policy: {policy}")

2. Explain how dynamic programming is utilized in reinforcement learning.

Here's how dynamic programming is utilized in reinforcement learning:

You might also like