0% found this document useful (0 votes)

56 views19 pages

Ass1 Merged Merged

The document shows code for implementing Q-learning on the FrozenLake environment in Gym. It initializes the environment and prints information about the observation and action spaces. It then initializes a Q-table of zeros and trains the Q-table over multiple episodes using the Q-learning update rule. It prints the Q-table before and after training and plots the outcomes of each episode. Finally, it implements value iteration to find the optimal policy.

Uploaded by

Akash Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views19 pages

Ass1 Merged Merged

Uploaded by

Akash Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

In [1]:

import gym

In [2]:

env=gym.make("FrozenLake-v1",render_mode="human")

E:\anaconda\lib\site-packages\gym\core.py:317: DeprecationWarning: WARN: Initializing wra

pper in old step API which returns one bool instead of two. It is recommended to set `new
_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarni
ng: WARN: Initializing environment in old step API which returns one bool instead of two.
It is recommended to set `new_step_api=True` to use new step API. This will be the defaul
t behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\core.py:317: DeprecationWarning: WARN: Initializing wra
pper in old step API which returns one bool instead of two. It is recommended to set `new
_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
E:\anaconda\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarni
ng: WARN: Initializing environment in old step API which returns one bool instead of two.
It is recommended to set `new_step_api=True` to use new step API. This will be the defaul
t behaviour in future.
deprecation(

In [3]:

env.reset()
Out[3]:
0

Out[3]:
0

In [4]:

# observation space - states

print(env.observation_space)

# actions: left -0, down - 1, right - 2, up- 3

print(env.action_space)

Discrete(16)
Discrete(4)
Discrete(16)
Discrete(4)

In [5]:
import numpy as np
import matplotlib.pyplot as plt

In [6]:
plt.rcParams['figure.dpi'] = 300
plt.rcParams.update({'font.size': 17})

# We initialize the Q-table

qtable = np.zeros((env.observation_space.n, env.action_space.n))

# Hyperparameters
episodes =50 # Total number of episodes
alpha = 0.5 # Learning rate
gamma = 0.9 # Discount factor
# List of outcomes to plot
outcomes = []

print('Q-table before training:')

print(qtable)

# Training
for _ in range(episodes):
state = env.reset()

done = False

# By default, we consider our outcome to be a failure

outcomes.append("Failure")

# Until the agent gets stuck in a hole or reaches the goal, keep training it
while not done:
# Choose the action with the highest value in the current state
if np.max(qtable[state]) > 0:
action = np.argmax(qtable[state])

# If there's no best action (only zeros), take a random one

else:
action = env.action_space.sample()

# Implement this action and move the agent in the desired direction
new_state, reward, done, truncated= env.step(action)

# Update Q(s,a)
qtable[state, action] = qtable[state, action] + \
alpha * (reward + gamma * np.max(qtable[new_state]) - q
table[state, action])

# Update our current state

state = new_state

# If we have a reward, it means that our outcome is a success

if reward:
outcomes[-1] = "Success"

print()
print('===========================================')
print('Q-table after training:')
print(qtable)

Q-table before training:

[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Q-table before training:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

===========================================
Q-table after training:
[[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0.3375 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0.225 0. ]
[0. 0.875 0. 0. ]
[0. 0. 0. 0. ]]

In [8]:
print(env.observation_space)

Discrete(16)
Discrete(16)

In [9]:
env.action_space
Out[9]:
Discrete(4)
Out[9]:
Discrete(4)

In [10]:
print(env.P[9][2])

[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 10, 0.0, False), (0.333333333

3333333, 5, 0.0, True)]
[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 10, 0.0, False), (0.333333333
3333333, 5, 0.0, True)]

In [11]:
random_action = env.action_space.sample()

In [12]:
env.reset()
Out[12]:
0
Out[12]:
0
In [13]:
new_state, reward, done, truncated = env.step(random_action)

In [14]:
state = env.reset()
print('Time Step 0 ')
env.render()
num_timesteps = 100
for t in range(num_timesteps):
new_state, reward, done, truncated= env.step(random_action)
print("Time Step {} ".format(t+1))

env.render()
if done:
break

Time Step 0
Time Step 0
Time Step 1
Time Step 1
Time Step 2
Time Step 2
Time Step 3
Time Step 3
Time Step 4
Time Step 4
Time Step 5
Time Step 5
Time Step 6
Time Step 6
Time Step 7
Time Step 7
Time Step 8
Time Step 8
Time Step 9
Time Step 9
Time Step 10
Time Step 10

Value Iteration
In [15]:
def value_iteration(env):
num_iterations = 1000
threshold = 1e-20
gamma = 1
value_table = np.zeros(env.observation_space.n)#all states value=0
for i in range(num_iterations):
updated_value_table = np.copy(value_table)
for s in range(env.observation_space.n):
Q_values = [sum([prob*(r + gamma * updated_value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
value_table[s] = max(Q_values)
if (np.sum(np.fabs(updated_value_table - value_table))<= threshold):
break
return value_table

In [16]:
def extract_policy(value_table):
gamma =1
policy = np.zeros(env.observation_space.n)
for s in range(env.observation_space.n):
Q_values = [sum([prob*(r + gamma * value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
policy[s] = np.argmax(np.array(Q_values))
return policy

In [17]:
optimal_value_function = value_iteration(env)
optimal_policy = extract_policy(optimal_value_function)
print(optimal_policy)

[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]

In [18]:
env.close()

In [ ]:
ASSIGNMENT - 1
In [1]:

import gym

In [2]:
env = gym.make('MountainCar-v0',render_mode="rgb_array")

In [3]:
# Observation and action space
obs_space = env.observation_space
action_space = env.action_space
print("The observation space: {} ".format(obs_space))
print("The action space: {} ".format(action_space))

The observation space: Box([-1.2 -0.07], [0.6 0.07], (2,), float32)

The action space: Discrete(3)

In [4]:

# pip install gym[classic_control]

In [5]:
import matplotlib.pyplot as plt

# reset the environment and see the initial observation

obs = env.reset()
print("The initial observation is {} ".format(obs))

# Sample a random action from the entire action space

random_action = env.action_space.sample()

# # Take the action and get the new observation space

new_obs, reward, done, info, a = env.step(random_action)
print("The new observation is {} ".format(new_obs))

The initial observation is (array([-0.5884122, 0. ], dtype=float32), {})

The new observation is [-5.8892918e-01 -5.1695626e-04]

In [6]:
env_screen = env.render()
env.close()

import matplotlib.pyplot as plt

plt.imshow(env_screen)
Out[6]:

<matplotlib.image.AxesImage at 0x266f66029e0>
In [7]:

import time

# Number of steps you run the agent for

num_steps = 150

obs = env.reset()

for step in range(num_steps):

# take random action, but you can also do something more intelligent
# action = my_intelligent_agent_fn(obs)
action = env.action_space.sample()

# apply the action

obs, reward, done, info ,a= env.step(action)

# Render the env

env.render()

# Wait a bit before the next frame unless you want to see a crazy fast video
time.sleep(0.001)
plt.imshow(env.render())
# If the epsiode is up, then start another one
if done:
env.reset()

# Close the env

env.close()
ASSIGNMENT - 2
In [1]:

import gym
import random
import matplotlib.pyplot as plt
import numpy as np

In [3]:

# MAKING A GYM ENVIRONMENT FOR THE GAME

environment = gym.make("FrozenLake-v1", is_slippery=False)
environment.reset()
environment.render()

In [4]:
# Q TABLE FOR THE GAME
#1
# qtable = np.zeros((16, 4))
#2
nb_states = environment.observation_space.n # = 16
nb_actions = environment.action_space.n # = 4
qtable = np.zeros((nb_states, nb_actions))
qtable

Out[4]:
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])

In [5]:
# random.choice(["LEFT", "DOWN", "RIGHT", "UP"])
environment.action_space.sample()
Out[5]:

In [6]:
environment.step(2)
environment.render()

In [7]:

action = environment.action_space.sample()

# 2. Implement this action and move the agent in the desired direction
new_state, reward, done, info,a = environment.step(action)
# Display the results (reward and map)
environment.render()
print(f'Reward = {reward}')

Reward = 0.0

In [8]:

import matplotlib.pyplot as plt

plt.rcParams['figure.dpi'] = 300
plt.rcParams.update({'font.size': 17})

# We re-initialize the Q-table

qtable = np.zeros((environment.observation_space.n, environment.action_space.n))

# Hyperparameters
episodes = 1000 # Total number of episodes
alpha = 0.5 # Learning rate
gamma = 0.9 # Discount factor

# List of outcomes to plot

outcomes = []

print('Q-table before training:')

print(qtable)

# Training
for _ in range(episodes):
state = environment.reset()[0]
done = False

# By default, we consider our outcome to be a failure

outcomes.append("Failure")

# If there's no best action (only zeros), take a random one

else:
action = environment.action_space.sample()

# Implement this action and move the agent in the desired direction
new_state, reward, done, info,a = environment.step(action)

# Update Q(s,a)
qtable[state, action] = qtable[state, action] + \
alpha * (reward + gamma * np.max(qtable[new_state]) - q
table[state, action])

# Update our current state

state = new_state

# If we have a reward, it means that our outcome is a success

if reward:
outcomes[-1] = "Success"

print()
print('===========================================')
print('Q-table after training:')
print(qtable)

# Plot outcomes
plt.figure(figsize=(12, 5))
plt.xlabel("Run number")
plt.ylabel("Outcome")
ax = plt.gca()
ax.set_facecolor('#efeeea')
plt.bar(range(len(outcomes)), outcomes, color="#0A047A", width=1.0)
plt.show()
Q-table before training:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

===========================================
Q-table after training:
[[0. 0. 0.59049 0. ]
[0. 0. 0.6561 0. ]
[0. 0.729 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0.81 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0.2784375 0. ]
[0. 0.9 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 1. 0. ]
[0. 0. 0. 0. ]]

In [9]:
from IPython.display import clear_output
import time

state = environment.reset()[0]
done = False
sequence = []

while not done:

# Choose the action with the highest value in the current state
if np.max(qtable[state]) > 0:
action = np.argmax(qtable[state])
# If there's no best action (only zeros), take a random one
else:
action = environment.action_space.sample()

# Add the action to the sequence

sequence.append(action)

# Implement this action and move the agent in the desired direction
new_state, reward, done, info,a = environment.step(action)

# Update our current state

state = new_state

# Update the render

clear_output(wait=True)
environment.render()
time.sleep(1)

print(f"Sequence = {sequence}")

Sequence = [2, 2, 1, 1, 1, 2]

In [10]:
episodes = 100
nb_success = 0

# Evaluation
for _ in range(100):
state = environment.reset()[0]
done = False

# Until the agent gets stuck or reaches the goal, keep training it
while not done:
# Choose the action with the highest value in the current state
if np.max(qtable[state]) > 0:
action = np.argmax(qtable[state])

# If there's no best action (only zeros), take a random one

else:
action = environment.action_space.sample()

# Implement this action and move the agent in the desired direction
new_state, reward, done, info,a = environment.step(action)

# Update our current state

state = new_state

# When we get a reward, it means we solved the game

nb_success += reward

# Let's check our success rate!

print (f"Success rate = {nb_success/episodes*100}%")

Success rate = 100.0%

ASSIGNMENT - 3
In [1]:

import gym
import numpy as np
import matplotlib.pyplot as plt

In [2]:

env = gym.make("FrozenLake-v1", render_mode="human" ,is_slippery=False)

In [3]:

env.reset()
# env.render()
# env.step(2)
Out[3]:

(0, {'prob': 1})

In [4]:
# env.step(2)

In [5]:

# env.step(2)

In [6]:
print(env.observation_space)
env.action_space

Discrete(16)
Out[6]:

Discrete(4)

In [7]:
env.P[2][2]
Out[7]:
[(1.0, 3, 0.0, False)]

In [8]:

def value_iteration(env, gamma = 0.9):

# initialize value table with zeros

value_table = np.zeros(env.observation_space.n)

# set number of iterations and threshold

no_of_iterations = 100000
threshold = 1e-20

for i in range(no_of_iterations):

# On each iteration, copy the value table to the updated_value_table

updated_value_table = np.copy(value_table)

# Now we calculate Q Value for each actions in the state

# and update the value of a state with maximum Q value
for state in range(env.observation_space.n):
Q_value = []
for action in range(env.action_space.n):
next_states_rewards = []
for next_sr in env.P[state][action]:
trans_prob, next_state, reward_prob, _ = next_sr
next_states_rewards.append((trans_prob * (reward_prob + gamma * upda
ted_value_table[next_state])))

Q_value.append(np.sum(next_states_rewards))

value_table[state] = max(Q_value)

# we will check whether we have reached the convergence i.e whether the differenc
e
# between our value table and updated value table is very small. But how do we kn
ow it is very
# small? We set some threshold and then we will see if the difference is less
# than our threshold, if it is less, we break the loop and return the value funct
ion as optimal
# value function

if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):

print ('Value-iteration converged at iteration# %d .' %(i+1))
break

return value_table

In [9]:
def extract_policy(value_table, gamma = 0.9):

# initialize the policy with zeros

policy = np.zeros(env.observation_space.n)

for state in range(env.observation_space.n):

# initialize the Q table for a state

Q_table = np.zeros(env.action_space.n)

# compute Q value for all ations in the state

for action in range(env.action_space.n):
for next_sr in env.P[state][action]:
trans_prob, next_state, reward_prob, _ = next_sr
Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next
_state]))

# select the action which has maximum Q value as an optimal action of the state
policy[state] = np.argmax(Q_table)

return policy

In [10]:
optimal_value_function = value_iteration(env=env,gamma=0.9)
optimal_value_function

Value-iteration converged at iteration# 7.

Out[10]:
array([0.59049, 0.6561 , 0.729 , 0.6561 , 0.6561 , 0. , 0.81 ,
0. , 0.729 , 0.81 , 0.9 , 0. , 0. , 0.9 ,
1. , 0. ])

In [11]:
optimal_policy = extract_policy(optimal_value_function, gamma=0.9)
optimal_policy
Out[11]:
Out[11]:

array([1., 2., 1., 0., 1., 0., 1., 0., 2., 1., 1., 0., 0., 2., 2., 0.])

In [12]:
opt_pol = optimal_policy.reshape(4,4)
print('THE OPTIMAL POLICY IS \n ',opt_pol)

THE OPTIMAL POLICY IS

[[1. 2. 1. 0.]
[1. 0. 1. 0.]
[2. 1. 1. 0.]
[0. 2. 2. 0.]]
ASSIGNMENT - 4
In [1]:

import gym
import pandas as pd
from collections import defaultdict

In [ ]:

env = gym.make('Blackjack-v1')
print(env.reset())

(18, 7, False)

In [ ]:

print(env.action_space)

Discrete(2)

In [ ]:
print(env.observation_space)

Tuple(Discrete(32), Discrete(11), Discrete(2))

In [ ]:
def policy(state):
return 0 if state[0] >19 else 1

In [ ]:
state = env.reset()
print(state)

(20, 4, False)

In [ ]:

print(policy(state))

In [ ]:

num_timesteps = 500

def play_episode(env, policy):

state = env.reset()
episode = []
for t in range(num_timesteps):
action = policy(state)
next_state, reward, done, _ = env.step(action)
episode.append((state, action, reward))
if done:
break
state = next_state
return episode

In [ ]:
print(play_episode(env,policy))

[((12, 10, False), 1, 0.0), ((19, 10, False), 1, -1.0)]

In [ ]:
total_return = defaultdict(float)
N = defaultdict(int)

num_iterations = 100000

for i in range(num_iterations):
episode = play_episode(env, policy)
states, actions, rewards = zip(*episode)
for t, state in enumerate(states):
r = (sum(rewards[t:]))
total_return[state] += sum(rewards[t:])
N[state] += 1

In [ ]:
total_return = pd.DataFrame(total_return.items(), columns=['state', 'total_return'])
N = pd.DataFrame(N.items(), columns=['state', 'N'])
df = pd.merge(total_return, N, on='state')
df.head()
Out[ ]:

state total_return N

0 (13, 10, False) -2317.0 3807

1 (20, 10, False) 2628.0 6001

2 (14, 1, True) -26.0 85

3 (13, 1, False) -614.0 898

4 (12, 10, False) -1996.0 3597

In [ ]:
df['value'] = df['total_return'] / df['N']
df.head()
Out[ ]:

state total_return N value

0 (13, 10, False) -2317.0 3807 -0.608616

1 (20, 10, False) 2628.0 6001 0.437927

2 (14, 1, True) -26.0 85 -0.305882

3 (13, 1, False) -614.0 898 -0.683742

4 (12, 10, False) -1996.0 3597 -0.554907

In [ ]:
df[df['state']==(21,9,False)]['value'].values

Out[ ]:
array([0.94723618])

In [ ]:
df[df['state']==(5,8,False)]['value'].values
Out[ ]:
array([-0.63513514])
ASSIGNMENT - 6
In [11]:

import gym
import numpy as np
import random

In [12]:

env= gym.make('FrozenLake-v1') #, render_mode='human')

In [13]:

Q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
Q[(s,a)] = 0.0

In [14]:
def epsilon_greedy (state, epsilon):
if random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)), key= lambda x:
Q[(state,x)])

In [15]:
alpha=0.85
gamma= 0.90
epsilon = 0.8

In [16]:
num_episodes = 50000
num_timesteps= 1000

In [17]:
for i in range(num_episodes):
s = env.reset()[0]
for t in range(num_timesteps):
a = epsilon_greedy(s, epsilon)
s_,r, done, _, trash = env.step(a)
a_ = np.argmax([Q[(s_, a)] for a in range(env.action_space.n)])
Q[(s,a)] += alpha * (r + gamma * Q[(s_,a_)]-Q[(s,a)])
s = s_
if done:
break

In [18]:
Q

Out[18]:
{(0, 0): 0.23477961696373423,
(0, 1): 0.22480181183703787,
(0, 2): 0.23961716957752016,
(0, 3): 0.24066398243905854,
(1, 0): 0.2204815896999076,
(1, 1): 0.04017125915710931,
(1, 2): 0.2822227428738474,
(1, 3): 0.22490808477046206,
(2, 0): 0.29961284509447655,
(2, 0): 0.29961284509447655,
(2, 1): 0.32990657866523887,
(2, 2): 0.37292229711147334,
(2, 3): 0.2790710024900863,
(3, 0): 0.25000597793284357,
(3, 1): 0.2575759230383145,
(3, 2): 0.037377204692152305,
(3, 3): 0.32551898596551954,
(4, 0): 0.3804023551933965,
(4, 1): 0.00856676265665978,
(4, 2): 0.5076563484150082,
(4, 3): 0.050394379122136346,
(5, 0): 0.0,
(5, 1): 0.0,
(5, 2): 0.0,
(5, 3): 0.0,
(6, 0): 0.4285911909119954,
(6, 1): 0.0002831967627810342,
(6, 2): 0.692932809233417,
(6, 3): 0.006210297861473632,
(7, 0): 0.0,
(7, 1): 0.0,
(7, 2): 0.0,
(7, 3): 0.0,
(8, 0): 0.4724947380728043,
(8, 1): 0.5292926568861616,
(8, 2): 0.0854144618498413,
(8, 3): 0.4739574281045383,
(9, 0): 0.07482359865928406,
(9, 1): 0.7499983936496128,
(9, 2): 0.5420577123103719,
(9, 3): 0.07627448541464799,
(10, 0): 0.5934266484891275,
(10, 1): 0.8240260740178592,
(10, 2): 0.774672222464751,
(10, 3): 0.09258352148159432,
(11, 0): 0.0,
(11, 1): 0.0,
(11, 2): 0.0,
(11, 3): 0.0,
(12, 0): 0.0,
(12, 1): 0.0,
(12, 2): 0.0,
(12, 3): 0.0,
(13, 0): 0.5646633544813056,
(13, 1): 0.11021736826449313,
(13, 2): 0.6234923802366498,
(13, 3): 0.7062633423537164,
(14, 0): 0.807111703090516,
(14, 1): 0.6835467756183892,
(14, 2): 0.9109203529502328,
(14, 3): 0.7967422343382821,
(15, 0): 0.0,
(15, 1): 0.0,
(15, 2): 0.0,
(15, 3): 0.0}

2008 Nissan Teana J32 Service Manual-Fsu
No ratings yet
2008 Nissan Teana J32 Service Manual-Fsu
19 pages
Qbasic MCQ
100% (6)
Qbasic MCQ
5 pages
Frozen Lake
No ratings yet
Frozen Lake
6 pages
Ass1 Merged Merged
No ratings yet
Ass1 Merged Merged
15 pages
Ass 2
No ratings yet
Ass 2
4 pages
Class-Work-1 (26-08-2024)
No ratings yet
Class-Work-1 (26-08-2024)
5 pages
RLDL
No ratings yet
RLDL
23 pages
Frozen Lake
No ratings yet
Frozen Lake
2 pages
Assignment Week6 AI4ICPS
No ratings yet
Assignment Week6 AI4ICPS
11 pages
Sindhuja Assignment-2 AI
No ratings yet
Sindhuja Assignment-2 AI
22 pages
ML - 6 - Jupyter Notebook
No ratings yet
ML - 6 - Jupyter Notebook
5 pages
Exam
No ratings yet
Exam
7 pages
REINFORCE Algorithm
No ratings yet
REINFORCE Algorithm
15 pages
Practical
No ratings yet
Practical
6 pages
Class ActorCritic
No ratings yet
Class ActorCritic
1 page
Tanu Raman ML Lab File
No ratings yet
Tanu Raman ML Lab File
21 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
21BAI10063 MonteCarloLab
No ratings yet
21BAI10063 MonteCarloLab
18 pages
21L7734 Shais Quiz3 Aml 8A
No ratings yet
21L7734 Shais Quiz3 Aml 8A
25 pages
Q-Learning in RL With Openai Gym: Joo Soon Lee
No ratings yet
Q-Learning in RL With Openai Gym: Joo Soon Lee
34 pages
AI
No ratings yet
AI
11 pages
Machine Learning Practical File
No ratings yet
Machine Learning Practical File
31 pages
Implement The KNN
No ratings yet
Implement The KNN
5 pages
Ass 1
No ratings yet
Ass 1
2 pages
1 - All Python Codes + Neo4j Samples
No ratings yet
1 - All Python Codes + Neo4j Samples
16 pages
CVDL (Practical No. 3)
No ratings yet
CVDL (Practical No. 3)
1 page
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Part 2 - Building A Deep Q-Network To Play Gridworld - Catastrophic Forgetting and Experience Replay - by NandaKishore Joshi - Towards Data Science
No ratings yet
Part 2 - Building A Deep Q-Network To Play Gridworld - Catastrophic Forgetting and Experience Replay - by NandaKishore Joshi - Towards Data Science
8 pages
CCC
No ratings yet
CCC
25 pages
RLDL File
No ratings yet
RLDL File
31 pages
RL Unit V Qa
No ratings yet
RL Unit V Qa
13 pages
201CS240-ML Obtr 2
No ratings yet
201CS240-ML Obtr 2
16 pages
ML Record
No ratings yet
ML Record
24 pages
AIML Final Programs
No ratings yet
AIML Final Programs
8 pages
DL2 - Jupyter Notebook
No ratings yet
DL2 - Jupyter Notebook
5 pages
Initialization
No ratings yet
Initialization
16 pages
Program Explanation
No ratings yet
Program Explanation
37 pages
Regularization For Neural Network
No ratings yet
Regularization For Neural Network
37 pages
ML Lab Record
No ratings yet
ML Lab Record
33 pages
DEEPAK
No ratings yet
DEEPAK
6 pages
RLDL
No ratings yet
RLDL
27 pages
Import Gym
No ratings yet
Import Gym
4 pages
Machine Learning CODE
No ratings yet
Machine Learning CODE
19 pages
Reinforcement Learning - Project 3
No ratings yet
Reinforcement Learning - Project 3
9 pages
Machine Learning - Lab Manual
No ratings yet
Machine Learning - Lab Manual
35 pages
ML Lab P-1
No ratings yet
ML Lab P-1
10 pages
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
No ratings yet
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
7 pages
Software Laboratory II Code
No ratings yet
Software Laboratory II Code
27 pages
ML Record Print
No ratings yet
ML Record Print
20 pages
01 Module 2 Neural Network Based Reinforcement Learning
No ratings yet
01 Module 2 Neural Network Based Reinforcement Learning
133 pages
SiddharthShah 1032221195 DivC 50 DL LabAssignment5
No ratings yet
SiddharthShah 1032221195 DivC 50 DL LabAssignment5
10 pages
ANN PR Code and Output
No ratings yet
ANN PR Code and Output
25 pages
Experiment No
No ratings yet
Experiment No
29 pages
Quadcopter
No ratings yet
Quadcopter
7 pages
RBF Based Reinforcement Learning
No ratings yet
RBF Based Reinforcement Learning
14 pages
Code Question1-Adaline
No ratings yet
Code Question1-Adaline
29 pages
Practical No4,5
No ratings yet
Practical No4,5
7 pages
Exe 1
No ratings yet
Exe 1
13 pages
Lab-5 Report
No ratings yet
Lab-5 Report
11 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
(P) Program AIO
No ratings yet
(P) Program AIO
22 pages
Sprint Round: ©2003 MATHCOUNTS Foundation: 2003 State Answer Key
No ratings yet
Sprint Round: ©2003 MATHCOUNTS Foundation: 2003 State Answer Key
6 pages
12 - Al & Cu Rods - Cu Strip - Bcu
No ratings yet
12 - Al & Cu Rods - Cu Strip - Bcu
8 pages
Interactive Worksheet - Area and Perimeter
No ratings yet
Interactive Worksheet - Area and Perimeter
2 pages
Assignment II
No ratings yet
Assignment II
4 pages
CO Alarm User Manual 900-0113
No ratings yet
CO Alarm User Manual 900-0113
25 pages
Power Electronics-3rd Chapter
No ratings yet
Power Electronics-3rd Chapter
80 pages
Advanced Product Quality Planning - FGI
No ratings yet
Advanced Product Quality Planning - FGI
142 pages
P14D Cortec
No ratings yet
P14D Cortec
8 pages
Material Model For Wood
No ratings yet
Material Model For Wood
22 pages
Shearing Machine: Manufacturing Processes Laboratory
No ratings yet
Shearing Machine: Manufacturing Processes Laboratory
22 pages
Case Study
No ratings yet
Case Study
29 pages
Excel VBAMacros
No ratings yet
Excel VBAMacros
86 pages
Properties of Implication: Legend Symbols English Symbols English
No ratings yet
Properties of Implication: Legend Symbols English Symbols English
3 pages
Task 2 - Electromagnetic Waves in Bounded Open Media Individual Work Format
No ratings yet
Task 2 - Electromagnetic Waves in Bounded Open Media Individual Work Format
21 pages
Chapter 3 v5
100% (1)
Chapter 3 v5
3 pages
Example On Bode Magnitude Plot
No ratings yet
Example On Bode Magnitude Plot
3 pages
Syllabus Vlsi Design
No ratings yet
Syllabus Vlsi Design
4 pages
Chapter 5 - Stucture Query Language
No ratings yet
Chapter 5 - Stucture Query Language
36 pages
Bogdan Baranowski
No ratings yet
Bogdan Baranowski
3 pages
Rekayasa Fondasi Dalam: Tiang Pancang Dan Tiang Bor
No ratings yet
Rekayasa Fondasi Dalam: Tiang Pancang Dan Tiang Bor
38 pages
Learn Vim Script The Hard Way Sample
No ratings yet
Learn Vim Script The Hard Way Sample
15 pages
(Chapter 8) Data Structures and CAATTs For Data Extraction
No ratings yet
(Chapter 8) Data Structures and CAATTs For Data Extraction
30 pages
SD Wan Wan Edge Onboarding Deploy Guide 2020jan PDF
No ratings yet
SD Wan Wan Edge Onboarding Deploy Guide 2020jan PDF
100 pages
Heater Nomenclature1 PDF
No ratings yet
Heater Nomenclature1 PDF
42 pages
PLS-SEM: Indeed A Silver Bullet: Journal of Marketing Theory and Practice
No ratings yet
PLS-SEM: Indeed A Silver Bullet: Journal of Marketing Theory and Practice
15 pages
SPECIFICATIONS Cable 70mm 4core Cable
100% (2)
SPECIFICATIONS Cable 70mm 4core Cable
2 pages
0653 Combined Science
No ratings yet
0653 Combined Science
143 pages
L 5:probability
No ratings yet
L 5:probability
71 pages

Ass1 Merged Merged

Uploaded by

Ass1 Merged Merged

Uploaded by

In [1]:

E:\anaconda\lib\site-packages\gym\core.py:317: DeprecationWarning: WARN: Initializing wra

# observation space - states

# actions: left -0, down - 1, right - 2, up- 3

# We initialize the Q-table

print('Q-table before training:')

# By default, we consider our outcome to be a failure

# If there's no best action (only zeros), take a random one

# Update our current state

# If we have a reward, it means that our outcome is a success

Q-table before training:

[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 10, 0.0, False), (0.333333333

The observation space: Box([-1.2 -0.07], [0.6 0.07], (2,), float32)

# pip install gym[classic_control]

# reset the environment and see the initial observation

# Sample a random action from the entire action space

# # Take the action and get the new observation space

The initial observation is (array([-0.5884122, 0. ], dtype=float32), {})

import matplotlib.pyplot as plt

# Number of steps you run the agent for

for step in range(num_steps):

# apply the action

# Render the env

# Close the env

# MAKING A GYM ENVIRONMENT FOR THE GAME

import matplotlib.pyplot as plt

# We re-initialize the Q-table

# List of outcomes to plot

print('Q-table before training:')

# By default, we consider our outcome to be a failure

# If there's no best action (only zeros), take a random one

# Update our current state

# If we have a reward, it means that our outcome is a success

while not done:

# Add the action to the sequence

# Update our current state

# Update the render

# If there's no best action (only zeros), take a random one

# Update our current state

# When we get a reward, it means we solved the game

# Let's check our success rate!

Success rate = 100.0%

env = gym.make("FrozenLake-v1", render_mode="human" ,is_slippery=False)

(0, {'prob': 1})

def value_iteration(env, gamma = 0.9):

# initialize value table with zeros

# set number of iterations and threshold

# On each iteration, copy the value table to the updated_value_table

# Now we calculate Q Value for each actions in the state

if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):

# initialize the policy with zeros

for state in range(env.observation_space.n):

# initialize the Q table for a state

# compute Q value for all ations in the state

Value-iteration converged at iteration# 7.

THE OPTIMAL POLICY IS

Tuple(Discrete(32), Discrete(11), Discrete(2))

def play_episode(env, policy):

[((12, 10, False), 1, 0.0), ((19, 10, False), 1, -1.0)]

0 (13, 10, False) -2317.0 3807

1 (20, 10, False) 2628.0 6001

2 (14, 1, True) -26.0 85

3 (13, 1, False) -614.0 898

4 (12, 10, False) -1996.0 3597

state total_return N value

0 (13, 10, False) -2317.0 3807 -0.608616

1 (20, 10, False) 2628.0 6001 0.437927

2 (14, 1, True) -26.0 85 -0.305882

3 (13, 1, False) -614.0 898 -0.683742

4 (12, 10, False) -1996.0 3597 -0.554907

env= gym.make('FrozenLake-v1') #, render_mode='human')

You might also like