Reinforcement Learning - Project 3
Reinforcement Learning - Project 3
Section: 002
Pair: 14
Team members
Task T1
# Creating functions to build policy using value_iteration using
Bellman's equation
# obstacle no movement
elif (next_state in OBSTACLE):
return 0
# initialize value_function U
value_function = np.copy(environment)
max_iterations = 500
k = 0
k += 1
delta = 0
if is_valid_state(next_state, GRID_SIZE):
next_value += transProb(next_state,
action, intended_action, OBSTACLE) * value_function[next_state[0],
next_state[1]]
value_function = temp_value_function
# test
# Create Environment
GRID_SIZE = (5,5)
NUM_ACTIONS = 4
ACTIONS = [(0, -1), (0, 1), (-1, 0), (1, 0)] # Left, Right, Up, Down
if it was (row,col)
global ACTIONS
HAZARD = (2, 2)
DESTINATION = (0, 3)
START_POSITION = (3, 1)
OBSTACLE = [(0, 0), (0, 4), (4, 0), (4, 4)]
# Create environment
environment = initialize_environment(R1, R2, DESTINATION, HAZARD)
Policy:
X → → D X
→ → → ↑ ←
→ ↑ H ↑ ←
→ ↑ → ↑ ←
X ↑ ↑ ↑ X
Value function:
Implementing Tasks
# environment settings
# same environment is used for all Tasks
GRID_SIZE = (5,5)
NUM_ACTIONS = 4
HAZARD = (2, 2)
DESTINATION = (0, 3)
START_POSITION = (3, 1)
OBSTACLE = [(0, 0), (0, 4), (4, 0), (4, 4)]
Task - T1
R1 = 10
R2 = -5
r = -5
# Create environment
environment = initialize_environment(R1, R2, DESTINATION, HAZARD)
Policy (P1):
X → → D X
→ → → ↑ ←
→ ↑ H ↑ ←
→ ↑ → ↑ ←
X ↑ ↑ ↑ X
Value function:
# Create environment
environment = initialize_environment(R1, R2, DESTINATION, HAZARD)
Policy: (P2)
X → → D X
→ → → ↑ ←
↑ ↑ H ↑ ↑
↑ ↑ → ↑ ↑
X ↑ → ↑ X
Value function:
# Create environment
environment = initialize_environment(R1, R2, DESTINATION, HAZARD)
Policy: (P3)
X → → D X
→ ↑ → ↑ ←
↑ ↑ H ↑ ↑
↑ ← → → ↑
X → → ↑ X
Value function:
# Create environment
environment = initialize_environment(R1, R2, DESTINATION, HAZARD)
Policy:
X → → D X
→ → → ↑ ←
→ ↑ H ↑ ←
→ ↑ → ↑ ←
X ↑ ↑ ↑ X
Value function:
R1 = 50
R2 = -50
r = -1
# Create environment
environment = initialize_environment(R1, R2, DESTINATION, HAZARD)
Policy:
X → → D X
→ → → ↑ ←
↑ ↑ H ↑ ↑
↑ ↑ → ↑ ↑
X ↑ → ↑ X
Value function:
# Create environment
environment = initialize_environment(R1, R2, DESTINATION, HAZARD)
Policy:
X → → D X
→ ↑ → ↑ ←
↑ ↑ H ↑ ↑
↑ ← → → ↑
X → → ↑ X
Value function: