Assignment 1
Assignment 1
February 8, 2024
Welcome to Assignment 1. This notebook will: - Help you create your first bandit algorithm - Help
you understand the effect of epsilon on exploration and learn about the exploration/exploitation
tradeoff - Introduce you to some of the reinforcement learning software we are going to use for this
specialization
This class uses RL-Glue to implement most of our experiments. It was originally designed by Adam
White, Brian Tanner, and Rich Sutton. This library will give you a solid framework to understand
how reinforcement learning experiments work and how to run your own. If it feels a little confusing
at first, don’t worry - we are going to walk you through it slowly and introduce you to more and
more parts as you progress through the specialization.
We are assuming that you have used a Jupyter notebook before. But if not, it is quite simple.
Simply press the run button, or shift+enter to run each of the cells. The places in the code that
you need to fill in will be clearly marked for you.
In the above cell, we import the libraries we need for this assignment. We use numpy throughout
the course and occasionally provide hints for which methods to use in numpy. Other than that we
mostly use vanilla python and the occasional other library, such as matplotlib for making plots.
You might have noticed that we import ten_arm_env. This is the 10-armed Testbed introduced
in section 2.3 of the textbook. We use this throughout this notebook to test our bandit agents.
It has 10 arms, which are the actions the agent can take. Pulling an arm generates a stochastic
1
reward from a Gaussian distribution with unit-variance. For each action, the expected value of
that action is randomly sampled from a normal distribution, at the start of each run. If you are
unfamiliar with the 10-armed Testbed please review it in the textbook before continuing.
DO NOT IMPORT OTHER LIBRARIES as this will break the autograder.
DO NOT SET A RANDOM SEED as this will break the autograder.
Please do not duplicate cells. This will put your notebook into a bad state and break Cousera’s
autograder.
Before you submit, please click “Kernel” -> “Restart and Run All” and make sure all cells pass.
We want to create an agent that will find the action with the highest expected reward. One way
an agent could operate is to always choose the action with the highest value based on the agent’s
current estimates. This is called a greedy agent as it greedily chooses the action that it thinks has
the highest value. Let’s look at what happens in this case.
First we are going to implement the argmax function, which takes in a list of action values and
returns an action with the highest value. Why are we implementing our own instead of using
the argmax function that numpy uses? Numpy’s argmax function returns the first instance of the
highest value. We do not want that to happen as it biases the agent to choose a specific action
in the case of ties. Instead we want to break ties between the highest values randomly. So we
are going to implement our own argmax function. You may want to look at np.random.choice to
randomly select from a list of values.
[2]: # -----------
# Graded Cell
# -----------
def argmax(q_values):
"""
Takes in a list of q_values and returns the index of the item
with the highest value. Breaks ties randomly.
returns: int - the index of the highest value in q_values
"""
top_value = float("-inf")
ties = []
for i in range(len(q_values)):
# if a value in q_values is greater than the highest value update top␣
,→and reset ties to zero
2
if q_values[i] == top_value:
ties.append(i)
return np.random.choice(ties)
[3]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code
test_array = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
assert argmax(test_array) == 8, "Check your argmax implementation returns the␣
,→index of the largest value"
assert argmax(test_array) == 0
[4]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
test_array = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
assert argmax(test_array) == 8, "Check your argmax implementation returns the␣
,→index of the largest value"
counts = [0, 0, 0, 0]
for _ in range(100):
a = argmax(test_array)
counts[a] += 1
3
# make sure the random number generator is called exactly once whenver `argmax`␣
,→is called
expected = [44, 0, 0, 56] # <-- notice not perfectly uniform due to randomness
assert counts == expected
Now we introduce the first part of an RL-Glue agent that you will implement. Here we are going to
create a GreedyAgent and implement the agent_step method. This method gets called each time
the agent takes a step. The method has to return the action selected by the agent. This method
also ensures the agent’s estimates are updated based on the signals it gets from the environment.
Fill in the code below to implement a greedy agent.
[5]: # -----------
# Graded Cell
# -----------
class GreedyAgent(main_agent.Agent):
def agent_step(self, reward, observation=None):
"""
Takes one step for the agent. It takes in a reward and observation and
returns the action the agent chooses at that time step.
Arguments:
reward -- float, the reward the agent recieved from the environment␣
,→after taking the last action.
observation -- float, the observed state the agent is in. Do not worry␣
,→about this as you will not use it
"""
### Useful Class Variables ###
# self.q_values : An array with what the agent believes each of the␣
,→values of the arm are.
#######################
4
# your code here
self.arm_count[self.last_action] += 1
current_action = argmax(self.q_values)
# current action = ? # Use the argmax function you created above
# your code here
step_size = 1/self.arm_count[self.last_action]
self.q_values[self.last_action] = self.q_values[self.last_action] +␣
,→step_size * (reward - self.q_values[self.last_action])
self.last_action = current_action
return current_action
[6]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code
# build a fake agent for testing and set some initial conditions
np.random.seed(1)
greedy_agent = GreedyAgent()
greedy_agent.q_values = [0, 0, 0.5, 0, 0]
greedy_agent.arm_count = [0, 1, 0, 0, 0]
greedy_agent.last_action = 1
action = greedy_agent.agent_step(reward=1)
# make sure the agent is using the argmax that breaks ties randomly
assert action == 2
[7]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
# build a fake agent for testing and set some initial conditions
np.random.seed(1)
greedy_agent = GreedyAgent()
greedy_agent.q_values = [0, 0, 1.0, 0, 0]
greedy_agent.arm_count = [0, 1, 0, 0, 0]
greedy_agent.last_action = 1
5
action = greedy_agent.agent_step(reward=1)
Let’s visualize the result. Here we run an experiment using RL-Glue to test our agent. For now, we
will set up the experiment code; in future lessons, we will walk you through running experiments
so that you can create your own.
[8]: # ---------------
# Discussion Cell
# ---------------
np.random.seed(run)
average_best += np.max(rl_glue.environment.arms)
for i in range(num_steps):
6
reward, _, action, _ = rl_glue.rl_step() # The environment and agent␣
,→ take a step and return
# the reward, and action taken.
rewards[run, i] = reward
We learned about another way for an agent to operate, where it does not always take the greedy
action. Instead, sometimes it takes an exploratory action. It does this so that it can find out what
the best action really is. If we always choose what we think is the current best action is, we may
miss out on taking the true best action, because we haven’t explored enough times to find that
best action.
Implement an epsilon-greedy agent below. Hint: we are implementing the algorithm from section 2.4
of the textbook. You may want to use your greedy code from above and look at np.random.random,
as well as np.random.randint, to help you select random actions.
7
[9]: # -----------
# Graded Cell
# -----------
class EpsilonGreedyAgent(main_agent.Agent):
def agent_step(self, reward, observation):
"""
Takes one step for the agent. It takes in a reward and observation and
returns the action the agent chooses at that time step.
Arguments:
reward -- float, the reward the agent recieved from the environment␣
,→after taking the last action.
observation -- float, the observed state the agent is in. Do not worry␣
,→about this as you will not use it
"""
#######################
# Update Q values - this should be the same update as your greedy agent␣
above
,→
8
current_action = np.random.choice(range(len(self.arm_count)))
else:
current_action = argmax(self.q_values)
self.last_action = current_action
return current_action
[10]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code
# build a fake agent for testing and set some initial conditions
np.random.seed(0)
e_greedy_agent = EpsilonGreedyAgent()
e_greedy_agent.q_values = [0, 0.0, 0.5, 0, 0]
e_greedy_agent.arm_count = [0, 1, 0, 0, 0]
e_greedy_agent.num_actions = 5
e_greedy_agent.last_action = 1
e_greedy_agent.epsilon = 0.5
# given this random seed, we should see a greedy action (action 2) here
action = e_greedy_agent.agent_step(reward=1, observation=0)
# -----------------------------------------------
# we'll try to guess a few of the trickier places
# -----------------------------------------------
# make sure to update for the *last_action* not the current action
assert e_greedy_agent.q_values != [0, 0.5, 1.0, 0, 0], "A"
# make sure the stepsize is based on the *last_action* not the current action
assert e_greedy_agent.q_values != [0, 1, 0.5, 0, 0], "B"
# make sure the agent is using the argmax that breaks ties randomly
assert action == 2, "C"
# -----------------------------------------------
9
# given this random seed, we should see a random action (action 4) here
action = e_greedy_agent.agent_step(reward=1, observation=0)
# The agent saw a reward of 1, so should increase the value for *last_action*
assert e_greedy_agent.q_values == [0, 0.75, 0.5, 0, 0], "D"
# the agent should have picked a random action for this particular random seed
assert action == 4, "E"
[11]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
np.random.seed(0)
e_greedy_agent = EpsilonGreedyAgent()
e_greedy_agent.q_values = [0, 0, 1.0, 0, 0]
e_greedy_agent.arm_count = [0, 1, 0, 0, 0]
e_greedy_agent.num_actions = 5
e_greedy_agent.last_action = 1
e_greedy_agent.epsilon = 0.5
action = e_greedy_agent.agent_step(reward=1, observation=0)
assert action == 4
Now that we have our epsilon greedy agent created. Let’s compare it against the greedy agent with
epsilon of 0.1.
[12]: # ---------------
# Discussion Cell
# ---------------
10
epsilon = 0.1
agent = EpsilonGreedyAgent
env = ten_arm_env.Environment
agent_info = {"num_actions": 10, "epsilon": epsilon}
env_info = {}
all_rewards = np.zeros((num_runs, num_steps))
for i in range(num_steps):
reward, _, action, _ = rl_glue.rl_step() # The environment and agent␣
,→take a step and return
11
Notice how much better the epsilon-greedy agent did. Because we occasionally choose a random
action we were able to find a better long term policy. By acting greedily before our value estimates
are accurate, we risk settling on a suboptimal action.
Did you notice that we averaged over 200 runs? Why did we do that?
To get some insight, let’s look at the results of two individual runs by the same agent.
[13]: # ---------------
# Discussion Cell
# ---------------
averages = []
rl_glue = RLGlue(env, agent)
rl_glue.rl_init(agent_info, env_info)
rl_glue.rl_start()
scores = [0]
for i in range(num_steps):
reward, state, action, is_terminal = rl_glue.rl_step()
scores.append(scores[-1] + reward)
averages.append(scores[-1] / (i + 1))
plt.plot(averages)
12
Notice how the two runs were different? But, if this is the exact same algorithm, why does it behave
differently in these two runs?
The answer is that it is due to randomness in the environment and in the agent. Depending on what
action the agent randomly starts with, or when it randomly chooses to explore, it can change the
results of the runs. And even if the agent chooses the same action, the reward from the environment
is randomly sampled from a Gaussian. The agent could get lucky, and see larger rewards for the
best action early on and so settle on the best action faster. Or, it could get unlucky and see smaller
rewards for best action early on and so take longer to recognize that it is in fact the best action.
To be more concrete, let’s look at how many times an exploratory action is taken, for different
seeds.
[14]: # ---------------
# Discussion Cell
# ---------------
print("Random Seed 1")
np.random.seed(1)
for _ in range(15):
if np.random.random() < 0.1:
print("Exploratory Action")
print()
print()
Random Seed 1
Exploratory Action
13
Exploratory Action
Exploratory Action
Random Seed 2
Exploratory Action
With the first seed, we take an exploratory action three times out of 15, but with the second, we
only take an exploratory action once. This can significantly affect the performance of our agent
because the amount of exploration has changed significantly.
To compare algorithms, we therefore report performance averaged across many runs. We do this
to ensure that we are not simply reporting a result that is due to stochasticity, as explained in the
lectures. Rather, we want statistically significant outcomes. We will not use statistical significance
tests in this course. Instead, because we have access to simulators for our experiments, we use the
simpler strategy of running for a large number of runs and ensuring that the confidence intervals
do not overlap.
Can we do better than an epsilon of 0.1? Let’s try several different values for epsilon and see how
they perform. We try different settings of key performance parameters to understand how the agent
might perform under different conditions.
Below we run an experiment where we sweep over different values for epsilon:
[15]: # ---------------
# Discussion Cell
# ---------------
n_q_values = []
n_averages = []
n_best_actions = []
num_runs = 200
14
rl_glue = RLGlue(env, agent)
rl_glue.rl_init(agent_info, env_info)
rl_glue.rl_start()
best_arm = np.argmax(rl_glue.environment.arms)
scores = [0]
averages = []
best_action_chosen = []
for i in range(num_steps):
reward, state, action, is_terminal = rl_glue.rl_step()
scores.append(scores[-1] + reward)
averages.append(scores[-1] / (i + 1))
if action == best_arm:
best_action_chosen.append(1)
else:
best_action_chosen.append(0)
if epsilon == 0.1 and run == 0:
n_q_values.append(np.copy(rl_glue.agent.q_values))
if epsilon == 0.1:
n_averages.append(averages)
n_best_actions.append(best_action_chosen)
all_averages.append(averages)
plt.plot(np.mean(all_averages, axis=0))
15
Why did 0.1 perform better than 0.01?
If exploration helps why did 0.4 perform worse that 0.0 (the greedy agent)?
Think about these and how you would answer these questions. They are questions in the practice
quiz. If you still have questions about it, retake the practice quiz.
In Section 1 of this assignment, we decayed the step size over time based on action-selection counts.
The step-size was 1/N(A), where N(A) is the number of times action A was selected. This is the
same as computing a sample average. We could also set the step size to be a constant value, such
as 0.1. What would be the effect of doing that? And is it better to use a constant or the sample
average method?
To investigate this question, let’s start by creating a new agent that has a constant step size. This
will be nearly identical to the agent created above. You will use the same code to select the epsilon-
greedy action. You will change the update to have a constant step size instead of using the 1/N(A)
update.
[16]: # -----------
# Graded Cell
# -----------
class EpsilonGreedyAgentConstantStepsize(main_agent.Agent):
def agent_step(self, reward, observation):
"""
Takes one step for the agent. It takes in a reward and observation and
returns the action the agent chooses at that time step.
Arguments:
reward -- float, the reward the agent recieved from the environment␣
,→after taking the last action.
observation -- float, the observed state the agent is in. Do not worry␣
,→about this as you will not use it
16
until future lessons
Returns:
current_action -- int, the action chosen by the agent at the current␣
,→time step.
"""
# self.step_size : A float which is the current step size for the agent.
# self.epsilon : The probability an epsilon greedy agent will explore␣
,→(ranges between 0 and 1)
#######################
return current_action
[17]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code
17
e_greedy_agent.step_size = step_size
action = e_greedy_agent.agent_step(1, 0)
assert e_greedy_agent.q_values == [0, step_size, 1.0, 0, 0], "Check that␣
,→you are updating q_values correctly using the stepsize."
[18]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
np.random.seed(0)
# Check Epsilon Greedy with Different Constant Stepsizes
for step_size in [0.01, 0.1, 0.5, 1.0]:
e_greedy_agent = EpsilonGreedyAgentConstantStepsize()
e_greedy_agent.q_values = [0, 0, 1.0, 0, 0]
e_greedy_agent.num_actions = 5
e_greedy_agent.last_action = 1
e_greedy_agent.epsilon = 0.0
e_greedy_agent.step_size = step_size
action = e_greedy_agent.agent_step(1, 0)
[19]: # ---------------
# Discussion Cell
# ---------------
epsilon = 0.1
num_steps = 1000
num_runs = 200
18
agent = EpsilonGreedyAgentConstantStepsize if step_size != '1/N(A)'␣
,→else EpsilonGreedyAgent
agent_info = {"num_actions": 10, "epsilon": epsilon, "step_size":␣
,→step_size, "initial_value": 0.0}
env_info = {}
best_arm = np.argmax(rl_glue.environment.arms)
if run == 0:
true_values[step_size] = np.copy(rl_glue.environment.arms)
best_action_chosen = []
for i in range(num_steps):
reward, state, action, is_terminal = rl_glue.rl_step()
if action == best_arm:
best_action_chosen.append(1)
else:
best_action_chosen.append(0)
if run == 0:
q_values[step_size].append(np.copy(rl_glue.agent.q_values))
best_actions[step_size].append(best_action_chosen)
ax.plot(np.mean(best_actions[step_size], axis=0))
plt.legend(step_sizes)
plt.title("% Best Arm Pulled")
plt.xlabel("Steps")
plt.ylabel("% Best Arm Pulled")
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])
plt.show()
19
Notice first that we are now plotting the amount of time that the best action is taken rather than
the average reward. To better understand the performance of an agent, it can be useful to measure
specific behaviors, beyond just how much reward is accumulated. This measure indicates how close
the agent’s behaviour is to optimal.
It seems as though 1/N(A) performed better than the others, in that it reaches a solution where
it takes the best action most frequently. Now why might this be? Why did a step size of 0.5 start
out better but end up performing worse? Why did a step size of 0.01 perform so poorly?
Let’s dig into this further below. Let’s plot how well each agent tracks the true value, where each
agent has a different step size method. You do not have to enter any code here, just follow along.
[20]: # ---------------
# Discussion Cell
# ---------------
largest = 0
num_steps = 1000
for step_size in step_sizes:
plt.figure(figsize=(15, 5), dpi= 80, facecolor='w', edgecolor='k')
largest = np.argmax(true_values[step_size])
plt.plot([true_values[step_size][largest] for _ in range(num_steps)],␣
,→linestyle="--")
20
21
These plots help clarify the performance differences between the different step sizes. A step size
of 0.01 makes such small updates that the agent’s value estimate of the best action does not get
close to the actual value. Step sizes of 0.5 and 1.0 both get close to the true value quickly, but are
very susceptible to stochasticity in the rewards. The updates overcorrect too much towards recent
rewards, and so oscillate around the true value. This means that on many steps, the action that
pulls the best arm may seem worse than it actually is. A step size of 0.1 updates fairly quickly to
the true value, and does not oscillate as widely around the true values as 0.5 and 1.0. This is one
of the reasons that 0.1 performs quite well. Finally we see why 1/N(A) performed well. Early on
while the step size is still reasonably high it moves quickly to the true expected value, but as it
gets pulled more its step size is reduced which makes it less susceptible to the stochasticity of the
rewards.
Does this mean that 1/N(A) is always the best? When might it not be? One possible setting where
it might not be as effective is in non-stationary problems. You learned about non-stationarity in the
lessons. Non-stationarity means that the environment may change over time. This could manifest
itself as continual change over time of the environment, or a sudden change in the environment.
Let’s look at how a sudden change in the reward distributions affects a step size like 1/N(A). This
time we will run the environment for 2000 steps, and after 1000 steps we will randomly change the
22
expected value of all of the arms. We compare two agents, both using epsilon-greedy with epsilon
= 0.1. One uses a constant step size of 0.1, the other a step size of 1/N(A) that reduces over time.
[21]: # ---------------
# Discussion Cell
# ---------------
epsilon = 0.1
num_steps = 2000
num_runs = 500
step_size = 0.1
np.random.seed(run)
for i in range(num_steps):
reward, state, action, is_terminal = rl_glue.rl_step()
rewards[run, i] = reward
if i == 1000:
rl_glue.environment.arms = np.random.randn(10)
plt.plot(np.mean(rewards, axis=0))
plt.legend(["Best Possible", "1/N(A)", "0.1"])
plt.xlabel("Steps")
plt.ylabel("Average reward")
plt.show()
23
Now the agent with a step size of 1/N(A) performed better at the start but then performed worse
when the environment changed! What happened?
Think about what the step size would be after 1000 steps. Let’s say the best action gets chosen 500
times. That means the step size for that action is 1/500 or 0.002. At each step when we update
the value of the action and the value is going to move only 0.002 * the error. That is a very tiny
adjustment and it will take a long time for it to get to the true value.
The agent with step size 0.1, however, will always update in 1/10th of the direction of the error.
This means that on average it will take ten steps for it to update its value to the sample mean.
These are the types of tradeoffs we have to think about in reinforcement learning. A larger step size
moves us more quickly toward the true value, but can make our estimated values oscillate around
the expected value. A step size that reduces over time can converge to close to the expected value,
without oscillating. On the other hand, such a decaying stepsize is not able to adapt to changes in
the environment. Nonstationarity—and the related concept of partial observability—is a common
feature of reinforcement learning problems and when learning online.
Great work! You have: - Implemented your first agent - Learned about the effect of epsilon, an
exploration parameter, on the performance of an agent - Learned about the effect of step size on the
performance of the agent - Learned about a good experiment practice of averaging across multiple
runs
24