UNIT IV-1
UNIT IV-1
Introduction to Deep Reinforcement learning – The multi-armed bandit – Contextual bandits – RL with
the Open AI Gym – A Q-Learning model – Markov decision process and the Bellman equation – Q-
learning – Q-learning and exploration – Frist DRL with Deep Q-learning – RL experiments – Keras RL
Deep Reinforcement Learning (DRL) building blocks include all the aspects that power learning and
empower agents to make wise judgements in their surroundings. Effective learning frameworks are
produced by the cooperative interactions of these elements. The following are the essential elements:
Agent: The decision-maker or learner who engages with the environment. The agent acts in accordance
with its policy and gains experience over time to improve its ability to make decisions.
Environment: The system outside of the agent that it communicates with. Based on the actions the agent
does, it gives the agent feedback in the form of incentives or punishments.
State: A depiction of the current circumstance or environmental state at a certain moment. The agent
chooses its activities and makes decisions based on the state.
Action: A choice the agent makes that causes a change in the state of the system. The policy of the agent
guides the selection of actions.Reward: A scalar feedback signal from the environment that shows
whether an agent’s behaviour in a specific state is desirable. The agent is guided by rewards to learn
positive behaviour.\
Policy: A plan that directs the agent’s decision-making by mapping states to actions. Finding an ideal
policy that maximises cumulative rewards is the objective.
Value Function: This function calculates the anticipated cumulative reward an agent can obtain from a
specific state while adhering to a specific policy. It is beneficial in assessing and contrasting states and
policies.
Model: A depiction of the dynamics of the environment that enables the agent to simulate potential
results of actions and states. Models are useful for planning and forecasting.
Learning Algorithm: The process by which the agent modifies its value function or policy in response to
experiences gained from interacting with the environment. Learning in DRL is fueled by a variety of
algorithms, including Q-learning, policy gradient, and actor-critic.
Deep Neural Networks: DRL can handle high-dimensional state and action spaces by acting as function
approximators in deep neural networks. They pick up intricate input-to-output mappings.
Experience Replay: A method that randomly selects from stored prior experiences (state, action, reward,
and next state) during training. As a result, learning stability is improved and the association between
subsequent events is decreased.
These core components collectively form the foundation of Deep Reinforcement Learning, empowering
agents to learn strategies, make intelligent decisions, and adapt to dynamic environments.
In Deep Reinforcement Learning (DRL), an agent interacts with an environment to learn how to make
optimal decisions. Steps:
Interaction: The agent interacts with its surroundings through acting, which results in states and rewards.
Learning: The agent keeps track of its experiences and updates its method for making decisions.
Exploration-Exploitation: The agent strikes a balance between using well-known actions and trying out
new ones.
Reward Maximization: The agent learns to select activities that will yield the greatest possible total
rewards.
Convergence: The agent’s policy becomes better and stays the same over time.
Extrapolation: Skilled agents can use what they’ve learned in fresh circumstances.
The multi-armed bandit problem is a classic reinforcement learning example where we are given a slot
machine with n arms (bandits) with each arm having its own rigged probability distribution of success.
Pulling any one of the arms gives you a stochastic reward of either R=+1 for success, or R=0 for failure.
Our objective is to pull the arms one-by-one in sequence such that we maximize our total reward
collected in the long run.
The non-triviality of the multi-armed bandit problem lies in the fact that we (the agent) cannot access the
true bandit probability distributions — all learning is carried out via the means of trial-and-error and
value estimation.This is our goal for the multi-armed bandit problem, and having such a strategy would
prove very useful in many real-world situations where one would like to select the “best” bandit out of a
group of bandits.
For an agent to decide which action yields the maximum reward, we must define the value of taking each
action. We use the concept of probability to define these values using the action-value function.
The value of selecting an action is defined as the expected reward received when taking that action from a
set of all possible actions. Since the value of selecting an action is not known to the agent, so we use the
‘sample-average’ method to estimate the value of taking an action.
Exploration vs Exploitation
Exploitation on the other hand, chooses the greedy action to get the most reward by exploiting the agent’s
current action-value estimates. But by being greedy with respect to action-value estimates, may not
actually get the most reward and lead to sub-optimal behaviour.
When an agent explores, it gets more accurate estimates of action-values. And when it exploits, it might
get more reward. It cannot, however, choose to do both simultaneously, which is also called the
exploration-exploitation dilemma.
Epsilon-Greedy Action Selection
Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between
exploration and exploitation randomly.
The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the
time with a small chance of exploring.
Code: Python code for Epsilon-Greedy
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
data = np.empty(N)
for i in range(N):
# epsilon greedy
p = np.random.random()
if p < eps:
j = np.random.choice(3)
else:
j = np.argmax([a.mean for a in actions])
x = actions[j].choose()
actions[j].update(x)
for a in actions:
print(a.mean)
return cumulative_average
if __name__ == '__main__':
Contextual Bandits
If you are just getting started with contextual bandits, it can be confusing to understand how contextual
bandits are related to other more widely known methods such as A/B testing, and why you might want to
use contextual bandits instead of those other methods. Therefore, we start our journey by discussing the
similarities and differences between contextual bandits and related methods.
While MAB simply looks at whether treatment or control is doing better overall, CB focuses on whether
treatment or control is doing better for a user with a given set of characteristics. The “context” in
contextual bandits precisely refers to these user characteristics and is what differentiates it from MAB.
For example, CB might decide to increase treatment allocation to 60% for core users but decrease
treatment allocation to 40% for casual users after observing first day’s data. In other words, CB will
dynamically update traffic allocation taking user characteristics (core vs casual in this example) into
account.
At this point, you might be tempted to think that CB is nothing more than a set of multiple MABs running
together. In fact, when the context we are interested in is a small one (e.g., we are only interested in
whether a user is a core user or a casual user), we can simply run one MAB for core users and another
MAB for casual users. However, as the context gets large (core vs casual, age, country, time since last
active, etc.) it becomes impractical to run a separate MAB for each unique context value.
The real value of CB emerges in this case through the use of models to describe the relationship of the
experimental conditions in different contexts to our outcome of interest (e.g., conversion). As opposed to
enumerating through each context value and treating them independently, the use of models allows us to
share information from different contexts and makes it possible to handle large context spaces. This idea
of a model will be discussed at several different points in this post, so keep on reading to learn more.
A new data point arrives with context X (e.g., a core user with an iOS device in the US).
Given this data point and the exploration strategy chosen (e.g., ε-greedy), the algorithm decides on a
condition to serve this user (e.g., treatment or control).
After the condition is served, we observe the outcome y (e.g., whether the user made a purchase or not).
Update (or fully retrain) the model used in Step 2 after seeing the new data. (As mentioned previously, we
usually make an update not after every sample but after seeing a batch of samples to ensure that updates
are less noisy.)
OpenAI is created for removing this problem of lack of standardization in papers along with an aim to
create better benchmarks by giving versatile numbers of environments with great ease of setting up. Aim
of this tool is to increase reproducibility in the field of AI and provide tools with which everyone can
learn about the basics of AI.
Now, defining the environment in RL’s context as a functional component, it simply takes action at a
given state as input and returns a new state and reward value associated with action-state pair.
import gym
env = gym.make('MountainCarContinuous-v0') # try for different environments
observation = env.reset() # This command will reset the environment
for t in range(100):
env.render()
print observation
action = env.action_space.sample()
observation, reward, done, info = env.step(action) # step() function returns observation
print observation, reward, done, info
if done:
print("Finished after {} timesteps".format(t+1))
break
[Output For Mountain Car Cont Env:]
[-0.56252328 0.00184034]
[-0.56081509 0.00170819] -0.00796802138459 False {}
[Output For CartPole Env:]
[ 0.1895078 0.55386028 -0.19064739 -1.03988221]
[ 0.20058501 0.36171167 -0.21144503 -0.81259279] 1.0 True {}
Finished after 52 timesteps
action-space & observation-space describes what is the valid format of action & state parameters for that
particular env to work on with.
What is action_space in above code? action-space & observation-space describes what is the valid format
of action & state parameters for that particular env to work on with. Just take a look at values returned.
import gym
env = gym.make('CartPole-v0')
print(env.action_space) #[Output: ] Discrete(2)
print(env.observation_space) # [Output: ] Box(4,)
env = gym.make('MountainCarContinuous-v0')
print(env.action_space) #[Output: ] Box(1,)
print(env.observation_space) #[Output: ] Box(2,)
Discrete is non-negative possible values, above 0 or 1 are equivalent to left and right movement for
CartPole balancing. Box represent n-dim array. These standard interfaces can help in writing general
codes for different environments. As we can simply check the bounds env.observation_space.high/[low]
and code them into our general algorithm.
Q-learning is a machine learning algorithm that helps an agent learn to act in an environment by trial and
error, and improve its behavior over time. It's a type of reinforcement learning, which trains a model to
mimic how animals and children learn by rewarding good actions and penalizing bad ones
Q(S,A) is an estimation of how good is it to take the action A at the state S . This estimation of
Q(S,A) will be iteratively computed using the TD- Update rule which we will see in the upcoming
sections.
Rewards An agent throughout its lifetime starts from a start state, and makes several transitions from its
current state to a next state based on its choice of action and also the environment the agent is interacting
in. At every step of transition, the agent from a state takes an action, observes a reward from the
environment, and then transits to another state.
Episodes: Instances where an agent concludes its actions, marking the end of an episode.
Temporal Difference: Calculated by comparing the current state and action values with the previous
ones.
Bellman’s Equation: A recursive formula invented by Richard Bellman in 1957, used to calculate the
value of a given state and determine its optimal position. It provides a recursive formula for calculating
the value of a given state in a Markov Decision Process (MDP) and is particularly influential in the
context of Q-learning and optimal decision-making.
The Equation is expressed as :
Q(s,a)=R(s,a)+γmaxaQ(s’,a)
Where,
● Q(s,a) is the Q-value for a given state-action pair
● maxaQ(s′,a) is the maximum Q-value for the next state ′s′ and all possible actions.
Bellman’s equation is crucial in reinforcement learning as it helps in evaluating the long-term expected
rewards associated with different actions in a given state. It forms the basis for Q-learning algorithms,
guiding agents to learn optimal policies through iterative updates based on observed experiences.
What is Q-table?
The Q-table functions as a repository of rewards associated with optimal actions for each state in a given
environment. It serves as a guide for the agent, indicating which actions are likely to yield positive
outcomes in various scenarios.
Each row in the Q-table corresponds to a distinct situation the agent might face, while the columns
represent the available actions. Through interactions with the environment and the receipt of rewards or
penalties, the Q-table is dynamically updated to capture the model’s evolving understanding.
Reinforcement learning aims to enhance performance by refining the Q-table, enabling the agent to make
informed decisions. As the Q-table undergoes continuous updates with more feedback, it becomes a more
accurate resource, empowering the agent to make optimal choices and achieve superior results.
Crucially, the Q-table is closely tied to the Q-function, a mathematical expression that considers the
current state and action, generating outputs that include anticipated future rewards for that specific state-
action pair. By consulting the Q-table, the agent can retrieve expected future rewards, guiding it toward
optimized decision-making and states.
Let’s say that a robot has to cross a maze and reach the end point. There are mines, and the robot can
only move one tile at a time. If the robot steps onto a mine, the robot is dead. The robot has to reach the
end point in the shortest time possible.
The scoring/reward system is as below:
1. The robot loses 1 point at each step. This is done so that the robot takes the shortest path and reaches the
goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to reach the end goal with the shortest path
without stepping on a mine?
Introducing the Q-Table
Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future
rewards for action at each state. Basically, this table will guide us to the best action at each state.
In the Q-Table, the columns are the actions and the rows are the states.
Each Q-table score will be the maximum expected future reward that the robot will get if it takes that
action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.
Using the above function, we get the values of Q for the cells in the table.
When we start, all the values in the Q-table are zeros.
There is an iterative process of updating the values. As we start to explore the environment, the Q-
function gives us better and better approximations by continuously updating the Q-values in the table.
Now, let’s understand how the updating takes place.
In the case of the robot game, to reiterate the scoring/reward structure is:
● power = +1
● mine = -100
● end = +100
Implementation of Q-Learning
Step 1: Define the Environment
Set up the environment parameters, including the number of states and actions, and initialize the Q-table.
In this grid world, each state represents a position, and actions move the agent within this environment.
import numpy as np
# Define parameters
learning_rate = 0.8
discount_factor = 0.95
exploration_prob = 0.2
epochs = 1000
# Q-learning algorithm
for epoch in range(epochs):
current_state = np.random.randint(0, n_states) # Start from a random state
import gym
import numpy as np
# 1. Load Environment and Q-table structure
env = gym.make('FrozenLake8x8-v0')
Q = np.zeros([env.observation_space.n,env.action_space.n])
# env.observation.n, env.action_space.n gives number of states and action in env loaded
# 2. Parameters of Q-learning
eta = .628
gma = .9
epis = 5000
rev_list = [] # rewards per episode calculate
# 3. Q-learning Algorithm
for i in range(epis):
# Reset environment
s = env.reset()
rAll = 0
d = False
j=0
#The Q-Table learning algorithm
while j < 99:
env.render()
j+=1
# Choose action from Q table
a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
#Get new state & reward from environment
s1,r,d,_ = env.step(a)
#Update Q-Table with new knowledge
Q[s,a] = Q[s,a] + eta*(r + gma*np.max(Q[s1,:]) - Q[s,a])
rAll += r
s = s1
if d == True:
break
rev_list.append(rAll)
env.render()
print("Reward Sum on all episodes " + str(sum(rev_list)/epis))
print("Final Values Q-Table")
print(Q)
The Bellman Equation is central to Markov Decision Processes. It outlines a framework for determining
the optimal expected reward at a state s by answering the question: “what is the maximum reward an
agent can receive if they make the optimal action now and for all future decisions?
The Bellman equation breaks down a dynamic optimization problem into a sequence of simpler
subproblems. It does this by defining the value of the current state as the maximum possible value of the
current state reward plus the value of the next state.
Intuitively, it's sort of a way to frame RL tasks such that we can solve them in a "principled" manner. We
will go into the specifics throughout this tutorial
5 Components of MDPs
1. S: set of states
2. A: set of actions
3. R: reward function
4. P: transition probability function
5. γ: discount for future rewards
Imagine that a robot sitting on a chair stood up and put its right foot forward. So, at the moment, it’s
standing with its right foot forward. This is its current state.
Now, according to the Markov property, the current state of the robot depends only on its immediate
previous state (or the previous timestep,) i.e. the state it was in when it stood up. And evidently, it doesn’t
depend on its prior state– sitting on the chair. Similarly, its next state depends only on its current state.
A Markov process is defined by (S, P) where S are the states, and P is the state-transition probability. It
consists of a sequence of random states S₁, S₂, … where all the states obey the Markov property.
The state transition probability or P_ss’ is the probability of jumping to a state s’ from the current state s.
Deep Q-Learning Intuition
In terms of the neural network we feed in the state, pass that through several hidden layers (the exact
states and actions we can take gets more complex we use deep learning as a function approximator.
Let's look at how the equation changes with deep Q-learning.Recall the equation for temporal
difference:In the maze example, the neural network will predict 4 values: up, right, left, or down.We then
take these 4 values and compare it to the values that were previously predicted, which are stored in
memory.So we're comparing Q1 Vs Q- Target1, Q2 Vs Q- Target2. Recall that neural networks work by
updating their weights, so we need to adapt our temporal difference equation to leverage this.So what
we're going to do is calculate a loss by taking the sum of the squared differences of the Q-values and their
targets:
L = ( Q Target-Q)2
We then take this loss and use backpropagation, or stochastic gradient descent, and pass it through the
network and update the weights.This is the learning part, now let's move on to how the agent selects the
best action to take. To choose which action is the best, we use the Q-values that we have and pass them
through a softmax function. This process happens every time the agent is in a new state.
Another important aspect of Deep-Q Learning that is very effective, and broadly (but not always!) used,
is experience and replay memory.Essentially, the idea behind this concept is that instead of training on
each state action pair right after we go through them, we instead store the ‘experiences’ (which contain
the state, the action taken, the reward received, and the next state) in one long list. This list is called the
replay memory, because it has a ‘memory’ of all the experiences the agent had. We then ‘replay’ the
Our code for defining a DQN agent that learns how to act in an environment—in this particular case, it
happens to be the Cart-Pole game from the OpenAI Gym library of environments—is provided within our
Cartpole DQN Jupyter notebook. [Note: Our DQN agent is based directly on Keon Kim’s, which is
available at his GitHub repository at bit.ly/keonDQN.] Its dependencies are as follows:
import random
import gym
import numpy as np
from keras.optimizers
import Adam
import os
Example 1 – Cart-Pole DQN hyperparameters
env = gym.make("CartPole-v0")
batch_size = 32
n_episodes = 1000
output_dir = "model_output/cartpole/"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
We use the Open AI Gym make() method to specify the particular environment that we’d like our agent
to interact with. The environment we choose is version zero (v0) of the Cart-Pole game, and we assign it
to the variable env. On your own time, you’re welcome to select an alternative Open AI Gym
environment,
state_size: the number of types of state information, which for the Cart-
Pole game is 4 (recall that these are cart position, cart velocity, pole angle, and pole angular velocity).
action_size: the number of possible actions, which for Cart-Pole is 2 (left and right).
We set our mini-batch size for training our neural net to 32.
We set the number of episodes (rounds of the game) to 1000. As you’ll soon see, this is about the right
number of episodes it will take for ou–r agent to excel regularly at the Cart-Pole game. For more-complex
environments, you’d likely need to increase this hyperparameter so that the agent has more rounds of
gameplay to learn in.
We define a unique directory name ('model_output/cartpole/') into which we’ll output our neural
network’s parameters at regular intervals. If the directory doesn’t yet exist, we use os.makedirs() to make
it.
class DQNAgent:
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95
self.epsilon = 1.0
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.learning_rate = 0.001
self.model = self._build_model()
def _build_model(self):
model = Sequential()
model.add(Dense(32, activation="relu",
input_dim=self.state_size))
model.add(Dense(32, activation="relu"))
model.add(Dense(self.action_size, activation="linear"))
model.compile(loss="mse",
optimizer=Adam(lr=self.learning_rate))
return model
self.memory.append((state, action,
if not done:
target = (reward +
self.gamma *
np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
self.epsilon *= self.epsilon_decay
return random.randrange(self.action_size)
act_values = self.model.predict(state)
return np.argmax(act_values[0])
self.model.save_weights(name)
Initialization parameters
state_size and action_size are environment-specific, but in the case of the Cart-Pole game are 4 and 2,
respectively, as mentioned earlier.
memoryis for storing memories that can subsequentlyreplayed in order to train our DQN’s neural net. The
memories are stored as elements of a data structure called adeque (pronounced “deck”), which is the same
as a list except that—because we specified maxlen=2000—it only retains the 2,000 most recent
memories. That is, whenever we attempt to append a 2,001st element onto the deque, its first element is
removed, always leaving us with a list that contains no more than 2,000 elements.
gamma is the discount factor (a.k.a. decay rate) γ that we introduced earlier in this chapter (see Figure
13.4). This agent hyperparameter discounts prospective rewards in future timesteps. Effective γ values
typically approach 1 (for example, 0.9, 0.95, 0.98, and 0.99). The closer to 1, the less we’re discounting
future reward. [Note: Indeed, if you were to set γ = 1 (which we don’t recommend) you wouldn’t be
discounting future reward at all.] Tuning the hyperparameters of reinforcement learning models such as γ
can be a fiddly process; near the end of this chapter, we discuss a tool called SLM Lab for carrying it out
effectively.
epsilon—symbolized by the Greek letter ε—is another reinforcement learning hyperparameter called
exploration rate. It represents the proportion of our agent’s actions that are random (enabling it to explore
the impact of such actions on the next state st+1 and the reward r returned by the environment) relative to
how often we allow its actions to exploit the existing “knowledge” its neural net has accumulated through
gameplay. Prior to having played any episodes, agents have no gameplay experience to exploit, so it is the
most common practice to start it off exploring 100 percent of the time; this is why we set epsilon = 1.0.
As the agent gains gameplay experience, we very slowly decay its exploration rate so that it can gradually
exploit the information it has learned (hopefully enabling it to attain more reward, as illustrated in Figure
13.7). That is, at the end of each episode the agent plays, we multiply its ε by epsilon_decay. Common
options for this hyperparameter are 0.990, 0.995, and 0.999. [Note: Analogous to setting γ = 1, setting
epsilon_decay = 1 would mean ε would not be decayed at all—that is, exploring at a continuous rate. This
would be an unusual choice for this hyperparameter.]
epsilon_min is a floor (a minimum) on how low the exploration rate ε can decay to. This hyperparameter
is typically set to a near-zero value such as 0.001, 0.01, or 0.02. We set it equal to 0.01, meaning that after
ε has decayed to 0.01 (as it will in our case by the 911th episode), our agent will explore on only 1
percent of the actions it takes—exploiting its gameplay experience the other 99 percent of the time. [Note:
If at this stage this exploration rate concept is somewhat unclear, it should become clearer as we examine
our agent’s episode-by-episode results later on.]
learning_rate is the same stochastic gradient descent hyperparameter that we covered in Chapter 8.
Finally, _build_model()—by the inclusion of its leading underscore—is being suggested as a private
method. This means that this method is recommended for use “internally” only—that is, solely by
instances of the class DQNAgent.
The second hidden layer is also dense, with 32 ReLU neurons. As mentioned earlier, we’ll explore
hyperparameter selection—including how we home in on a particular model architecture—by discussing
the SLM Lab tool later on in this chapter.
The output layer has dimensionality corresponding to the number of possible actions. [Note: Any
previous models in this book with only two outcomes (as in Chapters 11 and 12) used a single sigmoid
neuron. Here, we specify separate neurons for each of the outcomes, because we would like our code to
generalize beyond the Cart-Pole game. While Cart-Pole has only two actions, many environments have
more than two.] In the case of the Cart-Pole game, this is an array of length 2, with one element for left
and the other for right. As with a regression model (see Example 9.8), with DQNs the z values are output
directly from the neural net instead of being converted into a probability between 0 and 1. To do this, we
specify the linear activation function instead of the sigmoid or softmax functions that have otherwise
dominated this book.
As indicated when we compiled our regression model (Example 9.9), mean squared error is an
appropriate choice of cost function when we use linear activation in the output layer, so we set the
compile() method’s loss argument to mse. We return to our routine optimizer choice, Adam.
Remembering gameplay
At any given timestep t—that is, during any given iteration of the reinforcement learning loop (refer back
to Figure 13.3)—the DQN agent’s remember() method is run in order to append a memory to the end of
its memory deque. Each memory in this deque consists of five pieces of information about timestep t:
The next_state st+1 that the environment also returned to the agent
A Boolean flag done that is true if timestep t was the final iteration of the episode, and false otherwise
The DQN agent’s neural net model is trained by replaying memories of gameplay, as shown within the train() method of E
For each of the 32 sampled memories, we carry out a round of model training as follows: If done is True
—that is, if the memory was of the final timestep of an episode— then we know definitively that the
highest possible reward that could be attained from this timestep is equal to the reward rt. Thus, we can
just set our target reward equal to reward.
Otherwise (i.e., if done is False) then we try to estimate what the target reward— the maximum
discounted future reward—might be. We perform this estimation by starting with the known reward rt
and adding to it the discounted [Note: That is, multiplied by gamma, the discount factor γ.] maximum
future Q-value. Possible future Q-values are estimated by passing the next (i.e., future) state st+1 into the
model’s predict() method. Doing this in the context of the Cart-Pole game returns two outputs: one output
for the action left and the other for the action right. Whichever of these two outputs is higher (as
determined by the NumPy amax function) is the maximum predicted future Q-value.
Whether target is known definitively (because the timestep was the final one in an episode) or it’s
estimated using the maximum future Q-value calculation, we continue onward within the train() method’s
for loop:aWe run the predict() method again, passing in the current state st. As before, in the context of
the Cart-Pole game this returns two outputs: one for the left action and one for the right. We store these
two outputs in the variable target_f.
Whichever action at the agent actually took in this memory, we use target_f[0][action] = target to replace
that target_f output with the target reward. [Note: We do this because we can only train the Q-value
estimate based on actions that were actually taken by the agent: We estimated target based on next_state
st+1 and we only know what st+1 was for the action at that was actually taken by the agent at timestep t.
We don’t know what next state st+1 the environment might have returned had the agent taken a different
action than it actually took.]
The model input is the current state st and its output is target_f, which incorporates our approximation of
the maximum future discounted reward. By tuning the model’s parameters (represented by θ in Equation
13.2), we thus improve its capacity to accurately predict the action that is more likely to be associated
with maximizing future reward in any given state.
In many reinforcement learning problems, epochs can be set to 1. Instead of recycling an existing training
dataset multiple times, we can cheaply engage in more episodes of the Cart-Pole game (for example) to
generate as many fresh training data as we fancy.
We set verbose=0 because we don’t need any model-fitting outputs at this stage to monitor the progress of
model training. As we demonstrate shortly, we’ll instead monitor agent performance on an episode-by-
episode basis.
To select a particular action at to take at a given timestep t, we use the agent’s act() method. Within this
method, the NumPy rand function is used to sample a random value between 0 and 1 that we’ll call v. In
conjunction with our agent’s epsilon, epsilon_decay, and epsilon_min hyperparameters, this v value will
determine for us whether the agent takes an exploratory action or an exploitative one:
If the random value v is less than or equal to the exploration rate ε, then a random exploratory action is
selected using the randrange function. In early episodes, when ε is high, most of the actions will be
exploratory. In later episodes, as ε decays further and further (according to the epsilon_decay
hyperparameter), the agent will take fewer and fewer exploratory actions.
Otherwise—that is, if the random value v is greater than ε—the agent selects an action that exploits the
“knowledge” the model has learned via memory replay. To exploit this knowledge, the state st is passed
in to the model’s predict() method, which returns an activation output for each of the possible actions the
agent could theoretically take. We use the NumPy argmax function to select the action at associated with
the largest activation output. [Note: Recall that the activation is linear, and thus the output is not a
probability; instead, it is the discounted future reward for that action.]
{Note: We introduced the exploratory and exploitative modes of action when discussing the initialization
parameters for our DQNAgent class earlier, and they’re illustrated playfully in Figure 13.7.]
Finally, the save() and load() methods are one-liners that enable us to save and load the parameters of the
model. Particularly with respect to complex environments, agent performance can be flaky: For long
stretches, the agent may perform very well in a given environment, and then later appear to lose its
capabilities entirely. Because of this flakiness, it’s wise to save our model parameters at regular intervals.
Then, if the agent’s performance drops off in later episodes, the higher-performing parameters from some
earlier episode can be loaded back up.
Having created our DQN agent class, we can initialize an instance of the class—which we name agent—
with this line of code:
agent = DQNAgent(state_size, action_size)
The code in Example 3 enables our agent to interact with an OpenAI Gym environment, which in our
particular case is the Cart-Pole game.
for e in range(n_episodes):
state = env.reset()
done = False
time = 0
#env.render()
action = agent.act(state)
state = next_state
if done:
time += 1
agent.train(batch_size)
if e % 50 == 0:
agent.save(output_dir + "weights_"
+ "{:04d}".format(e) + ".hdf5")
Recalling that we had set the hyperparameter n_episodes to 1000, Example 13.3 consists of a big for loop
that allows our agent to engage in these 1,000 rounds of game- play. Each episode of gameplay is counted
by the variable e and involves:
We use env.reset() to begin the episode with a random state st. For the purposes of passing state into our
Keras neural network in the orientation the model is expecting, we use reshape to convert it from a
column into a row. [Note: The env.render() line is commented out because if you are running this code
via a Jupyter notebook within a Docker container, this line will cause an error. If, however, you happen to
be running the code via some other means (e.g., in a Jupyter notebook without using Docker) then you
can try uncommenting this line. If an error isn’t thrown, then a pop-up window should appear that renders
the environment graphically. This enables you to watch your DQN agent as it plays the Cart-Pole game in
real time, episode by episode. It’s fun to watch, but it’s by no means essential: It certainly has no impact
on how the agent learns!
We pass the state st into the agent’s act() method, and this returns the agent’s action at, which is either 0
(representing left) or 1 (right).
The action at is provided to the environment’s step() method, which returns the next_state st+1, the
current reward rt, and an update to the Boolean flag done.
If the episode is done (i.e., done equals true), then we set reward to a negative value (-10). This provides a
strong disincentive to the agent to end an episode early by losing control of balancing the pole or
navigating off the screen. If the episode is not done (i.e., done is False), then reward is +1 for each
additional timestep of gameplay.Nested within our thousand-episode loop is a while loop that iterates
over the timesteps of a given episode. Until the episode ends (i.e., until done equals True), in each
timestep t (represented by the variable time), we do the following.
In the same way that we needed to reorient state to be a row at the start of the episode, we use reshape to
reorientnext_state to a row here.
We use our agent’s remember() method to save all the aspects of this timestep (the state st, the action at
that was taken, the reward rt, the next state st+1, and the flag done) to memory.
We set state equal to next_state in preparation for the next iteration of the loop, which will be timestep t +
1.
If the episode ends, then we print summary metrics on the episode (see Figures 13.8 and 13.9 for example
outputs).
If the use the agent’s train() method to train its neural net parameters by replaying its memories of
gameplay.[Note: You can optionally move this training step up so that it’s inside the while loop. Each
episode will take a lot longer because you’ll be training the agent much more often, but your agent will
tend to solve the Cart-Pole game in far fewer episodes.]
Every 50 episodes, we use the agent’s save() method to store the neural net model’s parameters.
As shown in Figure 13.8, during our agent’s first 10 episodes of the Cart-Pole game, the scores were low.
It didn’t manage to keep the game going for more than 42 timesteps (i.e., a score of 41).
During these initial episodes, the exploration rate ε began at 100 percent. By the 10th episode, ε had
decayed to 96 percent, meaning that the agent was in exploitative mode (refer back to Figure 13.7) on
about 4 percent of timesteps. At this early stage of training, however, most of these exploitative actions
would probably have been effectively random anyway.
As shown in Figure 13.9, by the 991st episode our agent had mastered the Cart-Pole game.