0% found this document useful (0 votes)

16 views

UNIT IV-1

Uploaded by

prathmeshbajpai123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

UNIT IV-1

Uploaded by

prathmeshbajpai123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

UNIT IV: DEEP REINFORCEMENT LEARNING

Introduction to Deep Reinforcement learning – The multi-armed bandit – Contextual bandits – RL with
the Open AI Gym – A Q-Learning model – Markov decision process and the Bellman equation – Q-
learning – Q-learning and exploration – Frist DRL with Deep Q-learning – RL experiments – Keras RL

Introduction to Deep Reinforcement learning

Deep Reinforcement Learning (DRL) is a revolutionary Artificial Intelligence methodology that

combines reinforcement learning and deep neural networks. By iteratively interacting with an
environment and making choices that maximise cumulative rewards, it enables agents to learn
sophisticated strategies. Agents are able to directly learn rules from sensory inputs thanks to DRL, which
makes use of deep learning’s ability to extract complex features from unstructured data. DRL relies
heavily on Q-learning, policy gradient methods, and actor-critic systems. The notions of value networks,
policy networks, and exploration-exploitation trade-offs are crucial. The uses for DRL are numerous and
include robotics, gaming, banking, and healthcare. Its development from Atari games to real-world
difficulties emphasises how versatile and potent it is. Sample effectiveness, exploratory tactics, and safety
considerations are difficulties. The collaboration aims to drive DRL responsibly, promising an inventive
future that will change how decisions are made and problems are solved.

Core Components of Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) building blocks include all the aspects that power learning and
empower agents to make wise judgements in their surroundings. Effective learning frameworks are
produced by the cooperative interactions of these elements. The following are the essential elements:

Agent: The decision-maker or learner who engages with the environment. The agent acts in accordance
with its policy and gains experience over time to improve its ability to make decisions.

Environment: The system outside of the agent that it communicates with. Based on the actions the agent
does, it gives the agent feedback in the form of incentives or punishments.

State: A depiction of the current circumstance or environmental state at a certain moment. The agent
chooses its activities and makes decisions based on the state.

Action: A choice the agent makes that causes a change in the state of the system. The policy of the agent
guides the selection of actions.Reward: A scalar feedback signal from the environment that shows
whether an agent’s behaviour in a specific state is desirable. The agent is guided by rewards to learn
positive behaviour.\

Policy: A plan that directs the agent’s decision-making by mapping states to actions. Finding an ideal
policy that maximises cumulative rewards is the objective.

Value Function: This function calculates the anticipated cumulative reward an agent can obtain from a
specific state while adhering to a specific policy. It is beneficial in assessing and contrasting states and
policies.
Model: A depiction of the dynamics of the environment that enables the agent to simulate potential
results of actions and states. Models are useful for planning and forecasting.

Exploration-Exploitation Strategy: A method of making decisions that strikes a balance between

exploring new actions to learn more and exploiting well-known acts to reap immediate benefits
(exploitation).

Learning Algorithm: The process by which the agent modifies its value function or policy in response to
experiences gained from interacting with the environment. Learning in DRL is fueled by a variety of
algorithms, including Q-learning, policy gradient, and actor-critic.

Deep Neural Networks: DRL can handle high-dimensional state and action spaces by acting as function
approximators in deep neural networks. They pick up intricate input-to-output mappings.

Experience Replay: A method that randomly selects from stored prior experiences (state, action, reward,
and next state) during training. As a result, learning stability is improved and the association between
subsequent events is decreased.

These core components collectively form the foundation of Deep Reinforcement Learning, empowering
agents to learn strategies, make intelligent decisions, and adapt to dynamic environments.

How Deep Reinforcement Learning works?

In Deep Reinforcement Learning (DRL), an agent interacts with an environment to learn how to make
optimal decisions. Steps:

Initialization: Construct an agent and set up the issue.

Interaction: The agent interacts with its surroundings through acting, which results in states and rewards.
Learning: The agent keeps track of its experiences and updates its method for making decisions.

Policy Update: Based on data, algorithms modify the agent’s approach.

Exploration-Exploitation: The agent strikes a balance between using well-known actions and trying out
new ones.

Reward Maximization: The agent learns to select activities that will yield the greatest possible total
rewards.

Convergence: The agent’s policy becomes better and stays the same over time.

Extrapolation: Skilled agents can use what they’ve learned in fresh circumstances.

Evaluation: Unknown surroundings are used to assess the agent’s performance.

The multi-armed bandit

The multi-armed bandit problem is a classic reinforcement learning example where we are given a slot
machine with n arms (bandits) with each arm having its own rigged probability distribution of success.
Pulling any one of the arms gives you a stochastic reward of either R=+1 for success, or R=0 for failure.
Our objective is to pull the arms one-by-one in sequence such that we maximize our total reward
collected in the long run.

The non-triviality of the multi-armed bandit problem lies in the fact that we (the agent) cannot access the
true bandit probability distributions — all learning is carried out via the means of trial-and-error and
value estimation.This is our goal for the multi-armed bandit problem, and having such a strategy would
prove very useful in many real-world situations where one would like to select the “best” bandit out of a
group of bandits.

Action-Value and Action-Value Estimate

For an agent to decide which action yields the maximum reward, we must define the value of taking each
action. We use the concept of probability to define these values using the action-value function.

The value of selecting an action is defined as the expected reward received when taking that action from a
set of all possible actions. Since the value of selecting an action is not known to the agent, so we use the
‘sample-average’ method to estimate the value of taking an action.

Exploration vs Exploitation

Exploitation in Reinforcement Learning

Exploitation is defined as a greedy approach in which agents try to get more rewards by using estimated
value but not the actual value. So, in this technique, agents make the best decision based on current
information.

Exploration in Reinforcement Learning

Unlike exploitation, in exploration techniques, agents primarily focus on improving their knowledge
about each action instead of getting more rewards so that they can get long-term benefits. So, in this
technique, agents work on gathering more information to make the best overall decision.

Exploitation on the other hand, chooses the greedy action to get the most reward by exploiting the agent’s
current action-value estimates. But by being greedy with respect to action-value estimates, may not
actually get the most reward and lead to sub-optimal behaviour.
When an agent explores, it gets more accurate estimates of action-values. And when it exploits, it might
get more reward. It cannot, however, choose to do both simultaneously, which is also called the
exploration-exploitation dilemma.
Epsilon-Greedy Action Selection
Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between
exploration and exploitation randomly.
The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the
time with a small chance of exploring.
Code: Python code for Epsilon-Greedy
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt

# Define Action class

class Actions:
def __init__(self, m):
self.m = m
self.mean = 0
self.N = 0

# Choose a random action

def choose(self):
return np.random.randn() + self.m

# Update the action-value estimate

def update(self, x):
self.N += 1
self.mean = (1 - 1.0 / self.N)*self.mean + 1.0 / self.N * x

def run_experiment(m1, m2, m3, eps, N):

actions = [Actions(m1), Actions(m2), Actions(m3)]

data = np.empty(N)

for i in range(N):
# epsilon greedy
p = np.random.random()
if p < eps:
j = np.random.choice(3)
else:
j = np.argmax([a.mean for a in actions])
x = actions[j].choose()
actions[j].update(x)

# for the plot

data[i] = x
cumulative_average = np.cumsum(data) / (np.arange(N) + 1)

# plot moving average ctr

plt.plot(cumulative_average)
plt.plot(np.ones(N)*m1)
plt.plot(np.ones(N)*m2)
plt.plot(np.ones(N)*m3)
plt.xscale('log')
plt.show()

for a in actions:
print(a.mean)

return cumulative_average

if __name__ == '__main__':

c_1 = run_experiment(1.0, 2.0, 3.0, 0.1, 100000)

c_05 = run_experiment(1.0, 2.0, 3.0, 0.05, 100000)
c_01 = run_experiment(1.0, 2.0, 3.0, 0.01, 100000)

Contextual Bandits

If you are just getting started with contextual bandits, it can be confusing to understand how contextual
bandits are related to other more widely known methods such as A/B testing, and why you might want to
use contextual bandits instead of those other methods. Therefore, we start our journey by discussing the
similarities and differences between contextual bandits and related methods.

2.1. Contextual Bandit vs Multi-Armed Bandit

While MAB simply looks at whether treatment or control is doing better overall, CB focuses on whether
treatment or control is doing better for a user with a given set of characteristics. The “context” in
contextual bandits precisely refers to these user characteristics and is what differentiates it from MAB.
For example, CB might decide to increase treatment allocation to 60% for core users but decrease
treatment allocation to 40% for casual users after observing first day’s data. In other words, CB will
dynamically update traffic allocation taking user characteristics (core vs casual in this example) into
account.

At this point, you might be tempted to think that CB is nothing more than a set of multiple MABs running
together. In fact, when the context we are interested in is a small one (e.g., we are only interested in
whether a user is a core user or a casual user), we can simply run one MAB for core users and another
MAB for casual users. However, as the context gets large (core vs casual, age, country, time since last
active, etc.) it becomes impractical to run a separate MAB for each unique context value.
The real value of CB emerges in this case through the use of models to describe the relationship of the
experimental conditions in different contexts to our outcome of interest (e.g., conversion). As opposed to
enumerating through each context value and treating them independently, the use of models allows us to
share information from different contexts and makes it possible to handle large context spaces. This idea
of a model will be discussed at several different points in this post, so keep on reading to learn more.

Contextual Bandit Algorithm Steps

The previous section (especially the part on Thompson sampling) should already give you a pretty good
sense of the steps involved in a CB algorithm. However, for the sake of completeness, here is a step-by-
step description of a standard CB algorithm:

A new data point arrives with context X (e.g., a core user with an iOS device in the US).

Given this data point and the exploration strategy chosen (e.g., ε-greedy), the algorithm decides on a
condition to serve this user (e.g., treatment or control).

After the condition is served, we observe the outcome y (e.g., whether the user made a purchase or not).

Update (or fully retrain) the model used in Step 2 after seeing the new data. (As mentioned previously, we
usually make an update not after every sample but after seeing a batch of samples to ensure that updates
are less noisy.)

RL with Open AI Gym

Why Open AI gym ?

A 2016 Nature survey indicated that more than 70 percent of researchers have tried and failed to
reproduce another scientist’s experiments, and more than half have failed to reproduce their own
experiments.

OpenAI is created for removing this problem of lack of standardization in papers along with an aim to
create better benchmarks by giving versatile numbers of environments with great ease of setting up. Aim
of this tool is to increase reproducibility in the field of AI and provide tools with which everyone can
learn about the basics of AI.

RL with Open AI Gym

Now, defining the environment in RL’s context as a functional component, it simply takes action at a
given state as input and returns a new state and reward value associated with action-state pair.

Let’s Gym Together

What is OpenAI gym ? This python library gives us a huge number of test environments to work on our
RL agent’s algorithms with shared interfaces for writing general algorithms and testing them. Let’s get
started, just type pip install gym on the terminal for easy install, you’ll get some classic environment to
start working on your agent. Copy the code below and run it, your environment will get loaded. You can
check out other available environments like Algorithmic, Atari, Box2D and Robotics here and use the
second listed code snippet component below for listing all the available environments.

# 1. It renders instances for 500 timesteps, performing random actions.

import gym
env = gym.make('Acrobot-v1')
env.reset()
for _ in range(500):
env.render() # render() - Renders the environments to help visualise what the agent see, examples
modes are “human”, “rgb_array”, “ansi” for text.
env.step(env.action_space.sample()) #
This will always produce the correct format of actions for the environment that is loaded.
# 2. To check all env available, uninstalled ones are also shown.
from gym import envs
print(envs.registry.all())
When object interacts with environment with an action then step() function returns observation which
generally represents environments next state, reward a float of reward in previous action, done when it’s
time to reset the environment or goal achieved and info a dict for debugging, it can be used for learning if
it contains raw probabilities of environment’s last state. See how it works from the code snippet below.
Also, observe how observation of type Space is different for different environments.

import gym
env = gym.make('MountainCarContinuous-v0') # try for different environments
observation = env.reset() # This command will reset the environment
for t in range(100):
env.render()
print observation
action = env.action_space.sample()
observation, reward, done, info = env.step(action) # step() function returns observation
print observation, reward, done, info
if done:
print("Finished after {} timesteps".format(t+1))
break
[Output For Mountain Car Cont Env:]
[-0.56252328 0.00184034]
[-0.56081509 0.00170819] -0.00796802138459 False {}
[Output For CartPole Env:]
[ 0.1895078 0.55386028 -0.19064739 -1.03988221]
[ 0.20058501 0.36171167 -0.21144503 -0.81259279] 1.0 True {}
Finished after 52 timesteps
action-space & observation-space describes what is the valid format of action & state parameters for that
particular env to work on with.
What is action_space in above code? action-space & observation-space describes what is the valid format
of action & state parameters for that particular env to work on with. Just take a look at values returned.
import gym

env = gym.make('CartPole-v0')
print(env.action_space) #[Output: ] Discrete(2)
print(env.observation_space) # [Output: ] Box(4,)
env = gym.make('MountainCarContinuous-v0')
print(env.action_space) #[Output: ] Box(1,)
print(env.observation_space) #[Output: ] Box(2,)
Discrete is non-negative possible values, above 0 or 1 are equivalent to left and right movement for
CartPole balancing. Box represent n-dim array. These standard interfaces can help in writing general
codes for different environments. As we can simply check the bounds env.observation_space.high/[low]
and code them into our general algorithm.

Q-Learning — a simplistic overview

Q-learning is a machine learning algorithm that helps an agent learn to act in an environment by trial and
error, and improve its behavior over time. It's a type of reinforcement learning, which trains a model to
mimic how animals and children learn by rewarding good actions and penalizing bad ones

Key Components of Q-learning

Q-Values or Action-Values: Q-values are defined for states and actions.

Q(S,A) is an estimation of how good is it to take the action A at the state S . This estimation of

Q(S,A) will be iteratively computed using the TD- Update rule which we will see in the upcoming
sections.

Rewards An agent throughout its lifetime starts from a start state, and makes several transitions from its
current state to a next state based on its choice of action and also the environment the agent is interacting
in. At every step of transition, the agent from a state takes an action, observes a reward from the
environment, and then transits to another state.

Episodes: Instances where an agent concludes its actions, marking the end of an episode.

5. Q-values: Metrics used to evaluate actions at specific states.

There are two methods for determining Q-values:

Temporal Difference: Calculated by comparing the current state and action values with the previous
ones.
Bellman’s Equation: A recursive formula invented by Richard Bellman in 1957, used to calculate the
value of a given state and determine its optimal position. It provides a recursive formula for calculating
the value of a given state in a Markov Decision Process (MDP) and is particularly influential in the
context of Q-learning and optimal decision-making.
The Equation is expressed as :

Q(s,a)=R(s,a)+γmaxaQ(s’,a)
Where,
● Q(s,a) is the Q-value for a given state-action pair

● R(s,a) is the immediate reward for taking action a in state s.

● gamma is the discount factor, representing the importance of future rewards.

● maxaQ(s′,a) is the maximum Q-value for the next state ′s′ and all possible actions.
Bellman’s equation is crucial in reinforcement learning as it helps in evaluating the long-term expected
rewards associated with different actions in a given state. It forms the basis for Q-learning algorithms,
guiding agents to learn optimal policies through iterative updates based on observed experiences.

What is Q-table?
The Q-table functions as a repository of rewards associated with optimal actions for each state in a given
environment. It serves as a guide for the agent, indicating which actions are likely to yield positive
outcomes in various scenarios.
Each row in the Q-table corresponds to a distinct situation the agent might face, while the columns
represent the available actions. Through interactions with the environment and the receipt of rewards or
penalties, the Q-table is dynamically updated to capture the model’s evolving understanding.
Reinforcement learning aims to enhance performance by refining the Q-table, enabling the agent to make
informed decisions. As the Q-table undergoes continuous updates with more feedback, it becomes a more
accurate resource, empowering the agent to make optimal choices and achieve superior results.
Crucially, the Q-table is closely tied to the Q-function, a mathematical expression that considers the
current state and action, generating outputs that include anticipated future rewards for that specific state-
action pair. By consulting the Q-table, the agent can retrieve expected future rewards, guiding it toward
optimized decision-making and states.

Let’s say that a robot has to cross a maze and reach the end point. There are mines, and the robot can
only move one tile at a time. If the robot steps onto a mine, the robot is dead. The robot has to reach the
end point in the shortest time possible.
The scoring/reward system is as below:

1. The robot loses 1 point at each step. This is done so that the robot takes the shortest path and reaches the
goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to reach the end goal with the shortest path
without stepping on a mine?
Introducing the Q-Table
Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future
rewards for action at each state. Basically, this table will guide us to the best action at each state.

In the Q-Table, the columns are the actions and the rows are the states.

Each Q-table score will be the maximum expected future reward that the robot will get if it takes that
action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.

But the questions are:

● How do we calculate the values of the Q-table?

● Are the values available or predefined?

To learn each value of the Q-table, we use the Q-Learning algorithm.

Mathematics: the Q-Learning algorithm

Q-function
The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

Using the above function, we get the values of Q for the cells in the table.
When we start, all the values in the Q-table are zeros.

There is an iterative process of updating the values. As we start to explore the environment, the Q-
function gives us better and better approximations by continuously updating the Q-values in the table.
Now, let’s understand how the updating takes place.

Step 1: initialize the Q-Table

We will first build a Q-table. There are n columns, where n= number of actions. There are m rows, where
m= number of states. We will initialise the values at 0.
Steps 2 and 3: choose and perform an action
This combination of steps is done for an undefined amount of time. This means that this step runs until
the time we stop the training, or the training loop stops as defined in the code.

Steps 4 and 5: evaluate

Now we have taken an action and observed an outcome and reward.We need to update the function
Q(s,a).

In the case of the robot game, to reiterate the scoring/reward structure is:

● power = +1
● mine = -100

● end = +100

Implementation of Q-Learning
Step 1: Define the Environment
Set up the environment parameters, including the number of states and actions, and initialize the Q-table.
In this grid world, each state represents a position, and actions move the agent within this environment.

import numpy as np

# Define the environment

n_states = 16 # Number of states in the grid world
n_actions = 4 # Number of possible actions (up, down, left, right)
goal_state = 15 # Goal state

# Initialize Q-table with zeros

Q_table = np.zeros((n_states, n_actions))

# Define parameters
learning_rate = 0.8
discount_factor = 0.95
exploration_prob = 0.2
epochs = 1000

# Q-learning algorithm
for epoch in range(epochs):
current_state = np.random.randint(0, n_states) # Start from a random state

while current_state != goal_state:

# Choose action with epsilon-greedy strategy
if np.random.rand() < exploration_prob:
action = np.random.randint(0, n_actions) # Explore
else:
action = np.argmax(Q_table[current_state]) # Exploit

# Simulate the environment (move to the next state)

# For simplicity, move to the next state
next_state = (current_state + 1) % n_states

# Define a simple reward function (1 if the goal state is reached, 0 otherwise)

reward = 1 if next_state == goal_state else 0

# Update Q-value using the Q-learning update rule

Q_table[current_state, action] += learning_rate * \
(reward + discount_factor *
np.max(Q_table[next_state]) - Q_table[current_state, action])

current_state = next_state # Move to the next state

# After training, the Q-table represents the learned Q-values
print("Learned Q-table:")
print(Q_table)

Another Qlearning Example for FrozenLake navigation environment.

sudo pip install 'gym[all]'

Let’s start building our Q-table algorithm, which will try to solve the FrozenLake navigation
environment. In this environment the aim is to reach the goal, on a frozen lake that might have some holes
in it. Here is how the surface is the depicted by this Toy-Text environment.

SFFF (S: starting point, safe)

FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)
Q-table contains state-action pairs mapping to reward. So, we will construct an array which maps
different states and actions to reward values during execution of algorithm. Its dimension will clearly |
states|x|actions|. Let’s write it in code for the Q-learning Algorithm.

import gym
import numpy as np
# 1. Load Environment and Q-table structure
env = gym.make('FrozenLake8x8-v0')
Q = np.zeros([env.observation_space.n,env.action_space.n])
# env.observation.n, env.action_space.n gives number of states and action in env loaded
# 2. Parameters of Q-learning
eta = .628
gma = .9
epis = 5000
rev_list = [] # rewards per episode calculate
# 3. Q-learning Algorithm
for i in range(epis):
# Reset environment
s = env.reset()
rAll = 0
d = False
j=0
#The Q-Table learning algorithm
while j < 99:
env.render()
j+=1
# Choose action from Q table
a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
#Get new state & reward from environment
s1,r,d,_ = env.step(a)
#Update Q-Table with new knowledge
Q[s,a] = Q[s,a] + eta*(r + gma*np.max(Q[s1,:]) - Q[s,a])
rAll += r
s = s1
if d == True:
break
rev_list.append(rAll)
env.render()
print("Reward Sum on all episodes " + str(sum(rev_list)/epis))
print("Final Values Q-Table")
print(Q)

Markov Decision Processes and The Bellman equation

The Bellman Equation is central to Markov Decision Processes. It outlines a framework for determining
the optimal expected reward at a state s by answering the question: “what is the maximum reward an
agent can receive if they make the optimal action now and for all future decisions?

The Bellman equation breaks down a dynamic optimization problem into a sequence of simpler
subproblems. It does this by defining the value of the current state as the maximum possible value of the
current state reward plus the value of the next state.

Markov Decision Processes (MDPs)

Typically we can frame all RL tasks as MDPs1

Intuitively, it's sort of a way to frame RL tasks such that we can solve them in a "principled" manner. We
will go into the specifics throughout this tutorial

The key in MDPs is the Markov Property

Essentially the future depends on the present and not the past. More specifically, the future is independent
of the past given the present

5 Components of MDPs

1. S: set of states
2. A: set of actions
3. R: reward function
4. P: transition probability function
5. γ: discount for future rewards

What Is the Markov Property?

Imagine that a robot sitting on a chair stood up and put its right foot forward. So, at the moment, it’s
standing with its right foot forward. This is its current state.

Now, according to the Markov property, the current state of the robot depends only on its immediate
previous state (or the previous timestep,) i.e. the state it was in when it stood up. And evidently, it doesn’t
depend on its prior state– sitting on the chair. Similarly, its next state depends only on its current state.

A Markov process is defined by (S, P) where S are the states, and P is the state-transition probability. It
consists of a sequence of random states S₁, S₂, … where all the states obey the Markov property.

The state transition probability or P_ss’ is the probability of jumping to a state s’ from the current state s.
Deep Q-Learning Intuition

Deep Q-Learning is the combination of Q-learning, and neural networks.

In terms of the neural network we feed in the state, pass that through several hidden layers (the exact

number depends on the architecture) and then output the Q-values.

Q-learning works well when we have a relatively simple environment to solve, but when the number of

states and actions we can take gets more complex we use deep learning as a function approximator.

Let's look at how the equation changes with deep Q-learning.Recall the equation for temporal

difference:In the maze example, the neural network will predict 4 values: up, right, left, or down.We then

take these 4 values and compare it to the values that were previously predicted, which are stored in

memory.So we're comparing Q1 Vs Q- Target1, Q2 Vs Q- Target2. Recall that neural networks work by

updating their weights, so we need to adapt our temporal difference equation to leverage this.So what

we're going to do is calculate a loss by taking the sum of the squared differences of the Q-values and their

targets:

L = ( Q Target-Q)2

We then take this loss and use backpropagation, or stochastic gradient descent, and pass it through the

network and update the weights.This is the learning part, now let's move on to how the agent selects the
best action to take. To choose which action is the best, we use the Q-values that we have and pass them

through a softmax function. This process happens every time the agent is in a new state.

Experience & Replay Memory

Another important aspect of Deep-Q Learning that is very effective, and broadly (but not always!) used,

is experience and replay memory.Essentially, the idea behind this concept is that instead of training on

each state action pair right after we go through them, we instead store the ‘experiences’ (which contain

the state, the action taken, the reward received, and the next state) in one long list. This list is called the

replay memory, because it has a ‘memory’ of all the experiences the agent had. We then ‘replay’ the

memory for training by randomly sampling an experience from the memory.

Defining a DQN agent

Our code for defining a DQN agent that learns how to act in an environment—in this particular case, it
happens to be the Cart-Pole game from the OpenAI Gym library of environments—is provided within our
Cartpole DQN Jupyter notebook. [Note: Our DQN agent is based directly on Keon Kim’s, which is
available at his GitHub repository at bit.ly/keonDQN.] Its dependencies are as follows:

import random

import gym

import numpy as np

from collections import deque

from keras.models import Sequential from keras.layers import Dense

from keras.optimizers

import Adam

import os
Example 1 – Cart-Pole DQN hyperparameters

env = gym.make("CartPole-v0")

state_size = env.observation_space.shape[0]action_size = env.action_space.n

batch_size = 32

n_episodes = 1000

output_dir = "model_output/cartpole/"

if not os.path.exists(output_dir):

os.makedirs(output_dir)

We use the Open AI Gym make() method to specify the particular environment that we’d like our agent
to interact with. The environment we choose is version zero (v0) of the Cart-Pole game, and we assign it
to the variable env. On your own time, you’re welcome to select an alternative Open AI Gym
environment,

state_size: the number of types of state information, which for the Cart-

Pole game is 4 (recall that these are cart position, cart velocity, pole angle, and pole angular velocity).

action_size: the number of possible actions, which for Cart-Pole is 2 (left and right).

We set our mini-batch size for training our neural net to 32.

We set the number of episodes (rounds of the game) to 1000. As you’ll soon see, this is about the right
number of episodes it will take for ou–r agent to excel regularly at the Cart-Pole game. For more-complex
environments, you’d likely need to increase this hyperparameter so that the agent has more rounds of
gameplay to learn in.

We define a unique directory name ('model_output/cartpole/') into which we’ll output our neural
network’s parameters at regular intervals. If the directory doesn’t yet exist, we use os.makedirs() to make
it.

Example 2 – A deep Q-learning agent

class DQNAgent:

def init(self, state_size, action_size):

self.state_size = state_size

self.action_size = action_size

self.memory = deque(maxlen=2000)

self.gamma = 0.95

self.epsilon = 1.0

self.epsilon_decay = 0.995

self.epsilon_min = 0.01

self.learning_rate = 0.001

self.model = self._build_model()

def _build_model(self):

model = Sequential()

model.add(Dense(32, activation="relu",

input_dim=self.state_size))

model.add(Dense(32, activation="relu"))

model.add(Dense(self.action_size, activation="linear"))

model.compile(loss="mse",

optimizer=Adam(lr=self.learning_rate))

return model

def remember(self, state, action, reward, next_state, done):

self.memory.append((state, action,

reward, next_state, done))

def train(self, batch_size):

minibatch = random.sample(self.memory, batch_size)

for state, action, reward, next_state, done in minibatch:

target = reward # if done

if not done:

target = (reward +

self.gamma *

np.amax(self.model.predict(next_state)[0]))

target_f = self.model.predict(state)

target_f[0][action] = target

self.model.fit(state, target_f, epochs=1, verbose=0)

if self.epsilon &gt; self.epsilon_min:

self.epsilon *= self.epsilon_decay

def act(self, state):

if np.random.rand() &lt;= self.epsilon:

return random.randrange(self.action_size)

act_values = self.model.predict(state)

return np.argmax(act_values[0])

def save(self, name):

self.model.save_weights(name)

Initialization parameters

We begin Example 2 by initializing the class with a number of parameters:

state_size and action_size are environment-specific, but in the case of the Cart-Pole game are 4 and 2,
respectively, as mentioned earlier.

memoryis for storing memories that can subsequentlyreplayed in order to train our DQN’s neural net. The
memories are stored as elements of a data structure called adeque (pronounced “deck”), which is the same
as a list except that—because we specified maxlen=2000—it only retains the 2,000 most recent
memories. That is, whenever we attempt to append a 2,001st element onto the deque, its first element is
removed, always leaving us with a list that contains no more than 2,000 elements.

gamma is the discount factor (a.k.a. decay rate) γ that we introduced earlier in this chapter (see Figure
13.4). This agent hyperparameter discounts prospective rewards in future timesteps. Effective γ values
typically approach 1 (for example, 0.9, 0.95, 0.98, and 0.99). The closer to 1, the less we’re discounting
future reward. [Note: Indeed, if you were to set γ = 1 (which we don’t recommend) you wouldn’t be
discounting future reward at all.] Tuning the hyperparameters of reinforcement learning models such as γ
can be a fiddly process; near the end of this chapter, we discuss a tool called SLM Lab for carrying it out
effectively.

epsilon—symbolized by the Greek letter ε—is another reinforcement learning hyperparameter called
exploration rate. It represents the proportion of our agent’s actions that are random (enabling it to explore
the impact of such actions on the next state st+1 and the reward r returned by the environment) relative to
how often we allow its actions to exploit the existing “knowledge” its neural net has accumulated through
gameplay. Prior to having played any episodes, agents have no gameplay experience to exploit, so it is the
most common practice to start it off exploring 100 percent of the time; this is why we set epsilon = 1.0.

As the agent gains gameplay experience, we very slowly decay its exploration rate so that it can gradually
exploit the information it has learned (hopefully enabling it to attain more reward, as illustrated in Figure
13.7). That is, at the end of each episode the agent plays, we multiply its ε by epsilon_decay. Common
options for this hyperparameter are 0.990, 0.995, and 0.999. [Note: Analogous to setting γ = 1, setting
epsilon_decay = 1 would mean ε would not be decayed at all—that is, exploring at a continuous rate. This
would be an unusual choice for this hyperparameter.]

epsilon_min is a floor (a minimum) on how low the exploration rate ε can decay to. This hyperparameter
is typically set to a near-zero value such as 0.001, 0.01, or 0.02. We set it equal to 0.01, meaning that after
ε has decayed to 0.01 (as it will in our case by the 911th episode), our agent will explore on only 1
percent of the actions it takes—exploiting its gameplay experience the other 99 percent of the time. [Note:
If at this stage this exploration rate concept is somewhat unclear, it should become clearer as we examine
our agent’s episode-by-episode results later on.]

learning_rate is the same stochastic gradient descent hyperparameter that we covered in Chapter 8.

Finally, _build_model()—by the inclusion of its leading underscore—is being suggested as a private
method. This means that this method is recommended for use “internally” only—that is, solely by
instances of the class DQNAgent.

Building the agent’s neural network model

The _build_model() method of Example 2 is dedicated to constructing and compiling a Keras-specified

neural network that maps an environment’s state s to the agent’s Q-value for each available action a.
Once trained via gameplay, the agent will then be able to use the predicted Q-values to select the
particular action it should take, given a particular environmental state it encounters. Within the method,
there is nothing you haven’t seen before in this book:

We add to the model the following layers of neurons.

The first hidden layer is dense, consisting of 32 ReLU neurons. Using the input_dim argument, we
specify the shape of the network’s input layer, which is the dimensionality of the environment’s state
information s. In the case of the Cart-Pole environment, this value is an array of length 4, with one
element each for cart position, cart velocity, pole angle, and pole angular velocity. [Note: In environments
other than Cart-Pole, the state information might be much more complex. For example, with an Atari
video game environment like Pac-Man, state s would consist of pixels on a screen, which would be a two-
or three-dimensional input (for monochromatic or full-color, respectively). In a case such as this, a better
choice of first hidden layer would be a convolutional layer such as Conv2D (see Chapter 10).]

The second hidden layer is also dense, with 32 ReLU neurons. As mentioned earlier, we’ll explore
hyperparameter selection—including how we home in on a particular model architecture—by discussing
the SLM Lab tool later on in this chapter.

The output layer has dimensionality corresponding to the number of possible actions. [Note: Any
previous models in this book with only two outcomes (as in Chapters 11 and 12) used a single sigmoid
neuron. Here, we specify separate neurons for each of the outcomes, because we would like our code to
generalize beyond the Cart-Pole game. While Cart-Pole has only two actions, many environments have
more than two.] In the case of the Cart-Pole game, this is an array of length 2, with one element for left
and the other for right. As with a regression model (see Example 9.8), with DQNs the z values are output
directly from the neural net instead of being converted into a probability between 0 and 1. To do this, we
specify the linear activation function instead of the sigmoid or softmax functions that have otherwise
dominated this book.

As indicated when we compiled our regression model (Example 9.9), mean squared error is an
appropriate choice of cost function when we use linear activation in the output layer, so we set the
compile() method’s loss argument to mse. We return to our routine optimizer choice, Adam.

Remembering gameplay

At any given timestep t—that is, during any given iteration of the reinforcement learning loop (refer back
to Figure 13.3)—the DQN agent’s remember() method is run in order to append a memory to the end of
its memory deque. Each memory in this deque consists of five pieces of information about timestep t:

The state st that the agent encountered

The action at that the agent took

The reward rt that the environment returned to the agent

The next_state st+1 that the environment also returned to the agent

A Boolean flag done that is true if timestep t was the final iteration of the episode, and false otherwise

Training via memory replay

The DQN agent’s neural net model is trained by replaying memories of gameplay, as shown within the train() method of E
For each of the 32 sampled memories, we carry out a round of model training as follows: If done is True
—that is, if the memory was of the final timestep of an episode— then we know definitively that the
highest possible reward that could be attained from this timestep is equal to the reward rt. Thus, we can
just set our target reward equal to reward.

Otherwise (i.e., if done is False) then we try to estimate what the target reward— the maximum
discounted future reward—might be. We perform this estimation by starting with the known reward rt
and adding to it the discounted [Note: That is, multiplied by gamma, the discount factor γ.] maximum
future Q-value. Possible future Q-values are estimated by passing the next (i.e., future) state st+1 into the
model’s predict() method. Doing this in the context of the Cart-Pole game returns two outputs: one output
for the action left and the other for the action right. Whichever of these two outputs is higher (as
determined by the NumPy amax function) is the maximum predicted future Q-value.

Whether target is known definitively (because the timestep was the final one in an episode) or it’s
estimated using the maximum future Q-value calculation, we continue onward within the train() method’s
for loop:aWe run the predict() method again, passing in the current state st. As before, in the context of
the Cart-Pole game this returns two outputs: one for the left action and one for the right. We store these
two outputs in the variable target_f.

Whichever action at the agent actually took in this memory, we use target_f[0][action] = target to replace
that target_f output with the target reward. [Note: We do this because we can only train the Q-value
estimate based on actions that were actually taken by the agent: We estimated target based on next_state
st+1 and we only know what st+1 was for the action at that was actually taken by the agent at timestep t.
We don’t know what next state st+1 the environment might have returned had the agent taken a different
action than it actually took.]

We train our model by calling the fit() method.

The model input is the current state st and its output is target_f, which incorporates our approximation of
the maximum future discounted reward. By tuning the model’s parameters (represented by θ in Equation
13.2), we thus improve its capacity to accurately predict the action that is more likely to be associated
with maximizing future reward in any given state.
In many reinforcement learning problems, epochs can be set to 1. Instead of recycling an existing training
dataset multiple times, we can cheaply engage in more episodes of the Cart-Pole game (for example) to
generate as many fresh training data as we fancy.

We set verbose=0 because we don’t need any model-fitting outputs at this stage to monitor the progress of
model training. As we demonstrate shortly, we’ll instead monitor agent performance on an episode-by-
episode basis.

Selecting an action to take

To select a particular action at to take at a given timestep t, we use the agent’s act() method. Within this
method, the NumPy rand function is used to sample a random value between 0 and 1 that we’ll call v. In
conjunction with our agent’s epsilon, epsilon_decay, and epsilon_min hyperparameters, this v value will
determine for us whether the agent takes an exploratory action or an exploitative one:

If the random value v is less than or equal to the exploration rate ε, then a random exploratory action is
selected using the randrange function. In early episodes, when ε is high, most of the actions will be
exploratory. In later episodes, as ε decays further and further (according to the epsilon_decay
hyperparameter), the agent will take fewer and fewer exploratory actions.

Otherwise—that is, if the random value v is greater than ε—the agent selects an action that exploits the
“knowledge” the model has learned via memory replay. To exploit this knowledge, the state st is passed
in to the model’s predict() method, which returns an activation output for each of the possible actions the
agent could theoretically take. We use the NumPy argmax function to select the action at associated with
the largest activation output. [Note: Recall that the activation is linear, and thus the output is not a
probability; instead, it is the discounted future reward for that action.]

{Note: We introduced the exploratory and exploitative modes of action when discussing the initialization
parameters for our DQNAgent class earlier, and they’re illustrated playfully in Figure 13.7.]

Saving and loading model parameters

Finally, the save() and load() methods are one-liners that enable us to save and load the parameters of the
model. Particularly with respect to complex environments, agent performance can be flaky: For long
stretches, the agent may perform very well in a given environment, and then later appear to lose its
capabilities entirely. Because of this flakiness, it’s wise to save our model parameters at regular intervals.
Then, if the agent’s performance drops off in later episodes, the higher-performing parameters from some
earlier episode can be loaded back up.

Interacting with an OpenAI Gym environment

Having created our DQN agent class, we can initialize an instance of the class—which we name agent—
with this line of code:
agent = DQNAgent(state_size, action_size)

The code in Example 3 enables our agent to interact with an OpenAI Gym environment, which in our
particular case is the Cart-Pole game.

Example 3 DQN agent interacting with an OpenAI Gym environment

for e in range(n_episodes):

state = env.reset()

state = np.reshape(state, [1, state_size])

done = False

time = 0

while not done:

#env.render()

action = agent.act(state)

next_state, reward, done, _ = env.step(action)

reward = reward if not done else -10

next_state = np.reshape(next_state, [1, state_size])

agent.remember(state, action, reward, next_state, done)

state = next_state

if done:

print("episode: {}/{}, score: {}, e: {:.2}"

.format(e, n_episodes-1, time, agent.epsilon))

time += 1

if len(agent.memory) > batch_size:

agent.train(batch_size)

if e % 50 == 0:

agent.save(output_dir + "weights_"

+ "{:04d}".format(e) + ".hdf5")
Recalling that we had set the hyperparameter n_episodes to 1000, Example 13.3 consists of a big for loop
that allows our agent to engage in these 1,000 rounds of gameplay. Each episode of gameplay is counted
by the variable e and involves:

We use env.reset() to begin the episode with a random state st. For the purposes of passing state into our
Keras neural network in the orientation the model is expecting, we use reshape to convert it from a
column into a row. [Note: The env.render() line is commented out because if you are running this code
via a Jupyter notebook within a Docker container, this line will cause an error. If, however, you happen to
be running the code via some other means (e.g., in a Jupyter notebook without using Docker) then you
can try uncommenting this line. If an error isn’t thrown, then a pop-up window should appear that renders
the environment graphically. This enables you to watch your DQN agent as it plays the Cart-Pole game in
real time, episode by episode. It’s fun to watch, but it’s by no means essential: It certainly has no impact
on how the agent learns!

We pass the state st into the agent’s act() method, and this returns the agent’s action at, which is either 0
(representing left) or 1 (right).

The action at is provided to the environment’s step() method, which returns the next_state st+1, the
current reward rt, and an update to the Boolean flag done.

If the episode is done (i.e., done equals true), then we set reward to a negative value (-10). This provides a
strong disincentive to the agent to end an episode early by losing control of balancing the pole or
navigating off the screen. If the episode is not done (i.e., done is False), then reward is +1 for each
additional timestep of gameplay.Nested within our thousand-episode loop is a while loop that iterates
over the timesteps of a given episode. Until the episode ends (i.e., until done equals True), in each
timestep t (represented by the variable time), we do the following.

In the same way that we needed to reorient state to be a row at the start of the episode, we use reshape to
reorientnext_state to a row here.

We use our agent’s remember() method to save all the aspects of this timestep (the state st, the action at
that was taken, the reward rt, the next state st+1, and the flag done) to memory.

We set state equal to next_state in preparation for the next iteration of the loop, which will be timestep t +
1.

If the episode ends, then we print summary metrics on the episode (see Figures 13.8 and 13.9 for example
outputs).

Add 1 to our timestep counter time.

If the use the agent’s train() method to train its neural net parameters by replaying its memories of
gameplay.[Note: You can optionally move this training step up so that it’s inside the while loop. Each
episode will take a lot longer because you’ll be training the agent much more often, but your agent will
tend to solve the Cart-Pole game in far fewer episodes.]

Every 50 episodes, we use the agent’s save() method to store the neural net model’s parameters.
As shown in Figure 13.8, during our agent’s first 10 episodes of the Cart-Pole game, the scores were low.
It didn’t manage to keep the game going for more than 42 timesteps (i.e., a score of 41).

Performance of DQN agent during its first 10 episodes

During these initial episodes, the exploration rate ε began at 100 percent. By the 10th episode, ε had
decayed to 96 percent, meaning that the agent was in exploitative mode (refer back to Figure 13.7) on
about 4 percent of timesteps. At this early stage of training, however, most of these exploitative actions
would probably have been effectively random anyway.

As shown in Figure 13.9, by the 991st episode our agent had mastered the Cart-Pole game.

The Basic Leverage Trading Guide
No ratings yet
The Basic Leverage Trading Guide
3 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Machine_Learning_Chapter 4
No ratings yet
Machine_Learning_Chapter 4
13 pages
16 - Reinforcement Learning and Bandits.pptx
No ratings yet
16 - Reinforcement Learning and Bandits.pptx
41 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Multi-armed bandits
No ratings yet
Multi-armed bandits
11 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
AS01
No ratings yet
AS01
14 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Unit 1
No ratings yet
Unit 1
18 pages
RL RS-Unit_3 (1)
No ratings yet
RL RS-Unit_3 (1)
6 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
ML_Unit-4
No ratings yet
ML_Unit-4
10 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
rl-unit5
No ratings yet
rl-unit5
101 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Sections
No ratings yet
Sections
76 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
19 pages
Unit V
100% (1)
Unit V
24 pages
LearnAlgorithms LT
No ratings yet
LearnAlgorithms LT
95 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Module 1
No ratings yet
Module 1
72 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
Unit 5
No ratings yet
Unit 5
10 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
114021
No ratings yet
114021
55 pages
kguh
No ratings yet
kguh
38 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Reinforcement Learning.pptx
No ratings yet
Reinforcement Learning.pptx
59 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
MDP
No ratings yet
MDP
10 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
AdaptiveEpsilonGreedyExploration PDF
No ratings yet
AdaptiveEpsilonGreedyExploration PDF
8 pages
Unit 5
No ratings yet
Unit 5
45 pages
unit4(AI)2024.docx-1
No ratings yet
unit4(AI)2024.docx-1
22 pages
Unit 3
No ratings yet
Unit 3
12 pages
Reinforcement_learning
No ratings yet
Reinforcement_learning
19 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Pulsation Suppression Device Design For Reciprocating Compressor
No ratings yet
Pulsation Suppression Device Design For Reciprocating Compressor
9 pages
Impulsive Force
No ratings yet
Impulsive Force
4 pages
Three Phase Sparged Reactors - Some Design Aspects Ab Pandit
No ratings yet
Three Phase Sparged Reactors - Some Design Aspects Ab Pandit
84 pages
1.6 - Exploring Transformations of Parent Functions
No ratings yet
1.6 - Exploring Transformations of Parent Functions
2 pages
Lahore School of Economics
No ratings yet
Lahore School of Economics
10 pages
Scan Jan 23 2019
No ratings yet
Scan Jan 23 2019
1 page
A1 S1 2012 8 Puzzle
No ratings yet
A1 S1 2012 8 Puzzle
3 pages
TransformationsRotations LessonPlan
100% (1)
TransformationsRotations LessonPlan
6 pages
CBSE Class 10 Maths Question Paper 2019 With Solutions1
0% (1)
CBSE Class 10 Maths Question Paper 2019 With Solutions1
4 pages
Hydraulic Calculation For Fire Protec PDF
No ratings yet
Hydraulic Calculation For Fire Protec PDF
3 pages
(Ebook) - Thermal Analysis Techniques
0% (1)
(Ebook) - Thermal Analysis Techniques
21 pages
GR 12 Maths P2 Eng - x5 - 240604 - 144110
No ratings yet
GR 12 Maths P2 Eng - x5 - 240604 - 144110
14 pages
0610_0970_MW_OTG_Marking_Guidance
No ratings yet
0610_0970_MW_OTG_Marking_Guidance
1 page
Tutorial2 - Plastic Analysis - Sol
No ratings yet
Tutorial2 - Plastic Analysis - Sol
4 pages
Introduction To Management Science 8th Edition by Bernard W. Taylor III
No ratings yet
Introduction To Management Science 8th Edition by Bernard W. Taylor III
40 pages
Programming With Scilab
0% (1)
Programming With Scilab
58 pages
Assessing Production Line Risk Using Bayesian Belief Networks
No ratings yet
Assessing Production Line Risk Using Bayesian Belief Networks
11 pages
Operating Systems Concepts 5th Edition Chapter7
0% (1)
Operating Systems Concepts 5th Edition Chapter7
44 pages
Lorenz PDF
No ratings yet
Lorenz PDF
29 pages
Masianello System
No ratings yet
Masianello System
3 pages
Math 9-Q1-DW3
No ratings yet
Math 9-Q1-DW3
2 pages
June Final 2016
No ratings yet
June Final 2016
7 pages
Chapter 3 Part 2
No ratings yet
Chapter 3 Part 2
22 pages
Thesis Chapter 1 3 2nd Semester Revised
No ratings yet
Thesis Chapter 1 3 2nd Semester Revised
31 pages
Artificial Intelligence and Emerging Technologies: Professional Program in
No ratings yet
Artificial Intelligence and Emerging Technologies: Professional Program in
12 pages
1st Grade End Module Assessment Summary - All Modules
No ratings yet
1st Grade End Module Assessment Summary - All Modules
22 pages
Transmission Loss As A Function of Plant Generation
No ratings yet
Transmission Loss As A Function of Plant Generation
5 pages
0607 Cambridge International Mathematics: MARK SCHEME For The October/November 2015 Series
No ratings yet
0607 Cambridge International Mathematics: MARK SCHEME For The October/November 2015 Series
5 pages
Fundamentals of Statistical Signal Processing - Estimation Theory-Kay
No ratings yet
Fundamentals of Statistical Signal Processing - Estimation Theory-Kay
303 pages

UNIT IV-1

Uploaded by

UNIT IV-1

Uploaded by

UNIT IV: DEEP REINFORCEMENT LEARNING

Introduction to Deep Reinforcement learning

Deep Reinforcement Learning (DRL) is a revolutionary Artificial Intelligence methodology that

Core Components of Deep Reinforcement Learning

Exploration-Exploitation Strategy: A method of making decisions that strikes a balance between

How Deep Reinforcement Learning works?

Initialization: Construct an agent and set up the issue.

Policy Update: Based on data, algorithms modify the agent’s approach.

Evaluation: Unknown surroundings are used to assess the agent’s performance.

The multi-armed bandit

Action-Value and Action-Value Estimate

Exploitation in Reinforcement Learning

Exploration in Reinforcement Learning

# Define Action class

# Choose a random action

# Update the action-value estimate

def run_experiment(m1, m2, m3, eps, N):

actions = [Actions(m1), Actions(m2), Actions(m3)]

# for the plot

# plot moving average ctr

c_1 = run_experiment(1.0, 2.0, 3.0, 0.1, 100000)

2.1. Contextual Bandit vs Multi-Armed Bandit

Contextual Bandit Algorithm Steps

RL with Open AI Gym

Why Open AI gym ?

RL with Open AI Gym

Let’s Gym Together

# 1. It renders instances for 500 timesteps, performing random actions.

Q-Learning — a simplistic overview

Key Components of Q-learning

Q-Values or Action-Values: Q-values are defined for states and actions.

5. Q-values: Metrics used to evaluate actions at specific states.

● R(s,a) is the immediate reward for taking action a in state s.

● gamma is the discount factor, representing the importance of future rewards.

But the questions are:

● How do we calculate the values of the Q-table?

To learn each value of the Q-table, we use the Q-Learning algorithm.

Mathematics: the Q-Learning algorithm

Step 1: initialize the Q-Table

Steps 4 and 5: evaluate

# Define the environment

# Initialize Q-table with zeros

while current_state != goal_state:

# Simulate the environment (move to the next state)

# Define a simple reward function (1 if the goal state is reached, 0 otherwise)

# Update Q-value using the Q-learning update rule

current_state = next_state # Move to the next state

Another Qlearning Example for FrozenLake navigation environment.

sudo pip install 'gym[all]'

SFFF (S: starting point, safe)

Markov Decision Processes and The Bellman equation

Markov Decision Processes (MDPs)

Typically we can frame all RL tasks as MDPs1

The key in MDPs is the Markov Property

What Is the Markov Property?

Deep Q-Learning is the combination of Q-learning, and neural networks.

number depends on the architecture) and then output the Q-values.

Experience & Replay Memory

memory for training by randomly sampling an experience from the memory.

Defining a DQN agent

from collections import deque

from keras.models import Sequential from keras.layers import Dense

state_size = env.observation_space.shape[0]action_size = env.action_space.n

Example 2 – A deep Q-learning agent

def __init__(self, state_size, action_size):

def remember(self, state, action, reward, next_state, done):

reward, next_state, done))

def train(self, batch_size):

minibatch = random.sample(self.memory, batch_size)

target = reward # if done

self.model.fit(state, target_f, epochs=1, verbose=0)

if self.epsilon &amp;gt; self.epsilon_min:

def act(self, state):

if np.random.rand() &amp;lt;= self.epsilon:

def save(self, name):

We begin Example 2 by initializing the class with a number of parameters:

def init(self, state_size, action_size):

if self.epsilon > self.epsilon_min:

if np.random.rand() <= self.epsilon: