0% found this document useful (0 votes)
4 views

The_Use_of_Reinforcement_Learning_in_Gaming_The_Br

Education purpose

Uploaded by

eggesrinu99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

The_Use_of_Reinforcement_Learning_in_Gaming_The_Br

Education purpose

Uploaded by

eggesrinu99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340440656

The Use of Reinforcement Learning in Gaming The Breakout Game Case


Study.pdf

Preprint · April 2020


DOI: 10.36227/techrxiv.12061728

CITATION READS
1 2,946

6 authors, including:

Aloukik Aditya
Lakehead University
5 PUBLICATIONS 11 CITATIONS

SEE PROFILE

All content following this page was uploaded by Aloukik Aditya on 14 September 2020.

The user has requested enhancement of the downloaded file.


The Use of Reinforcement Learning in Gaming:
The Breakout Game Case Study
Taresh Dewan Manva Trivedi Sabah Mohammed
COMP5112WC Student COMP5112WC Student COMP5112WC Supervisor
Lakehead University Lakehead University Lakehead University
[email protected] [email protected] [email protected]

Aloukik Aditya Ao Chen Danning Jiang


COMP5112WC Student COMP5112WC Student COMP5112WC Student
Lakehead University Lakehead University Lakehead University
[email protected] [email protected] [email protected]

Abstract—Traditionally, reinforcement learning (RL) algo- and are praised when they make the right ones – this is
rithms are called trial and error learning methods that use reinforcement. Performance of RL degrades when the state-
real task experience to develop an incremental management action space is too large to be completely known. For this
policy. The reinforcement learning theory offers a viewpoint
in psychology, how agents can maximize their control of an reason, we are using Deep Reinforcement Learning to achieve
environment. The major difference of reinforcement learning better performance.
from supervised learning is that a partial feedback is provided Deep reinforcement learning integrates artificial neural net-
to the learner, regarding the learned experiences. An RL agent works with an architecture of RL that allows software-defined
learns how to map states to optimal action through trial-and- agents to learn the best possible behaviours in a virtual
error and over time practices and develops a strategy for long-
term rewards. In this paper, we are using an approach which environment to achieve their objectives. Instead of using a
unifies artificial neural networks and reinforcement learning lookup table to store, index and update all possible states and
architecture allowing the agent to learn the best possible actions their values, which is difficult with very big problems, we
in a virtual environment to achieve their objectives for which have trained a neural network on state-action space samples
we have chosen Breakout – a classic arcade game. We have to learn to determine how important those are relative to our
chosen Breakout as it achieves superhuman play as compared
to other games such as Enduro, Time Pilot etc. This paper aim of enhancing learning.
provides a comparative analysis between Deep Q Network (DQN)
and Double Deep Q Network (DDQN) algorithms based on B. Basic Definitions
their hit rate, out of which DDQN proved to be better for Reinforcement Learning can be understood if we know the
Breakout game. DQN is chosen over Basic Q learning because concepts of agent, state, action, environment and rewards, that
it understands policy learning using its neural network which is
good for complex environment and DDQN is chosen as it solves
is explained below:
overestimation problem (agent always choses non-optimal action
for any state just because it has maximum Q-value) occurring
in basic Q-learning.

Index Terms—Reinforcement learning (RL), Deep Q Network


(DQN), Double Deep Q Network (DDQN), arcade games, Break-
out, Atari, agent, action, state, environment, Q-value, rewards

I. I NTRODUCTION
A. Overview
Fig. 1. Reinforcement Learning Process
Reinforcement learning (RL) refers to goal-oriented algo-
rithms that learn how to accomplish a specific goal or how As shown in figure 1, agent and environment are the key
to optimise over many steps along a dimension; for example, components in reinforcement learning process. Agent is an
over many moves, they can maximise the points earned in entity that takes action (a set of moves which the agent can
a game. RL algorithms can start with a blank state and make) in an environment (surroundings the agent is going
achieve superhuman performance under the right conditions. through and which responds to the agent). A state is a real
Such algorithms are penalised, like a pet incentivized through and immediate condition in which the agent finds himself; that
scolding and punishment, when they make the wrong decisions is, a position or a moment on the basis of which its Q-value is
updated and reward (input by which we measure the success behavior of the paddle will be completely random and it will
or failure of an agent’s action in each state) is assigned to the start decreasing with the ongoing iterations with a value 0.95.
agent which can be negative or positive which in turn impacts For each distinct (state, action) pair, it’s Q-value (which is
the Q-value (Q-value takes the input of two parameters: state calculated by DQN and DDQN algorithms) will be updated,
and current action a. Qπ(s, a) refers to the long-term return of and that Q-value will be used to take best possible action of
an action taking action from the current state s under policy the paddle for the next state (velocity, ball co-ordinates).
π. Q maps the pairs of state action to rewards). We chose to implement Deep Reinforcement Learning on
An agent sends feedback in the form of actions to the Atari games because the environment of Atari games is quite
environment from any given state, and the environment returns uncertain taking into consideration the states and actions
the new state of the agent (which resulted from acting on the related to the environment which makes it relatable to real
prior state) as well as rewards if any. Rewards can be staggered life situations.
or immediate. Effectively they determine the behaviour of the
handler. Another term associated with rewards is discount II. P ROBLEM D EFINITION
factor which can be computed by future rewards as discovered The purpose of our project is to create virtual environment
by the agent to dampen the effect of that reward on the choice of a game named “Breakout” that can prove out to be a close
of action by the agent. replication of the real-life environment, learn the environment
In our customised breakout game environment, paddle plays with time as it happens in real life and act accordingly. To
the role of an agent and environment includes wall, bricks and duplicate a real-life environment and to act as per the changes,
ball. Actions which can be taken by the agent (paddle) are the environment has to be learnt properly, which we are going
moving the paddle to the left, moving the paddle to the right or to do so, using the algorithms of Reinforcement Learning
let it stay in idle position. States of the game will be, whether mentioned above and showing comparisons which is better
the game is ongoing / lost / won, the x and y coordinates of for such situations.
the ball, ball-velocity, the x position of the paddle, array of First, Reinforcement learning is the problem that we have
the coordinates of the remaining bricks, the number of frames studied. Reinforcement learning is a branch of machine learn-
since the game started, and the current score of the game. ing, which is used to describe and solve the problem of agent’s
Our customised breakout environment consists of a paddle, interaction with the environment through learning strategies to
a ball, a wall and a block which consists of bricks with achieve maximum reward [1]. A classic and standard model
different colors. When a ball hits the paddle, 3 points are of reinforcement learning is the Markov Decision Process
rewarded to the agent and if it misses the paddle, 3 points are (MDP), which is simply a process where an agent takes
penalised to the agent. When a ball hits the brick, rewards are action to update its state to obtain reward and interact with
awarded to the agent as per the color of the brick (blue=8, environment [2].
green=7, olive=6, yellow=5, orange=4, red=3). Q-value of Second problem being the game itself. We have imitated
each (state, action) pair will be fed to the network which Breakout as our game. The building blocks of this game are
will be helpful for learning. After achieving Q-values for a moving ball, a paddle and 6 rows of bricks. The agent i.e.
each distinct (state, action) pair which will be stored in the paddle can move left and right to hit the ball. After hitting
network, the action with the best Q-value will be chosen for the ball, the ball will rebound and then destroy the bricks
the progression of the game. to achieve reward. This game needs to be able to receive
One of the reasons to choose breakout is that for the start, it incoming actions and switch to the next frame. If the ball
is better to choose a game that can be altered as per the user is missed, the game will end and then reset the position of the
and for which different parts of the algorithms like states, ball and paddle in the game. The bricks will disappear after
reward system can be modified into a better one. In this being touched by the ball. At the beginning of the game, the
paper, we have given a comparative analysis of two different ball and paddle are placed in the initial place and all 6 rows of
algorithms on a classic Atari arcade game - Breakout. We have bricks are loaded. The game gives the total number of bricks
compared and evaluated the performances of different models destroyed, the number of times the ball was missed, and the
namely Conventional Deep Q-Network and Double Deep Q- ball’s hit rate.
Network. Third problem is to create a network structure for both DQN
and DDQN in our simulated Breakout environment in terms of
C. Asserting Thesis reward acquisition, loss at each epoch and hit rate to conclude
The ”memory” is a key component of DQNs: the trials are which algorithm performs better.
used to train the model continuously, as stated earlier. Instead
of training on the trials when they get in, however, we add III. R ELATED R ESEARCH W ORK
them to memory and train them on a random sample of that Reinforcement learning can go back to the implementation
memory. The gamma factor reflects this depreciated value for of TD-gammon, In 1992, IBM researcher Gerald Tesauro
the expected future returns on the state. Value defined for but developed an algorithm that combines time difference learning
will vary between 0 and 1. We have followed Epsilon Greedy and neural networks, and named it TD-Gammon, specializing
Search in which value of gamma is 1 in initial stages as the in playing backgammon. TD-gammon uses a three-layer neural
network. The backgammon position is represented by 198 episode-level, and segment-level. Basic overview of perfor-
units as the input, and there are 40-80 neurons in the middle- mance results of different algorithms along with their criteria
hidden layer. The final output is an estimate of the value in breakout game is briefly given in Table I.
function [3]. TD-gammon used a model-free reinforcement
learning algorithm like Q-learning and approximated the value TABLE I
function using a multi-layer perceptron with one hidden layer. OVERVIEW OF RELATED RESEARCH WORK IN B REAKOUT
However, the applications of TD-gammon into other board SNo Criteria for Com- Algorithms Results
games were less successful, which led to a widespread belief parison Com-
that the TD-gammon method only worked in backgammon. pared
1. Rewards for SNN | SNN is better
This perhaps because the randomness in the dice rolls helps given number of DQN especially
explore the state space and also makes the value function epochs for noisy or
particularly smooth [4]. incomplete data.
Q learning is proposed by Watkins in 1989, which has 2. High Score, DQN | ARC
Stability and ARC outperformed
become the popular option for reinforcement learning-based rewards DQN with a
agents, however it is useless for the complicated and high high score of 79
state space problem. and rewards
increasing
The combination of deep learning and reinforcement learn- gradually.
ing methods, mainly involving Q learning, was brought for- 3. Epoch level, Visual Visual DQN is
ward in a sequence of papers [5]. From what we know that training level, DQN | able to control
episode level and DQN the random
previous reinforcement learning methods had trouble in select- segment level actions taken by
ing features, while the deep reinforcement learning approach DQN algorithm.
was found to handle complex tasks successfully, as it can learn
from data at different levels of features. Mnih successfully We have shown performance comparisons for DQN and
trained a deep RL agent from visual inputs consisting of DDQN algorithms for the training phase, in terms of hit rate
thousands of pixels. This approach enabled it to reach beyond- which is calculated by taking the ratio of number of times the
human capabilities in playing Atari games, Alpha Go and ball hits the paddle to the sum of number of times ball hits
so on [6]. The Deep Q network agent synthesized by Mnih the paddle and number of times ball misses the paddle, for the
achieves human-like performance when playing Atari games given number of episodes and which in our case is 100.
by using artificial neural networks to process sensory data. In
subsequent work, Van Hassel [7] improved the algorithm by IV. M ETHODOLOGY
implementing double deep Q-Learning which helps generate
more accurate estimates by eliminating overestimation. In order to compare the performance of the AI agent in
In the paper “Human-level control through deep reinforce- playing the breakout game based on different algorithms, we
ment learning” [5], researchers have shown that the deep Q- decided to create our own environment to train the agent,
network agent, obtaining only the pixels and the game score which imitates the environment of the OpenAI Atari game
as inputs, has been able to exceed the efficiency of all previous Breakout-ram-v0. After setting up the environment, we built
algorithms and reach a level equal to that of a skilled human the network for DQN and Double DQN algorithms respec-
games tester using the same algorithm, network architecture tively. The specific steps are as follows:
and hyperparameters across a range of 49 games [5]. 1) Setting up the environment-Breakout, including the
Paper [8], shows comparisons between RELU Neural Net- background, the paddle and ball, defining bricks,
work and Spiking Neural Network in terms of rewards controlling paddle movement, handling collisions,
achieved for given number of epochs in breakout, using updating state and environment, using turtle library
Epsilon greedy approach and conventional greedy approach. which is a graphics library in python that can be used
Furthermore, with additional benefits of SNNs can supplement to create various objects and shapes, provide animations
the working of DQN when data is noisy and incomplete [8]. to them using penup() function, by adjusting the speed
Paper [9], shows comparisons of performances in terms of during the process of object creation.
training time, stability and higher score achieved using DQN
and Asynchronous Advantage Actor-Critic (A3C) algorithms 2) For defining bricks, controlling paddle movement,
that too in breakout game. Rewards achieved using A3C handling collisions, updating state and environment, we
algorithm are higher as compared to DQN as A3C uses a have defined separate functions namely reset() which
multi-core power CPU to work efficiently whereas, DQN will reset the environment if the paddle misses the ball,
needs a powerful GPU to train faster and runs slowly on a next iteration() which will compute the parameters for
CPU [9]. Paper [10], shows a visual DQN approach which next state, move positive x() will move the paddle to
helps to control the random actions of DQN and helps domain the right and move negative x() which will move the
experts to understand, diagnose, and improve DQN models paddle to the left.
with four levels of details: overall training level, epoch-level,
3) Creating DQN and Double DQN algorithm using the
Breakout environment and perform Hyper-parameter
tuning (like discount factor gamma, learning rate, Qnew = R + γQ(S, argmaxQ(S, a; θmain ), θtarget ) (2)
epsilon). Using libraries like random, numpy, keras, R means the current reward, γ means parameter gamma ,
collections, matplotlib and so on. Qnew means the updated Q value, S means the input state,
θmain means using the main network, θtarget means using
4) Training the Agent using environment, within the 1000 the target network. In equation (2), we use state S as input
steps, calculating reward and loss for each 100 episodes, and main network to select the action a which has the largest
then save the reward and data loss and plot them. Q value. Then use target network to get the Q value of the
selected action a. Then we can update the Q value and use
updated Q value and input state S to train the network.
A. Deep Q Network
In traditional reinforcement learning like Q learning, we use V. P ROTOTYPING
table to store Q value. But it has a limitation. The problem A. Detailed Design:
today is too complicated to use tables to store the Q values
of each state and action. No matter how much memory the In our project, we have used three classes to implement the
computer has, it becomes time consuming to search for the code (Paddle, DQN and DDQN) as shown in figure 2.
corresponding state in such a large table. When reinforcement
learning is combined with deep learning, neural network can
solve this problem. Because we can just input the state value,
output all the Q values of each action, and then directly
select the action with the maximum value as the next action
according to the principle of Q learning.
Table II describes the structure of Deep Q Network we use
in our project.

TABLE II
M ODEL S TRUCTURE

Layer (type) Output Shape Activation Param #


function
dense 1 (Dense) (None, 64) RELU 384
dense 2 (Dense) (None, 64) RELU 4160
dense 3 (Dense) (None, 3) Linear 195

The input of this network is a state, the output is the Q


values of 3 actions. Neural network is what we use to process
the state and predict Q value. We also need to update the Q
value and train the network. Equation (1) below describes how
Deep Q Network update Q value.
Fig. 2. Class Diagram
Qnew = R + γmaxQ(S, a) (1)

R means the current reward, γ means parameter gamma , B. Building the major classes:
max Q(S, a) means the action a with maximal Q value in
As shown in figure 3, using the import turtle we are adding
the state S, Qnew means the updated Q value. Then we can
the turtle library into our python environment. The turtle
use state S as input, updated Q value as output to train the
module provides turtle graphics primitives, in both object-
network.
oriented and procedure-oriented ways.
B. Double DQN
The network structure of Double DQN is the same as Deep
Q Network. But Double DQN requires 2 networks. The one Fig. 3. Importing turtle
is the main network which also has updated parameters, the
other one is target network which has old parameters. As shown in figure 4, we are adding the bricks using for
In Double DQN, we use main network to get the Q value loop. We have assigned a variety of colors to the bricks. Using
using state as input. The equation (1) to update Q value has x cor and y cor, we have assigned the coordinates of the
changed to equation (2): bricks, and each Brick is added after 110 pixels on the x-axis.
Fig. 4. Building Bricks

We have used Brick () function to provide its dimensions,


shape, colors, and coordinates.
As shown in figure 5, we then initialize the configuration of
the ball and paddle. Talking about the ball, we have decided
to take its shape as a circle and color to be red. dx and dy
are the coordinates of the ball, and the ball will be moving
so the values of its coordinates change accordingly. Using
setpostion() function, we set the initial position of the ball. Fig. 6. To create main screen

2) Run frame: This function is used to make the ball move


then check if any brick is touched by the ball and if the
ball is missed. If the brick is touched by the ball, the ball
will bounce and the brick will disappear. And the game
will record this ball hit and calculate the hit rate. If the
ball is missed, the game will call the reset function and
record the number of missed ball.
3) Next iteration: This function requires action as an input
and the input parameter action is an integer type. It
represents which action to take. In our game, we totally
have 3 actions. The range of action is from 0 to 2. 0
Fig. 5. Building Paddle and Ball means paddle should move left; 1 means paddle should
do nothing; 2 means paddle should move right. After
The paddle has some similar configuration to the ball, the game receives the input action, the game will call
but its movement is on the x-axis, and its shape is square. the run frame function to move to next frame.
Using shapesize() we stretched the square length and created In DQN class, parameters that have been used are:
a rectangle. We decided to select its color as white; it is more • action space: It is an integer type; it represents how many
visible on black background. actions we have in the game. In our game, we totally
Figure 6 shows how the final display screen is created. have 3 actions, which means paddle should move left, do
Bgcolor() function selects the background color of the entire nothing and move right.
game, which is black in our case. The screen resolution is • state space: It is an integer type; it represents the dimen-
decided using win.setup() function, and we chose our screen sion of the state.
size to be 800 by 600. • Epsilon: It is a float. When we choose an action, we will
In Paddle class, it has 3 functions: randomly create a number to compare with it.
1) Reset: The returned value of the function will be X value • Gamma: It is a float type. This number is one of the
of the paddle, X value of the ball, Y value of the ball. parameters from the equation to update the Q value.
This function is to reset the position of the paddle and • Batch size: It is an integer type. This is the max batch
the ball in the game. After this function is called, the size of the batch.
paddle will return to the mid of the screen and the ball • Epsilon min: It is a float. This is the minimal number of
will go back to the initial position and begins to fall. epsilon.
• Epsilon decay:It is a epsilon. We will use epsilon *=ep- main network and target network have the same structure.
silon decay to decrease the epsilon. Functions of this class used in our prototype are:
• Learning rate: It is a float. It is the learning rate for
updating the network. 1) Build model: The output model is a model class we im-
• Memory: It is a deque. The max length is 100000. We port from keras. The function is to build the structure of
use it as the batch. the main network and target network. It has 3 layers.The
• Model: It is a keras.model class. This is the model of input of the network is the state. The state include the
Deep Q network. X value of the paddle, X value of the ball, Y value of
the ball. The first and second hidden layers both have
Functions of this class used in our prototype are: 64 n. The output layer has 3 neuron, which represents
1) Build model: The output model is a model class we the 3 Q values of the 3 actions in the game. Because
import from keras and is to build the structure of the the agent will choose the action which has the largest Q
Deep Q network. It has 3 layers. The input of the values.
network is the state. The state include the X value of 2) Remember: We will store the state, action, reward,
the paddle, X value of the ball, Y value of the ball. The next state and done in a batch. Because we will train
first and second hidden layers both have 64 n. The output the network by batch.
layer has 3 neuron, which represents the 3 Q values of 3) Act: First, we randomly create a number from 0 to 1.
the 3 actions in the game. Because the agent will choose Then check this random number is larger than epsilon
the action which has the largest Q values. or not. If its not, we random output an action. If it is
2) Remember: We will store the state, action, reward, larger than epsilon, we let the input state go through the
next state and done in a batch. Because we will train main network of Double Deep Q network and predict
the network by batch. an action then output this action.
3) Act: First, we randomly create a number from 0 to 1. 4) replay: We train the network by batch and update the
Then check this random number is larger than epsilon weight in main network.
or not. If its not, we random output an action. If it is 5) updated taget neural network: Update the weights in
larger than epsilon, we let the input state go through the target network. Because the replay function only up-
Deep Q network and predict an action then output this dates the weights in main network. Double DQN use
action. 2 different way to update the main and target network.
4) replay: We train the network by batch and update the That is the difference between Deep Q network.
weight in Deep Q network.
On the other hand in DoubleDQN class, parameters that C. Design to train the network:
have been used are:
At first, we create the paddle class. We will train the network
• action space: It is an integer type, it represent how many for 1000 epochs. In each epoch, we will use reset function to
action we have in the game.In our game, we totally have 3 initialize the game and get the first state as the first observation.
actions, which means paddle should move left, do nothing Then we use act function in the DQN class or DDQN class
and move right. to predict which action to take. Then use this action as input
• state space: It is an integer type, it represent the dimen- and call the next iteration function in paddle class to make
sion of the state. the game move to next frame. We will get a new state as a
• Epsilon: It is a float. When we choose an action, we will new observation and the reward of this action. Then we use
randomly create a number to compare with it. memory function in the DQN class or DDQN class to store
• Gama: It is a float type. This number is one of the the state, action , new state and reward and done in the batch.
parameter from the equation to update the Q value. If the batch is full , it means we have enough experience
• Batch size: It is an integer type. This is the max batch to let the network learn knowledge from the batch. We use
size of the batch. replay function to train the network. If we use Double network,
• Epsilon min: It is a float. This is the minimal number of replay function only train the main network, and we will use
epsilon. updated target neural network of DDQN class to update the
• Epsilon decay:It is a epsilon. We will use epsilon *=ep- target network in each 20 epochs. If the game is over we move
silon decay to decrease the epsilon. to next epoch. When the all epochs are finished, we save the
• Learning rate: It is a float. It is the learning rate for weights of the network as shown in figure 7.
updating the network.
• Memory: It is a deque. The max length is 100000. We
D. Experiment Setting:
use it as the batch.
• Model: It is a keras.model class. This is the main network The code runs on Windows 10. The environment is Python
of Double Deep Q network. 3.6 and Keras 2.24. Parameters setting is that epsilon is
• targeted neural network: It is a keras.model class. This 1, gamma = .95, batch size = 64, epsilon min = .01, ep-
is the target network of Double Deep Q network. Both silon decay = .995, learning rate = 0.001.
Fig. 8. GUI – customized environment of Breakout

and and Hits percentage of double DQN is higher (83.66 %)


than DQN (78.47 %).

Fig. 7. Flow diagram to train the network

E. Graphical User Interface(GUI - Breakout):


Graphical User Interface of our customized breakout envi-
ronment is shown in figure 8.
F. Results
After evaluating the performance, we found that the per-
formance of Double Deep Q-learning was better because It
uses two neural network models that are similar. During
the experience replay, the other one is a replica of the first Fig. 9. Performance Comparison (Hit Rate)
model’s last state. This second model calculates the Q-value.
In DQN, Q-value is determined with the reward added to In Figure 10 graphs, the rewards are much higher in the
the cumulative Q-value of the next state. If each time the Q- case double DQN compares to DQN. The left side shows the
value determines a high number for a given state, the value highest reward as 10 whereas the right side shows 5 to be max.
derived for that particular state from the performance of the The X-axis in graph represents episodes which 1000 in both
neural network will become higher each time. Every output cases. In each episode, we have used reset function to initialize
value of the neuron will become higher and higher until the game and got the first state as the first observation. Then
the difference between each output value is high. DDQN is we have used an act function in the DQN class or DDQN class
better because it reduces overestimations by decoupling agent to predict which action to take.
selection function and target function. Table III, shows the hyperparameters of both Double DQN
The Conventional Deep Q-learning performance lacked be- and DQN. Hyperparameters are vital because they directly
cause it is unable to prevent overfitting problems and because control the behaviour of the training algorithm and have a
the targets would be the Q-values of each of the actions for significant impact on the performance of the model is being
training the neural network. As shown in Figure 9, the score trained. Hyperparameter can impact greatly on the model.
Nature. 529. 484-489. 10.1038/nature16961. Van Hasselt, H., Guez, A.,
& Silver, D. Deep reinforcement learning with double qlearning. Paper
presented at the Thirtieth AAAI Conference on Artificial Intelligence,
2016.
[7] Hado van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforce-
ment learning with double Q-Learning. In Proceedings of the Thirtieth
AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press,
2094–2100.
[8] Patel, Devdhar & Hazan, Hananel & Saunders, Daniel & Siegelmann,
Hava & Kozma, Robert. (2019). Improved robustness of reinforce-
ment learning policies upon conversion to spiking neuronal network
platforms applied to Atari Breakout game. Neural Networks. 120.
10.1016/j.neunet.2019.08.009.
[9] Jeerige, Anoop, Doina Bein, and Abhishek Verma. ”Comparison of
deep reinforcement learning approaches for intelligent game playing.”
In 2019 IEEE 9th Annual Computing and Communication Workshop
Fig. 10. Performance Comparison (Graph) and Conference (CCWC), 2019.
[10] DQNViz: A Visual Analytics Approach to Understand Deep Q-
Networks.Junpeng Wang, Liang Gou, Han-Wei Shen, Hao Yang. In
TABLE III IEEE Transactions on Visualization & Computer Graphics, 2019
H YPER - PARAMETER TUNING Jan;25(1): 288-298.
Hyper-parameter Value
Epsilon 1
gamma 0.95
batch size 64
epsilon min 0.01
epsilon decay 0.995
learning rate 0.001

VI. C ONCLUSION
To put it into a nutshell, after running the game for both
models (Deep Q-learning and Double deep Q-learning) for
the training phase using hyper-parameters as mentioned in
Table III, DDQN performed better in terms of hit-rate which
is evident from figure 9. Hit rate is calculated by taking the
ratio of number of times the ball hits the paddle to the sum
of number of times ball hits the paddle and number of times
ball misses the paddle, for the given number of episodes and
which in our case is 100. In DDQN highest reward assigned
to a particular action was better as compared to DQN which
varied from (-10,10) for DDQN and (-5,5) for DQN which
states that DDQN explores the environment better than DQN
because of its target network which is evident from figure 10.
R EFERENCES
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 2018.
[2] Gagniuc, Paul A. Markov chains: from theory to implementation and
experimentation. John Wiley & Sons, 2017.
[3] Gerald Tesauro. Temporal difference learning and td-gammon. Commu-
nications of the ACM, 38(3):58–68, 1995.
[4] Jordan B. Pollack and Alan D. Blair. Why did td-gammon work. In
Advances in Neural Information Processing Systems 9, pages 10–16,
1996.
[5] Human-level control through deep reinforcement learning. Volodymyr
Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,
Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidje-
land, Georg Ostrovski, et al. Nature. 2015 Feb 26; 518(7540): 529–533.
doi: 10.1038/nature14236.
[6] Silver, David & Huang, Aja & Maddison, Christopher & Guez, Arthur
& Sifre, Laurent & Driessche, George & Schrittwieser, Julian &
Antonoglou, Ioannis & Panneershelvam, Veda & Lanctot, Marc &
Dieleman, Sander & Grewe, Dominik & Nham, John & Kalchbrenner,
Nal & Sutskever, Ilya & Lillicrap, Timothy & Leach, Madeleine &
Kavukcuoglu, Koray & Graepel, Thore & Hassabis, Demis. (2016).
Mastering the game of Go with deep neural networks and tree search.

View publication stats

You might also like