The_Use_of_Reinforcement_Learning_in_Gaming_The_Br
The_Use_of_Reinforcement_Learning_in_Gaming_The_Br
net/publication/340440656
CITATION READS
1 2,946
6 authors, including:
Aloukik Aditya
Lakehead University
5 PUBLICATIONS 11 CITATIONS
SEE PROFILE
All content following this page was uploaded by Aloukik Aditya on 14 September 2020.
Abstract—Traditionally, reinforcement learning (RL) algo- and are praised when they make the right ones – this is
rithms are called trial and error learning methods that use reinforcement. Performance of RL degrades when the state-
real task experience to develop an incremental management action space is too large to be completely known. For this
policy. The reinforcement learning theory offers a viewpoint
in psychology, how agents can maximize their control of an reason, we are using Deep Reinforcement Learning to achieve
environment. The major difference of reinforcement learning better performance.
from supervised learning is that a partial feedback is provided Deep reinforcement learning integrates artificial neural net-
to the learner, regarding the learned experiences. An RL agent works with an architecture of RL that allows software-defined
learns how to map states to optimal action through trial-and- agents to learn the best possible behaviours in a virtual
error and over time practices and develops a strategy for long-
term rewards. In this paper, we are using an approach which environment to achieve their objectives. Instead of using a
unifies artificial neural networks and reinforcement learning lookup table to store, index and update all possible states and
architecture allowing the agent to learn the best possible actions their values, which is difficult with very big problems, we
in a virtual environment to achieve their objectives for which have trained a neural network on state-action space samples
we have chosen Breakout – a classic arcade game. We have to learn to determine how important those are relative to our
chosen Breakout as it achieves superhuman play as compared
to other games such as Enduro, Time Pilot etc. This paper aim of enhancing learning.
provides a comparative analysis between Deep Q Network (DQN)
and Double Deep Q Network (DDQN) algorithms based on B. Basic Definitions
their hit rate, out of which DDQN proved to be better for Reinforcement Learning can be understood if we know the
Breakout game. DQN is chosen over Basic Q learning because concepts of agent, state, action, environment and rewards, that
it understands policy learning using its neural network which is
good for complex environment and DDQN is chosen as it solves
is explained below:
overestimation problem (agent always choses non-optimal action
for any state just because it has maximum Q-value) occurring
in basic Q-learning.
I. I NTRODUCTION
A. Overview
Fig. 1. Reinforcement Learning Process
Reinforcement learning (RL) refers to goal-oriented algo-
rithms that learn how to accomplish a specific goal or how As shown in figure 1, agent and environment are the key
to optimise over many steps along a dimension; for example, components in reinforcement learning process. Agent is an
over many moves, they can maximise the points earned in entity that takes action (a set of moves which the agent can
a game. RL algorithms can start with a blank state and make) in an environment (surroundings the agent is going
achieve superhuman performance under the right conditions. through and which responds to the agent). A state is a real
Such algorithms are penalised, like a pet incentivized through and immediate condition in which the agent finds himself; that
scolding and punishment, when they make the wrong decisions is, a position or a moment on the basis of which its Q-value is
updated and reward (input by which we measure the success behavior of the paddle will be completely random and it will
or failure of an agent’s action in each state) is assigned to the start decreasing with the ongoing iterations with a value 0.95.
agent which can be negative or positive which in turn impacts For each distinct (state, action) pair, it’s Q-value (which is
the Q-value (Q-value takes the input of two parameters: state calculated by DQN and DDQN algorithms) will be updated,
and current action a. Qπ(s, a) refers to the long-term return of and that Q-value will be used to take best possible action of
an action taking action from the current state s under policy the paddle for the next state (velocity, ball co-ordinates).
π. Q maps the pairs of state action to rewards). We chose to implement Deep Reinforcement Learning on
An agent sends feedback in the form of actions to the Atari games because the environment of Atari games is quite
environment from any given state, and the environment returns uncertain taking into consideration the states and actions
the new state of the agent (which resulted from acting on the related to the environment which makes it relatable to real
prior state) as well as rewards if any. Rewards can be staggered life situations.
or immediate. Effectively they determine the behaviour of the
handler. Another term associated with rewards is discount II. P ROBLEM D EFINITION
factor which can be computed by future rewards as discovered The purpose of our project is to create virtual environment
by the agent to dampen the effect of that reward on the choice of a game named “Breakout” that can prove out to be a close
of action by the agent. replication of the real-life environment, learn the environment
In our customised breakout game environment, paddle plays with time as it happens in real life and act accordingly. To
the role of an agent and environment includes wall, bricks and duplicate a real-life environment and to act as per the changes,
ball. Actions which can be taken by the agent (paddle) are the environment has to be learnt properly, which we are going
moving the paddle to the left, moving the paddle to the right or to do so, using the algorithms of Reinforcement Learning
let it stay in idle position. States of the game will be, whether mentioned above and showing comparisons which is better
the game is ongoing / lost / won, the x and y coordinates of for such situations.
the ball, ball-velocity, the x position of the paddle, array of First, Reinforcement learning is the problem that we have
the coordinates of the remaining bricks, the number of frames studied. Reinforcement learning is a branch of machine learn-
since the game started, and the current score of the game. ing, which is used to describe and solve the problem of agent’s
Our customised breakout environment consists of a paddle, interaction with the environment through learning strategies to
a ball, a wall and a block which consists of bricks with achieve maximum reward [1]. A classic and standard model
different colors. When a ball hits the paddle, 3 points are of reinforcement learning is the Markov Decision Process
rewarded to the agent and if it misses the paddle, 3 points are (MDP), which is simply a process where an agent takes
penalised to the agent. When a ball hits the brick, rewards are action to update its state to obtain reward and interact with
awarded to the agent as per the color of the brick (blue=8, environment [2].
green=7, olive=6, yellow=5, orange=4, red=3). Q-value of Second problem being the game itself. We have imitated
each (state, action) pair will be fed to the network which Breakout as our game. The building blocks of this game are
will be helpful for learning. After achieving Q-values for a moving ball, a paddle and 6 rows of bricks. The agent i.e.
each distinct (state, action) pair which will be stored in the paddle can move left and right to hit the ball. After hitting
network, the action with the best Q-value will be chosen for the ball, the ball will rebound and then destroy the bricks
the progression of the game. to achieve reward. This game needs to be able to receive
One of the reasons to choose breakout is that for the start, it incoming actions and switch to the next frame. If the ball
is better to choose a game that can be altered as per the user is missed, the game will end and then reset the position of the
and for which different parts of the algorithms like states, ball and paddle in the game. The bricks will disappear after
reward system can be modified into a better one. In this being touched by the ball. At the beginning of the game, the
paper, we have given a comparative analysis of two different ball and paddle are placed in the initial place and all 6 rows of
algorithms on a classic Atari arcade game - Breakout. We have bricks are loaded. The game gives the total number of bricks
compared and evaluated the performances of different models destroyed, the number of times the ball was missed, and the
namely Conventional Deep Q-Network and Double Deep Q- ball’s hit rate.
Network. Third problem is to create a network structure for both DQN
and DDQN in our simulated Breakout environment in terms of
C. Asserting Thesis reward acquisition, loss at each epoch and hit rate to conclude
The ”memory” is a key component of DQNs: the trials are which algorithm performs better.
used to train the model continuously, as stated earlier. Instead
of training on the trials when they get in, however, we add III. R ELATED R ESEARCH W ORK
them to memory and train them on a random sample of that Reinforcement learning can go back to the implementation
memory. The gamma factor reflects this depreciated value for of TD-gammon, In 1992, IBM researcher Gerald Tesauro
the expected future returns on the state. Value defined for but developed an algorithm that combines time difference learning
will vary between 0 and 1. We have followed Epsilon Greedy and neural networks, and named it TD-Gammon, specializing
Search in which value of gamma is 1 in initial stages as the in playing backgammon. TD-gammon uses a three-layer neural
network. The backgammon position is represented by 198 episode-level, and segment-level. Basic overview of perfor-
units as the input, and there are 40-80 neurons in the middle- mance results of different algorithms along with their criteria
hidden layer. The final output is an estimate of the value in breakout game is briefly given in Table I.
function [3]. TD-gammon used a model-free reinforcement
learning algorithm like Q-learning and approximated the value TABLE I
function using a multi-layer perceptron with one hidden layer. OVERVIEW OF RELATED RESEARCH WORK IN B REAKOUT
However, the applications of TD-gammon into other board SNo Criteria for Com- Algorithms Results
games were less successful, which led to a widespread belief parison Com-
that the TD-gammon method only worked in backgammon. pared
1. Rewards for SNN | SNN is better
This perhaps because the randomness in the dice rolls helps given number of DQN especially
explore the state space and also makes the value function epochs for noisy or
particularly smooth [4]. incomplete data.
Q learning is proposed by Watkins in 1989, which has 2. High Score, DQN | ARC
Stability and ARC outperformed
become the popular option for reinforcement learning-based rewards DQN with a
agents, however it is useless for the complicated and high high score of 79
state space problem. and rewards
increasing
The combination of deep learning and reinforcement learn- gradually.
ing methods, mainly involving Q learning, was brought for- 3. Epoch level, Visual Visual DQN is
ward in a sequence of papers [5]. From what we know that training level, DQN | able to control
episode level and DQN the random
previous reinforcement learning methods had trouble in select- segment level actions taken by
ing features, while the deep reinforcement learning approach DQN algorithm.
was found to handle complex tasks successfully, as it can learn
from data at different levels of features. Mnih successfully We have shown performance comparisons for DQN and
trained a deep RL agent from visual inputs consisting of DDQN algorithms for the training phase, in terms of hit rate
thousands of pixels. This approach enabled it to reach beyond- which is calculated by taking the ratio of number of times the
human capabilities in playing Atari games, Alpha Go and ball hits the paddle to the sum of number of times ball hits
so on [6]. The Deep Q network agent synthesized by Mnih the paddle and number of times ball misses the paddle, for the
achieves human-like performance when playing Atari games given number of episodes and which in our case is 100.
by using artificial neural networks to process sensory data. In
subsequent work, Van Hassel [7] improved the algorithm by IV. M ETHODOLOGY
implementing double deep Q-Learning which helps generate
more accurate estimates by eliminating overestimation. In order to compare the performance of the AI agent in
In the paper “Human-level control through deep reinforce- playing the breakout game based on different algorithms, we
ment learning” [5], researchers have shown that the deep Q- decided to create our own environment to train the agent,
network agent, obtaining only the pixels and the game score which imitates the environment of the OpenAI Atari game
as inputs, has been able to exceed the efficiency of all previous Breakout-ram-v0. After setting up the environment, we built
algorithms and reach a level equal to that of a skilled human the network for DQN and Double DQN algorithms respec-
games tester using the same algorithm, network architecture tively. The specific steps are as follows:
and hyperparameters across a range of 49 games [5]. 1) Setting up the environment-Breakout, including the
Paper [8], shows comparisons between RELU Neural Net- background, the paddle and ball, defining bricks,
work and Spiking Neural Network in terms of rewards controlling paddle movement, handling collisions,
achieved for given number of epochs in breakout, using updating state and environment, using turtle library
Epsilon greedy approach and conventional greedy approach. which is a graphics library in python that can be used
Furthermore, with additional benefits of SNNs can supplement to create various objects and shapes, provide animations
the working of DQN when data is noisy and incomplete [8]. to them using penup() function, by adjusting the speed
Paper [9], shows comparisons of performances in terms of during the process of object creation.
training time, stability and higher score achieved using DQN
and Asynchronous Advantage Actor-Critic (A3C) algorithms 2) For defining bricks, controlling paddle movement,
that too in breakout game. Rewards achieved using A3C handling collisions, updating state and environment, we
algorithm are higher as compared to DQN as A3C uses a have defined separate functions namely reset() which
multi-core power CPU to work efficiently whereas, DQN will reset the environment if the paddle misses the ball,
needs a powerful GPU to train faster and runs slowly on a next iteration() which will compute the parameters for
CPU [9]. Paper [10], shows a visual DQN approach which next state, move positive x() will move the paddle to
helps to control the random actions of DQN and helps domain the right and move negative x() which will move the
experts to understand, diagnose, and improve DQN models paddle to the left.
with four levels of details: overall training level, epoch-level,
3) Creating DQN and Double DQN algorithm using the
Breakout environment and perform Hyper-parameter
tuning (like discount factor gamma, learning rate, Qnew = R + γQ(S, argmaxQ(S, a; θmain ), θtarget ) (2)
epsilon). Using libraries like random, numpy, keras, R means the current reward, γ means parameter gamma ,
collections, matplotlib and so on. Qnew means the updated Q value, S means the input state,
θmain means using the main network, θtarget means using
4) Training the Agent using environment, within the 1000 the target network. In equation (2), we use state S as input
steps, calculating reward and loss for each 100 episodes, and main network to select the action a which has the largest
then save the reward and data loss and plot them. Q value. Then use target network to get the Q value of the
selected action a. Then we can update the Q value and use
updated Q value and input state S to train the network.
A. Deep Q Network
In traditional reinforcement learning like Q learning, we use V. P ROTOTYPING
table to store Q value. But it has a limitation. The problem A. Detailed Design:
today is too complicated to use tables to store the Q values
of each state and action. No matter how much memory the In our project, we have used three classes to implement the
computer has, it becomes time consuming to search for the code (Paddle, DQN and DDQN) as shown in figure 2.
corresponding state in such a large table. When reinforcement
learning is combined with deep learning, neural network can
solve this problem. Because we can just input the state value,
output all the Q values of each action, and then directly
select the action with the maximum value as the next action
according to the principle of Q learning.
Table II describes the structure of Deep Q Network we use
in our project.
TABLE II
M ODEL S TRUCTURE
R means the current reward, γ means parameter gamma , B. Building the major classes:
max Q(S, a) means the action a with maximal Q value in
As shown in figure 3, using the import turtle we are adding
the state S, Qnew means the updated Q value. Then we can
the turtle library into our python environment. The turtle
use state S as input, updated Q value as output to train the
module provides turtle graphics primitives, in both object-
network.
oriented and procedure-oriented ways.
B. Double DQN
The network structure of Double DQN is the same as Deep
Q Network. But Double DQN requires 2 networks. The one Fig. 3. Importing turtle
is the main network which also has updated parameters, the
other one is target network which has old parameters. As shown in figure 4, we are adding the bricks using for
In Double DQN, we use main network to get the Q value loop. We have assigned a variety of colors to the bricks. Using
using state as input. The equation (1) to update Q value has x cor and y cor, we have assigned the coordinates of the
changed to equation (2): bricks, and each Brick is added after 110 pixels on the x-axis.
Fig. 4. Building Bricks
VI. C ONCLUSION
To put it into a nutshell, after running the game for both
models (Deep Q-learning and Double deep Q-learning) for
the training phase using hyper-parameters as mentioned in
Table III, DDQN performed better in terms of hit-rate which
is evident from figure 9. Hit rate is calculated by taking the
ratio of number of times the ball hits the paddle to the sum
of number of times ball hits the paddle and number of times
ball misses the paddle, for the given number of episodes and
which in our case is 100. In DDQN highest reward assigned
to a particular action was better as compared to DQN which
varied from (-10,10) for DDQN and (-5,5) for DQN which
states that DDQN explores the environment better than DQN
because of its target network which is evident from figure 10.
R EFERENCES
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 2018.
[2] Gagniuc, Paul A. Markov chains: from theory to implementation and
experimentation. John Wiley & Sons, 2017.
[3] Gerald Tesauro. Temporal difference learning and td-gammon. Commu-
nications of the ACM, 38(3):58–68, 1995.
[4] Jordan B. Pollack and Alan D. Blair. Why did td-gammon work. In
Advances in Neural Information Processing Systems 9, pages 10–16,
1996.
[5] Human-level control through deep reinforcement learning. Volodymyr
Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,
Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidje-
land, Georg Ostrovski, et al. Nature. 2015 Feb 26; 518(7540): 529–533.
doi: 10.1038/nature14236.
[6] Silver, David & Huang, Aja & Maddison, Christopher & Guez, Arthur
& Sifre, Laurent & Driessche, George & Schrittwieser, Julian &
Antonoglou, Ioannis & Panneershelvam, Veda & Lanctot, Marc &
Dieleman, Sander & Grewe, Dominik & Nham, John & Kalchbrenner,
Nal & Sutskever, Ilya & Lillicrap, Timothy & Leach, Madeleine &
Kavukcuoglu, Koray & Graepel, Thore & Hassabis, Demis. (2016).
Mastering the game of Go with deep neural networks and tree search.