0% found this document useful (0 votes)
33 views

Exploring Game Playing AI Using Reinforcement Learning Techniques

This document discusses using reinforcement learning techniques like Q-Learning and Deep Q-Networks to create AI agents that can play games. It explores using Q-Learning to train agents to play simple games like Mountain Car and Flappy Bird. It then discusses how Deep Q-Learning is needed to train agents on more complex games due to limitations of Q-Learning like memory and time requirements.

Uploaded by

dhanushdarshan7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Exploring Game Playing AI Using Reinforcement Learning Techniques

This document discusses using reinforcement learning techniques like Q-Learning and Deep Q-Networks to create AI agents that can play games. It explores using Q-Learning to train agents to play simple games like Mountain Car and Flappy Bird. It then discusses how Deep Q-Learning is needed to train agents on more complex games due to limitations of Q-Learning like memory and time requirements.

Uploaded by

dhanushdarshan7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/346380703

Exploring Game Playing AI using Reinforcement Learning Techniques

Preprint · November 2020


DOI: 10.13140/RG.2.2.14522.62400

CITATIONS READS
0 944

1 author:

Prabesh Paudel
St. Olaf College
4 PUBLICATIONS 16 CITATIONS

SEE PROFILE

All content following this page was uploaded by Prabesh Paudel on 26 November 2020.

The user has requested enhancement of the downloaded file.


Exploring Game Playing AI using Reinforcement Learning Techniques
Prabesh Paudel
[email protected]​, MSCS Department, St. Olaf College

Abstract--Reinforcement learning has proven to be mountain car and flappy bird. To put it into context,
state-of-the-art methods for a lot of machine learning we discussed some scenarios where we created agents
and artificial intelligence tasks from resource that could play games like ​Hunt the Wumpus.
management, traffic control to self-driving cars, However, in those cases, the state space and the action
robotics, and even game playing. This method allows space were finite and we could hardcode some search
an agent to estimate the expected utility of its state in algorithms like A* or BFS/DFS to find the optimal
order to make optimal actions in an unknown path to victory. However, we don’t have perfect
environment. This paper explores different knowledge of the game in a lot of cases and that is
reinforcement learning algorithms and their where search algorithms fail and reinforcement
efficiency in playing popular games. We use learning comes in.
Q-Learning to train agents to play trivial games like
Flappy Bird, and we dive into Deep Reinforcement One of the major issues with Q-Learning is
Learning to train the agents to play more memory and time. The amount of memory required to
complicated games where the tasks are non-trivial. save Q values for every possible action in a fairly big
game is huge and the time required to explore each
Keywords-- Reinforcement Learning, Q-Learning, state to create the required Q table is simply
Deep Q-Learning unrealistic. This is where Deep Q Networks come in.
Deep Q-Learning is simply a way of avoiding the Q
1. INTRODUCTION table as a whole and rather use a neural network to
approximate the Q value given the state. In this paper,
Reinforcement Learning (RL) is an area of we use Deep Q-Learning to train our agents to play
machine learning that operates via an idea of reward or more complicated games (Minh et. al. 2015).
punishment for every action an agent takes and the end
goal for the agent is to maximize the cumulative 1.1. Significance
reward. We are going to be discussing Q-Learning and
Deep Q Network in this paper. However, these are not Why should we care about Reinforcement
the only RL algorithms that exist. Markov Chain Learning? How is it different from a run off the mill
Decision Process, SARSA, and DDPG are some of the machine learning algorithm? Let’s take a step back and
RL algorithms that are popular today. think about supervised learning and unsupervised
learning. In those kinds of situations you have a
Q(s, a) ← Q(s, a) + ɑ[r + ɣ max​ɑ’​Q(s​’,​ a​’)​ - Q(s, a)] pre-existing set of inputs and finite outputs
corresponding to the input, or a cluster that you want
Fig 1: Bellman Equation [Richard E. Bellman, 1951][S​ ource​] the input to fall in. So you just train the data you
The equation above is used to calculate a new collect in one of the supervised or unsupervised
Q value each time the agent loops through a state. algorithms you see fit.
Q-Learning is an algorithm that improves the agent
based on the reward that the agent gets for performing However, reinforcement learning works in a
an action. A major part of Q-Learning is the Q table completely different way. They can deal with large
which consists of Q values for each set of actions or complex problem spaces, meaning it can deal with
movements an agent can take. “Learning” in this case every possible combination of possible circumstances.
is updating the Q table so that we have the optimal Q Reinforcement learning learns by a process of
value for every action in each state. Once learning is receiving rewards or punishments for every action it
done, the agent can refer back to the Q table to figure takes, and is able to adapt in unforeseen circumstances
out the best possible action for any given state. In this accordingly. This can not only be applied to tasks like
paper, we use Q-Learning to train an agent to play the game playing, where the set of actions and outcomes
are virtually infinite but also tasks like Natural Q-Learning is a model-free form of machine
Language Processing and Supply Chain Management learning, in the sense that the AI "agent" does not need
can take advantage of RL. to know or have a model of the environment that it
will be in. The same algorithm can be used across a
2. RELATED WORK variety of environments. First, we explore Q-Learning
to create an agent that can play Mountain-Car and
Q-Learning was first theorized by Chris Flappy Bird. Q-Learning is a foundation for Deep
Watkins in 1989. Q-Learning (Watkins et al, 1992) Q-Learning.
provides the first convergence proof of the Q-Learning
algorithm. Since then a lot of innovation has happened 3.1.1. State Representation
in Q-Learning, one of which includes the use of the
Bellman equation which is a mathematical Mountain Car: ​Mountain-Car is a fairly
optimization method. Q-Learning uses the Bellman simple game powered by OpenAI Gym. A car is on a
equation to update the Q values in the Q table. one-dimensional track, positioned between two
"mountains". The goal is to drive up the mountain on
Q-Learning has been used for various tasks the right; however, the car's engine is not strong
after this innovation for various tasks. It wasn’t until enough to scale the mountain in a single pass.
DeepMind published a paper Playing Atari with Deep Therefore, the only way to succeed is to drive back
Reinforcement Learning in 2015 when Deep and forth to build up momentum. [OpenAI]. The state
Q-Learning became popular. It was one of the first of the car is represented in a tuple that is a coordinate.
papers that approached game playing using Deep There are three actions. We know we can take 3
Q-Network. The 2600 Atari gaming system was quite actions at any given time. That's our "action space."
popular in the late 1970s and the early 1980s. The
games ran with only 4 kilobytes of RAM on a Flappy Bird: ​Flappy bird is a game where
210x160 pixel display with 128 colors. However, you have a bird and your goal is to keep it alive as it
despite many attempts at creating agents using other goes through the pipes. The action is moving the bird
machine learning techniques, none of them were quite up and down to get the bird through the pipes.. It
successful. Mnih et al attempt to provide a proof works in a way where you tap the screen to move the
concept that training agents to play highly complicated bird upwards and press nothing to let it fall. Flappy
games is possible using Deep Q-Learning. The 2019 bird seems more complicated but it is very similar in
paper Grandmaster level in StarCraft II using complexity to Mountain Car. We have a discrete state
multi-agent reinforcement learning is the space which consists of three variables, the horizontal
state-of-the-art paper on using multi-agent distance to the pipes, the vertical distance to the lower
reinforcement learning to not only play StarCraft 2 but pipe and a boolean value that indicates whether the
reach the Grandmaster level. We don’t expect our bird is dead or alive.
model to reach that level of complexity.
However cool it may seem, reinforcement 3.1.2. Q-Learning Algorithm
learning is not able to perform as well in a physical
environment as it can in virtual simulated The main idea of the Q-Learning Algorithm is creating
environments. It is not feasible to take best action in a Q-Table, which is essentially a policy table, that has
the real world where all the actions come with Q values for all possible combinations of states and
repercussions and the real world is very dynamic, as actions. The following is the Q-Learning algorithm.
opposed to a simulated virtual environment like a (Watkins et al, 1992)
videogame. So, even though a lot of papers have
discussed the applicability of reinforcement learning in Initialize Q(s, a) arbitrarily
the physical environment, very little have been Repeat​ for each episode:
actually accomplished. Initialize s
​Repeat​ (for each step of the episode):
3. METHOD Choose a from s using policy derived from Q
In this section, we discuss how we create an Take action a, observe r, s’
environment and how different techniques were used Q(s, a) ← Q(s, a) + ɑ[r + ɣ max​ɑ’​Q(s​’,​ a​’​) - Q(s, a)]
to train our agents. s ← s’;
​until​ s is terminal
3.1. Q-Learning
There is a Q-value per action possible per state
and we create a table for this. To figure this table, we
either query the environment or we engage in the Uniformly sample a batch of experiences from the
environment over and over again until we figure it out. replay memory
For a mountain car, we get a continuous x coordinate, Backpropagate and update DQN with the
velocity, and reward as a return value. Since making a minibatch
table for every single continuous value is impossible, Update exploration probability ​ε
we make discrete chunks of them and get a finite Q ​if​ C updates to DQN since last update to target
table. We use the algorithm above to update the network ​then
Q-table, and we query it whenever we have to run the Update the target Q-network
mountain car independently. The code for the ​Q(s, a) ← Q(s, a)
mountain car can be found ​here​. ​end
Update state ​st​ ​ with ​at​
We take a similar approach to flappy bird. We Update current reward ​r​t​ and total reward
initialize a Q table and see what action maximizes the totalReward
reward given the state. We reward the bird with 0 Update game parameters (bird position, etc.)
points if it survives and give it -1000 points if it dies. Refresh screen
We update our Q table accordingly as described ​Until​ flappy bird crashes;
before. The code for the flappy bird game was found Restart Flappy Bird
on the internet. It is a replica of the flappy bird game Until​ convergence or number of iterations reached;
made using PyGame. The code for the Q-learning bot
can be found ​here​. Deep Q-learning algorithm for Flappy Bird,
Deep Reinforcement Learning for Flappy Bird[2015]
Kevin Chen [​Source]​
3.2. Deep Q-Learning
So, we use Neural Networks to predict a
Q-value for possible actions given a state. The Deep
Building on top of the foundation we created with Q-Network neural network model is a regression
Q-Learning, we now move on to Deep Q-Learning. model, which typically will output values for each of
our possible actions. These values will be continuous
*Note: Deep Q-Learning is absolutely not necessary to
float values, and they are directly our Q-values. It is a
achieve the goals that we are about to explore. We just
convolutional neural network, trained with a variant of
use Deep Q-Learning to understand how Deep
Q-learning, whose input is raw pixels and whose
Reinforcement Learning works!
output is a value function estimating future rewards.
In Q-Learning, we build a Q table for all With the introduction of neural networks,
possible sets of states and actions. But we can rather than a Q table, the complexity of our
sometimes encounter states which might be similar,
environment can go up significantly, without
but not exactly the same as one of the combinations in
necessarily requiring more memory. The
the Q-Table. This is where Deep Q-Learning comes in.
aforementioned is the Deep Q-Learning algorithm that
Flappy bird has a relatively small state space so it is
is used in the flappy bird program that can be found
easy to discretize the state space and create a Q-table
here​.
for it. But once we move to more complex games, the
state space and action space are way bigger than that
of flappy bird and hence the size of the Q-table
increases exponentially. Storing such a Q table would
be next to impossible. (Chen, 2015)

Initialize replay memory


Initialize DQN to random weights
Repeat
​New episode (new game)
Initialize state​ s​0
Repeat
Extract ​xt​ ​ from raw pixel data update state ​st​ ​ with ​x​t
Add experience
​e​t​ = (φ(st​ ​−1), a​t−1,
​ rt​ −1,
​ φ(st​ ))
​ ​ to replay memory
Take best action
​at​ ​ = ​arg mina​ ∈actions​ Q(st , a)​ with exploration if
training
View publication stats

Our Deep Q-Network is trained on raw pixel complicated papers like ​Grandmaster level in
values observed from the game screen at each time StarCraft II using multi-agent reinforcement learning
step. We feed the raw images to a Convolutional by Vinyals et. al. Future of reinforcement learning is
Neural Network with Max Pooling layers and the very bright and I can see these techniques being
output layer has the same dimension as the action applied in a lot of fields.
space, which in this case is two. One of them
corresponds to moving the bird upwards and the other 6. REFERENCES
one corresponds to doing nothing. At each time step,
the network performs whichever action corresponds to [1] Kaundinya, V., Jain, S., Saligram, S., Vanamala, C.
the highest Q-value using a greedy policy. The K., & B, A. (2018). Game Playing Agent for 2048
network is trained using the algorithm shown above. using Deep Reinforcement Learning. ​Nciccnda.​
doi:10.21467/proceedings.1.57
4. RESULTS
[2] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
The results of the experiments were
interesting. Starting off with Mountain Car, we Antonoglou, I., Wierstra, D., & Riedmiller, M. (2015).
assumed it would not take many iterations for the Playing Atari with Deep Reinforcement Learning.
Q-Learning algorithm to figure out how to reach the DeepMind Research​.
top of the mountain but to our surprise it took the doi:https://arxiv.org/pdf/1312.5602.pdf
algorithm over 2000 episodes to reach an optimal
Q-table. However, it was not the case for flappy bird. [3] Vinyals, O., & Babuschkin, I. (2019). Grandmaster
The algorithm was able to figure out the game in under level in StarCraft II using multi-agent reinforcement
250 episodes and the game reached a superhuman
learning. ​Nature.​
level. The game basically never died and was able to
reach the max score every single time. At this point we doi:https://doi.org/10.1038/s41586-019-1724-z
figured that Deep Q-Learning for this task would be an
absolute overkill but we wanted to explore it anyways. [4] Watkins, C. J. (1998). Q-Learning. ​Kluwer
Academic Publishers​.
As expected it was able to reach a maximum
score on every single trial as well. However, it took a
while to train the network because the image had to be
preprocessed and it took awhile for the neural network
to converge to an optimal model.

5. CONCLUSION

Reinforcement Learning has a lot of potential.


Game playing is one small thing that RL is good at and
it has been explored extensively in recent years.
Recently, a new toolkit called NLPGym has been
released which gives the user an environment where
they can test Reinforcement Learning on Natural
Language Processing tasks. I would love to explore
this myself and I am actually doing an ​Independent
Research ​with Matthew Richey on this next semester,
which I really look forward to.
Future work building on this project might
include accomplishing tasks that have a bigger action
and state spaces and possibly replicating some of more

You might also like