0% found this document useful (0 votes)
7 views

CS461 Intermediate Report Team7

Uploaded by

handeer22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

CS461 Intermediate Report Team7

Uploaded by

handeer22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS 461 - Project Intermediate Report

Department of Computer Engineering, Bilkent University, 06800 Ankara, Turkey


Emre Can Şen- 21902516, Ata Özlük - 21702335, Uygar Onat Erol - 21901908, Hande Eryılmaz - 21902678,
Mete Ertan - 21903215
Abstract— The objective of this project is to perform and poisonous foods. Then printed these rewards as scores on
experiments using various Deep Reinforcement Learning the game window.
methods to the "Snake" game. However, unlike the traditional
game, our version incorporates a poison apple system. In this B. Poisonous Food
environment, our research compares the performances of We placed poisonous food in addition to normal food. It
Deep-Q-Networks, Double Deep-Q-Networks, and Dueling uses the same method as nonpoisonous food. In the
Deep-Q-Networks across each method over 120,000 epochs. The environment code it was made sure that the snake and the
outcomes of these methods are presented & analyzed. normal food could not collide so we did not need to be
worried about poisonous food colliding with a snake, but the
I. INTRODUCTION poisonous food could collide with normal food so we made
sure that they wouldn't spawn at the same point.
As described in the proposal we used an already existing
snake game environment to implement a Convolutional III. CONVOLUTIONAL NEURAL NETWORK
Neural Network, a Deep-Q Network, Dueling Deep-Q A. Convolutional Neural Network Layers
Network and Double Deep-Q Network. We also changed the
environment for our project and added poisonous food to the We defined our convolutional neural network as a class.
environment as mentioned in the proposal. The class takes an input shape, which is row x column x state
of the game and also takes a learning rate. We first create a
Other than implementing the Deep-Q Network method, sequential model, which is a linear stack of layers. We then
we did not make any more changes to the proposal. The add layers one by one to this sequential model.
game is still played in a 20x20 grid, the snake starts with 3
parts(length), normal foods give positive rewards and ● We first added the first convolutional layer. We
poisonous food will give negative rewards. Our agent will try define the size of the first convolutional layer by
to reach the highest score possible learning with Deep 3x3 and give our input shape to the first layer while
Q-Network, Dueling Deep Q-Network and Double Deep giving it 32 filters. The convolutional layer consists
Q-Network. of these filters which are all kernel size of the 3x3
we defined. Each filter here slides over the input
The Convolutional Neural Network, Deep Q-Network, and computes the dot product between its values
Double Deep Q-Network, and Dueling Deep Q-Network and weights. The input to a convolutional layer is a
algorithms were implemented. The detailed implementation 3D volume (width, height, depth). We need to
steps of each of these methods were given in the methods change this to a single dimension output after we are
section with their respective pseudo-codes. The results done with convolutional layers.
obtained with Deep Q-Network, Double Deep Q-Network ● We add a max-pooling layer of 2x2 to the model.
were plotted and analyzed. The final results of Dueling Deep This layer is used for downsampling the spatial
Q-Network wasn’t achieved, but will be completed until the dimensions of the input. It is a form of spatial
final presentation. subsampling that helps maintain important features
Done: and suppress less important ones.
● We added the second convolutional layer with a
● Convolutional Neural Network Implementation & kernel size of 2x2 with 64 filters this time. The
Analysis second convolutional layer is crucial to the
convolutional neural network to do the second
● Deep-Q Network Implementation & Analysis
convolution which is necessary.
● Double Deep-Q Network Implementation & ● Then we added a flattened layer to make the output
Analysis of the previous layer a one-dimensional vector. This
is crucial because the layers from now need a
To be Finalized: single-dimensional vector input, but the
● Double Deep-Q Network Training Implementation convolutional layers give 3D volume output. The
& Analysis flattened layer reshapes this output without losing
any data. For example [[a, b],[c, d]] 2D
convolutional output becomes [a, b, c, d] a single
II. CHANGES IN GAME ENVIRONMENT & GAME dimensioned vector.
PARAMETERS ● After turning the output of the convolutional
network to 1D vector we then can use the fully
A. Reward System connected layer. The first fully connected layer with
As mentioned above we used an existing game 256 units is added, which processes the flattened
environment but made some changes there to suit our features of the flattened layer. Fully connected
interest. The environment we used implemented the snake layers here allow the network and therefore the
game with the help of the Pygame library. We added rewards agent to learn the global patterns and relationships
to getting the poisonous food, colliding with itself or the across the entire input. Each connection between
wall, getting the normal food. We might change the rewards nodes has a weight and these weights are adjusted
to get better results in the future. We wanted the snake to with training.
gain the most food reward possible as fast as possible to ● Lastly the output layer is added, which gets the
reach the highest reward possible while avoiding colliding values from the fully connected layer and outputs
the Q-values for each possible action. This layer epsilon_decay)
changes the output of the fully connected layer to
the desired output, which is as mentioned in # Choose action using epsilon-greedy strategy
Q-values in this case. action = epsilon_greedy(Q_network, state,
epsilon)

# Take the chosen action and observe the next


state and reward
next_state, reward, done, _ =
env.step(action)

# Store experience in replay buffer


replay_buffer.add_experience(state, action,
Fig. 1: Convolutional Neural Network Layers reward, next_state, done)

B. Definition of Neural Network # Sample random mini-batch from replay buffer


batch =
We defined this convolutional neural network with our input replay_buffer.sample_batch(batch_size)
shape, which is rows x columns of the grid defined in the
environment which is 20 x 20 and the number of possible # Compute target Q-values using the target
states of the agent. We gave a learning rate of 0.0001 to our Q-network
neural network as input. So it did not change massively with target_Q_values =
every input and old experiences will be important too. After compute_target_Q_values(target_Q_network, batch,
defining the convolutional neural network we used gamma)
Q-Network methods to get better rewards and update our
agent. The convolutional neural network we defined uses # Compute Q-values using the Q-network
partial observability. The agent does not have complete Q_values = Q_network(batch.states)
information about its surroundings so it uses Q-Network
methods with the convolutional neural network to gather # Update Q-network parameters using the
temporal difference loss
more information about the Q-values of the states on the
loss =
map.
compute_temporal_difference_loss(Q_values,
IV. METHODS AND IMPLEMENTATIONS target_Q_values)
Q_network.optimize(loss)
A. Deep Q-Network
Deep Q-learning is a type of reinforcement learning # Update target Q-network every
algorithm that combines Q-learning with deep neural 'update_target_freq' steps
networks. Deep Q-learning introduces deep neural networks if total_steps % update_target_freq == 0:
to approximate the Q-function, which is a mapping from
states and actions to Q-values. The neural network takes the target_Q_network.load_state_dict(Q_network.state_dict
state of the environment as input and outputs Q-values for ())
each possible action. This allows the algorithm to handle
high-dimensional input spaces, making it suitable for tasks total_reward += reward
like image-based decision-making in video games. state = next_state

The key idea behind Deep Q-learning is to use the neural # Print or log the total reward for the episode
network to generalize across similar states and actions, print(f"Episode {episode + 1}, Total Reward:
enabling the agent to learn a more efficient and effective {total_reward}")
policy. The algorithm uses a technique called experience
replay, where past experiences are stored in a replay buffer. Fig. 2: Pseudo-code for DQN
During training, batches of experiences are sampled The Target Q-Network used will be a CNN provided in
randomly from the buffer to train the neural network, previous section. Also the temporal loss is calculated then
reducing correlations in the data and improving stability. used for backpropagation.
The pseudo-code for Deep Q-Network is:
B. Double Deep Q-Network
for episode in range(num_episodes): Double Deep Q-learning is an extension of the original
state = env.reset() Deep Q-learning algorithm that aims to address a common
total_reward = 0 issue known as overestimation bias [1]. In the standard Deep
Q-learning algorithm, the Q-value is updated using the
while not done: maximum Q-value of the next state. However, during the
# Exploration-exploitation trade-off learning process, the Q-values themselves are also being
epsilon = max(epsilon_min, epsilon_max * updated, and this can result in an overestimation of their true
values. The basic idea behind Double DQN is to decouple By splitting the neural network's final layer into two
the selection of actions from the evaluation of those actions parts—estimating the advantage function for each action a
[1]. The algorithm modifies the Bellman equation as follows,
(A(s, a)) and the state value function for state s (V(s))—the
rather than utilizing the same one as in the DQN algorithm:
Dueling DQN algorithm suggests that the neural network
ultimately combines both parts into a single output that
estimates the Q-values [3]. This modification is beneficial
Double Q-Learning implementation with Deep Neural
because, in many situations, understanding the state-value
Networks is called Double Deep Q-Network. The target
function alone may be enough rather than knowing the
neural network assesses this action to determine its Q-value
precise value of each action [3]. However in the standard
after the main neural network θ selects the optimal next
DQN, both the state's value (V(s)) and the advantage of
action a' from among all of the options. This method has
taking an action (A(s, a)) are considered when figuring out
been demonstrated to lower overestimations, leading to
the Q-value. However, if the state is bad and all actions lead
improved final policy [2]. to failure, estimating the impact of each action is pointless
The pseudo-code for Double Deep Q-Network will be since the state's value (V(s)) has already been calculated.
similar to Deep Q-Network as the algorithm is a variation of This design speeds up training because it only calculates
DQN. The key difference in pseudo-code can be highlighted the value of a state without having to calculate Q(s, a) for
with: every action in that state. By separating the estimation
# Compute Q-values using the Q-network between two streams, we can find more dependable Q values
Q_values = Q_network(batch.states) for each action. We can measure the model's loss using the
# Use the Q-network to select the best actions for
following mean squared error:
the next states
next_actions = Q_network.get_best_actions(next_state)
also where,
# Compute target Q-values using the target Q-network
and the selected actions
target_Q_values = target_Q_network(batch.next_states)
target_Q_values_selected = then take the gradient descent step to update the model
target_Q_values[range(batch_size), next_actions] parameters.
Fig. 3: Pseudo-code for Double DQN Both DQN and Double DQN use a Deep Q-Network
structure, whereas Dueling DQN uses a specialized Dueling
In the Double DQN pseudo-code, the Q-network is used to
Deep Q-Network structure to implement the dueling
select the best actions for the next states (next_actions).
architecture. The pseudo-code for the Deep Q-Network
These actions are then used to index into the target Q-values
structure is as follows:
obtained from the target Q-network
(target_Q_values_selected). This separation of action class DuelingDeepQNetwork:
def __init__(self, input_size, output_size):
selection and target evaluation is the key idea behind Double # Shared layers for state representation
DQN and is intended to mitigate the overestimation bias that self.shared_layers =
can occur in standard DQN. The primary difference is that create_shared_layers(input_size)
Double DQN uses the Q-network for action selection and the
# Stream for state value estimation
target Q-network for evaluation of the selected actions during
self.value_stream =
the computation of target Q-values. create_value_stream(self.shared_layers)

C. Dueling Deep Q-Network


# Stream for advantage estimation
This algorithm splits the Q-values into two different parts, self.advantage_stream =
the value function V(s) and the advantage function A(s, a). create_advantage_stream(self.shared_layers,
The amount of reward that we will receive from state s is output_size)

indicated by the value function V(s). Furthermore, the def forward(self, state):
advantage function A(s, a) indicates the relative superiority # Forward pass through shared layers
of each action over the others. The Q-values can be obtained shared_output = self.shared_layers(state)
by combining the value V and the advantage A for each
# Forward pass through the value stream
action:
value_output =
self.value_stream(shared_output)
# Forward pass through the advantage stream Fig. 6: Apples per episode for DQN algorithm
advantage_output =
self.advantage_stream(shared_output)

# Combine value and advantage streams to get


Q-values
Q_values = value_output + (advantage_output -
advantage_output.mean(dim=1, keepdim=True))

return Q_values

Fig. 4: Pseudo-code for Dueling DQN


In Dueling Deep Q-Network structure, the architecture
explicitly separates the value and advantage streams,
allowing for independent estimation of the state value and
advantages for each action. This separation is intended to
enhance the stability of the learning process. Other than
Dueling Deep Q-Network structure, remaining code Fig. 7: Timesteps alive per episode for DQN algorithm
implementation will be similar to DQN implementation.
V. RESULTS
Below are the results for each agent, including their
respective Reward Value, (eaten) Apple Count, and
Timesteps:

Fig. 8: Rewards per episode for Double DQN algorithm

Fig. 5: Rewards per episode for DQN algorithm


Fig. 9: Apples per episode for Double DQN algorithm https://markelsanz14.medium.com/introduction-to-reinforce
ment-learning-part-4-double-dqn-and-dueling-dqn-b349c9a
61ea1

Fig. 10: Timesteps alive per episode for Doe DQN algorithm

The results from the figures indicate that DQN consistently


does better in rewards compared to Double DQN. However,
Double DQN shows higher scores in the other areas,
including more apples eaten and longer survival times.
Unfortunately, attempts to use Dueling DQN were not
successful in this phase and will be further examined in the
next project checkpoint.

VI. DISCUSSION
Our study suggests that both DQN and Double DQN are
suitable for our "Snake" game. While Double DQN performs
better in many aspects, DQN stands out with significantly
higher reward rates. Even though Double DQN reached a
maximum of around 10-12 apples eaten, DQN nearly
doubled the average reward per episode.
In conclusion, our data reveals that our algorithm still falls
short of achieving average human performance. To surpass it,
we need to enhance our algorithms or employ even higher
epoch values. However, it's important to note that the Kaggle
notebook has a limitation of supporting only 12 hours of
training, and our DQN already took 6 hours of runtime.
Therefore, surpassing human levels through this method
appears unlikely. Nevertheless, the positive outcomes affirm
the promising nature of our chosen DQN approach.
Even though it's not in the progress report, we trained
DQN and Double DQN. We didn't have time for Dueling, but
we completed the part we needed to do until the progress
report.

REFERENCES
[1]. Zhizhou Ren, “On the Estimation Bias in Double
Q-Learning,” Department of Computer Science, University
of Illinois at Urbana-Champaign, Jan. 14, 2022.
[2]. H Van Hasselt, “Deep reinforcement learning with double
q-learning,” Cornell University, Sept, 2015.
[3]. M. S. Ausin, “Introduction to Reinforcement Learning. Part
4. Double DQN and Dueling DQN,” Medium, Nov. 25,
2020.

You might also like