CS461 Intermediate Report Team7
CS461 Intermediate Report Team7
The key idea behind Deep Q-learning is to use the neural # Print or log the total reward for the episode
network to generalize across similar states and actions, print(f"Episode {episode + 1}, Total Reward:
enabling the agent to learn a more efficient and effective {total_reward}")
policy. The algorithm uses a technique called experience
replay, where past experiences are stored in a replay buffer. Fig. 2: Pseudo-code for DQN
During training, batches of experiences are sampled The Target Q-Network used will be a CNN provided in
randomly from the buffer to train the neural network, previous section. Also the temporal loss is calculated then
reducing correlations in the data and improving stability. used for backpropagation.
The pseudo-code for Deep Q-Network is:
B. Double Deep Q-Network
for episode in range(num_episodes): Double Deep Q-learning is an extension of the original
state = env.reset() Deep Q-learning algorithm that aims to address a common
total_reward = 0 issue known as overestimation bias [1]. In the standard Deep
Q-learning algorithm, the Q-value is updated using the
while not done: maximum Q-value of the next state. However, during the
# Exploration-exploitation trade-off learning process, the Q-values themselves are also being
epsilon = max(epsilon_min, epsilon_max * updated, and this can result in an overestimation of their true
values. The basic idea behind Double DQN is to decouple By splitting the neural network's final layer into two
the selection of actions from the evaluation of those actions parts—estimating the advantage function for each action a
[1]. The algorithm modifies the Bellman equation as follows,
(A(s, a)) and the state value function for state s (V(s))—the
rather than utilizing the same one as in the DQN algorithm:
Dueling DQN algorithm suggests that the neural network
ultimately combines both parts into a single output that
estimates the Q-values [3]. This modification is beneficial
Double Q-Learning implementation with Deep Neural
because, in many situations, understanding the state-value
Networks is called Double Deep Q-Network. The target
function alone may be enough rather than knowing the
neural network assesses this action to determine its Q-value
precise value of each action [3]. However in the standard
after the main neural network θ selects the optimal next
DQN, both the state's value (V(s)) and the advantage of
action a' from among all of the options. This method has
taking an action (A(s, a)) are considered when figuring out
been demonstrated to lower overestimations, leading to
the Q-value. However, if the state is bad and all actions lead
improved final policy [2]. to failure, estimating the impact of each action is pointless
The pseudo-code for Double Deep Q-Network will be since the state's value (V(s)) has already been calculated.
similar to Deep Q-Network as the algorithm is a variation of This design speeds up training because it only calculates
DQN. The key difference in pseudo-code can be highlighted the value of a state without having to calculate Q(s, a) for
with: every action in that state. By separating the estimation
# Compute Q-values using the Q-network between two streams, we can find more dependable Q values
Q_values = Q_network(batch.states) for each action. We can measure the model's loss using the
# Use the Q-network to select the best actions for
following mean squared error:
the next states
next_actions = Q_network.get_best_actions(next_state)
also where,
# Compute target Q-values using the target Q-network
and the selected actions
target_Q_values = target_Q_network(batch.next_states)
target_Q_values_selected = then take the gradient descent step to update the model
target_Q_values[range(batch_size), next_actions] parameters.
Fig. 3: Pseudo-code for Double DQN Both DQN and Double DQN use a Deep Q-Network
structure, whereas Dueling DQN uses a specialized Dueling
In the Double DQN pseudo-code, the Q-network is used to
Deep Q-Network structure to implement the dueling
select the best actions for the next states (next_actions).
architecture. The pseudo-code for the Deep Q-Network
These actions are then used to index into the target Q-values
structure is as follows:
obtained from the target Q-network
(target_Q_values_selected). This separation of action class DuelingDeepQNetwork:
def __init__(self, input_size, output_size):
selection and target evaluation is the key idea behind Double # Shared layers for state representation
DQN and is intended to mitigate the overestimation bias that self.shared_layers =
can occur in standard DQN. The primary difference is that create_shared_layers(input_size)
Double DQN uses the Q-network for action selection and the
# Stream for state value estimation
target Q-network for evaluation of the selected actions during
self.value_stream =
the computation of target Q-values. create_value_stream(self.shared_layers)
indicated by the value function V(s). Furthermore, the def forward(self, state):
advantage function A(s, a) indicates the relative superiority # Forward pass through shared layers
of each action over the others. The Q-values can be obtained shared_output = self.shared_layers(state)
by combining the value V and the advantage A for each
# Forward pass through the value stream
action:
value_output =
self.value_stream(shared_output)
# Forward pass through the advantage stream Fig. 6: Apples per episode for DQN algorithm
advantage_output =
self.advantage_stream(shared_output)
return Q_values
Fig. 10: Timesteps alive per episode for Doe DQN algorithm
VI. DISCUSSION
Our study suggests that both DQN and Double DQN are
suitable for our "Snake" game. While Double DQN performs
better in many aspects, DQN stands out with significantly
higher reward rates. Even though Double DQN reached a
maximum of around 10-12 apples eaten, DQN nearly
doubled the average reward per episode.
In conclusion, our data reveals that our algorithm still falls
short of achieving average human performance. To surpass it,
we need to enhance our algorithms or employ even higher
epoch values. However, it's important to note that the Kaggle
notebook has a limitation of supporting only 12 hours of
training, and our DQN already took 6 hours of runtime.
Therefore, surpassing human levels through this method
appears unlikely. Nevertheless, the positive outcomes affirm
the promising nature of our chosen DQN approach.
Even though it's not in the progress report, we trained
DQN and Double DQN. We didn't have time for Dueling, but
we completed the part we needed to do until the progress
report.
REFERENCES
[1]. Zhizhou Ren, “On the Estimation Bias in Double
Q-Learning,” Department of Computer Science, University
of Illinois at Urbana-Champaign, Jan. 14, 2022.
[2]. H Van Hasselt, “Deep reinforcement learning with double
q-learning,” Cornell University, Sept, 2015.
[3]. M. S. Ausin, “Introduction to Reinforcement Learning. Part
4. Double DQN and Dueling DQN,” Medium, Nov. 25,
2020.