Open In App

Implementing Deep Q-Learning using PyTorch

Last Updated : 28 May, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Deep Q-Learning is a reinforcement learning method which uses a neural network to help an agent learn how to make decisions by estimating Q-values which represent how good an action is in a given situation. In this article we’ll implement Deep Q-Learning from scratch using PyTorch.

How Deep Q-Learning Works

Deep Q-Learning works in 5 simple steps help an agent learn from its surroundings and improve how it makes decisions:

Working-of-Deep-Q-Learning
Working of Deep Q Learning
  • Define the Q-network: The Q-network is a deep neural network. It takes the current state of the agent as input and outputs Q-values for all possible actions. These Q-values represent how good each action is in that state.
  • Initialize the Q-network’s parameters: These are the weights of the neural network. PyTorch can automatically initialize them.
  • Define the loss function: This helps the network learn. The most commonly used loss here is Mean Squared Error (MSE) which compares the predicted Q-values and the target Q-values.
  • Define the optimizer: An optimizer adjusts the network’s weights to reduce the loss. It include Adam and RMSprop.
  • Collect experiences: The agent plays in the environment and collects data in the form of (state, action, reward, next_state). This experience helps the model learn which actions are better over time.

Let’s implement Deep Q-Learning using PyTorch.

 Step 1: Importing the required libraries 

First we will import all the necessary libraries like numpy , pandas , PyTorch and more.

Python
import gym
import random
import numpy as np
from collections import deque
import torch
import torch.nn as nn
import torch.optim as optim

Step 2: Defining the Q-Network (Neural Network)

This is a simple neural network with three layers. It takes the state as input and outputs Q-values for all possible actions.

Python
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

Step 3: Defining Hyperparameters

These parameters control the learning process: exploration vs exploitation, learning rate, memory, etc.

  • gamma: Future reward discount closer to 1 means long-term focus.
  • epsilon: Controls exploration vs exploitation.
  • batch_size: How many experiences we train on at once.
  • Memory_size: Maximum size of the experience replay buffer.
Python
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# Hyperparameters
gamma = 0.99             
epsilon = 1.0           
epsilon_min = 0.01
epsilon_decay = 0.995
learning_rate = 0.001
batch_size = 64
memory_size = 10000

Step 4: Creating Replay Memory 

The agent stores past experiences in memory and samples from it during training to break correlation in data.

  • Stores past experiences: (state, action, reward, next_state, done).
  • deque automatically removes old experiences when it exceeds the limit.
Python
memory = deque(maxlen=memory_size)

Step 5: Initializing Network and Optimizer

We use two networks: policy network (for selecting actions) and target network (for stable learning).

  • policy_net: Train actively and chooses actions.
  • target_net: Provides stable target Q-values (updated less frequently).
  • Adam: Adaptive optimizer to update weights.
  • MSELoss: Compares predicted Q-values to target Q-values.
Python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

policy_net = DQN(state_size, action_size).to(device)
target_net = DQN(state_size, action_size).to(device)
target_net.load_state_dict(policy_net.state_dict())  
target_net.eval()

optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)
loss_fn = nn.MSELoss()

Step 6: Defining Function to Choose Action

The agent either explores randomly or exploits its learned policy based on epsilon. Otherwise we select the action with the maximum predicted Q-value.

Python
def get_action(state, epsilon):
    if random.random() < epsilon:
        return random.choice(range(action_size))  
    else:
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        with torch.no_grad():
            q_values = policy_net(state)
        return q_values.argmax().item()  

Step 7: Training on a Mini-Batch

We randomly sample experiences and update the network using the Bellman equation.

  • Use the Bellman equation to compute target Q-values.
  • Minimize the MSE loss between predicted and target Q-values.
Python
def replay():
    if len(memory) < batch_size:
        return

    minibatch = random.sample(memory, batch_size)

    states, actions, rewards, next_states, dones = zip(*minibatch)

    states = torch.FloatTensor(states).to(device)
    actions = torch.LongTensor(actions).unsqueeze(1).to(device)
    rewards = torch.FloatTensor(rewards).unsqueeze(1).to(device)
    next_states = torch.FloatTensor(next_states).to(device)
    dones = torch.FloatTensor(dones).unsqueeze(1).to(device)

    # Current Q values
    current_q = policy_net(states).gather(1, actions)

    # Target Q values
    next_q = target_net(next_states).max(1)[0].detach().unsqueeze(1)
    target_q = rewards + (gamma * next_q * (1 - dones))

    loss = loss_fn(current_q, target_q)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Step 8: Training the Agent

This loop trains the agent over multiple episodes. After each episode we slowly update the target network and decay the exploration rate.

  • Calls get_action() to pick actions and replay() to train.
  • Updates target network periodically for stability.
  • Decay epsilon to reduce exploration over time.
Python
episodes = 500
target_update_freq = 10

for episode in range(episodes):
    reset_result = env.reset()
    state = reset_result[0] if isinstance(reset_result, tuple) else reset_result
    total_reward = 0

    for t in range(500):
        action = get_action(state, epsilon)
        step_result = env.step(action)

        if len(step_result) == 5:
            next_state, reward, terminated, truncated, _ = step_result
            done = terminated or truncated
        else:
            next_state, reward, done, _ = step_result

        memory.append((state, action, reward, next_state, done))
        state = next_state
        total_reward += reward

        replay()
        if done:
            break

    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

    if episode % target_update_freq == 0:
        target_net.load_state_dict(policy_net.state_dict())

    print(f"Episode {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.3f}")

Output:

Deep-Q-learning
Result

As shown in above output:

  • Total Reward shows how long the agent balanced the pole (higher is better).
  • Epsilon shows how much the agent is exploring lower means it’s exploiting more.

As training continues the agent learns better actions and Total Reward improves. When the total reward approaches the max value (like 500) it means the agent has learned to perform well consistently

To download complete notebook: click here


Practice Tags :

Similar Reads