Open In App

Sequential Decision Problems in AI

Last Updated : 16 Aug, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Sequential decision problems are at the heart of artificial intelligence (AI) and have become a critical area of study due to their vast applications in various domains, such as robotics, finance, healthcare, and autonomous systems. These problems involve making a sequence of decisions over time, where each decision can affect future outcomes, leading to a complex decision-making process that requires balancing immediate rewards with long-term benefits.

Introduction to Sequential Decision Problems

Sequential decision problems occur when an agent must make a series of decisions in an environment, with each decision affecting not only the immediate outcome but also the future states of the environment. These decisions are interdependent, meaning that the optimal decision at any point depends on the decisions made previously and the potential decisions that will be made in the future.

A classic example is the process of playing chess, where each move influences the subsequent moves and the overall outcome of the game. The challenge in such problems is to devise a strategy that optimizes a certain objective, such as maximizing the total reward or minimizing the total cost, over the entire sequence of decisions.

Five Key Components of Sequential Decision Problems

  1. States: The state represents the current situation of the environment. It encapsulates all the necessary information to make a decision. For example, in a game of chess, the state would include the positions of all the pieces on the board.
  2. Actions: Actions are the choices available to the agent at any given state. Each action leads to a transition from one state to another. In the chess example, an action would be moving a piece from one square to another.
  3. Transitions: The transition model describes how the state changes in response to an action. This is often probabilistic in nature, especially in environments where uncertainty plays a role.
  4. Rewards: The reward function assigns a numerical value to each state or state-action pair, representing the immediate benefit of being in that state or taking that action. The objective is typically to maximize the cumulative reward over time.
  5. Policies: A policy is a strategy that defines the action the agent will take in each state. An optimal policy maximizes the expected cumulative reward over time.

Types of Sequential Decision Problems

Sequential decision problems can be categorized based on the environment's characteristics and the information available to the agent:

1. Markov Decision Processes (MDPs)

MDPs are a fundamental framework for modeling sequential decision problems where the environment is fully observable, and the transitions between states are probabilistic. The decision-making process relies on the Markov property, where the future state depends only on the current state and action, not on the history of past states.

2. Partially Observable Markov Decision Processes (POMDPs)

In many real-world scenarios, the agent does not have complete information about the current state of the environment. POMDPs extend MDPs by introducing hidden states and an observation model, making the decision-making process more complex.

3. Multi-armed Bandits

This is a simpler form of sequential decision problem where the agent must choose between multiple actions (or arms), each with an unknown probability distribution of rewards. The challenge is to balance exploration (trying out different actions) and exploitation (choosing the action with the highest known reward).

4. Reinforcement Learning

Reinforcement learning (RL) is a popular approach for solving sequential decision problems where the agent learns an optimal policy through trial and error, receiving rewards or penalties for its actions. RL is widely used in AI for tasks such as game playing, robotic control, and resource management.

Sequential Decision Problem Solving with Value Iteration in Grid Environments

In this section, we are going to implement sequential decision making problem using Value Iteration which is a form of dynamic programming. This problem is modeled as a Markov Decision Process (MDP) where the system's dynamics are described by states, actions, and rewards.

Here’s a breakdown of the key components and the technique used:

1. Sequential Decision Making Problem

The task involves navigating through a grid world, where the goal is to find an optimal policy that dictates the best action to take in each state to maximize the cumulative reward. The decision-making is sequential because each decision (or action) leads to a new state, and the choice of action at each step depends on the current state of the environment.

2. Markov Decision Process

  1. States: The grid positions, represented as tuples (i, j).
  2. Actions: Possible moves (Up, Down, Left, Right) which can alter the state.
  3. Rewards: Specific outcomes defined for reaching the goal, hitting obstacles, or moving to regular positions.
  4. Transitions: The result of taking an action in a state, leading to a new state.

Implementation

Step 1: Define the Environment and Initialize Parameters

In this step, we define the grid world's size and characteristics, including the goal state and obstacles. We also set key parameters like the discount factor and the convergence threshold.

import numpy as np

# Define the grid world parameters
grid_size = 3
goal_state = (2, 2)
obstacles = [(1, 1)]
gamma = 0.9 # Discount factor
epsilon = 0.01 # Convergence threshold

Step 2: Define Reward Function and Actions

Set up the reward function and the possible actions an agent can take within the grid.

# Define the reward function
def reward_function(state):
if state == goal_state:
return 1
elif state in obstacles:
return -1
else:
return 0

# Define possible actions and their effects
actions = {
"Up": (-1, 0),
"Down": (1, 0),
"Left": (0, -1),
"Right": (0, 1)
}

Step 3: Initialize Value Function and Policy

Set the initial value function and a random initial policy.

# Initialize value function and policy
V = np.zeros((grid_size, grid_size))
policy = np.random.choice(list(actions.keys()), (grid_size, grid_size))

Step 4: Implement the Value Iteration Algorithm

def value_iteration(V, policy):
while True:
delta = 0
new_V = np.copy(V)

for i in range(grid_size):
for j in range(grid_size):
state = (i, j)
if state == goal_state or state in obstacles:
continue

action_values = []
for action in actions:
next_state = get_next_state(state, action)
reward = reward_function(next_state)
action_value = reward + gamma * V[next_state]
action_values.append(action_value)

best_action_value = max(action_values)
new_V[state] = best_action_value
best_action = list(actions.keys())[np.argmax(action_values)]
policy[state] = best_action
delta = max(delta, abs(V[state] - best_action_value))

V = new_V
if delta < epsilon:
break

return V, policy

V, optimal_policy = value_iteration(V, policy)

Step 5: Visualize the Results

Create a visual representation of the grid world, including the optimal policy with directional arrows.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

# Create a grid
ax.set_xticks(np.arange(-0.5, grid_size, 1), minor=True)
ax.set_yticks(np.arange(-0.5, grid_size, 1), minor=True)
ax.grid(which="minor", color="black", linestyle='-', linewidth=2)

# Draw obstacles and goal state
for obs in obstacles:
ax.add_patch(plt.Rectangle((obs[1] - 0.5, obs[0] - 0.5), 1, 1, fill=True, color="red"))
ax.add_patch(plt.Rectangle((goal_state[1] - 0.5, goal_state[0] - 0.5), 1, 1, fill=True, color="green"))

# Draw policy arrows
for i in range(grid_size):
for j in range(grid_size):
state = (i, j)
if state == goal_state or state in obstacles:
continue
action = optimal_policy[state]
# Arrow drawing code based on the action

ax.set_aspect('equal')
plt.show()

Step 6: Output the Value Function and Policy

print("Optimal Value Function:")
print(V)
print("\nOptimal Policy:")
for i in range(grid_size):
print([optimal_policy[(i, j)] for j in range(grid_size)])

Complete Implementation

Python
import numpy as np
import matplotlib.pyplot as plt

# Define the grid world parameters
grid_size = 3
goal_state = (2, 2)
obstacles = [(1, 1)]
gamma = 0.9  # Discount factor
epsilon = 0.01  # Convergence threshold

# Define the reward function
def reward_function(state):
    if state == goal_state:
        return 1
    elif state in obstacles:
        return -1
    else:
        return 0

# Define possible actions and their effects
actions = {
    "Up": (-1, 0),
    "Down": (1, 0),
    "Left": (0, -1),
    "Right": (0, 1)
}

# Initialize value function and policy
V = np.zeros((grid_size, grid_size))
policy = np.random.choice(list(actions.keys()), (grid_size, grid_size))

def is_valid_state(state):
    x, y = state
    return 0 <= x < grid_size and 0 <= y < grid_size

def get_next_state(state, action):
    action_move = actions[action]
    next_state = (state[0] + action_move[0], state[1] + action_move[1])
    if not is_valid_state(next_state):
        return state  # If action leads to invalid state, stay in current state
    return next_state

# Value Iteration Algorithm
def value_iteration(V, policy):
    while True:
        delta = 0
        new_V = np.copy(V)
        
        for i in range(grid_size):
            for j in range(grid_size):
                state = (i, j)
                
                if state == goal_state or state in obstacles:
                    continue
                
                action_values = []
                for action in actions:
                    next_state = get_next_state(state, action)
                    reward = reward_function(next_state)
                    action_value = reward + gamma * V[next_state]
                    action_values.append(action_value)
                
                best_action_value = max(action_values)
                new_V[state] = best_action_value
                best_action = list(actions.keys())[np.argmax(action_values)]
                policy[state] = best_action
                
                delta = max(delta, abs(V[state] - best_action_value))
        
        V = new_V
        if delta < epsilon:
            break
    
    return V, policy

# Run value iteration to solve the MDP
V, optimal_policy = value_iteration(V, policy)

# Visualization
fig, ax = plt.subplots()

# Create a grid
ax.set_xticks(np.arange(-0.5, grid_size, 1), minor=True)
ax.set_yticks(np.arange(-0.5, grid_size, 1), minor=True)
ax.grid(which="minor", color="black", linestyle='-', linewidth=2)

# Draw obstacles
for obs in obstacles:
    ax.add_patch(plt.Rectangle((obs[1] - 0.5, obs[0] - 0.5), 1, 1, fill=True, color="red"))

# Draw goal state
ax.add_patch(plt.Rectangle((goal_state[1] - 0.5, goal_state[0] - 0.5), 1, 1, fill=True, color="green"))

# Draw policy arrows
for i in range(grid_size):
    for j in range(grid_size):
        state = (i, j)
        if state == goal_state or state in obstacles:
            continue
        action = optimal_policy[state]
        if action == "Up":
            ax.arrow(j, i, 0, -0.4, head_width=0.2, head_length=0.2, fc='blue', ec='blue')
        elif action == "Down":
            ax.arrow(j, i, 0, 0.4, head_width=0.2, head_length=0.2, fc='blue', ec='blue')
        elif action == "Left":
            ax.arrow(j, i, -0.4, 0, head_width=0.2, head_length=0.2, fc='blue', ec='blue')
        elif action == "Right":
            ax.arrow(j, i, 0.4, 0, head_width=0.2, head_length=0.2, fc='blue', ec='blue')

# Set plot limits and labels
ax.set_xlim([-0.5, grid_size - 0.5])
ax.set_ylim([-0.5, grid_size - 0.5])
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_aspect('equal')
ax.set_title("Optimal Policy in Grid World")

plt.show()

# Print the results
print("Optimal Value Function:")
print(V)
print("\nOptimal Policy:")
for i in range(grid_size):
    print([optimal_policy[(i, j)] for j in range(grid_size)])

Output:

Optimal Value Function:
[[0.729 0.81 0.9 ]
[0.81 0. 1. ]
[0.9 1. 0. ]]

Optimal Policy:
['Down', 'Right', 'Down']
['Down', 'Down', 'Down']
['Right', 'Right', 'Right']
downlo

The graph represents the grid world, obstacles, goal state, and the optimal policy with arrows indicating the best action to take from each state.

Applications of Sequential Decision Problems

  1. Robotics: In robotics, sequential decision problems arise in navigation, path planning, and manipulation tasks. Robots must make a series of decisions to achieve a goal, such as reaching a destination or assembling a product, while accounting for dynamic changes in the environment.
  2. Finance: Financial decision-making often involves sequential decisions, such as portfolio management, where investors must decide how to allocate assets over time to maximize returns while managing risks.
  3. Healthcare: In healthcare, treatment planning for chronic diseases can be modeled as a sequential decision problem, where doctors must choose a series of treatments that optimize patient outcomes over time.
  4. Autonomous Systems: Autonomous vehicles, drones, and other autonomous systems rely on sequential decision-making to navigate complex environments, avoid obstacles, and achieve their objectives.

Challenges in Solving Sequential Decision Problems

  • Computational Complexity: As the number of states and actions increases, the computational complexity of finding an optimal policy grows exponentially. This is known as the "curse of dimensionality."
  • Uncertainty and Exploration: In many sequential decision problems, the agent must deal with uncertainty about the environment and the outcomes of its actions. Balancing exploration (gathering information) and exploitation (using known information to make decisions) is a key challenge.
  • Scalability: For large-scale problems, traditional methods may not be feasible. Approximation techniques, such as deep reinforcement learning, are often used to handle high-dimensional state and action spaces.

Conclusion

Sequential decision problems are a fundamental aspect of AI, playing a crucial role in various applications where decision-making over time is essential. Understanding the structure of these problems and the methods used to solve them is key to advancing AI research and developing intelligent systems capable of making complex, long-term decisions. As AI continues to evolve, the ability to tackle more sophisticated sequential decision problems will become increasingly important, driving innovation in fields ranging from robotics to finance and beyond.


Next Article

Similar Reads