unit 3
unit 3
• But, there might be some situations where more than one agent is
searching for the solution in the same search space, and this
situation usually occurs in game playing.
• This concept of optimal MAX play requires that MIN plays optimally as
well—it maximizes MAX’s worst-case outcome.
• What happens if MIN isn’t performing at its best?
• Then it’s a simple matter of demonstrating that MAX can perform even
better. Other strategies may outperform the minimax method against
suboptimal opponents, but they will always outperform optimal opponents.
•
Heuristic alpha beta tree
search(Alpha-Beta Pruning)
• Alpha-Beta pruning is not actually a new algorithm, but
rather an optimization technique for the minimax
algorithm.
• It reduces the computation time by a huge factor. This
allows us to search much faster and even go into
deeper levels in the game tree.
• It cuts off branches in the game tree which need not be
searched because there already exists a better move
available.
• It is called Alpha-Beta pruning because it passes 2 extra
parameters in the minimax function, namely alpha and
beta.
• Let’s define the parameters alpha and beta.
• Alpha is the best value that
the maximizer currently can guarantee at that
level or above.
Beta is the best value that
the minimizer currently can guarantee at that
level or below.
• While backtracking ,node values are passed to
upper nodes instead of alpha ,beta values
• We will only pass alpha ,beta values to child
nodes.
So far this is how our game tree looks. The 9 is crossed
out because it was never computed.
MCTS
• It is a probabilistic and heuristic driven search algorithm
that combines the classic tree search implementations
alongside machine learning principles of reinforcement
learning.
In tree search, there’s always the possibility that the current
best action is actually not the most optimal action.
In such cases, MCTS algorithm becomes useful as it continues
to evaluate other alternatives periodically during the learning
phase by executing them, instead of the current perceived
optimal strategy. This is known as the ” exploration-
exploitation trade-off “. (The exploration-exploitation trade-
off is a dilemma that arises when choosing between exploring
new options and exploiting existing knowledge.)
Here are some reasons why MCTS is
commonly used:
• Handling Complex and Strategic Games:
• Unknown or Imperfect Information
• Optimizing Exploration and Exploitation
• Scalability and Parallelization:
• Monte Carlo Tree Search (MCTS) algorithm:
In MCTS, nodes are the building blocks of the
search tree.
• These nodes are formed based on the outcome of
a number of simulations.
• The process of Monte Carlo Tree Search can be
broken down into four distinct steps, viz.,
selection, expansion, simulation and
backpropagation.
• Each of these steps is explained in details below:
Selection
• In this process, the MCTS algorithm traverses the
current tree from the root node using a specific strategy.
• The strategy uses an evaluation function to optimally
select nodes with the highest estimated value.
• MCTS uses the Upper Confidence Bound (UCB)
formula applied to trees as the strategy in the selection
process to traverse the tree.
• It balances the exploration-exploitation trade-off.
During tree traversal, a node is selected based on some
parameters that return the maximum value. The
parameters are characterized by the formula that is
typically used for this purpose is given below.
• Expansion: In this process, a new child node is added to the
tree to that node which was optimally reached during the
selection process.
• Simulation: In this process, a simulation is performed by
choosing moves or strategies until a result or predefined
state is achieved.
• Backpropagation: After determining the value of the
newly added node, the remaining tree must be updated. So,
the backpropagation process is performed, where it
backpropagates from the new node to the root node. During
the process, the number of simulation stored in each node is
incremented. Also, if the new node’s simulation results in a
win, then the number of wins is also incremented.
Stochastic Games
• Stochastic games represent dynamic
interactions in which the environment changes
in response to players' behavior.
• Scientist Shapley says, "In a stochastic game
the play proceeds by steps from position to
position, according to transition probabilities
controlled jointly by the two players"
• A stochastic game is played by a set of players. In
each stage of the game, the play is in a given state
(or position, in Shapley's language), taken from a
set of states, and every player chooses an action
from a set of available actions.
• The collection of actions that the players choose,
together with the current state, determine the
stage payoff that each player receives, as well as
determines probability distribution according to
which the new state that the player will visit.
https://youtu.be/xXE5AwzNQ2s?si=Yo
O2HN-PnCvay8ea
• Partially Observable Games, often referred to as
Partially Observable Markov Decision Processes
(POMDPs), are a class of problems and models in
artificial intelligence that involve decision-making in
situations where an agent's observations do not provide
complete information about the underlying state of the
environment.
• POMDPs are an extension of Markov Decision
Processes (MDPs) to scenarios where uncertainty and
partial observability are significant factors. They are
commonly used to model and solve problems in various
domains, including robotics, healthcare, finance, and
game playing.
Key Characteristics of Partially
Observable Games (POMDPs):
•
Partial Observability: In POMDPs, the agent's observations are
incomplete and do not directly reveal the true state of the
environment. This introduces uncertainty, as the agent must reason
about the possible states given its observations.
Hidden States: The environment's true state, also known as the
hidden state, evolves according to a probabilistic process. The
agent's observations provide noisy or incomplete information about
this hidden state.
Belief State: To handle partial observability, the agent maintains a
belief state, which is a probability distribution over possible hidden
states. The belief state captures the agent's uncertainty about the true
state of the environment.
• Action and Observation: The agent takes actions based on its
belief state, and it receives observations that depend on the hidden
state. These observations help the agent update its belief state and
make decisions.
Objective and Policy: The agent's goal is to find a policy—a
mapping from belief states to actions—that maximizes a specific
objective, such as cumulative rewards or long-term expected utility.
Solving Partially Observable Games (POMDPs):
Solving POMDPs is challenging due to the added complexity of
partial observability. Traditional techniques used for MDPs, such as
dynamic programming and value iteration, are not directly
applicable to POMDPs. Instead, specialized algorithms and
techniques are developed to address the partial observability:
• Belief Space Methods: These methods work directly in the space of belief
states and involve updating beliefs based on observations and actions.
Techniques like the POMDP forward algorithm and backward induction are
used to compute optimal policies.