cs188-su24-lec06
cs188-su24-lec06
[These slides adapted from Dan Klein, Pieter Abbeel, Anca Dragan, Stuart Russell, and many others]
Behavior from Computation
o Axes:
o Deterministic or stochastic?
o One, two, or more players?
o Zero sum?
o Perfect information (can you see the state)?
Types of Games
o Chess
o (1997): Deep Blue defeats human champion Gary
Kasparov in a six-game match. Current programs are
even better, if less historic.
o Go
o (2016): AlphaGo defeats human champion Lee Sedol.
Uses Monte Carlo Tree Search, learned evaluation
function.
Deterministic Games with Terminal Utilities
o Many possible formalizations, one is:
o States: S (start at s0)
o Players: P = {1...N} (usually take turns)
o Actions: A (may depend on player / state)
o Transition Function: S x A → S
o Terminal Test: S → {t, f}
o Terminal Utilities: S x P → R
2 0 … 2 6 … 4 6
Value of a State
Value of a state: Non-Terminal States:
The best achievable
outcome (utility)
from that state
2 0 … 2 6 … 4 6
Terminal States:
Adversarial Game Trees
-8 -5 -10 +8
Terminal States:
Tic-Tac-Toe Game Tree
Adversarial Search (Minimax)
o Deterministic, zero-sum games: Minimax values:
computed recursively
o Tic-tac-toe, chess, checkers
o One player maximizes result 5 max
o The other minimizes result
2 5 min
o Minimax search:
o A state-space search tree
o Players alternate turns
8 2 5 6
o Compute each node’s minimax
value: the best achievable utility
Terminal values:
against a rational (optimal) part of the game
adversary
Minimax Implementation (Dispatch)
def value(state):
if the state is terminal: return the state’s utility
if the next agent is MAX: return max-value(state)
if the next agent is MIN: return min-value(state)
3 2 2
3 12 8 2 4 6 14 5 2
Minimax Properties
max
min
10 10 9 100
3 >=3
3 <=2 2
3 12 8 2 14 5 2
Alpha-Beta Implementation
MAX
MIN a
MAX
MIN n
24
Alpha-Beta Pruning Properties
o This pruning has no effect on minimax value computed for the root!
2
Alpha-Beta Quiz 2
2
Alpha-Beta Quiz 2
10
<=2
>=100 2
10
Resource Limits
Resource Limits
o Problem: In realistic games, cannot search to leaves! 4 max
o Example:
o Suppose we have 100 seconds, can explore 10K nodes / sec
o So can check 1M nodes per move
o - reaches about depth 8 – decent chess program
[Demo: thrashing d=2, thrashing d=2 (fixed evaluation function), smart ghosts coordinate (L6D6,7,8,10)]
Video of Demo Thrashing (d=2)
Why Pacman Starves
o Generalization of minimax:
o Terminals have utility tuples
o Node values are also utility tuples
o Each player maximizes its own component
o Can give rise to cooperation and
competition dynamically…
1,6,6
max
min
10 10 9 100
def value(state):
if the state is a terminal state: return the state’s utility
if the next agent is MAX: return max-value(state)
if the next agent is EXP: return exp-value(state)
def exp-value(state):
initialize v = 0
for each successor of state: 1/2 1/6
p = probability(successor) 1/3
v += p * value(successor)
return v 5
8 24
7 -12
3 12 9 2 4 6 15 6 0
Expectimax Pruning?
3 12 9 2
Depth-Limited Expectimax
Estimate of true
400 300 …expectimax value
(which would
require a lot of
… work to compute)
492 362 …
What Probabilities to Use?
Pacman used depth 4 search with an eval function that avoids trouble
Ghost used depth 2 search with an eval function that seeks Pacman
[Demos: world assumptions (L7D3,4,5,6)]
Assumptions vs. Reality
Pacman used depth 4 search with an eval function that avoids trouble
Ghost used depth 2 search with an eval function that seeks Pacman
[Demos: world assumptions (L7D3,4,5,6)]
Video of Demo World Assumptions
Random Ghost – Expectimax Pacman
Video of Demo World Assumptions
Adversarial Ghost – Minimax Pacman
Video of Demo World Assumptions
Adversarial Ghost – Expectimax Pacman
Video of Demo World Assumptions
Random Ghost – Minimax Pacman
Mixed Layer Types
o E.g. Backgammon
o Expectiminimax
o Environment is an
extra “random
agent” player that
moves after each
min/max agent
o Each node
computes the
appropriate
combination of its
children
Example: Backgammon
o Dice rolls increase b: 21 possible rolls with 2 dice
o Backgammon 20 legal moves
o Depth 2 = 20 x (21 x 20)3 = 1.2 x 109
80
MCTS Version 2: UCT
o Repeat until out of time:
o Given the current search tree, recursively apply UCB to choose a
path down to a leaf (not fully expanded) node n
o Add a new child c to n and run a rollout from c
o Update the win counts from c back up to the root
o Choose the action leading to the child with highest N
81
UCT Example
5/10 4/9
82
Why is there no min or max?????
o “Value” of a node, U(n)/N(n), is a weighted sum of child
values!
o Idea: as N → , the vast majority of rollouts are
concentrated in the best children, so weighted average →
max/min
o Theorem: as N → UCT selects the minimax move
o (but N never approaches infinity!)
83
Summary
o Games require decisions when optimality is impossible
o Bounded-depth search and approximate evaluation functions
o Games force efficient use of computation
o Alpha-beta pruning, MCTS
o Game playing has produced important research ideas
o Reinforcement learning (checkers)
o Iterative deepening (chess)
o Rational metareasoning (Othello)
o Monte Carlo tree search (chess, Go)
o Solution methods for partial-information games in economics (poker)
o Video games present much greater challenges – lots to do!
o b = 10500, |S| = 104000, m = 10,000, partially observable, often > 2 players