0% found this document useful (0 votes)
5 views

lecture21

Uploaded by

teamsienna24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

lecture21

Uploaded by

teamsienna24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Computing Science (CMPUT) 455

Search, Knowledge, and Simulations

Martin Müller

Department of Computing Science


University of Alberta
[email protected]

Fall 2024

1
455 Today - Lecture 21

• Introduction to Reinforcement Learning (RL)


• TD-Gammon, an early example of reinforcement learning
with neural nets in games

Coursework
• Work on Assignment 4
• Reading and activities: Sutton RL tutorial + slides
• Quiz 11: Neural Networks and Deep Learning
(double length)

2
Part V

RL, AlphaGo and Beyond

3
Reinforcement Learning

• Reinforcement Learning (RL) introduction


• Credit assignment problem
• Learning from rewards and temporal differences
• TD-gammon as early example
• Training by RL
• Deep RL

4
Reinforcement Learning (RL)

• Activity - watch the tutorial and slides by Rich Sutton


• Brief review in class only
• Focus on what we need for AlphaGo
• Discuss Gerry Tesauro’s TD-Gammon program
• Early big success story for RL in heuristic search
• Early example of neural nets in games

5
Basic Concepts of RL

• Observe input St (state of game at time t)


• Produce move, action At
• Observe reward (quality of action) Rt+1
• Note that reward occurs at next timestep
• Often, the reward is delayed
• Most games: reward (win/loss/score) only at end of game
• Reward 0 at all earlier time steps
• Interaction produces a trajectory: S0 , A0 , R1 , S1 , A1 , . . .

6
RL vs Supervised Learning in Games

Supervised Learning
• Label for each move, or each position in training set
• Move label: Good/bad, expert move/not expert move
• Position label: evaluation, e.g. win/loss/draw in Tic Tac Toe
• Learn - minimize prediction error on given data set
• Can use mathematical optimization techniques, e.g.
gradient descent

Reinforcement Learning
• Reward for whole game sequence only
• Learn - try to improve gameplay by trial and error
• Which of our actions were good, and which were bad?
• Need to solve the credit assignment problem

7
Credit Assignment Problem

• Reward for (possibly long) sequence of decisions


• No direct reward for each single move decision
• How can we tell which moves are good or bad?
• Distribute reward from end of game over all actions
• Difficult problem
• RL provides the most popular answers
• Main idea: if same action happens in many different
sequences, we can learn if it leads to more wins or losses

8
Basic Concepts - Policies and Value Functions

• RL: often learn a policy, or a value function, or both


• Policy π : S → ∆(A)
• Mapping from states in S to Actions in A
• ∆(A): probability distribution over actions
• Special case: deterministic policy
• For each state s, a single action is taken with probability 1
• Value functions
• Have the role of evaluation functions in RL
• State-value function vπ (s):
value of state s when we follow policy π
• Action-value function qπ (s, a): value of action a in state s
when we follow policy π afterwards
• Discount factor 0 < γ ≤ 1 - discounting future rewards

9
Value Functions

• Popular approach to solving the credit assignment


problem: value-based reinforcement learning
• Estimate one or both of:
• State-value function:
" T
#
X
vπ (s) = E γ t Rt S0 = s, At ∼ π(St )
t=0

• Action-value function:
" T
#
X
t
qπ (s, a) = E γ Rt S0 = s, A0 = a, At>0 ∼ π(St )
t=0

where π : S → ∆(A) is a stochastic policy

10
Policy Iteration

You can construct a new policy by solving the credit assignment


problem for an old policy:
1. Initialization: Set π(s) arbitrarily for all s ∈ S
2. Policy Evaluation: Compute estimates V (s) for state values
3. Policy Improvement: New policy chooses action that leads
to highest value of V
4. If policy is stable, stop; else goto 2 using new policy
The new policy is “stable” if it chooses the same actions as the
old one at every state.

11
Self-Play In Games

• Problem: standard RL solves a single-agent problem


• “Expected reward from following a policy” is ill-defined for
games with two or more players
• Our reward depends on the other player’s actions,
chosen by their policy
• Solution: self-play
• Each policy is part of “the environment” for the other
• Train player policies simultaneously
• Simplest approach: train one policy, use for all players

12
Monte Carlo Reinforcement Learning

T
" #
X
qπ (s, a) = E Rt S0 = s, A0 = a, At>0 ∼ π(St )
t=0

• Notation: a ∼ π(s) means: sample action a with


probabilities given by distribution π(s)
• In lecture 12 we already saw how to estimate an expected
winrate using simulations
• Can do the same with any kind of rewards
• Play out (many) games using policy π
• Find the average total return from every trajectory that
starts from (or goes through) s, a

13
Monte Carlo Advantages and Disadvantages

Advantages:
• Conceptually straightforward
• Very parallelizable
• No dependence at all between state estimates

Disadvantages:
• Estimates of one state’s value are not used to improve
estimates of another
• Can only estimate the value of states and actions that are
visited sufficiently often in some trajectory
=⇒ Slow, data-inefficient

14
Temporal Difference (TD) Learning and TD(λ)

• Sutton (1988)
• Learn a model - a function from inputs to outputs
• Given only action sequences and rewards
• Learns a prediction (what is the best move?)
• Samples the environment (plays games)
• Compares learned estimate in each state with reward
• Learns from the difference
• TD(λ): uses discount factor λ for future rewards
• The sooner after the current state the reward happens, the
higher the effect

15
Monte Carlo vs. Temporal Difference

MC learning

TD learning
• Monte Carlo: learn from whole simulations
• TD: learn from differences of current and next values

16
Temporal Difference High-level Ideas

• Usually, predictions from states closer to the end are more


reliable
• We can adjust earlier predictions, “trickle down”
• Bootstrapping - learn predictions from other predictions
• Whole process is grounded in the true final rewards
• This is one successful approach to solving the credit
assignment problem in practice

17
Function Approximation

• Tabular learning:
Value of each state / state-action is tracked separately
• Function approximation:
Learn a model of values instead
• Based on features of the state / state-action
• Can use either Monte Carlo or TD updates
• Advantage: Generalization. The model can guess values
for similar states that it has never visited before.
• Disadvantage: Over-generalization. Different states can be
conflated if the features are insufficiently detailed.

18
Review - Backgammon

• Racing game played with dice


• Players race in opposite
directions on the 24 points
• Single pieces can be captured
and have to start from the
beginning
• Doubling cube - play for double
stakes, or resign
Image source: https://en.
• Gammon and backgammon -
wikipedia.org/wiki/Backgammon
win counts more if opponent is
far behind

19
Tesauro’s Neurogammon and TD-Gammon

• Neurogammon (Tesauro 1989)


• Plays backgammon using neural networks
• First program to reach “strong intermediate” human level,
close to expert
• Beat all (non-learning) opponents at 1989 Computer
Olympiad
• Beat many intermediate level humans,
lost to an expert player

20
Neurogammon Architecture

• Six separate networks, for different phases of the game


• Fully connected feed-forward nets
• One hidden layer
• Tiny nets by modern standards
• Trained with backprop
• Supervised learning from 400 expert games
• One more network to make doubling cube decisions
• Trained on 3000 hand-labeled positions

21
Limitations of Neurogammon

• Hand-engineered features are difficult to create


• Human experts are not very good in explaining what they
do - in a form that can be programmed
• Human expert games are difficult to collect, and contain
errors

22
TD-Gammon

• TD-Gammon (Tesauro 1992, 1994, 1995)


• Training by self-play
• Learns from the outcome of games
• Uses Temporal Difference (TD) Learning

23
TD-Gammon Architecture

• 198 inputs - 8 per point, 6 extra with global information


(pieces off the board, toPlay)
• Single hidden layer, tried 10..80 hidden units
• Sigmoid activation function
• Output: one number, winning probability of input position
• Trained by TD(λ) with λ = 0.7, learning rate α = 0.1
• 200,000 training games, 2 weeks on high-end workstation
• Small (1-3-ply) Alphabeta - like search “Expectiminimax”

24
TD-Gammon - Examples of Weights Learned

Image source: Tesauro, Practical Issues in Temporal Difference Learning, Machine Learning, 1992

25
TD-Gammon - Examples of Weights Learned

• Weights from input to two


of the 40 hidden units
• Both make sense to
human expert players
• Left: corresponds to who
is ahead in the race
• Right: probability that
Image source: Tesauro, Practical Issues in Temporal
attack will be successful
Difference Learning, Machine Learning, 1992

26
TD-Gammon Impact

• Much stronger than Neurogammon


• Close to top human players
• Changed opening theory
• Changed the way the game is played by human experts
• For many years, the most impressive application of RL

27
Computer Backgammon Now

• Programs generally follow the TD-Gammon architecture


• Bigger, faster, longer training
• Endgame databases with exact winning probabilities
• Much stronger than humans

28
Summary of RL Introduction

• Reinforcement learning for learning from self-play


• TD-Gammon as early success story
• Very small (for todays standard) net with 1 hidden layer
• World class performance
• Trained by RL, more specifically the TD(λ) algorithm

29

You might also like