lecture21
lecture21
Martin Müller
Fall 2024
1
455 Today - Lecture 21
Coursework
• Work on Assignment 4
• Reading and activities: Sutton RL tutorial + slides
• Quiz 11: Neural Networks and Deep Learning
(double length)
2
Part V
3
Reinforcement Learning
4
Reinforcement Learning (RL)
5
Basic Concepts of RL
6
RL vs Supervised Learning in Games
Supervised Learning
• Label for each move, or each position in training set
• Move label: Good/bad, expert move/not expert move
• Position label: evaluation, e.g. win/loss/draw in Tic Tac Toe
• Learn - minimize prediction error on given data set
• Can use mathematical optimization techniques, e.g.
gradient descent
Reinforcement Learning
• Reward for whole game sequence only
• Learn - try to improve gameplay by trial and error
• Which of our actions were good, and which were bad?
• Need to solve the credit assignment problem
7
Credit Assignment Problem
8
Basic Concepts - Policies and Value Functions
9
Value Functions
• Action-value function:
" T
#
X
t
qπ (s, a) = E γ Rt S0 = s, A0 = a, At>0 ∼ π(St )
t=0
10
Policy Iteration
11
Self-Play In Games
12
Monte Carlo Reinforcement Learning
T
" #
X
qπ (s, a) = E Rt S0 = s, A0 = a, At>0 ∼ π(St )
t=0
13
Monte Carlo Advantages and Disadvantages
Advantages:
• Conceptually straightforward
• Very parallelizable
• No dependence at all between state estimates
Disadvantages:
• Estimates of one state’s value are not used to improve
estimates of another
• Can only estimate the value of states and actions that are
visited sufficiently often in some trajectory
=⇒ Slow, data-inefficient
14
Temporal Difference (TD) Learning and TD(λ)
• Sutton (1988)
• Learn a model - a function from inputs to outputs
• Given only action sequences and rewards
• Learns a prediction (what is the best move?)
• Samples the environment (plays games)
• Compares learned estimate in each state with reward
• Learns from the difference
• TD(λ): uses discount factor λ for future rewards
• The sooner after the current state the reward happens, the
higher the effect
15
Monte Carlo vs. Temporal Difference
MC learning
TD learning
• Monte Carlo: learn from whole simulations
• TD: learn from differences of current and next values
16
Temporal Difference High-level Ideas
17
Function Approximation
• Tabular learning:
Value of each state / state-action is tracked separately
• Function approximation:
Learn a model of values instead
• Based on features of the state / state-action
• Can use either Monte Carlo or TD updates
• Advantage: Generalization. The model can guess values
for similar states that it has never visited before.
• Disadvantage: Over-generalization. Different states can be
conflated if the features are insufficiently detailed.
18
Review - Backgammon
19
Tesauro’s Neurogammon and TD-Gammon
20
Neurogammon Architecture
21
Limitations of Neurogammon
22
TD-Gammon
23
TD-Gammon Architecture
24
TD-Gammon - Examples of Weights Learned
Image source: Tesauro, Practical Issues in Temporal Difference Learning, Machine Learning, 1992
25
TD-Gammon - Examples of Weights Learned
26
TD-Gammon Impact
27
Computer Backgammon Now
28
Summary of RL Introduction
29