Reinforcement Learning
Reinforcement Learning
CSE 573
© Daniel S. Weld 2
573 Topics
Reinforcement
Supervised Learning
Learning Planning
Logic-Based Probabilistic
Knowledge Representation & Inference
Search
Problem Spaces
Agency
© Daniel S. Weld 3
Pole Demo
© Daniel S. Weld 4
Review: MDPs
S = set of states set (|S| = n)
© Daniel S. Weld 5
Goal for an MDP
• Find a policy which:
maximizes expected discounted reward
over an infinite horizon
for a fully observable
Markov decision process.
© Daniel S. Weld 6
Bellman Backup
Vn
Qn+1(s,a)
Max Vn
a1
Vn
Vn+1(s) s a2
Vn
a3
Vn
Vn
Vt+1(s) = R(s) +
Expected future reward
MaxaεA {c(a)+γΣs’εS Pr(s’|a,s)
Aver’gd over destVt(s’)}
states
© Daniel S. Weld 7
Value Iteration
• Assign arbitrary values to each state
(or use an admissible heuristic).
• Dynamic Programming
© Daniel S. Weld 8
Note on Value Iteration
• Order in which one applies Bellman Backups
Irrelevant!
t e
S ta
i t 10
I n
• Use animation
© Daniel S. Weld 10
Policy evaluation
• Given a policy Π:SA, find value of each
state using this policy.
• VΠ(s) = R(s) + c(Π(s)) +
γ[Σs’εS Pr(s’| Π(s),s)VΠ(s’)]
• This is a system of linear equations
involving |S| variables.
© Daniel S. Weld 11
Policy iteration
• Start with any policy (Π0).
• Iterate
Policy evaluation : For each state find VΠi(s).
Policy improvement : For each state s, find action
a* that maximizes QΠi(a,s).
If QΠi(a*,s) > VΠi(s) let Πi+1(s) = a*
else let Πi+1(s) = Πi(s)
• Stop when Πi+1 = Πi
• Converges faster than value iteration but
policy evaluation step is more expensive.
© Daniel S. Weld 12
Modified Policy iteration
• Instead of evaluating the actual value of
policy by
Solving system of linear equations, …
• Approximate it:
Value iteration with fixed policy.
© Daniel S. Weld 13
Excuse Me…
• MDPs are great, IF…
We know the state transition function P(s,a,s’)
We know the reward function R(s)
© Daniel S. Weld 14
How is learning to act possible when…
• Actions have non-deterministic effects
Which are initially unknown
© Daniel S. Weld 15
Naïve Approach
1. Act Randomly for a while
(Or systematically explore all possible actions)
2. Learn
Transition function
Reward function
3. Use value iteration, policy iteration, …
Problems?
© Daniel S. Weld 16
RL Techniques
1. Passive RL
3. Temporal-difference learning
Learns a utility function on states
• treats the difference between expected / actual
reward as an error signal, that is propagated
backward in time
© Daniel S. Weld 17
Concepts
• Exploration functions
Balance exploration / exploitation
• Function approximation
Compress a large state space into a small one
Linear function approximation, neural nets, …
Generalization
© Daniel S. Weld 18
Example:
• Suppose given policy
• Want to determine how good it is
© Daniel S. Weld 19
Objective: Value Function
© Daniel S. Weld 20
Just Like Policy Evaluation
• Except…?
© Daniel S. Weld 21
Passive RL
• Given policy ,
estimate U(s)
• Not given
transition matrix, nor
reward function!
• Epochs: training sequences
(1,1)(1,2)(1,3)(1,2)(1,3)(1,2)(1,1)(1,2)(2,2)(3,2) –1
(1,1)(1,2)(1,3)(2,3)(2,2)(2,3)(3,3) +1
(1,1)(1,2)(1,1)(1,2)(1,1)(2,1)(2,2)(2,3)(3,3) +1
(1,1)(1,2)(2,2)(1,2)(1,3)(2,3)(1,3)(2,3)(3,3) +1
(1,1)(2,1)(2,2)(2,1)(1,1)(1,2)(1,3)(2,3)(2,2)(3,2) -1
(1,1)(2,1)(1,1)(1,2)(2,2)(3,2) -1
© Daniel S. Weld 22
Approach 1
• Direct estimation
Estimate U(s) as average total reward of epochs
containing s (calculating from s to end of epoch)
• Pros / Cons?
© Daniel S. Weld 23
Approach 2
Adaptive Dynamic Programming
Requires fully observable environment
Estimate transition function M from training data
Solve Bellman eqn w/ modified policy iteration
Pros / Cons: U
R( s ) M s , sU ( s)
s
© Daniel S. Weld 24
Approach 3
• Temporal Difference Learning
Do backups on a per-action basis
Don’t try to estimate entire transition function!
For each transition from s to s’, update:
=Learning rate
U ( s ) U ( s ) ( R ( s ) U ( s ) U ( s ))
=Discount rate
© Daniel S. Weld 25
Notes
• Once U is learned, updates become 0:
0 ( R ( s ) U ( s) U ( s )) when U ( s ) R ( s ) U ( s)
• Similar to ADP
Adjusts state to ‘agree’ with observed successor
• Not all possible successors
© Daniel S. Weld 26
Notes II
• “TD(0)”
One step lookahead
U ( s ) U ( s ) ( R( s ) U ( s) U ( s ))
Can do 2 step, 3 step…
© Daniel S. Weld 27
TD()
• Or, … take it to the limit!
• Compute weighted average of all future states
U ( st ) U ( st ) ( R ( st ) U ( st 1 ) U ( st ))
becomes
U ( st ) U ( st ) ( R ( st ) (1 ) iU ( st i 1 ) U ( st ))
i 0
weighted average
• Implementation
Propagate current weighted TD onto past states
Must memorize states visited from start of epoch
© Daniel S. Weld 28
Notes III
• Online: update immediately after actions
Works even if epochs are infinitely long
© Daniel S. Weld 29
Q-Learning
• Version of TD-learning where
instead of learning value funct on states
we learn funct on [state,action] pairs
U ( s ) U ( s ) ( R ( s ) U ( s) U ( s ))
becomes
Q(a, s ) Q(a, s) ( R ( s ) max Q(a, s) Q(a, s ))
a
CMU Robotics
© Daniel S. Weld 32
Active Reinforcement Learning
Suppose agent must make policy while learning
First approach:
Start with arbitrary policy
Apply Q-Learning
New policy:
In state s,
Choose action a that maximizes Q(a,s)
Problem?
© Daniel S. Weld 33
Utility of Exploration
• Too easily stuck in non-optimal space
“Exploration versus exploitation tradeoff”
• Solution 1
With fixed probability perform a random action
• Solution 2
Increase est expected value of infrequent states
Properties
U+(s) of f(u,
R(s) + n) ?? a f(
max s’ P(s’ | a,s) U +
(s’), N(a,s))
• What to do??
© Daniel S. Weld 35
Function Approximation
• Never enough training data!
Must generalize what learning to new situations
• Idea:
Replace large state table by a smaller,
parameterized function
Updating the value of state will change the value
assigned to many other similar states
© Daniel S. Weld 36
Linear Function Approximation
• Represent U(s) as a weighted sum of
features (basis functions) of s
Uˆ ( s ) 1 f1 ( s ) 2 f 2 ( s ) ... n f n ( s )
Uˆ (s)
i i ( R( s ) Uˆ ( s) Uˆ ( s))
i
© Daniel S. Weld 37
Example
• U(s) = 0 + 1 x + 2 y
• Learns good approximation
10
© Daniel S. Weld 38
But What If…
• U(s) = 0 + 1 x + 2 y + 3 z
• Computed Features
z= (xg-x)2 + (yg-y)2
10
© Daniel S. Weld 39
Neural Nets
• Can create powerful function approximators
Nonlinear
Possibly unstable
• For TD-learning, apply difference signal to
neural net output and perform back-
propagation
© Daniel S. Weld 40
Policy Search
• Represent policy in terms of Q functions
• Gradient search
Requires differentiability
Stochastic policies; softmax
• Hillclimbing
Tweak policy and evaluate by running
• Replaying experience
© Daniel S. Weld 41
Walking Demo
UT AustinVilla
• Limited resources
Fetch most important pages first
• Topic specific search engines
Only want pages which are relevant to topic
• Minimize stale pages
Efficient re-fetch to keep index timely
How track the rate of change for pages?
© Daniel S. Weld 44
Standard Web Search Engine Architecture
store documents,
check for duplicates,
extract links
crawl the
web DocIds
create an
user inverted
index
query
Search
show results inverted
To user
engine
index
servers
© Daniel S. Weld 46
Methods
• Agent Types
Utility-based
Action-value based (Q function)
Reflex
• Passive Learning
Direct utility estimation
Adaptive dynamic programming
Temporal difference learning
• Active Learning
Choose random action 1/n th of the time
Exploration by adding to utility function
Q-learning (learn action/value f directly – model free)
• Generalization
Function approximation (linear function or neural networks)
• Policy Search
Stochastic policy repr / Softmax
Reusing past experience
© Daniel S. Weld 47
Summary
• Use reinforcement learning when
Model of world is unknown and/or rewards are delayed
• Temporal difference learning
Simple and efficient training rule
• Q-learning eliminates need for explicit T model
• Large state spaces can (sometimes!) be handled
Function approximation, using linear functions
Or neural nets
© Daniel S. Weld 48