Reinforcement Learning
Reinforcement Learning
Unit - 5
Dynamic Learning
Reinforcement Learning
October 12, 2010
Byoung-Tak Zhang
http://bi.snu.ac.kr/~btzhang/
Overview
• Motivating
M ti ti Applications
A li ti
– Learning robots
– Web
W b agents
t
• Markov Decision Processes
– MDP
– POMDP
• Reinforcement Learning
– Adaptive Dynamic Programming (ADP)
– Temporal Difference (TD) Learning
– TD-Q Learning
2
Motivating Applications
• G
Generalized
li d model
d l learning
l i for
f reinforcement
i f learning
l i
on a humanoid robot:
http://www youtube com/watch?v=mRpX9DFCdwI
http://www.youtube.com/watch?v=mRpX9DFCdwI
• Autonomous spider learns to walk forward by
reinforcement learning:
http://www.youtube.com/watch?v=RZf8fR1SmNY&fe
ature=related
• Reinforcement learning for a robotic soccer goalkeeper:
http://www.youtube.com/watch?v=CIF2SBVY-
J0&feature=related
Rewardi Actioni
WAIR
Learn
(modify profile)
Statei
Document Filtering
User Profile
Rewardi+1
(Relevance Feedback)
User ...
Filtered Documents
Zhang, B.-T. and Seo, Y.-W., Applied Artificial Intelligence, 15(7):665-685, 2001 5
Reinforcement Learning is
is…
• The agent
agent’ss observation is a function of the
current state
• Now you may need to remember previous
observations in order to act optimally
POMDP
0 100 0
90 100 0
0 G G
0 0
0 0 0 0
100
0 0 81 90 100
• A policy π: S Æ A (deterministic)
or π: S x A Æ [[0, 1]] ((stochastic))
Optimality Criteria
Suppose an agent receives
S i a reward
d rt at time
i t. Then
Th optimal
i l
behaviour might:
• Maximise the sum of expected
p future rewards: ∑r
t
t
T
• Maximise over a finite horizon: ∑r
t =0
t
∞
• Maximise over an infinite horizon: ∑r
t =0
t
∞
• Maximise over a discounted infinite horizon: ∑ rt
χ t
t =0
1 n
• Maximise average reward: lim ∑ rt
n →∞ n
t =1
Examples
p of MDPs
• Goal-directed, Indefinite Horizon, Cost Minimization MDP
– <S, A, Pr, C, G, s0>
– Most often studied in planning community
20
Policies ((“Plans” for MDPs))
• Nonstationary policy
– π:S x T → A, where T is the non-negative integers
– π(s,t) is action to do at state s with t stages-to-go
– What if we want to keep acting indefinitely?
• Stationary policy
– π:S → A
– π(s) is action to do at state s (regardless of time)
– specifies a continuously reactive controller
• These assume or have these properties:
– full observability
– history-independence
– deterministic action choice
21
Value of a Policy
• How good is a policy π?
• How do we measure “accumulated” reward?
• V
Value u c o V: S → assoc
ue function associates
a es value
va ue with
w each
eac
state (or each state and time for non-stationary π)
• Vπ(s) denotes value of policy at state s
– Depends on immediate reward, but also what you achieve
subsequently by following π
– An optimal policy is one that is no worse than any other
ppolicyy at anyy state
• The goal of MDP planning is to compute an optimal
ppolicyy (method
( depends
p on how we define value))
22
Finite-Horizon Value Functions
• We first consider maximizing total reward over a
fi i horizon
finite h i
• Assumes the agent has n time steps to live
• To act optimally, should the agent use a
stationary
y or non-stationaryy policy?
p y
• Put another way:
– If you had only one week to live would you act the
same way as if you had fifty years to live?
23
Finite Horizon Problems
• Value (utility) depends on stage-to-go
– hence so should policy: nonstationary π(s,k)
k
• Vπ (s ) is k-stage-to-go value function for π
24
Computing
p g Finite-Horizon Value
• Can use dynamic programming to compute Vπk (s )
– Markov property is critical for this
0
(a) Vπ ( s ) = R ( s ),
) ∀s
(b) V k ( s ) = R ( s ) +
π ∑ s' T ( s, π ( s, k ),
) s ' ) ⋅ Vπk −1 ( s ' )
immediate reward
expected
t d future
f t payoff
ff
with k-1 stages to go
π(s,k)
0.7
What is time complexity?
03
0.3
25
Vk Vk-1
Bellman Backup
H
How can we compute
t optimal
ti l Vt+1(s)
( ) given
i optimal
ti l Vt ?
Compute Vt
E
Expectations
i
s1
Compute 07
0.7
a1
Max s2
0.3
Vt+1(s) s
0.4 s3
a2
0.6 s4
0 7 Vt (s1) + 0.3
Vt+1(s) = R(s)+max { 0.7 0 3 Vt (s4)
26 0.4 Vt (s2) + 0.6 Vt(s3) }
Value Iteration: Finite Horizon Case
• Markov property allows exploitation of DP principle
f optimal
for ti l policy
li construction
t ti
– no need to enumerate |A|Tn possible policies
• Value Iteration Bellman backup
0
V ( s ) = R( s ),
) ∀s
V ( s ) = R( s ) + max ∑ T ( s, a, s ') ⋅V
k k −1
( s ')
a
s '
π * ( s, k ) = arg max ∑ s ' T ( s, a, s ')) ⋅V k −1 ( s '))
a
Vk is optimal
p k-stage-to-go
g g value function
f
Π*(s,k) is optimal k-stage-to-go policy
27
Value Iteration
V3 V2 V1 V0
s1
s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6
s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6
Π*(s4,t) = max { }
29
Discounted Infinite Horizon MDPs
• Defining value as total reward is problematic with
infinite horizons
– many or all policies have infinite expected reward
– some MDPs are ok (e (e.g.,
g zero
zero-cost
cost absorbing states)
• “Trick”: introduce discount factor 0 ≤ β < 1
– future
f rewards
d discounted
di d by
b β per time
i step
∞
Vπk ( s ) = E [ ∑ β t R t | π , s ]
t =0
∞
• Note: V ( s ) ≤ E [ β t R max ] = 1
π ∑ 1− β
R max
t =0
V 0 ( s) = 0
V k ( s ) = R( s ) + β max ∑ T ( s, a, s ')V k −1 ( s ')
a
s '
• Will converge
g to the optimal
p value function as k ggets
large. Why?
32
Policy Evaluation
• Value
l equation
i for
f fixed
fi d policy
li
Vπ ( s ) = R( s ) + β∑ T ( s, π ( s )), s '))Vπ ( s '))
s'
• How can we compute the value function for a
policy?
– we are given R and Pr
– simple linear system with n variables (each variables is
value of a state)) and n constraints ((one value equation
q
for each state)
– Use linear algebra (e.g. matrix inverse)
33
Policyy Iteration
• Given fixed policy, can compute its value exactly:
Vπ ( s ) = R( s ) + β ∑ T ( s, π ( s ), s ')Vπ ( s ')
s'
• Policy iteration exploits this: iterates steps of policy evaluation
and policy improvement
34
Policy Iteration Notes
35
Value Iteration vs. Policy Iteration
• Which is faster? VI or PI
– It depends on the problem
• VI takes more iterations than PI, but PI requires
more time on each iteration
– PI must perform policy evaluation on each step which
i l
involves solving
l i a linear
li system
t
• Complexity:
– Th
There are att mostt exp(n)
( ) policies,
li i so PI is
i no worse
than exponential time in number of states
– Empirically O(n) iterations are required
– Still no polynomial bound on the number of PI
iterations (open problem)!
36
Adaptive Dynamic Programming (ADP)
For each s, initialize V(s) , P(s’|s,a) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and P) using
g max(( R( s, a) + γ ∑ P( s ' |, s, a)V ( s '))
a = arg ))
a s'
Receive immediate reward r and observe the new state s’
U i the
Using h transition
i i tuple
l <s,a,s’,r>
’ to update
d model d l R and
dP
For all the sate s, update V(s) using the updating rule
Q( s, a) = R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
s'
• No
Note:
e: Delta
e rule
u e (neural
( eu network
e wo learning)
e g)
w(t ) = w(t ) + α (d − o) x
TD Q Learning algorithm
TD-Q
For each pair (s,a), initialize Q(s,a)
Observe the current state s
Loop forever
{
Select an action a and execute it
a = arg max Q( s, a )
a
Receive immediate reward r and observe the new state s’
Update Q(s,a)
Q( s, a ) = Q( s, a ) + α (r + γ max Q( s ' , a ' ) − Q( s, a))
a'
s=s’
}
Generalization in RL