0% found this document useful (0 votes)
5 views

Reinforcement Learning

Uploaded by

pratik665123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Reinforcement Learning

Uploaded by

pratik665123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Fall 2010 Graduate Course on

Unit - 5
Dynamic Learning

Chapter 6: MDP and Reinforcement


Learning

Reinforcement Learning
October 12, 2010
Byoung-Tak Zhang

School of Computer Science and Engineering &


Cognitive Science and Brain Science Programs
Seoul National University

http://bi.snu.ac.kr/~btzhang/
Overview
• Motivating
M ti ti Applications
A li ti
– Learning robots
– Web
W b agents
t
• Markov Decision Processes
– MDP
– POMDP
• Reinforcement Learning
– Adaptive Dynamic Programming (ADP)
– Temporal Difference (TD) Learning
– TD-Q Learning

2
Motivating Applications
• G
Generalized
li d model
d l learning
l i for
f reinforcement
i f learning
l i
on a humanoid robot:
http://www youtube com/watch?v=mRpX9DFCdwI
http://www.youtube.com/watch?v=mRpX9DFCdwI
• Autonomous spider learns to walk forward by
reinforcement learning:
http://www.youtube.com/watch?v=RZf8fR1SmNY&fe
ature=related
• Reinforcement learning for a robotic soccer goalkeeper:
http://www.youtube.com/watch?v=CIF2SBVY-
J0&feature=related

(c) 2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 4


WAIR: Reinforcement Learning
g Web Agent
g
for Personalized Information Filtering

Rewardi Actioni
WAIR
Learn
(modify profile)
Statei
Document Filtering
User Profile

Rewardi+1
(Relevance Feedback)
User ...
Filtered Documents

Zhang, B.-T. and Seo, Y.-W., Applied Artificial Intelligence, 15(7):665-685, 2001 5
Reinforcement Learning is
is…

… learningg from trial-and-error


and reward by interaction with an
environment.
Markov Decision Processes (MDPs)

• A simple, general framework for modeling


sequential decision making in a stochastic
environment
• Earlyy work by
y Bellman and Howard in the 1950s
• Currently a popular model in OR and AI
• Used to model problems in robotics
robotics, finance
finance,
agriculture, etc.
Markov Decision Processes (MDPs)

• The decision-making agent starts in some state and


chooses an action according to its policy
• This determines an immediate reward and causes a
stochastic transition to the next state
• Aim is to find the policy that leads to the highest total
reward over T time steps (finite horizon)
• Because
B off the
th Markov
M k property, t the
th policy
li does
d nott
have to remember previous states
The Formal Decision Problem - MDP

• Given <S, A, P, R, T, K>


– S is a finite state set (with start state s )
0

– A is a finite action set


– P(s’ | s, a) is a table of transition probabilities
– R(s, a, s’) is a reward function
• Policy π(s, t) = a
• Is there a policy that yields total reward at least K
over finite horizon T
• We will require that T < |S|
POMDP: Partially Observable MDP

• The agent
agent’ss observation is a function of the
current state
• Now you may need to remember previous
observations in order to act optimally
POMDP

• Given <S, A, P, R, Ω, O, T, K>


– Ω is a finite observation set
– O(o | s, a, s’) is a table of observation probabilities
• Policy π(o1, …, ot) = a
• Does there exist a policy that yields total
reward at least K over horizon T?
DEC POMDP2
DEC-POMDP

• Now two cooperating


p g ((decentralized)) agents
g
with different observation functions
DEC POMDP2
DEC-POMDP

• Given <S, A, P, R, Ω1, Ω2, O, T, K>


– Ω1 and Ω2 are finite observation sets
– O(o1, o2 | s, a1, a2, s’) is a table of observation probabilities
• Local policies π1(o11, …, o1t) = a1, π2(o21, …, o2t) = a2
• Joint policy <π1, π2>
• Does there exist a joint policy that yields total reward at
least K over horizon T??

Note: DEC-POMDP1 = POMDP


DEC MDP2
DEC-MDP

• DEC-POMDP2 with requirement that the state is


uniquely determined by the tuple of observations:
J : Ω1 × Ω 2 → S

Note: DEC-MDP1 = MDP


MDP

• Can use dynamic programming to “back up” values


from the end
⎛ ⎞
V ( s ) = max a∈A ⎜ R ( s, a ) + ∑ P ( s ' | s, a )V ( s ') ⎟
⎝ s '∈S ⎠
Elements of Reinforcement Learning

0 100 0
90 100 0
0 G G
0 0

0 0 0 0
100
0 0 81 90 100

r(state, action) V*(state) values


immediate reward values

• Value function: maps states to state values


V (s ) ≡ r (t )+γr (t +1)+γ2 r (t+2 )+...

Discount factor γ ∈ [0, 1) (here 0.9)


Elements of MDP
• A set S of states {s1, s2, s3, …}

• A set A of actions {a1, a2, a3, …}

• A transition function T: S x A Æ S (deterministic)


or T: S x A x S Æ [0, 1] (stochastic)

• A reward function R: S x A Æ Real


or R: S x A x S Æ Real

• A policy π: S Æ A (deterministic)
or π: S x A Æ [[0, 1]] ((stochastic))
Optimality Criteria
Suppose an agent receives
S i a reward
d rt at time
i t. Then
Th optimal
i l
behaviour might:
• Maximise the sum of expected
p future rewards: ∑r
t
t
T
• Maximise over a finite horizon: ∑r
t =0
t


• Maximise over an infinite horizon: ∑r
t =0
t


• Maximise over a discounted infinite horizon: ∑ rt
χ t

t =0
1 n
• Maximise average reward: lim ∑ rt
n →∞ n
t =1
Examples
p of MDPs
• Goal-directed, Indefinite Horizon, Cost Minimization MDP
– <S, A, Pr, C, G, s0>
– Most often studied in planning community

• Infinite Horizon, Discounted Reward Maximization MDP


– <S, A, Pr, R, γ>
– Most often studied in reinforcement learning

• Goal-directed,, Finite Horizon,, Prob. Maximization MDP


– <S, A, Pr, G, s0, T>
– Also studied in planning community

• Oversubscription Planning: Non absorbing goals, Reward Max. MDP


– <S, A, Pr, G, R, s0>
– Relatively recent model
Assumptions

• First-Order Markovian dynamics (history independence)


– Pr(St+1|At,St,At-1,St-1,..., S0) = Pr(St+1|At,St)
– Next state only depends on current state and current action
• First-Order
First Order Markovian reward process
– Pr(Rt|At,St,At-1,St-1,..., S0) = Pr(Rt|At,St)
– Reward only depends on current state and action
– As described earlier we will assume reward is specified by a deterministic
function R(s)
• i.e. Pr(Rt=R(St) | At,St) = 1
• Stationary dynamics and reward
– Pr(St+1|At,St) = Pr(Sk+1|Ak,Sk) for all t, k
– The world dynamics do not depend on the absolute time
• F ll observability
Full b bili
– Though we can’t predict exactly which state we will reach when we execute an
action, once it is realized, we know what it is

20
Policies ((“Plans” for MDPs))
• Nonstationary policy
– π:S x T → A, where T is the non-negative integers
– π(s,t) is action to do at state s with t stages-to-go
– What if we want to keep acting indefinitely?
• Stationary policy
– π:S → A
– π(s) is action to do at state s (regardless of time)
– specifies a continuously reactive controller
• These assume or have these properties:
– full observability
– history-independence
– deterministic action choice

21
Value of a Policy
• How good is a policy π?
• How do we measure “accumulated” reward?
• V
Value u c o V: S → assoc
ue function associates
a es value
va ue with
w each
eac
state (or each state and time for non-stationary π)
• Vπ(s) denotes value of policy at state s
– Depends on immediate reward, but also what you achieve
subsequently by following π
– An optimal policy is one that is no worse than any other
ppolicyy at anyy state
• The goal of MDP planning is to compute an optimal
ppolicyy (method
( depends
p on how we define value))
22
Finite-Horizon Value Functions
• We first consider maximizing total reward over a
fi i horizon
finite h i
• Assumes the agent has n time steps to live
• To act optimally, should the agent use a
stationary
y or non-stationaryy policy?
p y
• Put another way:
– If you had only one week to live would you act the
same way as if you had fifty years to live?

23
Finite Horizon Problems
• Value (utility) depends on stage-to-go
– hence so should policy: nonstationary π(s,k)
k
• Vπ (s ) is k-stage-to-go value function for π

– expected total reward after executing π for k time steps


k
Vπk ( s ) = E [ ∑ R t | π , s ]
t =0
k
= E [ ∑ R( s t ) | a t = π ( s t , k − t ), s = s 0 ]
t =0

• Here Rt and st are random variables denoting the reward


received and state at stage t respectively

24
Computing
p g Finite-Horizon Value
• Can use dynamic programming to compute Vπk (s )
– Markov property is critical for this
0
(a) Vπ ( s ) = R ( s ),
) ∀s

(b) V k ( s ) = R ( s ) +
π ∑ s' T ( s, π ( s, k ),
) s ' ) ⋅ Vπk −1 ( s ' )
immediate reward
expected
t d future
f t payoff
ff
with k-1 stages to go
π(s,k)

0.7
What is time complexity?
03
0.3
25
Vk Vk-1
Bellman Backup
H
How can we compute
t optimal
ti l Vt+1(s)
( ) given
i optimal
ti l Vt ?

Compute Vt
E
Expectations
i
s1
Compute 07
0.7
a1
Max s2
0.3
Vt+1(s) s
0.4 s3

a2
0.6 s4

0 7 Vt (s1) + 0.3
Vt+1(s) = R(s)+max { 0.7 0 3 Vt (s4)
26 0.4 Vt (s2) + 0.6 Vt(s3) }
Value Iteration: Finite Horizon Case
• Markov property allows exploitation of DP principle
f optimal
for ti l policy
li construction
t ti
– no need to enumerate |A|Tn possible policies
• Value Iteration Bellman backup
0
V ( s ) = R( s ),
) ∀s
V ( s ) = R( s ) + max ∑ T ( s, a, s ') ⋅V
k k −1
( s ')
a
s '
π * ( s, k ) = arg max ∑ s ' T ( s, a, s ')) ⋅V k −1 ( s '))
a
Vk is optimal
p k-stage-to-go
g g value function
f
Π*(s,k) is optimal k-stage-to-go policy
27
Value Iteration

V3 V2 V1 V0
s1

s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6

0.3 0.3 0.3


s4

V1(s4) = R(s4)+max { 0.7 V0 (s1) + 0.3 V0 (s4)


0.4 V0 (s2) + 0.6 V0(s3) }
28
Value Iteration
V3 V2 V1 V0
s1

s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6

0.3 0.3 0.3


s4

Π*(s4,t) = max { }

29
Discounted Infinite Horizon MDPs
• Defining value as total reward is problematic with
infinite horizons
– many or all policies have infinite expected reward
– some MDPs are ok (e (e.g.,
g zero
zero-cost
cost absorbing states)
• “Trick”: introduce discount factor 0 ≤ β < 1
– future
f rewards
d discounted
di d by
b β per time
i step

Vπk ( s ) = E [ ∑ β t R t | π , s ]
t =0

• Note: V ( s ) ≤ E [ β t R max ] = 1
π ∑ 1− β
R max
t =0

• Motivation: economic? failure prob? convenience?


30
Computing an Optimal Value Function
• Bellman equation for optimal value function
V * ( s ) = R( s ) + β max ∑ T ( s, a, s ')V * ( s ')
a
s'
– Bellman
B ll provedd this
thi is
i always
l true
t
• How can we compute the optimal value function?
– Th
The MAX operator t makes
k ththe system
t non-linear,
li so the
th problem
bl isi more
difficult than policy evaluation
• Notice that the optimal value function is a fixed
fixed-point
point of
the Bellman Backup operator B
– B takes a value function as input
p and returns a new value function

B[V ]( s ) = R( s ) + β max ∑ T ( s, a, s ')V ( s ')


a
s'
31
Value Iteration
• Can compute
p optimal
p ppolicy
y using
g value iteration,,
just like finite-horizon problems (just include
discount term))

V 0 ( s) = 0
V k ( s ) = R( s ) + β max ∑ T ( s, a, s ')V k −1 ( s ')
a
s '

• Will converge
g to the optimal
p value function as k ggets
large. Why?

32
Policy Evaluation

• Value
l equation
i for
f fixed
fi d policy
li
Vπ ( s ) = R( s ) + β∑ T ( s, π ( s )), s '))Vπ ( s '))
s'
• How can we compute the value function for a
policy?
– we are given R and Pr
– simple linear system with n variables (each variables is
value of a state)) and n constraints ((one value equation
q
for each state)
– Use linear algebra (e.g. matrix inverse)

33
Policyy Iteration
• Given fixed policy, can compute its value exactly:
Vπ ( s ) = R( s ) + β ∑ T ( s, π ( s ), s ')Vπ ( s ')
s'
• Policy iteration exploits this: iterates steps of policy evaluation
and policy improvement

1. Choose a random policy π


Policy improvement
2 L
2. Loop:
(a) Evaluate Vπ
(b) For each s in S,
S set π '(( s ) = arg max ∑ s 'T (s, a, s '))Vπ (s '))
(c) Replace π with π’ a
Until no improving action possible at any state

34
Policy Iteration Notes

• Each step of policy iteration is guaranteed to


strictly improve the policy at some state when
improvement is possible
• Convergence assured (Howard)
– intuitively: no local maxima in value space, and each
policy must improve value; since finite number of
policies will converge to optimal policy
policies,
• Gives exact value of optimal policy

35
Value Iteration vs. Policy Iteration
• Which is faster? VI or PI
– It depends on the problem
• VI takes more iterations than PI, but PI requires
more time on each iteration
– PI must perform policy evaluation on each step which
i l
involves solving
l i a linear
li system
t
• Complexity:
– Th
There are att mostt exp(n)
( ) policies,
li i so PI is
i no worse
than exponential time in number of states
– Empirically O(n) iterations are required
– Still no polynomial bound on the number of PI
iterations (open problem)!

36
Adaptive Dynamic Programming (ADP)
For each s, initialize V(s) , P(s’|s,a) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and P) using
g max(( R( s, a) + γ ∑ P( s ' |, s, a)V ( s '))
a = arg ))
a s'
Receive immediate reward r and observe the new state s’
U i the
Using h transition
i i tuple
l <s,a,s’,r>
’ to update
d model d l R and
dP
For all the sate s, update V(s) using the updating rule

V ( s ) = max(( R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))


a
s'
s = s’
}
How to Learn Model?
• U
Use the
h transition
i i tuple l <s, a, s’,
’ r> to learn
l
P(s’|s,a) and R(s,a). That’s supervised learning!
– Since the agent can get every transition (s, a, s’,r)
directly, so take (s,a) and s’ as an input-output example
of the transition probability function P(s
P(s’|s
|s,a).
a)
– Different techniques in the supervised learning
• Neural networks
• Decision trees
– Use r and P(s |s,a) to learn R(s,a)
P(s’|s,a)
R ( s, a ) = ∑ P ( s ' | s, a ) r
s'
From ADP to TD Learning
• In each step,
step ADP updates V for all states ss’
V ( s ) = max( R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
a
s'

– This is intractable for large state space!


– Improve this by prioritized-sweeping.
• Temporal
p difference ((TD)) learning:
g
– Adjust the estimated utility value V of the current state
based on its immediate reward R(s) and the estimated
value of the next state s’.
V ( s ) = V ( s ) + α ( R( s ) + V ( s ')) − V ( s ))
TD-Q
Q Learning
g
• Define Q-value function
V ( s ) = max Q( s, a )
a

Q( s, a) = R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
s'

= R ( s, a ) + γ ∑ P ( s ' | s, a ) max Q( s ', a ')


a'
s'
• Recall: TD learning
V ( s ) = V ( s ) + α ( R( s) + V ( s ')) − V ( s ))
• Key idea of TD-Q learning
Q( s, a ) = Q( s, a) + α (r + γ max Q( s ' , a ' ) − Q( s, a))
a'

• No
Note:
e: Delta
e rule
u e (neural
( eu network
e wo learning)
e g)
w(t ) = w(t ) + α (d − o) x
TD Q Learning algorithm
TD-Q
For each pair (s,a), initialize Q(s,a)
Observe the current state s
Loop forever
{
Select an action a and execute it
a = arg max Q( s, a )
a
Receive immediate reward r and observe the new state s’
Update Q(s,a)
Q( s, a ) = Q( s, a ) + α (r + γ max Q( s ' , a ' ) − Q( s, a))
a'

s=s’
}
Generalization in RL

• So far we assumed that all the functions


learned by the agent are (V(V, P, R Q) are
P R,
tabular forms, i.e. it is possible to enumerate
state
t t andd action
ti spaces.
• Use ggeneralization techniques
q to deal with
large state or action space.
– Function approximation techniques
– Neural networks
– Other supervised/unsupervised learning methods
Alternatives

• Function approximation of the Q-table:


– Neural
N l networks
t k
– Decision trees
– Gradient descent methods

• Reinforcement learning variants:


– Relational reinforcement learning
– Hierarchical reinforcement learning
– Intrinsically motivated reinforcement
learning
References and Further Reading

• Sutton, R., Barto, A., (2000) Reinforcement Learning:


an Introduction, The MIT Press
http://www.cs.ualberta.ca/~sutton/book/the-book.html
• Kaelbling, L., Littman, M., Moore, A., (1996)
Reinforcement Learning: a Survey, Journal of
Artificial Intelligence Research, 4:237-285
• Barto, A., Mahadevan, S., (2003) Recent Advances in
Hierarchical Reinforcement Learning, Discrete Event
Applications, 13(4):41-
Dynamic Systems: Theory and Applications (4):41
77

You might also like