0% found this document useful (0 votes)
62 views

Reinforcement Learning

The document discusses reinforcement learning and related concepts like Markov decision processes, value functions, and Q-learning. It explains how reinforcement learning differs from supervised learning by providing delayed rewards rather than explicit training examples. It also covers temporal difference learning and its use in applications like learning to play Backgammon through self-play.

Uploaded by

abasthana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Reinforcement Learning

The document discusses reinforcement learning and related concepts like Markov decision processes, value functions, and Q-learning. It explains how reinforcement learning differs from supervised learning by providing delayed rewards rather than explicit training examples. It also covers temporal difference learning and its use in applications like learning to play Backgammon through self-play.

Uploaded by

abasthana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Administration

Chapter 13:
Reinforcement Learning

Midterms due
Daily Show
Video

CS 536: Machine Learning


Littman (Wu, TA)

Reinforcement Learning

Control Learning

[Read Chapter 13]


[Exercises 13.1, 13.2, 13.4]
Control learning
Control policies that choose
optimal actions
Q learning
Convergence

Consider learning to choose actions,


like:
Robot learning to dock on battery
charger
Learning to choose actions to
optimize factory output
Learning to play Backgammon

Problem Characteristics

One Example: TD-Gammon

Note several problem characteristics:


Delayed reward
Opportunity for active exploration
Possibility that state only partially
observable
Possible need to learn multiple tasks
with same sensors/effectors

[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games
against itself.
Now, approximately equal to best human
player.

The RL Problem

Markov Decision Processes


Assume
finite set of states S; set of actions A
at each discrete time agent observes
state st in S and chooses action at in A
then receives immediate reward rt & state
changes to st+1
Markov assumption:

Goal: Learn to choose actions that maximize


r0 + " r1 + "2 r2 + ... , where 0 ! " < 1

rt = r(st, at) and st+1 = !(st, at) depend only on


current state and action
! and r may be nondeterministic
! and r not necessarily known to agent

Agent's Learning Task

Different Learning Problem

Execute actions in environment,


observe results, and
learn action policy #: S ) A that
maximizes
E[rt + " rt+1 + "2 rt+2 + ]

Note something new:


Target function is #: S ) A

from any starting state in S


here 0 ! " < 1 is the discount factor
for future rewards

Value Function
To begin, consider deterministic worlds...
For each possible policy # the agent might
adopt, we can define an evaluation
function over states
V#(s) $ rt+ " rt+1 + "2 rt+2 +
$ %i=0& "i rt+i
where rt, rt+1, are generated by following
policy # starting at state s.
Restated, the task is to learn the optimal
policy #': #' $ argmax# V#(s), ((s).

but we have no training examples of


form <s, a>
training examples are of form
<<s, a>, r>

Example MDP

What to Learn

Q Function

We might try to have agent learn the


evaluation function V#' (we write as V*)
It could then do a lookahead search to
choose best action from any state s
because
#*(s) = argmaxa [r (s, a) + " V*(!(s, a))]
A problem:
This works well if agent knows !: S+A)S,
and r: S+A) ,
But, when it doesn't, it can't choose
actions this way.

Define new function very similar to V*


Q(s, a) $ r(s, a) + " V*(!(s, a))]
If agent learns Q, it can choose
optimal action even without
knowing !!
#*(s) = argmaxa [r (s, a) + " V*(!(s, a))]
= argmaxa Q(s, a)
Q is the evaluation function the agent
will learn.

Training Rule to Learn Q

Q Learning in Deterministic Case

Note Q and V* closely related:


V*(s) = maxa Q(s, a)
This allows us to write Q recursively as
Q(st, at) = r(st, at) + " V*(!(st, at))
= r(st, at) + " maxa Q(st+1, a)
^
Nice! Let Q denote learner's current
approximation to Q. Use training rule
^
^
Q(s, a) * r + " maxa Q(s, a)
where s is the state resulting from
applying action a in state s.

For^ each s, a initialize table entry


Q(s, a) * 0
Observe current state s.
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new state s
^
Update the table entry for Q(s, a) via:
^
^
Q(s, a) * r + " maxa Q(s, a)
s * s

^
Updating Q

Convergence Proof
^

Q(s1, aright) * r + " maxa Q(s2, a)


* 0 + 0.9 max {63, 81, 100} = 90
notice if rewards non-negative, then
^
^
((s, a, n) Qn+1(s, a) ! Qn(s, a)
and
^
((s, a, n) 0 " Qn(s, a) " Q(s, a)

Proof Continued
^

For table entry Qn(s, a) updated on iteration n +1,


^
the error in the revised estimate Qn+1(s, a) is
^
|Qn+1(s, a) - Q(s, a)|
^
= |(r + " maxa Qn(s, a)) - (r + " maxa Q(s, a))|
^
= " | maxa Qn(s, a) - maxa Q(s, a)|
^
" " maxa | Qn(s, a) - Q(s, a)|
^
" " maxa,s | Qn(s, a) - Q(s, a)|
^
|Qn+1(s, a) - Q (s, a) | " " -n
Note that we used the fact that
| maxa f1(a) - maxa f2(a)| " maxa | f1(a) - f2(a)|

Q converges to Q. Consider case of


deterministic world where see each
<s, a> visited infinitely often.
Proof: Define a full interval to be an interval
during which each <s, a> is visited.
During each full interval the largest error
^
in Q table is reduced by factor of "
^
Let Qn be table after n updates, and n be
the maximum error in Q^ n; that is
^
-n = maxs,a |Qn(s, a) - Q (s, a) |

Nondeterministic Case
What if reward and next state are
non-deterministic?
We redefine V,Q by taking expected
values
V#(s) $ E[rt + " rt+1 + "2 rt+2 + ]
$ E[%i=0& "i rt+i]
Q(s, a) $ E[r(s, a) + "V*(!(s, a))]

Nondeterministic Case

Temporal Difference Learning

Q learning generalizes to
nondeterministic worlds
Alter training rule ^to
^
Qn(s, a) * (1-/n)Qn-1(s,a) ^+
/n [r + " maxa Qn-1(s, a)]
where
/n = 1/(1 + visitsn(s, a)).
^
Can still prove convergence of Q to Q
[Watkins and Dayan, 1992].

Q learning: reduce discrepancy between successive


Q estimates
One step time difference:
Q(1)(st,at) $ rt + " maxa Q(st+1,a)
Why not 2 steps?
Q(2)(st,at) $ rt + " rt+1 + "2 maxa Q(st+2,a)
Or n ?
Q(n)(st,at) $ rt + + "(n-1) rt+n-1 + "n maxa Q(st+n,a)
Blend all of these:
Q .(st,at) $ (1-.) [Q(1)(st,at) + .Q(2)(st,at) + ]

Temporal Difference Learning

Subtleties & Ongoing Research

Q .(st,at) $ (1-.) [Q(1)(st,at) + .Q(2)(st,at) + ]

Equivalent expression:

^
Q .(st,at) $ rt+" [(1-.) maxa Q(st,at) + .Q .(st+1,at+1)]

TD(.) algorithm uses above training rule


Sometimes converges faster than Q
learning (not well understood in control
case)
converges for learning V for any 0"."1
(Dayan, 1992)
Tesauro's TD-Gammon uses this
algorithm to estimate the value function
via self play.

^
Replace Q table with neural net or other
generalizer
Handle case where state only partially
observable
Design optimal exploration strategies
Extend to continuous action, state
^
Learn and use !: S+A)S
Relationship to dynamic programming
and heuristic search

You might also like