Reinforcement Learning
Reinforcement Learning
Chapter 13:
Reinforcement Learning
Midterms due
Daily Show
Video
Reinforcement Learning
Control Learning
Problem Characteristics
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games
against itself.
Now, approximately equal to best human
player.
The RL Problem
Value Function
To begin, consider deterministic worlds...
For each possible policy # the agent might
adopt, we can define an evaluation
function over states
V#(s) $ rt+ " rt+1 + "2 rt+2 +
$ %i=0& "i rt+i
where rt, rt+1, are generated by following
policy # starting at state s.
Restated, the task is to learn the optimal
policy #': #' $ argmax# V#(s), ((s).
Example MDP
What to Learn
Q Function
^
Updating Q
Convergence Proof
^
Proof Continued
^
Nondeterministic Case
What if reward and next state are
non-deterministic?
We redefine V,Q by taking expected
values
V#(s) $ E[rt + " rt+1 + "2 rt+2 + ]
$ E[%i=0& "i rt+i]
Q(s, a) $ E[r(s, a) + "V*(!(s, a))]
Nondeterministic Case
Q learning generalizes to
nondeterministic worlds
Alter training rule ^to
^
Qn(s, a) * (1-/n)Qn-1(s,a) ^+
/n [r + " maxa Qn-1(s, a)]
where
/n = 1/(1 + visitsn(s, a)).
^
Can still prove convergence of Q to Q
[Watkins and Dayan, 1992].
Equivalent expression:
^
Q .(st,at) $ rt+" [(1-.) maxa Q(st,at) + .Q .(st+1,at+1)]
^
Replace Q table with neural net or other
generalizer
Handle case where state only partially
observable
Design optimal exploration strategies
Extend to continuous action, state
^
Learn and use !: S+A)S
Relationship to dynamic programming
and heuristic search