0% found this document useful (0 votes)
5 views

S18 Reinforcement Learning 2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

S18 Reinforcement Learning 2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Artificial Intelligence - INFOF311

Reinforcement learning part 2

Instructor : Tom Lenaerts


Acknowledgement

We thank Stuart Russell for his generosity in allowing us to use the


slide set of the UC Berkeley Course CS188, Introduction to Artificial
Intelligence. These slides were created by Dan Klein, Pieter Abbeel
and Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188
materials are available at http://ai.berkeley.edu.

Center_for_Human-Compatible_Artificial_Intelligence.png

The slides for INFOF311 are slightly modified versions of the slides of
the spring and summer CS188 sessions in 2021 and 2022
Recap: Reinforcement Learning

Agent
State: s
Actions: a
Reward: r

Environment

▪ Basic idea:
▪ Receive feedback in the form of rewards
▪ Agent’s utility is defined by the reward function
▪ Must (learn to) act so as to maximize expected rewards
▪ All learning is based on observed samples of outcomes!
Recap: Reinforcement Learning
▪ Still assume a Markov decision process (MDP):
▪ A set of states s  S
▪ A set of actions (per state) A(s)
▪ A transition model T(s,a,s’)
▪ A reward function R(s,a,s’)
▪ Still looking for a policy (s)

▪ New twist: don’t know T or R


▪ I.e. we don’t know which states are good or what the actions do
▪ Must explore new states and actions to discover how the world works
Recap: Offline (MDPs) vs. Online (RL)

Offline/Planning Online/Learning
Recap: Passive vs Active RL

Passive (fixed π) Active (changing π)


Approaches to reinforcement learning
1. Model-based: Learn the model, solve it, execute the solution
2. Learn values from experiences, use to make decisions
a. Direct evaluation
b. Temporal difference learning
c. Q-learning
3. Optimize the policy directly
Example: Model-Based Learning
Input Policy  Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’)
T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A P(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
P(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10 …
B C D
Episode 3 Episode 4 R(s,a,s’)
E R(B, east, C) = -1
E, north, C, -1 E, north, C, -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume:  = 1 D, exit, x, +10 A, exit, x, -10 …
Example: Direct (aka “Monte Carlo”) Estimation
Input Policy  Observed Episodes (Training) Output Values
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A A
C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 +8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume:  = 1 D, exit, x, +10 A, exit, x, -10
Temporal Difference Learning
▪ Passive setting (fixed policy π), like policy evaluation: s
V(s) = s’ T(s,(s),s’) [R(s,(s),s’) + γV(s’) ] (s)
s, (s)
▪ Modifications:
1. Don’t know T or R; estimate expectation from samples!
s’
1
V(s) = i [ ri + γV(si’) ]
𝑁

2. Update V(s) after each transition (s,a,s’,r) using running average.

3. Decay older samples as new ones come in.


Example: TD Value Estimation
▪ Experience transition i: (𝑠𝑖, 𝑎𝑖 , 𝑠𝑖′ , 𝑟𝑖).
▪ Compute sampled value “target”: 𝑟𝑖 + 𝛾𝑉 𝜋 (𝑠𝑖′ ).
▪ Compute “TD error”: 𝛿𝑖 = 𝑟𝑖 + 𝛾𝑉 𝜋 𝑠𝑖′ − 𝑉 𝜋 𝑠𝑖 .
▪ Update: 𝑉 𝜋 𝑠𝑖 += 𝛼𝑖 ⋅ 𝛿𝑖 .
Example: TD Value Estimation
Input Policy  Observed Episodes (Training) Output Values
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1
A A
C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10
B C D B C D
Episode 3 Episode 4
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume:  = 1 D, exit, x, +10 A, exit, x, -10
Example: TD Value Estimation
▪ Experience transition i: (𝑠𝑖, 𝑎𝑖 , 𝑠𝑖′ , 𝑟𝑖). B, east, C, -1
C, east, D, -1
▪ Compute sampled value “target”: 𝑟𝑖 + 𝛾𝑉 𝜋 (𝑠𝑖′ ). D, exit, x, +10
▪ Compute “TD error”: 𝛿𝑖 = 𝑟𝑖 + 𝛾𝑉 𝜋 𝑠𝑖′ − 𝑉 𝜋 𝑠𝑖 .
B, east, C, -1
𝜋
▪ Update: 𝑉 𝑠𝑖 += 𝛼𝑖 ⋅ 𝛿𝑖 . C, east, D, -1
i s a s' r 𝒓 + 𝛾𝑉 𝜋 𝒔′ 𝑉𝜋 𝒔 𝛿
D, exit, x, +10
s V(s) 1
E, north, C, -1
A 2
C, east, D, -1
B 3 D, exit, x, +10
C 4
D 5 E, north, C, -1
E 6 C, east, A, -1
7 A, exit, x, -10
Example: TD Value Estimation
▪ Experience transition i: (𝑠𝑖, 𝑎𝑖 , 𝑠𝑖′ , 𝑟𝑖). B, east, C, -1
C, east, D, -1
▪ Compute sampled value “target”: 𝑟𝑖 + 𝛾𝑉 𝜋 (𝑠𝑖′ ). D, exit, x, +10
▪ Compute “TD error”: 𝛿𝑖 = 𝑟𝑖 + 𝛾𝑉 𝜋 𝑠𝑖′ − 𝑉 𝜋 𝑠𝑖 .
B, east, C, -1
𝜋
▪ Update: 𝑉 𝑠𝑖 += 𝛼𝑖 ⋅ 𝛿𝑖 . C, east, D, -1
i s a s' r 𝒓 + 𝛾𝑉 𝜋 𝒔′ 𝑉𝜋 𝒔 𝛿
D, exit, x, +10
s V(s) 1 B east C -1 -1 + 0 0 -1
E, north, C, -1
A 0 2 C east D -1 -1 + 0 0 -1
C, east, D, -1
B -1 3 D exit --- 10 10 + 0 0 +10 D, exit, x, +10
C 9 4 B east C -1 -1 + -1 -1 -1
D 10 5 C east D -1 -1 + 10 -1 +9 E, north, C, -1
E 8 6 D exit --- 10 10 + 0 10 0 C, east, A, -1
7 E north C -1 -1 + 9 0 +8 A, exit, x, -10
Example: TD Value Estimation
Input Policy  Observed Episodes (Training) Output Values
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A A
C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 +3 +4 +10
B C D B C D
Episode 3 Episode 4 +3
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume:  = 1 D, exit, x, +10 A, exit, x, -10
TD as approximate Bellman update
▪ Experience transition i: (𝑠𝑖, 𝑎𝑖 , 𝑠𝑖′ , 𝑟𝑖).
▪ Compute sampled value “target”: 𝑟𝑖 + 𝛾𝑉 𝜋 (𝑠𝑖′ ).
▪ Compute “TD error”: 𝛿𝑖 = 𝑟𝑖 + 𝛾𝑉 𝜋 𝑠𝑖′ − 𝑉 𝜋 𝑠𝑖 .
▪ Update with TD learning rule:
▪ 𝑉 𝜋 𝑠𝑖  𝑉 𝜋 𝑠𝑖 +  ⋅ 𝛿𝑖 .
▪ V(s)  V(s) +   [target - V(s)]
▪ V(s)  (1-)  V(s) +   target
▪  is the learning rate
▪ Observe a sample, move V(s) a little bit to make it more
consistent with its neighbor V (s’)
TD Learning Happens in the Brain!
▪ Neurons transmit Dopamine to
encode reward or value prediction
error:
𝛿𝑖 = 𝑟𝑖 + 𝛾𝑉 𝜋 𝑠𝑖′ − 𝑉 𝜋 𝑠𝑖 .

▪ Example of Neuroscience & AI


informing each other
Problems with TD Value Learning

▪ Model-free policy evaluation!


▪ Bellman updates with running sample mean!

s, a0 s, a1 s, a2

▪ Need the transition model to improve the policy!


Detour: Q-Value Iteration
▪ Value iteration: find successive (depth-limited) values
▪ Start with V0(s) = 0, which we know is right
▪ Given Vk, calculate the depth k+1 values for all states:

▪ But Q-values are more useful, so compute them instead


▪ Start with Q0(s,a) = 0, which we know is right
▪ Given Qk, calculate the depth (k+1) q-values for all q-states:
Q-learning as approximate Q-iteration
▪ Recall the definition of Q values:
▪ Q*(s,a) = expected return from doing a in s and then behaving optimally
thereafter; and *(s) = arg maxaQ*(s,a)

▪ Bellman equation for Q values:


▪ Q*(s,a) = s’ T(s,a,s’)[R(s,a,s’) + γ maxa’ Q*(s’,a’) ]
▪ Approximate Bellman update for Q values:
▪ Q(s,a)  (1-)  Q(s,a) +   [R(s,a,s’) + γmaxa’Q (s’,a’) ]

▪ We obtain a policy from learned Q(s,a), with no model!


▪ (No free lunch: Q(s,a) table is |A| times bigger than V(s) table)
Q-Learning

▪ Learn Q(s,a) values as you go


▪ Receive a sample (s,a,s’,r)
▪ Consider your old estimate: Q(s,a)
▪ Consider your new sample estimate:
q_target = R(s,a,s’) + γ maxa’ Q(s’,a’)

▪ Incorporate the new estimate into a running average:


Q(s,a)  (1-) Q(s,a) +   [q_target]

[Demo: Q-learning – gridworld (L10D2)]


[Demo: Q-learning – crawler (L10D3)]
Video of Demo Q-Learning -- Gridworld
Q-Learning Properties
▪ Amazing result: Q-learning converges to optimal policy -- even
if samples are generated from a suboptimal policy!

▪ This is called off-policy learning

▪ Caveats:
▪ You have to explore enough
▪ You have to eventually make the learning rate
small enough
▪ … but not decrease it too quickly
▪ Basically, in the limit, it doesn’t matter how you select actions (!)
Exploration vs. Exploitation
Exploration vs. Exploitation
▪ Exploration: try new things
▪ Exploitation: do what’s best given what you’ve learned so far
▪ Key point: pure exploitation often gets stuck in a rut and never
finds an optimal policy!

27
Exploration method 1: -greedy
▪ -greedy exploration
▪ Every time step, flip a biased coin
▪ With (small) probability , act randomly
▪ With (large) probability 1-, act on current policy

▪ Properties of -greedy exploration


▪ Every s,a pair is tried infinitely often
▪ Does a lot of stupid things
▪ Jumping off a cliff lots of times to make sure it hurts
▪ Keeps doing stupid things for ever
▪ Decay  towards 0
Demo Q-learning – Epsilon-Greedy – Crawler
Method 2: Optimistic Exploration Functions
▪ Exploration functions implement this tradeoff
▪ Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g., f(u,n) = u + k/n

▪ Regular Q-update:
▪ Q(s,a)  (1-)  Q(s,a) +   [R(s,a,s’) + γ maxaQ (s’,a) ]
▪ Modified Q-update:
▪ Q(s,a)  (1-)  Q(s,a) +   [R(s,a,s’) + γ maxa f(Q (s’,a’),n(s’,a’)) ]
▪ Note: this propagates the “bonus” back to states that lead to
unknown states as well!
Demo Q-learning – Exploration Function – Crawler
Approximate Q-Learning
Generalizing Across States
▪ Basic Q-Learning keeps a table of all Q-values

▪ In realistic situations, we cannot possibly learn


about every single state!
▪ Too many states to visit them all in training
▪ Too many states to hold the Q-tables in memory

▪ Instead, we want to generalize:


▪ Learn about some small number of training states from
experience
▪ Generalize that experience to new, similar situations
▪ Can we apply some machine learning tools to do this?

[demo – RL pacman]
Example: Pacman
Let’s say we discover In naïve q-learning, Or even this one!
through experience we know nothing
that this state is bad: about this state:
Demo Q-Learning Pacman – Tiny – Watch All
Demo Q-Learning Pacman – Tiny – Silent Train
Demo Q-Learning Pacman – Tricky – Watch All
Feature-Based Representations
▪ Solution: describe a state using a vector of
features
▪ Features are functions from states to real
numbers (often 0/1) that capture important
properties of the state
▪ Example features:
▪ Distance to closest ghost fGST
▪ Distance to closest dot
▪ Number of ghosts
▪ 1 / (distance to closest dot) fDOT
▪ Is Pacman in a tunnel? (0/1)
▪ …… etc.
▪ Can also describe a q-state (s, a) with features
(e.g., action moves closer to food)
Linear Value Functions

▪ We can express V and Q (approximately) as weighted linear


functions of feature values:
▪ Vw(s) = w1f1(s) + w2f2(s) + … + wnfn(s)
▪ Qw(s,a) = w1f1(s,a) + w2f2(s,a) + … + wnfn(s,a)

▪ Advantage: our experience is summed up in a few powerful numbers


▪ Can compress a value function for chess (1043 states) down to about 30 weights!

▪ Disadvantage: states may share features but have very different expected utility!
Updating a linear value function
▪ Original Q-learning rule tries to reduce prediction error at s,a:
▪ Q(s,a)  Q(s,a) +   [R(s,a,s’) + γ maxa’ Q (s’,a’) - Q(s,a) ]
▪ Instead, we update the weights to try to reduce the error at s,a:
▪ wi  wi +   [R(s,a,s’) + γ maxa’ Q (s’,a’) - Q(s,a) ] Qw(s,a)/wi
= wi +   [R(s,a,s’) + γ maxa’ Q (s’,a’) - Q(s,a) ] fi(s,a)
▪ Intuitive interpretation:
▪ Adjust weights of active features
▪ If something bad happens, blame the features we saw; decrease value of
states with those features. If something good happens, increase value!
Example: Q-Pacman
Q(s,a) = 4.0 fDOT(s,a) – 1.0 fGST(s,a)

fDOT(s,NORTH) = 0.5
a = NORTH
s r = –500 s’
fGST(s,NORTH) = 1.0

Q(s,NORTH) = +1 Q(s’,) = 0
r + γ maxa’ Q (s’,a’) = – 500 + 0
wDOT  4.0 + [–501]0.5
difference = –501
wGST  –1.0 + [–501]1.0

Q(s,a) = 3.0 fDOT(s,a) – 3.0 fGST(s,a)


Demo Approximate Q-Learning -- Pacman
Approaches to reinforcement learning
1. Model-based: Learn the model, solve it, execute the solution
2. Learn values from experiences, use to make decisions
a. Direct evaluation
b. Temporal difference learning
c. Q-learning
3. Optimize the policy directly
Policy Search
Policy Search
▪ Problem: often the feature-based policies that work well (win games, maximize
utilities) aren’t the ones that approximate V / Q best
▪ E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they
still produced good decisions
▪ Q-learning’s priority: get Q-values close (modeling)
▪ Action selection priority: get ordering of Q-values right (prediction)

▪ Solution: learn policies that maximize rewards, not the values that predict them

▪ Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing
(or gradient ascent!) on feature weights
Policy Search
▪ Simplest policy search:
▪ Start with an initial linear value function or Q-function
▪ Nudge each feature weight up and down and see if your policy is better than before

▪ Pros:
▪ Works well for partial observability / stochastic policies

▪ Cons:
▪ How do we tell the policy got better?
▪ Need to run many sample episodes!
▪ If there are a lot of features, this can be impractical
Policy Search

[Andrew Ng] [Video: HELICOPTER]


Summary
▪ RL solves MDPs via direct experience of transitions and rewards
▪ There are several approaches:
▪ Learn the MDP model and solve it
▪ Learn V directly from sums of rewards, or by TD local adjustments
▪ Still need a model to make decisions by lookahead
▪ Learn Q by local Q-learning adjustments, use it directly to pick actions
▪ Optimize the policy directly
▪ Scaling up with feature representations and approximation

You might also like