Reinforcement Learning I
Reinforcement Learning I
2/
22
2/22
Reinforcement Learning
Video: AIBO - Initial, AIBO - Training, AIBO - Finished, SNAKE, Toddler, Crawler
2/
22
2/22
Reinforcement Learning
2/
22
2/22
Reinforcement Learning
2/
22
2/22
Offline (MDPs) vs. Online (RL)
3/
22
3/22
Offline (MDPs) vs. Online (RL)
3/
22
3/22
Model-Based Learning
4/
22
4/22
Model-Based Learning
Model-Based Idea:
• Learn an approximate model based on experiences
• Solve for values as if the learned model were correct
4/
22
4/22
Model-Based Learning
Model-Based Idea:
• Learn an approximate model based on experiences
• Solve for values as if the learned model were correct
Step 1: Learn empirical MDP model
• Count outcomes s′ for each s, a
• Normalize to give an estimate of T̂ (s, a, s′ )
• Discover each R̂(s, a, s′ ) when we experience (s, a, s′ )
4/
22
4/22
Model-Based Learning
Model-Based Idea:
• Learn an approximate model based on experiences
• Solve for values as if the learned model were correct
Step 1: Learn empirical MDP model
• Count outcomes s′ for each s, a
• Normalize to give an estimate of T̂ (s, a, s′ )
• Discover each R̂(s, a, s′ ) when we experience (s, a, s′ )
Step 2: Solve the learned MDP
• For example, use value iteration, as before
4/
22
4/22
Example: Model-Based Learning
5/
22
5/22
Example: Model-Based Learning
5/
22
5/22
Example: Model-Based Learning
5/
22
5/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Example: Expected Age
6/
22
6/22
Model-Free Learning
7/
22
7/22
Passive Reinforcement Learning
8/
22
8/22
Passive Reinforcement Learning
Simplified task: policy evaluation
• Input: a fixed policy π(s)
• You don’t know the transitions
T (s, a, s′ )
• You don’t know the rewards
R(s, a, s′ )
• Goal: learn the state values
In this case:
• Learner is “along for the ride”
• No choice about what actions to take
• Just execute the policy and learn from experience
• This is NOT offline planning! You actually take actions in the world
8/
22
8/22
Direct Evaluation
9/
22
9/22
Example: Direct Evaluation
10 /
10/2222
Example: Direct Evaluation
10 /
10/2222
Example: Direct Evaluation
10 /
10/2222
Example: Direct Evaluation
10 /
10/2222
Pros and Cons of Direct Evaluation
Pros
• It’s easy to understand
• It doesn’t require any knowledge of T , R
• It eventually computes the correct average
values, using just sample transitions
11 /
11/2222
Pros and Cons of Direct Evaluation
Pros
• It’s easy to understand
• It doesn’t require any knowledge of T , R
• It eventually computes the correct average
values, using just sample transitions
Cons
• It wastes information about state connections
• Each state must be learned separately → takes
a long time to learn
11 /
11/2222
Pros and Cons of Direct Evaluation
Pros
• It’s easy to understand
• It doesn’t require any knowledge of T , R
• It eventually computes the correct average
values, using just sample transitions
Cons
• It wastes information about state connections
• Each state must be learned separately → takes
a long time to learn
11 /
11/2222
Why Not Use Policy Evaluation?
Simplified Bellman updates calculate V for a fixed policy:
12 /
12/2222
Why Not Use Policy Evaluation?
Simplified Bellman updates calculate V for a fixed policy:
• Each round, replace V with a one-step look-ahead layer over
V
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′
• This approach fully exploited the connections between the
states
12 /
12/2222
Why Not Use Policy Evaluation?
Simplified Bellman updates calculate V for a fixed policy:
• Each round, replace V with a one-step look-ahead layer over
V
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′
• This approach fully exploited the connections between the
states
• Unfortunately, we need T and R to do it!
12 /
12/2222
Why Not Use Policy Evaluation?
Simplified Bellman updates calculate V for a fixed policy:
• Each round, replace V with a one-step look-ahead layer over
V
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′
• This approach fully exploited the connections between the
states
• Unfortunately, we need T and R to do it!
Key question: how can we do this update to V without knowing T and R?
• In other words, how to we take a weighted average without knowing the weights?
12 /
12/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Sample-Based Policy Evaluation
We want to improve our estimate of V by computing these averages:
∑
π
Vk+1 (s) ← T (s, π(s), s′ ) [R(s, π(s), s′ ) + γVkπ (s′ )]
s′
Idea: Take samples of outcomes s′ (by doing the action!) and average
13 /
13/2222
Temporal Difference Learning
14 /
14/2222
Temporal Difference Learning
Big idea: learn from every experience!
• Update V (s) each time we experience a transition (s, a, s′ , r)
• Likely outcomes s′ will contribute updates more often
14 /
14/2222
Temporal Difference Learning
Big idea: learn from every experience!
• Update V (s) each time we experience a transition (s, a, s′ , r)
• Likely outcomes s′ will contribute updates more often
Temporal difference learning of values
• Policy still fixed, still doing evaluation!
• Move values toward value of whatever successor occurs: running
average
14 /
14/2222
Temporal Difference Learning
Big idea: learn from every experience!
• Update V (s) each time we experience a transition (s, a, s′ , r)
• Likely outcomes s′ will contribute updates more often
Temporal difference learning of values
• Policy still fixed, still doing evaluation!
• Move values toward value of whatever successor occurs: running
average
Sample of V (s):
sample = R(s, π(s), s′ ) + γV π (s′ )
14 /
14/2222
Temporal Difference Learning
Big idea: learn from every experience!
• Update V (s) each time we experience a transition (s, a, s′ , r)
• Likely outcomes s′ will contribute updates more often
Temporal difference learning of values
• Policy still fixed, still doing evaluation!
• Move values toward value of whatever successor occurs: running
average
Sample of V (s):
sample = R(s, π(s), s′ ) + γV π (s′ )
Update to V (s):
V π (s) ← (1 − α)V π (s) + (α)sample
14 /
14/2222
Temporal Difference Learning
Big idea: learn from every experience!
• Update V (s) each time we experience a transition (s, a, s′ , r)
• Likely outcomes s′ will contribute updates more often
Temporal difference learning of values
• Policy still fixed, still doing evaluation!
• Move values toward value of whatever successor occurs: running
average
Sample of V (s):
sample = R(s, π(s), s′ ) + γV π (s′ )
Update to V (s):
V π (s) ← (1 − α)V π (s) + (α)sample
Same update:
V π (s) ← V π (s) + α(sample − V π (s))
14 /
14/2222
Exponential Moving Average
The running interpolation update:
x̄n = (1 − α) · x̄n−1 + α · xn
15 /
15/2222
Exponential Moving Average
The running interpolation update:
x̄n = (1 − α) · x̄n−1 + α · xn
15 /
15/2222
Exponential Moving Average
The running interpolation update:
x̄n = (1 − α) · x̄n−1 + α · xn
Forgets about the past (distant past values were wrong anyway)
15 /
15/2222
Exponential Moving Average
The running interpolation update:
x̄n = (1 − α) · x̄n−1 + α · xn
Forgets about the past (distant past values were wrong anyway)
Decreasing learning rate (alpha) can give converging averages
15 /
15/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Example: Temporal Difference Learning
16 /
16/2222
Problems with TD Value Learning
TD value leaning is a model-free way to do policy evaluation, mimicking Bellman
updates with running sample averages
However, if we want to turn values into a (new) policy, we’re sunk:
∑
π(s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV (s′ )]
s′
17 /
17/2222
Problems with TD Value Learning
TD value leaning is a model-free way to do policy evaluation, mimicking Bellman
updates with running sample averages
However, if we want to turn values into a (new) policy, we’re sunk:
∑
π(s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV (s′ )]
s′
17 /
17/2222
Active Reinforcement Learning
18 /
18/2222
Active Reinforcement Learning
Full reinforcement learning: optimal policies
(like value iteration)
• You don’t know the transitions T (s, a, s′ )
• You don’t know the rewards R(s, a, s′ )
• You choose the actions now
• Goal: learn the optimal policy/values
In this case:
• Learner makes choices!
• Fundamental tradeoff: exploration vs.
exploitation
• This is NOT offline planning! You actually take
actions in the world and find out what
happens...
18 /
18/2222
Detour: Q-Value Iteration
Value iteration: find successive (depth-limited) values
• Start with V0 (s) = 0, which we know is right
• Given Vk , calculate the depth k + 1 values for all states
∑ [ ]
Vk + 1(s) ← max T (s, a, s′ ) R(s, a, s′ ) + γVk (s′ )
a
s′
19 /
19/2222
Detour: Q-Value Iteration
Value iteration: find successive (depth-limited) values
• Start with V0 (s) = 0, which we know is right
• Given Vk , calculate the depth k + 1 values for all states
∑ [ ]
Vk + 1(s) ← max T (s, a, s′ ) R(s, a, s′ ) + γVk (s′ )
a
s′
19 /
19/2222
Q-Learning
Q-Learning: sample-based Q-value iteration
∑ [ ]
′ ′ ′ ′
Qk+1 (s, a) ← T (s, a, s ) R(s, a, s ) + γ max
′
Qk (s , a )
a
s′
21 /
21/2222
Q-Learning Properties
Amazing result: Q-learning converges to optimal policy - even if you’re acting
suboptimally!
21 /
21/2222
Suggested Reading
Russell & Norvig: Chapter 21
Sutton and Barto: 6.1, 6.2, 6.5
22 /
22/2222