An Introduction To Reinforcement Learning
An Introduction To Reinforcement Learning
Shivaram Kalyanakrishnan
[email protected]
August 2014
What is Reinforcement Learning?
1. https://www.youtube.com/watch?v=Qv43pKlVZXk
1. https://www.youtube.com/watch?v=Qv43pKlVZXk
1. https://www.youtube.com/watch?v=Qv43pKlVZXk
1. https://www.youtube.com/watch?v=Qv43pKlVZXk
Operations Research
Control Theory
(Dynamic Programming)
Operations Research
Control Theory
(Dynamic Programming)
B. F. Skinner
Operations Research
Control Theory
(Dynamic Programming)
R. E. Bellman
B. F. Skinner
Operations Research
Control Theory
(Dynamic Programming)
R. E. Bellman D. P. Bertsekas
B. F. Skinner
Operations Research
Control Theory
(Dynamic Programming)
R. E. Bellman D. P. Bertsekas
B. F. Skinner W. Schultz
Operations Research
Control Theory
(Dynamic Programming)
R. E. Bellman D. P. Bertsekas
B. F. Skinner W. Schultz
R. S. Sutton
RL Competition: http://www.rl-competition.org/.
R
r t+1
π: S A
S: set of states.
A: set of actions.
T : transition function. ∀s ∈ S, ∀a ∈ A, T (s, a) is a distribution over S.
R: reward function. ∀s, s′ ∈ S, ∀a ∈ A, R(s, a, s′ ) is a finite real number.
γ: discount factor. 0 ≤ γ < 1.
R
r t+1
π: S A
S: set of states.
A: set of actions.
T : transition function. ∀s ∈ S, ∀a ∈ A, T (s, a) is a distribution over S.
R: reward function. ∀s, s′ ∈ S, ∀a ∈ A, R(s, a, s′ ) is a finite real number.
γ: discount factor. 0 ≤ γ < 1.
R
r t+1
π: S A
S: set of states.
A: set of actions.
T : transition function. ∀s ∈ S, ∀a ∈ A, T (s, a) is a distribution over S.
R: reward function. ∀s, s′ ∈ S, ∀a ∈ A, R(s, a, s′ ) is a finite real number.
γ: discount factor. 0 ≤ γ < 1.
R
r t+1
π: S A
S: set of states.
A: set of actions.
T : transition function. ∀s ∈ S, ∀a ∈ A, T (s, a) is a distribution over S.
R: reward function. ∀s, s′ ∈ S, ∀a ∈ A, R(s, a, s′ ) is a finite real number.
γ: discount factor. 0 ≤ γ < 1.
1. http://www.chess-game-strategies.com/images/kqa_chessboard_large-picture_2d.gif
1. http://www.chess-game-strategies.com/images/kqa_chessboard_large-picture_2d.gif
2. http://www.aviationspectator.com/files/images/
SH-3-Sea-King-helicopter-191.preview.jpg
[Video3 of Tetris]
1. http://www.chess-game-strategies.com/images/kqa_chessboard_large-picture_2d.gif
2. http://www.aviationspectator.com/files/images/
SH-3-Sea-King-helicopter-191.preview.jpg
3. https://www.youtube.com/watch?v=khHZyghXseE
0.8, 1
s1 s2 0.2, 1
0.5, 1
0.5, −1
0.2, −1
Notation: "transition probability, reward" marked on each arrow
States: s1 , s2 , s3 , and s4 .
Actions: Red (solid lines) and blue (dotted lines).
Transitions: Red action leads to same state with 20% chance, to next-clockwise state with
80% chance. Blue action leads to next-clockwise state or 2-removed-clockwise state with
equal (50%) probability.
Rewards: R(∗, ∗, s1 ) = 0, R(∗, ∗, s2 ) = 1, R(∗, ∗, s3 ) = −1, R(∗, ∗, s4 ) = 2.
Discount factor: γ = 0.9.
The variables in Bellman’s equation are the V π (s). |S| linear equations
in |S| unknowns.
The variables in Bellman’s equation are the V π (s). |S| linear equations
in |S| unknowns.
Planning problem:
Given S, A, T , R, γ, how can we find an optimal policy π ∗ ? We need
to be computationally efficient.
Planning problem:
Given S, A, T , R, γ, how can we find an optimal policy π ∗ ? We need
to be computationally efficient.
Learning problem:
Given S, A, γ, and the facility to follow a trajectory by sampling from T
and R, how can we find an optimal policy π ∗ ? We need to be sample-
efficient.
Other ways. Value iteration and its various “mixtures” with policy
iteration.
Shivaram Kalyanakrishnan 11/19
Learning
Given S, A, γ, and the facility to follow a trajectory by sampling from T
and R, how can we find an optimal policy π ∗ ?
Q-Learning
Christopher J. C. H. Watkins and Peter Dayan. Machine Learning, 1992.
1. http://www.youtube.com/watch?v=mRpX9DFCdwI
1. http://www.youtube.com/watch?v=mRpX9DFCdwI
Exploration
Generalisation (over states and actions)
State aliasing (partial observability)
Multiple agents, nonstationary rewards and transitions
Abstraction (over states and over time)
Proofs of convergence, sample-complexity bounds
Exploration
Generalisation (over states and actions)
State aliasing (partial observability)
Multiple agents, nonstationary rewards and transitions
Abstraction (over states and over time)
Proofs of convergence, sample-complexity bounds
My thesis question:
Thank you!
Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore, 1996. Reinforcement
Learning: A Survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
Jette Randløv and Preben Alstrøm, 1998. Learning to Drive a Bicycle using Reinforcement
Learning and Shaping. In Proceedings of the Fifteenth International Conference on Machine
Learning (ICML 1998), pp. 463–471, Morgan Kaufmann, 1998.
Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y. Ng, 2006. An Application of
Reinforcement Learning to Aerobatic Helicopter Flight. In Advances in Neural Information
Processing Systems 19, pp. 1–8, MIT Press, 2006.
Todd Hester, Michael Quinlan, and Peter Stone, 2010. Generalized Model Learning for
Reinforcement Learning on a Humanoid Robot. In Proceedings of the IEEE International
Conference on Robotics and Automation (ICRA 2010), pp. 2369–2374, IEEE, 2010.
Csaba Szepesvári, 2010. Algorithms for Reinforcement Learning. Morgan & Claypool, 2010.
Shivaram Kalyanakrishnan, 2011. Learning Methods for Sequential Decision Making with
Imperfect Representations. Ph.D. dissertation, published as UT Austin Computer Science
Technical Report TR-11-41, 2011.