Lecture 1
Lecture 1
Lecture 1
Felipe Maldonado
Department of Mathematical Sciences
University of Essex
email: [email protected]
Dynamic Programming and Reinforcement Learning 2
Key references
• Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction (2nd Edition) MIT
Press, Cambridge, MA, 2018. http://incompleteideas.net/book/RLbook2020.pdf
• Csaba Szepesvri, Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and
machine learning 4.1 (2010): 1-103.
https://sites.ualberta.ca/˜szepesva/papers/RLAlgsInMDPs.pdf
• Martin L. Puterman, Markov decision processes. Handbooks in operations research and management
science 2 (1990): 331-434.
• Wayne L. Winston, Operations Research: Applications and Algorithms, 4th edition, 2004.
• Andreas Lindholm, Niklas Wahlstroem, Fredrik Lindsten, and Thomas B. Schoen Machine Learning - A
First Course for Engineers and Scientists Cambridge University Press, 2022, http://smlbook.org
Dynamic Programming and Reinforcement Learning 3
Module Information:
Assessment:
• Lab Assignment: 10%
• Project 20%
• Examination: 70%
Dynamic Programming and Reinforcement Learning 4
Module Outline
• Trial and error and delayed rewards: most distinguishing features of Reinforcement Learning (RL)
• Formalisation of the methods?: capture the most important aspects facing a learning agent that interacts
over time with its environment to achieve a goal.
• A reinforcement learning agent must be able to sense the environment, take actions to affect their
current state and have a particular goal.
Dynamic Programming and Reinforcement Learning 6
• In Supervised Learning (SL), learning occurs from from a training set of labeled examples provided by
a knowledgable external supervisor. In uncharted territory, where one would expect learning to be most
beneficial? An agent must be able to learn from its own experience .
• Unsupervised Learning (UL), which is typically about finding structure hidden in collections of
unlabelled data. Uncovering structure in an agent?s experience can certainly be useful in RL, but by
itself does not address the reinforcement learning problem of maximising a reward signal.
• UL could tell you how identify places where wild animals live, SL could tell you to recognise a lion. RL
could tell you that you need to run from that place.
Dynamic Programming and Reinforcement Learning 7
• Unlike Supervised and Unsupervised Learning, RL has to deal with the trade-off of exploration and
exploitation.
• To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past
and found to be effective in producing reward. But what if I have not explored the best action yet?
• The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at
the task. The agent must try a variety of actions and progressively favour those that appear to be best.
• Core algorithms for RL were originally inspired by biological learning systems (e.g., genetic algorithms)
Dynamic Programming and Reinforcement Learning 8
Examples
• Chess
• Roomba
• Prepare breakfast
Remarks
• These examples share features that are so basic that they are easy to overlook. All involve interaction
between an active decision-making agent and its environment, within which the agent seeks to achieve a
goal despite uncertainty about its environment.
• Correct choice requires taking into account indirect, delayed consequences of actions, and thus may
require foresight or planning.
Dynamic Programming and Reinforcement Learning 9
• Exploration: Learning about the world by making decisions. But those decisions impact what we learn.
• Decisions: based on a POLICY, that takes experience from the past and recommends an action.
A (reinforcement) learning agent uses a POLICY to make their decisions, observes states, environment
(given by a MODEL or real experience), considers their past rewards and decides accordingly.
A REWARD SIGNAL defines the goal of a reinforcement learning problem. On each time step, the
environment sends to the reinforcement learning agent a single number called the reward (good and bad
events for the agent).
A VALUE FUNCTION specifies what is good in the long run. The value of a state is the total amount of
reward an agent can expect to accumulate over the future, starting from that state.
Example: https://www.youtube.com/watch?v=qy_mIEnnlF4 Sheldon trains Penny with
positive reinforcement when she does what he thinks it is good behaviour.
Dynamic Programming and Reinforcement Learning 10
• Section 2. special case of the reinforcement learning problem in which there is only a single state, called
bandit problems.
• Section 3. general problem formulation: finite Markov decision processes - and its main ideas including
Bellman equations and value functions.
• Sections 4. . 5. and 6. describe three fundamental classes of methods for solving RL problems:
Dynamic Programming, Monte Carlo methods, and Temporal- Difference learning.
• Section 7. strengths of Monte Carlo methods can be combined with the strengths of temporal-
difference methods via multi-step bootstrapping methods. As well as combining them with Model
Learning on Planning Methods.
Dynamic Programming and Reinforcement Learning 13
2. Multi-armed Bandits
Notation
q ∗ (a) value of arbitrary action a: q ∗ (a) = E[Rt [At = a]. We assume that we do not know those values, but
we have estimates of thema .
Definition: Assume that for all actions a we have good estimates Qt (a) ∼ q ∗ (a). An action is called a
Greedy Action if it is chosen such as At := argmaxa Qt (a).
Selecting one of the greedy actions implies that the agent is exploiting their current knowledge. If choose a
non-greedy action, we say the they are exploring. There are sophisticated ways to balance the exploration -
exploitation trade-off but with strong assumptions that rarely are satisfied in real examples. In this Module
we will only focused on finding any (reasonable) type of balance.
• Action-value methods are the methods for estimating the value of actions (their mean reward) and for
suing those estimates to make action decisions.
• One natural way to estimate this is by averaging the rewards actually received:
Pt−1
sum of rewards when a taken prior to t i=1 Ri 1[At =a]
Qt (a) := = P t−1
number of times a taken prior to t i=1 1[At =a]
As the denominator goes to infinity (when t → ∞), by the Law of Large Numbers, Qt (a) converges to
q ∗ (a). We call this the sample-average method for estimating action values because each estimate is an
average of the sample of relevant rewards.
The simplest action selection rule is to select one of the actions with the highest estimated value: greedy
actions At = argmax Qt (a)
Any issue with this approach? .... only exploitation.
One simple alternative is to behave greedily most of the time, but with a small probability select randomly
another action (independently of the action values). We will call this type of method as -greedy,
Dynamic Programming and Reinforcement Learning 16
• Let us consider a single action. Let Ri now denote the reward received after the i − th selection of this
action.
• Let Qn denote the estimate of its action value after it has been selected n − 1 times, which we can now
write simply as
R1 + R2 + · · · + Rn−1
Qn = (2.1)
n−1
• Incremental formula for Qn can be written as
1
Qn+1 = Qn + [Rn − Qn ] (2.2)
n
• With this type of incremental formula
N ewEstimate ← OldEstimate + StepSize[T arget − OldEstimate]
are quite useful since they allow to update averages with a small, constant computation.
• T arget − OldEstimate is an error in the estimate and leads to find ways to get closer to that target.
Solution Exercise 1:
Dynamic Programming and Reinforcement Learning 18
Pn
Exercise 3: Check that (1 − α)n + i=1 α(1 − α)n−i = 1
Note that the weight, (1 − α)n−i given to the reward Ri depends on how many rewards ago, n − i, it was
observed. The quantity 1 − α ≤ 1, and thus the weight given to Ri decreases as n increases. In fact, the
weight decays exponentially according to the exponent on 1 − α.
Dynamic Programming and Reinforcement Learning 20
Solution Exercise 2:
Solution Exercise 3:
Dynamic Programming and Reinforcement Learning 21
Sometimes it is convenient to vary the step-size parameter from step to step. Let αn (a) denote the step-size
parameter used to process the reward received after the nth selection of action a. As we have noted, the
choice αn (a) = n1 results in the sample-average method.
The first condition is required to guarantee that the steps are large enough to eventually overcome any initial
conditions or random fluctuations. The second condition guarantees that eventually the steps become small
enough to assure convergence.
• α constant does not satisfy second condition, and hence there is no convergence.
• Variable α does not work so well on nonstationary cases (which are the most common on RL).
• Variable stepsize are not that common in RL, but there are other environments where they appear
naturally.
Nt (a) denotes the number of times that action a has been selected prior to time t, and the number c > 0
controls the degree of exploration. If Nt (a) = 0, then a is considered to be a maximising action.
Since we take the argmax it gives us an upper bound on the possible true value of action a, with c
determining the confidence level. It can be tricky to implement in more general RL problems, such as those
with large state-space.
Dynamic Programming and Reinforcement Learning 24
eHt (a)
πt (a) = P r{At = a} := Pk
Ht (b)
b=1 e
The larger the preference, the more often that action is taken, but the preference has no interpretation in terms
of reward.
Dynamic Programming and Reinforcement Learning 25
On each step t. After selecting action At and receiving reward Rt , the preferences are updated under the
following rule.
Ht+1 (a) := Ht (a) + α(Rt − R̂t )(1[At =a] − πt (a)) (2.5)
R̂t serves as a baseline with which the reward is compared. If the reward is higher than the baseline, then the
probability of taking At in the future is increased, and if the reward is below baseline, then the probability is
decreased. The non-selected actions move in the opposite direction.
Remark: the name Gradient bandit, comes from the fact (2.5) can be interpreted as the gradient of the
expected reward. And therefore this method is an instance of the stochastic gradient ascent. This assures us
that the algorithm has robust convergence properties (we will discuss more about this in the Part II of the
Module).
Dynamic Programming and Reinforcement Learning 26
Notes: