16 - Reinforcement Learning and Bandits.pptx
16 - Reinforcement Learning and Bandits.pptx
Rui Zhang
Fall 2024
1
What types of ML are there?
2
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world
Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation of function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
3
What is reinforcement learning?
How to build agents that learn behaviors in a dynamic world?
● Agent-oriented learning
● learning by interacting with an environment to achieve a goal
● more natural, realistic, and ambitious than other kinds of machine learning
5
https://www.samyzaf.com/ML/rl/qmaze.html
TD-Gammon
Monte Carlo Tree Search, learning policy and value function networks for pruning the search
7
tree, trained from expert demonstrations, self play, and Tensor Processing Unit
The RL interface between Agent and Environment
Agent is in a state, takes an action, gets some reward for the pair of (state,
action), and goes to a new state!
Action
8
RL Terms
A set of States
● These are the possible positions of our mouse within the maze.
Policy
● A mapping from states to actions aiming to maximize its cumulative reward (how to map situations to
9
actions).
Notations
We will follow the following notation for known and unknown variables:
10
Dynamics
State transition probability. May or may not be known. Could be deterministic or random
11
Dynamics
State transition probability. May or may not be known. Could be deterministic or random
Distribution over rewards. May or may not be known. Could be deterministic or random
12
Dynamics
State transition probability. May or may not be known. Could be deterministic or random
Distribution over rewards. May or may not be known. Could be deterministic or random
Goal is to learn a policy which is a function whose input is a state and output is an action
(might be randomized)
13
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
○ Formulation
○ Regret
○ Action-Value Methods
○ -greedy action selection
○ UCB action selection
○ Gradient bandits
● Markov Decision Processes (MDP)
● Learning in MDP: When we don't know the world
14
Multi-armed Bandits
15
Formulation: -armed bandit problem
On each of an infinite sequence of time steps,
you choose an action from possibilities, and receive a real-valued reward
These true reward distribution is unknown. Nevertheless, you must maximize total
reward (equivalent to minimize total regret)
You must both try actions to learn their values (explore), and prefer those that
appear best (exploit)
16
Regret
Goal: minimize the REGRET:
● Low regret means that we do not lose much from not knowing future events.
● We can perform almost as well as someone who observes the entire
sequence and picks the best prediction strategy in hindsight
● We cannot compute regret (because this requires knowing the best arm), but
we use it to analyze our algorithm
● We can also compete with changing environment
17
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
18
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
19
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
20
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
22
The exploration/exploitation dilemma
Suppose you form estimates
You can never stop exploring, but maybe you should explore less with time. Or
maybe not. 23
-greedy action selection
In greedy action selection, you always exploit
In -greedy, you are usually greedy, but with probability you instead pick an
action at random (possibly the greedy action again)
24
-greedy action selection
27
Incremental Implementation
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards
28
From Averaging to Learning Rule
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards
29
Optimistic initial values to Encourage Exploration
All methods so far depend on
So far we have used
30
Upper Confidence Bound (UCB) action selection
● -greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or
particularly uncertain.
● It would be better to select among the non-greedy actions according to their
potential for actually being optimal, taking into account both how close their
estimates are to being maximal and the uncertainties in those estimates.
● One effective way of doing this is to select actions according to the upper
confidence bound:
32
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:
The first condition is required to guarantee that the steps are large enough to eventually
overcome any initial conditions or random fluctuations. The second condition guarantees that
eventually the steps become small enough to assure convergence.
33
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:
35
Gradient Bandit Algorithm
If the reward is higher than the baseline, then the probability of taking in the future is
increased, and if the reward is below baseline, then the probability is decreased.
The non-selected actions move in the opposite direction.
36
Summary comparison of bandit algorithms
37
Derivation of gradient-bandit algorithm
38
39
40
41