0% found this document useful (0 votes)
25 views41 pages

16 - Reinforcement Learning and Bandits.pptx

Uploaded by

sanjitdfd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views41 pages

16 - Reinforcement Learning and Bandits.pptx

Uploaded by

sanjitdfd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CMPSC 448: Machine Learning

Lecture 16. Reinforcement Learning and Bandits

Rui Zhang
Fall 2024

1
What types of ML are there?

2
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world

● Learning in MDP: When we don't know the world


○ Monte Carlo Methods

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation of function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
3
What is reinforcement learning?
How to build agents that learn behaviors in a dynamic world?
● Agent-oriented learning
● learning by interacting with an environment to achieve a goal
● more natural, realistic, and ambitious than other kinds of machine learning

RL is a general-purpose framework for decision-making


● RL is for an agent with the capacity to act
● Each action influences the agent’s future state
● Success is measured by a scalar reward signal
● Goal: select actions to maximize the future reward
● The learner is not told which actions to take, but instead must discover which actions
yield the most reward by trying them.
● The agent has to exploit what it has already experienced in order to obtain reward, but
it also has to explore in order to make better action selections in the future.
4
Maze

5
https://www.samyzaf.com/ML/rl/qmaze.html
TD-Gammon

estimated state value (≈ prob of winning)


Action selection by a shallow search

Start with a random Neural Network


Play millions of games against itself (i.e., self-play)
Learn a value function from this simulated experience
Six weeks later it’s the best player of backgammon in the world
Originally used expert handcrafted features, later repeated with raw board positions 6
AlphaGo
In October 2015, AlphaGo became the first computer Go program to beat a human
professional Go player without handicaps on a full-sized 19×19 board.

Monte Carlo Tree Search, learning policy and value function networks for pruning the search
7
tree, trained from expert demonstrations, self play, and Tensor Processing Unit
The RL interface between Agent and Environment
Agent is in a state, takes an action, gets some reward for the pair of (state,
action), and goes to a new state!

Action

8
RL Terms
A set of States
● These are the possible positions of our mouse within the maze.

A set of Actions available in each state


● This is {forward, back} in a corridor and {forward, back, left, right} at a crossroads.

Transitions between states


● For example, if you go left at a crossroads you end up in a new position. These can be a set of probabilities
that link to more than one possible state (e.g. when you use an attack in a game of Pokémon you can either
miss, inflict some damage, or inflict enough damage to knock out your opponent).

Rewards associated with each transition


● In the robot-mouse example, most of the rewards are 0, but they’re positive if you reach a point that
has water or cheese and negative if you reach a point that has an electric shock.

Policy
● A mapping from states to actions aiming to maximize its cumulative reward (how to map situations to
9
actions).
Notations
We will follow the following notation for known and unknown variables:

10
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

11
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

Distribution over rewards. May or may not be known. Could be deterministic or random

12
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

Distribution over rewards. May or may not be known. Could be deterministic or random

Goal is to learn a policy which is a function whose input is a state and output is an action
(might be randomized)

13
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
○ Formulation
○ Regret
○ Action-Value Methods
○ -greedy action selection
○ UCB action selection
○ Gradient bandits
● Markov Decision Processes (MDP)
● Learning in MDP: When we don't know the world

14
Multi-armed Bandits

The simplest reinforcement learning problem

One state (no state transition probabilities)


Actions: k levers (arms), each action is associated with a reward
Policy is to sequentially choose arms to maximize cumulative reward

15
Formulation: -armed bandit problem
On each of an infinite sequence of time steps,
you choose an action from possibilities, and receive a real-valued reward

The reward depends only on the action taken; it is identically, independently


distributed given the action:

These true reward distribution is unknown. Nevertheless, you must maximize total
reward (equivalent to minimize total regret)

You must both try actions to learn their values (explore), and prefer those that
appear best (exploit)
16
Regret
Goal: minimize the REGRET:

● Low regret means that we do not lose much from not knowing future events.
● We can perform almost as well as someone who observes the entire
sequence and picks the best prediction strategy in hindsight
● We cannot compute regret (because this requires knowing the best arm), but
we use it to analyze our algorithm
● We can also compete with changing environment
17
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

18
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown


● policy: try each arm (exploration), then use best one (exploitation)

19
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown


● policy: try each arm (exploration), then use best one (exploitation)

If rewards are random and known


● policy: take action with highest expected reward

20
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown


● policy: try each arm (exploration), then use best one (exploitation)

If rewards are random and known


● policy: take action with highest expected reward

If rewards are random and unknown


● policy: Explore by trying each arm 10,000 times to estimate the rewards, and
then exploit. But here exploration is too long and pre-determined.
21
Action-value methods
Methods that learn action-value estimates and nothing else.

For example, estimate action values as sample averages:

22
The exploration/exploitation dilemma
Suppose you form estimates

Define the greedy action at time t as

If then you are exploiting


If then you are exploring

You can't do both, but you need to do both

You can never stop exploring, but maybe you should explore less with time. Or
maybe not. 23
-greedy action selection
In greedy action selection, you always exploit

In -greedy, you are usually greedy, but with probability you instead pick an
action at random (possibly the greedy action again)

This is perhaps the simplest way to balance exploration and exploitation

24
-greedy action selection

Exploration is needed because there is always uncertainty about the accuracy of


the action-value estimates.
25
-greedy action selection

Exploration is needed because there is always uncertainty about the accuracy of


the action-value estimates.
26
Linear vs sublinear Regret
● If an algorithm forever explores it will have linear total regret
● If an algorithm never explores it will have linear total regret
● Can we have sublinear total regret?

27
Incremental Implementation
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards

How can we do this incrementally (without storing all the rewards)?

28
From Averaging to Learning Rule
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards

How can we do this incrementally (without storing all the rewards)?


We can store a running sum and count (and divide), or equivalently:

This is a standard form for learning/update rules we will frequently use

29
Optimistic initial values to Encourage Exploration
All methods so far depend on
So far we have used

Suppose we initialize the action values optimistically ( ), we can


encourage models to try all the arms.

30
Upper Confidence Bound (UCB) action selection
● -greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or
particularly uncertain.
● It would be better to select among the non-greedy actions according to their
potential for actually being optimal, taking into account both how close their
estimates are to being maximal and the uncertainties in those estimates.
● One effective way of doing this is to select actions according to the upper
confidence bound:

○ Estimate an upper bound on the true action values


○ Select the action with the largest estimated upper bound
○ A clever way of reducing exploration over time
31
Appendix

32
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:

The first condition is required to guarantee that the steps are large enough to eventually
overcome any initial conditions or random fluctuations. The second condition guarantees that
eventually the steps become small enough to assure convergence.
33
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:

Yes; No, because it is too small.


34
Action-Value vs Numerical Preference
We consider learning a numerical preference for each action
The action probabilities follows softmax distribution:

This is similar to a classification problem where classes are actions.


Then, we can use stochastic gradient descent :)

35
Gradient Bandit Algorithm

Then, we can use stochastic gradient descent :)


On each step, after selecting action and receiving the reward

The term serves as a baseline with which the reward is compared.

If the reward is higher than the baseline, then the probability of taking in the future is
increased, and if the reward is below baseline, then the probability is decreased.
The non-selected actions move in the opposite direction.
36
Summary comparison of bandit algorithms

37
Derivation of gradient-bandit algorithm

38
39
40
41

You might also like