0% found this document useful (0 votes)

25 views41 pages

16 - Reinforcement Learning and Bandits.pptx

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views41 pages

16 - Reinforcement Learning and Bandits.pptx

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

CMPSC 448: Machine Learning

Lecture 16. Reinforcement Learning and Bandits

Rui Zhang
Fall 2024

1
What types of ML are there?

2
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world

● Learning in MDP: When we don't know the world

○ Monte Carlo Methods

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation of function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
3
What is reinforcement learning?
How to build agents that learn behaviors in a dynamic world?
● Agent-oriented learning
● learning by interacting with an environment to achieve a goal
● more natural, realistic, and ambitious than other kinds of machine learning

RL is a general-purpose framework for decision-making

● RL is for an agent with the capacity to act
● Each action influences the agent’s future state
● Success is measured by a scalar reward signal
● Goal: select actions to maximize the future reward
● The learner is not told which actions to take, but instead must discover which actions
yield the most reward by trying them.
● The agent has to exploit what it has already experienced in order to obtain reward, but
it also has to explore in order to make better action selections in the future.
4
Maze

5
https://www.samyzaf.com/ML/rl/qmaze.html
TD-Gammon

estimated state value (≈ prob of winning)

Action selection by a shallow search

Start with a random Neural Network

Play millions of games against itself (i.e., self-play)
Learn a value function from this simulated experience
Six weeks later it’s the best player of backgammon in the world
Originally used expert handcrafted features, later repeated with raw board positions 6
AlphaGo
In October 2015, AlphaGo became the first computer Go program to beat a human
professional Go player without handicaps on a full-sized 19×19 board.

Monte Carlo Tree Search, learning policy and value function networks for pruning the search
7
tree, trained from expert demonstrations, self play, and Tensor Processing Unit
The RL interface between Agent and Environment
Agent is in a state, takes an action, gets some reward for the pair of (state,
action), and goes to a new state!

Action

8
RL Terms
A set of States
● These are the possible positions of our mouse within the maze.

A set of Actions available in each state

● This is {forward, back} in a corridor and {forward, back, left, right} at a crossroads.

Transitions between states

● For example, if you go left at a crossroads you end up in a new position. These can be a set of probabilities
that link to more than one possible state (e.g. when you use an attack in a game of Pokémon you can either
miss, inflict some damage, or inflict enough damage to knock out your opponent).

Rewards associated with each transition

● In the robot-mouse example, most of the rewards are 0, but they’re positive if you reach a point that
has water or cheese and negative if you reach a point that has an electric shock.

Policy
● A mapping from states to actions aiming to maximize its cumulative reward (how to map situations to
9
actions).
Notations
We will follow the following notation for known and unknown variables:

10
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

11
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

Distribution over rewards. May or may not be known. Could be deterministic or random

12
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

Distribution over rewards. May or may not be known. Could be deterministic or random

Goal is to learn a policy which is a function whose input is a state and output is an action
(might be randomized)

13
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
○ Formulation
○ Regret
○ Action-Value Methods
○ -greedy action selection
○ UCB action selection
○ Gradient bandits
● Markov Decision Processes (MDP)
● Learning in MDP: When we don't know the world

14
Multi-armed Bandits

The simplest reinforcement learning problem

One state (no state transition probabilities)

Actions: k levers (arms), each action is associated with a reward
Policy is to sequentially choose arms to maximize cumulative reward

15
Formulation: -armed bandit problem
On each of an infinite sequence of time steps,
you choose an action from possibilities, and receive a real-valued reward

The reward depends only on the action taken; it is identically, independently

distributed given the action:

These true reward distribution is unknown. Nevertheless, you must maximize total
reward (equivalent to minimize total regret)

You must both try actions to learn their values (explore), and prefer those that
appear best (exploit)
16
Regret
Goal: minimize the REGRET:

● Low regret means that we do not lose much from not knowing future events.
● We can perform almost as well as someone who observes the entire
sequence and picks the best prediction strategy in hindsight
● We cannot compute regret (because this requires knowing the best arm), but
we use it to analyze our algorithm
● We can also compete with changing environment
17
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

18
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown

● policy: try each arm (exploration), then use best one (exploitation)

19
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown

● policy: try each arm (exploration), then use best one (exploitation)

If rewards are random and known

● policy: take action with highest expected reward

20
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown

● policy: try each arm (exploration), then use best one (exploitation)

If rewards are random and known

● policy: take action with highest expected reward

If rewards are random and unknown

● policy: Explore by trying each arm 10,000 times to estimate the rewards, and
then exploit. But here exploration is too long and pre-determined.
21
Action-value methods
Methods that learn action-value estimates and nothing else.

For example, estimate action values as sample averages:

22
The exploration/exploitation dilemma
Suppose you form estimates

Define the greedy action at time t as

If then you are exploiting

If then you are exploring

You can't do both, but you need to do both

You can never stop exploring, but maybe you should explore less with time. Or
maybe not. 23
-greedy action selection
In greedy action selection, you always exploit

In -greedy, you are usually greedy, but with probability you instead pick an
action at random (possibly the greedy action again)

This is perhaps the simplest way to balance exploration and exploitation

24
-greedy action selection

Exploration is needed because there is always uncertainty about the accuracy of

the action-value estimates.
25
-greedy action selection

Exploration is needed because there is always uncertainty about the accuracy of

the action-value estimates.
26
Linear vs sublinear Regret
● If an algorithm forever explores it will have linear total regret
● If an algorithm never explores it will have linear total regret
● Can we have sublinear total regret?

27
Incremental Implementation
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards

How can we do this incrementally (without storing all the rewards)?

28
From Averaging to Learning Rule
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards

How can we do this incrementally (without storing all the rewards)?

We can store a running sum and count (and divide), or equivalently:

This is a standard form for learning/update rules we will frequently use

29
Optimistic initial values to Encourage Exploration
All methods so far depend on
So far we have used

Suppose we initialize the action values optimistically ( ), we can

encourage models to try all the arms.

30
Upper Confidence Bound (UCB) action selection
● -greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or
particularly uncertain.
● It would be better to select among the non-greedy actions according to their
potential for actually being optimal, taking into account both how close their
estimates are to being maximal and the uncertainties in those estimates.
● One effective way of doing this is to select actions according to the upper
confidence bound:

○ Estimate an upper bound on the true action values

○ Select the action with the largest estimated upper bound
○ A clever way of reducing exploration over time
31
Appendix

32
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:

The first condition is required to guarantee that the steps are large enough to eventually
overcome any initial conditions or random fluctuations. The second condition guarantees that
eventually the steps become small enough to assure convergence.
33
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:

Yes; No, because it is too small.

34
Action-Value vs Numerical Preference
We consider learning a numerical preference for each action
The action probabilities follows softmax distribution:

This is similar to a classification problem where classes are actions.

Then, we can use stochastic gradient descent :)

35
Gradient Bandit Algorithm

Then, we can use stochastic gradient descent :)

On each step, after selecting action and receiving the reward

The term serves as a baseline with which the reward is compared.

If the reward is higher than the baseline, then the probability of taking in the future is
increased, and if the reward is below baseline, then the probability is decreased.
The non-selected actions move in the opposite direction.
36
Summary comparison of bandit algorithms

37
Derivation of gradient-bandit algorithm

38
39
40
41

DLMAIRIL01_Q4-2024_Session3
No ratings yet
DLMAIRIL01_Q4-2024_Session3
47 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
RL UNIT PPT
No ratings yet
RL UNIT PPT
595 pages
Mid Term Report SoS (3)
No ratings yet
Mid Term Report SoS (3)
18 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
F20-AI-L10
No ratings yet
F20-AI-L10
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
rl-unit5
No ratings yet
rl-unit5
101 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Lecture 1
No ratings yet
Lecture 1
26 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Machine_Learning_Chapter 4
No ratings yet
Machine_Learning_Chapter 4
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
The Econometrics of Panel Data.A Handbook of The Theory With Applications. Matyas & Sevestre 1996 PDF
No ratings yet
The Econometrics of Panel Data.A Handbook of The Theory With Applications. Matyas & Sevestre 1996 PDF
915 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
subtitle (12)
No ratings yet
subtitle (12)
2 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Fundamentals of Biostatistics 8th Edition PDF
No ratings yet
Fundamentals of Biostatistics 8th Edition PDF
39 pages
Module 1
No ratings yet
Module 1
72 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Introduction To Petroleum Engineering Course - Part#2
No ratings yet
Introduction To Petroleum Engineering Course - Part#2
201 pages
UNIT IV-1
No ratings yet
UNIT IV-1
32 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
unit 5 ml
No ratings yet
unit 5 ml
15 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
37 RL
No ratings yet
37 RL
18 pages
Unit 1
No ratings yet
Unit 1
18 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Chapter4 20191 PDF
No ratings yet
Chapter4 20191 PDF
120 pages
Appendix A Tables
No ratings yet
Appendix A Tables
35 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
15 - Matrix Factorization.pptx
No ratings yet
15 - Matrix Factorization.pptx
55 pages
TC2-Lab Manual
No ratings yet
TC2-Lab Manual
35 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Chapter 12 @risk
100% (1)
Chapter 12 @risk
42 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Chapter V Tests of Hypothesis 6
No ratings yet
Chapter V Tests of Hypothesis 6
67 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
[FREE PDF sample] Maximum Likelihood Estimation of Misspecified Models Twenty Years Later Volume 17 1° Edition T. Fomby ebooks
100% (8)
[FREE PDF sample] Maximum Likelihood Estimation of Misspecified Models Twenty Years Later Volume 17 1° Edition T. Fomby ebooks
75 pages
Handout 2020 Part1 PDF
No ratings yet
Handout 2020 Part1 PDF
82 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Notation in Probability and Statistics - Wikipedia
No ratings yet
Notation in Probability and Statistics - Wikipedia
13 pages
Measurement Uncertainty Chemical
No ratings yet
Measurement Uncertainty Chemical
8 pages
Department of Statistics: Course Stats 330
No ratings yet
Department of Statistics: Course Stats 330
10 pages
Statistical Computing Using Statistical Computing Using
No ratings yet
Statistical Computing Using Statistical Computing Using
128 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
GEC05 Module Statistics
No ratings yet
GEC05 Module Statistics
34 pages
Chapter 5 Queueing Theory
No ratings yet
Chapter 5 Queueing Theory
32 pages
Multiple linear regression analysis with Stepwise method
No ratings yet
Multiple linear regression analysis with Stepwise method
5 pages
Assignment_Normal Distribution copy
No ratings yet
Assignment_Normal Distribution copy
5 pages
ws8-6 Measures of Spread
No ratings yet
ws8-6 Measures of Spread
5 pages
PR_Mid18 (1)
No ratings yet
PR_Mid18 (1)
4 pages
Camry Group Activity Bsba-Fm3a
No ratings yet
Camry Group Activity Bsba-Fm3a
12 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
FINAL EXAM IN E-WPS Office
No ratings yet
FINAL EXAM IN E-WPS Office
12 pages
Summary On One Sample Hypothesis Testing
No ratings yet
Summary On One Sample Hypothesis Testing
1 page
Inference Procedure Summary - AP Statistics: o o ND
No ratings yet
Inference Procedure Summary - AP Statistics: o o ND
6 pages
Deriving Mean and Variance of Laplace Distribution
No ratings yet
Deriving Mean and Variance of Laplace Distribution
2 pages
Linear Probability Model (LPM) in Multiple Linear Regression Model With A Binary
No ratings yet
Linear Probability Model (LPM) in Multiple Linear Regression Model With A Binary
3 pages
STA301 Fall 2004 Final Term Paper
No ratings yet
STA301 Fall 2004 Final Term Paper
3 pages
Design of Experiments: How To Optimize Products and Processes
No ratings yet
Design of Experiments: How To Optimize Products and Processes
2 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet

16 - Reinforcement Learning and Bandits.pptx

Uploaded by

16 - Reinforcement Learning and Bandits.pptx

Uploaded by

CMPSC 448: Machine Learning

Lecture 16. Reinforcement Learning and Bandits

● Learning in MDP: When we don't know the world

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

RL is a general-purpose framework for decision-making

estimated state value (≈ prob of winning)

Start with a random Neural Network

A set of Actions available in each state

Transitions between states

Rewards associated with each transition

The simplest reinforcement learning problem

One state (no state transition probabilities)

The reward depends only on the action taken; it is identically, independently

If rewards are deterministic and unknown

If rewards are deterministic and unknown

If rewards are random and known

If rewards are deterministic and unknown

If rewards are random and known

If rewards are random and unknown

For example, estimate action values as sample averages:

Define the greedy action at time t as

If then you are exploiting

You can't do both, but you need to do both

This is perhaps the simplest way to balance exploration and exploitation

Exploration is needed because there is always uncertainty about the accuracy of

Exploration is needed because there is always uncertainty about the accuracy of

How can we do this incrementally (without storing all the rewards)?

How can we do this incrementally (without storing all the rewards)?

This is a standard form for learning/update rules we will frequently use

Suppose we initialize the action values optimistically ( ), we can

○ Estimate an upper bound on the true action values

Yes; No, because it is too small.

This is similar to a classification problem where classes are actions.

Then, we can use stochastic gradient descent :)

The term serves as a baseline with which the reward is compared.

You might also like