0% found this document useful (0 votes)

2 views

13 ML Reinforcement Learning - Policy Search

The document discusses policy gradient methods in reinforcement learning, focusing on the explicit representation of policies and various techniques for optimizing them. It highlights the challenges of parameter search, the advantages of gradient-based methods, and introduces the REINFORCE algorithm along with improvements like temporal-difference estimation. Additionally, it touches on the actor-critic approach and techniques to enhance stability and efficiency in policy gradient algorithms.

Uploaded by

tdr2mqm6gr

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

13 ML Reinforcement Learning - Policy Search

Uploaded by

tdr2mqm6gr

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Overview

1. Motivation
2. General Idea of Policy Gradient
3. Search in Parameter Space
4. Search in Action Space

Explicit Representation of the Policy

The policy is represented explicitly using a function that maps state into actions. The
policy policy can be either stochastic or deterministic, and can handle both continuous and
discrete state-action spaces.
Most commonly, the policy is represented with a neural network that takes the state as
input and

outputs the corresponding action (deterministic policy) OR

outputs the parameters of the action's distributions (often Gaussian when the action-
space is continuous, or Boltzmann when the action-space is discrete).

Explicit Representation of the Policy

Deterministic Policy:

Boltzmann Policy:

Gaussian Policy:

Objective
In policy gradient methods, we aim to maximize the expected return:

1 / 10
Parameter Search
Grid Search: One possibility is to organize the possible values of the parameters into a grid,
and for each combination perform a Monte-Carlo evaluation. Usually, grid search are
unfeasible, since the policy is encoded with thousands or millions of parameters.
Furthermore, parameters are typically continuous.
Genetic Algorithms: Genetic algorithms are faster than grid search. However, they are still
problematic when the size of the neural network is large.

Furthermore, Monte-Carlo evaluations requires a large number of evaluations, especially

in highly stochastic environments.

Policy Gradient in Parameter Space

Gradient algorithms are widely used in machine learning, as they are typically very
computational efficient. Instead of performing a global search (like grid-search or genetic
algorithms), gradients improve locally the set of parameters. The downside it is that
gradient-based algorithm can get stuck in local optima.

Gradient Ascent:

where

Unfortunately, we don't have a direct access to . How can we compute it?

Policy Gradient in Parameter Space

Idea: We want to estimate the gradient from samples. Usually, to have that, we need to have
an expression that involves the average, which we can then transform into an empirical
average (see all the estimators we have seen so far!).

As we seen, writing things in probabilistic terms, often helps (see all the estimators we have
seen so far!).
To this end we introduce a probability over parameters

where are the parameters defining such distribution.

2 / 10
Log-Ratio Trick

The log-ratio trick can be derived by the derivative of the logarithm

Policy Gradient in Parameter Space

1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations

2: for do
3: for do
4: Sample .

5: Execute the policy in the environment, collect the return .

6: end for

8: end for
9: Return .

Policy Gradient in Parameter Space

Advantages: This estimator is very simple to implement, and does not require specific
assumption (e.g., it works also for non-Markov environments).

Disadvantages: Monte-Carlo log-ratio estimators are high-variance. This high variance is

even higher in stochastic environment, where for a single , the return could greatly vary
due to stochastic transitions, stochastic policies and long horizons. Furthermore, the vector

3 / 10
is often 'very high dimensional, since it needs to encode distribution over parameters.
When are parameters of a (deep) neural network, this estimator become intractable.
Overall. In some applications, where the policy needs to be deterministic, can have small
dimension and the environment is partially observable, this estimator makes a lot of sense.

Baseline Subtraction for Variance Reduction

A simple trick to reduce the variance of estimator, is to subtract a constant to the return.
Such subtraction maintain (theoretically) the estimator unbiased while reducing the
variance.

Baseline Subtraction for Variance Reduction

There is a optimal baseline subtraction estimator that minimize the variance. However, to
keep math simple is often taken as the average.

Policy Gradient in Parameter Space with Variance

Reduction
1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations
2: for do

3: for do
4: Sample .

5: Execute the policy in the environment, collect the return .

4 / 10
6: end for

9: end for

10: Return .

Principle of Policy Gradient in Action Space

Is is possible to avoid a distribution over parameters, and use (somehow) only the policy
? Yes!
Once again, we have to look things under the probabilistic lenses.
Consider the probability of a trajectory ,

We can write the return as

Naïve Policy Gradient

5 / 10
Naïve Policy Gradient

Naïve Policy Gradient

1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations , episode truncation
2: for do

3: for do
4: Sample

6: for do
7: Choose action

8: Execute the action on the environment, observe reward and next state
9:
10: end for
11: end for

12:

13: end for

14: Return .

Towards REINFORCE
What did we gain w.r.t. parameter search? We gained that now we directly use the gradient
of the policy without the need of modeling a distribution of parameters.

However, this estimator still exhibits extremely high variance! Can we do better? Yes! the
trick is to keep in account that a reward at time depends only on the current and previous
transitions: future transitions don't matter.
Let's define the truncated trajectory , and, more in general,
.

6 / 10
Towards REINFORCE

Let's see what happens to each single term ...

Towards REINFORCE
Plugging it back together,

where is the return from time step . This estimator is called REINFORCE.

REINFORCE

REINFORCE
1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations , episode truncation
2: for do
3: for do

4: Sample

6: for do
7: Choose action

8: Execute the action on the environment, observe reward and next state

7 / 10
9:
10: end for

11: end for

12:

13: end for

14: Return .

REINFORCE
REINFORCE has much less variance w.r.t. the naïve policy gradient, but still prohibitive.
Furthermore, the gradient can be taken only after the execution of many episodes, which
makes the improvement of the policy very slow and sample inefficient.
There are two improvements that we can make:
(1) Instead of a Monte-Carlo estimation of the return , we can use a temporal difference
estimator .

(2) Instead of waiting the termination of many episodes, we can improve the policy each
step.

Temporal-Difference
The following equality holds:

hence, by estimating with temporal difference, we can greatly reduce the variance. We
can use the temporal difference with function approximation seen in the previous lecture.
Policy gradient algorithms that use a temporal-difference approximation (or alike) of the
-function are called actor-critic.

Bootstrapping
The following equality holds:

8 / 10
where is a discounted state-action visitation. This last expression, allow us to write a
gradient estimator based on single samples, allowing us to improve the policy at each step.
Online estimator:

where are observed by the interaction of the agent with the environment.

Policy Gradient
Notice that the equation in the previous slide forms the Policy Gradient Theorem.

Policy Gradient Theorem

The policy gradient can be expressed as

where and
.

Online Actor-Critic

Online Actor-Critic
1: Input: a policy , a distribution over parameters , number of episodes ,
episode truncation , learning rates for actor and critic

2: for do
3: Sample

5: for do
6: Choose action

7: Execute the action on the environment, observe reward and next state

9 / 10
9:
10: end for

11: end for

12: Return .

Towards State-of-the-art Actor-Critic

The algorithm outlined in the previous slide, is efficient, but carries instabilities due to
highly correlated samples, aggressive updates of the critic, collapse of the actor variance
which makes local optima more likely, etc.
To address these issues, one can use the replay buffer and the target networks used in the
previous lecture. Additional entropic regularization can prevent getting stuck in local
optima. Many other techniques like baseline subtraction and off-policy estimation can help
making policy gradients even more efficient.

Nevertheless, the presented estimator captures well the the core idea of actor-critic
algorithms.

10 / 10

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Polugayevsky, L. - Grandmaster Performance - Pergamon 1984 PDF
100% (3)
Polugayevsky, L. - Grandmaster Performance - Pergamon 1984 PDF
188 pages
rl5
No ratings yet
rl5
26 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
No ratings yet
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
12 pages
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
No ratings yet
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
10 pages
Planning and Optimal Control Policy Gradient Methods
No ratings yet
Planning and Optimal Control Policy Gradient Methods
34 pages
Policy_Gradient_Methods_for_Reinforcement_Learning
No ratings yet
Policy_Gradient_Methods_for_Reinforcement_Learning
5 pages
Policy_Approximation_Document
No ratings yet
Policy_Approximation_Document
2 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
cs224r_L04_Actor_Critic
No ratings yet
cs224r_L04_Actor_Critic
89 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
rl-3
No ratings yet
rl-3
31 pages
9541-Article Text-13069-1-2-20201228
No ratings yet
9541-Article Text-13069-1-2-20201228
7 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Policy_Gradient_Theorem_Complete
No ratings yet
Policy_Gradient_Theorem_Complete
2 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
PolicyGradient
No ratings yet
PolicyGradient
33 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
SRE_Report_merged
No ratings yet
SRE_Report_merged
16 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
13_RL_3
No ratings yet
13_RL_3
48 pages
M 2
No ratings yet
M 2
12 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Exploration in Contextual Bandits: Reedy Reedy
No ratings yet
Exploration in Contextual Bandits: Reedy Reedy
16 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
cs224r_L03_MDP_PG
No ratings yet
cs224r_L03_MDP_PG
30 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Module 04
No ratings yet
Module 04
63 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
Chapter 11
No ratings yet
Chapter 11
17 pages
1、Bayesian Policy Gradient Algorithms（2006）
No ratings yet
1、Bayesian Policy Gradient Algorithms（2006）
9 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Lecture 7: Policy Gradient: David Silver
No ratings yet
Lecture 7: Policy Gradient: David Silver
41 pages
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
No ratings yet
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
40 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Chapter 12
No ratings yet
Chapter 12
17 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ficci Corporate Members
No ratings yet
Ficci Corporate Members
3 pages
The Cold War 1st Edition Edition Allan M. Winkler - The ebook with rich content is ready for you to download
100% (2)
The Cold War 1st Edition Edition Allan M. Winkler - The ebook with rich content is ready for you to download
48 pages
HEART Score For Chest Pain Patients
No ratings yet
HEART Score For Chest Pain Patients
2 pages
alana english grammar
No ratings yet
alana english grammar
20 pages
Nursing Check List
No ratings yet
Nursing Check List
11 pages
Pre Course Grammar Module
No ratings yet
Pre Course Grammar Module
29 pages
Thesis: Cost-Effective Insulation Coordination Design of 115 KV Transmission Line For Lightning Back-Flashover
No ratings yet
Thesis: Cost-Effective Insulation Coordination Design of 115 KV Transmission Line For Lightning Back-Flashover
66 pages
Sri Bhavani Ashtakam With Sanskrit Lyrics and Meanings
No ratings yet
Sri Bhavani Ashtakam With Sanskrit Lyrics and Meanings
5 pages
Choosing A Research Topic
No ratings yet
Choosing A Research Topic
2 pages
Ye Naing Linn (MBA - 34)
No ratings yet
Ye Naing Linn (MBA - 34)
72 pages
PC131 Servo Control BD Parts Check e R0
No ratings yet
PC131 Servo Control BD Parts Check e R0
3 pages
Dna Brochure
No ratings yet
Dna Brochure
2 pages
Industrial Visit Report Abhishek Gurjar
No ratings yet
Industrial Visit Report Abhishek Gurjar
21 pages
TightVNC Installation
No ratings yet
TightVNC Installation
51 pages
MPLSin IPv 6
No ratings yet
MPLSin IPv 6
38 pages
Quality Assuarance in Welding
100% (1)
Quality Assuarance in Welding
150 pages
Basic Civil and Mechanical-Unit-4-Boilers-Support Notes-Studyhaunters PDF
100% (1)
Basic Civil and Mechanical-Unit-4-Boilers-Support Notes-Studyhaunters PDF
11 pages
FOMED01-assignment 1_2024 Memo_d1fdf5b3cd8a9c612c5_250313_234325
No ratings yet
FOMED01-assignment 1_2024 Memo_d1fdf5b3cd8a9c612c5_250313_234325
8 pages
Resume 1561441203032 PDF
No ratings yet
Resume 1561441203032 PDF
2 pages
Elimination of Cadmium From Wet Process Phosphoric Acid With Alamine 336 - ScienceDirect
No ratings yet
Elimination of Cadmium From Wet Process Phosphoric Acid With Alamine 336 - ScienceDirect
5 pages
FM University
No ratings yet
FM University
57 pages
2025GWS_MS1
No ratings yet
2025GWS_MS1
4 pages
Laravel Blog
No ratings yet
Laravel Blog
5 pages
Cloud Computing Unit - Ii: Mohammed Sharfuddin, Assistant Professor, Csed, Mjcet
No ratings yet
Cloud Computing Unit - Ii: Mohammed Sharfuddin, Assistant Professor, Csed, Mjcet
61 pages
Bài Tập Unit 1 - Lớp 11 New
No ratings yet
Bài Tập Unit 1 - Lớp 11 New
8 pages
Chemistry Items by JOHNSON CHRIST
No ratings yet
Chemistry Items by JOHNSON CHRIST
16 pages
18.02.01.008 Design Studio LLL Small Cafe Design
No ratings yet
18.02.01.008 Design Studio LLL Small Cafe Design
28 pages
Chapter Four Corrected
No ratings yet
Chapter Four Corrected
10 pages
MAN1006 TUTORIAL SHEET 3
No ratings yet
MAN1006 TUTORIAL SHEET 3
7 pages

13 ML Reinforcement Learning - Policy Search

Uploaded by

13 ML Reinforcement Learning - Policy Search

Uploaded by

Overview

Explicit Representation of the Policy

outputs the corresponding action (deterministic policy) OR

Explicit Representation of the Policy

Furthermore, Monte-Carlo evaluations requires a large number of evaluations, especially

Policy Gradient in Parameter Space

Unfortunately, we don't have a direct access to . How can we compute it?

Policy Gradient in Parameter Space

where are the parameters defining such distribution.

The log-ratio trick can be derived by the derivative of the logarithm

Policy Gradient in Parameter Space

Policy Gradient in Parameter Space

5: Execute the policy in the environment, collect the return .

Policy Gradient in Parameter Space

Disadvantages: Monte-Carlo log-ratio estimators are high-variance. This high variance is

Baseline Subtraction for Variance Reduction

Baseline Subtraction for Variance Reduction

Baseline Subtraction for Variance Reduction

Policy Gradient in Parameter Space with Variance

5: Execute the policy in the environment, collect the return .

Principle of Policy Gradient in Action Space

We can write the return as

Naïve Policy Gradient

Naïve Policy Gradient

13: end for

Let's see what happens to each single term ...

11: end for

13: end for

Policy Gradient Theorem

11: end for

Towards State-of-the-art Actor-Critic

You might also like