0% found this document useful (0 votes)
13 views29 pages

PPO (v3)

Proximal Policy Optimization (PPO) is the default reinforcement learning algorithm used by OpenAI, focusing on optimizing policy while maintaining constraints on behavior. The document discusses the components of PPO, including the roles of the actor, environment, and reward, as well as the importance of using techniques like importance sampling and credit assignment. It also highlights the differences between on-policy and off-policy learning, and briefly mentions the PPO algorithm and its variants.

Uploaded by

luobin23628
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views29 pages

PPO (v3)

Proximal Policy Optimization (PPO) is the default reinforcement learning algorithm used by OpenAI, focusing on optimizing policy while maintaining constraints on behavior. The document discusses the components of PPO, including the roles of the actor, environment, and reward, as well as the importance of using techniques like importance sampling and credit assignment. It also highlights the differences between on-policy and off-policy learning, and briefly mentions the PPO algorithm and its variants.

Uploaded by

luobin23628
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Proximal Policy

Optimization (PPO)
default reinforcement learning algorithm at OpenAI

Policy Add
Gradient constraint
DeepMind https://youtu.be/gn4nRCC9TwQ
OpenAI https://blog.openai.com/o
penai-baselines-ppo/
Policy Gradient (Review)
Basic Components
You cannot control

Reward
Actor Env Function

Video Get 20 scores when


Game killing a monster

The rule
Go of GO
Policy of Actor

Take the action


based on the
left 0.7 probability.

… right 0.2 Score of an


action

fire 0.1
pixels
Example: Playing Video Game

(kill an alien)
Example: Playing Video Game

This is an episode.
After many turns Game Over
(spaceship destroyed)

We want the total


reward be maximized.
Actor, Environment, Reward

Env Actor Env Actor Env ……

Trajectory
Actor, Environment, Reward
updated updated

Env Actor Env Actor Env ……

Reward Reward

Expected Reward
Policy Gradient

It can even be a black box.


Policy Gradient
Update
…… Model

……
……
……

Data
Collection
only used once
Implementation

Consider as classification problem


left 1
… right 0


fire 0

TF, pyTorch …
Tip 1: Add a Baseline

It is probability …
Ideal
case
a b c a b c
Not The probability of the
sampled actions not sampled
Sampling
…… will decrease.
a b c a b c
Tip 2: Assign Suitable Credit

+5 +0 -2 -5 +0 -2
Tip 2: Assign Suitable Credit
Advantage
Function
Estimated by “critic” (later)

Can be state-dependent

Add discount factor


From on-policy
to off-policy
Using the experience more than once
On-policy v.s. Off-policy
• On-policy: The agent learned and the agent
interacting with the environment is the same.
• Off-policy: The agent learned and the agent
interacting with the environment is different.

阿光下棋 佐為下棋、阿光在旁邊看
Importance Sampling

Importance weight
Issue of Importance Sampling
Issue of Importance Sampling

Very large weight

negative
Importance
Sampling
Gradient for update

This term is from


sampled data.

When to stop?
Add Constraint
穩紮穩打,步步為營
PPO / TRPO Constraint on behavior not parameters

Proximal Policy Optimization (PPO)

TRPO (Trust Region Policy Optimization)


PPO algorithm

Update parameters
several times

Adaptive
KL Penalty
PPO algorithm

PPO2 algorithm
PPO algorithm

PPO2 algorithm
https://arxiv.org/abs/1707.06347

Experimental Results

You might also like