PPO (v3)
PPO (v3)
Optimization (PPO)
default reinforcement learning algorithm at OpenAI
Policy Add
Gradient constraint
DeepMind https://youtu.be/gn4nRCC9TwQ
OpenAI https://blog.openai.com/o
penai-baselines-ppo/
Policy Gradient (Review)
Basic Components
You cannot control
Reward
Actor Env Function
The rule
Go of GO
Policy of Actor
fire 0.1
pixels
Example: Playing Video Game
(kill an alien)
Example: Playing Video Game
This is an episode.
After many turns Game Over
(spaceship destroyed)
Trajectory
Actor, Environment, Reward
updated updated
Reward Reward
Expected Reward
Policy Gradient
……
……
……
Data
Collection
only used once
Implementation
…
fire 0
TF, pyTorch …
Tip 1: Add a Baseline
It is probability …
Ideal
case
a b c a b c
Not The probability of the
sampled actions not sampled
Sampling
…… will decrease.
a b c a b c
Tip 2: Assign Suitable Credit
+5 +0 -2 -5 +0 -2
Tip 2: Assign Suitable Credit
Advantage
Function
Estimated by “critic” (later)
Can be state-dependent
阿光下棋 佐為下棋、阿光在旁邊看
Importance Sampling
Importance weight
Issue of Importance Sampling
Issue of Importance Sampling
negative
Importance
Sampling
Gradient for update
When to stop?
Add Constraint
穩紮穩打,步步為營
PPO / TRPO Constraint on behavior not parameters
Update parameters
several times
Adaptive
KL Penalty
PPO algorithm
PPO2 algorithm
PPO algorithm
PPO2 algorithm
https://arxiv.org/abs/1707.06347
Experimental Results