Unit - 1
Unit - 1
Ask a Question
Open in Colab
Open In Colab
Now that you’ve studied the bases of Reinforcement Learning, you’re ready to train your first
agent and share it with the community through the Hub 🔥
: A Lunar Lander agent that will
learn to land correctly on the Moon 🌕
🤖
Unit 1: Train your first Deep Reinforcement Learning
Agent
In this notebook, you’ll train your first Deep Reinforcement Learning agent a Lunar Lander
agent that will learn to land correctly on the Moon 🌕
. Using Stable-Baselines3 a Deep
Reinforcement Learning library, share them with the community, and experiment with
different configurations
The environment 🎮
LunarLander-v2
We’re constantly trying to improve our tutorials, so if you find some issues in this notebook,
please open an issue on the Github Repo.
🚀
Let’s train our first Deep Reinforcement Learning agent and
upload it to the Hub
During the notebook, we’ll need to generate a replay video. To do so, with colab, we need to
have a virtual screen to be able to render the environment (and thus record the frames).
Hence the following cell will install virtual screen libraries and create and run a virtual screen
🖥 .
To make sure the new installed libraries are used, sometimes it’s required to restart the
notebook runtime. The next cell will force the runtime to crash, so you’ll need to connect again
and run the code starting from here. Thanks to this trick, we will be able to run our virtual
screen.
import os
os.kill(os.getpid(), 9)
# Virtual display
from pyvirtualdisplay import Display
The Hugging Face Hub 🤗 works as a central place where anyone can share and explore
models and datasets. It has versioning, metrics, visualizations and other features that will
allow you to easily collaborate with others.
You can see here all the Deep reinforcement Learning models available here 👉
https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads
import gymnasium
Gymnasium is the new version of Gym library maintained by the Farama Foundation.
At each step:
Our Agent receives a state (S0) from the Environment — we receive the first frame of our
game (Environment).
Based on that state (S0), the Agent takes an action (A0) — our Agent will move to the
right.
The environment transitions to a new state (S1) — new frame.
The environment gives some reward (R1) to the Agent — we’re not dead (Positive
Reward +1).
With Gymnasium:
At each step:
3️⃣ Get an action using our model (in our example we take a random action)
4️⃣ Using env.step(action) , we perform this action in the environment and get
for _ in range(20):
# Take a random action
action = env.action_space.sample()
print("Action taken:", action)
env.close()
💡 A good habit when you start to use an environment is to check its documentation
👉 https://gymnasium.farama.org/environments/box2d/lunar_lander/
Let’s see what the Environment looks like:
We see with Observation Space Shape (8,) that the observation is a vector of size 8, where
each value contains different information about the lander:
The action space (the set of possible actions the agent can take) is discrete with 4 actions
available 🎮 :
Action 0: Do nothing,
Action 1: Fire left orientation engine,
Action 2: Fire the main engine,
Action 3: Fire right orientation engine.
Reward function (the function that will gives a reward at each timestep) 💰:
After every step a reward is granted. The total reward of an episode is the sum of the rewards
for all the steps within that episode.
The episode receive an additional reward of -100 or +100 points for crashing or landing safely
respectively.
Vectorized Environment
To solve this problem, we’re going to use SB3 PPO. PPO (aka Proximal Policy Optimization) is
one of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you’ll study
during this course.
1️⃣ You create your environment (in our case it was done above)
2️⃣ You define the model you want to use and instantiate this model model =
PPO("MlpPolicy")
3️⃣ You train the agent with model.learn and define the number of training timesteps.
# Create environment
env = gym.make('LunarLander-v2')
💡 When you evaluate your agent, you should not use your training environment but create
an evaluation environment.
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env,
n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
In my case, I got a mean reward is 200.20 +/- 20.80 after training for 1 million steps,
which means that our lunar lander agent is ready to land on the moon 🌛🥳.
Publish our trained model on the Hub 🔥
Now that we saw we got good results after the training, we can publish our trained model on
the hub 🤗 with one line of code.
📚 The libraries documentation
👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face—x-stable-
baselines3-v20
By using package_to_hub you evaluate, record a replay, generate a model card of your agent
and push it to the hub.
This way:
To be able to share your model with the community there are three more steps to follow:
1️⃣ (If it’s not already done) create an account on Hugging Face
➡ https://huggingface.co/join
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face
website.
notebook_login()
!git config --global credential.helper store
If you don’t want to use a Google Colab or a Jupyter Notebook, you need to use this command
instead: huggingface-cli login
3️⃣ We’re now ready to push our trained agent to the 🤗 Hub 🔥
using package_to_hub() function
# method save, evaluate, generate a model card and record a replay video of
your agent before pushing the repo to the hub
package_to_hub(model=model, # Our trained model
model_name=model_name, # The name of our trained model
model_architecture=model_architecture, # The model
architecture we used: in our case PPO
env_id=env_id, # Name of the environment
eval_env=eval_env, # Evaluation Environment
repo_id=repo_id, # id of the model repository from the
Hugging Face Hub (repo_id = {organization}/{repo_name} for instance
ThomasSimonini/ppo-LunarLander-v2
commit_message=commit_message)
Congrats 🥳 you’ve just trained and uploaded your first Deep Reinforcement Learning agent.
The script above should have displayed a link to a model repository such
as https://huggingface.co/osanseviero/test_sb3. When you go to this link, you can:
Under the hood, the Hub uses git-based repositories (don’t worry if you don’t know what git
is), which means you can update the model with new versions as you experiment and improve
your agent.
Compare the results of your LunarLander-v2 with your classmates using the leaderboard 🏆
👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-
Leaderboard
The repo_id
The filename: the saved model inside the repo and its extension (*.zip)
Because the model I download from the Hub was trained with Gym (the former version of
Gymnasium) we need to install shimmy a API conversion tool that will help us to run the
environment correctly.
# When the model was trained on Python 3.8 the pickle protocol is 5
# But Python 3.6, 3.7 use protocol 4
# In order to get compatibility we need to:
# 1. Install pickle5 (we done it at the beginning of the colab)
# 2. Create a custom empty object we pass as parameter to PPO.load()
custom_objects = {
"learning_rate": 0.0,
"lr_schedule": lambda _: 0.0,
"clip_range": lambda _: 0.0,
}
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env,
n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
In the Leaderboard you will find your agents. Can you get to the top?
If you’re still feel confused with all these elements…it’s totally normal! This was the same for
me and for all people who studied RL.
Take time to really grasp the material before continuing and try the additional challenges. It’s
important to master these elements and have a solid foundations.
Naturally, during the course, we’re going to dive deeper into these concepts but it’s better to
have a good understanding of them now before diving into the next chapters.
Next time, in the bonus unit 1, you’ll train Huggy the Dog to fetch the stick.
Quiz
1: What is Reinforcement Learning?
Reinforcement learning is a framework for solving control tasks (also called decision
problems) by building agents that learn from the environment by interacting with it through
trial and error and receiving rewards (positive or negative) as unique feedback.
At every step:
The state is a complete description of the state of the world (there is no hidden
information)
The observation is a partial description of the state
We receive a state when we play with chess environment
We receive an observation when we play with Super Mario Bros
Episodic
Continuing
In Reinforcement Learning, we need to balance how much we explore the environment and
how much we exploit what we know about the environment.
Exploration is exploring the environment by trying random actions in order to find more
information about the environment.
Exploitation is exploiting known information to maximize the reward.
6: What is a policy?
The Policy π is the brain of our Agent. It’s the function that tells us what action to take
given the state we are in. So it defines the agent’s behavior at a given time.
7: What are value-based methods?