Nuplan: A Closed-Loop Ml-Based Planning Benchmark For Autonomous Vehicles
Nuplan: A Closed-Loop Ml-Based Planning Benchmark For Autonomous Vehicles
Holger Caesar Juraj Kabzan Kok Seang Tan Whye Kit Fong Eric Wolff
Alex Lang Luke Fletcher Oscar Beijbom Sammy Omari
Motional
Abstract
arXiv:2106.11810v4 [cs.CV] 4 Feb 2022
1
Dataset Data Cities Sensor Data Type Evaluation simulate physics, agents, and environmental conditions in
Argoverse 320h 2 Pred OL a closed-loop environment.
nuPredict 5h 2 X Pred OL
Lyft 1118h 1 Pred OL AirSim [19] is a high-fidelty simulator for AVs, such as
Waymo 570h 6 Pred OL drones and cars. It includes a physics engine that can op-
nuPlan 1500h 4 X Plan. OL+CL erate at a high frequency for real-time hardware-in-the-loop
Table 1. A comparison of leading datasets for motion prediction simulation. CARLA [7] supports the training and validation
(Pred) and planning (Plan). We show the dataset size, number of of autonomous urban driving systems. It allows for flexible
cities, availability of sensor data, dataset type, and whether it uses specification of sensor suites and environmental conditions.
open-loop (OL) or closed-loop (CL) evaluation. nuPredict refers
In the CARLA Autonomous Driving Challenge1 the goal is
to the prediction challenge of the nuScenes [4] dataset.
to navigate a set of waypoints using different combinations
of sensor data and HD maps. Alternatively, users can use
2. Related Work scene abstraction to omit the perception task and focus on
planning and control aspects. This challenge is conceptu-
We review the relevant literature for prediction and plan-
ally similar to what we propose, but does not use real world
ning datasets, simulation, and ML-based planning.
data and provides less detailed planning metrics.
Sim-to-real transfer is an active research area for diverse
Prediction datasets. Table 1 shows a comparison be-
tasks such as localization, perception, prediction, planning
tween our dataset and relevant prediction datasets. Argo-
and control. [21] show that the domain gap between sim-
verse Motion Forecasting [6] was the first large-scale pre-
ulated and real-world data remains an issue, by transfer-
diction dataset. With 320h of driving data, it was unprece-
ring a synthetically trained tracking model to the KITTI [9]
dented in size and provides simple semantic maps with cen-
dataset. To overcome the domain gap, they jointly train their
terlines and driveable area annotations. However, the auto-
model using real-world data for visible and simulation data
labeled trajectories in the dataset are of lower quality due to
for occluded objects. [3] learn how to drive by transferring
the state of object detection field at the time and the insuffi-
a vision-based lane following driving policy from simula-
cient amount of human-labeled training data (113 scenes).
tion to the real world without any real-world labels. [14]
The nuScenes prediction [4] challenge consists of 850
use reinforcement learning in simulation to obtain a driving
human-labeled scenes from the nuScenes dataset. While the
system controlling a full-size real-world vehicle. They use
annotations are high quality and sensor data is provided, the
mostly synthetic data, with labelled real-world data appear-
small scale limits the number of driving variations. The Lyft
ing only in the training of the segmentation network.
Level 5 Prediction Dataset [11] contains 1118h of data from
However, all simulations have fundamental limits since
a single route of 6.8 miles. It features detailed semantic
they introduce systematic biases. More work is required
maps, aerial maps, and dynamic traffic light status. While
to plausibly emulate real-world sensors, e.g. to generate
the scale is unprecedented, the autolabeled tracks are often
photo-realistic camera images.
noisy and geographic diversity is limited. The Waymo Open
Motion Dataset [8] focuses specifically on the interactions
ML-based planning. A new emerging research field is
between agents, but does so using open-loop evaluation.
ML-based planning for AVs using real-world data. How-
While the dataset size is smaller than existing datasets at
ever, the field has yet to converge on a common input/output
570h, the autolabeled tracks are of high quality [17]. They
space, dataset, or metrics. A jointly learnable behavior and
provide semantic maps and dynamic traffic light status.
trajectory planner is proposed in [18]. An interpretable cost
These datasets focus on prediction, rather than planning. function is learned on top of models for perception, pre-
In this work we aim to overcome this limitation by using diction and vehicle dynamics, and evaluated in open-loop
planning metrics and closed-loop evaluation. We are the on two unpublished datasets. An end-to-end interpretable
first large-scale dataset to provide sensor data. neural motion planner [24] takes raw lidar point clouds and
dynamic map data as inputs and predicts a cost map for
Planning datasets. CommonRoad [1] provides a first of planning. They evaluate in open-loop on an unpublished
its kind planning benchmark, that is composed of differ- dataset, with a planning horizon of only 3s. Chauffeur-
ent vehicle models, cost functions and scenarios (including Net [2] finds that standard behavior cloning is insufficient
goals and constraints). There are both pre-recorded and in- for handling complex driving scenarios, even when using as
teractive scenarios. With 5700 scenarios in total, the scale many as 30 million examples. They propose exposing the
of the dataset does not support training modern deep learn- learner to synthesized data in the form of perturbations to
ing based methods. All scenarios lack sensor data. the expert’s driving and augment the imitation loss with ad-
ditional losses that penalize undesirable events and encour-
Simulation. Simulators have enabled breakthroughs in
planning and reinforcement learning with their ability to 1 See carlachallenge.org
2
age progress. Their unpublished dataset contains 26 million of an AV. We use PointPillars [12] with CenterPoint [23],
examples which correspond to 60 days of continuous driv- a modified version multi-view fusion (MVF++) [17], and
ing. The method is evaluated in a closed-loop and an open- non-causal tracking to achieve near-human labeling perfor-
loop setup, as well as in the real world. They also show mance.
that open-loop evaluation can be misleading compared to
closed-loop. MP3 [5] proposes an end-to-end approach to Scenarios. To enable scenario-based metrics, we auto-
mapless driving, where the input is raw lidar data and a matically annotate intervals with tags for complex scenar-
high-level navigation goal. They evaluate on an unpub- ios. These scenarios include merges, lane changes, pro-
lished dataset in open and closed-loop. Multi-modal meth- tected or unprotected left or right turns, interaction with
ods have also been explored in recent works [16, 20, 13]. cyclists, interaction with pedestrians at crosswalks or else-
These approaches explore different strategies for fusing var- where, interactions with close proximity or high accelera-
ious modality representations in order to predict future way- tion, double parked vehicles, stop controlled intersections
points or control commands. Neural planners were also and driving in construction zones.
used in [15, 10] to evaluate an object detector using the KL
divergence of the planned trajectory and the observed route. 4. Benchmarks
Existing works evaluate on different metrics which are
To further the state of the art in ML-based planning, we
inconsistent across the literature. TransFuser [16] evalu-
organize benchmark challenges with the tasks and metrics
ates its method on the number of infractions, the percentage
described below.
of the route distance completed, and the route completion
weighted by an infraction multiplier. Infractions include 4.1. Overview
collisions with other agents, and running red lights. [20]
evaluates its planner using off-road time, off-lane time and To evaluate a proposed method against the benchmark
number of crashes, while [13, 22] report the success rate dataset, users submit ML-based planning code to our eval-
of reaching a given destination within a fixed time window. uation server. The code must follow a provided template.
[13] also introduces another metric which measures the av- Contrary to most benchmarks, the code is containerized for
erage percentage of distance travelled to the goal. portability in order to enable closed-loop evaluation on a
secret test set. The planner operates either on the autola-
While ML-based planning has been studied in great de-
beled trajectories or, for end-to-end open-loop approaches,
tail, the lack of published datasets and a standard set of met-
directly on the raw sensor data. When queried for a partic-
rics that provide a common framework for closed-loop eval-
ular timestep, the planner returns the planned position and
uation has limited the progress in this area. We aim to fill
heading of the ego vehicle. A provided controller will then
this gap by providing an ML-based planning dataset and
drive a vehicle while closely tracking the planned trajectory.
metrics.
We use a predefined motion model to simulate the ego vehi-
cle motion in order to approximate a real system. The final
3. Dataset driven trajectory is then scored against the metrics defined
Overview. We plan to release 1500 hours of data from in Sec 4.2.
Las Vegas, Boston, Pittsburgh, and Singapore. Each city
provides its unique driving challenges. For example, Las 4.2. Tasks
Vegas includes bustling casino pick-up and drop-off points
We present the three different tasks for our dataset with
(PUDOs) with complex interactions and busy intersections
increasing difficulty.
with up to 8 parallel driving lanes per direction, Boston
routes include drivers who love to double park, Pittsburgh Open-loop. In the first challenge, we task the planning
has its own custom precedence pattern for left turns at in- system to mimic a human driver. For every timestep, the
tersections, and Singapore features left hand traffic. For trajectory is scored based on predefined metrics. It is not
each city we provide semantic maps and an API for effi- used to control the vehicle. In this case, no interactions are
cient map queries. The dataset includes lidar point clouds, considered.
camera images, localization information and steering in-
puts. While we release autolabeled agent trajectories on Closed-loop. In the closed-loop setup the planner out-
the entire dataset, we make only a subset of the sensor data puts a planned trajectory using the information available
available due to the vast scale of the dataset (200+ TB). at each timestep, similar to the previous case. However,
the proposed trajectory is used as a reference for a con-
Autolabeling. We use an offline perception system to la- troller, and thus, the planning system is gradually corrected
bel the large-scale dataset at high accuracy, without the real- at each timestep with the new state of the vehicle. While
time constraints imposed on the online perception system the new state of the vehicle may not coincide with that of
3
the recorded state, leading to different camera views or lidar We will work closely with the community to add novel sce-
point clouds, we will not perform any sensor data warping narios and metrics to achieve consensus across the commu-
or novel view synthesis. In this set, we distinguish between nity. Likewise, for the main challenge metric we see multi-
two tasks. In the Non-reactive closed-loop task we do not ple options, such as a weighted sum of metrics, a weighted
make any assumptions on other agents behavior and simply sum of metric violations above a predefined threshold or a
use the observed agent trajectories. As shown in [11], the hierarchy of metrics. We invite the community to collab-
vast majority of interventions in closed-loop simulation is orate with us to define the metrics that will drive this field
due to the non-reactive nature, e.g. vehicles naively collid- forward.
ing with the ego vehicle. In the reactive closed-loop task
we provide a planning model for all other agents that are 5. Conclusion
tracked like the ego vehicle.
In this work we proposed the first ML-based planning
4.3. Metrics benchmark for AVs. Contrary to existing forecasting bench-
marks, we focus on goal-based planning, planning metrics
We split the metrics into two categories, common met- and closed-loop evaluation. We hope that by providing a
rics, which are computed for every scenario and scenario- common benchmark, we will pave a path towards progress
based metrics, which are tailored to predefined scenarios. in ML-based planning, which is one of the final frontiers in
autonomous driving.
Common metrics.
References
• Traffic rule violation is used to measure compliance with
common traffic rules. We compute the rate of collisions [1] Matthias Althoff, Markus Koschi, and Stefanie Manzinger.
with other agents, rate of off-road trajectories, the time CommonRoad: Composable benchmarks for motion plan-
ning on roads. In Proc. of the IEEE Intelligent Vehicles Sym-
gap to lead agents, time to collision and the relative ve-
posium, 2017. 2
locity while passing an agents as a function of the passing
[2] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauf-
distance.
feurnet: Learning to drive by imitating the best and synthe-
sizing the worst. In RSS, 2019. 2
• Human driving similarity is used to quantify a maneuver
[3] Alex Bewley, Jessica Rigley, Yuxuan Liu, Jeffrey Hawke,
satisfaction in comparison to a human, e.g. longitudinal Richard Shen, Vinh-Dieu Lam, and Alex Kendall. Learning
velocity error, longitudinal stop position error and lateral to drive from simulation without real world labels. In ICRA,
position error. In addition, the resulting jerk/acceleration 2019. 2
is compared to the human-level jerk/acceleration. [4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora,
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-
• Vehicle dynamics quantify rider comfort and feasibility of ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi-
a trajectory. Rider comfort is measured by jerk, acceler- modal dataset for autonomous driving. In CVPR, 2020. 1,
ation, steering rate and vehicle oscillation. Feasibility is 2
measured by violation of predefined limits of the same [5] Sergio Casas, Abbas Sadat, and Raquel Urtasun. MP3: A
criteria. unified model to map, perceive, predict and plan. In CVPR,
2021. 3
• Goal achievement measures the route progress towards a [6] Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jag-
goal waypoint on the map using L2 distance. jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter
Carr, Simon Lucey, Deva Ramanan, and James Hays. Argo-
verse: 3d tracking and forecasting with rich maps. In CVPR,
Scenario-based metrics. Based on the scenario tags from 2019. 1, 2
Sec. 3, we use additional metrics for challenging maneu- [7] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio
vers. For lane change, time to collision and time gap to Lopez, and Vladlen Koltun. CARLA: An open urban driving
simulator. CoRR, 2017. 2
lead/rear agent on the target lane is measured and scored.
[8] Scott Ettinger, Shuyang Cheng, and Benjamin Caine et al.
For pedestrian/cyclist interaction, we quantify the passing
Large scale interactive motion forecasting for autonomous
relative velocity while differentiating their location. Fur- driving: The Waymo Open Motion Dataset. arXiv preprint
thermore, we compare the agreement between decisions arXiv:2104.10133, 2021. 1, 2
made by a planner and human for crosswalks and unpro- [9] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
tected turns (right of way). Urtasun. Vision meets robotics: The KITTI dataset. IJRR,
32(11):1231–1237, 2013. 1, 2
Community feedback. Note that the metrics shown here [10] Yiluan Guo, Holger Caesar, Oscar Beijbom, Jonah Philion,
are an initial proposal and do not form an exhaustive list. and Sanja Fidler. The efficacy of neural planning metrics: A
4
meta-analysis of PKL on nuscenes. In IROS Workshop on
Benchmarking Progress in Autonomous Driving, 2020. 3
[11] John Houston, Guido Zuidhof, and Luca Bergamini et al.
One thousand and one hours: Self-driving motion prediction
dataset. arXiv preprint arXiv:2006.14480, 2020. 1, 2, 4
[12] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders
for object detection from point clouds. In CVPR, 2019. 3
[13] Eraqi Hesham M., Mohamed N. Moustafa, and Jens Honer.
Conditional imitation learning driving considering camera
and lidar fusion. In NeurIPS, 2020. 3
[14] Blazej Osinski, Adam Jakubowski, Pawel Ziecina, Piotr
Milos, Christopher Galias, Silviu Homoceanu, and Henryk
Michalewski. Simulation-based reinforcement learning for
real-world autonomous driving. In ICRA, 2020. 2
[15] Jonah Philion, Amlan Kar, and Sanja Fidler. Learning to
evaluate perception models using planner-centric metrics. In
CVPR, 2020. 3
[16] Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-
modal fusion transformer for end-to-end autonomous driv-
ing. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2021. 3
[17] Charles R. Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa Vo,
Boyang Deng, and Dragomir Anguelov. Offboard 3d ob-
ject detection from point cloud sequences. arXiv preprint
arXiv:2103.05073, 2021. 2, 3
[18] Abbas Sadat, Mengye Ren, Andrei Pokrovsky, Yen-Chen
Lin, Ersin Yumer, and Raquel Urtasun. Jointly learnable be-
havior and trajectory planning for self-driving vehicles. In
IROS, 2019. 2
[19] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish
Kapoor. AirSim: High-fidelity visual and physical simula-
tion for autonomous vehicles. In Field and Service Robotics,
2017. 2
[20] Ibrahim Sobh, Loay Amin, Sherif Abdelkarim, Khaled
Elmadawy, Mahmoud Saeed, Omar Abdeltawab, Mostafa
Gamal, and Ahmad El Sallab. End-to-end multi-modal sen-
sors fusion system for urban automated driving. In NeurIPS,
2018. 3
[21] Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien
Gaidon. Learning to track with object permanence. arXiv
preprint arXiv:2103.14258, 2021. 2
[22] Yi Xiao, Felipe Codevilla, Akhil Gurram, Onay Urfalioglu,
and Antonio M. López. Multimodal end-to-end autonomous
driving. arXiv preprint arXiv:1906.03199, 2019. 3
[23] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Center-
based 3d object detection and tracking. arXiv preprint
arXiv:2006.11275, 2020. 3
[24] Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin
Yang, Sergio Casas, and Raquel Urtasun. End-to-end inter-
pretable neural motion planner. In CVPR, 2021. 2