0% found this document useful (0 votes)
36 views

Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving

paper-LLM4Drive a Survey of Large Language Models for Autonomous Driving

Uploaded by

echozcc7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving

paper-LLM4Drive a Survey of Large Language Models for Autonomous Driving

Uploaded by

echozcc7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

Zhenjie Yang1∗ , Xiaosong Jia1∗ , Hongyang Li1,2 , Junchi Yan1†


1
School of AI and Department of CSE, Shanghai Jiao Tong University
2
OpenDriveLab
{yangzhenjie, jiaxiaosong, hongyangli, yanjunchi}@sjtu.edu.cn
∗ †
Equal contributions, order decided by a coin toss Correspondence author
arXiv:2311.01043v4 [cs.AI] 12 Aug 2024

Abstract Jia et al., 2022a; Jia et al., 2023c; Jia et al., 2024a],
and planning [Treiber et al., 2000; Dauner et al., 2023;
Autonomous driving technology, a catalyst for Li et al., 2024b; Jia et al., 2024b]. Specifically, the per-
revolutionizing transportation and urban mobility, ception component handles object detection [Li et al., 2022c;
has the tend to transition from rule-based sys- Liu et al., 2023d], tracking [Zeng et al., 2022], and sophisti-
tems to data-driven strategies. Traditional module- cated semantic segmentation tasks [Cheng et al., 2022]. The
based systems are constrained by cumulative er- prediction component analyzes the external environment [Jia
rors among cascaded modules and inflexible pre-set et al., 2021] and estimates the future states of the surrounding
rules. In contrast, end-to-end autonomous driving agents [Jia et al., 2022b]. The planning component, often re-
systems have the potential to avoid error accumula- liant on rule-based decision algorithms [Treiber et al., 2000],
tion due to their fully data-driven training process, determines the optimal and safest route to a predetermined
although they often lack transparency due to their destination. While the module-based approach provides reli-
”black box” nature, complicating the validation and ability and enhanced security in a variety of scenarios, it also
traceability of decisions. Recently, large language presents challenges. The decoupled design between system
models (LLMs) have demonstrated abilities includ- components may lead to key information loss during tran-
ing understanding context, logical reasoning, and sitions and potentially redundant computation as well. Ad-
generating answers. A natural thought is to uti- ditionally, errors may accumulate within the system due to
lize these abilities to empower autonomous driving. inconsistencies in optimization objectives among the mod-
By combining LLM with foundation vision models, ules, affecting the vehicle’s overall decision-making perfor-
it could open the door to open-world understand- mance [Chen et al., 2023a].
ing, reasoning, and few-shot learning, which cur-
rent autonomous driving systems are lacking. In Rule-based decision systems, with their inherent limita-
this paper, we systematically review the research tions and scalability issues, are gradually giving way to data-
line about (Vision) Large Language Models for Au- driven methods. End-to-end autonomous driving solutions
tonomous Driving ((V)LLM4Drive). This study are increasingly becoming a consensus in the field [Wu et
evaluates the current state of technological ad- al., 2022b; Chitta et al., 2023; Chen and Krähenbühl, 2022;
vancements, distinctly outlining the principal chal- Jia et al., 2023d; Jia et al., 2023b; Hu et al., 2023b]. By
lenges and prospective directions for the field. For eliminating integration errors between multiple modules and
the convenience of researchers in academia and in- reducing redundant computations, the end-to-end system en-
dustry, we provide real-time updates on the latest hances the expression of visual [Wu et al., 2022a] and sensory
advances in the field as well as relevant open-source information while ensuring greater efficiency. However, this
resources via the designated link: https://github. approach also introduces the “black box” problem, meaning
com/Thinklab-SJTU/Awesome-LLM4AD. a lack of transparency in the decision-making process, com-
plicating interpretation and validation.
Simultaneously, the explainability of autonomous driving
1 Introduction has become an important research focus [Jin et al., 2023a].
Autonomous driving is rapidly reshaping our understanding Although smaller language models (like early versions of
of transportation, heralding a new era of technological rev- BERT [Devlin et al., 2018] and GPT [Brown et al., 2020])
olution. This transformation means not only the future of employed in massive data collection from driving scenarios
transportation but also a fundamental shift across various in- help address this issue, they often lack sufficient generaliza-
dustries. In conventional autonomous driving systems, algo- tion capabilities to perform optimally. Recently, large lan-
rithms typically adopt the modular design [Liang et al., 2020; guage models [OpenAI, 2023; Touvron et al., 2023] have
Luo et al., 2018; Sadat et al., 2020], with separate compo- demonstrated remarkable abilities in understanding context,
nents responsible for critical tasks such as perception [Li et generating answers, and handling complex tasks. They are
al., 2022c; Liu et al., 2023d], prediction [Shi et al., 2022; also now integrated with multimodal models [Brohan et al.,
Driving Capability Domain Driving Capability Domain

More Data

Common Sense

Offline
Simulator Dataset

Corner Case Corner Case

Figure 1: The limitation of current autonomous driving paradigm (green arrow) and where LLMs can potentially enhance autonomous driving
ability (blue arrow).

2023a; Liu et al., 2023a; Driess et al., 2023; Xu et al., 2023; tence, akin to that possessed by an experienced human driver.
Chen et al., 2023b]. This integration achieves a unified fea- There are two main methods to acquire such proficiency: one,
ture space mapping for images, text, videos, point clouds, through learning-based techniques within simulated environ-
etc. Such consolidation significantly enhances the system’s ments; and two, by learning from offline data through similar
generalization capabilities and equips them with the capacity methodologies. It’s important to note that due to discrepan-
to quickly adapt to new scenarios in a zero-shot or few-shot cies between simulations and the real world, these two do-
manner. mains are not fully the same, i.e. sim2real gap [Höfer et al.,
In this context, developing an interpretable and efficient 2021]. Concurrently, offline data serves as a subset of real-
end-to-end autonomous driving system has become a re- world data since it’s collected directly from actual surround-
search hotspot [Chen et al., 2023a]. Large language models, ings. However, it is difficult to fully cover the distribution as
with their extensive knowledge base and exceptional gener- well due to the notorious long-tailed nature [Jain et al., 2021]
alization, could facilitate easier learning of complex driving of autonomous driving tasks.
behaviors. By leveraging the visual-language model (VLM)’s The final goal of autonomous driving is to elevate driv-
robust and comprehensive capabilities of open-world under- ing abilities from a basic green stage to a more advanced
standing and in-context learning [Bommasani et al., 2021; blue level through extensive data collection and deep learn-
Brohan et al., 2023b; Liu et al., 2023a; Driess et al., 2023], ing. However, the high cost associated with data gathering
it becomes possible to address the long-tail problem for per- and annotation, along with the inherent differences between
ception networks, assist in decision-making, and provide in- simulated and real-world environments, mean there’s still a
tuitive explanations for these decisions. gap before reaching the expert level of driving skills. In
This paper aims to provide a comprehensive overview of this scenario, if we can effectively utilize the innate com-
this rapidly emerging research field, analyze its basic princi- mon sense embedded within large language models, we might
ples, methods, and implementation processes, and introduce gradually narrow this gap. Intuitively, by adopting this ap-
in detail regarding the application of LLMs for autonomous proach, we could progressively enhance the capabilities of
driving. Finally, we discuss related challenges and future re- autonomous driving systems, bringing them closer to, or po-
search directions. tentially reaching, the ideal expert level of driving profi-
ciency. Through such technological integration and innova-
2 Motivation of LLM4AD tion, we anticipate significant improvements in the overall
In today’s technological landscape, large language models performance and safety of autonomous driving.
such as GPT-4 and GPT-4V [OpenAI, 2023; Yang et al., The application of large language models in the field of au-
2023c] are drawing attention with their superior contextual tonomous driving indeed covers a wide range of task types,
understanding and in-context learning capabilities. Their en- combining depth and breadth with revolutionary potential.
riched common sense knowledge has facilitated significant LLMs in autonomous driving pipelines is shown in the Fig. 2.
advancements in many downstream tasks. We ask the ques-
tion: how do these large models assist in the domain of au-
tonomous driving, especially in playing a critical role in the
decision-making process? 3 Application of LLM4AD
In Fig. 1, we give an intuitive demonstration of the lim-
itation of current autonomous driving paradigm and where In the following sections, we divide existing works based
LLMs can potentially enhance autonomous driving ability. on the perspective of applying LLMs: planning, perception,
We summarize two primary aspects of driving skills. The question answering, and generation. The corresponding tax-
orange circle represents the ideal level of driving compe- onomy tree is shown in Fig. 3.
Sensor Input Token
Inputs LLMs in AD Pipelines

LiDAR Camera GPS Vehicle Prompt


Dynamics

Modal

RT-2 LLaVA PaLM-E


GPT-Series Llama-Series

Visual Network LLMs Multi-Modal

Tasks

Behavior Path Intention


Detection Segmentation Tracking Motion Trajectory Question-Answer World Model Diffusion Model

Planning&Control Perception QA Generation

Figure 2: LLMs in Autonomous Driving Pipelines.

3.1 Planning & Control of QA tasks while the experiments about planning is sim-
Large language models (LLMs) have achieved great suc- ple. GPT-Driver [Mao et al., 2023a] transforms the motion
cess with their open-world cognitive and reasoning capa- planning task into a language modeling problem. It exceeds
bilities [Radford et al., 2018; Radford et al., 2019; Brown the UniAD[Hu et al., 2023b] in the L2 metric. Neverthe-
et al., 2020; Ouyang et al., 2022; OpenAI, 2023]. These less, since it uses past speed and acceleration information,
capabilities could provide a transparent explanation of the there is concern about unfair comparison with UniAD. Ad-
autonomous driving decision-making process, significantly ditionally, L2 only reflects the fitting degree of the driving
enhancing system reliability and user trust in the technol- route and might not reflect the driving performance [Dauner
ogy [Deruyttere et al., 2019; Kim et al., 2019a; Atakishiyev et al., 2023]. Agent-Driver [Mao et al., 2023b] leverages
et al., 2023; Jin et al., 2023a; Malla et al., 2023]. Within LLMs common sense and robust reasoning capabilities to im-
this domain, based on whether tuning the LLM, related re- prove the capabilities of planning by designing a tool library,
search can be categorized into two main types: fine-tuning a cognitive memory, and a reasoning engine. This paradigm
pre-trained models and prompt engineering. achieves better results on the nuScenes dataset. Meanwhile,
shortening the inference time is also an urgent problem. Driv-
Fine-tuning pre-trained models eLM [Sima et al., 2023] uses a trajectory tokenizer to pro-
In the application of fine-tuning pre-trained models, MTD- cess ego-trajectory signals to texts, making them belong to
GPT [Liu et al., 2023b] translates multi-task decision- the same domain space. Such a tokenizer can be applied to
making problems into sequence modeling problems. Through any general vision language models. Moreover, they utilize
training on a mixed multi-task dataset, it addresses various a graph-structure inference with multiple QA pairs in logical
decision-making tasks at unsignaled intersections. Although order, thus improving the final planning performance. [Wang
this approach outperforms the performance of single-task et al., 2023c] adapts LLMs as a vehicle ”Co-Pilot” of driving,
decision-making RL models, the used scenes are limited to which can accomplish specific driving tasks with human in-
unsignaled intersections, which might be enough to demon- tention satisfied based on the information provided. It lacks
strate the complexity of the real world application. Driving verification in complex interaction scenarios. LMDrive [Shao
with LLMs [Chen et al., 2023b] designs an architecture that et al., 2023] designs a multi-modal framework to predict the
fuses vectorized inputs into LLMs with a two-stage pretrain- control signal and whether the given instruction is completed.
ing and fine-tuning method. Due to the limitation of vec- It adopts Resnet [He et al., 2016] as the vision encoder which
torized representations, their method are only tested in the has not been through an image-text alignment pretraining.
simulation. DriveGPT4 [Xu et al., 2023] presents a multi- In addition, it introduces a benchmark LangAuto which in-
modal LLM based on Valley [Luo et al., 2023a] and devel- cludes approximately 64K instruction-following data clips in
ops a visual instruction tuning dataset for interpretable au- CARLA. The LangAuto benchmark tests the system’s ability
tonomous driving. Besides predicting a vehicle’s basic con- to handle complex instructions and challenging driving sce-
trol signals, it also responds in real-time, explaining why the nario. DriveMLM [Wang et al., 2023d] adopts a multi-modal
action is taken. It outperforms baseline models in a variety
Tracking LanguagePrompt [Wu et al., 2023]

Perception Detection HiLM-D [Ding et al., 2023]

Prediction Can you text what is happening [Keysan et al., 2023]


MTD-GPT [Liu et al., 2023b]
LC-LLM [Peng et al., 2024]
LeGo-Drive [Paul et al., 2024]
Context-aware Motion Prediction [Zheng et al., 2024a]

Prompt Engineering A Safety Perspective [Wang et al., 2023f]


Talk2Drive [Cui et al., 2023c]
ChatGPT as Your Vehicle Co-Pilot [Wang et al., 2023c]
Receive Reason and React [Cui et al., 2023b]
LanguageMPC [Sha et al., 2023]
Talk2BEV [Dewangan et al., 2023]
SurrealDriver [Jin et al., 2023b]
Drive as You Speak [Cui et al., 2023a]
Planning & Control TrafficGPT [Zhang et al., 2023b]
Drive Like a Human [Fu et al., 2023a]
DiLu [Wen et al., 2023a]
LLM-assited light [Wang et al., 2024b]
AccidentGPT [Wang et al., 2023b]
LLM-Assist [Sharan et al., 2023]
LLaDA [Li et al., 2024a]

Fine-tuning Pre-trained Model DriveMLM [Wang et al., 2023d]


LMDriver [Shao et al., 2023]
Agent-Driver [Mao et al., 2023b]
GPT-Driver [Mao et al., 2023a]
DriveLM [Sima et al., 2023]
DriveGPT4 [Xu et al., 2023]
Driving with LLMs [Chen et al., 2023b]
MTD-GPT [Liu et al., 2023b]
KoMA [Jiang et al., 2024]
AsyncDriver [Chen et al., 2024]
PlanAgent [Zheng et al., 2024b]
AgentsCoDriver [Hu et al., 2024]
LLM4AD DriveVLM [Tian et al., 2024]
RAG-Driver [Yuan et al., 2024]
VLP [Pan et al., 2024]
DME-Driver [Han et al., 2024]

Generation(Diffusion) ADriver-I [Jia et al., 2023a]


DrivingDiffusion [Li et al., 2023c]
DriveDreamer [Wang et al., 2023e]
CTG++ [Zhong et al., 2023]
GAIA-1 [Hu et al., 2023a]
MagicDrive [Gao et al., 2023]
Driving into the Future [Wang et al., 2023g]
ChatScene [Zhang et al., 2024]
REvolve [Hazra et al., 2024]
GenAD [Yang et al., 2024]
DriveDreamer-2 [Zhao et al., 2024]
ChatSim [Wei et al., 2024]
LLM-Assisted Light [Wang et al., 2024a]
LangProp [Ishida et al., 2024]

Visual QA DriveMLM [Wang et al., 2023d]


DriveLM [Sima et al., 2023]
Reason2Drive [Nie et al., 2023]
LingoQA [Marcu et al., 2023]
Dolphins [Ma et al., 2023a]
Question Answering(QA) DriveGPT4 [Xu et al., 2023]
A Superalignment Framework [Kong et al., 2024]
EM-VLM4AD [Gopalkrishnan et al., 2024]
TransGPT [Wang et al., 2024c]

Traditional QA Domain Knowledge Distillation [Tang et al., 2023]


Human-Centric Autonomous Systems [Yang et al., 2023b]
Engineering Safety [Nouri et al., 2024]
Hybrid Reasoning [Azarafza et al., 2024]

Evaluation & Benchmark On the Road with GPT-4V [Wen et al., 2023b]
GPT-4V Takes the Wheel [Huang et al., 2023a]
LaMPilot [Ma et al., 2023b]
Evaluation of LLMs [Tanahashi et al., 2023]
Testing LLMs [Tang et al., 2024]
DriveSim [Sreeram et al., 2024]
ELM [Zhou et al., 2024]
LimSim++ [Fu et al., 2024]
OmniDrive [Wang et al., 2024d]
AIDE [Liang et al., 2024]

Figure 3: Large Language Models for Autonomous Driving Research Tree


LLM(Multi-view image, Point cloud, and prompt) to generate prompt engineering
high-level decision commands and uses Apollo as a planner In the prompt engineering perspective, some methods tried
to get the control signal. Moreover, the training data gener- to tap into the deep reasoning potential of the LLMs through
ated by experts and uses GPT-3.5 to increase data diversity. It clever prompt design. DiLu [Wen et al., 2023a] designs a
achieves 76.1 driving score on the CARLA Town05 Long, framework of LLMs as agents to solve closed-loop driving
which reaches the level of classic end-to-end autonomous tasks. This method introduces a memory module to record
driving. KoMA [Jiang et al., 2024] is a knowledge-driven experience, to leverage LLMs to facilitate reasoning and re-
multi-agent framework in which each agent is powered by flection processes. DiLu exhibits strong generalization ca-
large language models. These agents analyze and infer the in- pabilities compared with SOTA RL-based methods. How-
tentions of surrounding vehicles to enhance decision-making. ever, the reasoning and reflection processes require multiple
AsyncDriver [Chen et al., 2024] is an asynchronous LLM- rounds of question-answering, and its inference time can-
enhanced framework where the inference frequency of LLM not be ignored. Similarly, Receive Reason and React [Cui
is controllable and can be decoupled from the real-time plan- et al., 2023b] and Drive as You Speak [Cui et al., 2023a]
ner. It has good closed-loop evaluation performance in chal- integrate the language and reasoning capabilities of LLMs
lenging scenarios of nuPlan. PlanAgent [Zheng et al., 2024b] into autonomous vehicles. In addition to memory and re-
extracts bird’s-eye view (BEV) representation and generates flection processes, these methods introduce additional raw
a text description input based on the lane map through an sensor information such as camera, GNSS, lidar, and radar.
environment transformation module. It uses a reasoning en- However, the inference speed is unsolved as well. Further-
gine module to perform a hierarchical thinking chain to guide more, SurrealDriver [Jin et al., 2023b] divides the memory
driving scene understanding, motion command generation, module into short-term memory, long-term guidelines, and
and planning code writing. AGENTSCODRIVER [Hu et safety criteria. Meanwhile, it interviews 24 drivers and uses
al., 2024] is an LLM-powered framework for multi-vehicle their detailed descriptions of driving behaviors as chain-of-
collaborative driving with lifelong learning, enabling com- thought prompts to develop a ‘coach agent’ module. How-
munication and collaboration among driving agents in com- ever, there is a lack of comparison with traditional algorithms
plex traffic scenarios, featuring a reasoning engine, cogni- to prove that large language models indeed bring performance
tive memory, reinforcement reflection, and a communication improvements. LanguageMPC [Sha et al., 2023] also de-
module. DriveVLM [Tian et al., 2024] leverages Vision- signs a chain-of-thought framework for LLMs in driving sce-
Language Models to enhance scene understanding and plan- narios and it integrates with low-level controllers by guided
ning capabilities for autonomous driving, while DriveVLM- parameter matrix adaptation. Although its performance ex-
Dual synergizes these advancements with traditional 3D per- ceeds MPC and RL-based methods in the simplified simu-
ception and planning approaches to effectively address spatial lator environments, it lacks validation in complex environ-
reasoning and computational challenges, demonstrating supe- ments. TrafficGPT [Zhang et al., 2023b] is a fusion of Chat-
rior performance in complex and dynamic driving scenarios. GPT and traffic foundation models which can tackle com-
RAG-Driver [Yuan et al., 2024], a Multi-Modal Large Lan- plex traffic-related problems and provide insightful sugges-
guage Model with Retrieval-augmented In-context Learning, tions. It leverages multimodal data as a data source, offer-
provides explainable and generalizable end-to-end driving ing comprehensive support for various traffic-related tasks.
by producing numerical control signals, along with explana- Talk2BEV [Dewangan et al., 2023] introduces a large vision-
tions and justifications for driving actions, and demonstrates language model (LVLM) interface for bird’s-eye view (BEV)
impressive zero-shot generalization to unseen environments maps in autonomous driving contexts. It does not require any
without additional training. LLaDA [Li et al., 2024a] designs training or fine-tuning, only relying on pre-trained image-
a training-free mechanism to assist human drivers and adapt language models. In addition, it presents a benchmark for
autonomous driving policies to new environments. VLP [Pan evaluating subsequent work in LVLMs for AD applications.
et al., 2024] is a Vision-Language-Planning model intended Talk2Drive [Cui et al., 2023a] utilizes human verbal com-
to enhance autonomous driving systems (ADS) by incorpo- mands and makes autonomous driving decisions based on
rating two novel components: ALP and SLP. ALP (Agent- contextual information to meet humanly personalized pref-
wise Learning Paradigm) aligns the generated bird’s-eye- erences for safety, efficiency, and comfort. AccidentGPT
view (BEV) with the true BEV map, improving self-driving [Wang et al., 2023b] integrates multi-vehicle collaborative
BEV reasoning. SLP (Self-Driving-Car-Centric Learning perception to improve environmental understanding and col-
Paradigm) aligns the ego vehicle’s query features with its lision avoidance, offering advanced safety features like proac-
textual planning features, enhancing self-driving decision- tive remote safety warnings and blind spot alerts. It also sup-
making. DME-Driver [Han et al., 2024] enhances deci- ports traffic police and management agencies by providing
sion logic explainability and environmental perception accu- real-time intelligent analysis of traffic safety factors.
racy by using a vision language model for decision-making Metric:
and a planning-oriented perception model for generating pre- MTD-GPT [Liu et al., 2023b] uses single-subtask suc-
cise control signals, effectively translating human-like driv- cess rates as the metric in simulation and it exceeds RL ex-
ing logic into actionable commands, and achieving high- pert. DriveGPT4 [Xu et al., 2023] and RAG-Driver [Yuan et
precision planning accuracy through the comprehensive HBD al., 2024] uses root mean squared error (RMSE) and thresh-
dataset. old accuracies for evaluation. In vehicle action descrip-
tion, justification, and full sentences, it uses BLEU-4 [Pap-
ineni et al., 2002], METEOR [Banerjee and Lavie, 2005], 3.2 Perception
CIDER[Vedantam et al., 2015] and chatgpt score [Fu et al., Large language models have demonstrated their unique value
2023b]. Driving with LLMs [Chen et al., 2023b] uses the and strong capabilities in “perception” tasks [Radford et al.,
Mean Absolute Error (MAE) for the predictions of the num- 2021; Li et al., 2022b; Li et al., 2023a; Li et al., 2023b;
ber of cars and pedestrians, normalized acceleration, and Li et al., 2022a]. Especially in environments where data
brake pressure. Additionally, it measures the accuracy of is relatively scarce, these models can rely on their few-shot
traffic light detection as well as the mean absolute distance learning characteristics to achieve fast and accurate learning
error in meters for traffic light distance prediction. Besides and reasoning [P et al., 2023; Lin et al., 2023]. This learn-
perception-related metrics, it also uses GPT-3.5 to grade their ing ability is of significance in the perception stage of the
model’s answers which is a recently emerging technique - autonomous driving system, which greatly improves the sys-
grading natural language responses [Fu et al., 2023b; Wang tem’s adaptability and generalization capabilities in chang-
et al., 2023a; Liu et al., 2023c]. DiLu [Wen et al., 2023a] ing and complex driving environments. PromptTrack [Wu et
uses Success Steps in simulation as a metric to evaluate gen- al., 2023] fuses cross-modal features in a prompt reasoning
eralization and transformation abilities. SurrealDriver [Jin et branch to predict 3D objects. It uses language prompts as
al., 2023b] evaluates agents based on two main dimensions: semantic cues and combines LLMs with 3D detection tasks
safe driving ability and humanness. Safe driving capabilities and tracking tasks. Although it achieves better performance
are assessed through collision rates, while human likeness compared to other methods, the advantages of LLMs do not
is assessed through user experiments with 24 adult partici- directly affect the tracking task. Rather, the tracking task
pants (age 29.3 ± 4.9 years, male = 17 years) who are legal to serves as a query to assist LLMs in performing 3D detec-
drive. LanguageMPC [Sha et al., 2023] customs some met- tion tasks. HiLM-D [Ding et al., 2023] incorporates high-
rics: failure/collision cases, the efficiency of traffic flow, time resolution information into multimodal large language mod-
cost by ego vehicle, and the safety of the ego vehicle’s driv- els for the Risk Object Localization and Intention and Sug-
ing behavior. Similarly, Talk2BEV [Dewangan et al., 2023] gestion Prediction (ROLISP) task. It combines LLMs with
measures their methods from the perspective of spatial rea- 2D detection tasks and obtains better performance in detec-
soning, instance attribute, instance counting, and visual rea- tion tasks and QA tasks compared to other multi-modal large
soning. GPT-Driver [Mao et al., 2023a], LLaDA [Li et al., models such as Video-LLaMa [Zhang et al., 2023a], eP-
2024a], DriveLM [Sima et al., 2023], Agent-Driver [Mao et ALM [Shukor et al., 2023]. It is worth noting to point out
al., 2023b], VLP [Pan et al., 2024], DME-Driver [Han et al., one potential limitation of the dataset: each video contains
2024] and DriveVLM [Tian et al., 2024] contain two metrics: only one risk object, which might not capture the complex-
L2 error (in meters) and collision rate (in percentage). The av- ity of real-world scenarios. [Keysan et al., 2023] integrates
erage L2 error is calculated by measuring the distance of each pre-trained language models as text-based input encoders for
waypoint in the planned trajectory and the offline recorded the autonomous driving trajectory prediction task. Joint en-
human driver trajectory. It reflects the fitting of the planned coders(image and text) over both modalities perform better
trajectory to the human driving trajectory. The collision rate than using a single encoder in isolation. While the joint
is calculated by placing an ego vehicle box at each waypoint model significantly improves the baseline, its performance
of the planned trajectory and then checking for collisions with has not reached the state-of-the-art level yet [Deo et al., 2021;
the ground truth bounding boxes of other objects. It reflects Gilles et al., 2021]. LC-LLM [Peng et al., 2024] is de-
the safety of the planned trajectory. LMDrive [Shao et al., signed for lane change prediction, leveraging LLM capabil-
2023] and DriveMLM[Wang et al., 2023d] adopts CARLA’s ities to understand complex scenarios, enhancing prediction
official metrics including Driving Score(DS), Route Comple- performance, and providing explainable predictions by gen-
tion(RC), Infraction Score(IS). At present, LLM4AD regard- erating explanations for lane change intentions and trajecto-
ing the planning task lacks a unified metric and cannot uni- ries. AIDE [Liang et al., 2024] introduces a paradigm for an
formly evaluate the pros and cons between each method and automatic data engine, incorporating automatic data querying
traditional counterparts. KoMA [Jiang et al., 2024] demon- and labeling using VLM, and continual learning with pseudo
strates the effectiveness and high success rate in the High- labels. It introduces a new benchmark to evaluate such auto-
way MARL simulator. AsyncDriver [Chen et al., 2024] ob- mated data engines for AV perception that allows combined
tains superior closed-loop evaluation performance in nuPlan insights across multiple paradigms of open vocabulary detec-
Closed-Loop Reactive Hard20 scenarios. PlanAgent [Zheng tion, semi-supervised, and continual learning. Context-aware
et al., 2024b] achieves competitive and general results on Motion Prediction [Zheng et al., 2024a] designs and conducts
the nuPlan Val14 and Test14-hard benchmarks, and improves prompt engineering to enable GPT4-V to comprehend com-
the efficiency of token usage when describing driving sce- plex traffic scenarios. It combines the context information
narios. AGENTSCODRIVER [Hu et al., 2024] adopt Suc- outputted by GPT4-V with MTR [Shi et al., 2023] to enhance
cess Rate (SR) and Success Step (SS) in HighwayEnv simula- motion prediction.
tor [Leurent, 2018]. LeGo-Drive [Paul et al., 2024] is a novel Metric:
planning-guided end-to-end LLM-based goal point naviga- PromptTrack [Wu et al., 2023] uses the Average Multiple
tion solution that predicts and improves the desired state by Object Tracking Precision (AMOTA) metric [Bernardin and
dynamically interacting with the environment and generating Stiefelhagen, 2008], the Average MultiObject Tracking Pre-
a collision-free trajectory. cision (AMOTP) [Bashar et al., 2022] and Identity Switches
(IDS) [Huang et al., 2023b] metrics. HiLM-D [Ding et
al., 2023] uses the BLEU-4 [Papineni et al., 2002], ME- ate precise brake and speed control values based on weather
TEOR [Banerjee and Lavie, 2005], CIDER[Vedantam et al., conditions. TransGPT [Wang et al., 2024c] is a novel large
2015] and SPICE [Anderson et al., 2016], IoU [Rezatofighi language model for the transportation domain that comes
et al., 2019] as metrics to compare with the state-of-the-art. in two variants—TransGPT-SM for single-modal data and
[Keysan et al., 2023] uses the standard evaluation metrics that TransGPT-MM for multi-modal data—designed to enhance
are provided in the nuScenes-devkit [Caesar et al., 2019; traffic analysis and modeling by generating synthetic traffic
Fong et al., 2021]: minimum Average Displacement Error scenarios, explaining traffic phenomena, answering traffic-
(minADEk), Final Displacement Error (minFDEk), and the related questions, offering recommendations, and creating
miss rate over 2 meters. LC-LLM [Peng et al., 2024] uses comprehensive traffic reports.
RMSE (lat) to assess lateral and longitudinal prediction error. Metric:
LeGo-Drive [Paul et al., 2024] adopts minFDE (minimum In terms of QA tasks, NLP’s metric is often used. In
Final Displacement Error) and L2 distance between the goal DriveFPT4 [Xu et al., 2023] and EM-VLM4AD [Gopalkr-
location and the trajectory endpoint. as evaluation metrics. ishnan et al., 2024], it uses BLEU-4 [Papineni et al., 2002],
Context-aware Motion Prediction [Zheng et al., 2024a] use METEOR [Banerjee and Lavie, 2005], CIDER[Vedantam et
the mean Average Precision (mAP) in official WOMD [Et- al., 2015] and chatgpt score [Fu et al., 2023b]. In [Yang et
tinger et al., 2021] evaluation. al., 2023b], it adapts accuracy at the individual question level
and the command level which includes some sub-questions.
3.3 Question Answering
Question-Answering is an important task that has a wide 3.4 Generation
range of applications in intelligent transportation, assisted In the realm of “generation” task, large language models
driving, and autonomous vehicles [Xu et al., 2021a; Xu et al., leverage their advanced knowledge-base and generative ca-
2021b]. It mainly reflects through different question and an- pabilities to create realistic driving videos or intricate driving
swer paradigms, including traditional QA mechanism [Tang scenarios under specific environmental factors [Khachatryan
et al., 2023] and more detailed visual QA methods [Xu et et al., 2023; Luo et al., 2023b]. This approach offers rev-
al., 2023]. [Tang et al., 2023] constructs the domain knowl- olutionary solutions to the challenges of data collection and
edge ontology by “chatting” with ChatGPT. It develops a labeling for autonomous driving, also constructing a safe and
web-based assistant to enable manual supervision and early easily controllable setting for testing and validating the deci-
intervention at runtime and it guarantees the quality of fully sion boundaries of autonomous driving systems. Moreover,
automated distillation results. This question-and-answer sys- by simulating a variety of driving situations and emergency
tem enhances the interactivity of the vehicle, transforms the conditions, the generated content becomes a crucial resource
traditional one-way human-machine interface into an interac- for refining and enriching the emergency response strategies
tive communication experience, and might be able to cultivate of autonomous driving systems.
the user’s sense of participation and control. These sophisti- The common generative models include the Variational
cated models [Tang et al., 2023; Xu et al., 2023], equipped Auto-Encoder(VAE) [Kingma and Welling, 2022], Genera-
with the ability to parse, understand, and generate human- tive Adversarial Network(GAN) [Goodfellow et al., 2014],
like responses, are pivotal in real-time information process- Normalizing Flow(Flow)[Rezende and Mohamed, 2016], and
ing and provision. They design comprehensive questions re- Denoising Diffusion Probabilistic Model(Diffusion)[Ho et
lated to the scene, including but not limited to vehicle states, al., 2020]. With diffusion models have recently achieved
navigation assistance, and understanding of traffic situations. great success in text-to-image [Ronneberger et al., 2015;
[Yang et al., 2023b] provides a human-centered perspective Rombach et al., 2021; Ramesh et al., 2022], some research
and gives several key insights through different prompt de- has begun to study using diffusion models to generate au-
signs to enable LLMs to achieve AD system requirements tonomous driving images or videos. DriveDreamer [Wang et
within the cabin. Dolphins [Ma et al., 2023a] enhances rea- al., 2023e] is a world model derived from real-world driving
soning capabilities through the innovative Grounded Chain of scenarios. It uses text, initial image, HDmap, and 3Dbox as
Thought (GCoT) process and specifically adapts to the driv- input, then generates high-quality driving videos and reason-
ing domain by building driving-specific command data and able driving policies. Similarly, Driving Diffusion [Li et al.,
command adjustments. LingoQA [Marcu et al., 2023] devel- 2023c] adopts a 3D layout as a control signal to generate real-
ops a QA benchmark and datasets, details are in 3.5 and 4. istic multi-view videos. GAIA-1 [Hu et al., 2023a] leverages
EM-VLM4AD [Gopalkrishnan et al., 2024] is an efficient, video, text, and action inputs to generate traffic scenarios, en-
lightweight, multi-frame vision language model for Visual vironmental elements, and potential risks. In these methods,
Question Answering in autonomous driving, and it only re- text encoder both adopt CLIP [Radford et al., 2021] which
quires much less memory and floating point operations than has a better alignment between image and text. In addition to
DriveLM [Sima et al., 2023]. [Nouri et al., 2024] proposes a generating autonomous driving videos, traffic scenes can also
prototype of a pipeline of prompts and LLMs that receives an be generated. CTG++ [Zhong et al., 2023] is a scene-level
item definition and outputs solutions in the form of safety re- diffusion model that can generate realistic and controllable
quirements. Hybrid Reasoning [Azarafza et al., 2024] uses traffic. It leverages LLMs for translating a user query into a
Large Language Models (LLMs) with inputs from image- differentiable loss function and use a diffusion model to trans-
detected objects and sensor data, including parameters like form the loss function into realistic, query compliant trajec-
object distance, car speed, direction, and location, to gener- tories. MagicDrive [Gao et al., 2023] generates highly realis-
Dataset Task Size Annotator Description
BDD-X Planning 77 hours, 6970 videos, Ego-vehicle actions description and
[Kim et al., 2018] Human
VQA 8.4M frames, 26228 captions explanation.
Joint action description for goal-
HAD Planning 30 hours, 5744 videos
[Kim et al., 2019b] Human oriented advice and attention description
Perception 22366 captions
for stimulus-driven advice.
Object referral dataset that contains
Talk2Car Planning 15 hours, 850 videos of 20s each
[Deruyttere et al., 2019] Human commands written in natural language
Perception 30k frames, 11959 captions
for self-driving cars.
Perception
In Carla, 18k frames and 3.7M P3 with reasoning logic; Connect the
DriveLM Prediction Human
[Sima et al., 2023] QA pairs;In nuScenes, 4.8k QA pairs in a graph-style structure;
Planning Rule-Based
frames and 450k QA pairs Use “What if”-style questions.
VQA
91 hours, 17785 videos, 77639 Joint risk localization with visual
DRAMA
[Malla et al., 2023] VQA question, 102830 answering, Human reasoning of driving risks in a free-
17066 captions form language description.
Joint important object identification,
Rank2Tell Perception several hours, 118 videos of 20s
[Sachdeva et al., 2023] Human important object localization ranking,
VQA each
and reasoning.
NuPrompt 15 hours, 35367 prompts for 3D Huamn Object-centric language prompt set
[Wu et al., 2023] Perception
objects GPT3.5 for perception tasks.
Leverage 3D annotations(object
15 hours, Train(24149 scences,
NuScenes-QA category, position, orientation,
[Qian et al., 2023] VQA 459941 QA pairs), Test(6019 Rule-Based
relationships information) and designed
scences, 83337 QA pairs)
question templates to construct QA pairs.
Perception
Reason2Drive Human Composed of nuScenes, Waymo and
[Nie et al., 2023] Prediction 600K video-text pairs
GPT-4 ONCE, with driving instructions.
VQA
Rule-Based
Contains reasoning pairs in addition to
LingoQA GPT-3.5/4
[Nie et al., 2023] VQA 419.9k QA pairs, 28k scenarios object presence, description, and
Software
localisation.
Human
Integrates multi-view information,
Perception
NuInstruct GPT-4 requiring responses from multiple
[Ding et al., 2024] Prediction 91k QA pairs, 17 subtasks
Human perspectives, with balanced view
VQA
distribution for perception tasks.
2059 hours of videos paired A large-scale multimodal dataset
Perception
OpenDV-2K with texts(1747 hours from for autonomous driving, to support
[Yang et al., 2024] Prediction BLIP-2
YouTube and 312 hours from the training of a generalized video
VQA
public datasets). prediction model.

Table 1: Description of different datasets regarding LLM4AD.

tic images, exploiting geometric information from 3D anno- Future [Wang et al., 2023g] develops a multiview world
tations by independently encoding road maps, object boxes, model, named Drive-WM, which is capable of generating
and camera parameters for precise, geometry-guided synthe- high-quality, controllable, and consistent multi-view videos
sis. This approach effectively solves the challenge of multi- in autonomous driving scenes. It explores the potential ap-
camera view consistency. Although it achieves better per- plication of the world model in end-to-end planning for au-
formance in terms of generation fidelity compared to BEV- tonomous driving. ChatScene [Zhang et al., 2024] designs
Gen [Swerdlow et al., 2023] and BEVControl [Yang et al., an LLM-based agent that generates and simulates challeng-
2023a], it also faces huge challenges in some complex scenes, ing safety-critical scenarios in CARLA, improving the colli-
such as night views and unseen weather conditions. ADriver- sion avoidance capabilities and robustness of autonomous ve-
I [Jia et al., 2023a] combines Multimodal Large Language hicles. REvolve [Hazra et al., 2024] is an evolutionary frame-
Models(MLLM) and Video Diffusion Model(VDM) to pre- work utilizing GPT-4 to generate and refine reward functions
dict the control signal of current frame and the future frames. for autonomous driving through human feedback. The reward
It shows impressive performance on nuScenes and their pri- function is used for RL, and the score is achieved closely by
vate datasets. However, MLLM and VDM are trained sep- human driving standards. GenAD [Yang et al., 2024] is a
arately, which fails to optimize jointly. Driving into the large-scale video prediction model for autonomous driving
that uses extensive web-sourced data and novel temporal rea- and episodic steps as metrics in AirSim simulator [Shah et
soning blocks to handle diverse driving scenarios, generalize al., 2017].
to unseen datasets in a zero-shot manner, and adapt for action-
conditioned prediction or motion planning. DriveDreamer-2 3.5 Evaluation & Benchmark
[Zhao et al., 2024] builds on DriveDreamer with a Large Lan- In terms of evaluation, On the Road with GPT-4V [Wen et
guage Model (LLM), generates customized and high-quality al., 2023b] conducts a comprehensive and multi-faceted eval-
multi-view driving videos by converting user queries into uation of GPT-4V in various autonomous driving scenarios,
agent trajectories and HDMaps, enhancing training for driv- including Scenario Understanding, Reasoning, and Acting as
ing perception methods. ChatSim [Wei et al., 2024] enable a Driver. GPT-4V performs well in scene understanding, in-
editable photo-realistic 3D driving scene simulations via nat- tent recognition and driving decision-making. It is good at
ural language commands with external digital assets, lever- handling out-of-distribution situations, can accurately assess
ages a large language model agent collaboration framework the intentions of traffic participants, use multi-view images
and novel multi-camera neural radiance field and lighting es- to comprehensively perceive the environment, and accurately
timation methods to produce scene-consistent, high-quality identify dynamic interactions between traffic participants.
outputs. LLM-Assisted Light [Wang et al., 2024a] inte- However, GPT-4V still has certain limitations in direction
grates the human-mimetic reasoning capabilities of LLMs, recognition, interpretation of traffic lights, and non-English
enabling the signal control algorithm to interpret and respond traffic signs. GPT-4V Takes the Wheel [Huang et al., 2023a]
to complex traffic scenarios with the nuanced judgment typi- evaluates the potential of GPT-4V for autonomous pedestrian
cal of human cognition. It developed a closed-loop traffic sig- behavior prediction using publicly available datasets. Al-
nal control system, integrating LLMs with a comprehensive though GPT-4V has made significant advances in AI capa-
suite of interoperable tools. LangProp [Ishida et al., 2024] bilities for pedestrian behavior prediction, it still has short-
is a framework that iteratively optimizes code generated by comings compared with leading traditional domain-specific
large language models (LLMs) using both supervised and re- models.
inforcement learning, automatically evaluating code perfor- In terms of benchmark, LMDrive [Shao et al., 2023] in-
mance, catching exceptions, and feeding results back to the troduces LangAuto(Language-guided Autonomous Driving)
LLM to improve code generation for autonomous driving CARLA benchmark. It covers various driving scenarios in 8
in CARLA. These methods explore the customized authen- towns and takes into account 16 different environmental con-
tic generations of autonomous driving data. Although these ditions. It contains three tracks: LangAuto track (updates
diffusion-based models achieved good results on video and navigation instructions based on location and is divided into
image-generated metrics, it is still unclear whether they could sub-tracks of different route lengths), LangAuto-Notice track
really be used in closed-loop to really boost the performance (adds notification instructions based on navigation instruc-
of the autonomous driving system. tions), and LangAuto-Sequential track (Combining consec-
Metric: utive instructions into a single long instruction). In addition,
DriveDreamer [Wang et al., 2023e], DriveDreamer- LangAuto also uses three main evaluation indicators: route
2 [Zhao et al., 2024], DrivingDiffusion [Li et al., 2023c] and completion, violation score, and driving score to comprehen-
GenAD [Yang et al., 2024] use the frame-wise Frechet In- sively evaluate the autonomous driving system’s ability to
ception Distance (FID) [Parmar et al., 2022] to evaluate the follow instructions and driving safety. LingoQA [Marcu et
quality of generated images and the Frechet Video Distance al., 2023] developed LingoQA which is used for evaluating
(FVD) [Unterthiner et al., 2019] for video quality evalua- video question-answering models for autonomous driving.
tion. DrivingDiffusion also uses average intersection cross- The evaluation system consists of three main parts: a GPT-
ing (mIoU) [Rezatofighi et al., 2019] scores for drivable areas 4-based evaluation, which determines whether the model’s
and NDS [Yin et al., 2021] for all the object classes by com- answers are consistent with human answers; and the Lingo-
paring the predicted layout with the ground-truth BEV lay- Judge metric, which evaluates the model using a trained text
out. CTG++ [Zhong et al., 2023] following [Xu et al., 2022; classifier called Lingo-Judge Accuracy of answers; and cor-
Zhong et al., 2022], uses the failure rate, Wasserstein distance relation analysis with human ratings. This analysis involves
between normalized histograms of driving profiles, realism multiple human annotators rating responses to 17 different
deviation (real), and scene-level realism metric (rel real) as models on a scale of 0 to 1, which are interpreted as the
metrics. MagicDrive [Gao et al., 2023] utilizes segmenta- likelihood that the response accurately solves the problem.
tion metrics such as Road mIoU and Vehicle mIoU [Taran Reason2Drive [Nie et al., 2023] introduces the protocol to
et al., 2018], as well as 3D object detection metrics like measure the correctness of the reasoning chains to resolve
mAP[Henderson and Ferrari, 2017] and NDS [Yin et al., semantic ambiguities. The evaluation process includes four
2021]. ADriver-I [Jia et al., 2023a] adapts L1 error of key metrics: Reasoning Alignment, which measures the ex-
speed and steering angle of the current frame, Frechet In- tent of overlap in logical reasoning; Redundancy, aimed at
ception Distance(FID), and Frechet Video Distance(FVD) as identifying any repetitive steps; Missing Step, which focuses
evaluation indicators. ChatScene [Zhang et al., 2024] pro- on pinpointing any crucial steps that are absent but neces-
vide a thorough evaluation of various scenario generation sary for problem-solving; and Strict Reason, which evalu-
algorithms. These are assessed based on the collision rate ates scenarios involving visual elements. LaMPilot [Ma et
(CR), overall score (OS), and average displacement error al., 2023b] is an benchmark test used to evaluate the instruc-
(ADE). REvolve [Hazra et al., 2024] adopts fitness scores tion execution capabilities of autonomous vehicles, includ-
ing three parts: a simulator, a data set, and an evaluator. It dataset is a collection of real-world driving behaviors, each
employs Python Language Model Programs (LMPs) to in- annotated with descriptions and explanations. It includes 26K
terpret human-annotated driving instructions and execute and activities across 8.4M frames and thus provides a resource for
evaluate them within its framework. [Tanahashi et al., 2023] understanding and predicting driver behaviors across differ-
evaluated the two core capabilities of large language models ent conditions.
(LLMs) in the field of autonomous driving: first, the spatial Honda Research Institute-Advice Dataset (HAD) [Kim
awareness decision-making ability, that is, LLMs can accu- et al., 2019b]: HAD offers 30 hours of driving video data
rately identify the spatial layout based on coordinate infor- paired with natural language advice and videos integrated
mation; second, the ability to follow traffic rules to ensure with can-bus signal data. The advice includes Goal-oriented
that LLMs Ability to strictly abide by traffic laws while driv- advice(top-down signal) which is designed to guide the vehi-
ing. [Tang et al., 2024] tests three OpenAI LLMs and sev- cle in a navigation task and Stimulus-driven advice(bottom-
eral other LLMs on UK Driving Theory Test Practice Ques- up signal) which highlights specific visual cues that the user
tions and Answers, and only GPT-4o passed the test, indi- expects the vehicle controller to actively focus on.
cating that the performance of LLMs still needs to be fur- Talk2Car [Deruyttere et al., 2019]: The Talk2Car
ther improved. [Kong et al., 2024] has developed an LLM- dataset contains 11959 commands for the 850 videos of the
based safe autonomous driving framework, which evaluates nuScenes [Caesar et al., 2019; Fong et al., 2021] training set
and enhances the performance of existing LLM-AD meth- as 3D bounding box annotations. Of the commands, 55.94%
ods in driving safety, sensitive data usage, token consump- were from videos recorded in Boston, while 44.06% were
tion and alignment scenarios by integrating security assess- from Singapore. On average, each command contains 11.01
ment agents. DriveSim [Sreeram et al., 2024] is a specialized words, which includes 2.32 nouns, 2.29 verbs, and 0.62 ad-
simulator that creates diverse driving scenarios to test and jectives. Typically, there are about 14.07 commands in every
benchmark MLLMs’ understanding and reasoning of real- video. It is a object referral dataset that contains commands
world driving scenes from a fixed in-car camera perspective. written in natural language for self-driving cars.
OmniDrive [Wang et al., 2024d] introduces a comprehensive DriveLM Dataset [Sima et al., 2023]: This dataset in-
benchmark for visual question-answering tasks related to 3D tegrates human-like reasoning into autonomous driving sys-
driving, ensuring strong alignment between agent models and tems, enhancing Perception, Prediction, and Planning (P3).
driving tasks through scene description, traffic regulation, 3D It employs a “Graph-of-Thought” structure, encouraging a
grounding, counterfactual reasoning, decision-making, and futuristic approach through “What if” scenarios, thereby
planning. AIDE [Liang et al., 2024] proposes an automatic promoting advanced, logic-based reasoning and decision-
data engine design paradigm, which features automatic data making mechanisms in driving systems.
query and labeling using VLM and continuous learning with DRAMA Dataset [Malla et al., 2023]: Collected from
pseudo-labels. It also introduces a new benchmark to evaluate Tokyo’s streets, it includes 17,785 scenario clips captured us-
such automatic data engines for self-driving car perception, ing the video camera, each clipped to 2 seconds in duration. It
providing comprehensive insights across multiple paradigms contains different annotations: Video-level Q/A, Object-level
including open vocabulary detection, semi-supervised learn- Q/A, Risk object bounding box, Free-form caption, separate
ing, and continuous learning. ELM [Zhou et al., 2024] is pro- labels for ego-car intention, scene classifier, and suggestions
posed to understand driving scenes over long-scope space and to the driver.
time, showing promising generalization performance in han- Rank2Tell Dataset [Sachdeva et al., 2023]: It is captured
dling complex driving scenarios. LimSim++ [Fu et al., 2024] from a moving vehicle on highly interactive traffic scenes in
introduces an open-source evaluation platform for (M)LLM the San Francisco Bay Area. It includes 116 clips ( 20s each)
in autonomous driving, supporting scenario understanding, of 10FPS captured using an instrumented vehicle equipped
decision-making, and evaluation systems. with three Point Grey Grasshopper video cameras with a res-
olution of 1920 × 1200 pixels, a Velodyne HDL-64E S2 Li-
4 Datasets in LLM4AD DAR sensor, and high precision GPS. The dataset includes
Video-level Q/A, Object-level Q/A, LiDAR and 3D bound-
Traditional datasets such as nuScenes dataset [Caesar et al., ing boxes (with tracking), Field of view from 3 cameras
2019; Fong et al., 2021] lack action description [Lu et al., (stitched), important object bounding boxes (multiple impor-
2024], detailed caption, and question-answering pairs which tant objects per frame with multiple levels of importance-
are used to interact with LLMs. The BDD-x [Kim et al., High, Medium, Low), free-form captions (multiple captions
2018], Rank2Tell [Sachdeva et al., 2023], DriveLM [Sima et per object for multiple objects), ego-car intention.
al., 2023], DRAMA [Malla et al., 2023], NuPrompt [Wu et NuPrompt Dataset [Wu et al., 2023]: It represents an ex-
al., 2023] and NuScenes-QA [Qian et al., 2023] datasets rep- pansion of the nuScenes dataset, enriched with annotated lan-
resent key developments in LLM4AD research, each bring- guage prompts specifically designed for driving scenes. This
ing unique contributions to understanding agent behaviors dataset includes 35,367 language prompts for 3D objects, av-
and urban traffic dynamics through extensive, diverse, and eraging 5.3 instances per object. This annotation enhances
situation-rich annotations. We give a summary of each the dataset’s practicality in autonomous driving testing and
dataset in Table 1. We give detailed descriptions below. training, particularly in complex scenarios requiring linguis-
BDD-X Dataset [Kim et al., 2018]: With over 77 hours tic processing and comprehension.
of diverse driving conditions captured in 6,970 videos, this NuScenes-QA dataset [Qian et al., 2023]: It is a dataset
in autonomous driving, containing 459,941 question-answer Ethical Statement
pairs from 34,149 distinct visual scenes. They are partitioned When applying LLMs to the field of autonomous driving,
into 376,604 questions from 28,130 scenes for training, and we must deeply consider their potential ethical implications.
83,337 from 6,019 scenes for testing. NuScenes-QA show- First, the illusion of the model may cause the vehicle to
cases a wide array of question lengths, reflecting different misunderstand the external environment or traffic conditions,
complexity levels, making it challenging for AI models. Be- thus causing safety hazards. Second, model discrimination
yond sheer numbers, the dataset ensures a balanced range of and bias may lead to vehicles making unfair or biased de-
question types and categories, from identifying objects to as- cisions in different environments or when facing different
sessing their behavior, such as whether they are moving or groups. Additionally, false information and errors in reason-
parked. This design inhibits the model’s tendency to be bi- ing can cause a vehicle to adopt inappropriate or dangerous
ased or rely on linguistic shortcuts. driving behaviors. Inductive advice may leave the vehicle
Reason2Drive [Nie et al., 2023]: It consists of nuScenes, vulnerable to external interference or malicious behavior. Fi-
Waymo, and ONCE datasets with 600,000 video-text pairs nally, privacy leakage is also a serious issue, as vehicles may
labeled by humans and GPT-4. It provides a detailed rep- inadvertently reveal sensitive information about the user or
resentation of the driving scene through a unique automatic the surrounding environment. To sum up, we strongly rec-
annotation mode, capturing various elements such as object ommend that before deploying a large language model to an
types, visual and kinematic attributes, and their relationship autonomous driving system, an in-depth and detailed ethi-
to the ego vehicle. It has been enhanced with GPT-4 to in- cal review should be conducted to ensure that its decision-
clude complex question-answer pairs and detailed reasoning making logic is not only technically accurate but also ethi-
narratives. cally appropriate. At the same time, we call for following the
LingoQA [Marcu et al., 2023]: This dataset is a large- principles of transparency, responsibility, and fairness to en-
scale, a diverse collection for autonomous driving, contain- sure the ethics and safety of technology applications. We call
ing approximately 419,000 question-answer pairs, covering on the entire community to work together to ensure reliable
both action and scenery subsets. It provides rich informa- and responsible deployment of autonomous driving technol-
tion about driving behavior, environmental perception, and ogy based on large language models.
road conditions through high-quality videos and detailed an-
notations. It features complex questions and free-form an- Acknowledgments
swers, leveraging GPT-3.5/4 to enhance the diversity and This work was partly supported by NSFC (92370201,
depth of content. The driving capabilities covered include ac- 62222607, 61972250) and Shanghai Municipal Science and
tions, reasons, attention, recognition, positioning, etc., which Technology Major Project (2021SHZDZX0102).
are particularly suitable for improving the understanding and
decision-making capabilities of autonomous driving systems.
References
NuInstruct [Ding et al., 2024]: It is a dataset fea-
turing 91K multi-view video-QA pairs spanning 17 sub- [Anderson et al., 2016] Peter Anderson, Basura Fernando,
tasks, each requiring comprehensive information such as Mark Johnson, and Stephen Gould. Spice: Semantic
temporal, multi-view, and spatial data, thereby significantly propositional image caption evaluation, 2016.
raising the complexity of the challenges. It developed a [Atakishiyev et al., 2023] Shahin Atakishiyev, Mohammad
SQL-based method that automatically generates instruction- Salameh, Hengshuai Yao, and Randy Goebel. Explainable
response pairs, inspired by the logical progression of human artificial intelligence for autonomous driving: A compre-
decision-making. hensive overview and field guide for future research direc-
OpenDV-2K [Yang et al., 2024]: This dataset is a large- tions, 2023.
scale multimodal dataset for autonomous driving, comprising [Azarafza et al., 2024] Mehdi Azarafza, Mojtaba Nayyeri,
2059 hours of curated driving videos, including 1747 hours Charles Steinmetz, Steffen Staab, and Achim Rettberg.
from YouTube and 312 hours from public datasets, with au- Hybrid reasoning based on large language models for au-
tomatically generated language annotations to support gener- tonomous car driving. arXiv preprint arXiv:2402.13602,
alized video prediction model training. 2024.
[Banerjee and Lavie, 2005] Satanjeev Banerjee and Alon
Lavie. METEOR: An automatic metric for MT evaluation
5 Conclusion with improved correlation with human judgments. In Jade
Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, ed-
In this paper, we have provided a comprehensive survey on itors, Proceedings of the ACL Workshop on Intrinsic and
LLM4AD. We classify and introduce different applications Extrinsic Evaluation Measures for Machine Translation
employing LLMs for autonomous driving and summarize the and/or Summarization, pages 65–72, Ann Arbor, Michi-
representative approaches in each category. At the same time, gan, June 2005. Association for Computational Linguis-
we summarize the latest datasets related to LLM4AD. We tics.
will continue to monitor developments in the field and high- [Bashar et al., 2022] Mk Bashar, Samia Islam,
light future research directions. Kashifa Kawaakib Hussain, Md. Bakhtiar Hasan,
A. B. M. Ashikur Rahman, and Md. Hasanul Kabir. Danny Birch, Daniel Maund, and Jamie Shotton. Driv-
Multiple object tracking in recent times: A literature ing with llms: Fusing object-level vector modality for
review, 2022. explainable autonomous driving, 2023.
[Bernardin and Stiefelhagen, 2008] Keni Bernardin and [Chen et al., 2024] Yuan Chen, Zi-han Ding, Ziqin Wang,
Rainer Stiefelhagen. Evaluating multiple object tracking Yan Wang, Lijun Zhang, and Si Liu. Asynchronous large
performance: The clear mot metrics. EURASIP J. Image language model enhanced planner for autonomous driving.
Video Process., 2008, 2008. arXiv preprint arXiv:2406.14556, 2024.
[Bommasani et al., 2021] Rishi Bommasani, Drew A Hud- [Cheng et al., 2022] Bowen Cheng, Ishan Misra, Alexan-
son, Ehsan Adeli, Russ Altman, Simran Arora, Syd- der G. Schwing, Alexander Kirillov, and Rohit Girdhar.
ney von Arx, Michael S Bernstein, Jeannette Bohg, An- Masked-attention mask transformer for universal image
toine Bosselut, Emma Brunskill, et al. On the oppor- segmentation. 2022.
tunities and risks of foundation models. arXiv preprint [Chitta et al., 2023] Kashyap Chitta, Aditya Prakash, Bern-
arXiv:2108.07258, 2021. hard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger.
[Brohan et al., 2023a] Anthony Brohan, Noah Brown, Jus- Transfuser: Imitation with transformer-based sensor fu-
tice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof sion for autonomous driving. IEEE Pattern Analysis and
Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Machine Intelligence (PAMI), 2023.
Chelsea Finn, et al. Rt-2: Vision-language-action models [Cui et al., 2023a] Can Cui, Yunsheng Ma, Xu Cao, Wen-
transfer web knowledge to robotic control. arXiv preprint qian Ye, and Ziran Wang. Drive as you speak: En-
arXiv:2307.15818, 2023. abling human-like interaction with large language models
[Brohan et al., 2023b] Anthony Brohan, Noah Brown, Jus- in autonomous vehicles. arXiv preprint arXiv:2309.10228,
tice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof 2023.
Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, [Cui et al., 2023b] Can Cui, Yunsheng Ma, Xu Cao, Wen-
Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gon- qian Ye, and Ziran Wang. Receive, reason, and re-
zalez Arenas, Keerthana Gopalakrishnan, Kehang Han, act: Drive as you say with large language models in
Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian autonomous vehicles. arXiv preprint arXiv:2310.08034,
Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry 2023.
Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, [Cui et al., 2023c] Can Cui, Zichong Yang, Yupeng Zhou,
Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Yunsheng Ma, Juanwu Lu, and Ziran Wang. Large lan-
Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, guage models for autonomous driving: Real-world exper-
Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag iments, 2023.
Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh,
Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan [Dauner et al., 2023] Daniel Dauner, Marcel Hallgarten, An-
Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, dreas Geiger, and Kashyap Chitta. Parting with miscon-
Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe ceptions about learning-based vehicle motion planning. In
Yu, and Brianna Zitkovich. Rt-2: Vision-language-action CoRL, 2023.
models transfer web knowledge to robotic control, 2023. [Deo et al., 2021] Nachiket Deo, Eric M. Wolff, and Oscar
[Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ry- Beijbom. Multimodal trajectory prediction conditioned on
der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- lane-graph traversals, 2021.
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, [Deruyttere et al., 2019] Thierry Deruyttere, Simon Vanden-
Amanda Askell, et al. Language models are few-shot hende, Dusan Grujicic, Luc Van Gool, and Marie-Francine
learners. Advances in neural information processing sys- Moens. Talk2car: Taking control of your self-driving
tems, 33:1877–1901, 2020. car. In Proceedings of the 2019 Conference on Empir-
[Caesar et al., 2019] Holger Caesar, Varun Bankiti, Alex H. ical Methods in Natural Language Processing and the
Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush 9th International Joint Conference on Natural Language
Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. Processing (EMNLP-IJCNLP). Association for Computa-
nuscenes: A multimodal dataset for autonomous driving. tional Linguistics, 2019.
arXiv preprint arXiv:1903.11027, 2019. [Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Ken-
[Chen and Krähenbühl, 2022] Dian Chen and Philipp ton Lee, and Kristina Toutanova. Bert: Pre-training of
Krähenbühl. Learning from all vehicles. In CVPR, 2022. deep bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805, 2018.
[Chen et al., 2023a] Li Chen, Penghao Wu, Kashyap Chitta,
[Dewangan et al., 2023] Vikrant Dewangan, Tushar Choud-
Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-
hary, Shivam Chandhok, Shubham Priyadarshan, Anushka
to-end autonomous driving: Challenges and frontiers.
Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy
arXiv preprint arXiv:2306.16927, 2023.
Jatavallabhula, and K. Madhava Krishna. Talk2bev:
[Chen et al., 2023b] Long Chen, Oleg Sinavski, Jan Language-enhanced bird’s-eye view maps for autonomous
Hünermann, Alice Karnsund, Andrew James Willmott, driving, 2023.
[Ding et al., 2023] Xinpeng Ding, Jianhua Han, Hang Xu, efficient vision-language models for question answering
Wei Zhang, and Xiaomeng Li. Hilm-d: Towards high- in autonomous driving. arXiv preprint arXiv:2403.19838,
resolution understanding in multimodal large language 2024.
models for autonomous driving, 2023. [Han et al., 2024] Wencheng Han, Dongqian Guo, Cheng-
[Ding et al., 2024] Xinpeng Ding, Jianhua Han, Hang Xu, Zhong Xu, and Jianbing Shen. Dme-driver: Integrat-
Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic ing human decision logic and 3d scene perception in
autonomous driving understanding by bird’s-eye-view in- autonomous driving. arXiv preprint arXiv:2401.03641,
jected multi-modal large models. In Proceedings of the 2024.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 13668–13677, 2024. [Hazra et al., 2024] Rishi Hazra, Alkis Sygkounas, Andreas
Persson, Amy Loutfi, and Pedro Zuidberg Dos Martires.
[Driess et al., 2023] Danny Driess, Fei Xia, Mehdi S. M. Revolve: Reward evolution with large language models
Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian for autonomous driving. arXiv preprint arXiv:2406.01309,
Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, 2024.
Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser-
manet, Daniel Duckworth, Sergey Levine, Vincent Van- [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
houcke, Karol Hausman, Marc Toussaint, Klaus Greff, Ren, and Jian Sun. Deep residual learning for image recog-
Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: nition. In Proceedings of the IEEE Conference on Com-
An embodied multimodal language model, 2023. puter Vision and Pattern Recognition (CVPR), June 2016.
[Ettinger et al., 2021] Scott Ettinger, Shuyang Cheng, Ben- [Henderson and Ferrari, 2017] Paul Henderson and Vittorio
jamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Ferrari. End-to-end training of object class detectors for
Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, Zoey Yang, mean average precision, 2017.
Aurelien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasude- [Ho et al., 2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel.
van, Alexander McCauley, Jonathon Shlens, and Dragomir Denoising diffusion probabilistic models, 2020.
Anguelov. Large scale interactive motion forecasting for
autonomous driving : The waymo open motion dataset, [Höfer et al., 2021] Sebastian Höfer, Kostas Bekris, Ankur
2021. Handa, Juan Camilo Gamboa, Melissa Mozifian, Florian
[Fong et al., 2021] Whye Kit Fong, Rohit Mohan, Juana Va- Golemo, Chris Atkeson, Dieter Fox, Ken Goldberg, John
leria Hurtado, Lubing Zhou, Holger Caesar, Oscar Bei- Leonard, et al. Sim2real in robotics and automation: Ap-
jbom, and Abhinav Valada. Panoptic nuscenes: A large- plications and challenges. IEEE transactions on automa-
scale benchmark for lidar panoptic segmentation and tion science and engineering, 18(2):398–400, 2021.
tracking. arXiv preprint arXiv:2109.03805, 2021. [Hu et al., 2023a] Anthony Hu, Lloyd Russell, Hudson Yeo,
[Fu et al., 2023a] Daocheng Fu, Xin Li, Licheng Wen, Min Zak Murez, George Fedoseev, Alex Kendall, Jamie Shot-
Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like ton, and Gianluca Corrado. Gaia-1: A generative
a human: Rethinking autonomous driving with large lan- world model for autonomous driving. arXiv preprint
guage models, 2023. arXiv:2309.17080, 2023.
[Fu et al., 2023b] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, [Hu et al., 2023b] Yihan Hu, Jiazhi Yang, Li Chen, Keyu
and Pengfei Liu. Gptscore: Evaluate as you desire, 2023. Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du,
Tianwei Lin, Wenhai Wang, et al. Planning-oriented
[Fu et al., 2024] Daocheng Fu, Wenjie Lei, Licheng Wen,
autonomous driving. In Proceedings of the IEEE/CVF
Pinlong Cai, Song Mao, Min Dou, Botian Shi, and
Conference on Computer Vision and Pattern Recognition,
Yu Qiao. Limsim++: A closed-loop platform for deploy-
pages 17853–17862, 2023.
ing multimodal llms in autonomous driving. 2024.
[Gao et al., 2023] Ruiyuan Gao, Kai Chen, Enze Xie, Lan- [Hu et al., 2024] Senkang Hu, Zhengru Fang, Zihan Fang,
qing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Xianhao Chen, and Yuguang Fang. Agentscodriver: Large
Magicdrive: Street view generation with diverse 3d geom- language model empowered collaborative driving with
etry control, 2023. lifelong learning. arXiv preprint arXiv:2404.06345, 2024.
[Gilles et al., 2021] Thomas Gilles, Stefano Sabatini, [Huang et al., 2023a] Jia Huang, Peng Jiang, Alvika Gau-
Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien tam, and Srikanth Saripalli. Gpt-4v takes the wheel: Eval-
Moutarde. Gohome: Graph-oriented heatmap output for uating promise and challenges for pedestrian behavior pre-
future motion estimation, 2021. diction, 2023.
[Goodfellow et al., 2014] Ian J. Goodfellow, Jean Pouget- [Huang et al., 2023b] Junchao Huang, Xiaoqi He, and Sheng
Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Zhao. The detection and rectification for identity-switch
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen- based on unfalsified control, 2023.
erative adversarial networks, 2014. [Ishida et al., 2024] Shu Ishida, Gianluca Corrado, George
[Gopalkrishnan et al., 2024] Akshay Gopalkrishnan, Ross Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton,
Greer, and Mohan Trivedi. Multi-frame, lightweight & Joao F. Henriques, and Anthony Hu. Langprop: A code
optimization framework using large language models ap- and Jingjing Liu. Adapt: Action-aware driving caption
plied to driving. In ICLR 2024 Workshop on Large Lan- transformer, 2023.
guage Model (LLM) Agents, 2024. [Jin et al., 2023b] Ye Jin, Xiaoxi Shen, Huiling Peng, Xi-
[Jain et al., 2021] Ashesh Jain, Luca Del Pero, Hugo Grim- aoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao,
mett, and Peter Ondruska. Autonomy 2.0: Why is Guyue Zhou, and Jiangtao Gong. Surrealdriver: Design-
self-driving always 5 years away? arXiv preprint ing generative driver agent simulation framework in urban
arXiv:2107.08142, 2021. contexts based on large language model, 2023.
[Jia et al., 2021] Xiaosong Jia, Liting Sun, Masayoshi [Keysan et al., 2023] Ali Keysan, Andreas Look, Eitan Kos-
Tomizuka, and Wei Zhan. Ide-net: Interactive driving man, Gonca Gürsun, Jörg Wagner, Yu Yao, and Barbara
event and pattern extraction from human data. IEEE Rakitsch. Can you text what is happening? integrating pre-
Robotics and Automation Letters, 6(2):3065–3072, 2021. trained language encoders into trajectory prediction mod-
[Jia et al., 2022a] Xiaosong Jia, Li Chen, Penghao Wu, Jia els for autonomous driving, 2023.
Zeng, Junchi Yan, Hongyang Li, and Yu Qiao. Towards [Khachatryan et al., 2023] Levon Khachatryan, Andranik
capturing the temporal dynamics for trajectory prediction: Movsisyan, Vahram Tadevosyan, Roberto Henschel,
a coarse-to-fine approach. In CoRL, 2022. Zhangyang Wang, Shant Navasardyan, and Humphrey
[Jia et al., 2022b] Xiaosong Jia, Liting Sun, Hang Zhao, Shi. Text2video-zero: Text-to-image diffusion models are
Masayoshi Tomizuka, and Wei Zhan. Multi-agent trajec- zero-shot video generators, 2023.
tory prediction by combining egocentric and allocentric [Kim et al., 2018] Jinkyu Kim, Anna Rohrbach, Trevor Dar-
views. In Conference on Robot Learning, pages 1434– rell, John Canny, and Zeynep Akata. Textual explanations
1443. PMLR, 2022. for self-driving vehicles. Proceedings of the European
[Jia et al., 2023a] Fan Jia, Weixin Mao, Yingfei Liu, Conference on Computer Vision (ECCV), 2018.
Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, [Kim et al., 2019a] Jinkyu Kim, Teruhisa Misu, Yi-Ting
and Tiancai Wang. Adriver-i: A general world model for Chen, Ashish Tawari, and John Canny. Grounding human-
autonomous driving, 2023. to-vehicle advice for self-driving vehicles, 2019.
[Jia et al., 2023b] Xiaosong Jia, Yulu Gao, Li Chen, Junchi [Kim et al., 2019b] Jinkyu Kim, Teruhisa Misu, Yi-Ting
Yan, Patrick Langechuan Liu, and Hongyang Li. Chen, Ashish Tawari, and John Canny. Grounding human-
Driveadapter: Breaking the coupling barrier of perception to-vehicle advice for self-driving vehicles. In The IEEE
and planning in end-to-end autonomous driving, 2023. Conference on Computer Vision and Pattern Recognition
[Jia et al., 2023c] Xiaosong Jia, Penghao Wu, Li Chen, (CVPR), 2019.
Yu Liu, Hongyang Li, and Junchi Yan. Hdgt: Heteroge- [Kingma and Welling, 2022] Diederik P Kingma and Max
neous driving graph transformer for multi-agent trajectory Welling. Auto-encoding variational bayes, 2022.
prediction via scene encoding. IEEE Transactions on Pat- [Kong et al., 2024] Xiangrui Kong, Thomas Braunl, Marco
tern Analysis and Machine Intelligence (TPAMI), 2023. Fahmi, and Yue Wang. A superalignment framework in
[Jia et al., 2023d] Xiaosong Jia, Penghao Wu, Li Chen, autonomous driving with large language models. arXiv
Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. preprint arXiv:2406.05651, 2024.
Think twice before driving: Towards scalable decoders for [Leurent, 2018] Edouard Leurent. An environment for au-
end-to-end autonomous driving, 2023. tonomous driving decision-making. https://github.com/
[Jia et al., 2024a] Xiaosong Jia, Shaoshuai Shi, Zijun Chen, eleurent/highway-env, 2018.
Li Jiang, Wenlong Liao, Tao He, and Junchi Yan. Amp: [Li et al., 2022a] Hongyang Li, Chonghao Sima, Jifeng Dai,
Autoregressive motion prediction revisited with next to- Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li,
ken prediction for autonomous driving. arXiv preprint Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiang-
arXiv:2403.13331, 2024. wei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong
[Jia et al., 2024b] Xiaosong Jia, Zhenjie Yang, Qifeng Li, Jia, Si Liu, Jianping Shi, Dahua Lin, and Yu Qiao. Delv-
Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards ing into the devils of bird’s-eye-view perception: A review,
multi-ability benchmarking of closed-loop end-to-end au- evaluation and recipe. arXiv preprint arXiv:2209.05324,
tonomous driving. arXiv preprint arXiv:2406.03877, 2022.
2024. [Li et al., 2022b] Junnan Li, Dongxu Li, Caiming Xiong,
[Jiang et al., 2024] Kemou Jiang, Xuan Cai, Zhiyong Cui, and Steven Hoi. Blip: Bootstrapping language-image
Aoyong Li, Yilong Ren, Haiyang Yu, Hao Yang, pre-training for unified vision-language understanding and
Daocheng Fu, Licheng Wen, and Pinlong Cai. Koma: generation, 2022.
Knowledge-driven multi-agent framework for autonomous [Li et al., 2022c] Zhiqi Li, Wenhai Wang, Hongyang Li,
driving with large language models. arXiv preprint Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng
arXiv:2407.14239, 2024. Dai. Bevformer: Learning bird’s-eye-view representation
[Jin et al., 2023a] Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei from multi-camera images via spatiotemporal transform-
Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, ers. In ECCV, pages 1–18. Springer, 2022.
[Li et al., 2023a] Junnan Li, Dongxu Li, Silvio Savarese, and [Lu et al., 2024] Han Lu, Xiaosong Jia, Yichen Xie, Wen-
Steven Hoi. Blip-2: Bootstrapping language-image pre- long Liao, Xiaokang Yang, and Junchi Yan. Ac-
training with frozen image encoders and large language tivead: Planning-oriented active learning for end-to-end
models, 2023. autonomous driving, 2024.
[Li et al., 2023b] Tianyu Li, Li Chen, Huijie Wang, Yang [Luo et al., 2018] Wenjie Luo, Bin Yang, and Raquel Urta-
Li, Jiazhi Yang, Xiangwei Geng, Shengyin Jiang, Yuting sun. Fast and furious: Real time end-to-end 3d detec-
Wang, Hang Xu, Chunjing Xu, Junchi Yan, Ping Luo, and tion, tracking and motion forecasting with a single con-
Hongyang Li. Graph-based topology reasoning for driving volutional net. In Proceedings of the IEEE conference on
scenes. arXiv preprint arXiv:2304.05277, 2023. Computer Vision and Pattern Recognition, pages 3569–
3577, 2018.
[Li et al., 2023c] Xiaofan Li, Yifu Zhang, and Xiaoqing Ye.
Drivingdiffusion: Layout-guided multi-view driving scene [Luo et al., 2023a] Ruipu Luo, Ziwang Zhao, Min Yang,
video generation with latent diffusion model. arXiv Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei
preprint arXiv:2310.07771, 2023. Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assis-
tant with large language model enhanced ability, 2023.
[Li et al., 2024a] Boyi Li, Yue Wang, Jiageng Mao, Boris
[Luo et al., 2023b] Zhengxiong Luo, Dayou Chen, Yingya
Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone.
Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao,
Driving everywhere with large language model policy
Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed
adaptation. In Proceedings of the IEEE/CVF Conference
diffusion models for high-quality video generation, 2023.
on Computer Vision and Pattern Recognition (CVPR),
pages 14948–14957, June 2024. [Ma et al., 2023a] Yingzi Ma, Yulong Cao, Jiachen Sun,
Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal
[Li et al., 2024b] Qifeng Li, Xiaosong Jia, Shaobo Wang, language model for driving, 2023.
and Junchi Yan. Think2drive: Efficient reinforcement
learning by thinking in latent world model for quasi- [Ma et al., 2023b] Yunsheng Ma, Can Cui, Xu Cao, Wen-
realistic autonomous driving (in carla-v2). In ECCV, 2024. qian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit
Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, and
[Liang et al., 2020] Ming Liang, Bin Yang, Wenyuan Zeng, Ziran Wang. Lampilot: An open benchmark dataset for au-
Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. tonomous driving with language model programs, 2023.
Pnpnet: End-to-end perception and prediction with track-
[Malla et al., 2023] Srikanth Malla, Chiho Choi, Isht
ing in the loop. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint
11553–11562, 2020. risk localization and captioning in driving. In Proceedings
of the IEEE/CVF Winter Conference on Applications of
[Liang et al., 2024] Mingfu Liang, Jong-Chyi Su, Samuel Computer Vision, pages 1043–1052, 2023.
Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, and Man- [Mao et al., 2023a] Jiageng Mao, Yuxi Qian, Hang Zhao,
mohan Chandraker. Aide: An automatic data engine for and Yue Wang. Gpt-driver: Learning to drive with gpt.
object detection in autonomous driving. In Proceedings of arXiv preprint arXiv:2310.01415, 2023.
the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 14695–14706, 2024. [Mao et al., 2023b] Jiageng Mao, Junjie Ye, Yuxi Qian,
Marco Pavone, and Yue Wang. A language agent for au-
[Lin et al., 2023] Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, tonomous driving, 2023.
Deepak Pathak, and Deva Ramanan. Multimodality helps
unimodality: Cross-modal few-shot learning with multi- [Marcu et al., 2023] Ana-Maria Marcu, Long Chen, Jan
modal models, 2023. Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal
Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex
[Liu et al., 2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, Kendall, Jamie Shotton, and Oleg Sinavski. Lingoqa:
and Yong Jae Lee. Visual instruction tuning, 2023. Video question answering for autonomous driving, 2023.
[Liu et al., 2023b] Jiaqi Liu, Peng Hang, Xiao qi, Jianqiang [Nie et al., 2023] Ming Nie, Renyuan Peng, Chunwei Wang,
Wang, and Jian Sun. Mtd-gpt: A multi-task decision- Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea-
making gpt model for autonomous driving at unsignalized son2drive: Towards interpretable and chain-based reason-
intersections, 2023. ing for autonomous driving, 2023.
[Liu et al., 2023c] Yang Liu, Dan Iter, Yichong Xu, Shuo- [Nouri et al., 2024] Ali Nouri, Beatriz Cabrero-Daniel,
hang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Fredrik Torner, Hakan Sivencrona, and Christian Berger.
Nlg evaluation using gpt-4 with better human alignment, Engineering safety requirements for autonomous driving
2023. with large language models, 2024.
[Liu et al., 2023d] Zhijian Liu, Haotian Tang, Alexander [OpenAI, 2023] OpenAI. Gpt-4 technical report, 2023.
Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song [Ouyang et al., 2022] Long Ouyang, Jeff Wu, Xu Jiang,
Han. Bevfusion: Multi-task multi-sensor fusion with uni- Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
fied bird’s-eye view representation. In ICRA, 2023. Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke [Rezatofighi et al., 2019] Hamid Rezatofighi, Nathan Tsoi,
Miller, Maddie Simens, Amanda Askell, Peter Welinder, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio
Paul Christiano, Jan Leike, and Ryan Lowe. Training lan- Savarese. Generalized intersection over union: A metric
guage models to follow instructions with human feedback, and a loss for bounding box regression, 2019.
2022. [Rezende and Mohamed, 2016] Danilo Jimenez Rezende
[P et al., 2023] Jishnu Jaykumar P, Kamalesh Palanisamy, and Shakir Mohamed. Variational inference with
Yu-Wei Chao, Xinya Du, and Yu Xiang. Proto-clip: normalizing flows, 2016.
Vision-language prototypical network for few-shot learn- [Rombach et al., 2021] Robin Rombach, Andreas
ing, 2023.
Blattmann, Dominik Lorenz, Patrick Esser, and Björn
[Pan et al., 2024] Chenbin Pan, Burhaneddin Yaman, Tom- Ommer. High-resolution image synthesis with latent
maso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem diffusion models, 2021.
Velipasalar, and Liu Ren. Vlp: Vision language planning
[Ronneberger et al., 2015] Olaf Ronneberger, Philipp Fis-
for autonomous driving. In Proceedings of the IEEE/CVF
cher, and Thomas Brox. U-net: Convolutional networks
Conference on Computer Vision and Pattern Recognition,
for biomedical image segmentation, 2015.
pages 14760–14769, 2024.
[Papineni et al., 2002] Kishore Papineni, Salim Roukos, [Sachdeva et al., 2023] Enna Sachdeva, Nakul Agarwal,
Todd Ward, and Wei-Jing Zhu. Bleu: a method for auto- Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush,
matic evaluation of machine translation. In Proceedings of Chiho Choi, and Mykel Kochenderfer. Rank2tell: A mul-
the 40th annual meeting on association for computational timodal driving dataset for joint importance ranking and
linguistics, pages 311–318. Association for Computational reasoning. arXiv preprint arXiv:2309.06597, 2023.
Linguistics, 2002. [Sadat et al., 2020] Abbas Sadat, Sergio Casas, Mengye
[Parmar et al., 2022] Gaurav Parmar, Richard Zhang, and Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urta-
Jun-Yan Zhu. On aliased resizing and surprising subtleties sun. Perceive, predict, and plan: Safe motion planning
in gan evaluation, 2022. through interpretable semantic representations. In Com-
puter Vision–ECCV 2020: 16th European Conference,
[Paul et al., 2024] Pranjal Paul, Anant Garg, Tushar Choud- Glasgow, UK, August 23–28, 2020, Proceedings, Part
hary, Arun Kumar Singh, and K Madhava Krishna. XXIII 16, pages 414–430. Springer, 2020.
Lego-drive: Language-enhanced goal-oriented closed-
loop end-to-end autonomous driving. arXiv preprint [Sha et al., 2023] Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen,
arXiv:2403.20116, 2024. Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi
Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc:
[Peng et al., 2024] Mingxing Peng, Xusen Guo, Xianda Large language models as decision makers for au-
Chen, Meixin Zhu, Kehua Chen, Xuesong Wang, Yin- tonomous driving. arXiv preprint arXiv:2310.03026,
hai Wang, et al. Lc-llm: Explainable lane-change inten- 2023.
tion and trajectory predictions with large language models.
arXiv preprint arXiv:2403.18344, 2024. [Shah et al., 2017] Shital Shah, Debadeepta Dey, Chris
Lovett, and Ashish Kapoor. Airsim: High-fidelity visual
[Qian et al., 2023] Tianwen Qian, Jingjing Chen, Linhai and physical simulation for autonomous vehicles, 2017.
Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-
qa: A multi-modal visual question answering bench- [Shao et al., 2023] Hao Shao, Yuxuan Hu, Letian Wang,
mark for autonomous driving scenario. arXiv preprint Steven L. Waslander, Yu Liu, and Hongsheng Li. Lm-
arXiv:2305.14836, 2023. drive: Closed-loop end-to-end driving with large language
models, 2023.
[Radford et al., 2018] Alec Radford, Karthik Narasimhan,
Tim Salimans, Ilya Sutskever, et al. Improving language [Sharan et al., 2023] SP Sharan, Francesco Pittaluga, Man-
understanding by generative pre-training. 2018. mohan Chandraker, et al. Llm-assist: Enhancing closed-
loop planning with language-based reasoning. arXiv
[Radford et al., 2019] Alec Radford, Jeffrey Wu, Rewon
preprint arXiv:2401.00125, 2023.
Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. [Shi et al., 2022] Shaoshuai Shi, Li Jiang, Dengxin Dai, and
OpenAI blog, 1(8):9, 2019. Bernt Schiele. Motion transformer with global intention
[Radford et al., 2021] Alec Radford, Jong Wook Kim, Chris localization and local movement refinement. Advances
in Neural Information Processing Systems, 35:6531–6543,
Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar-
2022.
wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack
Clark, Gretchen Krueger, and Ilya Sutskever. Learning [Shi et al., 2023] Shaoshuai Shi, Li Jiang, Dengxin Dai, and
transferable visual models from natural language supervi- Bernt Schiele. Motion transformer with global intention
sion, 2021. localization and local movement refinement, 2023.
[Ramesh et al., 2022] Aditya Ramesh, Prafulla Dhariwal, [Shukor et al., 2023] Mustafa Shukor, Corentin Dancette,
Alex Nichol, Casey Chu, and Mark Chen. Hierarchical and Matthieu Cord. ep-alm: Efficient perceptual augmen-
text-conditional image generation with clip latents, 2022. tation of language models, 2023.
[Sima et al., 2023] Chonghao Sima, Katrin Renz, Kashyap [Vedantam et al., 2015] Ramakrishna Vedantam,
Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-
Andreas Geiger, and Hongyang Li. Drivelm: Driving based image description evaluation, 2015.
with graph visual question answering. arXiv preprint [Wang et al., 2023a] Jiaan Wang, Yunlong Liang, Fandong
arXiv:2312.14150, 2023. Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu,
[Sreeram et al., 2024] Shiva Sreeram, Tsun-Hsuan Wang, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evalua-
Alaa Maalouf, Guy Rosman, Sertac Karaman, and Daniela tor? a preliminary study, 2023.
Rus. Probing multimodal llms as world models for driv- [Wang et al., 2023b] Lening Wang, Han Jiang, Pinlong Cai,
ing. arXiv preprint arXiv:2405.05956, 2024. Daocheng Fu, Tianqi Wang, Zhiyong Cui, Yilong Ren,
[Swerdlow et al., 2023] Alexander Swerdlow, Runsheng Xu, Haiyang Yu, Xuesong Wang, and Yinhai Wang. Accident-
and Bolei Zhou. Street-view image generation from a gpt: Accident analysis and prevention from v2x environ-
bird’s-eye view layout. arXiv preprint arXiv:2301.04634, mental perception with multi-modal large model. arXiv
2023. preprint arXiv:2312.13156, 2023.
[Tanahashi et al., 2023] Kotaro Tanahashi, Yuichi Inoue, [Wang et al., 2023c] Shiyi Wang, Yuxuan Zhu, Zhiheng Li,
Yu Yamaguchi, Hidetatsu Yaginuma, Daiki Shiotsuka, Hi- Yutong Wang, Li Li, and Zhengbing He. Chatgpt as your
royuki Shimatani, Kohei Iwamasa, Yoshiaki Inoue, Taka- vehicle co-pilot: An initial attempt. IEEE Transactions on
fumi Yamaguchi, Koki Igari, Tsukasa Horinouchi, Kento Intelligent Vehicles, pages 1–17, 2023.
Tokuhiro, Yugo Tokuchi, and Shunsuke Aoki. Evalua- [Wang et al., 2023d] Wenhai Wang, Jiangwei Xie,
tion of large language models for decision making in au- ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen
tonomous driving, 2023. Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li,
[Tang et al., 2023] Yun Tang, Antonio A. Bruto da Costa, Ja- et al. Drivemlm: Aligning multi-modal large language
son Zhang, Irvine Patrick, Siddartha Khastgir, and Paul models with behavioral planning states for autonomous
Jennings. Domain knowledge distillation from large lan- driving. arXiv preprint arXiv:2312.09245, 2023.
guage model: An empirical study in the autonomous driv- [Wang et al., 2023e] Xiaofeng Wang, Zheng Zhu, Guan
ing domain, 2023. Huang, Xinze Chen, and Jiwen Lu. Drivedreamer: To-
[Tang et al., 2024] Zuoyin Tang, Jianhua He, Dashuai Pei, wards real-world-driven world models for autonomous
Kezhong Liu, and Tao Gao. Testing large language mod- driving. arXiv preprint arXiv:2309.09777, 2023.
els on driving theory knowledge and skills for connected [Wang et al., 2023f] Yixuan Wang, Ruochen Jiao, Chengtian
autonomous vehicles. arXiv preprint arXiv:2407.17211, Lang, Sinong Simon Zhan, Chao Huang, Zhaoran Wang,
2024. Zhuoran Yang, and Qi Zhu. Empowering autonomous
[Taran et al., 2018] Vlad Taran, Nikita Gordienko, Yuriy driving with large language models: A safety perspective,
Kochura, Yuri Gordienko, Alexandr Rokovyi, Oleg 2023.
Alienin, and Sergii Stirenko. Performance evaluation of [Wang et al., 2023g] Yuqi Wang, Jiawei He, Lue Fan,
deep learning networks for semantic segmentation of traf- Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving
fic stereo-pair images. In Proceedings of the 19th Interna- into the future: Multiview visual forecasting and planning
tional Conference on Computer Systems and Technologies. with world model for autonomous driving, 2023.
ACM, sep 2018. [Wang et al., 2024a] Maonan Wang, Aoyu Pang, Yuheng
[Tian et al., 2024] Xiaoyu Tian, Junru Gu, Bailin Li, Kan, Man-On Pun, Chung Shue Chen, and Bo Huang.
Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Llm-assisted light: Leveraging large language model ca-
Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The con- pabilities for human-mimetic traffic signal control in com-
vergence of autonomous driving and large vision-language plex urban environments, 2024.
models. arXiv preprint arXiv:2402.12289, 2024. [Wang et al., 2024b] Maonan Wang, Aoyu Pang, Yuheng
[Touvron et al., 2023] Hugo Touvron, Thibaut Lavril, Gau- Kan, Man-On Pun, Chung Shue Chen, and Bo Huang.
tier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- Llm-assisted light: Leveraging large language model
othée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- capabilities for human-mimetic traffic signal control
bro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, in complex urban environments. arXiv preprint
Edouard Grave, and Guillaume Lample. Llama: Open and arXiv:2403.08337, 2024.
efficient foundation language models, 2023. [Wang et al., 2024c] Peng Wang, Xiang Wei, Fangxu Hu,
[Treiber et al., 2000] Martin Treiber, Ansgar Hennecke, and and Wenjuan Han. Transgpt: Multi-modal generative pre-
Dirk Helbing. Congested traffic states in empirical obser- trained transformer for transportation, 2024.
vations and microscopic simulations. Physical review E, [Wang et al., 2024d] Shihao Wang, Zhiding Yu, Xiaohui
62(2):1805, 2000. Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying
[Unterthiner et al., 2019] Thomas Unterthiner, Sjoerd van Li, and Jose M Alvarez. Omnidrive: A holistic llm-agent
Steenkiste, Karol Kurach, Raphael Marinier, Marcin framework for autonomous driving with 3d perception,
Michalski, and Sylvain Gelly. Towards accurate genera- reasoning and planning. arXiv preprint arXiv:2405.01533,
tive models of video: A new metric and challenges, 2019. 2024.
[Wei et al., 2024] Yuxi Wei, Zi Wang, Yifan Lu, Chenxin [Yang et al., 2023c] Zhengyuan Yang, Linjie Li, Kevin Lin,
Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yan- Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li-
feng Wang. Editable scene simulation for autonomous juan Wang. The dawn of lmms: Preliminary explorations
driving via collaborative llm-agents. In Proceedings of with gpt-4v(ision), 2023.
the IEEE/CVF Conference on Computer Vision and Pat- [Yang et al., 2024] Jiazhi Yang, Shenyuan Gao, Yihang Qiu,
tern Recognition, pages 15077–15087, 2024. Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu,
[Wen et al., 2023a] Licheng Wen, Daocheng Fu, Xin Li, Jia Zeng, Ping Luo, et al. Generalized predictive model
Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, for autonomous driving. In Proceedings of the IEEE/CVF
Liang He, and Yu Qiao. Dilu: A knowledge-driven ap- Conference on Computer Vision and Pattern Recognition,
proach to autonomous driving with large language models. pages 14662–14672, 2024.
arXiv preprint arXiv:2309.16292, 2023. [Yin et al., 2021] Tianwei Yin, Xingyi Zhou, and Philipp
[Wen et al., 2023b] Licheng Wen, Xuemeng Yang, Krähenbühl. Center-based 3d object detection and track-
Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, ing, 2021.
Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng [Yuan et al., 2024] Jianhao Yuan, Shuyang Sun, Daniel
Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and
Shuanglu Hu, Botian Shi, and Yu Qiao. On the road Matthew Gadd. Rag-driver: Generalisable driving ex-
with gpt-4v(ision): Early explorations of visual-language planations with retrieval-augmented in-context learning
model on autonomous driving, 2023. in multi-modal large language model. arXiv preprint
[Wu et al., 2022a] Penghao Wu, Li Chen, Hongyang Li, arXiv:2402.10828, 2024.
Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre- [Zeng et al., 2022] Fangao Zeng, Bin Dong, Yuang Zhang,
training for autonomous driving via self-supervised geo- Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr:
metric modeling. In The Eleventh International Confer- End-to-end multiple-object tracking with transformer. In
ence on Learning Representations, 2022. European Conference on Computer Vision (ECCV), 2022.
[Wu et al., 2022b] Penghao Wu, Xiaosong Jia, Li Chen, [Zhang et al., 2023a] Hang Zhang, Xin Li, and Lidong
Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided Bing. Video-llama: An instruction-tuned audio-visual
control prediction for end-to-end autonomous driving: A language model for video understanding. arXiv preprint
simple yet strong baseline, 2022. arXiv:2306.02858, 2023.
[Wu et al., 2023] Dongming Wu, Wencheng Han, Tiancai [Zhang et al., 2023b] Siyao Zhang, Daocheng Fu, Zhao
Wang, Yingfei Liu, Xiangyu Zhang, and Jianbing Shen. Zhang, Bin Yu, and Pinlong Cai. Trafficgpt: Viewing,
Language prompt for autonomous driving, 2023. processing and interacting with traffic foundation models.
arXiv preprint arXiv:2309.06719, 2023.
[Xu et al., 2021a] Li Xu, He Huang, and Jun Liu. Sutd-
trafficqa: A question answering benchmark and an effi- [Zhang et al., 2024] Jiawei Zhang, Chejian Xu, and Bo Li.
cient network for video reasoning over traffic events. In Chatscene: Knowledge-enabled safety-critical scenario
Proceedings of the IEEE/CVF Conference on Computer generation for autonomous vehicles. In Proceedings of
Vision and Pattern Recognition (CVPR), pages 9878– the IEEE/CVF Conference on Computer Vision and Pat-
9888, June 2021. tern Recognition, pages 15459–15469, 2024.
[Xu et al., 2021b] Li Xu, He Huang, and Jun Liu. Sutd- [Zhao et al., 2024] Guosheng Zhao, Xiaofeng Wang, Zheng
trafficqa: A question answering benchmark and an effi- Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xin-
cient network for video reasoning over traffic events, 2021. gang Wang. Drivedreamer-2: Llm-enhanced world mod-
els for diverse driving video generation. arXiv preprint
[Xu et al., 2022] Danfei Xu, Yuxiao Chen, Boris Ivanovic, arXiv:2403.06845, 2024.
and Marco Pavone. Bits: Bi-level imitation for traffic sim-
[Zheng et al., 2024a] Xiaoji Zheng, Lixiu Wu, Zhijie Yan,
ulation, 2022.
Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen,
[Xu et al., 2023] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen and Jiangtao Gong. Large language models pow-
Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, and ered context-aware motion prediction. arXiv preprint
Hengshuang Zhao. Drivegpt4: Interpretable end-to-end arXiv:2403.11057, 2024.
autonomous driving via large language model, 2023. [Zheng et al., 2024b] Yupeng Zheng, Zebin Xing, Qichao
[Yang et al., 2023a] Kairui Yang, Enhui Ma, Jibin Peng, Zhang, Bu Jin, Pengfei Li, Yuhang Zheng, Zhongpu Xia,
Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Kun Zhan, Xianpeng Lang, Yaran Chen, et al. Planagent:
Accurately controlling street-view elements with multi- A multi-modal large language agent for closed-loop ve-
perspective consistency via bev sketch layout. arXiv hicle motion planning. arXiv preprint arXiv:2406.01587,
preprint arXiv:2308.01661, 2023. 2024.
[Yang et al., 2023b] Yi Yang, Qingwen Zhang, Ci Li, [Zhong et al., 2022] Ziyuan Zhong, Davis Rempe, Danfei
Daniel Simões Marta, Nazre Batool, and John Folkesson. Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray,
Human-centric autonomous systems with llms for user and Marco Pavone. Guided conditional diffusion for con-
command reasoning, 2023. trollable traffic simulation, 2022.
[Zhong et al., 2023] Ziyuan Zhong, Davis Rempe, Yuxiao
Chen, Boris Ivanovic, Yulong Cao, Danfei Xu, Marco
Pavone, and Baishakhi Ray. Language-guided traffic sim-
ulation via scene-level diffusion, 2023.
[Zhou et al., 2024] Yunsong Zhou, Linyan Huang, Qingwen
Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi
Guo, Yu Qiao, and Hongyang Li. Embodied understand-
ing of driving scenarios. arXiv preprint arXiv:2403.04593,
2024.

You might also like