Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving
Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving
Abstract Jia et al., 2022a; Jia et al., 2023c; Jia et al., 2024a],
and planning [Treiber et al., 2000; Dauner et al., 2023;
Autonomous driving technology, a catalyst for Li et al., 2024b; Jia et al., 2024b]. Specifically, the per-
revolutionizing transportation and urban mobility, ception component handles object detection [Li et al., 2022c;
has the tend to transition from rule-based sys- Liu et al., 2023d], tracking [Zeng et al., 2022], and sophisti-
tems to data-driven strategies. Traditional module- cated semantic segmentation tasks [Cheng et al., 2022]. The
based systems are constrained by cumulative er- prediction component analyzes the external environment [Jia
rors among cascaded modules and inflexible pre-set et al., 2021] and estimates the future states of the surrounding
rules. In contrast, end-to-end autonomous driving agents [Jia et al., 2022b]. The planning component, often re-
systems have the potential to avoid error accumula- liant on rule-based decision algorithms [Treiber et al., 2000],
tion due to their fully data-driven training process, determines the optimal and safest route to a predetermined
although they often lack transparency due to their destination. While the module-based approach provides reli-
”black box” nature, complicating the validation and ability and enhanced security in a variety of scenarios, it also
traceability of decisions. Recently, large language presents challenges. The decoupled design between system
models (LLMs) have demonstrated abilities includ- components may lead to key information loss during tran-
ing understanding context, logical reasoning, and sitions and potentially redundant computation as well. Ad-
generating answers. A natural thought is to uti- ditionally, errors may accumulate within the system due to
lize these abilities to empower autonomous driving. inconsistencies in optimization objectives among the mod-
By combining LLM with foundation vision models, ules, affecting the vehicle’s overall decision-making perfor-
it could open the door to open-world understand- mance [Chen et al., 2023a].
ing, reasoning, and few-shot learning, which cur-
rent autonomous driving systems are lacking. In Rule-based decision systems, with their inherent limita-
this paper, we systematically review the research tions and scalability issues, are gradually giving way to data-
line about (Vision) Large Language Models for Au- driven methods. End-to-end autonomous driving solutions
tonomous Driving ((V)LLM4Drive). This study are increasingly becoming a consensus in the field [Wu et
evaluates the current state of technological ad- al., 2022b; Chitta et al., 2023; Chen and Krähenbühl, 2022;
vancements, distinctly outlining the principal chal- Jia et al., 2023d; Jia et al., 2023b; Hu et al., 2023b]. By
lenges and prospective directions for the field. For eliminating integration errors between multiple modules and
the convenience of researchers in academia and in- reducing redundant computations, the end-to-end system en-
dustry, we provide real-time updates on the latest hances the expression of visual [Wu et al., 2022a] and sensory
advances in the field as well as relevant open-source information while ensuring greater efficiency. However, this
resources via the designated link: https://github. approach also introduces the “black box” problem, meaning
com/Thinklab-SJTU/Awesome-LLM4AD. a lack of transparency in the decision-making process, com-
plicating interpretation and validation.
Simultaneously, the explainability of autonomous driving
1 Introduction has become an important research focus [Jin et al., 2023a].
Autonomous driving is rapidly reshaping our understanding Although smaller language models (like early versions of
of transportation, heralding a new era of technological rev- BERT [Devlin et al., 2018] and GPT [Brown et al., 2020])
olution. This transformation means not only the future of employed in massive data collection from driving scenarios
transportation but also a fundamental shift across various in- help address this issue, they often lack sufficient generaliza-
dustries. In conventional autonomous driving systems, algo- tion capabilities to perform optimally. Recently, large lan-
rithms typically adopt the modular design [Liang et al., 2020; guage models [OpenAI, 2023; Touvron et al., 2023] have
Luo et al., 2018; Sadat et al., 2020], with separate compo- demonstrated remarkable abilities in understanding context,
nents responsible for critical tasks such as perception [Li et generating answers, and handling complex tasks. They are
al., 2022c; Liu et al., 2023d], prediction [Shi et al., 2022; also now integrated with multimodal models [Brohan et al.,
Driving Capability Domain Driving Capability Domain
More Data
Common Sense
Offline
Simulator Dataset
Figure 1: The limitation of current autonomous driving paradigm (green arrow) and where LLMs can potentially enhance autonomous driving
ability (blue arrow).
2023a; Liu et al., 2023a; Driess et al., 2023; Xu et al., 2023; tence, akin to that possessed by an experienced human driver.
Chen et al., 2023b]. This integration achieves a unified fea- There are two main methods to acquire such proficiency: one,
ture space mapping for images, text, videos, point clouds, through learning-based techniques within simulated environ-
etc. Such consolidation significantly enhances the system’s ments; and two, by learning from offline data through similar
generalization capabilities and equips them with the capacity methodologies. It’s important to note that due to discrepan-
to quickly adapt to new scenarios in a zero-shot or few-shot cies between simulations and the real world, these two do-
manner. mains are not fully the same, i.e. sim2real gap [Höfer et al.,
In this context, developing an interpretable and efficient 2021]. Concurrently, offline data serves as a subset of real-
end-to-end autonomous driving system has become a re- world data since it’s collected directly from actual surround-
search hotspot [Chen et al., 2023a]. Large language models, ings. However, it is difficult to fully cover the distribution as
with their extensive knowledge base and exceptional gener- well due to the notorious long-tailed nature [Jain et al., 2021]
alization, could facilitate easier learning of complex driving of autonomous driving tasks.
behaviors. By leveraging the visual-language model (VLM)’s The final goal of autonomous driving is to elevate driv-
robust and comprehensive capabilities of open-world under- ing abilities from a basic green stage to a more advanced
standing and in-context learning [Bommasani et al., 2021; blue level through extensive data collection and deep learn-
Brohan et al., 2023b; Liu et al., 2023a; Driess et al., 2023], ing. However, the high cost associated with data gathering
it becomes possible to address the long-tail problem for per- and annotation, along with the inherent differences between
ception networks, assist in decision-making, and provide in- simulated and real-world environments, mean there’s still a
tuitive explanations for these decisions. gap before reaching the expert level of driving skills. In
This paper aims to provide a comprehensive overview of this scenario, if we can effectively utilize the innate com-
this rapidly emerging research field, analyze its basic princi- mon sense embedded within large language models, we might
ples, methods, and implementation processes, and introduce gradually narrow this gap. Intuitively, by adopting this ap-
in detail regarding the application of LLMs for autonomous proach, we could progressively enhance the capabilities of
driving. Finally, we discuss related challenges and future re- autonomous driving systems, bringing them closer to, or po-
search directions. tentially reaching, the ideal expert level of driving profi-
ciency. Through such technological integration and innova-
2 Motivation of LLM4AD tion, we anticipate significant improvements in the overall
In today’s technological landscape, large language models performance and safety of autonomous driving.
such as GPT-4 and GPT-4V [OpenAI, 2023; Yang et al., The application of large language models in the field of au-
2023c] are drawing attention with their superior contextual tonomous driving indeed covers a wide range of task types,
understanding and in-context learning capabilities. Their en- combining depth and breadth with revolutionary potential.
riched common sense knowledge has facilitated significant LLMs in autonomous driving pipelines is shown in the Fig. 2.
advancements in many downstream tasks. We ask the ques-
tion: how do these large models assist in the domain of au-
tonomous driving, especially in playing a critical role in the
decision-making process? 3 Application of LLM4AD
In Fig. 1, we give an intuitive demonstration of the lim-
itation of current autonomous driving paradigm and where In the following sections, we divide existing works based
LLMs can potentially enhance autonomous driving ability. on the perspective of applying LLMs: planning, perception,
We summarize two primary aspects of driving skills. The question answering, and generation. The corresponding tax-
orange circle represents the ideal level of driving compe- onomy tree is shown in Fig. 3.
Sensor Input Token
Inputs LLMs in AD Pipelines
Modal
Tasks
3.1 Planning & Control of QA tasks while the experiments about planning is sim-
Large language models (LLMs) have achieved great suc- ple. GPT-Driver [Mao et al., 2023a] transforms the motion
cess with their open-world cognitive and reasoning capa- planning task into a language modeling problem. It exceeds
bilities [Radford et al., 2018; Radford et al., 2019; Brown the UniAD[Hu et al., 2023b] in the L2 metric. Neverthe-
et al., 2020; Ouyang et al., 2022; OpenAI, 2023]. These less, since it uses past speed and acceleration information,
capabilities could provide a transparent explanation of the there is concern about unfair comparison with UniAD. Ad-
autonomous driving decision-making process, significantly ditionally, L2 only reflects the fitting degree of the driving
enhancing system reliability and user trust in the technol- route and might not reflect the driving performance [Dauner
ogy [Deruyttere et al., 2019; Kim et al., 2019a; Atakishiyev et al., 2023]. Agent-Driver [Mao et al., 2023b] leverages
et al., 2023; Jin et al., 2023a; Malla et al., 2023]. Within LLMs common sense and robust reasoning capabilities to im-
this domain, based on whether tuning the LLM, related re- prove the capabilities of planning by designing a tool library,
search can be categorized into two main types: fine-tuning a cognitive memory, and a reasoning engine. This paradigm
pre-trained models and prompt engineering. achieves better results on the nuScenes dataset. Meanwhile,
shortening the inference time is also an urgent problem. Driv-
Fine-tuning pre-trained models eLM [Sima et al., 2023] uses a trajectory tokenizer to pro-
In the application of fine-tuning pre-trained models, MTD- cess ego-trajectory signals to texts, making them belong to
GPT [Liu et al., 2023b] translates multi-task decision- the same domain space. Such a tokenizer can be applied to
making problems into sequence modeling problems. Through any general vision language models. Moreover, they utilize
training on a mixed multi-task dataset, it addresses various a graph-structure inference with multiple QA pairs in logical
decision-making tasks at unsignaled intersections. Although order, thus improving the final planning performance. [Wang
this approach outperforms the performance of single-task et al., 2023c] adapts LLMs as a vehicle ”Co-Pilot” of driving,
decision-making RL models, the used scenes are limited to which can accomplish specific driving tasks with human in-
unsignaled intersections, which might be enough to demon- tention satisfied based on the information provided. It lacks
strate the complexity of the real world application. Driving verification in complex interaction scenarios. LMDrive [Shao
with LLMs [Chen et al., 2023b] designs an architecture that et al., 2023] designs a multi-modal framework to predict the
fuses vectorized inputs into LLMs with a two-stage pretrain- control signal and whether the given instruction is completed.
ing and fine-tuning method. Due to the limitation of vec- It adopts Resnet [He et al., 2016] as the vision encoder which
torized representations, their method are only tested in the has not been through an image-text alignment pretraining.
simulation. DriveGPT4 [Xu et al., 2023] presents a multi- In addition, it introduces a benchmark LangAuto which in-
modal LLM based on Valley [Luo et al., 2023a] and devel- cludes approximately 64K instruction-following data clips in
ops a visual instruction tuning dataset for interpretable au- CARLA. The LangAuto benchmark tests the system’s ability
tonomous driving. Besides predicting a vehicle’s basic con- to handle complex instructions and challenging driving sce-
trol signals, it also responds in real-time, explaining why the nario. DriveMLM [Wang et al., 2023d] adopts a multi-modal
action is taken. It outperforms baseline models in a variety
Tracking LanguagePrompt [Wu et al., 2023]
Evaluation & Benchmark On the Road with GPT-4V [Wen et al., 2023b]
GPT-4V Takes the Wheel [Huang et al., 2023a]
LaMPilot [Ma et al., 2023b]
Evaluation of LLMs [Tanahashi et al., 2023]
Testing LLMs [Tang et al., 2024]
DriveSim [Sreeram et al., 2024]
ELM [Zhou et al., 2024]
LimSim++ [Fu et al., 2024]
OmniDrive [Wang et al., 2024d]
AIDE [Liang et al., 2024]
tic images, exploiting geometric information from 3D anno- Future [Wang et al., 2023g] develops a multiview world
tations by independently encoding road maps, object boxes, model, named Drive-WM, which is capable of generating
and camera parameters for precise, geometry-guided synthe- high-quality, controllable, and consistent multi-view videos
sis. This approach effectively solves the challenge of multi- in autonomous driving scenes. It explores the potential ap-
camera view consistency. Although it achieves better per- plication of the world model in end-to-end planning for au-
formance in terms of generation fidelity compared to BEV- tonomous driving. ChatScene [Zhang et al., 2024] designs
Gen [Swerdlow et al., 2023] and BEVControl [Yang et al., an LLM-based agent that generates and simulates challeng-
2023a], it also faces huge challenges in some complex scenes, ing safety-critical scenarios in CARLA, improving the colli-
such as night views and unseen weather conditions. ADriver- sion avoidance capabilities and robustness of autonomous ve-
I [Jia et al., 2023a] combines Multimodal Large Language hicles. REvolve [Hazra et al., 2024] is an evolutionary frame-
Models(MLLM) and Video Diffusion Model(VDM) to pre- work utilizing GPT-4 to generate and refine reward functions
dict the control signal of current frame and the future frames. for autonomous driving through human feedback. The reward
It shows impressive performance on nuScenes and their pri- function is used for RL, and the score is achieved closely by
vate datasets. However, MLLM and VDM are trained sep- human driving standards. GenAD [Yang et al., 2024] is a
arately, which fails to optimize jointly. Driving into the large-scale video prediction model for autonomous driving
that uses extensive web-sourced data and novel temporal rea- and episodic steps as metrics in AirSim simulator [Shah et
soning blocks to handle diverse driving scenarios, generalize al., 2017].
to unseen datasets in a zero-shot manner, and adapt for action-
conditioned prediction or motion planning. DriveDreamer-2 3.5 Evaluation & Benchmark
[Zhao et al., 2024] builds on DriveDreamer with a Large Lan- In terms of evaluation, On the Road with GPT-4V [Wen et
guage Model (LLM), generates customized and high-quality al., 2023b] conducts a comprehensive and multi-faceted eval-
multi-view driving videos by converting user queries into uation of GPT-4V in various autonomous driving scenarios,
agent trajectories and HDMaps, enhancing training for driv- including Scenario Understanding, Reasoning, and Acting as
ing perception methods. ChatSim [Wei et al., 2024] enable a Driver. GPT-4V performs well in scene understanding, in-
editable photo-realistic 3D driving scene simulations via nat- tent recognition and driving decision-making. It is good at
ural language commands with external digital assets, lever- handling out-of-distribution situations, can accurately assess
ages a large language model agent collaboration framework the intentions of traffic participants, use multi-view images
and novel multi-camera neural radiance field and lighting es- to comprehensively perceive the environment, and accurately
timation methods to produce scene-consistent, high-quality identify dynamic interactions between traffic participants.
outputs. LLM-Assisted Light [Wang et al., 2024a] inte- However, GPT-4V still has certain limitations in direction
grates the human-mimetic reasoning capabilities of LLMs, recognition, interpretation of traffic lights, and non-English
enabling the signal control algorithm to interpret and respond traffic signs. GPT-4V Takes the Wheel [Huang et al., 2023a]
to complex traffic scenarios with the nuanced judgment typi- evaluates the potential of GPT-4V for autonomous pedestrian
cal of human cognition. It developed a closed-loop traffic sig- behavior prediction using publicly available datasets. Al-
nal control system, integrating LLMs with a comprehensive though GPT-4V has made significant advances in AI capa-
suite of interoperable tools. LangProp [Ishida et al., 2024] bilities for pedestrian behavior prediction, it still has short-
is a framework that iteratively optimizes code generated by comings compared with leading traditional domain-specific
large language models (LLMs) using both supervised and re- models.
inforcement learning, automatically evaluating code perfor- In terms of benchmark, LMDrive [Shao et al., 2023] in-
mance, catching exceptions, and feeding results back to the troduces LangAuto(Language-guided Autonomous Driving)
LLM to improve code generation for autonomous driving CARLA benchmark. It covers various driving scenarios in 8
in CARLA. These methods explore the customized authen- towns and takes into account 16 different environmental con-
tic generations of autonomous driving data. Although these ditions. It contains three tracks: LangAuto track (updates
diffusion-based models achieved good results on video and navigation instructions based on location and is divided into
image-generated metrics, it is still unclear whether they could sub-tracks of different route lengths), LangAuto-Notice track
really be used in closed-loop to really boost the performance (adds notification instructions based on navigation instruc-
of the autonomous driving system. tions), and LangAuto-Sequential track (Combining consec-
Metric: utive instructions into a single long instruction). In addition,
DriveDreamer [Wang et al., 2023e], DriveDreamer- LangAuto also uses three main evaluation indicators: route
2 [Zhao et al., 2024], DrivingDiffusion [Li et al., 2023c] and completion, violation score, and driving score to comprehen-
GenAD [Yang et al., 2024] use the frame-wise Frechet In- sively evaluate the autonomous driving system’s ability to
ception Distance (FID) [Parmar et al., 2022] to evaluate the follow instructions and driving safety. LingoQA [Marcu et
quality of generated images and the Frechet Video Distance al., 2023] developed LingoQA which is used for evaluating
(FVD) [Unterthiner et al., 2019] for video quality evalua- video question-answering models for autonomous driving.
tion. DrivingDiffusion also uses average intersection cross- The evaluation system consists of three main parts: a GPT-
ing (mIoU) [Rezatofighi et al., 2019] scores for drivable areas 4-based evaluation, which determines whether the model’s
and NDS [Yin et al., 2021] for all the object classes by com- answers are consistent with human answers; and the Lingo-
paring the predicted layout with the ground-truth BEV lay- Judge metric, which evaluates the model using a trained text
out. CTG++ [Zhong et al., 2023] following [Xu et al., 2022; classifier called Lingo-Judge Accuracy of answers; and cor-
Zhong et al., 2022], uses the failure rate, Wasserstein distance relation analysis with human ratings. This analysis involves
between normalized histograms of driving profiles, realism multiple human annotators rating responses to 17 different
deviation (real), and scene-level realism metric (rel real) as models on a scale of 0 to 1, which are interpreted as the
metrics. MagicDrive [Gao et al., 2023] utilizes segmenta- likelihood that the response accurately solves the problem.
tion metrics such as Road mIoU and Vehicle mIoU [Taran Reason2Drive [Nie et al., 2023] introduces the protocol to
et al., 2018], as well as 3D object detection metrics like measure the correctness of the reasoning chains to resolve
mAP[Henderson and Ferrari, 2017] and NDS [Yin et al., semantic ambiguities. The evaluation process includes four
2021]. ADriver-I [Jia et al., 2023a] adapts L1 error of key metrics: Reasoning Alignment, which measures the ex-
speed and steering angle of the current frame, Frechet In- tent of overlap in logical reasoning; Redundancy, aimed at
ception Distance(FID), and Frechet Video Distance(FVD) as identifying any repetitive steps; Missing Step, which focuses
evaluation indicators. ChatScene [Zhang et al., 2024] pro- on pinpointing any crucial steps that are absent but neces-
vide a thorough evaluation of various scenario generation sary for problem-solving; and Strict Reason, which evalu-
algorithms. These are assessed based on the collision rate ates scenarios involving visual elements. LaMPilot [Ma et
(CR), overall score (OS), and average displacement error al., 2023b] is an benchmark test used to evaluate the instruc-
(ADE). REvolve [Hazra et al., 2024] adopts fitness scores tion execution capabilities of autonomous vehicles, includ-
ing three parts: a simulator, a data set, and an evaluator. It dataset is a collection of real-world driving behaviors, each
employs Python Language Model Programs (LMPs) to in- annotated with descriptions and explanations. It includes 26K
terpret human-annotated driving instructions and execute and activities across 8.4M frames and thus provides a resource for
evaluate them within its framework. [Tanahashi et al., 2023] understanding and predicting driver behaviors across differ-
evaluated the two core capabilities of large language models ent conditions.
(LLMs) in the field of autonomous driving: first, the spatial Honda Research Institute-Advice Dataset (HAD) [Kim
awareness decision-making ability, that is, LLMs can accu- et al., 2019b]: HAD offers 30 hours of driving video data
rately identify the spatial layout based on coordinate infor- paired with natural language advice and videos integrated
mation; second, the ability to follow traffic rules to ensure with can-bus signal data. The advice includes Goal-oriented
that LLMs Ability to strictly abide by traffic laws while driv- advice(top-down signal) which is designed to guide the vehi-
ing. [Tang et al., 2024] tests three OpenAI LLMs and sev- cle in a navigation task and Stimulus-driven advice(bottom-
eral other LLMs on UK Driving Theory Test Practice Ques- up signal) which highlights specific visual cues that the user
tions and Answers, and only GPT-4o passed the test, indi- expects the vehicle controller to actively focus on.
cating that the performance of LLMs still needs to be fur- Talk2Car [Deruyttere et al., 2019]: The Talk2Car
ther improved. [Kong et al., 2024] has developed an LLM- dataset contains 11959 commands for the 850 videos of the
based safe autonomous driving framework, which evaluates nuScenes [Caesar et al., 2019; Fong et al., 2021] training set
and enhances the performance of existing LLM-AD meth- as 3D bounding box annotations. Of the commands, 55.94%
ods in driving safety, sensitive data usage, token consump- were from videos recorded in Boston, while 44.06% were
tion and alignment scenarios by integrating security assess- from Singapore. On average, each command contains 11.01
ment agents. DriveSim [Sreeram et al., 2024] is a specialized words, which includes 2.32 nouns, 2.29 verbs, and 0.62 ad-
simulator that creates diverse driving scenarios to test and jectives. Typically, there are about 14.07 commands in every
benchmark MLLMs’ understanding and reasoning of real- video. It is a object referral dataset that contains commands
world driving scenes from a fixed in-car camera perspective. written in natural language for self-driving cars.
OmniDrive [Wang et al., 2024d] introduces a comprehensive DriveLM Dataset [Sima et al., 2023]: This dataset in-
benchmark for visual question-answering tasks related to 3D tegrates human-like reasoning into autonomous driving sys-
driving, ensuring strong alignment between agent models and tems, enhancing Perception, Prediction, and Planning (P3).
driving tasks through scene description, traffic regulation, 3D It employs a “Graph-of-Thought” structure, encouraging a
grounding, counterfactual reasoning, decision-making, and futuristic approach through “What if” scenarios, thereby
planning. AIDE [Liang et al., 2024] proposes an automatic promoting advanced, logic-based reasoning and decision-
data engine design paradigm, which features automatic data making mechanisms in driving systems.
query and labeling using VLM and continuous learning with DRAMA Dataset [Malla et al., 2023]: Collected from
pseudo-labels. It also introduces a new benchmark to evaluate Tokyo’s streets, it includes 17,785 scenario clips captured us-
such automatic data engines for self-driving car perception, ing the video camera, each clipped to 2 seconds in duration. It
providing comprehensive insights across multiple paradigms contains different annotations: Video-level Q/A, Object-level
including open vocabulary detection, semi-supervised learn- Q/A, Risk object bounding box, Free-form caption, separate
ing, and continuous learning. ELM [Zhou et al., 2024] is pro- labels for ego-car intention, scene classifier, and suggestions
posed to understand driving scenes over long-scope space and to the driver.
time, showing promising generalization performance in han- Rank2Tell Dataset [Sachdeva et al., 2023]: It is captured
dling complex driving scenarios. LimSim++ [Fu et al., 2024] from a moving vehicle on highly interactive traffic scenes in
introduces an open-source evaluation platform for (M)LLM the San Francisco Bay Area. It includes 116 clips ( 20s each)
in autonomous driving, supporting scenario understanding, of 10FPS captured using an instrumented vehicle equipped
decision-making, and evaluation systems. with three Point Grey Grasshopper video cameras with a res-
olution of 1920 × 1200 pixels, a Velodyne HDL-64E S2 Li-
4 Datasets in LLM4AD DAR sensor, and high precision GPS. The dataset includes
Video-level Q/A, Object-level Q/A, LiDAR and 3D bound-
Traditional datasets such as nuScenes dataset [Caesar et al., ing boxes (with tracking), Field of view from 3 cameras
2019; Fong et al., 2021] lack action description [Lu et al., (stitched), important object bounding boxes (multiple impor-
2024], detailed caption, and question-answering pairs which tant objects per frame with multiple levels of importance-
are used to interact with LLMs. The BDD-x [Kim et al., High, Medium, Low), free-form captions (multiple captions
2018], Rank2Tell [Sachdeva et al., 2023], DriveLM [Sima et per object for multiple objects), ego-car intention.
al., 2023], DRAMA [Malla et al., 2023], NuPrompt [Wu et NuPrompt Dataset [Wu et al., 2023]: It represents an ex-
al., 2023] and NuScenes-QA [Qian et al., 2023] datasets rep- pansion of the nuScenes dataset, enriched with annotated lan-
resent key developments in LLM4AD research, each bring- guage prompts specifically designed for driving scenes. This
ing unique contributions to understanding agent behaviors dataset includes 35,367 language prompts for 3D objects, av-
and urban traffic dynamics through extensive, diverse, and eraging 5.3 instances per object. This annotation enhances
situation-rich annotations. We give a summary of each the dataset’s practicality in autonomous driving testing and
dataset in Table 1. We give detailed descriptions below. training, particularly in complex scenarios requiring linguis-
BDD-X Dataset [Kim et al., 2018]: With over 77 hours tic processing and comprehension.
of diverse driving conditions captured in 6,970 videos, this NuScenes-QA dataset [Qian et al., 2023]: It is a dataset
in autonomous driving, containing 459,941 question-answer Ethical Statement
pairs from 34,149 distinct visual scenes. They are partitioned When applying LLMs to the field of autonomous driving,
into 376,604 questions from 28,130 scenes for training, and we must deeply consider their potential ethical implications.
83,337 from 6,019 scenes for testing. NuScenes-QA show- First, the illusion of the model may cause the vehicle to
cases a wide array of question lengths, reflecting different misunderstand the external environment or traffic conditions,
complexity levels, making it challenging for AI models. Be- thus causing safety hazards. Second, model discrimination
yond sheer numbers, the dataset ensures a balanced range of and bias may lead to vehicles making unfair or biased de-
question types and categories, from identifying objects to as- cisions in different environments or when facing different
sessing their behavior, such as whether they are moving or groups. Additionally, false information and errors in reason-
parked. This design inhibits the model’s tendency to be bi- ing can cause a vehicle to adopt inappropriate or dangerous
ased or rely on linguistic shortcuts. driving behaviors. Inductive advice may leave the vehicle
Reason2Drive [Nie et al., 2023]: It consists of nuScenes, vulnerable to external interference or malicious behavior. Fi-
Waymo, and ONCE datasets with 600,000 video-text pairs nally, privacy leakage is also a serious issue, as vehicles may
labeled by humans and GPT-4. It provides a detailed rep- inadvertently reveal sensitive information about the user or
resentation of the driving scene through a unique automatic the surrounding environment. To sum up, we strongly rec-
annotation mode, capturing various elements such as object ommend that before deploying a large language model to an
types, visual and kinematic attributes, and their relationship autonomous driving system, an in-depth and detailed ethi-
to the ego vehicle. It has been enhanced with GPT-4 to in- cal review should be conducted to ensure that its decision-
clude complex question-answer pairs and detailed reasoning making logic is not only technically accurate but also ethi-
narratives. cally appropriate. At the same time, we call for following the
LingoQA [Marcu et al., 2023]: This dataset is a large- principles of transparency, responsibility, and fairness to en-
scale, a diverse collection for autonomous driving, contain- sure the ethics and safety of technology applications. We call
ing approximately 419,000 question-answer pairs, covering on the entire community to work together to ensure reliable
both action and scenery subsets. It provides rich informa- and responsible deployment of autonomous driving technol-
tion about driving behavior, environmental perception, and ogy based on large language models.
road conditions through high-quality videos and detailed an-
notations. It features complex questions and free-form an- Acknowledgments
swers, leveraging GPT-3.5/4 to enhance the diversity and This work was partly supported by NSFC (92370201,
depth of content. The driving capabilities covered include ac- 62222607, 61972250) and Shanghai Municipal Science and
tions, reasons, attention, recognition, positioning, etc., which Technology Major Project (2021SHZDZX0102).
are particularly suitable for improving the understanding and
decision-making capabilities of autonomous driving systems.
References
NuInstruct [Ding et al., 2024]: It is a dataset fea-
turing 91K multi-view video-QA pairs spanning 17 sub- [Anderson et al., 2016] Peter Anderson, Basura Fernando,
tasks, each requiring comprehensive information such as Mark Johnson, and Stephen Gould. Spice: Semantic
temporal, multi-view, and spatial data, thereby significantly propositional image caption evaluation, 2016.
raising the complexity of the challenges. It developed a [Atakishiyev et al., 2023] Shahin Atakishiyev, Mohammad
SQL-based method that automatically generates instruction- Salameh, Hengshuai Yao, and Randy Goebel. Explainable
response pairs, inspired by the logical progression of human artificial intelligence for autonomous driving: A compre-
decision-making. hensive overview and field guide for future research direc-
OpenDV-2K [Yang et al., 2024]: This dataset is a large- tions, 2023.
scale multimodal dataset for autonomous driving, comprising [Azarafza et al., 2024] Mehdi Azarafza, Mojtaba Nayyeri,
2059 hours of curated driving videos, including 1747 hours Charles Steinmetz, Steffen Staab, and Achim Rettberg.
from YouTube and 312 hours from public datasets, with au- Hybrid reasoning based on large language models for au-
tomatically generated language annotations to support gener- tonomous car driving. arXiv preprint arXiv:2402.13602,
alized video prediction model training. 2024.
[Banerjee and Lavie, 2005] Satanjeev Banerjee and Alon
Lavie. METEOR: An automatic metric for MT evaluation
5 Conclusion with improved correlation with human judgments. In Jade
Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, ed-
In this paper, we have provided a comprehensive survey on itors, Proceedings of the ACL Workshop on Intrinsic and
LLM4AD. We classify and introduce different applications Extrinsic Evaluation Measures for Machine Translation
employing LLMs for autonomous driving and summarize the and/or Summarization, pages 65–72, Ann Arbor, Michi-
representative approaches in each category. At the same time, gan, June 2005. Association for Computational Linguis-
we summarize the latest datasets related to LLM4AD. We tics.
will continue to monitor developments in the field and high- [Bashar et al., 2022] Mk Bashar, Samia Islam,
light future research directions. Kashifa Kawaakib Hussain, Md. Bakhtiar Hasan,
A. B. M. Ashikur Rahman, and Md. Hasanul Kabir. Danny Birch, Daniel Maund, and Jamie Shotton. Driv-
Multiple object tracking in recent times: A literature ing with llms: Fusing object-level vector modality for
review, 2022. explainable autonomous driving, 2023.
[Bernardin and Stiefelhagen, 2008] Keni Bernardin and [Chen et al., 2024] Yuan Chen, Zi-han Ding, Ziqin Wang,
Rainer Stiefelhagen. Evaluating multiple object tracking Yan Wang, Lijun Zhang, and Si Liu. Asynchronous large
performance: The clear mot metrics. EURASIP J. Image language model enhanced planner for autonomous driving.
Video Process., 2008, 2008. arXiv preprint arXiv:2406.14556, 2024.
[Bommasani et al., 2021] Rishi Bommasani, Drew A Hud- [Cheng et al., 2022] Bowen Cheng, Ishan Misra, Alexan-
son, Ehsan Adeli, Russ Altman, Simran Arora, Syd- der G. Schwing, Alexander Kirillov, and Rohit Girdhar.
ney von Arx, Michael S Bernstein, Jeannette Bohg, An- Masked-attention mask transformer for universal image
toine Bosselut, Emma Brunskill, et al. On the oppor- segmentation. 2022.
tunities and risks of foundation models. arXiv preprint [Chitta et al., 2023] Kashyap Chitta, Aditya Prakash, Bern-
arXiv:2108.07258, 2021. hard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger.
[Brohan et al., 2023a] Anthony Brohan, Noah Brown, Jus- Transfuser: Imitation with transformer-based sensor fu-
tice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof sion for autonomous driving. IEEE Pattern Analysis and
Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Machine Intelligence (PAMI), 2023.
Chelsea Finn, et al. Rt-2: Vision-language-action models [Cui et al., 2023a] Can Cui, Yunsheng Ma, Xu Cao, Wen-
transfer web knowledge to robotic control. arXiv preprint qian Ye, and Ziran Wang. Drive as you speak: En-
arXiv:2307.15818, 2023. abling human-like interaction with large language models
[Brohan et al., 2023b] Anthony Brohan, Noah Brown, Jus- in autonomous vehicles. arXiv preprint arXiv:2309.10228,
tice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof 2023.
Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, [Cui et al., 2023b] Can Cui, Yunsheng Ma, Xu Cao, Wen-
Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gon- qian Ye, and Ziran Wang. Receive, reason, and re-
zalez Arenas, Keerthana Gopalakrishnan, Kehang Han, act: Drive as you say with large language models in
Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian autonomous vehicles. arXiv preprint arXiv:2310.08034,
Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry 2023.
Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, [Cui et al., 2023c] Can Cui, Zichong Yang, Yupeng Zhou,
Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Yunsheng Ma, Juanwu Lu, and Ziran Wang. Large lan-
Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, guage models for autonomous driving: Real-world exper-
Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag iments, 2023.
Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh,
Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan [Dauner et al., 2023] Daniel Dauner, Marcel Hallgarten, An-
Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, dreas Geiger, and Kashyap Chitta. Parting with miscon-
Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe ceptions about learning-based vehicle motion planning. In
Yu, and Brianna Zitkovich. Rt-2: Vision-language-action CoRL, 2023.
models transfer web knowledge to robotic control, 2023. [Deo et al., 2021] Nachiket Deo, Eric M. Wolff, and Oscar
[Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ry- Beijbom. Multimodal trajectory prediction conditioned on
der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- lane-graph traversals, 2021.
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, [Deruyttere et al., 2019] Thierry Deruyttere, Simon Vanden-
Amanda Askell, et al. Language models are few-shot hende, Dusan Grujicic, Luc Van Gool, and Marie-Francine
learners. Advances in neural information processing sys- Moens. Talk2car: Taking control of your self-driving
tems, 33:1877–1901, 2020. car. In Proceedings of the 2019 Conference on Empir-
[Caesar et al., 2019] Holger Caesar, Varun Bankiti, Alex H. ical Methods in Natural Language Processing and the
Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush 9th International Joint Conference on Natural Language
Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. Processing (EMNLP-IJCNLP). Association for Computa-
nuscenes: A multimodal dataset for autonomous driving. tional Linguistics, 2019.
arXiv preprint arXiv:1903.11027, 2019. [Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Ken-
[Chen and Krähenbühl, 2022] Dian Chen and Philipp ton Lee, and Kristina Toutanova. Bert: Pre-training of
Krähenbühl. Learning from all vehicles. In CVPR, 2022. deep bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805, 2018.
[Chen et al., 2023a] Li Chen, Penghao Wu, Kashyap Chitta,
[Dewangan et al., 2023] Vikrant Dewangan, Tushar Choud-
Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-
hary, Shivam Chandhok, Shubham Priyadarshan, Anushka
to-end autonomous driving: Challenges and frontiers.
Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy
arXiv preprint arXiv:2306.16927, 2023.
Jatavallabhula, and K. Madhava Krishna. Talk2bev:
[Chen et al., 2023b] Long Chen, Oleg Sinavski, Jan Language-enhanced bird’s-eye view maps for autonomous
Hünermann, Alice Karnsund, Andrew James Willmott, driving, 2023.
[Ding et al., 2023] Xinpeng Ding, Jianhua Han, Hang Xu, efficient vision-language models for question answering
Wei Zhang, and Xiaomeng Li. Hilm-d: Towards high- in autonomous driving. arXiv preprint arXiv:2403.19838,
resolution understanding in multimodal large language 2024.
models for autonomous driving, 2023. [Han et al., 2024] Wencheng Han, Dongqian Guo, Cheng-
[Ding et al., 2024] Xinpeng Ding, Jianhua Han, Hang Xu, Zhong Xu, and Jianbing Shen. Dme-driver: Integrat-
Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic ing human decision logic and 3d scene perception in
autonomous driving understanding by bird’s-eye-view in- autonomous driving. arXiv preprint arXiv:2401.03641,
jected multi-modal large models. In Proceedings of the 2024.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 13668–13677, 2024. [Hazra et al., 2024] Rishi Hazra, Alkis Sygkounas, Andreas
Persson, Amy Loutfi, and Pedro Zuidberg Dos Martires.
[Driess et al., 2023] Danny Driess, Fei Xia, Mehdi S. M. Revolve: Reward evolution with large language models
Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian for autonomous driving. arXiv preprint arXiv:2406.01309,
Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, 2024.
Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser-
manet, Daniel Duckworth, Sergey Levine, Vincent Van- [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
houcke, Karol Hausman, Marc Toussaint, Klaus Greff, Ren, and Jian Sun. Deep residual learning for image recog-
Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: nition. In Proceedings of the IEEE Conference on Com-
An embodied multimodal language model, 2023. puter Vision and Pattern Recognition (CVPR), June 2016.
[Ettinger et al., 2021] Scott Ettinger, Shuyang Cheng, Ben- [Henderson and Ferrari, 2017] Paul Henderson and Vittorio
jamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Ferrari. End-to-end training of object class detectors for
Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, Zoey Yang, mean average precision, 2017.
Aurelien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasude- [Ho et al., 2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel.
van, Alexander McCauley, Jonathon Shlens, and Dragomir Denoising diffusion probabilistic models, 2020.
Anguelov. Large scale interactive motion forecasting for
autonomous driving : The waymo open motion dataset, [Höfer et al., 2021] Sebastian Höfer, Kostas Bekris, Ankur
2021. Handa, Juan Camilo Gamboa, Melissa Mozifian, Florian
[Fong et al., 2021] Whye Kit Fong, Rohit Mohan, Juana Va- Golemo, Chris Atkeson, Dieter Fox, Ken Goldberg, John
leria Hurtado, Lubing Zhou, Holger Caesar, Oscar Bei- Leonard, et al. Sim2real in robotics and automation: Ap-
jbom, and Abhinav Valada. Panoptic nuscenes: A large- plications and challenges. IEEE transactions on automa-
scale benchmark for lidar panoptic segmentation and tion science and engineering, 18(2):398–400, 2021.
tracking. arXiv preprint arXiv:2109.03805, 2021. [Hu et al., 2023a] Anthony Hu, Lloyd Russell, Hudson Yeo,
[Fu et al., 2023a] Daocheng Fu, Xin Li, Licheng Wen, Min Zak Murez, George Fedoseev, Alex Kendall, Jamie Shot-
Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like ton, and Gianluca Corrado. Gaia-1: A generative
a human: Rethinking autonomous driving with large lan- world model for autonomous driving. arXiv preprint
guage models, 2023. arXiv:2309.17080, 2023.
[Fu et al., 2023b] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, [Hu et al., 2023b] Yihan Hu, Jiazhi Yang, Li Chen, Keyu
and Pengfei Liu. Gptscore: Evaluate as you desire, 2023. Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du,
Tianwei Lin, Wenhai Wang, et al. Planning-oriented
[Fu et al., 2024] Daocheng Fu, Wenjie Lei, Licheng Wen,
autonomous driving. In Proceedings of the IEEE/CVF
Pinlong Cai, Song Mao, Min Dou, Botian Shi, and
Conference on Computer Vision and Pattern Recognition,
Yu Qiao. Limsim++: A closed-loop platform for deploy-
pages 17853–17862, 2023.
ing multimodal llms in autonomous driving. 2024.
[Gao et al., 2023] Ruiyuan Gao, Kai Chen, Enze Xie, Lan- [Hu et al., 2024] Senkang Hu, Zhengru Fang, Zihan Fang,
qing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Xianhao Chen, and Yuguang Fang. Agentscodriver: Large
Magicdrive: Street view generation with diverse 3d geom- language model empowered collaborative driving with
etry control, 2023. lifelong learning. arXiv preprint arXiv:2404.06345, 2024.
[Gilles et al., 2021] Thomas Gilles, Stefano Sabatini, [Huang et al., 2023a] Jia Huang, Peng Jiang, Alvika Gau-
Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien tam, and Srikanth Saripalli. Gpt-4v takes the wheel: Eval-
Moutarde. Gohome: Graph-oriented heatmap output for uating promise and challenges for pedestrian behavior pre-
future motion estimation, 2021. diction, 2023.
[Goodfellow et al., 2014] Ian J. Goodfellow, Jean Pouget- [Huang et al., 2023b] Junchao Huang, Xiaoqi He, and Sheng
Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Zhao. The detection and rectification for identity-switch
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen- based on unfalsified control, 2023.
erative adversarial networks, 2014. [Ishida et al., 2024] Shu Ishida, Gianluca Corrado, George
[Gopalkrishnan et al., 2024] Akshay Gopalkrishnan, Ross Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton,
Greer, and Mohan Trivedi. Multi-frame, lightweight & Joao F. Henriques, and Anthony Hu. Langprop: A code
optimization framework using large language models ap- and Jingjing Liu. Adapt: Action-aware driving caption
plied to driving. In ICLR 2024 Workshop on Large Lan- transformer, 2023.
guage Model (LLM) Agents, 2024. [Jin et al., 2023b] Ye Jin, Xiaoxi Shen, Huiling Peng, Xi-
[Jain et al., 2021] Ashesh Jain, Luca Del Pero, Hugo Grim- aoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao,
mett, and Peter Ondruska. Autonomy 2.0: Why is Guyue Zhou, and Jiangtao Gong. Surrealdriver: Design-
self-driving always 5 years away? arXiv preprint ing generative driver agent simulation framework in urban
arXiv:2107.08142, 2021. contexts based on large language model, 2023.
[Jia et al., 2021] Xiaosong Jia, Liting Sun, Masayoshi [Keysan et al., 2023] Ali Keysan, Andreas Look, Eitan Kos-
Tomizuka, and Wei Zhan. Ide-net: Interactive driving man, Gonca Gürsun, Jörg Wagner, Yu Yao, and Barbara
event and pattern extraction from human data. IEEE Rakitsch. Can you text what is happening? integrating pre-
Robotics and Automation Letters, 6(2):3065–3072, 2021. trained language encoders into trajectory prediction mod-
[Jia et al., 2022a] Xiaosong Jia, Li Chen, Penghao Wu, Jia els for autonomous driving, 2023.
Zeng, Junchi Yan, Hongyang Li, and Yu Qiao. Towards [Khachatryan et al., 2023] Levon Khachatryan, Andranik
capturing the temporal dynamics for trajectory prediction: Movsisyan, Vahram Tadevosyan, Roberto Henschel,
a coarse-to-fine approach. In CoRL, 2022. Zhangyang Wang, Shant Navasardyan, and Humphrey
[Jia et al., 2022b] Xiaosong Jia, Liting Sun, Hang Zhao, Shi. Text2video-zero: Text-to-image diffusion models are
Masayoshi Tomizuka, and Wei Zhan. Multi-agent trajec- zero-shot video generators, 2023.
tory prediction by combining egocentric and allocentric [Kim et al., 2018] Jinkyu Kim, Anna Rohrbach, Trevor Dar-
views. In Conference on Robot Learning, pages 1434– rell, John Canny, and Zeynep Akata. Textual explanations
1443. PMLR, 2022. for self-driving vehicles. Proceedings of the European
[Jia et al., 2023a] Fan Jia, Weixin Mao, Yingfei Liu, Conference on Computer Vision (ECCV), 2018.
Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, [Kim et al., 2019a] Jinkyu Kim, Teruhisa Misu, Yi-Ting
and Tiancai Wang. Adriver-i: A general world model for Chen, Ashish Tawari, and John Canny. Grounding human-
autonomous driving, 2023. to-vehicle advice for self-driving vehicles, 2019.
[Jia et al., 2023b] Xiaosong Jia, Yulu Gao, Li Chen, Junchi [Kim et al., 2019b] Jinkyu Kim, Teruhisa Misu, Yi-Ting
Yan, Patrick Langechuan Liu, and Hongyang Li. Chen, Ashish Tawari, and John Canny. Grounding human-
Driveadapter: Breaking the coupling barrier of perception to-vehicle advice for self-driving vehicles. In The IEEE
and planning in end-to-end autonomous driving, 2023. Conference on Computer Vision and Pattern Recognition
[Jia et al., 2023c] Xiaosong Jia, Penghao Wu, Li Chen, (CVPR), 2019.
Yu Liu, Hongyang Li, and Junchi Yan. Hdgt: Heteroge- [Kingma and Welling, 2022] Diederik P Kingma and Max
neous driving graph transformer for multi-agent trajectory Welling. Auto-encoding variational bayes, 2022.
prediction via scene encoding. IEEE Transactions on Pat- [Kong et al., 2024] Xiangrui Kong, Thomas Braunl, Marco
tern Analysis and Machine Intelligence (TPAMI), 2023. Fahmi, and Yue Wang. A superalignment framework in
[Jia et al., 2023d] Xiaosong Jia, Penghao Wu, Li Chen, autonomous driving with large language models. arXiv
Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. preprint arXiv:2406.05651, 2024.
Think twice before driving: Towards scalable decoders for [Leurent, 2018] Edouard Leurent. An environment for au-
end-to-end autonomous driving, 2023. tonomous driving decision-making. https://github.com/
[Jia et al., 2024a] Xiaosong Jia, Shaoshuai Shi, Zijun Chen, eleurent/highway-env, 2018.
Li Jiang, Wenlong Liao, Tao He, and Junchi Yan. Amp: [Li et al., 2022a] Hongyang Li, Chonghao Sima, Jifeng Dai,
Autoregressive motion prediction revisited with next to- Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li,
ken prediction for autonomous driving. arXiv preprint Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiang-
arXiv:2403.13331, 2024. wei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong
[Jia et al., 2024b] Xiaosong Jia, Zhenjie Yang, Qifeng Li, Jia, Si Liu, Jianping Shi, Dahua Lin, and Yu Qiao. Delv-
Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards ing into the devils of bird’s-eye-view perception: A review,
multi-ability benchmarking of closed-loop end-to-end au- evaluation and recipe. arXiv preprint arXiv:2209.05324,
tonomous driving. arXiv preprint arXiv:2406.03877, 2022.
2024. [Li et al., 2022b] Junnan Li, Dongxu Li, Caiming Xiong,
[Jiang et al., 2024] Kemou Jiang, Xuan Cai, Zhiyong Cui, and Steven Hoi. Blip: Bootstrapping language-image
Aoyong Li, Yilong Ren, Haiyang Yu, Hao Yang, pre-training for unified vision-language understanding and
Daocheng Fu, Licheng Wen, and Pinlong Cai. Koma: generation, 2022.
Knowledge-driven multi-agent framework for autonomous [Li et al., 2022c] Zhiqi Li, Wenhai Wang, Hongyang Li,
driving with large language models. arXiv preprint Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng
arXiv:2407.14239, 2024. Dai. Bevformer: Learning bird’s-eye-view representation
[Jin et al., 2023a] Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei from multi-camera images via spatiotemporal transform-
Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, ers. In ECCV, pages 1–18. Springer, 2022.
[Li et al., 2023a] Junnan Li, Dongxu Li, Silvio Savarese, and [Lu et al., 2024] Han Lu, Xiaosong Jia, Yichen Xie, Wen-
Steven Hoi. Blip-2: Bootstrapping language-image pre- long Liao, Xiaokang Yang, and Junchi Yan. Ac-
training with frozen image encoders and large language tivead: Planning-oriented active learning for end-to-end
models, 2023. autonomous driving, 2024.
[Li et al., 2023b] Tianyu Li, Li Chen, Huijie Wang, Yang [Luo et al., 2018] Wenjie Luo, Bin Yang, and Raquel Urta-
Li, Jiazhi Yang, Xiangwei Geng, Shengyin Jiang, Yuting sun. Fast and furious: Real time end-to-end 3d detec-
Wang, Hang Xu, Chunjing Xu, Junchi Yan, Ping Luo, and tion, tracking and motion forecasting with a single con-
Hongyang Li. Graph-based topology reasoning for driving volutional net. In Proceedings of the IEEE conference on
scenes. arXiv preprint arXiv:2304.05277, 2023. Computer Vision and Pattern Recognition, pages 3569–
3577, 2018.
[Li et al., 2023c] Xiaofan Li, Yifu Zhang, and Xiaoqing Ye.
Drivingdiffusion: Layout-guided multi-view driving scene [Luo et al., 2023a] Ruipu Luo, Ziwang Zhao, Min Yang,
video generation with latent diffusion model. arXiv Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei
preprint arXiv:2310.07771, 2023. Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assis-
tant with large language model enhanced ability, 2023.
[Li et al., 2024a] Boyi Li, Yue Wang, Jiageng Mao, Boris
[Luo et al., 2023b] Zhengxiong Luo, Dayou Chen, Yingya
Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone.
Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao,
Driving everywhere with large language model policy
Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed
adaptation. In Proceedings of the IEEE/CVF Conference
diffusion models for high-quality video generation, 2023.
on Computer Vision and Pattern Recognition (CVPR),
pages 14948–14957, June 2024. [Ma et al., 2023a] Yingzi Ma, Yulong Cao, Jiachen Sun,
Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal
[Li et al., 2024b] Qifeng Li, Xiaosong Jia, Shaobo Wang, language model for driving, 2023.
and Junchi Yan. Think2drive: Efficient reinforcement
learning by thinking in latent world model for quasi- [Ma et al., 2023b] Yunsheng Ma, Can Cui, Xu Cao, Wen-
realistic autonomous driving (in carla-v2). In ECCV, 2024. qian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit
Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, and
[Liang et al., 2020] Ming Liang, Bin Yang, Wenyuan Zeng, Ziran Wang. Lampilot: An open benchmark dataset for au-
Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. tonomous driving with language model programs, 2023.
Pnpnet: End-to-end perception and prediction with track-
[Malla et al., 2023] Srikanth Malla, Chiho Choi, Isht
ing in the loop. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint
11553–11562, 2020. risk localization and captioning in driving. In Proceedings
of the IEEE/CVF Winter Conference on Applications of
[Liang et al., 2024] Mingfu Liang, Jong-Chyi Su, Samuel Computer Vision, pages 1043–1052, 2023.
Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, and Man- [Mao et al., 2023a] Jiageng Mao, Yuxi Qian, Hang Zhao,
mohan Chandraker. Aide: An automatic data engine for and Yue Wang. Gpt-driver: Learning to drive with gpt.
object detection in autonomous driving. In Proceedings of arXiv preprint arXiv:2310.01415, 2023.
the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 14695–14706, 2024. [Mao et al., 2023b] Jiageng Mao, Junjie Ye, Yuxi Qian,
Marco Pavone, and Yue Wang. A language agent for au-
[Lin et al., 2023] Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, tonomous driving, 2023.
Deepak Pathak, and Deva Ramanan. Multimodality helps
unimodality: Cross-modal few-shot learning with multi- [Marcu et al., 2023] Ana-Maria Marcu, Long Chen, Jan
modal models, 2023. Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal
Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex
[Liu et al., 2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, Kendall, Jamie Shotton, and Oleg Sinavski. Lingoqa:
and Yong Jae Lee. Visual instruction tuning, 2023. Video question answering for autonomous driving, 2023.
[Liu et al., 2023b] Jiaqi Liu, Peng Hang, Xiao qi, Jianqiang [Nie et al., 2023] Ming Nie, Renyuan Peng, Chunwei Wang,
Wang, and Jian Sun. Mtd-gpt: A multi-task decision- Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea-
making gpt model for autonomous driving at unsignalized son2drive: Towards interpretable and chain-based reason-
intersections, 2023. ing for autonomous driving, 2023.
[Liu et al., 2023c] Yang Liu, Dan Iter, Yichong Xu, Shuo- [Nouri et al., 2024] Ali Nouri, Beatriz Cabrero-Daniel,
hang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Fredrik Torner, Hakan Sivencrona, and Christian Berger.
Nlg evaluation using gpt-4 with better human alignment, Engineering safety requirements for autonomous driving
2023. with large language models, 2024.
[Liu et al., 2023d] Zhijian Liu, Haotian Tang, Alexander [OpenAI, 2023] OpenAI. Gpt-4 technical report, 2023.
Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song [Ouyang et al., 2022] Long Ouyang, Jeff Wu, Xu Jiang,
Han. Bevfusion: Multi-task multi-sensor fusion with uni- Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
fied bird’s-eye view representation. In ICRA, 2023. Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke [Rezatofighi et al., 2019] Hamid Rezatofighi, Nathan Tsoi,
Miller, Maddie Simens, Amanda Askell, Peter Welinder, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio
Paul Christiano, Jan Leike, and Ryan Lowe. Training lan- Savarese. Generalized intersection over union: A metric
guage models to follow instructions with human feedback, and a loss for bounding box regression, 2019.
2022. [Rezende and Mohamed, 2016] Danilo Jimenez Rezende
[P et al., 2023] Jishnu Jaykumar P, Kamalesh Palanisamy, and Shakir Mohamed. Variational inference with
Yu-Wei Chao, Xinya Du, and Yu Xiang. Proto-clip: normalizing flows, 2016.
Vision-language prototypical network for few-shot learn- [Rombach et al., 2021] Robin Rombach, Andreas
ing, 2023.
Blattmann, Dominik Lorenz, Patrick Esser, and Björn
[Pan et al., 2024] Chenbin Pan, Burhaneddin Yaman, Tom- Ommer. High-resolution image synthesis with latent
maso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem diffusion models, 2021.
Velipasalar, and Liu Ren. Vlp: Vision language planning
[Ronneberger et al., 2015] Olaf Ronneberger, Philipp Fis-
for autonomous driving. In Proceedings of the IEEE/CVF
cher, and Thomas Brox. U-net: Convolutional networks
Conference on Computer Vision and Pattern Recognition,
for biomedical image segmentation, 2015.
pages 14760–14769, 2024.
[Papineni et al., 2002] Kishore Papineni, Salim Roukos, [Sachdeva et al., 2023] Enna Sachdeva, Nakul Agarwal,
Todd Ward, and Wei-Jing Zhu. Bleu: a method for auto- Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush,
matic evaluation of machine translation. In Proceedings of Chiho Choi, and Mykel Kochenderfer. Rank2tell: A mul-
the 40th annual meeting on association for computational timodal driving dataset for joint importance ranking and
linguistics, pages 311–318. Association for Computational reasoning. arXiv preprint arXiv:2309.06597, 2023.
Linguistics, 2002. [Sadat et al., 2020] Abbas Sadat, Sergio Casas, Mengye
[Parmar et al., 2022] Gaurav Parmar, Richard Zhang, and Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urta-
Jun-Yan Zhu. On aliased resizing and surprising subtleties sun. Perceive, predict, and plan: Safe motion planning
in gan evaluation, 2022. through interpretable semantic representations. In Com-
puter Vision–ECCV 2020: 16th European Conference,
[Paul et al., 2024] Pranjal Paul, Anant Garg, Tushar Choud- Glasgow, UK, August 23–28, 2020, Proceedings, Part
hary, Arun Kumar Singh, and K Madhava Krishna. XXIII 16, pages 414–430. Springer, 2020.
Lego-drive: Language-enhanced goal-oriented closed-
loop end-to-end autonomous driving. arXiv preprint [Sha et al., 2023] Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen,
arXiv:2403.20116, 2024. Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi
Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc:
[Peng et al., 2024] Mingxing Peng, Xusen Guo, Xianda Large language models as decision makers for au-
Chen, Meixin Zhu, Kehua Chen, Xuesong Wang, Yin- tonomous driving. arXiv preprint arXiv:2310.03026,
hai Wang, et al. Lc-llm: Explainable lane-change inten- 2023.
tion and trajectory predictions with large language models.
arXiv preprint arXiv:2403.18344, 2024. [Shah et al., 2017] Shital Shah, Debadeepta Dey, Chris
Lovett, and Ashish Kapoor. Airsim: High-fidelity visual
[Qian et al., 2023] Tianwen Qian, Jingjing Chen, Linhai and physical simulation for autonomous vehicles, 2017.
Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-
qa: A multi-modal visual question answering bench- [Shao et al., 2023] Hao Shao, Yuxuan Hu, Letian Wang,
mark for autonomous driving scenario. arXiv preprint Steven L. Waslander, Yu Liu, and Hongsheng Li. Lm-
arXiv:2305.14836, 2023. drive: Closed-loop end-to-end driving with large language
models, 2023.
[Radford et al., 2018] Alec Radford, Karthik Narasimhan,
Tim Salimans, Ilya Sutskever, et al. Improving language [Sharan et al., 2023] SP Sharan, Francesco Pittaluga, Man-
understanding by generative pre-training. 2018. mohan Chandraker, et al. Llm-assist: Enhancing closed-
loop planning with language-based reasoning. arXiv
[Radford et al., 2019] Alec Radford, Jeffrey Wu, Rewon
preprint arXiv:2401.00125, 2023.
Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. [Shi et al., 2022] Shaoshuai Shi, Li Jiang, Dengxin Dai, and
OpenAI blog, 1(8):9, 2019. Bernt Schiele. Motion transformer with global intention
[Radford et al., 2021] Alec Radford, Jong Wook Kim, Chris localization and local movement refinement. Advances
in Neural Information Processing Systems, 35:6531–6543,
Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar-
2022.
wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack
Clark, Gretchen Krueger, and Ilya Sutskever. Learning [Shi et al., 2023] Shaoshuai Shi, Li Jiang, Dengxin Dai, and
transferable visual models from natural language supervi- Bernt Schiele. Motion transformer with global intention
sion, 2021. localization and local movement refinement, 2023.
[Ramesh et al., 2022] Aditya Ramesh, Prafulla Dhariwal, [Shukor et al., 2023] Mustafa Shukor, Corentin Dancette,
Alex Nichol, Casey Chu, and Mark Chen. Hierarchical and Matthieu Cord. ep-alm: Efficient perceptual augmen-
text-conditional image generation with clip latents, 2022. tation of language models, 2023.
[Sima et al., 2023] Chonghao Sima, Katrin Renz, Kashyap [Vedantam et al., 2015] Ramakrishna Vedantam,
Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-
Andreas Geiger, and Hongyang Li. Drivelm: Driving based image description evaluation, 2015.
with graph visual question answering. arXiv preprint [Wang et al., 2023a] Jiaan Wang, Yunlong Liang, Fandong
arXiv:2312.14150, 2023. Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu,
[Sreeram et al., 2024] Shiva Sreeram, Tsun-Hsuan Wang, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evalua-
Alaa Maalouf, Guy Rosman, Sertac Karaman, and Daniela tor? a preliminary study, 2023.
Rus. Probing multimodal llms as world models for driv- [Wang et al., 2023b] Lening Wang, Han Jiang, Pinlong Cai,
ing. arXiv preprint arXiv:2405.05956, 2024. Daocheng Fu, Tianqi Wang, Zhiyong Cui, Yilong Ren,
[Swerdlow et al., 2023] Alexander Swerdlow, Runsheng Xu, Haiyang Yu, Xuesong Wang, and Yinhai Wang. Accident-
and Bolei Zhou. Street-view image generation from a gpt: Accident analysis and prevention from v2x environ-
bird’s-eye view layout. arXiv preprint arXiv:2301.04634, mental perception with multi-modal large model. arXiv
2023. preprint arXiv:2312.13156, 2023.
[Tanahashi et al., 2023] Kotaro Tanahashi, Yuichi Inoue, [Wang et al., 2023c] Shiyi Wang, Yuxuan Zhu, Zhiheng Li,
Yu Yamaguchi, Hidetatsu Yaginuma, Daiki Shiotsuka, Hi- Yutong Wang, Li Li, and Zhengbing He. Chatgpt as your
royuki Shimatani, Kohei Iwamasa, Yoshiaki Inoue, Taka- vehicle co-pilot: An initial attempt. IEEE Transactions on
fumi Yamaguchi, Koki Igari, Tsukasa Horinouchi, Kento Intelligent Vehicles, pages 1–17, 2023.
Tokuhiro, Yugo Tokuchi, and Shunsuke Aoki. Evalua- [Wang et al., 2023d] Wenhai Wang, Jiangwei Xie,
tion of large language models for decision making in au- ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen
tonomous driving, 2023. Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li,
[Tang et al., 2023] Yun Tang, Antonio A. Bruto da Costa, Ja- et al. Drivemlm: Aligning multi-modal large language
son Zhang, Irvine Patrick, Siddartha Khastgir, and Paul models with behavioral planning states for autonomous
Jennings. Domain knowledge distillation from large lan- driving. arXiv preprint arXiv:2312.09245, 2023.
guage model: An empirical study in the autonomous driv- [Wang et al., 2023e] Xiaofeng Wang, Zheng Zhu, Guan
ing domain, 2023. Huang, Xinze Chen, and Jiwen Lu. Drivedreamer: To-
[Tang et al., 2024] Zuoyin Tang, Jianhua He, Dashuai Pei, wards real-world-driven world models for autonomous
Kezhong Liu, and Tao Gao. Testing large language mod- driving. arXiv preprint arXiv:2309.09777, 2023.
els on driving theory knowledge and skills for connected [Wang et al., 2023f] Yixuan Wang, Ruochen Jiao, Chengtian
autonomous vehicles. arXiv preprint arXiv:2407.17211, Lang, Sinong Simon Zhan, Chao Huang, Zhaoran Wang,
2024. Zhuoran Yang, and Qi Zhu. Empowering autonomous
[Taran et al., 2018] Vlad Taran, Nikita Gordienko, Yuriy driving with large language models: A safety perspective,
Kochura, Yuri Gordienko, Alexandr Rokovyi, Oleg 2023.
Alienin, and Sergii Stirenko. Performance evaluation of [Wang et al., 2023g] Yuqi Wang, Jiawei He, Lue Fan,
deep learning networks for semantic segmentation of traf- Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving
fic stereo-pair images. In Proceedings of the 19th Interna- into the future: Multiview visual forecasting and planning
tional Conference on Computer Systems and Technologies. with world model for autonomous driving, 2023.
ACM, sep 2018. [Wang et al., 2024a] Maonan Wang, Aoyu Pang, Yuheng
[Tian et al., 2024] Xiaoyu Tian, Junru Gu, Bailin Li, Kan, Man-On Pun, Chung Shue Chen, and Bo Huang.
Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Llm-assisted light: Leveraging large language model ca-
Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The con- pabilities for human-mimetic traffic signal control in com-
vergence of autonomous driving and large vision-language plex urban environments, 2024.
models. arXiv preprint arXiv:2402.12289, 2024. [Wang et al., 2024b] Maonan Wang, Aoyu Pang, Yuheng
[Touvron et al., 2023] Hugo Touvron, Thibaut Lavril, Gau- Kan, Man-On Pun, Chung Shue Chen, and Bo Huang.
tier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- Llm-assisted light: Leveraging large language model
othée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- capabilities for human-mimetic traffic signal control
bro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, in complex urban environments. arXiv preprint
Edouard Grave, and Guillaume Lample. Llama: Open and arXiv:2403.08337, 2024.
efficient foundation language models, 2023. [Wang et al., 2024c] Peng Wang, Xiang Wei, Fangxu Hu,
[Treiber et al., 2000] Martin Treiber, Ansgar Hennecke, and and Wenjuan Han. Transgpt: Multi-modal generative pre-
Dirk Helbing. Congested traffic states in empirical obser- trained transformer for transportation, 2024.
vations and microscopic simulations. Physical review E, [Wang et al., 2024d] Shihao Wang, Zhiding Yu, Xiaohui
62(2):1805, 2000. Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying
[Unterthiner et al., 2019] Thomas Unterthiner, Sjoerd van Li, and Jose M Alvarez. Omnidrive: A holistic llm-agent
Steenkiste, Karol Kurach, Raphael Marinier, Marcin framework for autonomous driving with 3d perception,
Michalski, and Sylvain Gelly. Towards accurate genera- reasoning and planning. arXiv preprint arXiv:2405.01533,
tive models of video: A new metric and challenges, 2019. 2024.
[Wei et al., 2024] Yuxi Wei, Zi Wang, Yifan Lu, Chenxin [Yang et al., 2023c] Zhengyuan Yang, Linjie Li, Kevin Lin,
Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yan- Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li-
feng Wang. Editable scene simulation for autonomous juan Wang. The dawn of lmms: Preliminary explorations
driving via collaborative llm-agents. In Proceedings of with gpt-4v(ision), 2023.
the IEEE/CVF Conference on Computer Vision and Pat- [Yang et al., 2024] Jiazhi Yang, Shenyuan Gao, Yihang Qiu,
tern Recognition, pages 15077–15087, 2024. Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu,
[Wen et al., 2023a] Licheng Wen, Daocheng Fu, Xin Li, Jia Zeng, Ping Luo, et al. Generalized predictive model
Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, for autonomous driving. In Proceedings of the IEEE/CVF
Liang He, and Yu Qiao. Dilu: A knowledge-driven ap- Conference on Computer Vision and Pattern Recognition,
proach to autonomous driving with large language models. pages 14662–14672, 2024.
arXiv preprint arXiv:2309.16292, 2023. [Yin et al., 2021] Tianwei Yin, Xingyi Zhou, and Philipp
[Wen et al., 2023b] Licheng Wen, Xuemeng Yang, Krähenbühl. Center-based 3d object detection and track-
Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, ing, 2021.
Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng [Yuan et al., 2024] Jianhao Yuan, Shuyang Sun, Daniel
Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and
Shuanglu Hu, Botian Shi, and Yu Qiao. On the road Matthew Gadd. Rag-driver: Generalisable driving ex-
with gpt-4v(ision): Early explorations of visual-language planations with retrieval-augmented in-context learning
model on autonomous driving, 2023. in multi-modal large language model. arXiv preprint
[Wu et al., 2022a] Penghao Wu, Li Chen, Hongyang Li, arXiv:2402.10828, 2024.
Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre- [Zeng et al., 2022] Fangao Zeng, Bin Dong, Yuang Zhang,
training for autonomous driving via self-supervised geo- Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr:
metric modeling. In The Eleventh International Confer- End-to-end multiple-object tracking with transformer. In
ence on Learning Representations, 2022. European Conference on Computer Vision (ECCV), 2022.
[Wu et al., 2022b] Penghao Wu, Xiaosong Jia, Li Chen, [Zhang et al., 2023a] Hang Zhang, Xin Li, and Lidong
Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided Bing. Video-llama: An instruction-tuned audio-visual
control prediction for end-to-end autonomous driving: A language model for video understanding. arXiv preprint
simple yet strong baseline, 2022. arXiv:2306.02858, 2023.
[Wu et al., 2023] Dongming Wu, Wencheng Han, Tiancai [Zhang et al., 2023b] Siyao Zhang, Daocheng Fu, Zhao
Wang, Yingfei Liu, Xiangyu Zhang, and Jianbing Shen. Zhang, Bin Yu, and Pinlong Cai. Trafficgpt: Viewing,
Language prompt for autonomous driving, 2023. processing and interacting with traffic foundation models.
arXiv preprint arXiv:2309.06719, 2023.
[Xu et al., 2021a] Li Xu, He Huang, and Jun Liu. Sutd-
trafficqa: A question answering benchmark and an effi- [Zhang et al., 2024] Jiawei Zhang, Chejian Xu, and Bo Li.
cient network for video reasoning over traffic events. In Chatscene: Knowledge-enabled safety-critical scenario
Proceedings of the IEEE/CVF Conference on Computer generation for autonomous vehicles. In Proceedings of
Vision and Pattern Recognition (CVPR), pages 9878– the IEEE/CVF Conference on Computer Vision and Pat-
9888, June 2021. tern Recognition, pages 15459–15469, 2024.
[Xu et al., 2021b] Li Xu, He Huang, and Jun Liu. Sutd- [Zhao et al., 2024] Guosheng Zhao, Xiaofeng Wang, Zheng
trafficqa: A question answering benchmark and an effi- Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xin-
cient network for video reasoning over traffic events, 2021. gang Wang. Drivedreamer-2: Llm-enhanced world mod-
els for diverse driving video generation. arXiv preprint
[Xu et al., 2022] Danfei Xu, Yuxiao Chen, Boris Ivanovic, arXiv:2403.06845, 2024.
and Marco Pavone. Bits: Bi-level imitation for traffic sim-
[Zheng et al., 2024a] Xiaoji Zheng, Lixiu Wu, Zhijie Yan,
ulation, 2022.
Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen,
[Xu et al., 2023] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen and Jiangtao Gong. Large language models pow-
Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, and ered context-aware motion prediction. arXiv preprint
Hengshuang Zhao. Drivegpt4: Interpretable end-to-end arXiv:2403.11057, 2024.
autonomous driving via large language model, 2023. [Zheng et al., 2024b] Yupeng Zheng, Zebin Xing, Qichao
[Yang et al., 2023a] Kairui Yang, Enhui Ma, Jibin Peng, Zhang, Bu Jin, Pengfei Li, Yuhang Zheng, Zhongpu Xia,
Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Kun Zhan, Xianpeng Lang, Yaran Chen, et al. Planagent:
Accurately controlling street-view elements with multi- A multi-modal large language agent for closed-loop ve-
perspective consistency via bev sketch layout. arXiv hicle motion planning. arXiv preprint arXiv:2406.01587,
preprint arXiv:2308.01661, 2023. 2024.
[Yang et al., 2023b] Yi Yang, Qingwen Zhang, Ci Li, [Zhong et al., 2022] Ziyuan Zhong, Davis Rempe, Danfei
Daniel Simões Marta, Nazre Batool, and John Folkesson. Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray,
Human-centric autonomous systems with llms for user and Marco Pavone. Guided conditional diffusion for con-
command reasoning, 2023. trollable traffic simulation, 2022.
[Zhong et al., 2023] Ziyuan Zhong, Davis Rempe, Yuxiao
Chen, Boris Ivanovic, Yulong Cao, Danfei Xu, Marco
Pavone, and Baishakhi Ray. Language-guided traffic sim-
ulation via scene-level diffusion, 2023.
[Zhou et al., 2024] Yunsong Zhou, Linyan Huang, Qingwen
Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi
Guo, Yu Qiao, and Hongyang Li. Embodied understand-
ing of driving scenarios. arXiv preprint arXiv:2403.04593,
2024.