0% found this document useful (0 votes)
23 views

Imotion-Llm: Motion Prediction Instruction Tuning: This Work Was Done Outside of Meta With Personal Capacity

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Imotion-Llm: Motion Prediction Instruction Tuning: This Work Was Done Outside of Meta With Personal Capacity

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

iMotion-LLM: Motion Prediction Instruction Tuning

Abdulwahab Felemban1 Eslam Mohamed Bakr1 Xiaoqian Shen1


1 2∗
Jian Ding Abduallah Mohamed Mohamed Elhoseiny1
1
KAUST 2 Meta Reality Labs
1
{abdulwahab.felemban, eslam.abdelrahman, xiaoqian.shen
jian.ding, mohamed.elhoseiny}@kaust.edu.sa
arXiv:2406.06211v2 [cs.CV] 11 Jun 2024

2
[email protected]

Abstract
We introduce iMotion-LLM: a Multimodal Large Language Models (LLMs) with
trajectory prediction, tailored to guide interactive multi-agent scenarios. Different
from conventional motion prediction approaches, iMotion-LLM capitalizes on
textual instructions as key inputs for generating contextually relevant trajectories.
By enriching the real-world driving scenarios in the Waymo Open Dataset with
textual motion instructions, we created InstructWaymo. Leveraging this dataset,
iMotion-LLM integrates a pretrained LLM, fine-tuned with LoRA, to translate
scene features into the LLM input space. iMotion-LLM offers significant advan-
tages over conventional motion prediction models. First, it can generate trajectories
that align with the provided instructions if it is a feasible direction. Second, when
given an infeasible direction, it can reject the instruction, thereby enhancing safety.
These findings act as milestones in empowering autonomous navigation systems
to interpret and predict the dynamics of multi-agent environments, laying the
groundwork for future advancements in this field.

1 Introduction
Motion prediction is a crucial component in autonomous driving. Forecasting future trajectories
of surrounding entities based on historical data is vital for safety and motion planning. Recent
challenges, e.g., Waymo Open Dataset challenges [18], introduce a track specifically designed to
concentrate on motion prediction where 1.1 seconds of the past motion is observed, and 8 seconds
to be predicted into the future. Various methodologies [25, 43] have been developed to tackle this
challenge; however, they lack support for human-vehicle interaction through interactive instructions.
With the advent of large language models (LLMs), significant advancements have been made in
applying LLMs to the autonomous driving context [50, 11]. Integrating LLMs into autonomous
driving systems markedly improves decision-making and vehicle adaptability. This advancement
results in more intelligent, reactive vehicles and promotes more intuitive human-machine interactions,
making autonomous driving more effective and user-friendly. However, the majority of existing
LLM applications in autonomous driving [50, 11] concentrate on text or image inputs, neglecting
the potential of vectorized motion prediction data. Vectorized motion data offers an abstract view of
driving scenarios, providing essential information for motion prediction, such as the historical states
of agents; they usually offer other vectorized map features like the locations of traffic lights or lane
centers. The investigation into LLM applications for vector data is still underexplored [7], indicating
an opportunity to improve autonomous driving technology by exploiting the benefits of vector data.
To this end, we integrate LLMs with vectorized motion prediction data and introduce the instructing
motion prediction task that utilizes human instructions and scene data as inputs. iMotion-LLM can

This work was done outside of Meta with personal capacity.

Preprint. Under review.


(Actual-scenario direction)
Human: Make the ego vehicle move straight.
The ego vehicle can move straight, where it will first move straight with a very slow
: speed and a moderate acceleration, then move straight with a slow speed and a
moderate acceleration. Agent-2 is 3.26m far on the left. There are no traffic lights. The
closest stop sign is 3.76m but not in front. Decision: <Accepted>.
iMotion
-LLM-

Human: Make the ego vehicle turn right. (Other feasible direction)

The ego vehicle can turn right, where it will first move straight with a very slow speed
: and a mild acceleration, then turn right with a slow speed and a moderate
acceleration. Agent-2 is 11.22m far on the left. There are no traffic lights. The closest
stop sign is 4.49m in front Decision: <Accepted>.
iMotion
-LLM-

Human: Make the ego vehicle take right U-turn. (Infeasible direction)

The ego vehicle cannot take right U-turn, because there are no lanes that allow this
: direction based on the current location, heading, and speed of the ego vehicle.
Decision: <Rejected>.
iMotion
-LLM-

Ground truth caption: The ego vehicle can move straight, where it will first move straight with a very slow speed and a moderate acceleration, then move
straight with a slow speed and a mild acceleration. Agent-2 is 12.72m far on the left. There are no traffic lights. The closest stop sign is 6.27m in front

Figure 1: Our iMotion-LLM model can process three types of instructions and predict the corre-
sponding trajectories. First, it can handle ground truth instructions that align with the direction of the
recorded real-scenario trajectory (e.g., Waymo Open Dataset), correctly accepting the instruction and
providing an explanation and trajectory. Second, iMotion-LLM can follow other feasible non-ground
truth directions and predict the correct explanation and trajectory. Finally, when given an infeasible
direction, iMotion-LLM correctly rejects the instruction.

output trajectory forecasts, language explanation of how the vehicle executes this trajectory if it
can, and a decision of whether to accept or reject the instruction based on feasibility, as illustrated
in Figure 1 and Figure 3. To support this task, we augment the Waymo Open dataset [17] with
ego vehicle direction instructions. The instruction details and statistics are explained in Section 3.
Subsequently, we introduce the iMotion-LLM: an instructable motion prediction model based on
Large Language Models (LLMs). iMotion-LLM, harnesses pretrained models’ multi-modal trajectory
prediction capabilities through an encoder-decoder transformer architecture. As shown in Figure 3, it
employs an LLM Projection to project encoded scene context embeddings from the Scene Encoder
into the LLM input space. The LLM generates instruction token [I] and N [S] tokens representing
scene context embeddings. These are combined into a single query by the Instruct Mapper. The
resulting keys and values, derived and projected by the Scene Mapper, are used by the Multi-modal
Trajectory Prediction Decoder. Our experiments, using GameFormer [25] as a backbone, show that
iMotion-LLM empowers autonomous navigation systems to interpret and predict the dynamics of
multi-agent environments, while matching the performance of existing models.
Our contributions can be summarized as:

• We proposed the instructing motion prediction task, which takes human instructions and
scene data as inputs and outputs a text to indicate a decision, a text that describes how the
action is executed, and trajectory predictions.
• We augmented the Waymo Open Dataset with instruction categories enabling instructing
the motion prediction task. This augmentation, named InstructWaymo, is easily expandable
to include more driving scenarios information of higher granularity and will benefit future
research in this direction.
• We introduced two evaluation metrics: Instruction Following Recall (IFR) and Direction
Variety Score (DVS), to measure the model’s ability to adhere to instructions and the diversity
of the predicted modalities across different directional categories. These attributes cannot be
reflected in the conventional metrics used in the motion prediction task.
• We introduced the iMotion-LLM: an instructing motion prediction model based on Large
Language Models (LLMs). Different from earlier motion prediction models, iMotion-LLM
leverages instructions, scene and vectorized motion data as an input to generate contextually
relevant predictions.

2
Table 1: Direction categories with their corresponding presence proportion in the train set of 327,391
driving scenarios and in the test set of 2,311 driving scenarios.
Category Stationary Straight Straight-right Straight-left Right Left Right u-turn Left u-turn
Train 1.6% 55.8% 3.3% 3.7% 16.7% 17.5% 0.1% 1.4%
Test 1.8% 56.1% 3.3% 3.2% 16.7% 17.7% 0.0% 1.1%

2 Related Work

Multimodal Large Langauge Models. Large Language Models (LLMs) have significantly advanced
in recent years [40, 15, 5, 47, 46, 1], with models like GPT-4 [1] demonstrating remarkable abili-
ties in generating coherent, contextually relevant text across numerous domains. With the strong
performance of LLMs, there is an emergence of multi-modal LLMs (MLLMs) [3], which extend
the LLMs with reasoning abilities across diverse modalities. Notable works includes Flamingo [3],
InstructBLIP [14], MiniGPT-4 [6, 52], LLaVA [32, 31], and Vicuna [10]. These works used visual
instruction tuning to align with human intentions. There are some extensions that focus on detection
and segmentation [52, 48, 29, 4], videos [30, 51, 33], and 3D [22, 49, 20]. Our work focuses on
MLLMs for motion prediction tasks.
Motion Prediction Models for Driving Scenarios. The task of motion prediction involves analyzing
the historical tracks of agents on a corresponding map to predict their joint future positions several
seconds into the future. LSTMs [2, 21] have been used to encode the historical states of agents,
while CNNs [12, 19, 41] have been employed to encode the rasterized images of the scene. Recently,
GNNs [9, 27, 36] have been employed to depict agent interactions effectively. The advent of
Transformer-based models, like SceneTransformer [39] and WayFormer [38], has further enhanced
prediction through their efficient structure, though they primarily focus on the encoding process.
Motion Transformer [44, 45] and GameFormer [26] innovates by improving the decoding stage,
leading to better accuracy. MotionLLM [42] used similar structures of LLM for the modeling, but
still did not introduce the language reasoning ability to motion prediction task.
Multimodal Large Language Models for autonomous driving. With the emergence of LLM,
there is a trend to adapt LLM for autonomous driving scenes [8, 16, 23, 24]. Innovations like
GPT-Driver [34] and SurrealDriver [28] exemplify the transformative impact of LLMs in motion
planning and driving maneuver generation, marking significant advancements in autonomous vehicle
technology. However, the majority of existing methods primarily focus on text or image inputs,
overlooking the benefits of vector data in motion prediction, which include reduced computational
complexity and improved accuracy. In this paper, we integrate Large Language Models (LLMs) with
vector-based data for motion prediction.

3 InstructWaymo: Instruction Augmentation of Waymo Open Dataset

InstructWaymo introduces a new perspective on the Waymo Open Dataset by making motion pre-
diction instructable and language descriptive. We designed a module that categorizes future motion
into different directions, speeds, and acceleration categories. Additionally, we extract scene attributes
from the most recent observed time step (at 1.1 seconds), including the relative locations of nearby
agents, traffic lights, and stop signs. InstructWaymo uses future direction information as an instruc-
tion and all other information, including future motion details, as an output caption. To enhance
iMotion-LLM training and evaluation, we measure direction feasibility based on the ego vehicle’s
current location, heading, speed, and detected road center lanes to determine if specific directions are
feasible to execute or not. InstructWaymo integrates this information into the output caption. We
provide InstructWaymo as a script to augment the Wamo Open Dataset.
Direction. Direction is fundamental for instructing navigation, drawing inspiration from the mean
Average Precision (mAP) metric utilized in Waymo motion prediction challenges [17], where they
compute mAP across various motion ranges to comprehensively assess performance across diverse
driving behaviors. We reuse their definition of driving behaviors, obtaining eight conceivable direction
conditions encompassing 8 classes listed in Table 1 with their statistics in the training and evaluation
sets. See the calculation of the direction in Section B in the appendix.

3
Speed and Acceleration. Following the intuition used in [37], we categorize trajectories of moving
vehicles based on speeds and relative change in speeds. For that, we heuristically defined 5-speed
categories and 9-acceleration categories; the suggested upper threshold and the categories are listed
in the appendix in Table 6.
Transcribing scene and map information. Based on the latest observed time step information, the
Waymo Open Dataset provides easy access to different attributes agents and scene attributes. Agents
and map information are transcribed as relative locations to the ego vehicle, enriching the contextual
driving information. A relative location is transcribed as a relative distance with a relative directional
position (for example, Agent-2 is 10m far behind).
Feasibility of directions. We define the feasibility of directions into three
categories: 1) actual-scenario direction, which is based on the ground
truth future trajectory and hence is always assumed to be a feasible direc-
tion; 2) Other feasible directions that are not the actual-scenario direction;
3) Infeasible directions, which is the complement set of feasible direc-
tions. To assess feasibility, we consider a set of candidate destinations
relative to the ego vehicle’s current location and heading. These candidate
destinations are possible locations on associated lanes within a range
determined by the vehicle’s speed (minimum range r1 , maximum range Figure 2: Feasibility de-
r2 ). Figure 2 illustrates this concept with two feasible directions. tection of lane directions
LLM Instruction and caption. Based on the previously extracted at- with "move straight" and
tributes, we generate a template of input instruction and output caption "turn right" as feasible di-
that the LLM can process. The input can be either the final future direc- rections within a range of
tion or additional two-step directions and speeds. in the results section, (r1, r2).
we focus on the model trained and evaluated with the final future direction
instructions only unless otherwise stated. The output caption that the
LLM aligns to generate auto-regressively includes the final direction, with two-step directions, speeds,
and accelerations achieving the final direction. Additionally, it states the relative position of the
interactive agent (Agent-2), and the relative locations of the closest traffic light and stop sign. In
this work, we considered selecting the relative location of the agent labeled as the interactive agent
(Agent-2) in the Waymo Interaction Prediction challenge.

4 iMotion-LLM
Revisiting existing models. Recent successful transformer-based interactive trajectory prediction
models [25, 43] commonly employ a schema comprising two main blocks. Initially, a scene encoder
encodes the observed map and agent information into embeddings representing scene context infor-
mation S ∈ RR×dscene , where dscene is the embedding dimension. Subsequently, the multimodal
trajectory prediction decoder utilizes cross-attention with S as keys and values, employing K learn-
able queries qmotion ∈ RK×dscene to predict a Gaussian Mixture Model (GMM) of future trajectories
for interactive agents. Both the Scene Encoder and Trajectory Decoder are depicted in Figure 3. The
vectorized motion data is encoded through an LSTM, while the map features are processed using
Multi-Layer Perceptrons (MLPs) for continuous features such as center lanes, or embedding layers
for categorical features like the state of traffic lights. Subsequently, the Scene Encoder functions as a
feature fusion layer.

4.1 Conditional Multimodal Trajectory Prediction Decoder

To generate a conditional output, cGAN [35] uses a conditioning signal in the generator model’s input.
Inspired by this, we fuse an additional learnable query, qinstruction , with the motion generation queries,
qmotion . For the base model, qinstruction is learned using a simple embedding layer with a categorical
class as input. When integrating an LLM with the base model, qinstruction is derived from the LLM’s
output embeddings as described in the next subsection.

4.2 Integration of iMotion-LLM

In our proposed design we integrate, align, and instruct fine-tune the LLM with a pretrained Scene
Encoder and the Multi-modal Trajectory Prediction Decoder. The LLM lies between them, and

4
Multimodal Trajectory Prediction Decoder
K&V
qmotion
Instruct Mapper Scene Mapper

Output Text [I] [S1] [S2] . . . [SN]

Large Language Model

Predict the future multimodal trajectory


LLM
TI embeddings of two agents … Make the ego
vehicle Move straight / Turn right / … Projection

Motion Encoder
Scene Encoder
Map Encoder

Figure 3: The proposed pipeline, referred to as iMotion-LLM, leverages the multi-modal trajectory
prediction capabilities of pretrained models, employing an encoder-decoder transformer architecture.
Given a textual instruction and scene context embeddings, iMotion-LLM utilizes an LLM Mapper to
project the encoded scene context embeddings from the Scene Encoder into the LLM input space.
Subsequently, the LLM generates an instruction token [I] and a sequence of [S] tokens representing
the scene context embeddings. The [I] token is projected to a query, and the scene context-generated
tokens are projected to be the keys and values utilized by the multi-modal trajectory prediction
decoder.

enables instructability. To enable this integrational design, illustrated in Figure 3, five main blocks
are required: 1) LLM Projection module. 2) LLM itself. 3) Scene Mapper. 4) Instruction Mapper. 5)
Output Caption.
LLM Projection. Inspired by Vision-LLMs [13, 53], we employ a simple MLP-based projection
layer to map input scene embeddings S ∈ RR×dscene to S̃ ∈ RR×dLLM , aligning with the LLM
embeddings dimension dLLM .
LLM. Projected scene embeddings S̃ and input instruction TI are fed to the LLM to generate
new tokens, [I; S1 ; S2 ; . . . ; SN ], where I represents instruction embedding and Sn represents scene
embeddings after grounding the instruction TI .
Scene Mapper. To ensure seamless integration, we freeze the motion prediction model’s encoder and
decoder. Consequently, we map instruction-grounded tokens [Si ] ∈ RdLLM back to Rdscene , serving
as keys and values in the Multimodal Trajectory Prediction Decoder, defined in Eq. 1.
Ki &Vi = M LP ([Si ]); i ∈ 1, ..., N . (1)

Instruct Mapper. Following the Scene Mapper, we project instruction token I back to the motion
prediction model’s embedding space (dscene ), which is fused with qmotion through a simple addition
operation, as shown in Eq. 2.
Q = qmotion + M LP ([I]). (2)

Output Caption. Along with generating scene and instruction tokens, the LLM outputs a text
that describes how the instruction is executed, key scene information, and a textual decision of
("<Accept>" or "<Reject>") to indicate whether an instruction is feasible or not.

5 Instruction Following and Diversity Metrics


Our primary objective is to render current motion prediction models interactive and controllable.
Hence, conventional metrics like Average Displacement Error (ADE) and Final Displacement Error

5
Figure 4: Illustrative examples of IFR and DVS of 6 modalities given a direction instruction of "move
straight".

(FDE) alone may not suffice to adequately evaluate the instruction-following capabilities of the
proposed model. To address this, we introduce two metrics: Instruction Following Recall (IF R) and
Direction Variety Score (DV S).
Instruction Following Recall (IFR). To gauge the model’s ability to adhere to instructions, we
compare given instructions direction Dintruct , with the directions of the generated multimodal
trajectories. For each of the M modalities, we calculate its direction Dpredj , using the same module
used to extract the actual-scenario ground truth future direction. Based on that IF R is computed as
the average recall across N samples of multimodal trajectory predictions:

N M
1 X 1 X 
i i

IF R = Recall Dpred j
| Dintruct , (3)
N i=1 M j=1

Where a higher IF R indicates higher adherence to a given instruction signal. For an unconditional
model that takes no instruction signal, we can still measure the IRF where Dintruct is considered
the actual-scenario ground truth’s future direction.
Direction Variety Score (DVS). To assess the directional diversity of predicted modalities, we
measure the ratio of unique direction categories predicted over the total number of modalities M .
This metric is calculated irrespective of the actual or given instruction as:
N i
1 X Unique(Dpred )
DV S = , (4)
N i=1 M

Where a higher DV S indicates more diversity of predicted directions.


Figure 4 shows three illustrative examples given an input instruction or an actual-scenario instruction
of “move straight.”. The left example shows the highest possible IF R, where all modalities are
precisely in the "move straight" direction. The middle example has only two true positives while
covering 3 unique directions, resulting in a 2/6 IF R and 3/6 DV S. The right example shows the
highest possible directional diversity of a maximum possible number of unique directions, with only
one true positive resulting in 1/6 IF R. In our experiments, we report the values in percentages.

6 Experiments
6.1 Experimental Setup

Implementation Details. We adopt GameFormer [25] as our trajectory forecasting backbone, using
327,391 training samples. The LLM projection layers and LoRA weights are fine-tuned over 15,000
training steps, with 4,000 iterations per inner epoch and a batch size of 24, effectively covering
360,000 training samples. With LoRA parameters of (r=8, alpha=16). We utilize the Adam optimizer
with an initial learning rate (LR) of 1e-4, incorporating a linear warmup for the first 100 steps starting
from a warmup LR of 1e-6, followed by a cosine LR scheduler. The training process takes 12 hours
on four A100-80GB GPUs.
Metrics. In addition to the proposed metrics, i.e., Recall-Instruction Following (RIF ) and Diver-
sity, which are discussed in Section 5, we employ the conventional motion metrics; minADE and
minFDE [17].

6
Evaluation. Each model is evaluated with three instruction types: actual-scenario, other feasible,
and infeasible. We use 2,311 evaluation examples. Each scenario has a set of feasible and infeasible
instructions biased toward the least representative direction. During evaluating these two types, we
sample instructions to balance the categorical types used. The details of this sampling are explained
in D. Evaluation takes 4 hours on a single A100-80GB GPU.

6.2 Results & Discussion

Table 2: Instruction following recall (IFR) and direction variety scores (DVS) for different models
across three different input instruction categories. This table compares the base GameFormer model
that takes no instruction as input, the conditional Gameformer that takes a discrete direction category
as an input, and iMotion-LLM. The results on iMotion-LLM in this table assume all generated
trajectories are valid (i.e. without feasibility classification).
Actual-scenario Other feasible Infeasible
Instruction Type: | |
instructions instructions instructions
Model IFR ↑ DVS ↓ IFR ↑ DVS IFR ↓ DVS
Baseline (GameFormer) 80.4% 9.1% – – – –
Conditional Baseline (C-GameFormer) 87.6% 5.0% 43.1% 18.1% 16.8% 20.5%
iMotion-LLMwithout classification 89.6% 2.1% 37.4% 10.7% 20.4% 13.8%

GameFormer. The baseline model takes no signal as an input other than the scene and historical
trajectory features. Yet to compare, we measure the unconditional model’s adherence to the ground
truth future direction. The results are shown in Table 2.
Conditional GameFormer. As we show in Table 2, making GameFormer conditional on a discrete
direction unlocks the model capability in following actual-scenario instructions, indicated by the
increase of 7.2% in IFR, combined with a drop of 4.1% in DVS, indicating less variety in the
prediction directions, thus higher precision instruction following. While also following the feasible
instructions to some limit, and significantly less instruction following infeasible instructions.
iMotion-LLM without classification. Even though iMotion-LLM was trained with feasibility
classification capability, we show in Table 2 how the model performs, assuming all generated
trajectories are valid. With this setup, iMotion-LLM can improve the actual-scenario instruction with
higher precision, indicated by an increase of 2% in IFR and a drop of 2.9% in DVS. But it doesn’t
show improvement for the other than the actual-scenario instruction.
iMotion-LLM with feasibility detection. Given the iMotion-LLM capability in detecting whether an
instruction should be accepted or rejected, Table 3 shows that iMotion-LLM accepts actual-scenario
instructions with a rate of 95.1%, which matches with the ensured feasibility of these instructions.
Even though there is a drop in IFR due to false negatives, the actual-scenario performance is combined
with a rate of 85.5% of detecting true negatives. Rejecting infeasible instructions reduces the IFR for
infeasible instructions marginally from 20.4% to 6.6%.
Insignificance of other feasible instructions following. For other feasible instructions besides the
actual-scenario instruction, as shown in Table 3 both the conditional GameFormer and iMotion-LLM
exhibit lower IFR and higher DVS compared to the actual-scenario case. Intuitively, this behavior
correlates with infeasibility of instructions rather than feasibility. Even though iMotion-LLM detects
the feasibility of actual-scenario and infeasible instructions with a high rate, it does not detect other
feasible instructions’ true positives with such significance. We attribute this to two factors. First,
driving behaviors for other feasible instructions may diverge from real scenarios, making the task
more complex and requiring better generalizability. Second, as shown in Figure 5, feasible directions
might not always align with safety, laws, or convenience. Interestingly, iMotion-LLM rejects this
instruction. Figure 6 shows a successful case of accepting feasible instructions and rejecting infeasible
instructions; stationary was labeled as infeasible due to the vehicle’s current velocity. We show
additional results in appendix C.
Vehicle minADE and minFDE. Even though this work focuses on a new task, Table 5 shows how
our model does not diverge significantly from the baseline, and it still inherits the capabilities of the
conditional baseline of generating trajectories that diverge from the ground truth future, when other

7
Table 3: Accuracy and effect on instruction following metrics of iMotion-LLM when considering
direction feasibility detection. iMotion-LLMwithout classification refers to the exact same model without
using feasibility detection and blindly evaluating all its generated trajectory modalities. iMotion-
LLM does not accumulate rejected instructions in the metric calculations, reflecting no instruction
following.
Actual-scenario Other feasible
Infeasible instructions
instructions instructions
Model Acc. ↑ IFR ↑ DVS ↓ Acc. ↑ IFR ↑ DVS ↓ Acc. ↑ IFR ↓ DVS ↑
Conditional Baseline (C-GameFormer) – 87.6% 5.0% – 43.1% 18.1% – 16.8% 20.5%
iMotion-LLMwithout classification – 89.6% 2.1% – 37.4% 10.7% – 20.4% 13.8%
iMotion-LLM 95.1% 86.2% 1.9% 41.2% 21.7 6.8% 85.8% 6.6% 8.1%

Table 4: Instruction following and displacement error when considering a more detailed instruction
of two-step direction and speed in addition to the final direction instruction.
Actual-scenario
Instruction Type:
instructions
Vehicles Vehicles
Model Instruction IFR ↑ DVS ↓ ↓ ↓
minADE minFDE
Baseline (GameFormer) None 80.4% 9.1% 1.18 2.44
Conditional Baseline (C-GameFormer) Final direction 87.6% 5.0% 1.09 2.22%
iMotion-LLMwithout classification Final direction 89.6% 2.1% 1.49 3.33
iMotion-LLM + Two-step direction & speed 87.5% 2.2% 1.23 2.49

Table 5: The minADE and minFDE for vehicle agents for all models.
Actual-scenario instructions Other feasible instructions Infeasible instructions
Vehicles Vehicles Vehicles Vehicles Vehicles Vehicles
Model ↓ ↓ ↓ ↓ ↓ ↓
minADE minFDE minADE minFDE minADE minFDE
Baseline (GameFormer) 1.18 2.44 – – – –
Conditional Baseline (C-GameFormer) 1.09 2.22 2.14 5.39 2.29 5.56
iMotion-LLMwithout classification 1.49 3.33 2.40 6.16 2.57 6.59
iMotion-LLM 1.48 3.32 2.50 6.65 3.04 8.11

Figure 5: Qualitative result showing in the top figure how iMotion-LLM accept and follow the
actual-scenario instruction of "move straight" and in the bottom figure how it rejects "turn right" even
though "turn right" is labeled as a feasible direction.

than the ground truth future actual-scenario instruction is used, indicated by the higher displacement
errors, where the displacement errors are always calculated with the ground truth actual-scenario
future trajectory as a reference.
Longer detailed instruction. in Table 4 we show how a longer instruction setting shows that the LLM
integration with GameFormer in iMotion-LLM can approach closer values of displacement errors to
the base GameFormer. A longer instruction considers two-step direction and speed descriptions of
the first and second half of the future (from the latest observed step to 4 seconds in the future and
from the 4th second to 8 seconds in the future) in addition to the final direction. For example, "Turn
right by first moving straight with a moderate speed, then turn right with a slow speed.".

8
(Actual-scenario direction)
Human: Make the ego vehicle turn right.
The ego vehicle can turn right, where it will first move straight with a slow speed and a
: mild acceleration, then turn right with a slow speed and a mild acceleration. Agent-2
is 14.97m far behind. The closest traffic light (green/go) is 14.16m in front The
closest stop sign is 66.12m but not in front. Decision: <Accepted>.
iMotion
-LLM-

Human: Make the ego vehicle move straight veering left.


The ego vehicle can move straight veering left, where it will first move straight with a
: slow speed and a aggressive acceleration, then move straight with a moderate speed
and a moderate acceleration. Agent-2 is 18.1m far behind. The closest traffic light
(green/go) is 32.82m in front The closest stop sign is 98.59m but not in front. Decision:
iMotion
-LLM-
<Accepted>. (Other feasible direction)

Human: Make the ego vehicle stay stationary. (Infeasible direction)

The ego vehicle cannot stay stationary, because there are no lanes that allow this
: direction based on the current location, heading, and speed of the ego vehicle.
Decision: <Rejected>.
iMotion
-LLM-

Ground truth: The ego vehicle can turn right, where it will first move straight with a slow speed and a mild deceleration, then turn right with a very slow
speed and a constant velocity. Agent-2 is 11.62m far behind. The closest traffic light (green/go) is 13.5m in front The closest stop sign is 64.27m but not in
front.

Figure 6: Qualitative result showing the model ability in following feasible instructions (top two
figures), and making sense of surroundings. While also rejecting irrational scenarios like staying
stationary in the bottom figure. Yet it generates a trajectory where the ego is stopping, and the
interactive agent (Agent-2) is overtaking it.

7 Limitations and Future Directions

Our study provides a key step by focusing on direction-based instructions, illustrating the potential
of the LLM in executing driving tasks. By showing that the model can effectively interpret and act
on these instructions, we have established a baseline that future research can build upon. Exploring
more complex instructions that encompass greater granularity and contextual information will further
enhance the model’s nuanced understanding and execution of multifaceted driving tasks. Furthermore,
we employed relatively simple instructions and output captions, demonstrating the feasibility and
effectiveness of this approach. Our work paves the way for incorporating more advanced and diverse
input instructions and output captioning with varying levels of reasoning based on the ego state and
surroundings. Although these elements were not included in this study, the attributes we extracted in
InstructWaymo can facilitate their seamless integration. This presents an exciting opportunity for
future research to develop more sophisticated and naturalistic implementations, extending the impact
of our initial findings

8 Conclusion

In conclusion, we introduce iMotion-LLM, a Large Multimodal Model powered by LLMs, tailored for
trajectory prediction in interactive multi-agent scenarios within autonomous navigation. By leveraging
textual instructions as key inputs, our model not only generates contextually relevant trajectory
predictions but also showcases an enhanced ability to interpret and act upon these instructions.
Through integration with a pretrained LLM fine-tuned with LoRA, iMotion-LLM effectively translates
scene features into the LLM input space, enabling accurate multimodal trajectory forecasts. Notably,
our model’s ability to generate trajectories aligned with provided instructions inherits the performance
of the underlying backbone model, marking a significant advancement in empowering autonomous
navigation systems to anticipate the dynamics of multi-agent environments. iMotion-LLM, combined
with InstructWaymo instructions and captions, provides the capability to align trajectories with
feasible instructions and reject infeasible ones, thereby enhancing operational safety. This work not
only advances the field of autonomous navigation by enabling systems to better anticipate and react
within multi-agent environments but also sets a solid foundation for further innovations in interactive
autonomous systems.

9
References
[1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D.,
Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint
arXiv:2303.08774 (2023)
[2] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm:
Human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 961–971 (2016)
[3] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A.,
Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning.
Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
[4] Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A
versatile vision-language model for understanding, localization, text reading, and beyond (2023)
[5] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in
neural information processing systems 33, 1877–1901 (2020)
[6] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong,
Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language
multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
[7] Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D.,
Shotton, J.: Driving with llms: Fusing object-level vector modality for explainable autonomous
driving. arXiv preprint arXiv:2310.01957 (2023)
[8] Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D.,
Shotton, J.: Driving with llms: Fusing object-level vector modality for explainable autonomous
driving. arXiv preprint arXiv:2310.01957 (2023)
[9] Chen, Y., Ivanovic, B., Pavone, M.: Scept: Scene-consistent, policy-based trajectory predictions
for planning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 17103–17112 (2022)
[10] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y.,
Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with
90%* chatgpt quality (March 2023), https://lmsys.org/blog/2023-03-30-vicuna/
[11] Cui, C., Ma, Y., Cao, X., Ye, W., Wang, Z.: Drive as you speak: Enabling human-like interaction
with large language models in autonomous vehicles. In: Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision. pp. 902–909 (2024)
[12] Cui, H., Radosavljevic, V., Chou, F.C., Lin, T.H., Nguyen, T., Huang, T.K., Schneider, J.,
Djuric, N.: Multimodal trajectory predictions for autonomous driving using deep convolutional
networks. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 2090–
2096. IEEE (2019)
[13] Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: InstructBLIP:
Towards general-purpose vision-language models with instruction tuning. In: Thirty-seventh
Conference on Neural Information Processing Systems (2023)
[14] Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.C.H.:
Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR
abs/2305.06500 (2023). https://doi.org/10.48550/ARXIV.2305.06500
[15] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[16] Dewangan, V., Choudhary, T., Chandhok, S., Priyadarshan, S., Jain, A., Singh, A.K., Srivastava,
S., Jatavallabhula, K.M., Krishna, K.M.: Talk2bev: Language-enhanced bird’s-eye view maps
for autonomous driving. arXiv preprint arXiv:2310.02251 (2023)

10
[17] Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B.,
Qi, C., Zhou, Y., Yang, Z., Chouard, A., Sun, P., Ngiam, J., Vasudevan, V., McCauley,
A., Shlens, J., Anguelov, D.: Large scale interactive motion forecasting for autonomous
driving : The waymo open motion dataset. In: 2021 IEEE/CVF International Confer-
ence on Computer Vision (ICCV). pp. 9690–9699. IEEE Computer Society, Los Alami-
tos, CA, USA (oct 2021). https://doi.org/10.1109/ICCV48922.2021.00957, https://doi.
ieeecomputersociety.org/10.1109/ICCV48922.2021.00957
[18] Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B., Qi, C.R.,
Zhou, Y., et al.: Large scale interactive motion forecasting for autonomous driving: The waymo
open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. pp. 9710–9719 (2021)
[19] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., Moutarde, F.: Home: Heatmap output
for future motion estimation. In: 2021 IEEE International Intelligent Transportation Systems
Conference (ITSC). pp. 500–507. IEEE (2021)
[20] Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H.,
et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding,
generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
[21] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780
(1997)
[22] Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d
world into large language models. arXiv preprint arXiv:2307.12981 (2023)
[23] Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.:
Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080
(2023)
[24] Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners:
Extracting actionable knowledge for embodied agents. In: International Conference on Machine
Learning. pp. 9118–9147. PMLR (2022)
[25] Huang, Z., Liu, H., Lv, C.: Gameformer: Game-theoretic modeling and learning of transformer-
based interactive prediction and planning for autonomous driving. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3903–3913 (October
2023)
[26] Huang, Z., Liu, H., Lv, C.: Gameformer: Game-theoretic modeling and learning of
transformer-based interactive prediction and planning for autonomous driving. arXiv preprint
arXiv:2303.05760 (2023)
[27] Huang, Z., Mo, X., Lv, C.: Multi-modal motion prediction with transformer-based neural
network for autonomous driving. In: 2022 International Conference on Robotics and Automation
(ICRA). pp. 2605–2611. IEEE (2022)
[28] Jin, Y., Shen, X., Peng, H., Liu, X., Qin, J., Li, J., Xie, J., Gao, P., Zhou, G., Gong, J.:
Surrealdriver: Designing generative driver agent simulation framework in urban contexts based
on large language model. arXiv preprint arXiv:2309.13193 (2023)
[29] Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via
large language model. arXiv preprint arXiv:2308.00692 (2023)
[30] Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat:
Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
[31] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv
preprint arXiv:2310.03744 (2023)
[32] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information
processing systems 36 (2024)

11
[33] Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video under-
standing via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)
[34] Mao, J., Qian, Y., Zhao, H., Wang, Y.: Gpt-driver: Learning to drive with gpt. arXiv preprint
arXiv:2310.01415 (2023)
[35] Mirza, M., Osindero, S.: Conditional generative adversarial nets (2014)
[36] Mo, X., Huang, Z., Xing, Y., Lv, C.: Multi-agent trajectory prediction with heterogeneous edge-
enhanced graph attention network. IEEE Transactions on Intelligent Transportation Systems
23(7), 9554–9567 (2022)
[37] Mohamed, A., Zhu, D., Vu, W., Elhoseiny, M., Claudel, C.: Social-implicit: Rethinking
trajectory prediction evaluation and the effectiveness of implicit maximum likelihood estimation.
In: European Conference on Computer Vision. pp. 463–479. Springer (2022)
[38] Nayakanti, N., Al-Rfou, R., Zhou, A., Goel, K., Refaat, K.S., Sapp, B.: Wayformer: Motion
forecasting via simple & efficient attention networks. In: 2023 IEEE International Conference
on Robotics and Automation (ICRA). pp. 2980–2987. IEEE (2023)
[39] Ngiam, J., Vasudevan, V., Caine, B., Zhang, Z., Chiang, H.T.L., Ling, J., Roelofs, R., Bewley,
A., Liu, C., Venugopal, A., et al.: Scene transformer: A unified architecture for predicting future
trajectories of multiple agents. In: International Conference on Learning Representations (2021)
[40] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are
unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
[41] Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: Dynamically-feasible
trajectory forecasting with heterogeneous data. In: Computer Vision–ECCV 2020: 16th Euro-
pean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. pp. 683–700.
Springer (2020)
[42] Seff, A., Cera, B., Chen, D., Ng, M., Zhou, A., Nayakanti, N., Refaat, K.S., Al-Rfou, R., Sapp,
B.: Motionlm: Multi-agent motion forecasting as language modeling. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. pp. 8579–8590 (2023)
[43] Shi, S., Jiang, L., Dai, D., Schiele, B.: Motion transformer with global intention localization
and local movement refinement. Advances in Neural Information Processing Systems (2022)
[44] Shi, S., Jiang, L., Dai, D., Schiele, B.: Motion transformer with global intention localization and
local movement refinement. Advances in Neural Information Processing Systems 35, 6531–6543
(2022)
[45] Shi, S., Jiang, L., Dai, D., Schiele, B.: Mtr++: Multi-agent motion prediction with symmetric
scene modeling and guided intention querying. IEEE Transactions on Pattern Analysis and
Machine Intelligence (2024)
[46] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B.,
Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971 (2023)
[47] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra,
S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models.
arXiv preprint arXiv:2307.09288 (2023)
[48] Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y.,
et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.
Advances in Neural Information Processing Systems 36 (2024)
[49] Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Empowering large language
models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)
[50] Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.K., Li, Z., Zhao, H.: Drivegpt4:
Interpretable end-to-end autonomous driving via large language model. arXiv preprint
arXiv:2310.01412 (2023)

12
[51] Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for
video understanding. arXiv preprint arXiv:2306.02858 (2023)
[52] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language
understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
[53] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision-language
understanding with advanced large language models. In: The Twelfth International Conference
on Learning Representations (2024)

13
Table 6: Speed and acceleration/deceleration categories and upper thresholds.
Speed category Very slow Slow Moderate Fast Very fast
Threshold (km/h) 20 40 90 120 >120
Accel./Decel. category Constant velocity mild Moderate aggressive extreme
Threshold (km/h increase in 8s) 6 25 46 65 >65

Figure 7: Illustrative examples of directions categories.

A Speed and acceleration categories

The set of 5 different speed categories ranging from very slow to very fast, and the set of acceleration
or deceleration ranging from mind to extreme, including a no acceleration (i.e., constant velocity).
We designed these thresholds heuristically, yet they can be easily adapted. Table 6 shows the used
thresholds.

B Calculation of the directions

Following the illustration shown in Figure 7, motion direction is measured based on the relative
heading angle between a time step and a future target step. We calculate direction solely based
on trajectory information; the heading angle is calculated using two consecutive trajectory discrete
samples. If the maximum future speed is within a threshold of vstationary = 2m/s, and the vehicle
traveled a distance within dstationary = 5m, the vehicle is considered stationary. Otherwise, the
vehicle is moving straight if the relative heading is within θs = 30 degrees. But if the longitudinal
displacement is greater than dv = 5m, it is categorized as straight veering right/left. If the relative
heading exceeds θs , and the latitudinal shift is less than du = 5m in the opposite direction, it is
considered as turning right/left. Otherwise, it is a U-turn. Right and left directions are distinguished
based on the sign of the relative heading. Figure 7 illustrates the different classes. Table 1 provides
detailed statistics on these eight categories.

C Additional Qualitative Results

Figure 8: Qualitative Results.

1
Figure 9: Qualitative Results.

Figure 10: Qualitative Results.

Figure 11: Qualitative Results.

D Evaluation Sampling
A total of 2,311 evaluation examples were used, where the directions category were validated to be of
similar distribution to the train data. Since the set of other feasible and infeasible directions for a
given example could be up to 7 (where 8 is the number of possible categories), we do not evaluate all.
For evaluation, we consider the reciprocal of the distribution across samples of the evaluation other

2
Table 7: The categorical distribution of directions for the other feasible and infeasible instructions in
the test set.
Category Stationary Straight Straight-right Straight-left Right Left Right u-turn Left u-turn
Other feasible instructions 18.23% 2.7% 3.88% 3.91% 8.36% 16.04% 23.75% 23.12%
Infeasible instructions 7.04% 6.76% 15.34% 9.14% 39.00% 13.54% 4.32% 4.85%

feasible instructions set, and the evaluation infeasible instructions set for sampling. Through this we
balance having excessive number of instructions that always appears as other feasible instruction or
infeasible instruction. 7 shows the distribution of each category.

You might also like