Bar Navigation World Models CVPR 2025 Paper
Bar Navigation World Models CVPR 2025 Paper
Amir Bar1 Gaoyue Zhou2 Danny Tran3 Trevor Darrell3 Yann LeCun1,2
1 2 3
FAIR at Meta New York University Berkeley AI Research
Figure 1. We train a Navigation World Model (NWM) from video footage of robots and their associated navigation actions (a). After
training, NWM can evaluate trajectories by synthesizing their videos and scoring the final frame’s similarity with the goal (b). We use
NWM to plan from scratch or rank experts navigation trajectories, improving downstream visual navigation performance. In unknown
environments, NWM can simulate imagined trajectories from a single image (c). In all examples above, the input to the model is the first
image and actions, then the model auto-regressively synthesizes future observations. Click on the image to view examples in a browser.
15791
1. Introduction videos, without relying on 3D priors.
To learn a NWM, we propose a novel Conditional Diffu-
Navigation is a fundamental skill for any organism with vi-
sion Transformer (CDiT), trained to predict the next image
sion, playing a crucial role in survival by allowing agents
state given past image states and actions as context. Un-
to locate food, shelter, and avoid predators. In order to
like a DiT [44], CDiT’s computational complexity is linear
successfully navigate environments, smart agents primarily
with respect to the number of context frames, and it scales
rely on vision, allowing them to construct representations
favorably for models trained up to 1B parameters across di-
of their surroundings to assess distances and capture land-
verse environments and embodiments, requiring 4→ fewer
marks in the environment, all useful for planning a naviga-
FLOPs compared to a standard DiT while achieving better
tion route.
future prediction results.
When human agents plan, they often imagine their fu-
In unknown environments, our results show that NWM
ture trajectories considering constraints and counterfactu-
benefits from training on unlabeled, action- and reward-free
als. On the other hand, current state-of-the-art robotics nav-
video data from Ego4D. Qualitatively, we observe improved
igation policies [53, 55] are “hard-coded”, and after train-
video prediction and generation performance on single im-
ing, new constraints cannot be easily introduced (e.g. “no
ages (see Figure 1(c)). Quantitatively, with additional unla-
left turns”). Another limitation of current supervised vi-
beled data, NWM produces more accurate predictions when
sual navigation models is that they cannot dynamically al-
evaluated on the held-out Stanford Go [24] dataset.
locate more computational resources to address hard prob-
Our contributions are as follows. We introduce a Nav-
lems. We aim to design a new model that can mitigate these
igation World Model (NWM) and propose a novel Con-
issues.
ditional Diffusion Transformer (CDiT), which scales ef-
In this work, we propose a Navigation World Model ficiently up to 1B parameters with significantly reduced
(NWM), trained to predict the future representation of a computational requirements compared to standard DiT. We
video frame based on past frame representation(s) and train CDiT on video footage and navigation actions from di-
action(s) (see Figure 1(a)). NWM is trained on video verse robotic agents, enabling planning by simulating navi-
footage and navigation actions collected from various gation plans independently or alongside external navigation
robotic agents. After training, NWM is used to plan novel policies, achieving state-of-the-art visual navigation perfor-
navigation trajectories by simulating potential navigation mance. Finally, by training NWM on action- and reward-
plans and verifying if they reach a target goal (see Fig- free video data, such as Ego4D, we demonstrate improved
ure 1(b)). To evaluate its navigation skills, we test NWM video prediction and generation performance in unseen en-
in known environments, assessing its ability to plan novel vironments.
trajectories either independently or by ranking an external
navigation policy. In the planning setup, we use NWM in
2. Related Work
a Model Predictive Control (MPC) framework, optimizing
the action sequence that enables NWM to reach a target Goal conditioned visual navigation is an important task in
goal. In the ranking setup, we assume access to an exist- robotics requiring both perception and planning skills [8,
ing navigation policy, such as NoMaD [55], which allows 13, 15, 41, 43, 51, 55]. Given context image(s) and an
us to sample trajectories, simulate them using NWM, and image specifying the navigation goals, goal-conditioned vi-
select the best ones. Our NWM achieves state-of-the-art sual navigation models [51, 55] aim to generate a viable
standalone performance and competitive results when com- path towards the goal if the environment is known, or to ex-
bined with existing methods. plore it otherwise. Recent visual navigation methods like
NWM is conceptually similar to recent diffusion-based NoMaD [55] train a diffusion policy via behavior cloning
world models for offline model-based reinforcement learn- and temporal distance objective to follow goals in the con-
ing, such as DIAMOND [1] and GameNGen [66]. How- ditional setting or to explore new environments in the un-
ever, unlike these models, NWM is trained across a wide conditional setting. Previous approaches like Active Neural
range of environments and embodiments, leveraging the di- SLAM [8] used neural SLAM together with analytical plan-
versity of navigation data from robotic and human agents. ners to plan trajectories in the 3D environment, while other
This allows us to train a large diffusion transformer model approaches like [9] learn policies via reinforcement learn-
capable of scaling effectively with model size and data to ing. Here we show that world models can use exploratory
adapt to multiple environments. Our approach also shares data to plan or improve existing navigation policies.
similarities with Novel View Synthesis (NVS) methods like Differently than in learning a policy, the goal of a world
NeRF [40], Zero-1-2-3 [38], and GDC [67], from which model [19] is to simulate the environment, e.g. given the
we draw inspiration. However, unlike NVS approaches, our current state and action to predict the next state and an as-
goal is to train a single model for navigation across diverse sociated reward. Previous works have shown that jointly
environments and model temporal dynamics from natural learning a policy and a world model can improve sample
15792
efficiency on Atari [1, 20, 21], simulated robotics environ- mated based on the change in the agent’s location.
ments [50], and even when applied to real world robots [71]. Our goal is to learn a world model F , a stochastic map-
More recently, [22] proposed to use a single world model ping from previous latent observation(s) sω and action aω to
that is shared across tasks by introducing action and task future latent state representation st+1 :
embeddings while [37, 73] proposed to describe actions in
language, and [6] proposed to learn latent actions. World si = encε (xi ) sω +1 ↓ Fε (sω +1 | sω , aω ) (1)
models were also explored in the context of game simula-
Where sω = (sω , ..., sω ↑m ) are the past m visual observa-
tion. DIAMOND [1] and GameNGen [66] propose to use
tions encoded via a pretrained VAE [4]. Using a VAE has
diffusion models to learn game engines of computer games
the benefit of working with compressed latents, allowing to
like Atari and Doom. Our work is inspired by these works,
decode predictions back to pixel space for visualization.
and we aim to learn a single general diffusion video trans-
Due to the simplicity of this formulation, it can be nat-
former that can be shared across many environments and
urally shared across environments and easily extended to
different embodiments for navigation.
more complex action spaces, like controlling a robotic arm.
In computer vision, generating videos has been a long
Different than [20], we aim to train a single world model
standing challenge [3, 4, 17, 29, 32, 62, 74]. Most recently,
across environments and embodiments, without using task
there has been tremendous progress with text-to-video syn-
or action embeddings like in [22].
thesis with methods like Sora [5] and MovieGen [45]. Past
The formulation in Equation 1 models action but does
works proposed to control video synthesis given structured
not allow control over the temporal dynamics. We extend
action-object class categories [61] or Action Graphs [2].
this formulation with a time shift input k ↑ [Tmin , Tmax ], set-
Video generation models were previously used in reinforce-
ting aω = (u, ω, k), thus now aω specifies the time change
ment learning as rewards [10], pretraining methods [59], for
k, used to determine how many steps should the model
simulating and planning manipulation actions [11, 35] and
move into the future (or past). Hence, given a current state
for generating paths in indoor environments [26, 31]. Inter-
sω , we can randomly choose a timeshift k and use the cor-
estingly, diffusion models [28, 54] are useful both for video
responding time shifted video frame as our next state sω +1 .
tasks like generation [69] and prediction [36], but also for
The navigation actions can then be approximated to be a
view synthesis [7, 46, 63]. Differently, we use a conditional
summation from time ε to m = ε + k ↔ 1:
diffusion transformer to simulate trajectories for planning
without explicit 3D representations or priors. m
! m
!
uω ↓m = ut ωω ↓m = ωt mod 2ϑ (2)
3. Navigation World Models t=ω t=ω
15793
CDiT enables time-efficient autoregressive modeling by
constraining the attention in the first attention block only
to tokens from the target frame which is being denoised.
To condition on tokens from past frames, we incorporate a
cross-attention layer, such that every query token from the
current target attends to tokens from past frames, which are
used as keys and values. The cross-attention then contextu-
alizes the representations using a skip connection layer.
To condition on the navigation action a ↑ R3 , we first
d
map each scalar to R 3 by extracting sine-cosine features,
then applying a 2-layer MLP, and concatenating them into
a single vector ϖa ↑ Rd . We follow a similar process to
map the timeshift k ↑ R to ϖk ↑ Rd and the diffusion
timestep t ↑ R to ϖk ↑ Rd . Finally we sum all embeddings
into a single vector used for conditioning:
ϱ = ϖa + ϖk + ϖt (3)
ϱ is then fed to an AdaLN [72] block to generate scale
and shift coefficients that modulate the Layer Normaliza-
tion [34] outputs, as well as the outputs of the attention lay-
ers. To train on unlabeled data, we simply omit explicit
navigation actions when computing ϱ (see Eq. 3).
An alternative approach is to simply use DiT [44], how-
ever, applying a DiT on the full input is computation-
ally expensive. Denote n the number of input tokens per
Figure 2. Conditional Diffusion Transformer (CDiT) Block.
frame, and m the number of frames, and d the token di-
The block’s complexity is linear with the number of frames.
mension. Scaled Multi-head Attention Layer [68] complex-
ity is dominated by the attention term O(m2 n2 d), which In this objective, the timestep t is sampled randomly to
is quadratic with context length. In contrast, our CDiT ensure that the model learns to denoise frames across vary-
block is dominated by the cross-attention layer complexity ing levels of corruption. By minimizing this loss, the model
O(mn2 d), which is linear with respect to the context, al- (t)
learns to reconstruct sω +1 from its noisy version sω +1 , con-
lowing us to use longer context size. We analyze these two ditioned on the context sω and action aω , thereby enabling
design choices in Section 4. CDiT resembles the original the generation of realistic future frames. Following [44], we
Transformer Block [68], without applying expensive self- also predict the covariance matrix of the noise and supervise
attention over the context tokens. it with the variational lower bound loss Lvlb [42].
Diffusion Training. In the forward process, noise is added
to the target state sω +1 according to a randomly chosen 3.3. Navigation Planning with World Models
(t)
timestep t ↑ {1, . . . , T }. The noisy state sω +1 can be de- Here we move to describe how to use a trained NWM to
↗ ↗
(t)
fined as: sω +1 = ςt sω +1 + 1 ↔ ςt φ, where φ ↓ N (0, I) plan navigation trajectories. Intuitively, if our world model
is Gaussian noise, and {ςt } is a noise schedule control- is familiar with an environment, we can use it to simulate
(t)
ling the variance. As t increases, sω +1 converges to pure navigation trajectories, and choose the ones which reach the
noise. The reverse process attempts to recover the origi- goal. In an unknown, out of distribution environments, long
(t) term planning might rely on imagination.
nal state representation sω +1 from the noisy version sω +1 ,
Formally, given the latent encoding s0 and navigation
conditioned on the context sω , the current action aω , and the
target s↔ , we look for a sequence of actions (a0 , ..., aT ↑1 )
diffusion timestep t. We define Fε (sω +1 |sω , aω , t) as the de-
that maximizes the likelihood of reaching s↔ . Let S(sT , s↔ )
noising neural network model parameterized by ↼. We fol-
represent the unnormalized score for reaching state s↔
low the same noise schedule and hyperparams of DiT [44].
with sT given the initial condition s0 , actions a =
Training Objective. The model is trained to minimize the (a0 , . . . , aT ↑1 ), and states s = (s1 , . . . sT ) obtained by au-
mean-squared between the clean and predicted target, aim- toregressively rolling out the NWM: s ↓ Fε (·|s0 , a).
ing to learn the denoising process: We define the energy function E(s0 , a0 , . . . , aT ↑1 , sT ),
" #
(t) such that minimizing the energy corresponds to maximizing
Lsimple = Esω +1 ,aω ,sω ,ϑ,t ↘sω +1 ↔ Fε (sω +1 |sω , aω , t)↘22 .
the unnormalized perceptual similarity score and following
15794
potential constraints on the states and actions: unlabeled Ego4D videos and GO Stanford [24] serves as an
unknown evaluation environment. For the full details, see
E(s0 , a0 , . . . , aT ↑1 , sT ) = ↔S(sT , s↔ )+ (4) Appendix 8.1.
T
! ↑1 T
! ↑1
Evaluation Metrics. We evaluate predicted navigation
+ I(aω ↑
/ Avalid ) + I(sω ↑
/ Ssafe ),
trajectories using Absolute Trajectory Error (ATE) for
ω =0 ω =0
accuracy and Relative Pose Error (RPE) for pose con-
The similarity is computed by decoding s↔ and sT to pixels sistency [57]. To check how semantically similar are
using a pretrained VAE decoder [4] and then measuring the world model predictions to ground truth images, we ap-
perceptual similarity [14, 75]. Constraints like “never go ply LPIPS [76] and DreamSim [14], measuring perceptual
left then right” can be encoded by constraining aω to be in similarity by comparing deep features, and PSNR for pixel-
a valid action set Avalid , and “never explore the edge of the level quality. For image and video synthesis quality, we use
cliff” by ensuring such states sω are in Ssafe . I(·) denotes the FID [23] and FVD [64] which evaluate the generated data
indicator function that applies a large penalty if any action distribution. See Appendix 8.1 for more details.
or state constraint is violated.
Baselines. We consider all the following baselines.
The problem then reduces to finding the actions that min-
imize this energy function: • DIAMOND [1] is a diffusion world model based on
the UNet [47] architecture. We use DIAMOND in
arg min Es [E(s0 , a0 , . . . , aT ↑1 , sT )] (5) the offline-reinforcement learning setting following their
a0 ,...,aT →1
public code. The diffusion model is trained to autoregres-
This objective can be reformulated as a Model Predic- sively predict at 56x56 resolution alongside an upsampler
tive Control (MPC) problem, and we optimize it using the to obtrain 224x224 resolution predictions. To condition
Cross-Entropy Method [48], a simple derivative-free and on continuous actions, we use a linear embedding layer.
population-based optimization method which was recently • GNM [53] is a general goal-conditioned navigation pol-
used with with world models for planning [77]. We include icy trained on a dataset soup of robotic navigation datasets
an overview of the Cross-Entropy Method and the full opti- with a fully connected trajectory prediction network.
mization technical details in Appendix 7. GNM is trained on multiple datasets including SCAND,
Ranking Navigation Trajectories. Assuming we have an TartanDrive, GO Stanford, and RECON.
existing navigation policy !(a|s0 , s↔ ), we can use NWMs • NoMaD [55] extends GNM using a diffusion policy for
to rank sampled trajectories. Here we use NoMaD [55], predicting trajectories for robot exploration and visual
a state-of-the-art navigation policy for robotic navigation. navigation. NoMaD is trained on the same datasets used
To rank trajectories, we draw multiple samples from ! and by GNM and on HuRoN.
choose the one with the lowest energy, like in Eq. 5. Implementation Details. In the default experimental set-
ting we use a CDiT-XL of 1B parameters with context of 4
4. Experiments and Results frames, a total batch size of 1024, and 4 different navigation
We describe the experimental setting, our design choices, goals, leading to a final total batch size of 4096. We use the
and compare NWM to previous approaches. Additional re- Stable Diffusion [4] VAE tokenizer, similar as in DiT [44].
sults are included in the Supplementary Material. We use the AdamW [39] optimizer with a learning rate of
8e ↔ 5. After training, we sample 5 times from each model
4.1. Experimental Setting to report mean and std results. XL sized model are trained
Datasets. For all robotics datasets (SCAND [30], Tartan- on 8 H100 machines, each with 8 GPUs. Unless otherwise
Drive [60], RECON [52], and HuRoN [27]), we have ac- mentioned, we use the same setting as in DiT-*/2 models.
cess to the location and rotation of robots, allowing us to in- 4.2. Ablations
fer relative actions compare to current location (see Eq. 2).
To standardize the step size across agents, we divide the Models are evaluated on single-step 4 seconds future pre-
distance agents travel between frames by their average step diction on validation set trajectories on the known envi-
size in meters, ensuring the action space is similar for dif- ronment RECON. We evaluate the performance against the
ferent agents. We further filter out backward movements, ground truth frame by measuring LPIPS, DreamSim, and
following NoMaD [55]. Additionally, we use unlabeled PSNR. We provide qualitative examples in Figure 3.
Ego4D [18] videos, where the only action we consider is Model Size and CDiT. We compare CDiT (see Section 3.2)
time shift. SCAND provides video footage of socially com- with a standard DiT in which all context tokens are fed as
pliant navigation in diverse environments, TartanDrive fo- inputs. We hypothesize that for navigating known environ-
cuses on off-road driving, RECON covers open-world nav- ments, the capacity of the model is the most important, and
igation, HuRoN captures social interactions. We train on the results in Figure 5, indicate that CDiT indeed performs
15795
Figure 3. Following trajectories in known environments. We include qualitative video generation comparisons of different models
following ground truth trajectories. Click on the image to play the video clip in a browser.
better with models of up to 1B parameters, while consuming input contributes to the prediction performance (we include
less than 2→ FLOPs. Surprisingly, even with equal amount the results in Table 1. We find that running the model with
of parameters (e.g, CDiT-L compared to DiT-XL), CDiT is time only leads to poor performance, while not conditioning
4→ faster and performs better. on time leads to small drop in performance as well. This
Number of Goals. We train models with variable number confirms that both inputs are beneficial to the model.
of goal states given a fixed context, changing the number of 4.3. Video Prediction and Synthesis
goals from 1 to 4. Each goal is randomly chosen between
±16 seconds window around the current state. The results We evaluate how well our model follows ground truth ac-
reported in Table 1 indicate that using 4 goals leads to sig- tions and predicts future states. The model is conditioned
nificantly improved prediction performance in all metrics. on the first image and context frames, then autoregressively
predicts the next state using ground truth actions, feeding
Context Size. We train models while varying the number back each prediction. We compare predictions to ground
of conditioning frames from 1 to 4 (see Table 1). Unsurpris- truth images at 1, 2, 4, 8, and 16 seconds, reporting FID
ingly, more context helps, and with short context the model and LPIPS on the RECON dataset. Figure 4 shows perfor-
often “lose track”, leading to poor predictions. mance over time compared to DIAMOND at 4 FPS and 1
Time and Action Conditioning. We train our model with FPS, showing that NWM predictions are significantly more
both time and action conditioning and test how much each accurate than DIAMOND. Initially, the NWM 1 FPS vari-
15796
Figure 7. Ranking an external policy’s trajectories using NWM. To navigate from the observation image to the goal, we sample
trajectories from NoMaD [55], simulate each of these trajectories using NWM, score them (see Equation 4), and rank them. With NWM
we can accurately choose trajectories that are closer to the groundtruth trajectory. Click the image to play examples in a browser.
ant performs better, but after 8 seconds, predictions degrade or a single turn. We show that NWM supports constraint-
due to accumulated errors and loss of context and the 4 FPS aware planning. In forward-first, the agent moves forward
becomes superior. See qualitative examples in Figure 3. for 5 steps, then turns for 3. In left-right first, it turns for
Generation Quality. To evaluate video quality, we auto- 3 steps before moving forward. In straight then forward,
regressively predict videos at 4 FPS for 16 seconds to cre- it moves straight for 3 steps, then forward. Constraints are
ate videos, while conditioning on ground truth actions. We enforced by zeroing out specific actions; e.g., in left-right
then evaluate the quality of videos generated using FVD, first, forward motion is zeroed for the first 3 steps, and Stan-
compared to DIAMOND [1]. The results in Figure 6 indi- dalone Planning optimizes the rest. We report the norm of
cate that NWM outputs higher quality videos. the difference in final position and yaw relative to uncon-
strained planning. Results (Table 3) show NWM plans ef-
4.4. Planning Using a Navigation World Model fectively under constraints, with only minor performance
drops (see examples in Figure 9).
Next, we turn to describe experiments that measure how
Using a Navigation World Model for Ranking. NWM
well can we navigate using a NWM. We include the full
can enhance existing navigation policies in a goal-
technical details of the experiments in Appendix 8.2.
conditioned navigation. Conditioning NoMaD on past ob-
Standalone Planning. We demonstrate that NWM can be servations and a goal image, we sample n ↑ {16, 32} tra-
effectively used independently for goal-conditioned naviga- jectories, each of length 8, and evaluate them by autoregres-
tion. We condition it on past observations and a goal image, sively following the actions using NWM. Finally, we rank
and use the Cross-Entropy Method to find a trajectory that each trajectory’s final prediction by measuring LPIPS sim-
minimizes the LPIPS similarity of the last predicted image ilarity with the goal image (see Figure 7). We report ATE
to the goal image (see Equation 5). To rank an action se- and RPE on all in-domain datasets (Table 2) and find that
quence, we execute the NWM and measure LPIPS between NWM-based trajectory ranking improves navigation perfor-
the last state and the goal 3 times to get an average score. mance, with more samples yielding better results.
We generate trajectories of length 8, with temporal shift of
k = 0.25. We evaluate the model performance in Table 2. 4.5. Generalization to Unknown Environments
We find that using a NWM for planning leads to competitive Here we experiment with adding unlabeled data, and ask
results with state-of-the-art policies. whether NWM can make predictions in new environments
Planning with Constraints. World models allow planning using imagination. In this experiment, we train a model
under constraints—for example, requiring straight motion on all in-domain datasets, as well as a susbet of unlabeled
15797
Figure 8. Navigating Unknown Environments. NWM is conditioned on a single image, and autoregressively predicts the next states
given the associated actions (marked in yellow). Click on the image to play the video clip in a browser.
Table 4. Training on additional unlabeled data improves performance on unseen environments. Reporting results on unknown
environment (Go Stanford) and known one (RECON). Results reported by evaluating 4 seconds into the future.
Figure 9. Planning with Constraints Using NWM. We visualize Figure 10. Limitations and Failure Cases. In unknown environ-
trajectories planned with NWM under the constraint of moving left ments, a common failure case is mode collapse, where the model
or right first, followed by forward motion. The planning objective outputs slowly become more similar to data seen in training. Click
is to reach the same final position and orientation as the ground on the image to play the video clip in a browser.
truth (GT) trajectory. Shown are the costs for proposed trajectories
0, 1, and 2, with trajectory 0 (in green) achieving the lowest cost. training data. Additionally, the model currently utilizes 3
DoF navigation actions, but extending to 6 DoF navigation
videos from Ego4D, where we only have access to the time- and potentially more (like controlling the joints of a robotic
shift action. We train a CDiT-XL model and test it on the arm) are possible as well, which we leave for future work.
Go Stanford dataset as well as other random images. We re-
port the results in Table 4, finding that training on unlabeled 6. Discussion
data leads to significantly better video predictions according
to all metrics, including improved generation quality. We Our proposed Navigation World Model (NWM) offers a
include qualitative examples in Figure 8. Compared to in- scalable, data-driven approach to learning world models for
domain (Figure 3), the model breaks faster and expectedly visual navigation; However, we are not exactly sure yet
hallucinates paths as it generates traversals of imagined en- what representations enable this, as our NWM does not ex-
vironments. plicitly utilize a structured map of the environment. One
idea, is that next frame prediction from an egocentric point
5. Limitations of view can drive the emergence of allocentric representa-
tions [65]. Ultimately, our approach bridges learning from
We identify multiple limitations. First, when applied to out video, visual navigation, and model-based planning and
of distribution data, the model tends to slowly lose context could potentially open the door to self-supervised systems
and generates next states that resemble the training data, that not only perceive but can also plan to inform action.
a phenomena that was observed in image generation and
is known as mode collapse [56, 58]. We include such an Acknowledgments. We thank Noriaki Hirose for his
example in Figure 10. Second, while the model can plan, help with the HuRoN dataset and for sharing his insights,
it struggles with simulating temporal dynamics like pedes- and to Manan Tomar, David Fan, Sonia Joseph, Angjoo
trian motion (although in some cases it does). Both limita- Kanazawa, Ethan Weber, Nicolas Ballas, and the anony-
tions are likely to be solved with longer context and more mous reviewers for their helpful discussions and feedback.
15798
References [13] J Frey, M Mattamala, N Chebrolu, C Cadena, M Fallon, and
M Hutter. Fast traversability estimation for wild visual nav-
[1] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- igation. Robotics: Science and Systems Proceedings, 19,
ervisto, Amos Storkey, Tim Pearce, and François Fleuret. 2023. 2, 3
Diffusion for world modeling: Visual details matter in atari.
[14] Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy
In Thirty-eighth Conference on Neural Information Process-
Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream-
ing Systems. 2, 3, 5, 7
sim: Learning new dimensions of human visual similarity
[2] Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, using synthetic data. Advances in Neural Information Pro-
Gal Chechik, Trevor Darrell, and Amir Globerson. Compo- cessing Systems, 36, 2024. 5, 1
sitional video synthesis with action graphs. In International
[15] Zipeng Fu, Ashish Kumar, Ananye Agarwal, Haozhi Qi, Ji-
Conference on Machine Learning, pages 662–673. PMLR,
tendra Malik, and Deepak Pathak. Coupling vision and pro-
2021. 3
prioception for navigation of legged robots. In Proceedings
[3] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- of the IEEE/CVF Conference on Computer Vision and Pat-
rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa tern Recognition, pages 17273–17283, 2022. 2
Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space- [16] Junyu Gao, Xuan Yao, and Changsheng Xu. Fast-slow test-
time diffusion model for video generation. arXiv preprint time adaptation for online vision-and-language navigation.
arXiv:2401.12945, 2024. 3 In Proceedings of the 41st International Conference on Ma-
[4] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel chine Learning, pages 14902–14919. PMLR, 2024. 3
Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, [17] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du-
Zion English, Vikram Voleti, Adam Letts, et al. Stable video val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi
diffusion: Scaling latent video diffusion models to large Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz-
datasets. arXiv preprint arXiv:2311.15127, 2023. 3, 5 ing text-to-video generation by explicit image conditioning.
[5] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, arXiv preprint arXiv:2311.10709, 2023. 3
Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- [18] Kristen Grauman, Andrew Westbury, Eugene Byrne,
man, Eric Luhman, et al. Video generation models as world Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson
simulators, 2024. 3 Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d:
[6] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Around the world in 3,000 hours of egocentric video. In Pro-
Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, ceedings of the IEEE/CVF Conference on Computer Vision
Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- and Pattern Recognition, pages 18995–19012, 2022. 5, 2
nie: Generative interactive environments. In Forty-first Inter- [19] David Ha and Jürgen Schmidhuber. World models. arXiv
national Conference on Machine Learning, 2024. 3 preprint arXiv:1803.10122, 2018. 2
[7] Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. [20] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham-
Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini mad Norouzi. Dream to control: Learning behaviors by la-
De Mello, Tero Karras, and Gordon Wetzstein. Generative tent imagination. In International Conference on Learning
novel view synthesis with 3d-aware diffusion models. In Representations, . 3
Proceedings of the IEEE/CVF International Conference on [21] Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi,
Computer Vision (ICCV), pages 4217–4229, 2023. 3 and Jimmy Ba. Mastering atari with discrete world models.
[8] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, In International Conference on Learning Representations, .
Abhinav Gupta, and Ruslan Salakhutdinov. Learning to ex- 3
plore using active neural slam. In International Conference [22] Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2:
on Learning Representations. 2 Scalable, robust world models for continuous control. In The
[9] Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning Twelfth International Conference on Learning Representa-
exploration policies for navigation. In International Confer- tions. 3
ence on Learning Representations. 2 [23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
[10] Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Dani- two time-scale update rule converge to a local nash equilib-
jar Hafner, and Pieter Abbeel. Video prediction models as rium. Advances in neural information processing systems,
rewards for reinforcement learning. Advances in Neural In- 30, 2017. 5, 1
formation Processing Systems, 36, 2024. 3 [24] Noriaki Hirose, Amir Sadeghian, Marynel Vázquez, Patrick
[11] Chelsea Finn and Sergey Levine. Deep visual foresight for Goebel, and Silvio Savarese. Gonet: A semi-supervised
planning robot motion. In 2017 IEEE International Confer- deep learning approach for traversability estimation. In 2018
ence on Robotics and Automation (ICRA), pages 2786–2793. IEEE/RSJ International Conference on Intelligent Robots
IEEE, 2017. 3 and Systems (IROS), pages 3044–3051. IEEE, 2018. 2, 5
[12] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan [25] Noriaki Hirose, Amir Sadeghian, Fei Xia, Roberto Martı́n-
Alistarh. Gptq: Accurate post-training quantization Martı́n, and Silvio Savarese. Vunet: Dynamic scene view
for generative pre-trained transformers. arXiv preprint synthesis for traversability estimation using an rgb camera.
arXiv:2210.17323, 2022. 3 IEEE Robotics and Automation Letters, 2019. 2
15799
[26] Noriaki Hirose, Fei Xia, Roberto Martı́n-Martı́n, Amir [40] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Sadeghian, and Silvio Savarese. Deep visual mpc-policy Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
learning for navigation. IEEE Robotics and Automation Let- Representing scenes as neural radiance fields for view syn-
ters, 4(4):3184–3191, 2019. 3 thesis. Communications of the ACM, 65(1):99–106, 2021.
[27] Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey 2
Levine. Sacson: Scalable autonomous control for social nav- [41] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer,
igation. IEEE Robotics and Automation Letters, 2023. 5, 1, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin,
2 Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navi-
[28] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- gate in complex environments. In International Conference
fusion probabilistic models. Advances in neural information on Learning Representations, 2022. 2
processing systems, 33:6840–6851, 2020. 3 [42] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
[29] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, denoising diffusion probabilistic models. In Proceedings
Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben of the 38th International Conference on Machine Learning,
Poole, Mohammad Norouzi, David J Fleet, et al. Imagen pages 8162–8171. PMLR, 2021. 4
video: High definition video generation with diffusion mod- [43] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit
els. arXiv preprint arXiv:2210.02303, 2022. 3 Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jiten-
[30] Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- dra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot
nell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep visual imitation. In Proceedings of the IEEE conference on
Biswas, and Peter Stone. Socially compliant navigation computer vision and pattern recognition workshops, pages
dataset (scand): A large-scale dataset of demonstrations for 2050–2053, 2018. 2
social navigation. IEEE Robotics and Automation Letters, 7 [44] William Peebles and Saining Xie. Scalable diffusion mod-
(4):11807–11814, 2022. 5, 1 els with transformers. In Proceedings of the IEEE/CVF In-
[31] Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, ternational Conference on Computer Vision (ICCV), pages
and Peter Anderson. Pathdreamer: A world model for indoor 4195–4205, 2023. 2, 4, 5
navigation. In Proceedings of the IEEE/CVF International [45] Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra,
Conference on Computer Vision, pages 14738–14748, 2021. Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-
3 Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of
[32] Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, media foundation models. arXiv preprint arXiv:2410.13720,
Jonathan Huang, Grant Schindler, Rachel Hornung, Vigh- 2024. 3
nesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. [46] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-
Videopoet: A large language model for zero-shot video gen- hall. Dreamfusion: Text-to-3d using 2d diffusion. In The
eration. In Forty-first International Conference on Machine Eleventh International Conference on Learning Representa-
Learning. 3 tions. 3
[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. [47] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
Imagenet classification with deep convolutional neural net- net: Convolutional networks for biomedical image segmen-
works. Advances in neural information processing systems, tation. In Medical image computing and computer-assisted
25, 2012. 1 intervention–MICCAI 2015: 18th international conference,
[34] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Munich, Germany, October 5-9, 2015, proceedings, part III
Layer normalization. ArXiv e-prints, pages arXiv–1607, 18, pages 234–241. Springer, 2015. 5
2016. 4 [48] Reuven Y Rubinstein. Optimization of computer simulation
[35] Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sud- models with rare events. European Journal of Operational
hakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl Research, 99(1):89–112, 1997. 5, 1
Vondrick. Dreamitate: Real-world visuomotor policy learn- [49] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets,
ing via video generation, 2024. 3 Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia
[36] Han Lin, Tushar Nagarajan, Nicolas Ballas, Mido Assran, Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A
Mojtaba Komeili, Mohit Bansal, and Koustuv Sinha. Vedit: platform for embodied ai research. In Proceedings of
Latent prediction architecture for procedural video represen- the IEEE/CVF international conference on computer vision,
tation learning, 2024. 3 pages 9339–9347, 2019. 3
[37] Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter [50] Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu,
Abbeel, Dan Klein, and Anca Dragan. Learning to model the Stephen James, Kimin Lee, and Pieter Abbeel. Masked
world with language, 2024. 3 world models for visual control. In Conference on Robot
[38] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- Learning, pages 1332–1344. PMLR, 2023. 3
makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to- [51] Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachow-
3: Zero-shot one image to 3d object. In Proceedings of icz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint:
the IEEE/CVF international conference on computer vision, A foundation model for visual navigation. In 7th Annual
pages 9298–9309, 2023. 2 Conference on Robot Learning. 2
[39] I Loshchilov. Decoupled weight decay regularization. arXiv [52] Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas
preprint arXiv:1711.05101, 2017. 5 Rhinehart, and Sergey Levine. Rapid exploration for open-
15800
world navigation with latent goal models. arXiv preprint [65] Benigno Uria, Borja Ibarz, Andrea Banino, Vinicius Zam-
arXiv:2104.05859, 2021. 5, 1, 2 baldi, Dharshan Kumaran, Demis Hassabis, Caswell Barry,
[53] Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Charles Blundell. A model of egocentric to allocentric
and Sergey Levine. Gnm: A general navigation model to understanding in mammalian brains. bioRxiv, 2022. 8
drive any robot. In 2023 IEEE International Conference on [66] Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi
Robotics and Automation (ICRA), pages 7226–7233. IEEE, Fruchter. Diffusion models are real-time game engines.
2023. 2, 5, 7, 3 arXiv preprint arXiv:2408.14837, 2024. 2, 3
[54] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, [67] Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar-
and Surya Ganguli. Deep unsupervised learning using gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi
nonequilibrium thermodynamics. In International confer- Zheng, and Carl Vondrick. Generative camera dolly: Ex-
ence on machine learning, pages 2256–2265. PMLR, 2015. treme monocular dynamic novel view synthesis. 2024. 2
3 [68] A Vaswani. Attention is all you need. Advances in Neural
[55] Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Information Processing Systems, 2017. 4
Levine. Nomad: Goal masked diffusion policies for nav- [69] Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal.
igation and exploration. In 2024 IEEE International Con- Mcvd-masked conditional video diffusion for prediction,
ference on Robotics and Automation (ICRA), pages 63–70. generation, and interpolation. Advances in neural informa-
IEEE, 2024. 2, 5, 7, 1, 3 tion processing systems, 35:23371–23385, 2022. 3
[56] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U [70] Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman,
Gutmann, and Charles Sutton. Veegan: Reducing mode col- Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang
lapse in gans using implicit variational learning. Advances Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased
in neural information processing systems, 30, 2017. 8 consistency models. Advances in Neural Information Pro-
[57] Jürgen Sturm, Wolfram Burgard, and Daniel Cremers. Eval- cessing Systems, 37:83951–84009, 2024. 3
uating egomotion and structure-from-motion approaches us- [71] Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter
ing the tum rgb-d benchmark. In Proc. of the Workshop on Abbeel, and Ken Goldberg. Daydreamer: World models for
Color-Depth Camera Fusion in Robotics at the IEEE/RJS In- physical robot learning. In Conference on robot learning,
ternational Conference on Intelligent Robot Systems (IROS), pages 2226–2240. PMLR, 2023. 3
page 6, 2012. 5, 1 [72] Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and
[58] Hoang Thanh-Tung and Truyen Tran. Catastrophic forget- Junyang Lin. Understanding and improving layer normaliza-
ting and mode collapse in gans. In 2020 international joint tion, 2019. 4
conference on neural networks (ijcnn), pages 1–10. IEEE, [73] Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour,
2020. 8 Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuur-
[59] Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, mans, and Pieter Abbeel. Learning interactive real-world
Alex Lamb, John Langford, Matthew E. Taylor, and Sergey simulators. In The Twelfth International Conference on
Levine. Video occupancy models, 2024. 3 Learning Representations. 3
[60] Samuel Triest, Matthew Sivaprakasam, Sean J Wang, Wen- [74] Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han
shan Wang, Aaron M Johnson, and Sebastian Scherer. Tar- Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-
tandrive: A large-scale dataset for learning off-road dynam- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit:
ics models. In 2022 International Conference on Robotics Masked generative video transformer. In Proceedings of
and Automation (ICRA), pages 2546–2552. IEEE, 2022. 5, the IEEE/CVF Conference on Computer Vision and Pattern
1, 2 Recognition, pages 10459–10469, 2023. 3
[61] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan [75] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
Kautz. MoCoGAN: Decomposing motion and content for and Oliver Wang. The unreasonable effectiveness of deep
video generation. In IEEE Conference on Computer Vision features as a perceptual metric. In CVPR, 2018. 5, 1
and Pattern Recognition (CVPR), pages 1526–1535, 2018. 3 [76] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
[62] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan man, and Oliver Wang. The unreasonable effectiveness of
Kautz. Mocogan: Decomposing motion and content for deep features as a perceptual metric. In Proceedings of the
video generation. In Proceedings of the IEEE conference on IEEE conference on computer vision and pattern recogni-
computer vision and pattern recognition, pages 1526–1535, tion, pages 586–595, 2018. 5
2018. 3 [77] Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto.
[63] Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Dino-wm: World models on pre-trained visual features en-
Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah able zero-shot planning, 2024. 5
Snavely. Megascenes: Scene-level view synthesis at scale.
In Computer Vision – ECCV 2024, pages 197–214, Cham,
2025. Springer Nature Switzerland. 3
[64] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach,
Raphaël Marinier, Marcin Michalski, and Sylvain Gelly.
Fvd: A new metric for video generation. 2019. 5, 1
15801