MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang; Jiageng Liu; Zheyuan Zhang; Siyuan Zhou; Reuben Tan; Jianwei Yang; Yilun Du; Chuang Gan

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang ,
Jiageng Liu ,
Zheyuan Zhang ,
Siyuan Zhou ,
Reuben Tan ,
Jianwei Yang ,
Yilun Du ,
Chuang Gan

July 2025

arXiv

下载 BibTex

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

The video introduces MindJourney, a framework that enhances Vision-Language Models (VLMs), which excel at interpreting single images but struggle to infer the underlying three-dimensional world. By allowing the VLM to “imagine” moving through the scene given a spatial reasoning question, the model proposes trajectories in a simulated imagination space. A world model then generates novel views along these paths, expanding the available observations from a single image. This richer 3D context enables the VLM to answer previously challenging questions with greater ease.