MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

  • Reuben Tan, Microsoft

The video introduces MindJourney, a framework that enhances Vision-Language Models (VLMs), which excel at interpreting single images but struggle to infer the underlying three-dimensional world. By allowing the VLM to “imagine” moving through the scene given a spatial reasoning question, the model proposes trajectories in a simulated imagination space. A world model then generates novel views along these paths, expanding the available observations from a single image. This richer 3D context enables the VLM to answer previously challenging questions with greater ease.