1. 📘 Topic and Domain: The paper presents a real-world grounded video world simulation model that generates city-scale videos anchored in actual urban environments, specifically Seoul.
2. 💡 Previous Research and New Ideas: Building on pretrained video world models and diffusion transformers, the paper introduces retrieval-augmented generation using street-view images to ground video generation in real locations rather than imagined environments.
3. ❓ Problem: The paper addresses the limitation that existing world models operate in entirely imagined environments, proposing to generate temporally consistent, spatially faithful videos grounded in actual physical locations.
4. 🛠️ Methods: The authors use cross-temporal pairing to handle temporal misalignment, synthetic urban datasets for trajectory diversity, view interpolation for sparse data, and a Virtual Lookahead Sink mechanism for long-horizon stability.
5. 📊 Results and Evaluation: SWM outperforms existing world models on benchmarks across Seoul, Busan, and Ann Arbor in visual quality, camera adherence, temporal coherence, and structural fidelity, maintaining stable generation over trajectories reaching hundreds of meters.