2026-03-20 Papers

1/2

Paper 1

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Published: 2026-03-19

Link: http://arxiv.org/pdf/2603.19235

1. 📘 Topic and Domain: The paper addresses spatial blindness in Multimodal Large Language Models (MLLMs) for 3D scene understanding by leveraging implicit 3D priors from video generation models.

2. 💡 Previous Research and New Ideas: Building on existing MLLMs that rely on explicit 3D modalities or complex geometric supervision, the paper proposes using pre-trained video diffusion models as "Latent World Simulators" to extract implicit spatial priors without requiring 3D annotations.

3. ❓ Problem: The paper aims to solve the "spatial blindness" problem in MLLMs, where models struggle with fine-grained geometric reasoning and physical dynamics understanding despite strong semantic capabilities.

4. 🛠️ Methods: The authors introduce VEGA-3D, a plug-and-play framework that extracts spatiotemporal features from video diffusion models through noise injection and integrates them with semantic representations via a token-level adaptive gated fusion mechanism.

5. 📊 Results and Evaluation: VEGA-3D achieves superior performance across 3D scene understanding benchmarks (e.g., 63.2% on ScanRefer, 106.3 CIDEr on ScanQA), spatial reasoning tasks, and robotic manipulation, consistently outperforming state-of-the-art baselines without requiring explicit 3D supervision.

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

1/2

Paper 2

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Published: 2026-03-19

Link: http://arxiv.org/pdf/2603.19228

1. 📘 Topic and Domain: The paper focuses on instruction-guided video editing using diffusion models in the computer vision domain.

2. 💡 Previous Research and New Ideas: Building on existing diffusion-based video editing approaches that rely on external priors (VLM features, structural conditions), the paper proposes SAMA which factorizes video editing into semantic anchoring and motion alignment without requiring explicit external priors.

3. ❓ Problem: The paper addresses the challenge of balancing precise semantic modifications with faithful motion preservation in instruction-guided video editing, where current models struggle with conflicts between these two requirements.

4. 🛠️ Methods: SAMA uses a two-stage training pipeline: factorized pre-training with semantic anchoring (predicting semantic tokens at sparse anchor frames) and motion alignment (motion-centric video restoration tasks), followed by supervised fine-tuning on paired editing data.

5. 📊 Results and Evaluation: SAMA achieves state-of-the-art performance among open-source models on VIE-Bench, OpenVE-Bench, and ReCo-Bench, with competitive results against commercial systems like Kling-Omni, demonstrating strong zero-shot editing capabilities and improved instruction following with temporal consistency.

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

1/2

Paper 3

FASTER: Rethinking Real-Time Flow VLAs

Published: 2026-03-19

Link: http://arxiv.org/pdf/2603.19199

1. 📘 Topic and Domain: The paper focuses on real-time Vision-Language-Action (VLA) models for robotic manipulation, specifically addressing reaction latency in flow-based action chunking policies.

2. 💡 Previous Research and New Ideas: The paper builds on existing flow-based VLAs (π0.5, X-VLA) and asynchronous inference methods, proposing a novel Horizon-Aware Schedule that prioritizes immediate actions during flow sampling to achieve 10× faster reaction times without architectural changes.

3. ❓ Problem: The paper solves the reaction latency bottleneck in action chunking VLA policies, where constant timestep schedules force completion of all sampling steps before any movement can start, limiting real-time responsiveness in dynamic tasks.

4. 🛠️ Methods: The paper introduces FASTER (Fast Action Sampling for ImmediaTE Reaction) using a Horizon-Aware Schedule that adaptively allocates sampling steps across the action chunk, enabling single-step generation of immediate actions while maintaining long-horizon trajectory quality, coupled with a streaming client-server interface.

5. 📊 Results and Evaluation: FASTER achieves 10× acceleration in Time to First Action (TTFA) compared to baselines, demonstrates superior performance in real-world tasks including table tennis (0.80 vs 0.20 score on RTX 4090), and maintains competitive performance on simulation benchmarks (96.5% on LIBERO, 4.292 on CALVIN).