2026-03-20 Papers

1/2

Paper 1

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Published: 2026-03-19

Link: http://arxiv.org/pdf/2603.19235

1. 📘 Topic and Domain: The paper addresses spatial blindness in Multimodal Large Language Models (MLLMs) for 3D scene understanding by leveraging implicit 3D priors from video generation models.
2. 💡 Previous Research and New Ideas: Building on existing MLLMs that rely on explicit 3D modalities or complex geometric supervision, the paper proposes using pre-trained video diffusion models as "Latent World Simulators" to extract implicit spatial priors without requiring 3D annotations.
3. ❓ Problem: The paper aims to solve the "spatial blindness" problem in MLLMs, where models struggle with fine-grained geometric reasoning and physical dynamics understanding despite strong semantic capabilities.
4. 🛠️ Methods: The authors introduce VEGA-3D, a plug-and-play framework that extracts spatiotemporal features from video diffusion models through noise injection and integrates them with semantic representations via a token-level adaptive gated fusion mechanism.
5. 📊 Results and Evaluation: VEGA-3D achieves superior performance across 3D scene understanding benchmarks (e.g., 63.2% on ScanRefer, 106.3 CIDEr on ScanQA), spatial reasoning tasks, and robotic manipulation, consistently outperforming state-of-the-art baselines without requiring explicit 3D supervision.

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

VEGA-3D: Generation Models Know Space Input Multi-view Images Text Query Semantic Branch Visual Encoder (e.g., SigLIP) Semantic Features Generative Branch Latent World Simulator (e.g., Wan2.1-T2V) Generative Features Noise Injection z_k = (1-t_k)z_0 + t_k ε t_k = 0.3 Layer 20 features Multi-view Consistency 3D Structural Priors Geometric Cues Adaptive Gated Fusion Token-level gating mechanism g_i = σ(W_g^T Concat(F_gen,i, F_sem,i)) F_fused = (1-g)·F_gen + g·F_sem MLLM Processing Enriched with 3D Structural Awareness Dense Geometric Anchors 3D Scene Understanding Grounding, Captioning, QA Spatial Reasoning VSI-Bench Tasks Robotic Manipulation LIBERO Tasks Key Innovation No explicit 3D supervision Plug-and-play framework +5.1% ScanRefer +2.8% Multi3DRefer +4.2% ScanQA +2.7% SQA3D VEGA-3D: Repurposing Video Generation Models as Latent World Simulators
Q1
1. What is the key insight behind VEGA-3D's approach to solving spatial blindness in MLLMs?
Video generation models inherently learn 3D structural priors through synthesizing temporally coherent videos
Adding more 3D point cloud data is the only way to improve spatial understanding
Semantic features alone are sufficient for geometric reasoning tasks
Q2
2. At what diffusion timestep does VEGA-3D extract the most informative spatial cues from the video generation model?
At the very beginning (t=0) when the latent is clean
At intermediate noise levels (around t=0.3)
At the end of diffusion (t=1.0) with maximum noise
Q3
3. Which architectural choice showed the strongest correlation with downstream 3D understanding performance?
UNet-based models due to their local convolutional bias
DiT-based models with global attention mechanisms achieving >96% multi-view consistency
Traditional discriminative encoders like DINO and V-JEPA
1/2

Paper 2

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Published: 2026-03-19

Link: http://arxiv.org/pdf/2603.19228

1. 📘 Topic and Domain: The paper focuses on instruction-guided video editing using diffusion models in the computer vision domain.
2. 💡 Previous Research and New Ideas: Building on existing diffusion-based video editing approaches that rely on external priors (VLM features, structural conditions), the paper proposes SAMA which factorizes video editing into semantic anchoring and motion alignment without requiring explicit external priors.
3. ❓ Problem: The paper addresses the challenge of balancing precise semantic modifications with faithful motion preservation in instruction-guided video editing, where current models struggle with conflicts between these two requirements.
4. 🛠️ Methods: SAMA uses a two-stage training pipeline: factorized pre-training with semantic anchoring (predicting semantic tokens at sparse anchor frames) and motion alignment (motion-centric video restoration tasks), followed by supervised fine-tuning on paired editing data.
5. 📊 Results and Evaluation: SAMA achieves state-of-the-art performance among open-source models on VIE-Bench, OpenVE-Bench, and ReCo-Bench, with competitive results against commercial systems like Kling-Omni, demonstrating strong zero-shot editing capabilities and improved instruction following with temporal consistency.

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

SAMA: Factorized Video Editing Workflow Stage 0: Factorized Pre-training Semantic Anchoring • Predict semantic tokens • Sparse anchor frames • Instruction-aware Motion Alignment • Cube Inpainting • Speed Perturbation • Tube Shuffle Training Data • Image editing pairs (NHR-Edit, GPT-Image-Edit, X2Edit) • Text-to-video data (Koala-36M, MotionBench) • Motion-centric pretext tasks on videos Stage 1: Supervised Fine-tuning Paired Video Editing • Source video + Instruction + Target video • Semantic Anchoring still enabled • Resolve semantic-motion conflicts Training Data • Video editing datasets (Ditto-1M, OpenVE-3M, ReCo-Data) • Image editing data (NHR-Edit, Pico-Banana-400K) • VLM-based quality filtering Model Architecture DiT Backbone Wan2.1-T2V-14B VAE Encoder Video tokenization Semantic Encoder SigLIP + Projector Type Embeddings 0: source, 1: sem, 2: target Learning rate: 2×10⁻⁵ | Batch: 448 (image), 112 (video) | Resolution: 480p | λ = 0.1 | EMA decay: 0.9998 Output Capabilities ✓ Zero-shot video editing ability ✓ State-of-the-art performance on benchmarks ✓ Competitive with commercial systems
Q1
1. What are the three motion-centric pretext tasks used in SAMA's Motion Alignment component?
Cube Inpainting, Speed Perturbation, and Tube Shuffle
Frame Interpolation, Motion Blur, and Temporal Masking
Optical Flow Prediction, Depth Estimation, and Object Tracking
Q2
2. What surprising capability emerges from SAMA's factorized pre-training stage alone, even without paired video editing data?
The ability to generate videos from text descriptions
Strong zero-shot video editing behavior
Automatic video segmentation and object detection
Q3
3. How does SAMA distinguish between different token types (source video, target video, and semantic tokens) in its unified formulation?
By using different neural network branches for each token type
By applying shifted Rotary Position Embeddings (RoPE)
By adding learned type embeddings with IDs 0, 1, and 2 respectively
1/2

Paper 3

FASTER: Rethinking Real-Time Flow VLAs

Published: 2026-03-19

Link: http://arxiv.org/pdf/2603.19199

1. 📘 Topic and Domain: The paper focuses on real-time Vision-Language-Action (VLA) models for robotic manipulation, specifically addressing reaction latency in flow-based action chunking policies.
2. 💡 Previous Research and New Ideas: The paper builds on existing flow-based VLAs (π0.5, X-VLA) and asynchronous inference methods, proposing a novel Horizon-Aware Schedule that prioritizes immediate actions during flow sampling to achieve 10× faster reaction times without architectural changes.
3. ❓ Problem: The paper solves the reaction latency bottleneck in action chunking VLA policies, where constant timestep schedules force completion of all sampling steps before any movement can start, limiting real-time responsiveness in dynamic tasks.
4. 🛠️ Methods: The paper introduces FASTER (Fast Action Sampling for ImmediaTE Reaction) using a Horizon-Aware Schedule that adaptively allocates sampling steps across the action chunk, enabling single-step generation of immediate actions while maintaining long-horizon trajectory quality, coupled with a streaming client-server interface.
5. 📊 Results and Evaluation: FASTER achieves 10× acceleration in Time to First Action (TTFA) compared to baselines, demonstrates superior performance in real-world tasks including table tennis (0.80 vs 0.20 score on RTX 4090), and maintains competitive performance on simulation benchmarks (96.5% on LIBERO, 4.292 on CALVIN).

FASTER: Rethinking Real-Time Flow VLAs

FASTER: Rethinking Real-Time Flow VLAs - Method Workflow 1. Problem Analysis Action Chunking Policy Inference Reaction Time Analysis Constant Schedule Bottleneck TTFA Metric Introduction 2. Pilot Study Straightness Analysis S(A) = ∫ E[||(A¹-A⁰) - Ż||²] dτ Clean Action Estimates à = A - v(o,A,τ)τ Finding: Early actions have lower straightness & easier to generate 3. FASTER Method Horizon-Aware Schedule (HAS) u_i = (1-i/(H-1))^α * u₀ Adaptive Sampling τᵢ = max(0, (ρ-uᵢ)/(1-uᵢ)) Mixed Schedule Training p=0.5 mixing prob Single-Step First Action 10x faster 4. Implementation Streaming Client-Server Interface Early Stopping Strategy Reduced TTFA & Higher Frequency Real-time Responsiveness
Q1
1. What key insight led to the development of FASTER's Horizon-Aware Schedule?
Near-term actions in a chunk follow straighter interpolation paths and require fewer sampling steps than future actions
The VLM backbone computation time dominates the overall inference latency
Asynchronous inference completely eliminates reaction time bottlenecks
Q2
2. In the table tennis experiment on RTX 4090, what was the performance score achieved by FASTER compared to synchronous inference?
0.47 vs 0.00
0.80 vs 0.20
0.95 vs 0.53
Q3
3. How does FASTER achieve single-step generation of immediate actions without architectural modifications?
By distilling the multi-step model into a one-step model through knowledge transfer
By setting the hit time u₀ = (N-1)/N so the first action's local timestep reaches zero after one sampling step
By updating the observation input at every denoising step to reduce uncertainty