2026-03-23 Papers

1/2

Paper 1

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.17051

1. 📘 Topic and Domain: The paper focuses on aligning distilled autoregressive (AR) video models with human preferences using reinforcement learning in the domain of video generation.
2. 💡 Previous Research and New Ideas: The paper builds on distilled AR video models (Self-Forcing, Causal-Forcing, LongLive) and forward-process RL (DiffusionNFT), proposing Astrolabe - a memory-efficient online RL framework that avoids re-distillation and reverse-process optimization overhead.
3. ❓ Problem: The paper addresses the misalignment of distilled streaming video models with human preferences, which frequently exhibit artifacts and unnatural motion despite efficient generation capabilities.
4. 🛠️ Methods: The authors use forward-process RL with negative-aware fine-tuning, streaming training with rolling KV-cache for long videos, and multi-reward optimization (visual quality, motion quality, text alignment) with uncertainty-aware selective regularization.
5. 📊 Results and Evaluation: Astrolabe consistently improves generation quality across multiple benchmarks (VBench, VBench-Long), enhancing HPSv3 scores by ~1.3-1.6 points and motion quality scores while maintaining the same inference speed as baseline models.

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Astrolabe: Workflow Overview Input Text Prompts (VidProM Dataset) Memory-Efficient Streaming Rollout Rolling KV Cache Frame Sink + Window Distilled AR Video Model Group-wise Sampling G Candidate Clips Shared Context Forward-Process RL Implicit Policies (v⁺, v⁻) Advantage-based Loss Streaming Long Tuning Detached Historical Context Local Window Gradients Multi-Reward System VQ MQ TA Aggregated Reward Reward Hacking Mitigation Uncertainty-aware Selective KL Penalty Dynamic Reference Updates (EMA) Aligned Output Enhanced Video Quality Legend: Input/Output Core Process Stabilization
Q1
1. What astronomical navigation tool does the paper's name 'Astrolabe' metaphorically represent in the context of video generation?
A tool for steering distilled AR models toward human preferences without re-distillation
A celestial coordinate system for mapping video frame sequences
A telescope for observing long-range temporal dependencies
Q2
2. How does Astrolabe handle the memory challenge when training on long videos?
By compressing all frames into a single latent representation
By maintaining a rolling KV-cache with fixed frame sinks and applying gradients only to local windows
By training separate models for each video segment and merging them later
Q3
3. What happens when the paper's authors remove the adaptive loss weighting from DiffusionNFT in their distilled AR setting?
The model loses its ability to generate coherent motion
The training becomes unstable with gradient explosion due to volatile x₀ norm under large discretization gaps
The inference speed decreases by approximately 50%
1/2

Paper 2

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.17117

1. 📘 Topic and Domain: The paper presents MosaicMem, a hybrid spatial memory mechanism for controllable video world models that enables long-horizon video generation with consistent camera motion and scene revisitation.
2. 💡 Previous Research and New Ideas: The paper builds on existing explicit memory methods (that use 3D structures like point clouds) and implicit memory methods (that store posed frames), proposing a novel hybrid approach that lifts patches into 3D for localization while using attention-based conditioning for dynamic scene generation.
3. ❓ Problem: The paper aims to solve the bottleneck of spatial memory in video diffusion models, where explicit 3D structures struggle with moving objects and implicit memory produces inaccurate camera motion even with correct poses.
4. 🛠️ Methods: The authors use a patch-based memory system that lifts video patches into 3D space using off-the-shelf depth estimators, retrieves spatially aligned patches via warped RoPE and latent mechanisms, and conditions generation through PRoPE camera control within a fine-tuned video diffusion model.
5. 📊 Results and Evaluation: MosaicMem achieves superior performance compared to both explicit and implicit baselines, with rotation error of 0.51°, translation error of 0.06, FID of 65.67, and enables minute-level navigation, memory-based scene editing, and autoregressive generation at 16 FPS.

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

MosaicMem: Hybrid Spatial Memory Workflow Inputs • Real Image (I) • Text Prompts (L) • Camera Poses (C) Patch Extraction (Basic Unit) 3D Lifting • Depth Estimation • Camera Info Mosaic Memory Storage (3D Patches) Memory Retrieval & Alignment Warped RoPE Positional Encoding Warped Latent Feature Transform PRoPE Camera Control Patch Composition DiT Video Model (Flow Matching) Native Attention Conditioning Generated Video X = {X₁, ..., Xᴛ} Long-horizon rollout Enabled Capabilities • Minute-level Navigation • Memory-based Editing • Autoregressive Generation • Dynamic Scene Modeling
Q1
1. What is the fundamental unit of memory storage in MosaicMem that distinguishes it from both explicit and implicit memory approaches?
3D Gaussian splats that are dynamically updated
Video patches that are lifted into 3D space
Compressed frame tokens stored in latent space
Q2
2. According to the paper's results, which capability does MosaicMem enable that demonstrates its practical advantage over existing methods?
Real-time video generation at 16 FPS with autoregressive rollout
Perfect 3D reconstruction without any depth estimation errors
Automatic text prompt generation from visual scenes
Q3
3. What innovative scene manipulation capability does MosaicMem enable through its patch-based memory design?
Automatic conversion of 2D videos into fully interactive 3D environments
Creating surreal Inception-like scenes by flipping and registering memory in the sky
Real-time physics simulation for all objects in the scene
1/2

Paper 3

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.17024

1. 📘 Topic and Domain: The paper focuses on multi-hop vision-language reasoning data synthesis for training vision-language models (VLMs) in the domain of multimodal AI and computer vision.
2. 💡 Previous Research and New Ideas: The paper builds on reinforcement learning with verifiable rewards (RLVR) for VLMs but proposes HopChain, a novel framework that synthesizes multi-hop reasoning data where each hop requires visual re-grounding and earlier hops establish dependencies for later ones.
3. ❓ Problem: The paper addresses VLMs' struggle with fine-grained, multi-step vision-language reasoning due to compounding errors (perception, reasoning, knowledge, hallucination) in long chain-of-thought reasoning, which existing training data fails to adequately expose.
4. 🛠️ Methods: The authors use a four-stage pipeline: category identification via VLM, instance segmentation via SAM3, multi-hop query generation creating logically dependent chains, and human-in-the-loop verification, then train models using RLVR with Soft Adaptive Policy Optimization (SAPO).
5. 📊 Results and Evaluation: Testing on Qwen3.5-35B-A3B and Qwen3.5-397B-A17B across 24 benchmarks showed improvements in 20/24 benchmarks for both models, with gains exceeding 50 points in ultra-long-CoT reasoning and performance dropping from 70.4 to 66.7 and 64.3 when using half-multi-hop and single-hop variants respectively.

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

HopChain: Multi-Hop Data Synthesis Pipeline 1. Problem Analysis Long CoT Failures Perception Error Reasoning Error Knowledge Error Hallucination Compound Errors 2. HopChain Data Synthesis Stage 1 Category Identification (VLM: Qwen3-VL) Stage 2 Instance Segmentation (SAM3) Stage 3 Multi-Hop Query Generation (Qwen3-VL-235B) Stage 4 Ground-Truth Annotation & Difficulty Calibration 3. Multi-Hop Query Structure Perception-Level Hop Level 1 ↔ Level 2 (Single ↔ Multi-object) Instance-Chain Hop A → B → C → ... (Dependency chain) Requirements: • Logically dependent hops • Numerical final answer • Visual re-grounding 4. RLVR Training & Results RLVR with SAPO Multi-hop + Original RLVR Data Model Training Qwen3.5-35B-A3B Qwen3.5-397B-A17B Results 20/24 benchmarks ↑ Long-CoT: +50 pts Broad gains Improved VL Reasoning
Q1
1. What are the two types of hops that HopChain uses to create multi-hop vision-language reasoning queries?
Perception-level hops (switching between single-object and multi-object perception) and instance-chain hops (following dependency chains like A→B→C)
Visual hops (processing image features) and language hops (processing text features)
Forward hops (moving to next objects) and backward hops (returning to previous objects)
Q2
2. When comparing full multi-hop training with simplified variants on Qwen3.5-35B-A3B, what were the average scores across five representative benchmarks?
Full multi-hop: 64.3, Half-multi-hop: 66.7, Single-hop: 70.4
Full multi-hop: 70.4, Half-multi-hop: 66.7, Single-hop: 64.3
Full multi-hop: 66.7, Half-multi-hop: 70.4, Single-hop: 64.3
Q3
3. What was the most dominant error type identified in the paper's analysis of long chain-of-thought reasoning failures, and why is this significant?
Hallucination errors were dominant, showing that VLMs primarily struggle with generating false information
Knowledge errors were dominant, indicating that VLMs lack sufficient world knowledge for reasoning
Perception errors were the largest group, highlighting that VLMs often fail at the fundamental level of correctly interpreting visual information during multi-step reasoning