2026-03-23 Papers

1/2

Paper 1

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.17051

1. 📘 Topic and Domain: The paper focuses on aligning distilled autoregressive (AR) video models with human preferences using reinforcement learning in the domain of video generation.

2. 💡 Previous Research and New Ideas: The paper builds on distilled AR video models (Self-Forcing, Causal-Forcing, LongLive) and forward-process RL (DiffusionNFT), proposing Astrolabe - a memory-efficient online RL framework that avoids re-distillation and reverse-process optimization overhead.

3. ❓ Problem: The paper addresses the misalignment of distilled streaming video models with human preferences, which frequently exhibit artifacts and unnatural motion despite efficient generation capabilities.

4. 🛠️ Methods: The authors use forward-process RL with negative-aware fine-tuning, streaming training with rolling KV-cache for long videos, and multi-reward optimization (visual quality, motion quality, text alignment) with uncertainty-aware selective regularization.

5. 📊 Results and Evaluation: Astrolabe consistently improves generation quality across multiple benchmarks (VBench, VBench-Long), enhancing HPSv3 scores by ~1.3-1.6 points and motion quality scores while maintaining the same inference speed as baseline models.

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

1/2

Paper 2

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.17117

1. 📘 Topic and Domain: The paper presents MosaicMem, a hybrid spatial memory mechanism for controllable video world models that enables long-horizon video generation with consistent camera motion and scene revisitation.

2. 💡 Previous Research and New Ideas: The paper builds on existing explicit memory methods (that use 3D structures like point clouds) and implicit memory methods (that store posed frames), proposing a novel hybrid approach that lifts patches into 3D for localization while using attention-based conditioning for dynamic scene generation.

3. ❓ Problem: The paper aims to solve the bottleneck of spatial memory in video diffusion models, where explicit 3D structures struggle with moving objects and implicit memory produces inaccurate camera motion even with correct poses.

4. 🛠️ Methods: The authors use a patch-based memory system that lifts video patches into 3D space using off-the-shelf depth estimators, retrieves spatially aligned patches via warped RoPE and latent mechanisms, and conditions generation through PRoPE camera control within a fine-tuned video diffusion model.

5. 📊 Results and Evaluation: MosaicMem achieves superior performance compared to both explicit and implicit baselines, with rotation error of 0.51°, translation error of 0.06, FID of 65.67, and enables minute-level navigation, memory-based scene editing, and autoregressive generation at 16 FPS.

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

1/2

Paper 3

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.17024

1. 📘 Topic and Domain: The paper focuses on multi-hop vision-language reasoning data synthesis for training vision-language models (VLMs) in the domain of multimodal AI and computer vision.

2. 💡 Previous Research and New Ideas: The paper builds on reinforcement learning with verifiable rewards (RLVR) for VLMs but proposes HopChain, a novel framework that synthesizes multi-hop reasoning data where each hop requires visual re-grounding and earlier hops establish dependencies for later ones.

3. ❓ Problem: The paper addresses VLMs' struggle with fine-grained, multi-step vision-language reasoning due to compounding errors (perception, reasoning, knowledge, hallucination) in long chain-of-thought reasoning, which existing training data fails to adequately expose.

4. 🛠️ Methods: The authors use a four-stage pipeline: category identification via VLM, instance segmentation via SAM3, multi-hop query generation creating logically dependent chains, and human-in-the-loop verification, then train models using RLVR with Soft Adaptive Policy Optimization (SAPO).

5. 📊 Results and Evaluation: Testing on Qwen3.5-35B-A3B and Qwen3.5-397B-A17B across 24 benchmarks showed improvements in 20/24 benchmarks for both models, with gains exceeding 50 points in ultra-long-CoT reasoning and performance dropping from 70.4 to 66.7 and 64.3 when using half-multi-hop and single-hop variants respectively.