2026-02-24 Papers

1/2

Paper 1

A Very Big Video Reasoning Suite

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20159

1. 📘 Topic and Domain: The paper presents a large-scale video reasoning dataset and benchmark for evaluating video generation models' reasoning capabilities across cognitive tasks.
2. 💡 Previous Research and New Ideas: Building on existing video reasoning benchmarks that are limited in scale (12.8K samples combined), the paper proposes VBVR with 2M+ samples and introduces a principled cognitive architecture framework organizing tasks into five faculties: perception, transformation, spatiality, abstraction, and knowledge.
3. ❓ Problem: Current video generation models focus primarily on visual quality while their reasoning capabilities remain underexplored, hindered by the lack of large-scale training data and verifiable evaluation frameworks for video reasoning.
4. 🛠️ Methods: The authors created 200 parameterized task generators based on cognitive theory, generated 1M training and 7.5K test samples via distributed cloud infrastructure, and developed rule-based scorers for reproducible evaluation instead of model-based judging.
5. 📊 Results and Evaluation: Fine-tuning Wan2.2 on VBVR improved its performance from 0.371 to 0.685 (84.6% gain), surpassing all evaluated models including Sora 2 (0.546) and Veo 3.1 (0.480), while scaling studies showed emergent generalization to out-of-domain tasks but persistent gaps to human performance (0.974).

A Very Big Video Reasoning Suite

VBVR: A Very Big Video Reasoning Suite Cognitive Architecture • Abstraction • Knowledge • Spatiality • Perception Task Design 200+ Tasks 6 Quality Criteria Peer Review Data Generation Cloud-based Pipeline AWS Lambda Workers S3 Storage VBVR-Dataset 2,015,000 Images 1,007,500 Videos 1,000,000 Training Samples 7,500 Test Samples VBVR-Bench Rule-based Evaluation 100 Test Tasks Human-aligned Scoring Reproducible Results Model Evaluation • CogVideoX • Wan2.2 • Sora 2 • HunyuanVideo • Veo 3.1 • Kling 2.6 Scaling Study • VBVR-Wan2.2 Training • Emergent Generalization Key Findings • VBVR-Wan2.2 achieves 0.685 overall score (84.6% improvement) • Early signs of emergent generalization with scale • Significant gap remains to human performance (0.974) VBVR 2.015M Others: 12.8K
Q1
1. What philosophical foundation does VBVR use to organize its cognitive architecture, and which philosopher's concept of 'dunameis' (cognitive faculties) inspired the framework?
Kant's categories of understanding, with faculties organized around his concept of 'Vernunft'
Aristotle's cognitive hierarchy, ascending from 'aisthesis' (perception) through 'phantasia' to 'nous' (understanding)
Plato's theory of forms, with tasks designed around the concept of ideal representations
Q2
2. In the capability correlation analysis, which two cognitive faculties showed a strong positive correlation (ρ=0.461), and what neuroscience evidence supports this connection?
Knowledge and Spatiality - supported by hippocampal place cells and grid cells that enable both spatial navigation and concept learning
Perception and Transformation - supported by visual cortex regions that handle both recognition and mental rotation
Abstraction and Knowledge - supported by prefrontal cortex regions that manage both rule extraction and memory formation
Q3
3. What key insight emerged from VBVR-Wan2.2's qualitative analysis regarding the relationship between controllability and reasoning in video generation?
Models need to generate photorealistic videos first before attempting reasoning tasks
Controllability is the bedrock of verifiable reasoning - models must maintain stable scenes and precise object manipulation rather than freely rewriting content
Reasoning emerges naturally from larger model scale without requiring specific controllability constraints
1/2

Paper 2

ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20093

1. 📘 Topic and Domain: The paper focuses on sequential recommendation systems, specifically addressing latent multi-step reasoning with adaptive test-time computation.
2. 💡 Previous Research and New Ideas: The paper builds on existing latent reasoning methods for sequential recommendation (like ReaRec, PLR, LARES) but introduces manifold-constrained reasoning that restricts latent states to evolve within graph-induced collaborative neighborhoods rather than unconstrained latent space.
3. ❓ Problem: The paper aims to solve the "latent drift" problem where unconstrained latent reasoning trajectories in existing methods deviate into implausible regions, degrading model robustness and generalization.
4. 🛠️ Methods: ManCAR uses a variational framework with graph-conditioned teacher priors to constrain reasoning trajectories, employs progressive teacher scheduling during training, and implements adaptive test-time termination based on KL-divergence convergence between consecutive reasoning steps.
5. 📊 Results and Evaluation: ManCAR achieves up to 46.88% relative improvement in NDCG@10 over state-of-the-art baselines across seven Amazon datasets, with adaptive reasoning achieving near-ceiling performance while reducing unnecessary computation steps.

ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

ManCAR: Manifold-Constrained Latent Reasoning Workflow User Interaction History (H) h = (h₁, h₂, ..., hₙ₋₁) Item Interaction Graph G=(I,E) Swing Algorithm Sequential Encoder f_θ(·) h_{n-1} = f_θ(h) Candidate Set Construction C(h_R) = {y*} ∪ N(h_R;G;k) Multi-Step Latent Reasoning (t' = 1, ..., T') Reasoning Module r_{t'} = ρ_θ(h; r_{1:t'-1}) r_1 = h_{n-1} Teacher Prior q(c|h_R,G) RDMA Strategy Predictive Dist. p_θ^{(t')}(i|h) z_{t'} = r_{t'}^T E Target Loss L_{main}^{(t')} KL Regularization L_{reg}^{(t')} = D_{KL}(q||p_θ^{(t')}) Norm Rescaling h ← α·h/||h||·avg(E) Convergence Check D_{KL}(p_θ^{(t'-1)}||p_θ^{(t')}) < ε ? Final Recommendation Top-k Items Based on p_θ^{(t')}(i|h) Continue Key: Input Graph Reasoning Loss Output
Q1
1. What geometric concept does ManCAR use to prevent latent drift during multi-step reasoning?
A collaborative manifold defined by graph-induced neighborhoods on the item probability simplex
A hyperbolic embedding space that captures hierarchical item relationships
A Euclidean distance metric between consecutive reasoning states
Q2
2. How does ManCAR determine when to stop reasoning at test time?
By using a fixed number of reasoning steps determined during training
When the KL divergence between consecutive reasoning states falls below a threshold
By measuring the cosine similarity between the final state and item embeddings
Q3
3. What is the key difference between ManCAR's teacher scheduling and ReaRec's PRL mechanism?
ManCAR uses a decreasing temperature schedule while ReaRec uses an increasing one
ManCAR employs parallel reasoning streams while ReaRec uses sequential refinement
ManCAR uses an increasing temperature schedule while ReaRec uses a decreasing one
1/2

Paper 3

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20161

1. 📘 Topic and Domain: The paper presents Mobile-O, a unified multimodal model for both visual understanding and image generation optimized for deployment on mobile devices.
2. 💡 Previous Research and New Ideas: Building on existing unified models like BLIP-3o and Show-O, the paper introduces a Mobile Conditioning Projector (MCP) for efficient cross-modal fusion and a novel quadruplet training format (generation prompt, image, question, answer) for simultaneous improvement of both tasks.
3. ❓ Problem: The paper addresses the challenge that existing unified multimodal models are too computationally expensive and memory-intensive for deployment on edge devices like smartphones.
4. 🛠️ Methods: The authors use a lightweight architecture combining FastVLM for understanding and SANA for generation, connected via the MCP module using depthwise-separable convolutions, and employ a three-stage training scheme with unified post-training on 105k quadruplet samples.
5. 📊 Results and Evaluation: Mobile-O achieves 74% on GenEval (5-11% better than Show-O and JanusFlow) while running 6-11× faster, and attains 62.1% average accuracy across seven visual understanding benchmarks, all while maintaining under 2GB memory footprint and 3-second generation time on iPhone.

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Mobile-O: Unified Multimodal Understanding and Generation Workflow Stage 1: Cross-Modal Alignment • JourneyDB (4M samples) • BLIP3o-Short (5M samples) • Pre-train DiT + MCP • Freeze VE + LLM • ~3 days training Stage 2: Supervised Fine-tuning • BLIP3o-60K • ShareGPT-4o (45K) • Target complex gestures • Maintain frozen config • ~15 hours training Stage 3: Unified Post-Training • Quadruplet format • (prompt, image, Q, A) • Joint I2T + T2I loss • LoRA on LLM + VE • ~5 hours training Mobile Conditioning Projector Key Innovation Layer Fusion Depthwise Conv Channel Attention • Bridges VLM → Diffusion • Only 2.4M parameters Understanding Module • FastVLM-0.5B • Image Encoder • Qwen2-0.5B LLM Generation Module • SANA-600M DiT • VAE Encoder/Decoder • 512×512 images Unified Quadruplet Data Format Generation Prompt Image Question Answer Performance Highlights • 74% on GenEval (5% better than Show-O) • 62.1% average on 7 understanding benchmarks Mobile Deployment • ~3 seconds on iPhone 17 Pro • <2GB memory footprint
Q1
1. What is the key architectural innovation in Mobile-O that enables efficient cross-modal fusion between understanding and generation tasks?
A Mobile Conditioning Projector (MCP) using depthwise-separable convolutions and layerwise alignment
A 2.6B-parameter UNet combined with learnable query tokens
A transformer-based adapter with full 2D convolutions
Q2
2. How does Mobile-O's training data requirement compare to existing unified models like BLIP-3o?
Mobile-O requires 50-100M samples, similar to other unified models
Mobile-O achieves strong performance with only a few million pre-training samples (about 5× less than BLIP-3o)
Mobile-O needs at least 1B samples for effective cross-modal alignment
Q3
3. What unique training format does Mobile-O introduce in its unified multimodal post-training stage?
Sequential training where understanding is frozen while training generation
Joint training on disjoint understanding and generation datasets
Quadruplet format (generation prompt, image, question, answer) where each sample supports both tasks