2026-02-24 Papers

1/2

Paper 1

A Very Big Video Reasoning Suite

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20159

1. 📘 Topic and Domain: The paper presents a large-scale video reasoning dataset and benchmark for evaluating video generation models' reasoning capabilities across cognitive tasks.

2. 💡 Previous Research and New Ideas: Building on existing video reasoning benchmarks that are limited in scale (12.8K samples combined), the paper proposes VBVR with 2M+ samples and introduces a principled cognitive architecture framework organizing tasks into five faculties: perception, transformation, spatiality, abstraction, and knowledge.

3. ❓ Problem: Current video generation models focus primarily on visual quality while their reasoning capabilities remain underexplored, hindered by the lack of large-scale training data and verifiable evaluation frameworks for video reasoning.

4. 🛠️ Methods: The authors created 200 parameterized task generators based on cognitive theory, generated 1M training and 7.5K test samples via distributed cloud infrastructure, and developed rule-based scorers for reproducible evaluation instead of model-based judging.

5. 📊 Results and Evaluation: Fine-tuning Wan2.2 on VBVR improved its performance from 0.371 to 0.685 (84.6% gain), surpassing all evaluated models including Sora 2 (0.546) and Veo 3.1 (0.480), while scaling studies showed emergent generalization to out-of-domain tasks but persistent gaps to human performance (0.974).

A Very Big Video Reasoning Suite

1/2

Paper 2

ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20093

1. 📘 Topic and Domain: The paper focuses on sequential recommendation systems, specifically addressing latent multi-step reasoning with adaptive test-time computation.

2. 💡 Previous Research and New Ideas: The paper builds on existing latent reasoning methods for sequential recommendation (like ReaRec, PLR, LARES) but introduces manifold-constrained reasoning that restricts latent states to evolve within graph-induced collaborative neighborhoods rather than unconstrained latent space.

3. ❓ Problem: The paper aims to solve the "latent drift" problem where unconstrained latent reasoning trajectories in existing methods deviate into implausible regions, degrading model robustness and generalization.

4. 🛠️ Methods: ManCAR uses a variational framework with graph-conditioned teacher priors to constrain reasoning trajectories, employs progressive teacher scheduling during training, and implements adaptive test-time termination based on KL-divergence convergence between consecutive reasoning steps.

5. 📊 Results and Evaluation: ManCAR achieves up to 46.88% relative improvement in NDCG@10 over state-of-the-art baselines across seven Amazon datasets, with adaptive reasoning achieving near-ceiling performance while reducing unnecessary computation steps.

ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

1/2

Paper 3

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20161

1. 📘 Topic and Domain: The paper presents Mobile-O, a unified multimodal model for both visual understanding and image generation optimized for deployment on mobile devices.

2. 💡 Previous Research and New Ideas: Building on existing unified models like BLIP-3o and Show-O, the paper introduces a Mobile Conditioning Projector (MCP) for efficient cross-modal fusion and a novel quadruplet training format (generation prompt, image, question, answer) for simultaneous improvement of both tasks.

3. ❓ Problem: The paper addresses the challenge that existing unified multimodal models are too computationally expensive and memory-intensive for deployment on edge devices like smartphones.

4. 🛠️ Methods: The authors use a lightweight architecture combining FastVLM for understanding and SANA for generation, connected via the MCP module using depthwise-separable convolutions, and employ a three-stage training scheme with unified post-training on 105k quadruplet samples.

5. 📊 Results and Evaluation: Mobile-O achieves 74% on GenEval (5-11% better than Show-O and JanusFlow) while running 6-11× faster, and attains 62.1% average accuracy across seven visual understanding benchmarks, all while maintaining under 2GB memory footprint and 3-second generation time on iPhone.