1. 📘 Topic and Domain: The paper presents a large-scale video reasoning dataset and benchmark for evaluating video generation models' reasoning capabilities across cognitive tasks.
2. 💡 Previous Research and New Ideas: Building on existing video reasoning benchmarks that are limited in scale (12.8K samples combined), the paper proposes VBVR with 2M+ samples and introduces a principled cognitive architecture framework organizing tasks into five faculties: perception, transformation, spatiality, abstraction, and knowledge.
3. ❓ Problem: Current video generation models focus primarily on visual quality while their reasoning capabilities remain underexplored, hindered by the lack of large-scale training data and verifiable evaluation frameworks for video reasoning.
4. 🛠️ Methods: The authors created 200 parameterized task generators based on cognitive theory, generated 1M training and 7.5K test samples via distributed cloud infrastructure, and developed rule-based scorers for reproducible evaluation instead of model-based judging.
5. 📊 Results and Evaluation: Fine-tuning Wan2.2 on VBVR improved its performance from 0.371 to 0.685 (84.6% gain), surpassing all evaluated models including Sora 2 (0.546) and Veo 3.1 (0.480), while scaling studies showed emergent generalization to out-of-domain tasks but persistent gaps to human performance (0.974).