1. 📘 Topic and Domain: The paper introduces V-ReasonBench, a benchmark suite for evaluating reasoning capabilities in video generation models across four dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics.
2. 💡 Previous Research and New Ideas: Based on Chain-of-Frame paradigm in video generation and Chain-of-Thought in language models, it proposes a new unified benchmark framework focusing specifically on evaluating reasoning abilities rather than just visual quality.
3. ❓ Problem: The paper addresses the lack of systematic and reliable evaluation methods for assessing reasoning capabilities in video generation models.
4. 🛠️ Methods: The benchmark uses a hybrid evaluation strategy combining mask-based, grid-based, and VLM-based evaluation methods across 326 reasoning instances, with pass@k as the primary metric for assessment.
5. 📊 Results and Evaluation: Testing six state-of-the-art video models revealed varying strengths across different reasoning dimensions, with Sora-2 leading overall (43.86% average), followed by Hailuo-02 (37.52%), while the benchmark achieved 97.09% alignment with human judgment.