1. 📘 Topic and Domain: The paper focuses on evaluating the reasoning capabilities of Text-Image-to-Video (TI2V) generation models, specifically their ability to internalize and apply implicit world rules beyond surface-level visual quality.
2. 💡 Previous Research and New Ideas: Building on existing video generation benchmarks that emphasize perceptual quality and temporal coherence (like VBench), this work introduces a reasoning-oriented evaluation framework with 467 human-annotated samples across eight reasoning dimensions and an automated LMM-based evaluation pipeline.
3. ❓ Problem: The paper addresses the lack of evaluation protocols for assessing whether TI2V models can reliably understand and reason over implicit world rules, as current benchmarks predominantly focus on visual fidelity rather than cognitive reasoning abilities.
4. 🛠️ Methods: The authors develop RISE-Video benchmark with four evaluation metrics (Reasoning Alignment, Temporal Consistency, Physical Rationality, Visual Quality), use GPT-5 as an automated judge with specially designed prompts, and evaluate 11 state-of-the-art TI2V models across eight reasoning categories.
5. 📊 Results and Evaluation: Experiments reveal that all models achieve low accuracy scores (best: 22.5% by Hailuo 2.3), with closed-source models outperforming open-source ones; models perform better on perceptual knowledge but struggle significantly with logical capability tasks, and the LMM-based evaluation shows strong alignment with human judgments.