2025-11-21 Papers

1/2

Paper 1

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Published: 2025-11-20

Link: http://arxiv.org/pdf/2511.16668

1. 📘 Topic and Domain: The paper introduces V-ReasonBench, a benchmark suite for evaluating reasoning capabilities in video generation models across four dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics.
2. 💡 Previous Research and New Ideas: Based on Chain-of-Frame paradigm in video generation and Chain-of-Thought in language models, it proposes a new unified benchmark framework focusing specifically on evaluating reasoning abilities rather than just visual quality.
3. ❓ Problem: The paper addresses the lack of systematic and reliable evaluation methods for assessing reasoning capabilities in video generation models.
4. 🛠️ Methods: The benchmark uses a hybrid evaluation strategy combining mask-based, grid-based, and VLM-based evaluation methods across 326 reasoning instances, with pass@k as the primary metric for assessment.
5. 📊 Results and Evaluation: Testing six state-of-the-art video models revealed varying strengths across different reasoning dimensions, with Sora-2 leading overall (43.86% average), followed by Hailuo-02 (37.52%), while the benchmark achieved 97.09% alignment with human judgment.

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

V-ReasonBench: Video Generation Reasoning Evaluation Pipeline Data Preparation 326 instances 652 image pairs Structured Problem-Solving • Arithmetic Operation • Code Execution • Sudoku • Tic-Tac-Toe Spatial Cognition • Shape Fitting • Visual Symmetry • Color Connection Pattern-based Inference • Sequence Completion • Analogy Solving • Rule Following Physical Dynamics • Block Sliding • Communicating Vessels • Temperature Deformation Video Generation Models Sora-2 Veo-3.1 Hailuo-02 Kling-2.5 Vidu-Q2 Seedance-1.0 Chain-of-Frame (CoF) Reasoning Initial Image → Intermediate Frames → Final Frame 5 videos per instance (Pass@5 evaluation) Mask-based Evaluation For clear object boundaries Pixel-level MSE comparison Used in: Sequence, Analogy, Physics Grid-based Evaluation For structured layouts Cell-wise accuracy comparison Used in: Symmetry, Rule Following VLM-based Evaluation For simple visual outputs Gemini-2.5-Pro assessment Used in: Math, Code, Shapes Performance Results & Analysis Pass@5 scores across four reasoning dimensions Human preference alignment: 97.09% accuracy Duration effects, hallucination patterns, failure modes 9,780 total generated videos evaluated Key Findings: Sora-2 leads in structured reasoning (72%), Hailuo-02 excels in physical dynamics (36.67%) Video models show distinct reasoning patterns and systematic failure modes
Q1
1. What unexpected finding emerged regarding video duration in the Chain-of-Frame (CoF) reasoning process?
Longer durations consistently improved reasoning accuracy
Longer durations often introduced irrelevant content without improving reasoning
Video duration had no effect on reasoning performance
Q2
2. Why did the benchmark avoid relying solely on Vision-Language Models (VLMs) for evaluation?
VLMs were too computationally expensive
VLMs struggled with interpreting grid-structured and densely laid out visual content
VLMs could not process video content
Q3
3. Among the four reasoning dimensions tested, which showed the highest performance gap between top-performing Sora-2 and other models?
Structured Problem-Solving (72% vs average ~16%)
Physical Dynamics (26.67% vs average ~32%)
Spatial Cognition (36.76% vs average ~17%)
1/2

Paper 2

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Published: 2025-11-20

Link: http://arxiv.org/pdf/2511.16669

1. 📘 Topic and Domain: Video-Next-Event Prediction (VNEP), using video generation as an answer modality for predicting future events in video understanding and AI reasoning.
2. 💡 Previous Research and New Ideas: Based on Next-Event Prediction (NEP) which generates textual descriptions of future events; introduces a novel paradigm of using generated videos instead of text to demonstrate predicted events.
3. ❓ Problem: The limitation of text-only answers in conveying complex physical actions and procedures, requiring a solution that can provide more intuitive and customized visual demonstrations.
4. 🛠️ Methods: Proposes VANS model using Joint-GRPO (Group Relative Policy Optimization) to align a Vision-Language Model with a Video Diffusion Model through reinforcement learning, trained on their newly created VANS-Data-100K dataset.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance in both event prediction accuracy and video generation quality, with significant improvements in ROUGE-L scores (0.3631), CLIP-V scores (0.8021), and reduced FVD (78.32) compared to baseline methods.

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

VANS: Video-as-Answer Next Video Event Prediction Workflow Data Preparation Raw Data Collection Shot Split & Crop Clip Selection QA Pair Generation VANS-Data-100K Model Architecture VLM: Qwen2.5-VL-3B VDM: Wan-2.1-1.3B ViT + VAE Encoders Cross-modal Conditioning Joint-GRPO Training Pipeline Supervised Fine-Tuning VLM: 10K steps VDM: 20K steps Learning Rate: 5e-5 Stage 1: VLM Tuning Visualization-Friendly Caption Generation Format Reward rf Text Fidelity rt1 Video Fidelity rv1 Stage 2: VDM Adaptation Context-Faithful Video Generation Video Fidelity rv2 Semantic Alignment rc2 Evaluation Metrics Text: BLEU, ROUGE-L Video: FVD, CLIP-V Semantic: CLIP-T Procedural & Predictive Tasks Input/Output Flow Input Video + Question VLM Reasoning Caption VDM Generation Output Video Key Innovation Joint-GRPO bridges semantic-to-visual gap via coordinated RL training Results: SOTA Performance on VNEP Procedural: ROUGE-L 0.3631 | Predictive: CLIP-V 0.7872
Q1
1. What is the main innovation of VNEP compared to traditional Next-Event Prediction (NEP)?
It uses more advanced language models for prediction
It generates video demonstrations instead of text descriptions
It processes videos at a higher resolution
Q2
2. What is the key challenge addressed by the Joint-GRPO strategy in the VANS model?
Reducing computational costs of video generation
Improving video resolution quality
Aligning the semantic understanding of VLM with visual generation of VDM
Q3
3. The VANS-Data-100K dataset contains what ratio of procedural to predictive samples?
50K : 50K
30K : 70K
70K : 30K
1/2

Paper 3

What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

Published: 2025-11-19

Link: http://arxiv.org/pdf/2511.15593

1. 📘 Topic and Domain: Study of ideation diversity in AI research agents' performance on machine learning tasks using the MLE-bench benchmark.
2. 💡 Previous Research and New Ideas: Based on previous work on AI research agents and automated machine learning tools, proposes new methods to quantify and control agents' ideation diversity.
3. ❓ Problem: Understanding what factors drive success in AI research agents' performance, specifically focusing on whether ideation diversity is a key bottleneck.
4. 🛠️ Methods: Analyzed 11,000 agent trajectories across different models and scaffolds, measured diversity using Shannon entropy of model architectures, and conducted controlled experiments by modifying prompts to affect diversity levels.
5. 📊 Results and Evaluation: Found strong correlation between ideation diversity and agent performance, with higher-diversity agents achieving better results across multiple evaluation metrics, demonstrating that ideation diversity is indeed a key performance factor.

What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

AI Research Agent Ideation Diversity Study Workflow Large-Scale Data Collection 11,000 agent trajectories 6 LLM backbones × 2 scaffolds 75 MLE-bench tasks 264,000 GPU hours 10-20 random seeds Ideation Diversity Measurement Extract ML architectures from initial 5 draft ideas Calculate Shannon Entropy on architecture distribution Tree-level diversity metric Correlation Analysis Diversity vs Performance Pearson r = 0.57 (p < 0.001) Medal rate correlation High-performing models show higher diversity Controlled Experiment Setup Baseline vs Low Diversity Modify system prompts to reduce diversity mechanisms 22 MLE-bench lite tasks 2 scaffolds × 10 seeds Diversity Control Methods Remove: Sibling memory diversity Remove: Prompt-adaptive complexity Remove: Diversity mentions Prompt for similar ideas Temperature experiments Causal Validation Results 6.9% decrease (AIRAGreedy) 8.4% decrease (AIRAMCTS) in medal rates Reduced architecture usage: 40% → 70% use ≤2 architectures Alternative Metrics Validation Valid Submission Rate Average Normalized Score Percentile Rankings ELO-based Rankings Higher correlations observed Implementation Bottleneck Diversity helps agents design implementable solutions Execution time correlation with performance De-risks implementation failures Key Findings Scaffold design impacts diversity Higher diversity → Better performance Causal relationship established Robust across metrics Implementation bottleneck identified LLM Backbones o3, GPT-OSS (20B/120B) Llama Maverick, Devstral, CWM DeepSeek-R1 Agentic Scaffolds AIDE (Greedy Tree Search) AIRA Greedy AIRA MCTS Evaluation Framework MLE-bench (75 tasks) MLE-bench lite (22 tasks) Kaggle medal system Architecture Types CNN, Transformer, GBDT EfficientNet, ResNet, ViT LightGBM, ConvNeXt Main Research Conclusion Ideation diversity is a key bottleneck in AI research agent performance Higher diversity in initial ML architecture ideas leads to better task performance Causal relationship confirmed through controlled experimentation
Q1
1. How did the researchers measure ideation diversity in AI research agents?
By counting the number of unique keywords in agent outputs
By calculating Shannon entropy on the distribution of model architectures
By measuring the time taken to generate different ideas
Q2
2. What happened when the researchers deliberately reduced ideation diversity in their controlled experiment?
The agents completed tasks faster but with lower accuracy
The agents' performance improved slightly
The agents showed significant decrease in performance across multiple metrics
Q3
3. Which of the following agent scaffolds demonstrated higher ideation diversity in the study?
AIDE, with 70% of initial drafts using just GBDT and CNN
AIRAGreedy, with a more balanced distribution of different architectures
Both scaffolds showed equal levels of diversity