2026-01-13 Papers

1/2

Paper 1

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Published: 2026-01-11

Link: http://arxiv.org/pdf/2601.06943

1. 📘 Topic and Domain: A benchmark for video-based deep research that evaluates AI models' ability to answer questions by combining video understanding with web searching and reasoning.
2. 💡 Previous Research and New Ideas: Based on existing deep research benchmarks that focus on text-only queries and closed-video understanding tasks, this paper introduces a novel paradigm requiring models to connect video cues with open web search.
3. ❓ Problem: Existing benchmarks don't evaluate how AI models can use video content as clues to search and verify information across the open web, which is important for real-world video question answering.
4. 🛠️ Methods: Created VideoDR benchmark through rigorous human annotation process with strict quality control, testing models under two paradigms (Workflow and Agentic) across difficulty levels, video durations, and semantic domains.
5. 📊 Results and Evaluation: Leading models like Gemini-3-pro-preview achieved 76% accuracy under Agentic setting, while results showed that Agentic is not consistently superior to Workflow - success depends on models' ability to maintain initial video anchors during long search chains.

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

VideoDR: Video Deep Research Benchmark Workflow Data Construction Pipeline Candidate Video Pool Stratified Sampling: Source, Domain, Duration Negative Filtering Initial Filtering Remove clips lacking prominent visual anchors Multi-frame coherence Question Design Multi-frame reasoning Multi-hop reasoning Verifiable answers Quality Control Web-only Test Video-only Test Human Testing (5 subjects) Task Definition: f(V, Q; S) → A Given Video V and Question Q, use Search tool S to produce Answer A • Multi-frame visual anchor extraction • Interactive web retrieval • Multi-hop reasoning over joint video-web evidence Evaluation Paradigms Workflow Paradigm 1. Extract visual cues → structured text 2. Use text + question for search 3. Multi-round retrieval & reasoning ✓ Stable anchors, controllable Agentic Paradigm 1. End-to-end single agent 2. Direct video + question input 3. Autonomous search & reasoning ✓ Preserves details, requires consistency Models Evaluated Closed-source: GPT-4o, GPT-5.2, Gemini-3-pro-preview Open-source: MiniCPM-V 4.5, InternVL3.5-14B, Qwen3-Omni-30B-A3B Key Findings Agentic ≠ Always Better Depends on model's ability to maintain initial video anchors over long retrieval chains Core Bottlenecks • Goal Drift • Long-horizon Consistency • Numerical Error Persistence Performance Leaders Gemini-3-pro: 69% → 76% GPT-5.2: 69% → 69% Human baseline: 50.4% 100 samples across 6 domains 25.5 avg question tokens 3 difficulty levels
Q1
1. What is the main innovation of VideoDR compared to existing benchmarks?
It only focuses on video understanding without web search
It combines video understanding with open web search and reasoning
It evaluates text-only queries on web search
Q2
2. Which of the following best describes the relationship between Workflow and Agentic paradigms based on the paper's findings?
Agentic is consistently superior to Workflow in all scenarios
Workflow always performs better than Agentic
Agentic's success depends on the model's ability to maintain initial video anchors during search
Q3
3. What was Gemini-3-pro-preview's key strength in tool usage according to the paper?
It used the most number of tool calls among all models
It converted additional retrieval and reflection into more reliable evidence integration
It had the fastest runtime among all models
1/2

Paper 2

BabyVision: Visual Reasoning Beyond Language

Published: 2026-01-10

Link: http://arxiv.org/pdf/2601.06521

1. 📘 Topic and Domain: The paper introduces BabyVision, a benchmark for evaluating fundamental visual reasoning abilities in multimodal large language models (MLLMs), focusing on basic visual skills that humans develop before language acquisition.
2. 💡 Previous Research and New Ideas: Based on developmental psychology research showing humans acquire core visual skills before language, the paper proposes a new evaluation approach focused on pre-linguistic visual abilities, rather than existing benchmarks that test high-level semantic reasoning.
3. ❓ Problem: The paper addresses the gap between MLLMs' strong performance on knowledge-intensive tasks versus their weakness in basic visual tasks that even young children can solve effortlessly.
4. 🛠️ Methods: The authors created a benchmark with 388 questions across 22 subtypes in 4 categories (fine-grained discrimination, visual tracking, spatial perception, pattern recognition), evaluated leading MLLMs against human performance, and introduced BABYVISION-GEN for testing visual generation capabilities.
5. 📊 Results and Evaluation: The best model (Gemini3-Pro-Preview) achieved only 49.7% accuracy compared to human performance of 94.1%, with consistent deficits across all categories, revealing significant gaps in MLLMs' fundamental visual understanding abilities.

BabyVision: Visual Reasoning Beyond Language

BabyVision Research Methodology Flow Data Curation Pipeline 1. Taxonomy & Seed Collection Define 4 categories, 22 subtypes ~50 seed images 2. Data Collection & Filtering Reverse image search ~4000 candidates 3. Annotation & QA Double blind expert review 388 questions final Four Core Visual Categories Fine-grained Discrimination 163 questions, 8 subtypes Find different, shadows, pattern completion reconstruction Visual Tracking 83 questions, 5 subtypes Maze navigation, line observation, connect lines Spatial Perception 91 questions, 5 subtypes 3D views, cube unfold, paper folding, count 3D blocks Visual Pattern Recognition 51 questions, 4 subtypes Logic patterns, rotation, mirroring, overlay patterns Evaluation Methodologies BabyVision Language-based Evaluation • 388 questions across 22 subtypes • Multiple choice & fill-in-blank • LLM-as-judge evaluation • Human baseline: 94.1% • Best model: 49.7% (Gemini3) • Tested 11 frontier models BabyVision-Gen Visual Generation Evaluation • 280 questions across 21 subtypes • Visual output assessment • Automatic evaluation toolkit • 96% agreement with humans • Best: 18.3% (NanoBanana-Pro) • Tests visual reasoning via generation Key Research Insights Verbalization Bottleneck Models compress visual info into language tokens RLVR Training Benefits +4.8% improvement with reinforcement learning Four Failure Modes Detail loss, tracking, spatial, pattern failures Visual Externalization Generation models show promising visual reasoning
Q1
1. What is identified as the fundamental limitation preventing MLLMs from performing well on BabyVision tasks?
Insufficient training data
The verbalization bottleneck
Limited computational resources
Q2
2. How did Gemini3-Pro-Preview's performance on BabyVision compare to human age groups?
Performed better than 12-year-olds
Performed between 3-year-olds and 6-year-olds
Performed worse than 3-year-olds
Q3
3. What unique aspect does BABYVISION-GEN introduce to evaluate visual reasoning?
Text-based explanations of visual reasoning
Multiple choice selection format
Visual generation outputs like drawing lines or marking regions
1/2

Paper 3

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Published: 2026-01-12

Link: http://arxiv.org/pdf/2601.07832

1. 📘 Topic and Domain: A novel linear attention mechanism called MHLA (Multi-Head Linear Attention) for transformer architectures in computer vision and natural language processing tasks.
2. 💡 Previous Research and New Ideas: Based on linear attention mechanisms that reduce computational complexity but suffer from performance degradation; proposes token-level multi-head attention to restore expressivity while maintaining efficiency.
3. ❓ Problem: Addresses the "global context collapse" problem in linear attention where models lose representational diversity and performance degrades due to using a single global key-value summary.
4. 🛠️ Methods: Divides tokens into non-overlapping blocks ("heads") along spatial dimensions, computes local key-value summaries, and uses learnable mixing coefficients to create query-specific context while maintaining linear complexity.
5. 📊 Results and Evaluation: Achieved significant improvements across multiple domains: 3.6% on ImageNet classification, 6.3% on NLP tasks, 12.6% on image generation, and 41% on video generation while maintaining the same time complexity as linear attention.

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head Problem Analysis • Linear Attention has O(N) complexity • But suffers from Global Context Collapse • Rank limited by dimension d • Loss of query-conditioned selectivity • Uniform attention distribution Root Cause Analysis • Single global KV summary shared by all queries • rank(A_lin) ≤ min{rank(Q̃), rank(K̃)} ≤ d • High entropy in attention weights • Information bottleneck in fixed-size matrix • No token-level diversity Solution Overview • Multi-Head Linear Attention (MHLA) • Partition tokens into M blocks • Local KV summaries per block • Query-conditioned mixing • Restore token-level diversity MHLA Architecture Input Sequence X ∈ R^(N×d) Q, K, V projections Token Partitioning Split into M blocks Non-overlapping Local KV Summaries S_b = Σ_{j∈b} K̃_j V_j^T z_b = Σ_{j∈b} K̃_j Multi-Head Mixing S̃_i = Σ_{b=1}^M m_{i,b} S_b Learnable coefficients Output Computation o = q̃^T S̃_i / q̃^T z̃_i Key Properties & Advantages Complexity O(Nd² + M²d²) = O(Nd²) Rank Improvement rank ≤ Σ_{b=1}^M min(n_b, d) Much higher than d Query Selectivity Restored via mixing Lower entropy No Extra Modules Pure attention design Standard GEMMs Experimental Validation Image Classification ImageNet: +3.6% vs Self-Attention Image Generation DiT: +12.6% FID Same throughput as LA NLP Tasks MMLU: +6.3% Long context modeling Video Generation +41% improvement Ultra-long sequences Core Innovation: Token-level diversity via query-conditioned block mixing Maintains linear complexity while restoring expressivity
Q1
1. What is the key problem that MHLA aims to solve in linear attention mechanisms?
High computational complexity
Global context collapse and loss of representational diversity
Inability to process long sequences
Q2
2. How does MHLA achieve better performance while maintaining linear complexity?
By using multiple attention heads in parallel
By adding convolutional layers and gating modules
By dividing tokens into spatial blocks and using learnable mixing coefficients
Q3
3. On which task did MHLA achieve the most significant performance improvement?
Image classification (3.6% gain)
Image generation (12.6% gain)
Video generation (41% gain)