2026-01-13 Papers

1/2

Paper 1

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Published: 2026-01-11

Link: http://arxiv.org/pdf/2601.06943

1. 📘 Topic and Domain: A benchmark for video-based deep research that evaluates AI models' ability to answer questions by combining video understanding with web searching and reasoning.

2. 💡 Previous Research and New Ideas: Based on existing deep research benchmarks that focus on text-only queries and closed-video understanding tasks, this paper introduces a novel paradigm requiring models to connect video cues with open web search.

3. ❓ Problem: Existing benchmarks don't evaluate how AI models can use video content as clues to search and verify information across the open web, which is important for real-world video question answering.

4. 🛠️ Methods: Created VideoDR benchmark through rigorous human annotation process with strict quality control, testing models under two paradigms (Workflow and Agentic) across difficulty levels, video durations, and semantic domains.

5. 📊 Results and Evaluation: Leading models like Gemini-3-pro-preview achieved 76% accuracy under Agentic setting, while results showed that Agentic is not consistently superior to Workflow - success depends on models' ability to maintain initial video anchors during long search chains.

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

1/2

Paper 2

BabyVision: Visual Reasoning Beyond Language

Published: 2026-01-10

Link: http://arxiv.org/pdf/2601.06521

1. 📘 Topic and Domain: The paper introduces BabyVision, a benchmark for evaluating fundamental visual reasoning abilities in multimodal large language models (MLLMs), focusing on basic visual skills that humans develop before language acquisition.

2. 💡 Previous Research and New Ideas: Based on developmental psychology research showing humans acquire core visual skills before language, the paper proposes a new evaluation approach focused on pre-linguistic visual abilities, rather than existing benchmarks that test high-level semantic reasoning.

3. ❓ Problem: The paper addresses the gap between MLLMs' strong performance on knowledge-intensive tasks versus their weakness in basic visual tasks that even young children can solve effortlessly.

4. 🛠️ Methods: The authors created a benchmark with 388 questions across 22 subtypes in 4 categories (fine-grained discrimination, visual tracking, spatial perception, pattern recognition), evaluated leading MLLMs against human performance, and introduced BABYVISION-GEN for testing visual generation capabilities.

5. 📊 Results and Evaluation: The best model (Gemini3-Pro-Preview) achieved only 49.7% accuracy compared to human performance of 94.1%, with consistent deficits across all categories, revealing significant gaps in MLLMs' fundamental visual understanding abilities.

BabyVision: Visual Reasoning Beyond Language

1/2

Paper 3

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Published: 2026-01-12

Link: http://arxiv.org/pdf/2601.07832

1. 📘 Topic and Domain: A novel linear attention mechanism called MHLA (Multi-Head Linear Attention) for transformer architectures in computer vision and natural language processing tasks.

2. 💡 Previous Research and New Ideas: Based on linear attention mechanisms that reduce computational complexity but suffer from performance degradation; proposes token-level multi-head attention to restore expressivity while maintaining efficiency.

3. ❓ Problem: Addresses the "global context collapse" problem in linear attention where models lose representational diversity and performance degrades due to using a single global key-value summary.

4. 🛠️ Methods: Divides tokens into non-overlapping blocks ("heads") along spatial dimensions, computes local key-value summaries, and uses learnable mixing coefficients to create query-specific context while maintaining linear complexity.

5. 📊 Results and Evaluation: Achieved significant improvements across multiple domains: 3.6% on ImageNet classification, 6.3% on NLP tasks, 12.6% on image generation, and 41% on video generation while maintaining the same time complexity as linear attention.