2025-10-29 Papers

1/2

Paper 1

InteractComp: Evaluating Search Agents With Ambiguous Queries

Published: 2025-10-28

Link: http://arxiv.org/pdf/2510.24668

1. 📘 Topic and Domain: Evaluating language models' ability to handle ambiguous search queries through interactive clarification in information retrieval and natural language processing.
2. 💡 Previous Research and New Ideas: Based on existing search agent benchmarks like GAIA and BrowseComp, but introduces a novel focus on interaction capabilities during search, which previous benchmarks overlooked.
3. ❓ Problem: Current search agents assume user queries are complete and unambiguous, failing to handle real-world scenarios where queries require clarification through interaction.
4. 🛠️ Methods: Created InteractComp benchmark with 210 expert-curated questions across 9 domains using a target-distractor methodology that creates genuine ambiguity resolvable only through interaction with users.
5. 📊 Results and Evaluation: The best model achieved only 13.73% accuracy with interaction available versus 71.50% with complete context, revealing systematic overconfidence rather than reasoning deficits, while forced interaction improved performance from 14% to 40%.

InteractComp: Evaluating Search Agents With Ambiguous Queries

InteractComp: Evaluating Search Agents With Ambiguous Queries Data Construction Target-Distractor Methodology 210 Expert-Curated Questions (9 domains) Easy to verify, interact to disambiguate Two-Stage Verification Stage 1: Completeness Validation Stage 2: Interaction Necessity Validation Manual + Automated Agent Architecture ReAct Framework 3 Configurations: • Answer-only • Answer+Search • Answer+Search+Interact Model Evaluation 17 Models Tested Proprietary + Open-weight GPT-5, DeepSeek-R1, Claude-4, Qwen-2.5 etc. Experimental Analysis Ablation Studies 3 Evaluation Modes: • Answer-only: 5.18% • Search-only: 8.81% • With-context: 71.50% 13.8× performance gap! Scaling Analysis Round Limits: 5 → 10 → 20 rounds Minimal improvement Models underutilize interaction opportunities Forced Interaction Minimum Questions Required (2-10) GPT-5: 20% → 40% Dramatic gains when forced to interact Longitudinal Study 15 Months Analysis BrowseComp: 7× gain InteractComp: Stagnant 6-14% performance Critical blind spot Key Findings & Contributions Systematic Overconfidence Best model: 13.73% accuracy vs 71.50% with context Not capability deficit! Interaction Patterns Diverse strategies across models (0.25% - 73.95% IR) Better calibration with interaction Training Potential Clean reward signals from search outcomes Suitable for RLVR approaches Impact: Revealing Critical Blind Spot in Agent Development While search performance improved 7×, interaction capabilities stagnated InteractComp provides foundation for training uncertainty-aware, interactive agents
Q1
1. What is the most striking finding about model performance in the InteractComp benchmark?
Models failed completely with 0% accuracy on all tasks
Models achieved high accuracy (71.50%) with complete context but only 13.73% with interaction available
Models performed equally well with or without interaction capabilities
Q2
2. How does the InteractComp benchmark create genuinely ambiguous questions?
By using random word generators to create confusing queries
By intentionally using grammatically incorrect sentences
By pairing a lesser-known target with a popular distractor that shares overlapping attributes
Q3
3. What surprising trend was revealed in the 15-month longitudinal study of model development?
Both search and interaction capabilities improved dramatically
While BrowseComp performance improved seven-fold, InteractComp performance remained stagnant
All model capabilities decreased over time
1/2

Paper 2

Tongyi DeepResearch Technical Report

Published: 2025-10-28

Link: http://arxiv.org/pdf/2510.24701

1. 📘 Topic and Domain: The paper presents Tongyi DeepResearch, an open-source large language model designed specifically for autonomous deep information-seeking research tasks.
2. 💡 Previous Research and New Ideas: Based on previous work in LLMs and agent systems, it introduces a novel end-to-end agentic training framework combining mid-training and post-training phases, along with automated data synthesis and customized environments.
3. ❓ Problem: The paper aims to develop an efficient, open-source AI research agent capable of conducting complex, multi-step reasoning and information seeking tasks that would typically take humans several hours.
4. 🛠️ Methods: The authors used a combination of agentic mid-training, post-training, automated data synthesis pipeline, and stage-specific environments, built on a 30.5B parameter model that activates only 3.3B parameters per token.
5. 📊 Results and Evaluation: The model achieved state-of-the-art performance across multiple benchmarks, including 32.9 on Humanity's Last Exam, 43.4 on BrowseComp, 72.2 on WebWalkerQA, and others, outperforming both open-source and proprietary systems.

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Training Pipeline Qwen3-30B-A3B Base Model Agentic Mid-training Agentic CPT Stage 1 (32K) Agentic CPT Stage 2 (128K) Large-scale Agent Behavior Data Question Planning Reasoning Decision Environment Scaling & Function-calling Agentic Post-training High-quality Data Synthesis Graph Construction Uncertainty Injection Supervised Fine-tuning ReAct Mode Context Mgmt Mode Agentic Reinforcement Learning Simulated Environment Wikipedia RAG Real-world Environment Search, Visit, etc. Tool Integration Search Visit Python Scholar Parser GRPO Algorithm On-policy RL with Dynamic Data Curation Model Merging Parameter Interpolation Multi-variant Combination Tongyi DeepResearch 30.5B Total / 3.3B Active State-of-the-art Performance Context Management Markovian State Reconstruction Dynamic Summary Updates S(t) = π(S(t-1), a(t), o(t)) Prevents context overflow in long-horizon tasks Design Principles • Synthetic Data Centric Scaling • Environment Interaction Learning • End-to-end Agent Training • Automated Data Generation Benchmark Results HLE: 32.9 BrowseComp: 43.4 GAIA: 70.9 WebWalkerQA: 72.2 xbench: 75.0 FRAMES: 90.6 State-of-the-art across multiple benchmarks
Q1
1. What is the most innovative aspect of Tongyi DeepResearch's training approach?
Using only post-training phase like other models
Combining agentic mid-training and post-training phases
Relying solely on human-annotated training data
Q2
2. How many total parameters does Tongyi DeepResearch have, and how many are actually activated per token?
30.5B total, with 30.5B activated per token
3.3B total, with 3.3B activated per token
30.5B total, with 3.3B activated per token
Q3
3. What unique feature of the model's data synthesis pipeline makes it more efficient than traditional approaches?
It requires extensive human annotation
It only works with small datasets
It is fully automated and requires no human annotation
1/2

Paper 3

Uniform Discrete Diffusion with Metric Path for Video Generation

Published: 2025-10-28

Link: http://arxiv.org/pdf/2510.24717

1. 📘 Topic and Domain: Video generation using discrete diffusion models in the computer vision and deep learning domain.
2. 💡 Previous Research and New Ideas: Based on continuous diffusion models and discrete tokenization approaches, proposes a novel framework called URSA that bridges discrete and continuous methods through iterative global refinement of discrete tokens.
3. ❓ Problem: Addresses the gap between discrete and continuous video generation approaches, particularly the challenges of error accumulation and long-context inconsistency in discrete methods.
4. 🛠️ Methods: Introduces Uniform discRete diffuSion with metric pAth (URSA) featuring a Linearized Metric Path and Resolution-dependent Timestep Shifting mechanism, along with asynchronous temporal fine-tuning for multi-task capabilities.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance with a text-to-video score of 82.4 on VBench, image-to-video score of 86.2, and text-to-image score of 86.0 on DPG-Bench, demonstrating competitive results against both discrete and continuous approaches.

Uniform Discrete Diffusion with Metric Path for Video Generation

URSA: Uniform Discrete Diffusion with Metric Path Video/Image Input Text Prompts Conditioning Discrete Tokenization (Cosmos/IBQ Tokenizer) URSA Core Framework Linearized Metric Path p_t(x|x_1) = softmax(-β_t d(x,x_1)) β_t = c × (t/(1-t))^α Controls perturbation linearly Timestep Shifting t̃ = t/(t + λ(1-t)) λ > 1: stronger perturbation λ < 1: gradual perturbation Adapts to resolution Asynchronous Timestep Scheduling t_i ~ U(0,1) for each frame Enables multi-task learning Frame-wise independence Iterative Global Refinement Process x_0 ~ Unif([K]) Categorical Noise x_t Intermediate x_{t+1} Refined x_1 Clean Data Output Final Result Training Cross-entropy loss on predicted tokens L = E[-log p(x_1|x_t, e)] LLM backbone (Qwen3 architecture) Sampling Euler solver for velocity field T refinement steps (25-50 steps) Iterative token refinement Text-to-Video VBench: 82.4 Image-to-Video VBench++: 86.2 Text-to-Image DPG-Bench: 86.0 Long Video Gen 40s+ videos
Q1
1. What is the main innovation of URSA that helps bridge the gap between discrete and continuous approaches?
The use of a large language model architecture
Iterative global refinement of discrete tokens
Increased model parameter count
Q2
2. Which feature allows URSA to handle multiple video generation tasks within a single model?
Resolution-dependent Timestep Shifting
Linearized Metric Path
Asynchronous temporal fine-tuning
Q3
3. What is the key advantage of URSA's approach compared to traditional discrete methods like autoregressive and masked diffusion models?
It requires more computational resources
It processes tokens sequentially
It allows refinement of already generated tokens