2025-09-17 Papers

1/2

Paper 1

Scaling Agents via Continual Pre-training

Published: 2025-09-16

Link: http://arxiv.org/pdf/2509.13310

1. 📘 Topic and Domain: The paper focuses on scaling language models into agentic systems through continual pre-training in the domain of AI/ML, specifically addressing deep research agents capable of autonomous tool use and complex problem-solving.
2. 💡 Previous Research and New Ideas: Based on traditional post-training approaches (SFT and RL) for language models, the paper proposes a novel Agentic Continual Pre-training (Agentic CPT) framework as an intermediate step between pre-training and post-training.
3. ❓ Problem: The paper aims to solve the limitation of post-training approaches that force models to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, creating optimization tensions.
4. 🛠️ Methods: The authors developed AgentFounder using First-order Action Synthesis (FAS) and Higher-order Action Synthesis (HAS) for data generation, implemented through a two-stage training strategy with progressive context window expansion (32K to 128K).
5. 📊 Results and Evaluation: AgentFounder-30B achieved state-of-the-art performance across 10 benchmarks, notably scoring 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE, outperforming both open-source and some commercial models.

Scaling Agents via Continual Pre-training

AgentFounder: Scaling Agents via Continual Pre-training Data Collection • Web-crawled data • Tool invocation records • Wikipedia data • Discarded trajectories • Search results Knowledge-to-Question Transformation Entity-Anchored Open-World Memory Multi-Style Question Synthesis First-order Action Synthesis (FAS) Planning Action: Problem decomposition Tool invocation prediction Reasoning Action: Step-by-step reasoning Higher-order Action Synthesis (HAS) Step-level Scaling: Multi-option generation Action space exploration Decision Synthesis: Multi-step decision process Two-Stage Training Stage 1 200B tokens 32K context FAS + short HAS Stage 2 100B tokens 128K context High-quality HAS Output AgentFounder-30B Base Model Post-training SFT-A: Two-stage React SFT-B: Mixed training SFT-C: Summarized reasoning General + Agent data Final Model AgentFounder-30B Deep Research Agent Performance Results Key Achievements: • BrowseComp-en: 39.9% (SOTA among open-source) • BrowseComp-zh: 43.3% • GAIA: 72.8% (highest single-agent accuracy) • HLE: 31.5% Pass@1 (first open-source >30%) • Xbench-DeepSearch: 73.0% Scaling Properties: • Logarithmic scaling law holds • Consistent improvements up to 315B tokens • Superior scaling efficiency vs larger models • Maintains general tool-use capabilities • Adaptable to different post-training methods Key Innovation: Agentic Continual Pre-training (Agentic CPT) Problem Solved: Traditional post-training forces models to simultaneously learn capabilities and alignment, creating optimization tensions and limiting performance. Solution: Agentic CPT creates pre-aligned foundation models with inherent agentic behaviors before post-training. Impact: • Enables effective downstream fine-tuning • Reduces post-training complexity and improves convergence Scalability: • Offline synthesis without API costs • Systematic and scalable data generation
Q1
1. What is the main innovation that AgentFounder introduces to address the limitations of traditional post-training approaches?
A larger context window of 128K tokens
Agentic Continual Pre-training (Agentic CPT) as an intermediate step
More efficient supervised fine-tuning methods
Q2
2. In the two-stage data synthesis approach of AgentFounder, what is the main advantage of Higher-order Action Synthesis (HAS) over First-order Action Synthesis (FAS)?
It requires fewer computational resources
It generates shorter training sequences
It transforms trajectories into multi-step decision-making processes with expanded exploration paths
Q3
3. What performance breakthrough did AgentFounder-30B achieve in the HLE benchmark?
It became the first open-source model to surpass the 30-point threshold with 31.5%
It matched the performance of commercial deep research products
It achieved perfect accuracy on academic questions
1/2

Paper 2

Towards General Agentic Intelligence via Environment Scaling

Published: 2025-09-16

Link: http://arxiv.org/pdf/2509.13311

1. 📘 Topic and Domain: The paper focuses on developing general agentic intelligence for Large Language Models through environment scaling and tool-learning capabilities.
2. 💡 Previous Research and New Ideas: Prior research used real-world APIs, LLM simulations, or manual environment construction for tool learning, while this paper proposes automatic environment construction and a two-phase agent training strategy.
3. ❓ Problem: The paper addresses the challenge of scaling up environments for training language agents' function-calling capabilities and effectively training agents in these environments.
4. 🛠️ Methods: The authors develop a scalable framework that automatically constructs heterogeneous environments through tool graph modeling and programmatic materialization, combined with a two-phase agent training approach (general foundation learning followed by domain specialization).
5. 📊 Results and Evaluation: Their AgentScaler models achieved state-of-the-art performance among open-source models under 1T parameters on τ-bench, τ2-Bench, and ACEBench benchmarks, with AgentScaler-30B-A3B performing comparably to trillion-parameter models.

Towards General Agentic Intelligence via Environment Scaling

AgentScaler: Environment Scaling for General Agentic Intelligence Phase 1: Environment Construction & Scaling Scenario Collection 30,000+ APIs ToolBench, API-Gen Tool Dependency Graph Modeling Louvain Clustering Domain Partitioning 1,000+ Domains Community Detection Function Schema Materialization Database Operations Agentic Task Construction Tool Sequences Phase 2: Agent Experience Learning Human-Agent Interplay Experience Collection Trajectory Filtering 3-Stage Funnel Validity + State + Match Stage 1 Training General Domains Foundation Skills Stage 2 Training Vertical Domains Specialization AgentScaler Models 4B, 8B, 30B-A3B State-of-the-Art Evaluation & Results τ-bench Retail & Airline τ2-Bench Multi-domain ACEBench Multi-category Performance SOTA Results Strong Generalization Cross-domain Transfer Stability & Consistency Pass@k Analysis Core Design Principles Function calls as read-write operations on database D Tools grouped by domains with shared database schemas Verifiable environments through state tracking Two-stage learning: Foundation → Specialization feeds into
Q1
1. What is the main limitation of the AgentScaler framework according to the paper?
Inability to handle multiple languages
Lack of reinforcement learning integration
Poor performance on simple tasks
Q2
2. In the two-phase agent training strategy, what is the focus of the first phase?
Domain-specific specialization
Fundamental tool usage skills across general domains
Real-world API integration
Q3
3. What interesting trend was observed regarding tool-calling complexity in the experimental analysis?
More tool calls led to higher accuracy
Tool call count had no impact on performance
There was a negative correlation between number of tool calls and task accuracy
1/2

Paper 3

WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents

Published: 2025-09-16

Link: http://arxiv.org/pdf/2509.13309

1. 📘 Topic and Domain: Development of an advanced AI research agent (WebResearcher) that can autonomously discover and synthesize knowledge from external sources through web search and tool use.
2. 💡 Previous Research and New Ideas: Based on previous deep-research systems like OpenAI's Deep Research and Google's Gemini Deep Research, but introduces a novel iterative paradigm instead of the traditional mono-contextual approach for information accumulation.
3. ❓ Problem: Addresses the limitations of current mono-contextual AI research agents that suffer from context suffocation and noise contamination when handling complex, long-horizon research tasks.
4. 🛠️ Methods: Implements IterResearch (an iterative deep-research paradigm), WebFrontier (a data synthesis engine), and a Research-Synthesis Framework using multiple parallel agents, with periodic consolidation of findings into evolving reports.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance across 6 benchmarks, notably scoring 36.7% accuracy on Humanity's Last Exam (surpassing DeepSeek-V3.1's 29.8%) and 51.7% on BrowseComp-en (matching OpenAI's Deep Research system).

WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents

WebResearcher: Methodology Flow IterResearch Paradigm Reformulates deep research as MDP Think Report Action Cognitive Scratchpad Central Memory Tool Call / Final Answer Periodic Synthesis & Reconstruction WebFrontier Data Engine Scalable Data Synthesis Seed Data Generation Complexity Escalation (Tool-Augmented) Quality Control Multi-Agent Collaborative Framework Knowledge Expansion → Abstraction Factual Grounding → Validation Training Optimization Rejection Sampling Fine-Tuning Group Sequence Policy Optimization Multi-round Trajectory Training Data Amplification via Decomposition Markov Property Enforcement Research-Synthesis Framework Test-Time Scaling with Parallel Agents Parallel Research Agent 1 Agent 2 Agent n Concurrent IterResearch Multiple Final Reports Integrative Synthesis Synthesis Agent Report Consolidation Final Answer Generation External Tools Integration Search Web Information Retrieval Scholar Academic Literature Search Visit Web Page Content Extraction Python Code Execution & Computation Key Advantages ✓ Prevents Context Suffocation Maintains focused workspace ✓ Eliminates Noise Contamination Periodic synthesis filtering ✓ Enables Unbounded Reasoning Arbitrary research depth ✓ Superior Performance State-of-the-art results across benchmarks ✓ Test-Time Scaling Parallel agent exploration
Q1
1. What is the main limitation of mono-contextual approaches that WebResearcher aims to solve?
High computational costs and slow processing speed
Context suffocation and noise contamination in long-horizon tasks
Inability to access multiple web sources simultaneously
Q2
2. In WebResearcher's Research-Synthesis Framework, what happens when n (number of parallel research agents) increases?
Performance decreases due to conflicting information
No significant change in performance is observed
Performance improves but shows diminishing returns after n>8
Q3
3. On the BrowseComp benchmark, which tools were most frequently used by WebResearcher?
Scholar and Python tools
Search and Visit tools (96% of all tool invocations)
Python and Visit tools