1. 📘 Topic and Domain: A benchmark for evaluating language AI agents' ability to use tools and execute complex real-world tasks across multiple applications.
2. 💡 Previous Research and New Ideas: Previous benchmarks focused on narrow domains or simplified tasks; this paper proposes a more comprehensive benchmark with diverse applications, realistic environments, and complex multi-step workflows.
3. ❓ Problem: Existing language agent benchmarks lack diversity, realism, and long-horizon complexity needed to evaluate real-world performance.
4. 🛠️ Methods: Created Tool Decathlon (TOOLATHLON) benchmark with 32 software applications, 604 tools, and 108 tasks requiring multi-step execution, with realistic environment states and verifiable evaluation scripts.
5. 📊 Results and Evaluation: The best model (Claude-4.5-Sonnet) achieved only 38.6% success rate with 20.2 tool calling turns on average, while the top open-source model (DeepSeek-V3.2-Exp) reached 20.1%, highlighting significant room for improvement.