1. 📘 Topic and Domain: A benchmark called MCPMark for evaluating how well large language models can use the Model Context Protocol (MCP) to interact with external systems and tools in realistic scenarios.
2. 💡 Previous Research and New Ideas: Based on existing MCP benchmarks that focus on simple read-heavy tasks; proposes a more comprehensive benchmark with complex multi-step workflows and diverse CRUD operations across multiple environments.
3. ❓ Problem: Existing MCP benchmarks are too narrow in scope and fail to capture the complexity of real-world workflows, making it difficult to properly evaluate models' capabilities in realistic scenarios.
4. 🛠️ Methods: Created 127 tasks across 5 MCP environments (Notion, GitHub, Filesystem, PostgreSQL, Playwright) through a human-AI collaborative pipeline, with programmatic verification scripts and full state tracking.
5. 📊 Results and Evaluation: The best model (gpt-5-medium) achieved only 52.56% pass@1 and 33.86% pass^4, with most models performing below 30% pass@1, demonstrating the benchmark's challenging nature and revealing significant gaps in current model capabilities.