2025-10-02 Papers

1/2

Paper 1

GEM: A Gym for Agentic LLMs

Published: 2025-10-01

Link: http://arxiv.org/pdf/2510.01051

1. 📘 Topic and Domain: A framework called GEM (General Experience Maker) for training large language models through reinforcement learning in interactive environments.

2. 💡 Previous Research and New Ideas: Based on OpenAI-Gym for traditional reinforcement learning, proposing a new standardized framework specifically designed for training LLMs through experience-based learning rather than static datasets.

3. ❓ Problem: The lack of standardized environments and tools for training LLMs through reinforcement learning in complex, multi-turn interactive scenarios.

4. 🛠️ Methods: Implemented a REINFORCE algorithm variant with Return Batch Normalization (ReBN), providing diverse environment suites, asynchronous vectorized execution, and flexible wrappers for easy extensibility.

5. 📊 Results and Evaluation: The framework demonstrated successful training across 24 environments, with ReBN consistently improving performance over baseline methods, and showed effective tool integration in math and question-answering tasks.

GEM: A Gym for Agentic LLMs

1/2

Paper 2

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Published: 2025-09-28

Link: http://arxiv.org/pdf/2509.24002

1. 📘 Topic and Domain: A benchmark called MCPMark for evaluating how well large language models can use the Model Context Protocol (MCP) to interact with external systems and tools in realistic scenarios.

2. 💡 Previous Research and New Ideas: Based on existing MCP benchmarks that focus on simple read-heavy tasks; proposes a more comprehensive benchmark with complex multi-step workflows and diverse CRUD operations across multiple environments.

3. ❓ Problem: Existing MCP benchmarks are too narrow in scope and fail to capture the complexity of real-world workflows, making it difficult to properly evaluate models' capabilities in realistic scenarios.

4. 🛠️ Methods: Created 127 tasks across 5 MCP environments (Notion, GitHub, Filesystem, PostgreSQL, Playwright) through a human-AI collaborative pipeline, with programmatic verification scripts and full state tracking.

5. 📊 Results and Evaluation: The best model (gpt-5-medium) achieved only 52.56% pass@1 and 33.86% pass^4, with most models performing below 30% pass@1, demonstrating the benchmark's challenging nature and revealing significant gaps in current model capabilities.

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

1/2

Paper 3

Code2Video: A Code-centric Paradigm for Educational Video Generation

Published: 2025-10-01

Link: http://arxiv.org/pdf/2510.01174

1. 📘 Topic and Domain: Educational video generation using code-centric approach, focusing on creating high-quality instructional videos through executable Python code.

2. 💡 Previous Research and New Ideas: Based on recent advances in video generation and coding agents, proposing a novel code-centric paradigm instead of pixel-based generation for better control and interpretability.

3. ❓ Problem: Current generative models struggle to produce professional educational videos that require disciplinary knowledge, precise visual structures, and coherent transitions.

4. 🛠️ Methods: Developed Code2Video framework with three collaborative agents: Planner (structures content), Coder (converts instructions to executable code), and Critic (refines spatial layout using vision-language models).

5. 📊 Results and Evaluation: Achieved 40% improvement over direct code generation, evaluated using MMMC benchmark across dimensions including VLM-as-a-Judge aesthetic scores, code efficiency, and TeachQuiz knowledge transfer metric.