2025-10-02 Papers

1/2

Paper 1

GEM: A Gym for Agentic LLMs

Published: 2025-10-01

Link: http://arxiv.org/pdf/2510.01051

1. 📘 Topic and Domain: A framework called GEM (General Experience Maker) for training large language models through reinforcement learning in interactive environments.
2. 💡 Previous Research and New Ideas: Based on OpenAI-Gym for traditional reinforcement learning, proposing a new standardized framework specifically designed for training LLMs through experience-based learning rather than static datasets.
3. ❓ Problem: The lack of standardized environments and tools for training LLMs through reinforcement learning in complex, multi-turn interactive scenarios.
4. 🛠️ Methods: Implemented a REINFORCE algorithm variant with Return Batch Normalization (ReBN), providing diverse environment suites, asynchronous vectorized execution, and flexible wrappers for easy extensibility.
5. 📊 Results and Evaluation: The framework demonstrated successful training across 24 environments, with ReBN consistently improving performance over baseline methods, and showed effective tool integration in math and question-answering tasks.

GEM: A Gym for Agentic LLMs

GEM: A Gym for Agentic LLMs - Methodology Flow GEM Framework Architecture • Standardized Environment Interface (reset(), step()) • Asynchronous Vectorized Execution • Modular Wrappers for Extensibility • 7 Task Categories (Math, Code, Game, QA, etc.) • 3 Tool Types (Python, Search, MCP) • Multi-turn & Single-turn Support Task Categories • Math (with/without images) • Code Generation • Language Games • QA & ReasoningGym • Terminal Environment Tools Integration • Python Interpreter • Search Engine • MCP Protocol Wrapper System • Observation Wrappers • Action Wrappers • Autoreset Mechanism • Vectorization Support Execution Features • Parallel Environments • Async Tool Calls • Batch Collection • High Throughput Reinforcement Learning Algorithms Action Formulations • Single Token as Action • Response as Action • Whole Interaction as Action REINFORCE + ReBN (Proposed) • Return Batch Normalization • Compatible with γ < 1 • Per-turn Dense Rewards Baseline Methods • GRPO • PPO • Vanilla REINFORCE Empirical Studies & Evaluation Algorithm Benchmarking 8 environments 4 algorithms Discount Factor Analysis γ impact study Binary search discovery Tool Integration Effects Math + Python QA + Search Framework Integration 5 RL frameworks Single-file scripts Agent Evaluation MCP integration Terminal tasks Strong LLMs test Key Innovation: Multi-turn RL with fine-grained rewards
Q1
1. What key innovation does GEM introduce compared to traditional LLM training approaches?
Using larger datasets for training
Experience-based learning through interactive environments
Faster computation through parallel processing
Q2
2. What is the main advantage of Return Batch Normalization (ReBN) in the paper?
It reduces training time significantly
It enables multi-GPU training
It allows for fine-grained per-turn rewards and arbitrary discount factors
Q3
3. In the GuessTheNumber environment experiment, what effect did a lower discount factor (γ) have?
It led to slower but more accurate solutions
It encouraged finding solutions with fewer turns
It had no significant impact on performance
1/2

Paper 2

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Published: 2025-09-28

Link: http://arxiv.org/pdf/2509.24002

1. 📘 Topic and Domain: A benchmark called MCPMark for evaluating how well large language models can use the Model Context Protocol (MCP) to interact with external systems and tools in realistic scenarios.
2. 💡 Previous Research and New Ideas: Based on existing MCP benchmarks that focus on simple read-heavy tasks; proposes a more comprehensive benchmark with complex multi-step workflows and diverse CRUD operations across multiple environments.
3. ❓ Problem: Existing MCP benchmarks are too narrow in scope and fail to capture the complexity of real-world workflows, making it difficult to properly evaluate models' capabilities in realistic scenarios.
4. 🛠️ Methods: Created 127 tasks across 5 MCP environments (Notion, GitHub, Filesystem, PostgreSQL, Playwright) through a human-AI collaborative pipeline, with programmatic verification scripts and full state tracking.
5. 📊 Results and Evaluation: The best model (gpt-5-medium) achieved only 52.56% pass@1 and 33.86% pass^4, with most models performing below 30% pass@1, demonstrating the benchmark's challenging nature and revealing significant gaps in current model capabilities.

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

MCPMark: Workflow Methodology Task Creation Pipeline I. Explore II. Evolve III. Verify IV. Iterate Human Experts Domain Knowledge Quality Control AI Agents Task Creation Task Execution MCP Environment Setup Notion GitHub Filesystem PostgreSQL Playwright 38 Curated Initial States Realistic templates & scenarios Containerized environments Task Specifications 127 Tasks Total Natural language instructions Programmatic verification scripts CRUD-diverse operations Multi-step workflows Evaluation Framework MCPMark-Agent Minimal framework Tool-calling loop State Management Sandboxed execution Full state tracking Verification Programmatic scripts Automatic evaluation Evaluation Metrics pass@1 pass@4 pass^4 Model Testing & Analysis Proprietary Models GPT-5, Claude-4, o3 Open-Source Models Qwen3, DeepSeek, GLM Analysis Focus Reasoning effort impact Failure pattern analysis Environment-specific performance Tool usage efficiency Key Findings Best Model: GPT-5-medium 52.56% pass@1, 33.86% pass^4 Average Complexity 16.2 turns, 17.4 tool calls Challenges: Local vs Remote services, Robustness gaps, Efficiency requirements
Q1
1. What was the key innovation in MCPMark's task creation process?
Using only AI agents to create tasks autonomously
A human-AI collaborative pipeline with expert validation
Random generation of tasks from existing databases
Q2
2. Why did models generally perform better on local service tasks compared to remote services?
Local services had simpler verification scripts
Remote services had stricter security protocols
Local services were easier to simulate and had better training data availability
Q3
3. What surprising finding emerged about the relationship between model performance and tool calls?
More tool calls always led to better performance
Successful models used fewer, better-targeted tool calls
Tool calls had no impact on task success rates
1/2

Paper 3

Code2Video: A Code-centric Paradigm for Educational Video Generation

Published: 2025-10-01

Link: http://arxiv.org/pdf/2510.01174

1. 📘 Topic and Domain: Educational video generation using code-centric approach, focusing on creating high-quality instructional videos through executable Python code.
2. 💡 Previous Research and New Ideas: Based on recent advances in video generation and coding agents, proposing a novel code-centric paradigm instead of pixel-based generation for better control and interpretability.
3. ❓ Problem: Current generative models struggle to produce professional educational videos that require disciplinary knowledge, precise visual structures, and coherent transitions.
4. 🛠️ Methods: Developed Code2Video framework with three collaborative agents: Planner (structures content), Coder (converts instructions to executable code), and Critic (refines spatial layout using vision-language models).
5. 📊 Results and Evaluation: Achieved 40% improvement over direct code generation, evaluated using MMMC benchmark across dimensions including VLM-as-a-Judge aesthetic scores, code efficiency, and TeachQuiz knowledge transfer metric.

Code2Video: A Code-centric Paradigm for Educational Video Generation

Code2Video: A Code-centric Paradigm for Educational Video Generation Learning Topic Query Input Stage 1: PLANNER Temporal Coherence Outline Generation Storyboard Construction External Database Reference Images Visual Assets Asset Cache Stage 2: CODER Bridge Lecture → Code Parallel Code Generation Scope Refine (SR) Auto-debugging Scope Refine Line Scope Block Scope Global Scope Progressive Repair Stage 3: CRITIC Spatial Refinement Visual Anchor Prompt VideoLLM Feedback Layout Optimization Visual Anchor (6×6) OUTPUT Educational Video Executable Code Temporal Coherence Spatial Clarity MMMC BENCHMARK Efficiency Time Tokens Aesthetics VLM Judge 5 Dimensions TeachQuiz Knowledge Transfer TeachQuiz Process VLM + Unlearn → VLM + Unlearn + Video Score = S(V) - S(V|unlearn) Key Features of Code-Centric Paradigm S Scalable Modular integration External assets I Interpretable Explicit scripting Auditable decisions C Controllable Precise timing Spatial organization R Reproducible Deterministic Extensible code
Q1
1. What is the main advantage of using a code-centric approach over pixel-based video generation for educational content?
Faster rendering speed and lower computational requirements
Better control over temporal sequencing and spatial organization
Higher resolution output and better video quality
Q2
2. Which component of Code2Video is responsible for ensuring visual elements are properly arranged without overlapping?
The Planner agent
The Coder agent
The Critic agent
Q3
3. How does the TeachQuiz evaluation metric work?
By measuring the video's visual quality and aesthetic appeal
By comparing the generated video duration with human-made videos
By testing knowledge transfer through unlearning and relearning cycles