2025-11-24 Papers

1/2

Paper 1

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Published: 2025-11-20

Link: http://arxiv.org/pdf/2511.16043

1. 📘 Topic and Domain: The paper focuses on developing self-evolving AI agents using Large Language Models (LLMs) without human-curated training data, in the domain of artificial intelligence and machine learning.

2. 💡 Previous Research and New Ideas: Based on previous self-play and self-challenging approaches for LLMs, it proposes a novel framework called Agent0 that introduces tool integration and multi-step co-evolution between two specialized agents.

3. ❓ Problem: The paper addresses the limitation of LLM agents requiring massive human-curated datasets for training, which creates scalability bottlenecks and tethers AI development to human knowledge boundaries.

4. 🛠️ Methods: Uses two co-evolving agents initialized from the same base LLM: a curriculum agent that generates increasingly challenging tasks and an executor agent that learns to solve them, with integrated tools and multi-turn interactions supported by Group Relative Policy Optimization (GRPO).

5. 📊 Results and Evaluation: Agent0 improved mathematical reasoning performance by 18% and general reasoning performance by 24% on the Qwen3-8B-Base model across ten benchmarks, demonstrating substantial capability gains through the co-evolutionary process.

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

1/2

Paper 2

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Published: 2025-11-20

Link: http://arxiv.org/pdf/2511.16334

1. 📘 Topic and Domain: Large multimodal reasoning models (LMRMs) and training recipes for combining supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance visual-language reasoning capabilities.

2. 💡 Previous Research and New Ideas: Based on recent advances in reinforcement learning with verifiable rewards (RLVR) for language models, proposes a novel transparent and reproducible training recipe combining SFT and RL specifically for multimodal reasoning.

3. ❓ Problem: The lack of transparent, reproducible training pipelines and data curation processes for building multimodal reasoning models, which limits understanding of how these models are developed.

4. 🛠️ Methods: Developed a two-stage recipe: 1) SFT stage using 874K high-quality samples with step-by-step validation, and 2) RL stage using 74K samples across diverse domains with carefully designed reward functions and optimization strategies.

5. 📊 Results and Evaluation: Achieved 11.6% improvement over Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, with particularly strong performance on benchmarks like MathVista (79.5%), WeMath (79.0%), and LogicVista (72.6%).

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

1/2

Paper 3

First Frame Is the Place to Go for Video Content Customization

Published: 2025-11-19

Link: http://arxiv.org/pdf/2511.15700

1. 📘 Topic and Domain: The paper explores video content customization through a novel perspective on the role of first frames in video generation models, focusing on computer vision and deep learning.

2. 💡 Previous Research and New Ideas: Based on existing video generation models like Wan2.2, the paper proposes a new perspective that the first frame acts as a conceptual memory buffer for storing visual entities, rather than just being a starting point.

3. ❓ Problem: The paper aims to solve the challenge of incorporating multiple reference images into pre-trained video generation models without architectural modifications or large-scale fine-tuning.

4. 🛠️ Methods: The authors develop FFGo, a lightweight add-on that uses Vision-Language Models for data curation and LoRA adaptation with just 20-50 training examples to invoke the model's innate ability to mix subjects through the first frame.

5. 📊 Results and Evaluation: Through user studies across 200 annotations, FFGo outperformed baseline models in object identity, scene identity, and overall quality, with 81.2% of users ranking it as their top choice despite using minimal training data.