2025-08-07 Papers

1/2

Paper 1

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Published: 2025-08-06

Link: http://arxiv.org/pdf/2508.04700

1. 📘 Topic and Domain: The paper presents SEAgent, a self-evolving computer use agent framework that enables autonomous learning and adaptation to unfamiliar software environments through experience.
2. 💡 Previous Research and New Ideas: Based on previous research in large vision-language models and computer use agents that relied heavily on human-labeled data, this paper proposes a novel framework allowing agents to learn autonomously through self-exploration and experiential learning.
3. ❓ Problem: The paper addresses the challenge of enabling computer use agents to effectively learn and adapt to new and specialized software environments without requiring human annotations or supervision.
4. 🛠️ Methods: The authors developed a framework combining a World State Model for trajectory assessment, a Curriculum Generator for creating progressively challenging tasks, and a reinforcement learning approach using both adversarial imitation for failure actions and Group Relative Policy Optimization for successful ones.
5. 📊 Results and Evaluation: The system achieved a significant improvement in success rate from 11.3% to 34.5% compared to the baseline UI-TARS model when tested across five professional software applications, with the specialist-to-generalist strategy outperforming both specialist and direct generalist approaches.

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

SEAgent: Self-Evolving Computer Use Agent Workflow Initialization GUI State Parsing Initial Task Generation Software Guidebook Creation (U₀) World State Model + Curriculum Generator Autonomous Exploration Actor Model π executes task instructions Trial-and-error learning UI-TARS Base Model World State Evaluation Step-wise trajectory assessment Success/Failure labeling Qwen2.5-VL Fine-tuned Curriculum Update Generate challenging tasks Update guidebook Qwen2.5-72B Policy Update via Reinforcement Learning GRPO Correct Actions (aᵀ) Verifiable Rewards Distance-based rewards Adversarial Imitation Failure Actions (aᶠ) Negative KL Divergence Learn from mistakes Specialist-to-Generalist Training Strategy Software Specialists Individual training Knowledge Distillation SFT on trajectories Multi-Software RL Cross-domain training Unified Generalist Superior performance Performance Results UI-TARS Baseline: 11.3% SEAgent (Specialist): 32.2% SEAgent (Generalist): 34.5% +23.2% Improvement in Success Rate Key Components • World State Model (Qwen2.5-VL-7B) • Curriculum Generator (Qwen2.5-72B) • Actor Model (UI-TARS-7B-DPO) • Software Guidebook Memory • Step-wise Reward Signals Software Environments • VSCode (Development) • GIMP (Image Editing) • LibreOffice Impress (Presentation) • VLC (Media Player) • LibreOffice Writer (Document) Key Innovations • Autonomous exploration without human supervision • Self-evolving curriculum learning • Experience-based policy update • Specialist-to-generalist strategy Iterative Self-Evolution
Q1
1. What is the main innovation of SEAgent compared to previous computer use agents?
It uses more sophisticated vision-language models
It learns autonomously through self-exploration without human supervision
It can only work with specialized software applications
Q2
2. In the specialist-to-generalist training strategy, what is the correct sequence of steps?
Train generalist model first, then specialize for each software
Train multiple generalist models and combine them together
Train specialist agents first, then distill into a generalist model
Q3
3. What was the main performance improvement achieved by SEAgent?
Improved success rate from 11.3% to 34.5% across five software applications
Reduced training time by 50% compared to baseline models
Achieved 100% accuracy on simple computer tasks
1/2

Paper 2

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Published: 2025-08-04

Link: http://arxiv.org/pdf/2508.02193

1. 📘 Topic and Domain: The paper introduces Seed Diffusion Preview, a large-scale discrete-state diffusion language model for code generation with high-speed inference capabilities.
2. 💡 Previous Research and New Ideas: Based on previous work in discrete diffusion models and non-autoregressive generation, it proposes new techniques for balancing generation quality and speed while addressing the limitations of traditional token-by-token decoding.
3. ❓ Problem: The paper aims to solve the challenges of slow inference speed in language models while maintaining competitive performance, particularly addressing the inefficiencies in token-by-token generation.
4. 🛠️ Methods: The paper employs a two-stage curriculum (TSC) for diffusion training, constrained-order diffusion training, on-policy diffusion learning, and block-level parallel sampling with system optimizations.
5. 📊 Results and Evaluation: The model achieves 2,146 tokens/second inference speed on H20 GPUs while maintaining competitive performance across various code benchmarks, establishing new state-of-the-art on the speed-quality trade-off frontier.

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Seed Diffusion: Methodology Flow TSC: Two-Stage Curriculum Stage 1 (80%): Mask-based Forward Process Stage 2 (20%): Edit-based Forward Process Trajectory Space Tailoring Constrained-order diffusion ELBO-based trajectory selection and fine-tuning On-policy Diffusion Learning Minimize trajectory length with verifier constraint Progressive surrogate loss Block-level Inference Semi-autoregressive parallel sampling with KV-caching Key Mathematical Components Mask-based Process: q_mask(x_t|x_0) = ∏ q_mask(x_t[i]|x_0[i]) Edit-based Process: k_t = ⌊|x_0| · α_t⌋ (Levenshtein distance control) Combined Loss: L_diff(θ) = -E_q_edit,t log p_θ(x_0|x_t) - E_q_mask,t [weighted reconstruction] Performance Achievements Inference Speed: 2,146 tokens/s on H20 GPUs Competitive performance on code benchmarks Superior speed-quality Pareto frontier Strong performance on editing tasks Evaluation Benchmarks HumanEval MBPP BigCodeBench LiveCodeBench MBXP NCB Aider/CanItEdit System Infrastructure Optimization Specialized framework for diffusion sampling • Block size optimization • KV-caching implementation Balance between computation latency and token generation rate Key Innovations Two-stage curriculum (mask + edit) Constrained-order trajectory filtering On-policy learning for speedup Block-level parallel inference
Q1
1. What is the main innovation that allows Seed Diffusion to achieve high-speed inference?
Using larger GPU clusters
Non-sequential, parallel generation through discrete diffusion
Reducing the model size and parameters
Q2
2. In the Two-Stage Curriculum (TSC) training, what is the ratio between mask-based and edit-based forward processes?
50% mask-based, 50% edit-based
90% mask-based, 10% edit-based
80% mask-based, 20% edit-based
Q3
3. What inference speed did Seed Diffusion Preview achieve on H20 GPUs?
1,489 tokens/s
2,146 tokens/s
737 tokens/s
1/2

Paper 3

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Published: 2025-08-05

Link: http://arxiv.org/pdf/2508.03680

1. 📘 Topic and Domain: The paper presents Agent Lightning, a framework for applying Reinforcement Learning (RL) to train Large Language Models (LLMs) in any AI agent system.
2. 💡 Previous Research and New Ideas: Previous work focused on static, single-call RL tasks, while this paper proposes a novel framework that decouples agent execution from RL training to enable seamless integration with any existing agent.
3. ❓ Problem: The paper addresses the challenge of applying RL to complex AI agents, which currently lack mechanisms for automated optimization and struggle with reliability in real-world tasks.
4. 🛠️ Methods: The authors formulate agent execution as a Markov Decision Process, introduce a unified data interface for RL training, and develop LightningRL algorithm with a Training-Agent Disaggregation architecture.
5. 📊 Results and Evaluation: The framework demonstrated stable performance improvements across three different tasks (text-to-SQL, retrieval-augmented generation, and math QA) implemented with different agent frameworks (LangChain, OpenAI Agents SDK, and AutoGen).

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning: Training Workflow Agent Execution LangChain, OpenAI SDK, AutoGen, Custom ZERO code changes Data Collection Unified Data Interface States, Calls, Rewards OpenTelemetry MDP Formulation State: Agent Snapshot Action: LLM Output Reward: Task Quality Trajectory Decomposition Extract Transitions (input, output, reward) LightningRL Algorithm Hierarchical RL Approach Credit Assignment Token-level Optimization Training-Agent Disaggregation Lightning Server Lightning Client Model Update Compatible with GRPO, PPO, REINFORCE++ Performance Improvement Continuous & Stable Key Features • Complete decoupling of agent & training • Unified data interface for ANY agent • Hierarchical RL with credit assignment • Automatic Intermediate Rewarding (AIR) Applications • Text-to-SQL (LangChain) • RAG (OpenAI Agents SDK) • Math QA with Tools (AutoGen) • Multi-agent scenarios Benefits • Framework agnostic • Scalable & robust • Real-world deployment • Observability integration
Q1
1. What is the key innovation in Agent Lightning's architecture that differentiates it from existing RL frameworks?
Its ability to handle multiple LLMs simultaneously
Complete decoupling between agent execution and RL training
The use of a new type of neural network architecture
Q2
2. In the experimental evaluation, which combination of task and framework was NOT tested?
Text-to-SQL with LangChain
Math QA with AutoGen
Image generation with Stable Diffusion
Q3
3. How does Agent Lightning handle the challenge of long sequences in multi-turn interactions?
By using advanced compression algorithms
By organizing data as individual transitions rather than concatenated turns
By limiting the maximum context length