2026-01-26 Papers

1/2

Paper 1

LongCat-Flash-Thinking-2601 Technical Report

Published: 2026-01-23

Link: http://arxiv.org/pdf/2601.16725

1. 📘 Topic and Domain: The paper presents LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model focused on agentic reasoning capabilities in the domain of large language models.
2. 💡 Previous Research and New Ideas: The paper builds on LongCat-Flash-Chat's pre-training recipe and extends it with novel ideas including environment scaling for multi-domain training, robust training under noisy environments, and a Heavy Thinking mode that jointly scales reasoning depth and width.
3. ❓ Problem: The paper aims to solve the challenge of enabling models to perform complex real-world tasks through adaptive interaction with external environments, addressing the limitations of existing models in long-horizon trajectories and heterogeneous environment interactions.
4. 🛠️ Methods: The authors use a unified training framework combining domain-parallel expert training, scalable environment construction across 20+ domains, asynchronous reinforcement learning (DORA system), curriculum-based noise injection, and a two-stage Heavy Thinking mode for test-time scaling.
5. 📊 Results and Evaluation: LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on agentic benchmarks (73.1% on BrowseComp, 77.7% on RWSearch, 88.2% on τ2-Bench, 29.3% on VitaBench) while maintaining competitive performance on general reasoning tasks, demonstrating strong generalization and robustness to real-world noise.

LongCat-Flash-Thinking-2601 Technical Report

LongCat-Flash-Thinking-2601 Workflow Pre-Training 560B MoE Model LongCat-Flash Recipe Mid-Training 32K/128K/256K Context Agentic Data Synthesis Data Construction Text-driven Synthesis Environment-grounded Planning-oriented Environment Scaling Domain Graph Construction Tool Dependency Analysis 20+ Domains, 10K+ Envs RL Preparation Cold-start Policy Task Set Construction Quality Filtering DORA System Asynchronous RL Multi-version Training 32K Concurrent Envs Training Strategies • Curriculum Learning • Dynamic Budget Allocation • Context Management • Multi-domain Environment Training • Robust RL with Noise Heavy Thinking Mode Test-time Scaling Parallel Reasoning Reasoning Refinement Width + Depth Expansion Zigzag Attention Sparse Attention 1M Context Support 1.5x Speedup MLA + SSA Evaluation & Benchmarking Math Reasoning Agentic Search Tool Use Coding General QA State-of-the-Art Performance LongCat-Flash-Thinking-2601 560B MoE with Superior Agentic Reasoning
Q1
1. What unique infrastructure system does LongCat-Flash-Thinking-2601 use to handle the challenges of multi-turn agentic interactions and long-tailed generation?
DORA (Dynamic ORchestration for Asynchronous Rollout) - a multi-version asynchronous training system
MCTS (Monte Carlo Tree Search) - a traditional reinforcement learning framework
RLHF (Reinforcement Learning from Human Feedback) - a standard alignment technique
Q2
2. How does the model's 'Heavy Thinking Mode' improve reasoning performance during test-time?
By simply increasing the temperature parameter to generate more diverse outputs
By jointly expanding both reasoning width (parallel trajectories) and depth (reflective reasoning) through two complementary stages
By using a larger model checkpoint with more parameters activated
Q3
3. What innovative approach does the paper take to improve the model's robustness in real-world deployments?
Training exclusively on clean, high-quality data to avoid learning from errors
Using larger batch sizes and more compute to overfit on perfect environments
Systematically analyzing real-world noise patterns and progressively injecting multi-type environmental imperfections during training
1/2

Paper 2

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Published: 2026-01-23

Link: http://arxiv.org/pdf/2601.16746

1. 📘 Topic and Domain: The paper focuses on context pruning for coding agents in software engineering, specifically addressing the challenge of long interaction contexts in LLM-based agents.
2. 💡 Previous Research and New Ideas: The paper builds on existing context compression methods like LLMLingua and Selective-Context, but proposes a novel task-aware, line-level pruning approach that preserves syntactic structure and uses agent-generated goal hints to guide selective attention.
3. ❓ Problem: The paper aims to solve the "Context Wall" problem where coding agents accumulate excessive context during multi-turn interactions, leading to high API costs, latency, and performance degradation.
4. 🛠️ Methods: The authors develop SWE-Pruner, a lightweight neural skimmer (0.6B parameters) trained on 61K synthetic data that performs adaptive line-level pruning based on explicit goal hints from agents, using CRF-based scoring and aggregation.
5. 📊 Results and Evaluation: SWE-Pruner achieves 23-54% token reduction on agent tasks (SWE-Bench Verified, SWE-QA) and up to 14.84× compression on single-turn tasks (LongCodeQA) with minimal performance impact, while also reducing agent interaction rounds by up to 26%.

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

SWE-Pruner Workflow Coding Agent (Claude/GLM) Read File Command (grep, cat, etc.) Environment (Repository) Raw Context High Noise 1000s of lines Goal Hint Generation "Focus on MRO resolution logic" SWE-Pruner Framework Neural Skimmer (0.6B params) Qwen3-Reranker Line-Level Scoring CRF + Aggregation Adaptive Selection τ = 0.5 Training Data 61K samples 9 task types Pruned Context Relevant & Low Noise 23-54% reduction Performance Results SWE-Bench: 70.2% success Long Code QA: 58.71% acc Up to 14.84× compression Legend: Agent/Model SWE-Pruner System Training Component Optimized Output
Q1
1. What percentage of token consumption do 'read' operations account for in coding agents according to the paper's analysis?
Approximately 50% across different models
Between 67-76% depending on the model
Less than 40% on average
Q2
2. How does SWE-Pruner differ from traditional context compression methods like LLMLingua when handling code?
It operates at line-level granularity to preserve syntactic structure
It uses larger neural models for better compression ratios
It compresses code into natural language summaries
Q3
3. What unexpected benefit did SWE-Pruner demonstrate beyond token reduction?
It improved code generation accuracy by 50%
It reduced agent interaction rounds by up to 26%
It eliminated the need for file search operations
1/2

Paper 3

EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

Published: 2026-01-22

Link: http://arxiv.org/pdf/2601.15876

1. 📘 Topic and Domain: The paper focuses on developing native computer use agents (CUA) that can autonomously interact with graphical user interfaces, advancing multimodal AI capabilities for desktop automation.
2. 💡 Previous Research and New Ideas: Building on existing CUA models like OpenCUA and UI-TARS that rely on static dataset imitation, the paper proposes a paradigm shift to "evolving experience learning" through verifiable synthesis engines and massive-scale interactive rollouts.
3. ❓ Problem: The paper addresses the bottleneck of static data scaling in training computer use agents, where passive imitation fails to capture causal dynamics and environmental feedback necessary for long-horizon computer tasks.
4. 🛠️ Methods: The authors employ a three-pillar approach: a verifiable synthesis engine for generating tasks with executable validators, scalable infrastructure supporting tens of thousands of concurrent sandbox sessions, and an iterative learning strategy combining rejection sampling fine-tuning with reinforcement learning.
5. 📊 Results and Evaluation: EvoCUA-32B achieves 56.7% success rate on OSWorld benchmark, surpassing the previous open-source SOTA OpenCUA-72B (45.0%) and closed-weights UI-TARS-2 (53.1%), demonstrating consistent performance gains across different model scales.

EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

EvoCUA: Evolving Computer Use Agents Learning from Scalable Synthetic Experience Verifiable Synthesis Engine Structured Task Space Dual-Stream Generation Scalable Interaction Infrastructure Async Gateway Service 10K+ Sandboxes Evolving Experience Learning Iterative Optimization Cold Start Rejection Sampling Reinforcement Learning Performance 56.7% OSWorld Success Rate SOTA Open-Weights Key Innovation: From Static Data to Dynamic Experience
Q1
1. What is the key innovation that distinguishes EvoCUA from previous computer use agents like OpenCUA?
It uses a larger model with 235 billion parameters for better performance
It shifts from static data imitation to dynamic experience learning through verifiable synthesis and massive interactive rollouts
It focuses exclusively on web browser automation rather than general desktop tasks
Q2
2. How does EvoCUA's infrastructure handle the computational demands of its evolving learning paradigm?
By using a single powerful GPU to process all training data sequentially
By limiting training to only 100 sandbox sessions at a time to ensure quality
By orchestrating tens of thousands of concurrent sandbox sessions through asynchronous gateways and distributed scheduling
Q3
3. What unique approach does EvoCUA use to ensure the quality of its synthesized training data?
It generates tasks alongside executable validators that provide deterministic success verification, eliminating ambiguity in reward signals
It relies on human annotators to manually verify each generated task before inclusion
It uses GPT-4 to score the quality of generated tasks based on linguistic coherence