2026-01-26 Papers

1/2

Paper 1

LongCat-Flash-Thinking-2601 Technical Report

Published: 2026-01-23

Link: http://arxiv.org/pdf/2601.16725

1. 📘 Topic and Domain: The paper presents LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model focused on agentic reasoning capabilities in the domain of large language models.

2. 💡 Previous Research and New Ideas: The paper builds on LongCat-Flash-Chat's pre-training recipe and extends it with novel ideas including environment scaling for multi-domain training, robust training under noisy environments, and a Heavy Thinking mode that jointly scales reasoning depth and width.

3. ❓ Problem: The paper aims to solve the challenge of enabling models to perform complex real-world tasks through adaptive interaction with external environments, addressing the limitations of existing models in long-horizon trajectories and heterogeneous environment interactions.

4. 🛠️ Methods: The authors use a unified training framework combining domain-parallel expert training, scalable environment construction across 20+ domains, asynchronous reinforcement learning (DORA system), curriculum-based noise injection, and a two-stage Heavy Thinking mode for test-time scaling.

5. 📊 Results and Evaluation: LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on agentic benchmarks (73.1% on BrowseComp, 77.7% on RWSearch, 88.2% on τ2-Bench, 29.3% on VitaBench) while maintaining competitive performance on general reasoning tasks, demonstrating strong generalization and robustness to real-world noise.

LongCat-Flash-Thinking-2601 Technical Report

1/2

Paper 2

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Published: 2026-01-23

Link: http://arxiv.org/pdf/2601.16746

1. 📘 Topic and Domain: The paper focuses on context pruning for coding agents in software engineering, specifically addressing the challenge of long interaction contexts in LLM-based agents.

2. 💡 Previous Research and New Ideas: The paper builds on existing context compression methods like LLMLingua and Selective-Context, but proposes a novel task-aware, line-level pruning approach that preserves syntactic structure and uses agent-generated goal hints to guide selective attention.

3. ❓ Problem: The paper aims to solve the "Context Wall" problem where coding agents accumulate excessive context during multi-turn interactions, leading to high API costs, latency, and performance degradation.

4. 🛠️ Methods: The authors develop SWE-Pruner, a lightweight neural skimmer (0.6B parameters) trained on 61K synthetic data that performs adaptive line-level pruning based on explicit goal hints from agents, using CRF-based scoring and aggregation.

5. 📊 Results and Evaluation: SWE-Pruner achieves 23-54% token reduction on agent tasks (SWE-Bench Verified, SWE-QA) and up to 14.84× compression on single-turn tasks (LongCodeQA) with minimal performance impact, while also reducing agent interaction rounds by up to 26%.

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

1/2

Paper 3

EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

Published: 2026-01-22

Link: http://arxiv.org/pdf/2601.15876

1. 📘 Topic and Domain: The paper focuses on developing native computer use agents (CUA) that can autonomously interact with graphical user interfaces, advancing multimodal AI capabilities for desktop automation.

2. 💡 Previous Research and New Ideas: Building on existing CUA models like OpenCUA and UI-TARS that rely on static dataset imitation, the paper proposes a paradigm shift to "evolving experience learning" through verifiable synthesis engines and massive-scale interactive rollouts.

3. ❓ Problem: The paper addresses the bottleneck of static data scaling in training computer use agents, where passive imitation fails to capture causal dynamics and environmental feedback necessary for long-horizon computer tasks.

4. 🛠️ Methods: The authors employ a three-pillar approach: a verifiable synthesis engine for generating tasks with executable validators, scalable infrastructure supporting tens of thousands of concurrent sandbox sessions, and an iterative learning strategy combining rejection sampling fine-tuning with reinforcement learning.

5. 📊 Results and Evaluation: EvoCUA-32B achieves 56.7% success rate on OSWorld benchmark, surpassing the previous open-source SOTA OpenCUA-72B (45.0%) and closed-weights UI-TARS-2 (53.1%), demonstrating consistent performance gains across different model scales.