2026-04-02 Papers

1/2

Paper 1

Reasoning Shift: How Context Silently Shortens LLM Reasoning

Published: 2026-04-01

Link: http://arxiv.org/pdf/2604.01161

1. 📘 Topic and Domain: This paper investigates the robustness of reasoning behaviors in large language models (LLMs) with test-time scaling, specifically how different context conditions affect reasoning quality and length.
2. 💡 Previous Research and New Ideas: Based on prior work in Chain-of-Thought reasoning, test-time scaling, and long-context language models; introduces the novel "reasoning shift" phenomenon where reasoning models produce significantly shorter reasoning traces (up to 50% fewer tokens) when problems are surrounded by irrelevant context.
3. ❓ Problem: The paper investigates whether models can solve isolated subproblems as effectively when surrounded by irrelevant context as they do in isolation, and how context length and content affect LLM reasoning capabilities.
4. 🛠️ Methods: Systematic evaluation across three context scenarios—problems with lengthy irrelevant context, multi-turn conversational settings, and subtasks within complex tasks—using multiple reasoning models (Qwen-3.5-27B, GPT-OSS-120B, Gemini 3 Flash Preview, Kimi K2 Thinking) on IMOAnswerBench and MATH500, with sentence-level analysis of reasoning traces.
5. 📊 Results and Evaluation: All models generated up to 50% fewer reasoning tokens under non-baseline conditions, with reduced self-verification and uncertainty management behaviors; performance dropped by 9-15% on challenging tasks, though the position of first answer candidate remained similar, indicating compression occurs post-answer rather than affecting initial reasoning.
1. 📘 主题与领域: 本文研究大型语言模型(LLM)在测试时扩展下的推理行为稳健性,特别是不同上下文条件如何影响推理质量和长度。
2. 💡 先前研究与新思路: 基于Chain-of-Thought推理、测试时扩展和长上下文语言模型的前期工作;提出新的"推理偏移"现象——推理模型在问题周围存在无关上下文时会产生显著更短的推理轨迹(tokens减少达50%)。
3. ❓ 问题: 本文探讨模型在有无关上下文包围的情况下,是否能像独立处理那样有效解决孤立子问题,以及上下文长度和内容如何影响LLM的推理能力。
4. 🛠️ 方法: 在三种上下文场景下进行系统评估——冗长无关上下文的问题、多轮对话设置和复杂任务中的子任务;使用多种推理模型(Qwen-3.5-27B、GPT-OSS-120B、Gemini 3 Flash Preview、Kimi K2 Thinking)在IMOAnswerBench和MATH500上测试,并对推理轨迹进行句子级分析。
5. 📊 结果与评估: 所有模型在非基线条件下生成的推理tokens减少达50%,且自我验证和不确定性管理行为减少;在挑战性任务上性能下降9-15%,但首次答案候选位置保持相似,表明压缩发生在答案之后而非影响初始推理过程。

Reasoning Shift: How Context Silently Shortens LLM Reasoning

Reasoning Shift: Paper Workflow BACKGROUND Test-time Scaling + Chain-of-Thought Reasoning | Long-Context LLMs RESEARCH QUESTION How does context affect LLM reasoning capabilities? Key Phenomenon: Non-isolated context → Shorter reasoning traces (up to 50% reduction) EXPERIMENTAL SETUP Baseline Single problem in isolation Subtask Two independent problems in one query Long Input Irrelevant context prefix (Shakespeare) Multi-turn Multi-turn chat history Models: Qwen-3.5-27B, GPT-OSS-120B, Gemini 3 Flash Preview, Kimi K2 Thinking | Benchmarks: IMOAnswerBench, MATH-500 KEY FINDINGS Performance Impact • Accuracy drop in non-baseline • 9-15% degradation across models • More pronounced in thinking mode Reasoning Length • Up to 50% fewer tokens generated • Even 128 tokens → 18% reduction • Further increase → 50% reduction Behavioral Changes • Reduced self-verification • Less uncertainty management • Fewer double-checking behaviors ANALYSIS METHODS Trace Inspection Manual analysis | No semantic confusion Sentence-Level Analysis Functional categories | Transition matrices Resampling Experiments Context suppresses self-verification patterns CONCLUSIONS Context conditions affect reasoning behavior | High-level behaviors are fragile Performance on harder tasks is impacted | May reduce overthinking on easier problems
Q1
1. What is the maximum reduction in reasoning tokens observed when models solve problems surrounded by irrelevant context compared to isolated conditions?
25% fewer tokens
50% fewer tokens
75% fewer tokens
Q2
2. According to the analysis, what behavioral patterns decrease when reasoning models face non-isolated context conditions?
Problem setup and plan generation
Self-verification and uncertainty management
Fact retrieval and active computation
Q3
3. The paper found that the compression of reasoning traces primarily occurs at which stage of the reasoning process?
At the beginning when setting up the problem
After the first answer candidate is found
During active computation steps
1/2

Paper 2

Terminal Agents Suffice for Enterprise Automation

Published: 2026-03-31

Link: http://arxiv.org/pdf/2604.00073

1. 📘 Topic and Domain: The paper investigates enterprise automation using AI agents, comparing terminal-based coding agents, web/GUI agents, and tool-augmented MCP agents across enterprise platforms (ServiceNow, GitLab, ERPNext).
2. 💡 Previous Research and New Ideas: Based on prior work on web agents, GUI agents, and tool-augmented agents (MCP), the paper proposes that simple terminal-based coding agents operating through direct API interaction can be sufficient and more cost-effective than complex agent architectures for enterprise automation.
3. ❓ Problem: The paper investigates whether sophisticated agent stacks with heavy abstraction layers (like MCP tool registries or GUI interfaces) are necessary for enterprise automation, or whether minimal terminal agents interacting directly with platform APIs can achieve comparable or better results.
4. 🛠️ Methods: The authors implemented three agent paradigms with identical LLM backbones: MCP-based tool agents, Playwright web agents, and terminal-based coding agents (StarShell). They evaluated these on 729 tasks across three enterprise platforms using four frontier LLMs, measuring success rate and inference cost.
5. 📊 Results and Evaluation: Terminal agents achieved the best cost-performance tradeoff, matching or exceeding web agent accuracy in 7/12 platform-model combinations while costing 5× less on average; MCP agents had the lowest success rates (11.5-68.9%) due to limited tool coverage; documentation access and self-generated skills provided marginal or platform-dependent benefits.
1. 📘 主题与领域: 该论文研究了使用AI代理进行企业自动化,比较了终端编码代理、Web/GUI代理和工具增强型MCP代理,在企业平台(ServiceNow、GitLab、ERPNext)上的表现。
2. 💡 先前研究与新思路: 基于先前关于Web代理、GUI代理和工具增强代理(MCP)的研究,论文提出假设:通过直接API交互的简单终端编码代理可能足以完成企业自动化任务,且成本效益更高。
3. ❓ 问题: 该论文探讨复杂代理架构(如MCP工具注册表或GUI界面)是否对企业自动化必要,还是最小化的终端代理直接与平台API交互即可达到相当或更好的效果。
4. 🛠️ 方法: 作者实现了三种使用相同LLM骨干的代理范式:基于MCP的工具代理、Playwright Web代理和基于终端的编码代理(StarShell)。使用四个前沿LLM在三个企业平台的729个任务上进行评估,衡量成功率和推理成本。
5. 📊 结果与评估: 终端代理获得最佳成本-性能权衡,在12种平台-模型组合中有7种匹配或超过Web代理准确率,同时平均成本降低5倍;MCP代理因工具覆盖范围有限而成功率最低(11.5%-68.9%);文档访问和自生成技能提供边缘或平台依赖的收益。

Terminal Agents Suffice for Enterprise Automation

Terminal Agents for Enterprise Automation - Method Flow Three Agent Interaction Paradigms Compared MCP Tool-Augmented Curated API tools via MCP servers Lowest flexibility GUI Web Agent Browser interface via Playwright High flexibility, high cost Terminal Agent (StarShell) Direct API via terminal/filesystem Best cost-performance tradeoff StarShell: Terminal-Based Enterprise Agent Architecture Task Description Natural language goal + Execution context Agent Execution Loop Terminal Execute commands Filesystem Store artifacts/skills Skills Memory REPL-style loop: Reason → Execute → Observe → Correct Direct API Calls curl commands JSON payloads Platforms ServiceNow GitLab | ERPNext Terminal Agent Task Execution Workflow 1. Receive Task Parse goal + context 2. Check Skills Search memory dir 3. Explore API Discover endpoints 4. Execute Run curl/scripts 5. Handle Errors Retry/adapt 6. Store Skills Save procedures Iterate if needed Benchmark Task Types: Record CRUD | Filtering & Sorting | Navigation | Knowledge Base | Service Catalog | Impersonation | Composite Workflows Key Evaluation Findings Terminal vs Web Success Rates ServiceNow: 78.5% vs 62.1% GitLab: 79.8% vs 84.6% ERPNext: 73.9% vs 65.2% Cost Efficiency (Gemini 3.1 Pro) Terminal: $0.09 avg per task Web: $0.69 avg per task MCP: $0.11 avg per task Skills Memory Impact +3.6pp ServiceNow (+43.7% cost reduction) +5.8pp ERPNext (+16.8% cost reduction) +1.6pp GitLab (minimal gain) MCP Agent Limitation: 11.5-18.5% on ServiceNow Constrained by tool registry - tasks need operations beyond exposed endpoints Key Insight: Simple Programmatic Interfaces + Strong Models = Sufficient Direct API access outperforms complex abstraction layers for enterprise tasks Practical Extensions Documentation Access | Self-Created Reusable Skills | Planner/Executor Multi-Agent | Hybrid Terminal+Browser Agent
Q1
1. What was the main conclusion of the paper regarding terminal-based agents for enterprise automation?
They are too simple and ineffective for enterprise tasks
They match or outperform more complex architectures while costing significantly less
They only work well for simple, single-step tasks
Q2
2. What was the primary reason MCP-based tool-augmented agents achieved the lowest success rates?
They were too expensive to run
They were limited by predefined tool coverage and could only perform operations explicitly exposed by the tool registry
They lacked access to web browsing capabilities
Q3
3. According to the paper, which strategy allowed terminal agents to reduce costs by up to 43.7% on ServiceNow tasks?
Using hybrid browser and terminal access
Accessing official platform documentation
Creating and reusing persistent self-generated skills across tasks
1/2

Paper 3

Universal YOCO for Efficient Depth Scaling

Published: 2026-04-01

Link: http://arxiv.org/pdf/2604.01220

1. 📘 Topic and Domain: The paper addresses efficient inference-time scaling for large language models through architectural innovation in decoder-decoder Transformer designs. 2. **💡 Previous Research and New Ideas:** Building on the YOCO (You Only Cache Once) decoder-decoder architecture and Universal Transformer, the paper proposes combining YOCO with recursive computation via a Universal Self-Decoder that iterates efficient-attention layers. 3. **❓ Problem:** Standard Transformers and prior recursive approaches like Universal Transformer suffer from high computational overhead and linearly growing KV cache as depth increases, making efficient inference-time scaling difficult. 4. **🛠️ Methods:** YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache. 5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
2. 💡 Previous Research and New Ideas: Building on the YOCO (You Only Cache Once) decoder-decoder architecture and Universal Transformer, the paper proposes combining YOCO with recursive computation via a Universal Self-Decoder that iterates efficient-attention layers. 3. **❓ Problem:** Standard Transformers and prior recursive approaches like Universal Transformer suffer from high computational overhead and linearly growing KV cache as depth increases, making efficient inference-time scaling difficult. 4. **🛠️ Methods:** YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache. 5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
3. ❓ Problem: Standard Transformers and prior recursive approaches like Universal Transformer suffer from high computational overhead and linearly growing KV cache as depth increases, making efficient inference-time scaling difficult. 4. **🛠️ Methods:** YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache. 5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
4. 🛠️ Methods: YOCO-U replaces the static Self-Decoder with a Universal Self-Decoder that performs T iterations of parameter-shared efficient self-attention (e.g., sliding-window attention) to enhance representational depth while keeping the Cross-Decoder unchanged for constant global KV cache. 5. **📊 Results and Evaluation:** YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
5. 📊 Results and Evaluation: YOCO-U achieves 0.033 lower loss than YOCO at equal FLOPs, requires ~62% fewer training tokens for comparable performance, and maintains efficient inference with linear pre-filling and negligible KV cache overhead, while outperforming baselines on general and long-context benchmarks.
1. 📘 主题与领域: 该论文聚焦于通过解码器-解码器架构创新实现大型语言模型的高效推理时扩展。 2. **💡 先前研究与新思路:** 基于YOCO(一次只缓存一次)解码器-解码器架构和通用Transformer,提出将YOCO与递归计算相结合,通过通用自解码器迭代高效注意力层。 3. **❓ 问题:** 标准Transformer和通用Transformer等递归方法在深度增加时面临高计算开销和KV缓存线性增长的问题,难以实现高效的推理时扩展。 4. **🛠️ 方法:** YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。 5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
2. 💡 先前研究与新思路: 基于YOCO(一次只缓存一次)解码器-解码器架构和通用Transformer,提出将YOCO与递归计算相结合,通过通用自解码器迭代高效注意力层。 3. **❓ 问题:** 标准Transformer和通用Transformer等递归方法在深度增加时面临高计算开销和KV缓存线性增长的问题,难以实现高效的推理时扩展。 4. **🛠️ 方法:** YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。 5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
3. ❓ 问题: 标准Transformer和通用Transformer等递归方法在深度增加时面临高计算开销和KV缓存线性增长的问题,难以实现高效的推理时扩展。 4. **🛠️ 方法:** YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。 5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
4. 🛠️ 方法: YOCO-U将静态自解码器替换为通用自解码器,通过参数共享的高效自注意力(如滑动窗口注意力)执行T次迭代以增强表征深度,同时保持交叉解码器不变以维持恒定的全局KV缓存。 5. **📊 结果与评估:** YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。
5. 📊 结果与评估: YOCO-U在相同FLOPs下比YOCO获得0.033更低的损失,在达到可比性能时减少约62%的训练token,同时保持线性预填充和可忽略的KV缓存开销,在通用和长上下文基准测试中优于基线。

Universal YOCO for Efficient Depth Scaling

YOCO-U Architecture Flow Universal YOCO for Efficient Depth Scaling Input Sequence x₁ · · · xₙ Embedding Layer X⁰ Universal Self-Decoder (Iterative Computation) Iteration 1 Efficient Self-Attention (Sliding-Window) SwiGLU + RMSNorm Iteration 2 Efficient Self-Attention (Sliding-Window) SwiGLU + RMSNorm Iteration T Efficient Self-Attention (Sliding-Window) SwiGLU + RMSNorm Shared Parameters across T iterations No additional params! Global KV Cache Generation (K̂, V̂) O(N) Memory Constant (independent of T) Cross-Decoder (L/2 layers) Layer 1 Cross-Attention + SwiGLU Layer 2 Cross-Attention + SwiGLU ... L/2 layers total Layer L/2 Cross-Attention + SwiGLU Reused Reused Reused Output: Next Token Prediction Y Key Properties • Linear Pre-filling O(N) • Constant KV Cache O(N) • Efficient Attention O(1) • Parameter Sharing • Enhanced Expressiveness via Recursion vs Standard Methods Standard Transformer: • O(LN D) KV Cache • O(LN²D) Prefill YOCO-U: • O(N D) KV Cache • O(LN D) Prefill • Negligible overhead
Q1
1. What is the key innovation of YOCO-U compared to the original YOCO architecture?
It adds more layers to both Self-Decoder and Cross-Decoder
It replaces the static Self-Decoder with a Universal Self-Decoder that performs recursive computation
It removes the Cross-Decoder and uses only self-attention
Q2
2. What is the main advantage of YOCO-U's recursive computation compared to Universal Transformer?
YOCO-U has fewer parameters
YOCO-U applies recursion only to shallow efficient-attention layers, keeping KV cache constant
YOCO-U uses full attention instead of sliding-window attention
Q3
3. According to the paper, what does YOCO-U achieve compared to the non-recursive YOCO baseline?
It requires more training tokens for comparable performance
It achieves lower loss at equal FLOPs and requires ~62% fewer training tokens
It doubles the KV cache size

Today's Reading Tips 今日阅读推荐

Read Paper 1 first for its novel 'reasoning shift' phenomenon—a 50% reduction in reasoning tokens with irrelevant context—which has immediate practical implications for deploying LLMs in real-world applications. Papers 1 and 3 both address inference efficiency from different angles (test-time vs. architectural), while Paper 2's findings on terminal agents complement the efficiency theme by demonstrating cost-effective agent design in enterprise settings.