2026-02-26 Papers

1/2

Paper 1

On Data Engineering for Scaling LLM Terminal Capabilities

Published: 2026-02-24

Link: http://arxiv.org/pdf/2602.21193

1. 📘 Topic and Domain: The paper focuses on data engineering strategies for scaling terminal capabilities in large language models, specifically for command-line interface interactions.

2. 💡 Previous Research and New Ideas: The paper builds on existing dataset adaptation approaches and synthetic data generation techniques (like Evol-Instruct), proposing Terminal-Task-Gen, a novel dual-strategy pipeline combining dataset adaptation with skill-based synthetic task generation.

3. ❓ Problem: The paper addresses the data scarcity bottleneck in training terminal agents, where existing approaches lack transparency in training data strategies and face challenges in generating diverse, high-quality terminal interaction data at scale.

4. 🛠️ Methods: The authors use a two-pronged approach: (1) adapting existing math, code, and SWE datasets to terminal format, and (2) generating synthetic tasks through seed-based and skill-based methods using DeepSeek-V3.2 as the teacher model.

5. 📊 Results and Evaluation: Nemotron-Terminal models achieve substantial improvements on Terminal-Bench 2.0: 8B model improves from 2.5% to 13.0%, 14B from 4.0% to 20.2%, and 32B from 3.4% to 27.4%, matching significantly larger models' performance.

On Data Engineering for Scaling LLM Terminal Capabilities

1/2

Paper 2

Test-Time Training with KV Binding Is Secretly Linear Attention

Published: 2026-02-24

Link: http://arxiv.org/pdf/2602.21204

1. 📘 Topic and Domain: The paper investigates Test-Time Training (TTT) with key-value binding in sequence modeling and transformer architectures.

2. 💡 Previous Research and New Ideas: Building on prior TTT work that treats it as online meta-learning for memorizing key-value mappings, the paper proposes that TTT is actually a form of learned linear attention operator rather than a memorization mechanism.

3. ❓ Problem: The paper addresses empirical contradictions in the prevailing memorization-based interpretation of TTT, such as gradient ascent preserving performance and distributional asymmetry between queries and keys.

4. 🛠️ Methods: The authors analytically derive that TTT variants (including multi-layer MLPs with momentum) can be rewritten as linear attention operators, and conduct empirical ablations on LaCT and ViTTT architectures.

5. 📊 Results and Evaluation: The linear attention reformulation enables 4.0× inference speedup through parallelization while maintaining performance, and shows that many TTT design choices (weight normalization, momentum) are redundant, with simplified variants achieving comparable results across language modeling, novel view synthesis, and image classification tasks.

Test-Time Training with KV Binding Is Secretly Linear Attention

1/2

Paper 3

SkillOrchestra: Learning to Route Agents via Skill Transfer

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.19672

1. 📘 Topic and Domain: The paper focuses on skill-aware orchestration for routing agents in compound AI systems, specifically addressing multi-turn model routing and agent orchestration in the domain of large language models (LLMs).

2. 💡 Previous Research and New Ideas: The paper builds on existing model routing approaches (heuristic, discriminative, and RL-based methods like Router-R1 and ToolOrchestra) but proposes SkillOrchestra, which learns fine-grained skills from execution experience and models agent-specific competence rather than directly learning routing policies end-to-end.

3. ❓ Problem: The paper addresses limitations in current routing approaches: input-level routers make coarse query-level decisions ignoring evolving task requirements, and RL-trained orchestrators are expensive to adapt and suffer from routing collapse (repeatedly invoking one strong but costly option).

4. 🛠️ Methods: SkillOrchestra learns a reusable Skill Handbook from execution traces containing mode-level insights, fine-grained skills, and agent profiles, then performs skill-grounded routing by selecting agents based on required skills and explicit performance-cost trade-offs at deployment time.

5. 📊 Results and Evaluation: SkillOrchestra outperforms state-of-the-art RL-based orchestrators by up to 22.5% with 700× and 300× learning cost reduction compared to Router-R1 and ToolOrchestra respectively, achieving higher accuracy at lower cost across ten benchmarks while demonstrating better routing balance and transferability across orchestrator models.