2026-02-26 Papers

1/2

Paper 1

On Data Engineering for Scaling LLM Terminal Capabilities

Published: 2026-02-24

Link: http://arxiv.org/pdf/2602.21193

1. 📘 Topic and Domain: The paper focuses on data engineering strategies for scaling terminal capabilities in large language models, specifically for command-line interface interactions.
2. 💡 Previous Research and New Ideas: The paper builds on existing dataset adaptation approaches and synthetic data generation techniques (like Evol-Instruct), proposing Terminal-Task-Gen, a novel dual-strategy pipeline combining dataset adaptation with skill-based synthetic task generation.
3. ❓ Problem: The paper addresses the data scarcity bottleneck in training terminal agents, where existing approaches lack transparency in training data strategies and face challenges in generating diverse, high-quality terminal interaction data at scale.
4. 🛠️ Methods: The authors use a two-pronged approach: (1) adapting existing math, code, and SWE datasets to terminal format, and (2) generating synthetic tasks through seed-based and skill-based methods using DeepSeek-V3.2 as the teacher model.
5. 📊 Results and Evaluation: Nemotron-Terminal models achieve substantial improvements on Terminal-Bench 2.0: 8B model improves from 2.5% to 13.0%, 14B from 4.0% to 20.2%, and 32B from 3.4% to 27.4%, matching significantly larger models' performance.

On Data Engineering for Scaling LLM Terminal Capabilities

Terminal-Task-Gen: Data Engineering Pipeline Dataset Adaptation Math Code SWE Terminus 2 Format Wrapper + Instruction Suffix Synthetic Task Generation Seed-based Skill-based Skill Taxonomy • Algorithmic • Systems • Data LLM Task Synthesis (DeepSeek-V3.2) Trajectory Generation Terminus 2 Agent Docker Env Pre-built Images Post-Processing Filtering & Decontamination No Filter Complete-only Success-only Terminal-Corpus SFT Dataset Supervised Fine-Tuning Qwen3 → Nemotron-Terminal • 2 epochs, LR: 2e-5 • 32K context length • Batch size: 128 TB2.0 Results 8B: 2.5→13.0% 32B: 3.4→27.4%
Q1
1. What surprising finding did the authors discover about filtering strategies for synthetic task trajectories?
Keeping only successful trajectories led to the best performance
Removing all incomplete trajectories improved model accuracy by 50%
No filtering at all yielded significantly better performance than strict filtering
Q2
2. How does Nemotron-Terminal-32B's performance compare to much larger models on Terminal-Bench 2.0?
It achieves 27.4% accuracy, outperforming the 480B Qwen3-Coder's 23.9%
It reaches 15% accuracy, falling short of all models above 100B parameters
It matches GPT-5.2's performance at 54.0% accuracy
Q3
3. What key design decision enabled Terminal-Task-Gen to achieve efficient large-scale task generation?
Using multi-agent systems to coordinate task creation across domains
Generating unique Docker environments for each individual task
Using pre-built domain-specific Docker images shared across tasks
1/2

Paper 2

Test-Time Training with KV Binding Is Secretly Linear Attention

Published: 2026-02-24

Link: http://arxiv.org/pdf/2602.21204

1. 📘 Topic and Domain: The paper investigates Test-Time Training (TTT) with key-value binding in sequence modeling and transformer architectures.
2. 💡 Previous Research and New Ideas: Building on prior TTT work that treats it as online meta-learning for memorizing key-value mappings, the paper proposes that TTT is actually a form of learned linear attention operator rather than a memorization mechanism.
3. ❓ Problem: The paper addresses empirical contradictions in the prevailing memorization-based interpretation of TTT, such as gradient ascent preserving performance and distributional asymmetry between queries and keys.
4. 🛠️ Methods: The authors analytically derive that TTT variants (including multi-layer MLPs with momentum) can be rewritten as linear attention operators, and conduct empirical ablations on LaCT and ViTTT architectures.
5. 📊 Results and Evaluation: The linear attention reformulation enables 4.0× inference speedup through parallelization while maintaining performance, and shows that many TTT design choices (weight normalization, momentum) are redundant, with simplified variants achieving comparable results across language modeling, novel view synthesis, and image classification tasks.

Test-Time Training with KV Binding Is Secretly Linear Attention

Test-Time Training (TTT) with KV Binding: From Memorization to Linear Attention Traditional Interpretation TTT as Storage-and-Retrieval • Online meta-learning • Key-value memorization • Query-based retrieval Empirical Contradictions • Gradient ascent preserves performance • Q/K distributional asymmetry • Better inner loss → worse performance • Replacing Q with K: no degradation Mathematical Analysis Unrolling Inner-Loop Updates f(x) = φ(x; Θ)W o = φ(q)[S₀ + Σ φ(kᵢ)ᵀvᵢ] (Linear Attention Form) New Interpretation TTT as Linear Attention • Learned feature mixing • History-dependent operator • No explicit memorization Practical Benefits • Simplification: remove redundant components • Parallelization: 4.0× inference speedup • Generalization: unified framework • Enhanced design space Example Implementations LaCT (NVS & LLM) ViTTT (Image Classification) Key Finding TTT is not test-time memorization but learned linear attention with enhanced representational capacity
Q1
1. What surprising empirical finding challenges the memorization-based interpretation of TTT?
Replacing gradient descent with gradient ascent in the inner loop maintains or even improves task performance
Increasing the number of inner-loop iterations always improves downstream task performance
Queries and keys must share the same distributional properties for TTT to function properly
Q2
2. According to the paper, what is the true nature of TTT's inner loop mechanism?
A sophisticated key-value storage system that memorizes associations at test time
A learned linear attention operator that performs structured mixing of queries, keys, and values
A meta-learning algorithm that requires deep MLPs for optimal performance
Q3
3. What practical benefit was achieved by reformulating TTT as linear attention?
The need for gradient orthogonalization and momentum was eliminated without performance loss
The inference throughput improved by up to 4.0× through parallel implementation
The model size was reduced by 50% while maintaining the same accuracy
1/2

Paper 3

SkillOrchestra: Learning to Route Agents via Skill Transfer

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.19672

1. 📘 Topic and Domain: The paper focuses on skill-aware orchestration for routing agents in compound AI systems, specifically addressing multi-turn model routing and agent orchestration in the domain of large language models (LLMs).
2. 💡 Previous Research and New Ideas: The paper builds on existing model routing approaches (heuristic, discriminative, and RL-based methods like Router-R1 and ToolOrchestra) but proposes SkillOrchestra, which learns fine-grained skills from execution experience and models agent-specific competence rather than directly learning routing policies end-to-end.
3. ❓ Problem: The paper addresses limitations in current routing approaches: input-level routers make coarse query-level decisions ignoring evolving task requirements, and RL-trained orchestrators are expensive to adapt and suffer from routing collapse (repeatedly invoking one strong but costly option).
4. 🛠️ Methods: SkillOrchestra learns a reusable Skill Handbook from execution traces containing mode-level insights, fine-grained skills, and agent profiles, then performs skill-grounded routing by selecting agents based on required skills and explicit performance-cost trade-offs at deployment time.
5. 📊 Results and Evaluation: SkillOrchestra outperforms state-of-the-art RL-based orchestrators by up to 22.5% with 700× and 300× learning cost reduction compared to Router-R1 and ToolOrchestra respectively, achieving higher accuracy at lower cost across ten benchmarks while demonstrating better routing balance and transferability across orchestrator models.

SkillOrchestra: Learning to Route Agents via Skill Transfer

SkillOrchestra: Learning to Route Agents via Skill Transfer Skill Handbook Learning Agent Execution Traces Discovery Refinement Skill Handbook • Mode Selection Insights • Fine-grained Skills • Agent Profiles • Cost Estimates Handbook Selection Candidate Handbooks H₁: 98 skills H₂: 43 skills H₃: 65 skills Pareto Optimal Selection Reward↑ Cost↓ Selected Handbook Deployment Orchestrator At each timestep: 1. Mode Selection 2. Identify Active Skills 3. Select Optimal Agent Agent 1 Model + Tool Agent N Model + Tool Key Method Components 1. Skill Discovery & Refinement • Contrast successful vs failed trajectories • Abstract capability gaps into reusable skills • Estimate agent competence via Beta distributions • Merge/split skills based on performance variance 2. Granularity-Aware Selection • Match skill detail to orchestrator capacity • Validate on held-out data for Pareto optimality • Balance expressiveness vs decision reliability • Enable cost-performance trade-offs 3. Skill-Grounded Routing • Mode selection based on execution insights • Identify active skills for current state • Aggregate competence over skill set • Select agent optimizing performance-cost 4. Transferable Knowledge • Handbook decoupled from orchestrator params • Reusable across different backbones • No retraining needed for new orchestrators • Extensible to new models/tools SkillOrchestra achieves 22.5% improvement with 700× cost reduction vs RL baselines
Q1
1. What is 'routing collapse' as identified in the SkillOrchestra paper?
When the routing system crashes due to excessive computational load from managing too many models
The tendency of RL-trained orchestrators to repeatedly select a single strong but expensive model despite having better alternatives
The failure of skill discovery when agent profiles become too similar to distinguish between them
Q2
2. How does SkillOrchestra achieve its dramatic cost reduction compared to Router-R1?
By using smaller, less capable models that consume fewer tokens during inference
By learning a reusable Skill Handbook instead of expensive end-to-end RL training, enabling skill-based routing that better matches model capabilities to task requirements
By limiting the maximum number of routing steps to 4 turns instead of allowing unlimited iterations
Q3
3. What unique capability does the Skill Handbook provide that traditional routing methods lack?
The ability to transfer learned orchestration knowledge across different orchestrator backbones without retraining
Support for more than 10 different language models in the routing pool
Automatic code generation for solving mathematical problems