2025-10-08 Papers

1/2

Paper 1

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

Published: 2025-10-07

Link: http://arxiv.org/pdf/2510.06217

1. 📘 Topic and Domain: This paper focuses on developing a tool-grounded Process Reward Model (PRM) called TATTOO for improving tabular reasoning capabilities in large language models.
2. 💡 Previous Research and New Ideas: It builds upon existing PRM frameworks but introduces novel table-specific supervision and tool integration, addressing limitations of current PRMs that struggle with table operations.
3. ❓ Problem: The paper aims to solve the challenge of providing reliable step-level supervision for large reasoning models when performing table-based operations, as existing PRMs fail to effectively verify table retrieval and schema interaction steps.
4. 🛠️ Methods: The authors develop a dual-stage training approach combining supervised fine-tuning and reinforcement learning, using a curated dataset of 60k high-quality step-level annotations with integrated tool-based verification.
5. 📊 Results and Evaluation: TATTOO improves downstream policy models by 30.9% across 5 tabular reasoning benchmarks, outperforming larger models like Qwen-2.5-Math-PRM-72B while using only 8B parameters, and demonstrates strong generalizability across different test-time scaling strategies.

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning Problem Analysis Existing PRMs fail on table-specific operations Error Categories • Table Retrieval (47.7%) • Schema Interaction (34.3%) • Inner-thinking (12.0%) Data Curation Pipeline Trajectory Generation Verification Synthesis Tool Use Synthesis 60k High-Quality Training Instances Dual-Stage Training Supervised Fine-tuning (Tool-use patterns) RL Policy Optimization (Reward shaping) Tool-Grounded Reward Shaping Table-Aware Reward Design r_i,rea (Inner reasoning) r_i,tab (Table operations) External Tools: • Computation tools (Python, SQL) • Table lookup tools (DataFrame APIs) Test-Time Scaling Strategies Best-of-N Beam Search DVTS Policy Improvement via TaTToo Evaluation Results 30.9% improvement on downstream LRMs Outperforms 72B PRMs with 8B parameters 5 tabular reasoning benchmarks (TB-NR, TB-FC, TB-DA, WTQ, MMQA) Key Contributions • Tool-integrated verification • Table-aware reward design • Dual-stage training paradigm • Strong generalizability Theoretical Foundation Policy Improvement Lower Bound (Theorem 4.1) V^π'(s_i) - V^π(s_i) ≳ Var[r_i,tab] + Var[r_i,rea] + alignment terms Decomposable reward design enables additive policy improvement TaTToo: 8B Parameter Tool-Grounded PRM Robust step-level supervision for tabular reasoning tasks END
Q1
1. What is the main limitation of existing Process Reward Models (PRMs) that TATTOO aims to address?
Their inability to handle mathematical calculations
Their failure to effectively verify table retrieval and schema interaction steps
Their large model size and computational requirements
Q2
2. How does TATTOO achieve better performance with fewer parameters compared to larger models?
By using a simpler architecture with fewer layers
By incorporating external tools and table-aware reward supervision
By training on a much larger dataset
Q3
3. What is unique about TATTOO's training approach?
It uses only supervised learning with human annotations
It relies solely on reinforcement learning
It combines supervised fine-tuning with reinforcement learning and tool integration
1/2

Paper 2

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.04618

1. 📘 Topic and Domain: The paper focuses on context engineering for large language models (LLMs), specifically developing evolving contexts to improve LLM performance and self-improvement capabilities.
2. 💡 Previous Research and New Ideas: Based on Dynamic Cheatsheet's adaptive memory approach, the paper proposes ACE (Agentic Context Engineering), introducing structured context updates and preservation of detailed domain knowledge rather than compressing it.
3. ❓ Problem: The paper addresses two key limitations in existing context adaptation methods: brevity bias (oversimplifying contexts) and context collapse (degradation of information during iterative rewrites).
4. 🛠️ Methods: ACE uses a three-component system (Generator, Reflector, Curator) with incremental delta updates and a grow-and-refine mechanism to maintain comprehensive, evolving contexts without losing detailed knowledge.
5. 📊 Results and Evaluation: ACE achieved significant improvements over baselines: +10.6% on agents and +8.6% on finance benchmarks, while reducing adaptation latency by 86.9% and matching top-ranked production-level agents despite using smaller models.

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

ACE: Agentic Context Engineering Workflow Input Query & Context Generator Produces reasoning trajectories Trajectory & Feedback Reflector Critiques & extracts insights Insights & Lessons Curator Synthesizes into delta entries Delta Context Items Context Playbook Evolving strategies & knowledge base Incremental Delta Updates • Structured bullet points • Localized modifications • Preserves knowledge • Prevents collapse • Parallel merging Grow-and-Refine • Adaptive expansion • Redundancy control • Semantic deduplication • Maintains relevance • Scalable contexts Multi-Epoch Adaptation • Progressive refinement • Iterative improvement • Strengthens contexts • Better generalization • Self-improvement Key Benefits • 86.9% lower latency • Reduced rollout cost • No labeled supervision • Comprehensive playbooks • Self-improving LLMs Performance Results +10.6% on Agents +8.6% on Finance Matches GPT-4.1 Agent Using Smaller Model Consistent improvements across agent and domain-specific benchmarks Iterative Refinement
Q1
1. What was the key innovation in ACE's approach to handling context compared to previous methods?
Using completely compressed contexts to maximize efficiency
Treating contexts as evolving playbooks that accumulate knowledge
Eliminating all previous context after each update
Q2
2. Which performance metric did ACE significantly improve according to the paper?
Reduced model size by 86.9%
Increased memory efficiency by 10.6%
Reduced adaptation latency by 86.9%
Q3
3. What are the two key limitations of existing context adaptation methods that ACE addresses?
High computational cost and memory usage
Slow processing speed and poor accuracy
Brevity bias and context collapse
1/2

Paper 3

Less is More: Recursive Reasoning with Tiny Networks

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.04871

1. 📘 Topic and Domain: The paper focuses on recursive reasoning models for solving complex puzzle tasks using small neural networks in the domain of machine learning and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on the Hierarchical Reasoning Model (HRM), the paper proposes a simpler Tiny Recursive Model (TRM) that uses a single tiny network instead of two networks recursing at different frequencies.
3. ❓ Problem: The paper aims to solve the challenge of achieving high performance on complex puzzle tasks (like Sudoku, Maze, ARC-AGI) with minimal parameters while avoiding the complexity and theoretical requirements of existing approaches.
4. 🛠️ Methods: TRM uses a single tiny 2-layer network that recursively improves its latent reasoning feature and predicted answer through multiple supervision steps, incorporating exponential moving average and simplified adaptive computational time.
5. 📊 Results and Evaluation: TRM achieved better results than HRM and large language models on multiple benchmarks while using fewer parameters (7M vs 27M), including 87.4% accuracy on Sudoku-Extreme, 85.3% on Maze-Hard, 44.6% on ARC-AGI-1, and 7.8% on ARC-AGI-2.

Less is More: Recursive Reasoning with Tiny Networks

Tiny Recursive Model (TRM) Workflow Input Question x Input Embedding f_I(x) Initial States y_init, z_init Deep Supervision Loop (up to N_sup = 16 steps) Recursive Reasoning (T-1 times without gradients + 1 time with gradients) Latent Recursion (n = 6 times) z = net(x, y, z) Tiny 2-layer network Answer Update y = net(y, z) Same tiny network Output Head ŷ = argmax(f_O(y)) Halting Decision ACT (simplified) Loss Computation Cross-entropy + Halting loss Final Prediction Improved Answer Key Features • Single tiny network (2 layers) • Only 7M parameters • No fixed-point theorem • Full gradient backprop • Simplified ACT • EMA for stability • Progressive improvement Recursive Loop
Q1
1. What is the main advantage of TRM over HRM according to the paper?
It uses more complex biological arguments
It achieves better results with fewer parameters and simpler architecture
It requires more forward passes during training
Q2
2. On which dataset did TRM achieve its most significant improvement over HRM?
Sudoku-Extreme (from 55% to 87.4%)
ARC-AGI-2 (from 5% to 7.8%)
Maze-Hard (from 74.5% to 85.3%)
Q3
3. Why did the authors choose to use only 2 layers in TRM?
To reduce computational cost
Because it was inspired by biological neural networks
Because fewer layers actually improved generalization due to less overfitting