2025-10-08 Papers

1/2

Paper 1

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

Published: 2025-10-07

Link: http://arxiv.org/pdf/2510.06217

1. 📘 Topic and Domain: This paper focuses on developing a tool-grounded Process Reward Model (PRM) called TATTOO for improving tabular reasoning capabilities in large language models.

2. 💡 Previous Research and New Ideas: It builds upon existing PRM frameworks but introduces novel table-specific supervision and tool integration, addressing limitations of current PRMs that struggle with table operations.

3. ❓ Problem: The paper aims to solve the challenge of providing reliable step-level supervision for large reasoning models when performing table-based operations, as existing PRMs fail to effectively verify table retrieval and schema interaction steps.

4. 🛠️ Methods: The authors develop a dual-stage training approach combining supervised fine-tuning and reinforcement learning, using a curated dataset of 60k high-quality step-level annotations with integrated tool-based verification.

5. 📊 Results and Evaluation: TATTOO improves downstream policy models by 30.9% across 5 tabular reasoning benchmarks, outperforming larger models like Qwen-2.5-Math-PRM-72B while using only 8B parameters, and demonstrates strong generalizability across different test-time scaling strategies.

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

1/2

Paper 2

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.04618

1. 📘 Topic and Domain: The paper focuses on context engineering for large language models (LLMs), specifically developing evolving contexts to improve LLM performance and self-improvement capabilities.

2. 💡 Previous Research and New Ideas: Based on Dynamic Cheatsheet's adaptive memory approach, the paper proposes ACE (Agentic Context Engineering), introducing structured context updates and preservation of detailed domain knowledge rather than compressing it.

3. ❓ Problem: The paper addresses two key limitations in existing context adaptation methods: brevity bias (oversimplifying contexts) and context collapse (degradation of information during iterative rewrites).

4. 🛠️ Methods: ACE uses a three-component system (Generator, Reflector, Curator) with incremental delta updates and a grow-and-refine mechanism to maintain comprehensive, evolving contexts without losing detailed knowledge.

5. 📊 Results and Evaluation: ACE achieved significant improvements over baselines: +10.6% on agents and +8.6% on finance benchmarks, while reducing adaptation latency by 86.9% and matching top-ranked production-level agents despite using smaller models.

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

1/2

Paper 3

Less is More: Recursive Reasoning with Tiny Networks

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.04871

1. 📘 Topic and Domain: The paper focuses on recursive reasoning models for solving complex puzzle tasks using small neural networks in the domain of machine learning and artificial intelligence.

2. 💡 Previous Research and New Ideas: Based on the Hierarchical Reasoning Model (HRM), the paper proposes a simpler Tiny Recursive Model (TRM) that uses a single tiny network instead of two networks recursing at different frequencies.

3. ❓ Problem: The paper aims to solve the challenge of achieving high performance on complex puzzle tasks (like Sudoku, Maze, ARC-AGI) with minimal parameters while avoiding the complexity and theoretical requirements of existing approaches.

4. 🛠️ Methods: TRM uses a single tiny 2-layer network that recursively improves its latent reasoning feature and predicted answer through multiple supervision steps, incorporating exponential moving average and simplified adaptive computational time.

5. 📊 Results and Evaluation: TRM achieved better results than HRM and large language models on multiple benchmarks while using fewer parameters (7M vs 27M), including 87.4% accuracy on Sudoku-Extreme, 85.3% on Maze-Hard, 44.6% on ARC-AGI-1, and 7.8% on ARC-AGI-2.