2026-01-15 Papers

1/2

Paper 1

MAXS: Meta-Adaptive Exploration with LLM Agents

Published: 2026-01-14

Link: http://arxiv.org/pdf/2601.09259

1. 📘 Topic and Domain: The paper proposes MAXS, a meta-adaptive exploration framework for Large Language Model (LLM) Agents to improve multi-tool reasoning and decision-making.
2. 💡 Previous Research and New Ideas: Based on existing Chain of Thought (CoT), Tree of Thought (ToT), and Monte Carlo Tree Search (MCTS) methods, it introduces a novel lookahead strategy and value estimation mechanism for more efficient reasoning.
3. ❓ Problem: The paper addresses two key issues in LLM Agent reasoning: locally myopic generation (lack of foresight in decision-making) and trajectory instability (where small early errors can lead to divergent reasoning paths).
4. 🛠️ Methods: MAXS employs a lookahead strategy to simulate future steps, combines three metrics (advantage score, step consistency variance, and inter-step trend slopes) for value estimation, and uses a trajectory convergence mechanism to control computational costs.
5. 📊 Results and Evaluation: Tested across five datasets and three base models, MAXS consistently outperformed existing methods in both accuracy and efficiency, showing particular strength on MathVista (85.5% accuracy) while using significantly fewer tokens than alternatives like MCTS.

MAXS: Meta-Adaptive Exploration with LLM Agents

MAXS: Meta-Adaptive Exploration with LLM Agents Input s₀ MAXS Framework Lookahead Strategy • Rollout future steps • Bellman recursion • R(s₀,s≤ᵢ,s>ᵢ) • K=4 steps ahead • Mitigate myopic generation Value Estimation Mechanism • Advantage Score: Aᵢ = Fᵢ - Fᵢ₋₁ • Step-Level Variance (Lyapunov stability) • Slope-Level Variance (Lipschitz continuity) • Combined reward function • Trajectory stability Tool Integration • Search Tool • Code Execution • Dynamic invocation Trajectory Convergence • Early stopping: Var(Rᵢ) ≤ δ • Efficiency optimization • δ = 0.002 threshold Step Selection ŝᵢ ~ πθ(sᵢ|s₀,s<ᵢ)e^(R/τ) Softmax selection Iteration Control Until convergence or max steps (13) Combined Reward Function R(s₀,s≤ᵢ,s>ᵢ) = (1-α-β)·Norm(R^adv_i) + α·Norm(R^step_i) + β·Norm(R^slope_i) α = 0.3, β = 0.2, τ = 0.6 Final Answer Key Benefits & Results • Addresses locally myopic generation • Reduces trajectory instability • 1000x fewer tokens than MCTS • 63.46% avg accuracy (MiMo-VL-7B) • Superior efficiency-performance trade-off • Consistent across 5 benchmarks
Q1
1. What is the primary innovation of MAXS compared to previous methods?
It uses a larger language model than previous approaches
It combines lookahead strategy with value estimation for balanced exploration
It completely eliminates the need for external tools in reasoning
Q2
2. According to the ablation studies, which component removal caused the largest performance drop in MAXS?
Removing the lookahead module
Removing the slope variance score
Removing the trajectory convergence mechanism
Q3
3. What was the computational efficiency advantage of MAXS compared to MCTS?
It used about 10 times fewer tokens
It used about 100 times fewer tokens
It used about 1000 times fewer tokens
1/2

Paper 2

A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation

Published: 2026-01-14

Link: http://arxiv.org/pdf/2601.09274

1. 📘 Topic and Domain: A benchmark dataset (A3-Bench) for evaluating memory-driven scientific reasoning in math, physics, and chemistry domains.
2. 💡 Previous Research and New Ideas: Based on existing memory and scientific reasoning benchmarks, proposes a novel dual-scale memory framework using anchors (foundational knowledge) and attractors (experience-based templates).
3. ❓ Problem: Existing benchmarks only evaluate final answers or step-by-step coherence, without evaluating how models activate and utilize memory during scientific reasoning.
4. 🛠️ Methods: Created 2,198 scientific problems using SAPM process (subject benchmarking, anchor/attractor development, problem reconstruction, memory mapping) and introduced AAUI metric to measure memory activation rates.
5. 📊 Results and Evaluation: Models showed improved accuracy under memory activation (13.48% average increase), with strongest gains on difficult problems, while maintaining reasonable token costs and achieving AAUI scores up to 0.97.

A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation

A³-Bench: Memory-Driven Scientific Reasoning Workflow SAPM Annotation Process S Subject Benchmarking A Anchor & Attractor Dev P Problem Reconstructing M Memory Mapping Memory Architecture Components Anchors • Concepts • Principles • Formulas Attractors • Abstract Schemas • Solution Templates • Episodic Exemplars Memory Activation: m* = arg min [D_KL(q||p(m|A)) + H(m)] HybridRAG Framework Twin-Needle Activator Vector Needle (semantic) Graph Needle (logical) Context Composer Query + Activated State Final Context Weaving Activation Process: z* ≈ Φ_hybrid(x) = V(x) ⊕ G(V(x)) Evaluation Framework Vanilla No Memory Parametric Knowledge Only Full Memory Anchor & Attractor Activation Gold Memory Annotated Human-labeled Subset AAUI Metric Anchor-Attractor Utilization Index Memory Activation Dataset: 2,198 Problems Math: 998 (45.4%) Physics: 600 (27.3%) Chemistry: 600 (27.3%) Key Findings Memory Augmentation +13.48% average improvement across all LLMs Hard Problems Most beneficial on difficult problems AAUI Correlation Higher AAUI correlates with better accuracy Generalization Transfers beyond source datasets Memory-driven scientific reasoning through dual-scale anchor and attractor activation
Q1
1. What is the main innovation of A3-Bench compared to existing scientific reasoning benchmarks?
It includes more diverse scientific problems
It evaluates memory activation and utilization during reasoning
It focuses only on final answer accuracy
Q2
2. In the SAPM process used to create A3-Bench, what is the maximum number of anchors and attractors allowed per question?
4 anchors and 2 attractors
3 anchors and 3 attractors
6 anchors and 4 attractors
Q3
3. When comparing memory activation paradigms, which observation was made about model performance?
Attractor-only activation consistently performed worse than anchor-only activation
The combination of anchors and attractors showed no significant improvement over single memory type
Most models performed better with attractor-only activation compared to anchor-only activation
1/2

Paper 3

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Published: 2026-01-14

Link: http://arxiv.org/pdf/2601.09708

1. 📘 Topic and Domain: The paper focuses on efficient Vision-Language-Action (VLA) reasoning for robotic tasks, specifically developing a framework called Fast-ThinkAct to improve the speed and effectiveness of robots' decision-making processes.
2. 💡 Previous Research and New Ideas: Based on previous chain-of-thought (CoT) reasoning approaches in VLA models, this paper proposes a novel method of compressing lengthy reasoning chains into compact latent representations while maintaining performance.
3. ❓ Problem: The paper addresses the high inference latency in current VLA systems that use lengthy reasoning traces, which hampers real-time performance in robotic applications requiring rapid decision-making.
4. 🛠️ Methods: The authors implement a teacher-student framework with preference-guided distillation that compresses linguistic and visual planning into compact continuous latents, using a verbalizer LLM to ensure the latent representations remain interpretable.
5. 📊 Results and Evaluation: The framework achieved up to 89.3% reduction in inference latency compared to state-of-the-art reasoning VLAs while maintaining or improving performance across various robotic manipulation and reasoning benchmarks.

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Fast-ThinkAct: Efficient VLA Reasoning Framework Phase 1: Verbalizable Latent Planning Input Observation (o_t) Instruction (l) Teacher VLM F^T_θ GRPO Training Generates CoT τ (~250 tokens) Student VLM F_θ Latent Reasoning Generates z={z_m} (M=6 tokens) Verbalizer V_ψ Decodes latents to text Spatial Tokens K waypoints Visual trajectory L_verb Preference L_distill Trajectory L_ans Waypoint Phase 2: Reasoning-Enhanced Policy Learning Frozen Student F_θ Visual Planning c_t from KV cache Action Model π_φ DiT-Policy / RDT Generates actions a_t L_IL Imitation Robot Actions Real-time execution Key Achievements 89.3% Latency Reduction Superior Performance Long-horizon Planning Few-shot Adaptation
Q1
1. What is the main innovation of Fast-ThinkAct compared to previous VLA systems?
It introduces a new type of robotic hardware
It compresses lengthy reasoning chains into compact latent representations
It completely eliminates the need for reasoning in robotic tasks
Q2
2. How much inference latency reduction did Fast-ThinkAct achieve compared to state-of-the-art reasoning VLAs?
Up to 50%
Up to 75%
Up to 89.3%
Q3
3. What unique component does Fast-ThinkAct use to ensure its latent representations remain interpretable?
A verbalizer LLM
A neural network translator
A binary encoder-decoder