2026-01-12 Papers

1/2

Paper 1

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.05432

1. 📘 Topic and Domain: The paper focuses on image geolocalization using large vision-language models (LVLMs) and map tools to determine the location where an image was taken.

2. 💡 Previous Research and New Ideas: Previous research treated geolocalization as a classification/retrieval task or used LVLMs with chain-of-thought reasoning, while this paper introduces a novel "Thinking with Map" approach that enables models to use map tools like humans do.

3. ❓ Problem: The paper aims to solve the challenge of accurately determining image locations by addressing the limitations of existing approaches that rely solely on internal model knowledge without using maps.

4. 🛠️ Methods: The authors developed a two-stage optimization scheme: agentic reinforcement learning to improve sampling efficiency, followed by parallel test-time scaling to explore multiple candidate paths, along with map-based tools for verification.

5. 📊 Results and Evaluation: The method outperformed existing models on most metrics across multiple benchmarks, notably improving Acc@500m from 8.0% to 22.1% compared to Gemini-3-Pro with Google Search/Map grounded mode.

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

1/2

Paper 2

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Published: 2026-01-09

Link: http://arxiv.org/pdf/2601.06021

1. 📘 Topic and Domain: The paper focuses on improving reinforcement learning for large language model-based deep search agents through better reward mechanisms.

2. 💡 Previous Research and New Ideas: Previous work used binary outcome rewards for training deep search agents; this paper proposes a novel Citation-aware Rubric Rewards (CaRR) framework that evaluates reasoning comprehensiveness and factual grounding.

3. ❓ Problem: The paper aims to address limitations of pure outcome-based rewards which fail to capture reasoning comprehensiveness and factuality, leading to shortcut exploitation and hallucinations in deep search agents.

4. 🛠️ Methods: The authors developed CaRR to decompose complex questions into verifiable rubrics and introduced Citation-aware Group Relative Policy Optimization (C-GRPO) that combines rubric rewards with outcome rewards.

5. 📊 Results and Evaluation: C-GRPO consistently outperformed standard outcome-based RL baselines across multiple deep search benchmarks, showing better performance with extended context budgets and strong generalization to open-ended research tasks.

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

1/2

Paper 3

Evolving Programmatic Skill Networks

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.03509

1. 📘 Topic and Domain: The paper focuses on continual skill acquisition for embodied AI agents, introducing a framework called Programmatic Skill Network (PSN) that enables agents to learn, refine, and reuse executable skills in open-ended environments.

2. 💡 Previous Research and New Ideas: Based on existing work in programmatic skill representations and LLM-based agents, the paper proposes a novel framework where skills are represented as executable symbolic programs forming a compositional network that evolves through experience, with unique mechanisms for credit assignment and structural refactoring.

3. ❓ Problem: The paper addresses limitations of current approaches where skills are typically represented as flat libraries or static graphs lacking principled mechanisms for continual improvement and unified frameworks for credit assignment over hierarchical skill compositions.

4. 🛠️ Methods: The authors develop three core mechanisms: (1) REFLECT for structured fault localization over skill compositions, (2) maturity-aware update gating for stabilizing reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation to maintain network compactness.

5. 📊 Results and Evaluation: Experiments on MineDojo and Crafter environments demonstrate that PSN achieves robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions, with better performance than baseline approaches in technology tree progression and survival tasks.