2026-01-12 Papers

1/2

Paper 1

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.05432

1. 📘 Topic and Domain: The paper focuses on image geolocalization using large vision-language models (LVLMs) and map tools to determine the location where an image was taken.
2. 💡 Previous Research and New Ideas: Previous research treated geolocalization as a classification/retrieval task or used LVLMs with chain-of-thought reasoning, while this paper introduces a novel "Thinking with Map" approach that enables models to use map tools like humans do.
3. ❓ Problem: The paper aims to solve the challenge of accurately determining image locations by addressing the limitations of existing approaches that rely solely on internal model knowledge without using maps.
4. 🛠️ Methods: The authors developed a two-stage optimization scheme: agentic reinforcement learning to improve sampling efficiency, followed by parallel test-time scaling to explore multiple candidate paths, along with map-based tools for verification.
5. 📊 Results and Evaluation: The method outperformed existing models on most metrics across multiple benchmarks, notably improving Acc@500m from 8.0% to 22.1% compared to Gemini-3-Pro with Google Search/Map grounded mode.

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Thinking with Map: Workflow Overview Phase 1: Thinking with Map Process Input Image Agent-in-Map Loop Map Tools POI Search, Static Map Hypothesis Generation Cross Validation Phase 2: Agentic Reinforcement Learning Policy πθ GRPO Reward Trajectory Generation Pass@N → Pass@K Advantage Computation Training: MAPBench + IMAGEO-2 (4500 samples) Phase 3: Parallel Test-time Scaling Parallel Sampling Traj 1 Traj 2 Traj N Verifier Evidence Aggregation Final Answer Pass@K → Pass@1 Acc@500m: 8.0% → 22.1% vs Gemini-3-Pro Evaluation Datasets MAPBench 5,000 Chinese urban images Easy: 599, Hard: 1,901 GeoBench 1,132 global images Photos, Panoramas, Satellite IMAGEO-Bench 2,929 crowdsourced Google Map POIs Key Innovations • Map-augmented reasoning • Self-verifiable trajectories • Parallel exploration • Evidence-based verification Performance Results MAPBench-hard Acc@500m: 14.86% (Best) GeoBench Acc@500m: 57.94% (Best) IMAGEO-2 Acc@500m: 20.53% (Best) Base Model Qwen3-VL-30B-A3B Map Tools Integration POI Keyword Search Location details lookup Static Map Query Visual verification POI Detail Query Detailed information Image Zoom Tool Visual clue inspection Satellite Map Query Aerial view verification POI Input Tips Search suggestions
Q1
1. What is the main innovation of this paper compared to previous geolocalization approaches?
Using classification and retrieval methods
Integrating map tools with LVLM reasoning
Applying chain-of-thought reasoning only
Q2
2. In the two-stage optimization scheme proposed by the paper, what is the correct order of stages?
Parallel test-time scaling followed by reinforcement learning
Chain-of-thought reasoning followed by map verification
Agentic reinforcement learning followed by parallel test-time scaling
Q3
3. What improvement did the paper's method achieve in fine-grained localization (Acc@500m) compared to Gemini-3-Pro?
From 8.0% to 14.1%
From 8.0% to 22.1%
From 8.0% to 18.5%
1/2

Paper 2

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Published: 2026-01-09

Link: http://arxiv.org/pdf/2601.06021

1. 📘 Topic and Domain: The paper focuses on improving reinforcement learning for large language model-based deep search agents through better reward mechanisms.
2. 💡 Previous Research and New Ideas: Previous work used binary outcome rewards for training deep search agents; this paper proposes a novel Citation-aware Rubric Rewards (CaRR) framework that evaluates reasoning comprehensiveness and factual grounding.
3. ❓ Problem: The paper aims to address limitations of pure outcome-based rewards which fail to capture reasoning comprehensiveness and factuality, leading to shortcut exploitation and hallucinations in deep search agents.
4. 🛠️ Methods: The authors developed CaRR to decompose complex questions into verifiable rubrics and introduced Citation-aware Group Relative Policy Optimization (C-GRPO) that combines rubric rewards with outcome rewards.
5. 📊 Results and Evaluation: C-GRPO consistently outperformed standard outcome-based RL baselines across multiple deep search benchmarks, showing better performance with extended context budgets and strong generalization to open-ended research tasks.

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Citation-Aware Rubric Rewards (CaRR) & C-GRPO Workflow Input Processing CaRR Framework C-GRPO Training Multi-hop Question Complex QA from DeepDive Dataset Rubric Initialization LLM decomposes into single-hop rubrics Agent Trajectory ReAct paradigm with search, open, find tools Step 1: Hidden Entity Identification Judge LLM checks if hidden entities are explicitly identified in final response Step 2: Citation-based Rubric Judgment Verify rubrics are supported by cited web contents from trajectory Step 3: Evidence Connectivity Check Build bipartite graph and check rubrics connected to final answer via BFS Rubric Reward Calculation R_r = |R_connect| / |R_q| Ratio of satisfied rubrics Outcome Reward Binary signal for correct final answer R_o ∈ {0, 1} Mixed Reward R = (1-α)·R_o + α·R_o·R̂_r Combines outcome and rubric rewards GRPO Optimization Policy optimization with token-level loss function Robust Agent Enhanced deep search with comprehensive reasoning CaRR C-GRPO
Q1
1. What is the main limitation of using pure outcome-based rewards in training deep search agents?
They require too much computational resources
They cannot capture reasoning comprehensiveness and factuality
They only work with small language models
Q2
2. How does CaRR evaluate an agent's response trajectory?
By counting the number of search queries made
By measuring response generation speed
By checking if entities are identified, facts are citation-supported, and evidence chains are connected
Q3
3. What innovative aspect does C-GRPO introduce to the training process?
It combines citation-aware rubric rewards with outcome rewards
It eliminates the need for human supervision
It reduces training time by 50%
1/2

Paper 3

Evolving Programmatic Skill Networks

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.03509

1. 📘 Topic and Domain: The paper focuses on continual skill acquisition for embodied AI agents, introducing a framework called Programmatic Skill Network (PSN) that enables agents to learn, refine, and reuse executable skills in open-ended environments.
2. 💡 Previous Research and New Ideas: Based on existing work in programmatic skill representations and LLM-based agents, the paper proposes a novel framework where skills are represented as executable symbolic programs forming a compositional network that evolves through experience, with unique mechanisms for credit assignment and structural refactoring.
3. ❓ Problem: The paper addresses limitations of current approaches where skills are typically represented as flat libraries or static graphs lacking principled mechanisms for continual improvement and unified frameworks for credit assignment over hierarchical skill compositions.
4. 🛠️ Methods: The authors develop three core mechanisms: (1) REFLECT for structured fault localization over skill compositions, (2) maturity-aware update gating for stabilizing reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation to maintain network compactness.
5. 📊 Results and Evaluation: Experiments on MineDojo and Crafter environments demonstrate that PSN achieves robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions, with better performance than baseline approaches in technology tree progression and survival tasks.

Evolving Programmatic Skill Networks

Programmatic Skill Network (PSN) Framework Flow Task Stream {ω₁, ω₂, ...} Hybrid Planner Backward-chaining + LLM Planning S(g) = {s : E^post_s ⇒ g} PSN Manager CodeGen + Execute sₜ = CodeGen(Pₜ, Contextₜ) Environment E Success? δₜ ∈ {0,1} Online Refactor Structural Optimization: • Merge redundant skills • Abstract common patterns • Prune unused branches • Rollback validation Skill Optimizer Trace-Based Credit Assignment: ∇̃s = REFLECT(fₜ, s; Tₜ) Phase I: Top-down feedback Phase II: Bottom-up repair Maturity-aware gating P(update) = (1-ε)σ(γ(0.6-V(s)))+ε Skill Network Nₜ = (Sₜ, Lₜ) s₁ s₂ s₃ s₄ s₅ Each skill s = (Cs, Ps, Es, CHILDREN(s)) Control flow + Parameters Pre/Postconditions V(s) = p̂s - us (reliability) Core PSN Mechanisms 1. REFLECT (Credit Assignment) Trace-based fault localization • Symbolic differentiation over skill graph • Top-down feedback propagation • Bottom-up gradient application • Compositional error attribution ∇̃s′ = REFLECT(∇̃s, s′) Analogous to backpropagation 2. Maturity-Aware Gating Stability-plasticity tradeoff • Reliable skills → low update rate • Uncertain skills → high plasticity • Prevents catastrophic forgetting • Progressive skill stabilization Analogous to learning rate scheduling 3. Structural Refactoring Network architecture optimization • Merge redundant skills • Extract common abstractions • Prune unused components • Rollback validation for safety Analogous to neural architecture search Success Failure
Q1
1. What is the key parallel drawn between PSN's learning dynamics and neural network training?
Both use gradient descent optimization
Both require GPU acceleration for training
Both employ structured credit assignment and stability-plasticity tradeoffs
Q2
2. When does the PSN framework invoke the REFLECT operator?
After every skill execution regardless of outcome
Only when a skill execution fails to perform credit assignment
Only during the initial skill learning phase
Q3
3. What is the main advantage demonstrated by PSN over Voyager in the experimental results?
PSN requires less computational resources
PSN achieves better skill retention and reduced catastrophic forgetting
PSN learns skills in fewer iterations but with higher variance