2026-03-06 Papers

1/2

Paper 1

RoboPocket: Improve Robot Policies Instantly with Your Phone

Published: 2026-03-05

Link: http://arxiv.org/pdf/2603.05504

1. 📘 Topic and Domain: The paper presents RoboPocket, a smartphone-based system for robot-free instant policy iteration in robotic manipulation learning.

2. 💡 Previous Research and New Ideas: Building on handheld data collection interfaces like UMI and interactive learning methods like DAgger, the paper proposes using AR Visual Foresight on smartphones to enable real-time policy feedback and correction without physical robots.

3. ❓ Problem: The paper addresses the inefficiency of robot learning data collection, where current methods either require expensive physical robots for interactive learning or operate in open-loop without knowing policy weaknesses.

4. 🛠️ Methods: The system uses an iPhone with custom gripper hardware, real-time AR visualization of policy predictions, remote inference servers, and online finetuning with weighted sampling to enable instant policy updates.

5. 📊 Results and Evaluation: RoboPocket achieved 2× data efficiency compared to offline baselines across four manipulation tasks, adhered to data scaling laws, and enabled distributed policy improvement with as few as 12 corrections per user.

RoboPocket: Improve Robot Policies Instantly with Your Phone

1/2

Paper 2

Heterogeneous Agent Collaborative Reinforcement Learning

Published: 2026-03-03

Link: http://arxiv.org/pdf/2603.02604

1. 📘 Topic and Domain: The paper introduces Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), focusing on collaborative policy optimization for heterogeneous large language model agents in reinforcement learning with verifiable rewards.

2. 💡 Previous Research and New Ideas: The paper builds on Group Sequence Policy Optimization (GSPO) and RLVR paradigms, proposing a new approach where heterogeneous agents share verified rollouts during training for mutual improvement while maintaining independent execution at inference time.

3. ❓ Problem: The paper addresses the inefficiency of isolated on-policy optimization where multiple agents solving the same task repeatedly generate costly trajectories that are only used for self-training, missing opportunities for cross-agent knowledge transfer.

4. 🛠️ Methods: The authors propose HACPO algorithm with four mechanisms: agent-capability-aware advantage estimation, model capability discrepancy coefficient, exponential importance sampling, and stepwise clipping to enable effective rollout sharing among heterogeneous agents.

5. 📊 Results and Evaluation: HACPO consistently improves all participating agents across three heterogeneity types and seven mathematical reasoning benchmarks, achieving an average 3.3% improvement over GSPO while using only half the rollout cost.

Heterogeneous Agent Collaborative Reinforcement Learning

1/2

Paper 3

MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

Published: 2026-03-04

Link: http://arxiv.org/pdf/2603.03756

1. 📘 Topic and Domain: The paper focuses on training large language models (LLMs) for scientific discovery, specifically addressing the computational intractability of modeling P(hypothesis|background).

2. 💡 Previous Research and New Ideas: Building on MOOSE-Chem's probabilistic decomposition theory and existing LLM discovery methods that rely on external feedback, the paper proposes MOOSE-STAR, which directly trains P(h|b) through decomposed subtasks and hierarchical search.

3. ❓ Problem: The paper solves the combinatorial complexity barrier (O(N^k)) that makes directly training LLMs to generate scientific hypotheses from research backgrounds mathematically intractable.

4. 🛠️ Methods: The authors decompose the intractable objective into sequential subtasks (inspiration retrieval and hypothesis composition), employ hierarchical search trees, introduce bounded composition for robustness, and use motivation planning to guide search.

5. 📊 Results and Evaluation: MOOSE-STAR reduces complexity from exponential to logarithmic O(log N) in best cases, achieves 54.37% inspiration retrieval accuracy, and demonstrates continuous test-time scaling while brute-force methods hit a "complexity wall" at 41.3% success rate.