2026-02-04 Papers

1/2

Paper 1

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Published: 2026-02-03

Link: http://arxiv.org/pdf/2602.03786

1. 📘 Topic and Domain: The paper focuses on agentic orchestration systems for automating complex, long-horizon tasks through dynamic sub-agent creation in the domain of AI agent systems.

2. 💡 Previous Research and New Ideas: Building on existing sub-agent-as-tools paradigms that use fixed roles or context isolation, the paper proposes a unified four-tuple abstraction (Instruction, Context, Tools, Model) for on-demand, dynamic sub-agent creation.

3. ❓ Problem: The paper addresses the lack of flexibility and adaptability in current multi-agent systems, which rely on static sub-agent roles or simple context isolation, limiting their effectiveness in open-ended environments.

4. 🛠️ Methods: The authors develop AORCHESTRA, an orchestrator-centric framework that dynamically creates specialized sub-agents using the four-tuple abstraction, with learnable orchestration through supervised fine-tuning and in-context learning.

5. 📊 Results and Evaluation: AORCHESTRA achieves 16.28% relative improvement over the strongest baseline when paired with Gemini-3-Flash across three benchmarks (GAIA, SWE-Bench-Verified, Terminal-Bench), demonstrating superior performance in complex task automation.

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

1/2

Paper 2

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Published: 2026-02-03

Link: http://arxiv.org/pdf/2602.03796

1. 📘 Topic and Domain: The paper addresses 3D-aware human motion control for video generation, enabling view-adaptive human animation from 2D driving videos.

2. 💡 Previous Research and New Ideas: Building on existing 2D pose-based and SMPL-based motion control methods, the paper proposes learning implicit view-agnostic motion representations that align with pretrained video generators' 3D priors rather than relying on external 3D reconstructions.

3. ❓ Problem: Current methods either rigidly bind motion to 2D driving viewpoints (preventing novel-view synthesis) or rely on inaccurate external 3D parametric models that override video generators' intrinsic spatial understanding.

4. 🛠️ Methods: The authors develop 3DiMo, jointly training a transformer-based motion encoder with a pretrained DiT video generator using view-rich supervision (single-view, multi-view, and moving-camera videos) and auxiliary geometric supervision that is gradually annealed.

5. 📊 Results and Evaluation: 3DiMo outperforms baselines on LPIPS, FID, and FVD metrics, with user studies confirming superior motion accuracy (4.28±0.08), naturalness (4.18±0.06), and 3D plausibility (4.05±0.09), demonstrating faithful motion reproduction with flexible camera control.

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

1/2

Paper 3

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Published: 2026-01-31

Link: http://arxiv.org/pdf/2602.00919

1. 📘 Topic and Domain: The paper presents Green-VLA, a staged Vision-Language-Action framework for training generalist robots, with focus on humanoid robot control and multi-embodiment generalization.

2. 💡 Previous Research and New Ideas: The paper builds on existing VLA models (π0, OpenVLA, RT-2) and proposes a five-stage training curriculum (L0-L1-R0-R1-R2), unified action space across embodiments, and quality-focused data curation with temporal alignment.

3. ❓ Problem: The paper aims to solve the challenges of heterogeneous robotics datasets, poor data quality, behavior cloning limitations, and the difficulty of deploying VLA models across diverse robot embodiments while maintaining real-world performance.

4. 🛠️ Methods: The authors use a DataQA pipeline for quality filtering, unified action space with semantic layout, flow-matching action expert, joint prediction module for guidance, and two-phase RL fine-tuning (trajectory optimization and source distribution optimization).

5. 📊 Results and Evaluation: Green-VLA achieves 69.5% success rate on ALOHA table-cleaning (vs 35.6% for π0), 71.8% on Google Robot tasks, 91.7% on WidowX tasks, and demonstrates successful deployment on the Green humanoid robot with 90% average success across manipulation tasks.