2026-02-04 Papers

1/2

Paper 1

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Published: 2026-02-03

Link: http://arxiv.org/pdf/2602.03786

1. 📘 Topic and Domain: The paper focuses on agentic orchestration systems for automating complex, long-horizon tasks through dynamic sub-agent creation in the domain of AI agent systems.
2. 💡 Previous Research and New Ideas: Building on existing sub-agent-as-tools paradigms that use fixed roles or context isolation, the paper proposes a unified four-tuple abstraction (Instruction, Context, Tools, Model) for on-demand, dynamic sub-agent creation.
3. ❓ Problem: The paper addresses the lack of flexibility and adaptability in current multi-agent systems, which rely on static sub-agent roles or simple context isolation, limiting their effectiveness in open-ended environments.
4. 🛠️ Methods: The authors develop AORCHESTRA, an orchestrator-centric framework that dynamically creates specialized sub-agents using the four-tuple abstraction, with learnable orchestration through supervised fine-tuning and in-context learning.
5. 📊 Results and Evaluation: AORCHESTRA achieves 16.28% relative improvement over the strongest baseline when paired with Gemini-3-Flash across three benchmarks (GAIA, SWE-Bench-Verified, Terminal-Bench), demonstrating superior performance in complex task automation.

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

AOrchestra: Automating Sub-Agent Creation Workflow User Task Orchestrator 4-Tuple Agent Abstraction Φ = (Instruction, Context, Tools, Model) Orchestration Process 1. Decompose task into subtasks 2. Instantiate tailored sub-agents via explicit tool calls Configuring Instruction (I) Clear, actionable success criteria Configuring Context (C) Task-relevant curated details Configuring Tools (T) Minimal required tool subset Configuring Model (M) Match capability to task difficulty Dynamic SubAgent 1 Dynamic SubAgent 2 ... Dynamic SubAgent N ReAct OpenHands Mini-SWE Plug-and-Play Final Answer Learnable Orchestrator Supervised Fine-tuning Task orchestration skills In-Context Learning Cost-aware routing Delegate & Return
Q1
1. What is the core abstraction that AORCHESTRA uses to model both main agents and sub-agents?
A three-tuple: (Task, Environment, Reward)
A four-tuple: (Instruction, Context, Tools, Model)
A five-tuple: (Goal, State, Action, Observation, Reward)
Q2
2. How does AORCHESTRA's approach to sub-agents differ from existing systems like Claude Code?
AORCHESTRA uses sub-agents as static, pre-defined roles with fixed capabilities
AORCHESTRA creates sub-agents dynamically on-demand with task-specific specialization
AORCHESTRA only uses sub-agents for context isolation without any specialization
Q3
3. What learning methods does AORCHESTRA employ to improve its orchestration capabilities?
Reinforcement learning and genetic algorithms
Supervised fine-tuning and in-context learning
Unsupervised clustering and transfer learning
1/2

Paper 2

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Published: 2026-02-03

Link: http://arxiv.org/pdf/2602.03796

1. 📘 Topic and Domain: The paper addresses 3D-aware human motion control for video generation, enabling view-adaptive human animation from 2D driving videos.
2. 💡 Previous Research and New Ideas: Building on existing 2D pose-based and SMPL-based motion control methods, the paper proposes learning implicit view-agnostic motion representations that align with pretrained video generators' 3D priors rather than relying on external 3D reconstructions.
3. ❓ Problem: Current methods either rigidly bind motion to 2D driving viewpoints (preventing novel-view synthesis) or rely on inaccurate external 3D parametric models that override video generators' intrinsic spatial understanding.
4. 🛠️ Methods: The authors develop 3DiMo, jointly training a transformer-based motion encoder with a pretrained DiT video generator using view-rich supervision (single-view, multi-view, and moving-camera videos) and auxiliary geometric supervision that is gradually annealed.
5. 📊 Results and Evaluation: 3DiMo outperforms baselines on LPIPS, FID, and FVD metrics, with user studies confirming superior motion accuracy (4.28±0.08), naturalness (4.18±0.06), and 3D plausibility (4.05±0.09), demonstrating faithful motion reproduction with flexible camera control.

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

3DiMo: 3D-Aware Implicit Motion Control Workflow Input Reference Image (I_R) Driving Video (V_D) Data Augmentation Perspective Transform Appearance Aug. Motion Encoder Body Encoder (E_b) Hand Encoder (E_h) Motion Tokens [z_b; z_h] Text Prompt Camera Control VAE Encoder Latent Space DiT-based Video Generator Self-Attention + Cross-Attention Motion Conditioning via Cross-Attention Auxiliary Decoder SMPL/MANO View-Rich Dataset Single-View (600K) Multi-View (80K) Camera-Motion (80K) Multi-Stage Training Strategy Stage 1: Single-View Reconstruction Stage 2: Mixed View-Rich Supervision Stage 3: Cross-View Refinement Output Video 3D-Aware Motion + Camera Control View-Adaptive Generation augment encode 1D tokens auxiliary supervision provide progressive Key Features End-to-end learning View-agnostic design 3D-aware supervision Flexible camera control
Q1
1. What is the key innovation in 3DiMo's approach to motion representation compared to existing methods?
It uses higher resolution SMPL models with more accurate depth estimation
It learns implicit view-agnostic motion tokens that align with the video generator's 3D priors
It requires multi-camera setups to capture ground truth 3D motion data
Q2
2. How does 3DiMo handle the auxiliary geometric supervision during training?
It maintains constant SMPL supervision throughout all training stages
It only uses SMPL supervision in the final training stage for refinement
It gradually anneals the supervision to zero, transitioning from external guidance to learned 3D understanding
Q3
3. What type of data supervision strategy does 3DiMo employ to achieve genuine 3D awareness?
Only high-quality motion capture data with precise 3D joint annotations
View-rich supervision combining single-view, multi-view, and moving-camera videos
Exclusively synthetic data rendered from 3D human models
1/2

Paper 3

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Published: 2026-01-31

Link: http://arxiv.org/pdf/2602.00919

1. 📘 Topic and Domain: The paper presents Green-VLA, a staged Vision-Language-Action framework for training generalist robots, with focus on humanoid robot control and multi-embodiment generalization.
2. 💡 Previous Research and New Ideas: The paper builds on existing VLA models (π0, OpenVLA, RT-2) and proposes a five-stage training curriculum (L0-L1-R0-R1-R2), unified action space across embodiments, and quality-focused data curation with temporal alignment.
3. ❓ Problem: The paper aims to solve the challenges of heterogeneous robotics datasets, poor data quality, behavior cloning limitations, and the difficulty of deploying VLA models across diverse robot embodiments while maintaining real-world performance.
4. 🛠️ Methods: The authors use a DataQA pipeline for quality filtering, unified action space with semantic layout, flow-matching action expert, joint prediction module for guidance, and two-phase RL fine-tuning (trajectory optimization and source distribution optimization).
5. 📊 Results and Evaluation: Green-VLA achieves 69.5% success rate on ALOHA table-cleaning (vs 35.6% for π0), 71.8% on Google Robot tasks, 91.7% on WidowX tasks, and demonstrates successful deployment on the Green humanoid robot with 90% average success across manipulation tasks.

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Green-VLA: Staged Training Pipeline L0: Base VLM Foundational Vision-Language Model L1: Multimodal Grounding 24M web samples VQA, pointing, spatial R0: Multi-embodiment Pretraining 3,000 hours robot data Unified action space R1: Embodiment Adaptation Target robot tuning Efficiency optimization R2: RL Alignment Trajectory optimization Long-horizon robustness DataQA Pipeline • Jitter filtering (τ) • Sharpness scoring (σ) • Visual diversity (δ) • State variance (σ²) • Temporal alignment via optical flow Key Components • Flow-matching expert • Episode progress • OOD detection JPM Guidance • 2D affordance point • 3D lifting • Target steering Deployment Results • Green humanoid: 32 DoF control • ALOHA: 83.1% table cleaning SR • Simpler: 71.8% avg success • CALVIN: 4.63 ACL (R2) • E-commerce: 95.4% with JPM • Multi-embodiment transfer Unified Action Space A* Single-Arm EEF Dual-Arm Joints Humanoid Joints Gripper/Dex-hand Mobile/Static Base 64-dimensional semantic layout with embodiment-aware masking
Q1
1. What is the key innovation in Green-VLA's approach to handling multiple robot embodiments?
Using separate neural networks for each robot type
A unified action space with semantic layout and embodiment prompting
Training only on humanoid robots and transferring to other platforms
Q2
2. How does Green-VLA address the problem of varying execution speeds across different robotics datasets?
By discarding all slow-motion trajectories from the training set
Through optical flow-based temporal alignment and action interpolation
By training separate models for fast and slow robot movements
Q3
3. What performance improvement did Green-VLA achieve on the ALOHA table-cleaning task compared to π0?
From 35.6% to 69.5% success rate
From 69.5% to 83.1% success rate
From 12.1% to 35.6% success rate