1. 📘 Topic and Domain: The paper presents Green-VLA, a staged Vision-Language-Action framework for training generalist robots, with focus on humanoid robot control and multi-embodiment generalization.
2. 💡 Previous Research and New Ideas: The paper builds on existing VLA models (π0, OpenVLA, RT-2) and proposes a five-stage training curriculum (L0-L1-R0-R1-R2), unified action space across embodiments, and quality-focused data curation with temporal alignment.
3. ❓ Problem: The paper aims to solve the challenges of heterogeneous robotics datasets, poor data quality, behavior cloning limitations, and the difficulty of deploying VLA models across diverse robot embodiments while maintaining real-world performance.
4. 🛠️ Methods: The authors use a DataQA pipeline for quality filtering, unified action space with semantic layout, flow-matching action expert, joint prediction module for guidance, and two-phase RL fine-tuning (trajectory optimization and source distribution optimization).
5. 📊 Results and Evaluation: Green-VLA achieves 69.5% success rate on ALOHA table-cleaning (vs 35.6% for π0), 71.8% on Google Robot tasks, 91.7% on WidowX tasks, and demonstrates successful deployment on the Green humanoid robot with 90% average success across manipulation tasks.