1. 📘 Topic and Domain: Development of GR-3, a large-scale vision-language-action (VLA) model for robotic manipulation and control.
2. 💡 Previous Research and New Ideas: Based on previous VLA models and pre-trained vision-language models, proposing new co-training with web-scale vision-language data and efficient fine-tuning from human trajectory data.
3. ❓ Problem: Addressing the challenge of creating generalist robot policies that can handle novel objects, environments, and instructions while performing complex long-horizon tasks reliably.
4. 🛠️ Methods: Combines three-way training using robot trajectory data, web-scale vision-language data, and human trajectory data collected via VR devices, implemented with a flow-matching architecture and RMSNorm optimization.
5. 📊 Results and Evaluation: GR-3 outperformed baseline π0 across three challenging tasks (pick-and-place, table bussing, cloth manipulation), showing superior instruction following capabilities and generalization to novel scenarios.