1. 📘 Topic and Domain: The paper presents SmolVLA, a compact and efficient vision-language-action (VLA) model for robotics that enables natural language-driven robot control.
2. 💡 Previous Research and New Ideas: Based on previous work in vision-language models (VLMs) and robotics foundation models, it introduces a lightweight VLA architecture and asynchronous inference stack while utilizing community-collected datasets rather than expensive industrial ones.
3. ❓ Problem: The paper addresses the challenge of making VLA models more accessible and efficient, as existing models are typically massive (billions of parameters), expensive to train, and rely on costly robotic platforms and datasets.
4. 🛠️ Methods: The authors developed a compact VLA model combining a small pretrained vision-language model with an action expert trained via flow matching, implemented layer skipping for efficiency, and created an asynchronous inference stack that decouples perception from action execution.
5. 📊 Results and Evaluation: SmolVLA achieved performance comparable to VLA models 10x larger across both simulated and real-world robotic tasks, while being trainable on a single GPU and deployable on consumer-grade hardware, with the asynchronous inference enabling 30% faster task completion.