2025-06-03 Papers

1/2

Paper 1

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Published: 2025-06-02

Link: http://arxiv.org/pdf/2506.01844

1. 📘 Topic and Domain: The paper presents SmolVLA, a compact and efficient vision-language-action (VLA) model for robotics that enables natural language-driven robot control.

2. 💡 Previous Research and New Ideas: Based on previous work in vision-language models (VLMs) and robotics foundation models, it introduces a lightweight VLA architecture and asynchronous inference stack while utilizing community-collected datasets rather than expensive industrial ones.

3. ❓ Problem: The paper addresses the challenge of making VLA models more accessible and efficient, as existing models are typically massive (billions of parameters), expensive to train, and rely on costly robotic platforms and datasets.

4. 🛠️ Methods: The authors developed a compact VLA model combining a small pretrained vision-language model with an action expert trained via flow matching, implemented layer skipping for efficiency, and created an asynchronous inference stack that decouples perception from action execution.

5. 📊 Results and Evaluation: SmolVLA achieved performance comparable to VLA models 10x larger across both simulated and real-world robotic tasks, while being trainable on a single GPU and deployable on consumer-grade hardware, with the asynchronous inference enabling 30% faster task completion.

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

1/2

Paper 2

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Published: 2025-06-02

Link: http://arxiv.org/pdf/2506.01943

1. 📘 Topic and Domain: Video generation for robotic manipulation using trajectory control and diffusion models in computer vision and robotics.

2. 💡 Previous Research and New Ideas: Based on prior trajectory-controlled video generation methods that focus on individual object motion, this paper proposes a novel collaborative trajectory approach that decomposes interaction into phases.

3. ❓ Problem: Existing trajectory-based methods struggle with multi-object interaction and feature entanglement in overlapping regions during robotic manipulation, leading to degraded visual quality.

4. 🛠️ Methods: Introduces RoboMaster framework that decomposes interaction into pre-interaction, interaction, and post-interaction phases, incorporating appearance and shape-aware latent representations with mask-based object embeddings.

5. 📊 Results and Evaluation: Outperforms existing approaches on Bridge V2 dataset in both visual quality and trajectory accuracy metrics (FVD, PSNR, SSIM), while demonstrating better generalization to in-the-wild scenarios.

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

1/2

Paper 3

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Published: 2025-06-02

Link: http://arxiv.org/pdf/2506.01853

1. 📘 Topic and Domain: Native multimodal large language model (LLM) for 3D content generation and understanding, extending beyond traditional text-image capabilities.

2. 💡 Previous Research and New Ideas: Based on previous work in text-to-image LLMs like GPT-4o, this paper proposes the first unified model that integrates 3D capabilities into a multimodal LLM framework through discrete token representation.

3. ❓ Problem: The paper addresses the limitation of current multimodal LLMs being confined to only text and images, lacking the ability to understand and generate 3D content.

4. 🛠️ Methods: Uses a 3D vector-quantized variational autoencoder (VQVAE) to convert 3D objects into discrete tokens, constructs a 3D-Alpaca dataset for training, and fine-tunes the Qwen-2.5-vl-7B-Instruct model.

5. 📊 Results and Evaluation: The model achieves strong performance in text-to-3D, image-to-3D generation, and 3D understanding tasks, demonstrating capabilities close to specialized models while maintaining general language abilities.