2025-09-12 Papers

1/2

Paper 1

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09372

1. 📘 Topic and Domain: Vision-Language-Action (VLA) modeling for robotic control, focusing on bridging visual-language perception with action generation.

2. 💡 Previous Research and New Ideas: Based on previous VLA models that rely on large pre-trained vision-language models and extensive robotic data pre-training; proposes a novel lightweight "VLA-Adapter" paradigm that reduces reliance on large models.

3. ❓ Problem: Current VLA models face bottlenecks including dependence on large-scale vision-language models, slow fine-tuning, high GPU memory usage, and low inference efficiency.

4. 🛠️ Methods: Introduces a Policy module with Bridge Attention that uses both Raw features and ActionQuery features from all layers of a small backbone model to effectively bridge perception and action spaces, without requiring robotic pre-training.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance (97.3% success rate) using only a 0.5B parameter backbone (vs 7B for previous methods), with 3x faster inference speed, 1/38x training cost, and successful real-world robot deployment, evaluated on LIBERO and CALVIN benchmarks.

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

1/2

Paper 2

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09674

1. 📘 Topic and Domain: The paper focuses on developing SimpleVLA-RL, an efficient reinforcement learning framework for Vision-Language-Action (VLA) models in robotic manipulation tasks.

2. 💡 Previous Research and New Ideas: Based on veRL (Volcano Engine Reinforcement Learning for LLMs), the paper proposes new VLA-specific trajectory sampling, parallel rendering, and optimized loss computation for robotic applications.

3. ❓ Problem: The paper addresses two key challenges in VLA models: the scarcity of large-scale human-operated robotic trajectories required for training, and limited generalization to tasks involving distribution shift.

4. 🛠️ Methods: The paper implements an end-to-end online RL framework with dynamic sampling, higher rollout temperature, and modified clipping range, using binary outcome rewards (1 for success, 0 for failure) for training.

5. 📊 Results and Evaluation: The framework achieved state-of-the-art performance on LIBERO and RoboTwin benchmarks, improving success rates by 10-15%, demonstrating strong generalization capabilities, and effectively transferring from simulation to real-world tasks.

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

1/2

Paper 3

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09595

1. 📘 Topic and Domain: The paper focuses on AI-driven avatar animation synthesis, specifically generating high-fidelity portrait animations from audio, image, and text inputs.

2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and audio-driven avatar generation, it introduces a novel approach using multimodal language models for semantic understanding of instructions rather than just low-level signal tracking.

3. ❓ Problem: The paper addresses the challenge of generating coherent, long-duration avatar animations that maintain semantic consistency with multimodal inputs while preserving high visual quality and lip synchronization.

4. 🛠️ Methods: Uses a two-stage cascaded framework with an MLLM Director for instruction understanding and planning, followed by parallel generation of video sub-clips with blueprint keyframes for long-duration synthesis.

5. 📊 Results and Evaluation: Achieves superior performance in generating 1080p 48fps videos with precise lip sync, emotional expressiveness, and identity consistency, outperforming baselines OmniHuman-1 and HeyGen across multiple evaluation metrics on a 375-sample benchmark.