2025-09-12 Papers

1/2

Paper 1

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09372

1. 📘 Topic and Domain: Vision-Language-Action (VLA) modeling for robotic control, focusing on bridging visual-language perception with action generation.
2. 💡 Previous Research and New Ideas: Based on previous VLA models that rely on large pre-trained vision-language models and extensive robotic data pre-training; proposes a novel lightweight "VLA-Adapter" paradigm that reduces reliance on large models.
3. ❓ Problem: Current VLA models face bottlenecks including dependence on large-scale vision-language models, slow fine-tuning, high GPU memory usage, and low inference efficiency.
4. 🛠️ Methods: Introduces a Policy module with Bridge Attention that uses both Raw features and ActionQuery features from all layers of a small backbone model to effectively bridge perception and action spaces, without requiring robotic pre-training.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance (97.3% success rate) using only a 0.5B parameter backbone (vs 7B for previous methods), with 3x faster inference speed, 1/38x training cost, and successful real-world robot deployment, evaluated on LIBERO and CALVIN benchmarks.

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

VLA-Adapter: Vision-Language-Action Model Workflow Input Data 3rd-view Image (X_v) Gripper Image (X_g) Instruction (L_t) Vision Feature Extraction DINOv2 + SigLIP Vision Embeddings VLM (Qwen2.5-0.5B) M layers processing Prismatic-VLMs architecture Raw latent (C_R) extraction ActionQuery latent (C_AQ) extraction ActionQuery 64 learnable tokens Multimodal aggregation Deep-layer performs best All-layer features used Condition Analysis Key Finding 1: Middle-layer Raw features better Key Finding 2: Deep-layer ActionQuery better Key Finding 3: Multi-layer features optimal Policy Network (97M params) L1-based architecture M layers (same as VLM) Bridge Attention modules FFN layers Trained from scratch Bridge Attention Cross Attention 1: C_R features Cross Attention 2: C_AQ + Proprio Self Attention: Action latent Learnable Ratio parameter tanh activation for stability Selective injection of C_R Full injection of C_AQ Training End-to-end training L1 loss objective 8 hours on single GPU Action Output H-step action chunk 7D continuous actions 219.2Hz inference speed Performance Achievements LIBERO: 97.3% success rate CALVIN: 4.42 avg length 1/14× backbone size vs SOTA 1/38× training cost 3× faster inference Key Innovation Effective VL→A bridging Tiny-scale backbone (0.5B) No robotic pre-training needed Autonomous condition injection VLA-Adapter: Efficient VL→A bridging with tiny-scale backbone achieving SOTA performance Lower training costs, faster inference, maintained performance quality
Q1
1. What is the main innovation of VLA-Adapter compared to previous VLA models?
It uses a much larger vision-language model
It eliminates the need for robotic pre-training while using a smaller backbone
It focuses only on action generation without visual inputs
Q2
2. What key component does VLA-Adapter use to bridge perception and action spaces?
Bridge Attention with both Raw and ActionQuery features
Only last-layer features from the vision model
Random feature selection from different layers
Q3
3. What is the most significant practical advantage of VLA-Adapter?
It achieves 100% accuracy on all tasks
It can only work in simulation environments
It can be trained in just 8 hours on a single consumer GPU
1/2

Paper 2

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09674

1. 📘 Topic and Domain: The paper focuses on developing SimpleVLA-RL, an efficient reinforcement learning framework for Vision-Language-Action (VLA) models in robotic manipulation tasks.
2. 💡 Previous Research and New Ideas: Based on veRL (Volcano Engine Reinforcement Learning for LLMs), the paper proposes new VLA-specific trajectory sampling, parallel rendering, and optimized loss computation for robotic applications.
3. ❓ Problem: The paper addresses two key challenges in VLA models: the scarcity of large-scale human-operated robotic trajectories required for training, and limited generalization to tasks involving distribution shift.
4. 🛠️ Methods: The paper implements an end-to-end online RL framework with dynamic sampling, higher rollout temperature, and modified clipping range, using binary outcome rewards (1 for success, 0 for failure) for training.
5. 📊 Results and Evaluation: The framework achieved state-of-the-art performance on LIBERO and RoboTwin benchmarks, improving success rates by 10-15%, demonstrating strong generalization capabilities, and effectively transferring from simulation to real-world tasks.

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

SimpleVLA-RL: Vision-Language-Action Model Training via Reinforcement Learning Input Query (Visual + Language) Interactive VLA Rollout • Token-based sampling • Environment interaction • Temperature = 1.6 • Dynamic sampling Trajectory τ₁ Trajectory τ₂ ... Trajectory τG Environment Parallel rendering State transitions Multi-environment Outcome Rewards R = 1 if success R = 0 if failure Binary feedback Exploration Enhancements • Dynamic Sampling: Exclude uniform reward groups • Clip Higher: [0.8, 1.28] vs [0.8, 1.2] • Higher Temperature: 1.6 vs 1.0 • Remove KL regularization GRPO Training Group advantage: Â = (R - mean(R)) / std(R) PPO-style clipping Policy optimization Key Achievements • LIBERO: 91.0% → 99.1% success rate • Data efficiency: 1 demo → 96.9% performance • Generalization across spatial/object/goal • Real-world sim2real transfer "Pushcut" Phenomenon Discovery • RL discovers novel pushing strategies beyond demonstration data • Move-can-pot: Push instead of grasp-move-place • Emergent efficient behaviors through exploration veRL Framework Integration • Extended veRL for VLA-specific trajectory sampling • Scalable parallelization and distributed training • Integrated training-inference-rendering pipeline Evaluation: LIBERO, RoboTwin 1.0 & 2.0, Real-world Tasks State-of-the-art performance across simulation and real-world benchmarks Key Innovation: Extending LLM RL techniques to VLA domain with outcome-based rewards
Q1
1. What novel phenomenon did researchers observe during RL training that was not present in supervised data?
The 'pushcut' phenomenon where models learned to push objects instead of grasp-move-place
The 'speedup' phenomenon where models executed tasks faster than demonstrations
The 'multipath' phenomenon where models found multiple solutions for the same task
Q2
2. In the data scarcity experiment with One-Trajectory SFT, what remarkable improvement did SimpleVLA-RL achieve on LIBERO-Long tasks?
Improved from 17.3% to 48.9% success rate
Improved from 17.3% to 91.7% success rate
Improved from 48.9% to 91.7% success rate
Q3
3. What unique approach does SimpleVLA-RL take regarding reward design compared to traditional robotic RL?
It uses dense rewards based on distance to goal
It combines multiple weighted reward components
It uses simple binary rewards (1 for success, 0 for failure) based only on task completion
1/2

Paper 3

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09595

1. 📘 Topic and Domain: The paper focuses on AI-driven avatar animation synthesis, specifically generating high-fidelity portrait animations from audio, image, and text inputs.
2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and audio-driven avatar generation, it introduces a novel approach using multimodal language models for semantic understanding of instructions rather than just low-level signal tracking.
3. ❓ Problem: The paper addresses the challenge of generating coherent, long-duration avatar animations that maintain semantic consistency with multimodal inputs while preserving high visual quality and lip synchronization.
4. 🛠️ Methods: Uses a two-stage cascaded framework with an MLLM Director for instruction understanding and planning, followed by parallel generation of video sub-clips with blueprint keyframes for long-duration synthesis.
5. 📊 Results and Evaluation: Achieves superior performance in generating 1080p 48fps videos with precise lip sync, emotional expressiveness, and identity consistency, outperforming baselines OmniHuman-1 and HeyGen across multiple evaluation metrics on a 375-sample benchmark.

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Kling-Avatar: Cascaded Long-Duration Avatar Animation Synthesis Multimodal Inputs Image + Audio + Text Prompt MLLM Director Instruction Grounding Qwen2.5-Omni/VL → Storyline Stage 1: Blueprint Human Video DiT Global Semantic Planning Blueprint Video High-level Semantics Keyframe Extraction Anchor Frame Selection First-Last Frame Strategy Stage 2: Parallel Sub-clip Generation Sub-clip 1 Local Details Sub-clip 2 Local Details Sub-clip 3 Local Details Sub-clip N Local Details ... Data Preparation Pipeline • Lip-clarity Filtering • Temporal-continuity Detection • Audio-visual Synchronization • Aesthetic Quality Assessment • Expert Model Filtering Training & Inference Strategies • Sliding Window Audio Injection • Mouth Region Loss Weighting • Random Padding for Robustness • Negative Frame CFG • Text Cross-attention Freezing Long-Duration Avatar Video High-fidelity • Precise Lip-sync Vivid Emotions • 1080p@48fps Video DiT Architecture Audio Cross-attention T5 Text Encoder Whisper Audio Parallel Processing
Q1
1. What is the main innovation in Kling-Avatar's approach compared to previous avatar animation methods?
Using higher resolution video output
Using multimodal language models for semantic understanding of instructions
Using faster parallel processing techniques
Q2
2. How does Kling-Avatar handle long-duration video generation?
By generating the entire video in one pass
By using recursive generation frame by frame
By generating parallel sub-clips guided by blueprint keyframes
Q3
3. What unique approach does Kling-Avatar use for data preparation?
Simply collecting as much data as possible
Using expert models to filter data quality across multiple dimensions
Only using manually curated data