2026-03-26 Papers

1/2

Paper 1

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Published: 2026-03-25

Link: http://arxiv.org/pdf/2603.24533

1. 📘 Topic and Domain: The paper addresses autonomous mobile GUI agents using Multimodal Large Language Models (MLLMs) for tasks like clicking, swiping, and interacting with mobile phone interfaces.

2. 💡 Previous Research and New Ideas: Based on prior work in GUI agents, RL methods (GRPO/PPO), and self-evolving training pipelines; it proposes two novel techniques: Rejection Fine-Tuning (RFT) for autonomous data-model co-evolution and Group Relative Self-Distillation (GRSD) for step-level supervision via fork point detection.

3. ❓ Problem: The paper solves inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards in long-horizon mobile GUI tasks, where traditional RL methods fail to identify which specific step caused task failure.

4. 🛠️ Methods: A two-stage pipeline consisting of (1) Rejection Fine-Tuning with iterative trajectory generation filtered by rule-based verifiers, and (2) GRSD that identifies fork points using SSIM-based screenshot matching to extract dense step-level supervision from successful trajectories to correct failed ones.

5. 📊 Results and Evaluation: The 4B model achieves 81.0% Pass@1 success rate on AndroidWorld (116 tasks), surpassing all baselines including models up to 235B parameters and exceeding reported human-level performance of 80.0%; ablation studies confirm GRSD's effectiveness over GRPO/PPO.

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

1/2

Paper 2

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Published: 2026-03-24

Link: http://arxiv.org/pdf/2603.23483

1. 📘 Topic and Domain: The paper addresses efficient inference for agentic multimodal large language models (MLLMs) that use iterative visual tool invocation, focusing on the domain of accelerating agentic AI systems.

2. 💡 Previous Research and New Ideas: Based on token-level speculative decoding and multimodal efficiency techniques, the paper proposes lifting speculation to the "agentic level" using a lightweight, tool-free small model as a speculative planner to bypass entire tool-use loops for queries not requiring them.

3. ❓ Problem: The paper solves the "stateful bottleneck" where agentic MLLMs suffer from sequential perception-reasoning-tool loops that cause latency explosion and concurrency collapse, making real-world deployment prohibitive.

4. 🛠️ Methods: SpecEyes uses a four-phase pipeline: heuristic tool-use judgment, speculative prediction with a stateless small model, cognitive gating via an answer separability score to decide acceptance, and agentic fallback for rejected queries, combined with heterogeneous parallel serving.

5. 📊 Results and Evaluation: Evaluated on V* Bench, HR-Bench, and POPE, SpecEyes achieved 1.1–3.35× speedup while preserving or improving accuracy (up to +6.7%), with the min aggregation strategy for answer separability delivering the best accuracy-speed trade-off.

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

1/2

Paper 3

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Published: 2026-03-24

Link: http://arxiv.org/pdf/2603.23386

1. 📘 Topic and Domain: The paper addresses the generation of simulation-ready articulated 3D assets from monolithic static meshes using multimodal large language models (MLLMs), falling within the domain of 3D computer vision, embodied AI, and physics-based simulation.

2. 💡 Previous Research and New Ideas: The paper builds upon prior work in articulated object reconstruction (e.g., ArtGS, PartField), MLLM-based kinematic reasoning (e.g., Articulate-Anything, PhysX-Anything), and 3D tokenization. Its new ideas include a unified MLLM framework that jointly performs part decomposition and kinematic prediction, and a Sparse 3D VQ-VAE that reduces token counts by 70% to overcome memory limitations of dense voxel representations.

3. ❓ Problem: The paper aims to solve the lack of "sim-ready" articulated assets, as most existing 3D meshes are static and non-decomposed, and existing multi-stage pipelines for articulated object creation suffer from accumulated errors and incompatibility between part geometry and joint predictions.

4. 🛠️ Methods: The authors propose SIMART, which uses a Sparse 3D VQ-VAE for efficient geometric tokenization and a Qwen3-VL-based MLLM backbone to jointly perform part-level mesh decomposition and kinematic parameter (joint type, axis, limits) prediction, outputting URDF specifications and segmented meshes.

5. 📊 Results and Evaluation: SIMART achieves state-of-the-art performance on PartNet-Mobility and a newly curated AI-generated benchmark (SIMART-Bench), outperforming baselines like Urdformer, Articulate-Anything, and PhysX-Anything in joint classification accuracy (Type↑), axis error (Axis↓), origin error (Origin↓), part IoU (↑), and Chamfer distance (CD↓), while enabling physics-based robotic simulation and VR/AR applications.