2026-03-26 Papers

1/2

Paper 1

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Published: 2026-03-25

Link: http://arxiv.org/pdf/2603.24533

1. 📘 Topic and Domain: The paper addresses autonomous mobile GUI agents using Multimodal Large Language Models (MLLMs) for tasks like clicking, swiping, and interacting with mobile phone interfaces.
2. 💡 Previous Research and New Ideas: Based on prior work in GUI agents, RL methods (GRPO/PPO), and self-evolving training pipelines; it proposes two novel techniques: Rejection Fine-Tuning (RFT) for autonomous data-model co-evolution and Group Relative Self-Distillation (GRSD) for step-level supervision via fork point detection.
3. ❓ Problem: The paper solves inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards in long-horizon mobile GUI tasks, where traditional RL methods fail to identify which specific step caused task failure.
4. 🛠️ Methods: A two-stage pipeline consisting of (1) Rejection Fine-Tuning with iterative trajectory generation filtered by rule-based verifiers, and (2) GRSD that identifies fork points using SSIM-based screenshot matching to extract dense step-level supervision from successful trajectories to correct failed ones.
5. 📊 Results and Evaluation: The 4B model achieves 81.0% Pass@1 success rate on AndroidWorld (116 tasks), surpassing all baselines including models up to 235B parameters and exceeding reported human-level performance of 80.0%; ablation studies confirm GRSD's effectiveness over GRPO/PPO.

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

UI-Voyager: Two-Stage Self-Evolving Framework Workflow of Mobile GUI Agent Training Stage 1: Rejection Fine-Tuning (RFT) Trajectory Generation (k trajs/query) Rule-based Verifier Rejection Sampling ✅ Success ❌ Failure Supervised Fine-Tuning (SFT) Updated Policy π₁ → π₂ → πₘ Multi-Round Self-Evolution Seed Task Generator Stage 2: Group Relative Self-Distillation (GRSD) Query: Mobile OS Task (e.g., "Turn Bluetooth off") Base Policy (RFT Model) Group Rollouts G trajectories per task τ⁺ Success τ⁻ Failure Fork Point Detection SSIM Matching Self-Correction Same Observation → Different Action • Cross-Trajectory State Matching • Transition Alignment & Teacher Selection SFT with Mixed Samples Failed-context + Correct-action Updated Policy πₘ via GRSD Multi-Round Self-Evolution 💡 Key Insight Fork points identify where trajectories diverge: • Same observation (SAME) • Different action → Divergence (DIVERGE) → Transform sparse rewards to dense step-level supervision 🎉 Final Result: UI-Voyager (4B) 81.0% Pass@1 Outperforms all baselines • Exceeds human-level performance (80.0%) Algorithm Summary 1. RFT: Iterative collection & filtering 2. GRSD: Detect fork points via SSIM 3. Self-Distillation: Correct → Failed 4. SFT with mixed samples → Updated π SSIM Matching Formula SAME(oₐ, oᵦ) = 1[SSIM(φ(oₐ), φ(oᵦ)) ≥ θ] φ: crop → resize → grayscale preprocessing θ: similarity threshold (default 0.80) vs. GRPO/PPO Traditional RL: Trajectory-level reward (0 or 1) GRSD: Step-level supervision at fork points
Q1
1. Why do traditional reinforcement learning methods like GRPO and PPO struggle with long-horizon GUI tasks according to the paper?
They require too much computational power for mobile devices
They only receive trajectory-level rewards (success/failure), making it impossible to identify which specific step caused task failure
They cannot process visual information from mobile screens
Q2
2. In UI-Voyager, how does the Group Relative Self-Distillation (GRSD) method identify 'fork points' between successful and failed trajectories?
Using pretrained vision encoders to compute cosine similarity between embeddings
Using Structural Similarity Index (SSIM) on cropped, resized, grayscale screenshots to detect matching screen states
Using OCR to extract text from screenshots and compare text content
Q3
3. What makes UI-Voyager's performance on AndroidWorld particularly impressive?
It achieves 81.0% success rate with 235B parameters, the largest model in the benchmark
It achieves 81.0% success rate with only 4B parameters, surpassing all larger models and exceeding human-level performance
It achieves 73.2% success rate with 4B parameters, matching human-level performance exactly
1/2

Paper 2

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Published: 2026-03-24

Link: http://arxiv.org/pdf/2603.23483

1. 📘 Topic and Domain: The paper addresses efficient inference for agentic multimodal large language models (MLLMs) that use iterative visual tool invocation, focusing on the domain of accelerating agentic AI systems.
2. 💡 Previous Research and New Ideas: Based on token-level speculative decoding and multimodal efficiency techniques, the paper proposes lifting speculation to the "agentic level" using a lightweight, tool-free small model as a speculative planner to bypass entire tool-use loops for queries not requiring them.
3. ❓ Problem: The paper solves the "stateful bottleneck" where agentic MLLMs suffer from sequential perception-reasoning-tool loops that cause latency explosion and concurrency collapse, making real-world deployment prohibitive.
4. 🛠️ Methods: SpecEyes uses a four-phase pipeline: heuristic tool-use judgment, speculative prediction with a stateless small model, cognitive gating via an answer separability score to decide acceptance, and agentic fallback for rejected queries, combined with heterogeneous parallel serving.
5. 📊 Results and Evaluation: Evaluated on V* Bench, HR-Bench, and POPE, SpecEyes achieved 1.1–3.35× speedup while preserving or improving accuracy (up to +6.7%), with the min aggregation strategy for answer separability delivering the best accuracy-speed trade-off.

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

SpecEyes: Agentic-Level Speculative Acceleration Framework Methodology Flow Chart Query + Image Input Phase I Heuristic Tool-Use Judgment (Large Agentic MLLM ML) Binary Classification: g(q,I) g=0: Tool-free (proceed to Phase II) g=1: Tool-required (skip to Phase IV) g=0? Yes/No Yes No Phase II Speculative Prediction (Small Non-Agentic MLLM MS) Generates ŷS + Logits {ℓ(n)} Phase III Cognitive Gating Answer Separability Score Ssep min aggregation strategy Ssep ≥ τ? Threshold Accept Reject Phase IV Agentic Fallback (Large Agentic MLLM ML) Stateful Tool Loop Stateful Bottleneck Sequential perception- reasoning tool loop sd+1 = f(sd, td(sd)) Final Answers Output Key Formulas Tool-free ratio: β ∈ [0,1] Acceptance rate: α ∈ [0,1] Expected Latency: E[LSpecEyes] Throughput Speedup: 1/(1-βα) Separability Score Token-level: Ssep = (ℓ[1] - μK) / (σK + ε) Aggregation: Smin = min(Ssep(n)) Parallel 1.1-3.35× +6.7% Stateless Stateful
Q1
1. What is the fundamental performance barrier in agentic multimodal LLMs that SpecEyes identifies and aims to overcome?
Insufficient training data for tool-using capabilities
The stateful bottleneck caused by strict causal dependencies in tool-use chains
Limited GPU memory for processing high-resolution images
Q2
2. What is the key innovation of SpecEyes compared to existing token-level speculative decoding methods?
Using a larger draft model to propose tokens for verification
Lifting speculation from the token level to the entire agentic pipeline to bypass tool-use loops
Compressing visual tokens before processing by the language model
Q3
3. According to the paper, which aggregation strategy for the answer separability score provides the best accuracy-speed trade-off?
Mean aggregation (averaging all token-level separability scores)
Bottom-r aggregation (focusing on the lowest-confidence tokens)
Min aggregation (using the minimum token-level separability score)
1/2

Paper 3

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Published: 2026-03-24

Link: http://arxiv.org/pdf/2603.23386

1. 📘 Topic and Domain: The paper addresses the generation of simulation-ready articulated 3D assets from monolithic static meshes using multimodal large language models (MLLMs), falling within the domain of 3D computer vision, embodied AI, and physics-based simulation.
2. 💡 Previous Research and New Ideas: The paper builds upon prior work in articulated object reconstruction (e.g., ArtGS, PartField), MLLM-based kinematic reasoning (e.g., Articulate-Anything, PhysX-Anything), and 3D tokenization. Its new ideas include a unified MLLM framework that jointly performs part decomposition and kinematic prediction, and a Sparse 3D VQ-VAE that reduces token counts by 70% to overcome memory limitations of dense voxel representations.
3. ❓ Problem: The paper aims to solve the lack of "sim-ready" articulated assets, as most existing 3D meshes are static and non-decomposed, and existing multi-stage pipelines for articulated object creation suffer from accumulated errors and incompatibility between part geometry and joint predictions.
4. 🛠️ Methods: The authors propose SIMART, which uses a Sparse 3D VQ-VAE for efficient geometric tokenization and a Qwen3-VL-based MLLM backbone to jointly perform part-level mesh decomposition and kinematic parameter (joint type, axis, limits) prediction, outputting URDF specifications and segmented meshes.
5. 📊 Results and Evaluation: SIMART achieves state-of-the-art performance on PartNet-Mobility and a newly curated AI-generated benchmark (SIMART-Bench), outperforming baselines like Urdformer, Articulate-Anything, and PhysX-Anything in joint classification accuracy (Type↑), axis error (Axis↓), origin error (Origin↓), part IoU (↑), and Chamfer distance (CD↓), while enabling physics-based robotic simulation and VR/AR applications.

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

SIMART Pipeline From Monolithic Mesh to Sim-ready Articulated Assets via MLLM INPUT RGB Image I_vis Raw Mesh G_geo Instruction T_txt Sparse 3D VQ-VAE 3D-Unet Encoder 4096 codebook zero token -70% 64³ → 8³ Voxels Quantization Unified MLLM (Qwen3-VL Backbone) Vis Geo Txt Token Fusion Transformer Layers Part Seg. Kinematic URDF Simulation-Ready Asset Output URDF Metadata joint_type: revolute axis: [100, 0, 0] limits: [-54°, 45°] density: 1.2 g/cm³ Part-Segmented Meshes Part 0 (fixed) Part 1 (revolute) Part 2 (prismatic) A = (M_seg, P_sim) → Physics Simulator Robotic Simulation NVIDIA Isaac Sim VR/AR Applications Interactive Assets Key Innovation: Sparse Tokenization • Dense Voxels: O(N³) complexity wastes tokens on empty space • Sparse VQ-VAE: Only encode occupied surface voxels • Zero Token: Special codebook entry (index 0) for empty space • Coordinate-aware: [xyz] [K] format preserves spatial info • Result: 70% token reduction, enables complex mesh processing
Q1
1. What is the primary problem that SIMART aims to solve?
Converting static monolithic meshes into simulation-ready articulated assets
Generating high-quality 2D images from 3D articulated objects
Improving rendering speed for VR applications
Q2
2. What is the key innovation of SIMART's Sparse 3D VQ-VAE?
It uses dense voxel representations for maximum geometric detail
It reduces token counts by 70% compared to dense voxel tokens
It processes only text inputs without visual or geometric data
Q3
3. Which MLLM backbone does SIMART use as its core reasoning engine?
GPT-4 Vision
Qwen3-VL-8B
LLaVA-2