2025-05-28 Papers

1/2

Paper 1

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

Published: 2025-05-27

Link: http://arxiv.org/pdf/2505.21496

1. 📘 Topic and Domain: The paper presents UI-Genie, a self-improving framework for mobile GUI agents using multimodal large language models (MLLMs) to automate mobile interface interactions.

2. 💡 Previous Research and New Ideas: Based on previous MLLM research for GUI agents, it introduces a novel self-improving approach with a specialized reward model and automatic trajectory generation, eliminating reliance on manual annotation.

3. ❓ Problem: The paper addresses two key challenges in GUI agents: the difficulty of verifying trajectory outcomes and the lack of scalable high-quality training data.

4. 🛠️ Methods: The authors developed UI-Genie-RM (a specialized reward model), generated synthetic training data through rule-based verification and trajectory corruption, and implemented an iterative self-improvement pipeline where both agent and reward models evolve together.

5. 📊 Results and Evaluation: UI-Genie achieved state-of-the-art performance across multiple GUI agent benchmarks after three generations of self-improvement, while generating two novel datasets (UI-Genie-RM-517k and UI-Genie-Agent-16k) without manual annotation.

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

1/2

Paper 2

Exploring the Latent Capacity of LLMs for One-Step Text Generation

Published: 2025-05-27

Link: http://arxiv.org/pdf/2505.21189

1. 📘 Topic and Domain: Exploring large language models' ability to generate text in a single forward pass using specially trained input embeddings, in the domain of natural language processing and neural architectures.

2. 💡 Previous Research and New Ideas: Based on research showing LLMs can reconstruct text autoregressively from trained embeddings, this paper proposes non-autoregressive generation using just two trainable "proto-tokens."

3. ❓ Problem: The paper investigates whether LLMs can generate accurate multi-token sequences in one forward pass without iterative decoding, challenging the assumption that autoregressive generation is necessary.

4. 🛠️ Methods: Uses two trainable embeddings ("proto-tokens") fed into frozen LLMs, optimizing them to generate target sequences in a single pass, with one token shared across texts and the other unique to each text.

5. 📊 Results and Evaluation: Successfully generated hundreds of accurate tokens in one forward pass (up to 724 tokens for largest models), achieving 279x faster generation than autoregressive methods, though with approximately half the maximum sequence length capacity.

Exploring the Latent Capacity of LLMs for One-Step Text Generation

1/2

Paper 3

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Published: 2025-05-27

Link: http://arxiv.org/pdf/2505.21333

1. 📘 Topic and Domain: The paper focuses on evaluating Optical Character Recognition (OCR) capabilities of Multimodal Large Language Models (MLLMs) in video scenarios.

2. 💡 Previous Research and New Ideas: Previous research mainly focused on OCR in static images, while this paper introduces a comprehensive benchmark for video OCR tasks and proposes new evaluation methods for dynamic text recognition.

3. ❓ Problem: The paper addresses the challenge of evaluating MLLMs' ability to recognize, understand, and reason about text in videos, which is more complex than static image OCR due to motion blur, temporal variations, and visual effects.

4. 🛠️ Methods: The authors created MME-VideoOCR benchmark with 1,464 videos and 2,000 manually annotated question-answer pairs across 25 tasks in 10 categories, evaluating 18 state-of-the-art MLLMs using containment match, GPT-assisted scoring, and multiple-choice evaluation methods.

5. 📊 Results and Evaluation: The best-performing model (Gemini-2.5 Pro) achieved 73.7% accuracy, while most models struggled with tasks requiring spatio-temporal reasoning and cross-frame information integration, highlighting the need for improved video OCR capabilities in MLLMs.