2025-05-28 Papers

1/2

Paper 1

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

Published: 2025-05-27

Link: http://arxiv.org/pdf/2505.21496

1. 📘 Topic and Domain: The paper presents UI-Genie, a self-improving framework for mobile GUI agents using multimodal large language models (MLLMs) to automate mobile interface interactions.
2. 💡 Previous Research and New Ideas: Based on previous MLLM research for GUI agents, it introduces a novel self-improving approach with a specialized reward model and automatic trajectory generation, eliminating reliance on manual annotation.
3. ❓ Problem: The paper addresses two key challenges in GUI agents: the difficulty of verifying trajectory outcomes and the lack of scalable high-quality training data.
4. 🛠️ Methods: The authors developed UI-Genie-RM (a specialized reward model), generated synthetic training data through rule-based verification and trajectory corruption, and implemented an iterative self-improvement pipeline where both agent and reward models evolve together.
5. 📊 Results and Evaluation: UI-Genie achieved state-of-the-art performance across multiple GUI agent benchmarks after three generations of self-improvement, while generating two novel datasets (UI-Genie-RM-517k and UI-Genie-Agent-16k) without manual annotation.

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

UI-Genie Framework UI-Genie-RM Reward Model Training Data Construction UI-Genie-Agent Self-Improvement Pipeline Trajectory Exploration + Outcome Verification Dataset Expansion + Model Fine-tuning UI-Genie-RM-517k Reward Dataset UI-Genie-Agent-16k Synthetic Trajectory Dataset 1 Build Reward Model 2 Generate Training Data 3 Self-Improvement 4 Train Agent Model
Q1
1. What is the main innovation of UI-Genie compared to previous GUI agent approaches?
It uses a larger language model
It eliminates the need for manual annotation through self-improvement
It only works with Android devices
Q2
2. How does UI-Genie-RM process historical context to evaluate actions?
It only looks at the current screenshot
It uses the full history of all screenshots
It uses the 5 most recent screenshots plus summarized earlier actions
Q3
3. What is the size of the synthetic reward dataset (UI-Genie-RM) created by this framework?
16,000 samples
517,000 samples
1 million samples
1/2

Paper 2

Exploring the Latent Capacity of LLMs for One-Step Text Generation

Published: 2025-05-27

Link: http://arxiv.org/pdf/2505.21189

1. 📘 Topic and Domain: Exploring large language models' ability to generate text in a single forward pass using specially trained input embeddings, in the domain of natural language processing and neural architectures.
2. 💡 Previous Research and New Ideas: Based on research showing LLMs can reconstruct text autoregressively from trained embeddings, this paper proposes non-autoregressive generation using just two trainable "proto-tokens."
3. ❓ Problem: The paper investigates whether LLMs can generate accurate multi-token sequences in one forward pass without iterative decoding, challenging the assumption that autoregressive generation is necessary.
4. 🛠️ Methods: Uses two trainable embeddings ("proto-tokens") fed into frozen LLMs, optimizing them to generate target sequences in a single pass, with one token shared across texts and the other unique to each text.
5. 📊 Results and Evaluation: Successfully generated hundreds of accurate tokens in one forward pass (up to 724 tokens for largest models), achieving 279x faster generation than autoregressive methods, though with approximately half the maximum sequence length capacity.

Exploring the Latent Capacity of LLMs for One-Step Text Generation

One-Step Text Generation with LLMs Input: Two "Proto-tokens" (e and m tokens) Frozen Pre-trained LLM Generated Text Sequence Key Findings Can generate hundreds of tokens in single forward pass Two tokens are essential one token setup fails Token arrangement matters [e][m]×(N-1) works best 279× faster than autoregressive generation Works better with natural text vs random sequences Solutions form connected regions in embedding space
Q1
1. What is the key innovation in the paper's text generation approach compared to traditional methods?
Using a single trainable token for generation
Using two specially trained proto-tokens in one forward pass
Using multiple forward passes with shared embeddings
Q2
2. What surprising finding did the researchers discover about the proto-tokens?
They only work with large language models
They must be completely unique for each text
One proto-token can be shared across multiple texts while maintaining performance
Q3
3. How does the speed improvement of this method compare to traditional autoregressive generation?
About 279 times faster
About 50 times faster
About 100 times faster
1/2

Paper 3

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Published: 2025-05-27

Link: http://arxiv.org/pdf/2505.21333

1. 📘 Topic and Domain: The paper focuses on evaluating Optical Character Recognition (OCR) capabilities of Multimodal Large Language Models (MLLMs) in video scenarios.
2. 💡 Previous Research and New Ideas: Previous research mainly focused on OCR in static images, while this paper introduces a comprehensive benchmark for video OCR tasks and proposes new evaluation methods for dynamic text recognition.
3. ❓ Problem: The paper addresses the challenge of evaluating MLLMs' ability to recognize, understand, and reason about text in videos, which is more complex than static image OCR due to motion blur, temporal variations, and visual effects.
4. 🛠️ Methods: The authors created MME-VideoOCR benchmark with 1,464 videos and 2,000 manually annotated question-answer pairs across 25 tasks in 10 categories, evaluating 18 state-of-the-art MLLMs using containment match, GPT-assisted scoring, and multiple-choice evaluation methods.
5. 📊 Results and Evaluation: The best-performing model (Gemini-2.5 Pro) achieved 73.7% accuracy, while most models struggled with tasks requiring spatio-temporal reasoning and cross-frame information integration, highlighting the need for improved video OCR capabilities in MLLMs.

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

MME-VideoOCR Workflow Data Collection Public Videos AI Generated Videos Video Filtering Visual Dynamics Meaningful Text Manual Annotation QA Pairs Creation Expert Verification Task Categories • Text Recognition • Visual Text QA • Text Grounding • Attribute Recognition • Change Detection • Special Text Parsing • Cross-Frame Understanding • Text-Based Reasoning Evaluation Methods Containment Match | GPT-Assisted Scoring | Multiple-Choice
Q1
1. What was the key innovation of MME-VideoOCR compared to previous OCR benchmarks?
It used a larger dataset of static images
It introduced comprehensive evaluation across temporal and spatial dimensions
It only focused on text recognition accuracy
Q2
2. Why did the researchers introduce 'debiasing test' in their evaluation methodology?
To test if models could work without visual input
To prevent models from relying on textual priors and knowledge leakage
To evaluate models' language translation capabilities
Q3
3. What surprising limitation was revealed about current MLLMs through this benchmark?
They couldn't read text at all in videos
They performed better on long videos than short ones
They struggled to integrate information across multiple frames