2026-02-10 Papers

1/2

Paper 1

GEBench: Benchmarking Image Generation Models as GUI Environments

Published: 2026-02-09

Link: http://arxiv.org/pdf/2602.09007

1. 📘 Topic and Domain: This paper introduces GEBench, a benchmark for evaluating image generation models as interactive GUI environments in the computer vision and human-computer interaction domain.

2. 💡 Previous Research and New Ideas: The paper builds on existing image generation models and GUI automation research, proposing a novel evaluation framework that shifts focus from general visual fidelity to GUI-specific interaction logic and temporal coherence across discrete state transitions.

3. ❓ Problem: The paper addresses the lack of evaluation methods for assessing whether image generation models can reliably function as GUI environments, as existing benchmarks focus on general visual quality rather than GUI-specific requirements like state transitions and interaction logic.

4. 🛠️ Methods: The authors created a 700-sample benchmark across five task categories (single-step, multi-step, fiction-app, real-app, grounding) and developed GE-Score, a five-dimensional metric evaluated by VLM judges across Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality dimensions.

5. 📊 Results and Evaluation: Results show that while current models perform well on single-step transitions (top models achieving 80+ scores), they struggle significantly with multi-step planning and spatial grounding tasks, with major bottlenecks in icon interpretation, text rendering, and localization precision.

GEBench: Benchmarking Image Generation Models as GUI Environments

1/2

Paper 2

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Published: 2026-02-09

Link: http://arxiv.org/pdf/2602.08676

1. 📘 Topic and Domain: This paper presents LLaDA2.1, a diffusion-based large language model that accelerates text generation through token editing mechanisms in the natural language processing domain.

2. 💡 Previous Research and New Ideas: The paper builds on LLaDA2.0 and previous discrete diffusion language models, proposing a novel "Draft-and-Edit" paradigm that combines Mask-to-Token (M2T) and Token-to-Token (T2T) operations with configurable threshold decoding to enable error correction during generation.

3. ❓ Problem: The paper aims to solve the trade-off between decoding speed and generation quality in discrete diffusion language models, addressing exposure bias and token-level inconsistencies that occur during parallel decoding.

4. 🛠️ Methods: The authors use dual probability thresholds for configurable decoding (Speedy Mode and Quality Mode), mixture of M2T and T2T training objectives, multi-turn forward data augmentation, and ELBO-based Block-level Policy Optimization (EBPO) for reinforcement learning.

5. 📊 Results and Evaluation: LLaDA2.1 achieves significant speed improvements (892 TPS on HumanEval+, 801 TPS on BigCodeBench, 663 TPS on LiveCodeBench) while maintaining competitive performance across 33 benchmarks covering knowledge, reasoning, coding, math, and alignment tasks.

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

1/2

Paper 3

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Published: 2026-02-09

Link: http://arxiv.org/pdf/2602.08439

1. 📘 Topic and Domain: This paper introduces Demo-driven Video In-Context Learning for procedural video knowledge acquisition, focusing on enabling multimodal large language models to learn from video demonstrations rather than relying solely on pre-trained knowledge.

2. 💡 Previous Research and New Ideas: The paper builds on existing multimodal video understanding and in-context learning research, proposing a novel paradigm where models learn from text instructions, video demonstrations, or selected demonstrations to answer questions about target videos, moving beyond static knowledge retrieval.

3. ❓ Problem: The paper addresses the limitation that current video understanding models primarily rely on internal pre-trained knowledge or visible facts rather than learning new skills from contextual demonstrations, which is crucial for human-like learning and adaptation.

4. 🛠️ Methods: The authors develop Demo-ICL using a two-stage training strategy: video supervised fine-tuning followed by information-assisted Direct Preference Optimization (DPO), along with constructing Demo-ICL-Bench benchmark with 1,200 questions from instructional YouTube videos.

5. 📊 Results and Evaluation: Demo-ICL achieves superior performance compared to existing models, with state-of-the-art results showing significant improvements in demo-driven learning tasks, while current models like Gemini-2.5-Pro achieve only 46.6% and 32.0% accuracy on text and video demonstrations respectively, demonstrating the challenge's difficulty.