2026-02-10 Papers

1/2

Paper 1

GEBench: Benchmarking Image Generation Models as GUI Environments

Published: 2026-02-09

Link: http://arxiv.org/pdf/2602.09007

1. 📘 Topic and Domain: This paper introduces GEBench, a benchmark for evaluating image generation models as interactive GUI environments in the computer vision and human-computer interaction domain.
2. 💡 Previous Research and New Ideas: The paper builds on existing image generation models and GUI automation research, proposing a novel evaluation framework that shifts focus from general visual fidelity to GUI-specific interaction logic and temporal coherence across discrete state transitions.
3. ❓ Problem: The paper addresses the lack of evaluation methods for assessing whether image generation models can reliably function as GUI environments, as existing benchmarks focus on general visual quality rather than GUI-specific requirements like state transitions and interaction logic.
4. 🛠️ Methods: The authors created a 700-sample benchmark across five task categories (single-step, multi-step, fiction-app, real-app, grounding) and developed GE-Score, a five-dimensional metric evaluated by VLM judges across Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality dimensions.
5. 📊 Results and Evaluation: Results show that while current models perform well on single-step transitions (top models achieving 80+ scores), they struggle significantly with multi-step planning and spatial grounding tasks, with major bottlenecks in icon interpretation, text rendering, and localization precision.

GEBench: Benchmarking Image Generation Models as GUI Environments

GEBench: Benchmarking Image Generation Models as GUI Environments Data Collection Raw Screen Recording Task Annotation Quality Control Rule-based Preprocessing Expert Verification Dataset Construction 700 Samples 5 Task Categories Five Task Categories Single-step 200 samples Multi-step 200 samples Fiction App 100 samples Real App 100 samples Grounding 100 samples Model Evaluation 12 Image Generation Models 8 Commercial + 4 Open-source GPT-Image, Nano Banana, Flux, etc. VLM-as-a-Judge 3 Evaluator Models GPT-4o, Gemini-3, Qwen3-VL Cross-validation Strategy GE-Score: Five Evaluation Dimensions GOAL Achievement Assessment LOGIC Interaction Coherence CONS Content Consistency UI Plausibility Integrity QUAL Visual Quality Key Findings & Challenges • Models excel at single-step transitions but struggle with multi-step planning • Critical bottlenecks: Icon interpretation, Text rendering, Localization precision • Performance gap between commercial and open-source models GE-Score = (1/5) × Σ(GOAL + LOGIC + CONS + UI + QUAL) Normalized to [0, 100] scale
Q1
1. What is the primary limitation of existing image generation benchmarks when evaluating models as GUI environments?
They focus on general-domain visual fidelity rather than GUI-specific interaction logic and state transitions
They only evaluate single images instead of video sequences
They require too much computational power to run effectively
Q2
2. According to the evaluation results, what represents the biggest performance gap for current image generation models in GEBench?
Visual quality degradation in high-resolution images
The dramatic drop from single-step transitions (80+ scores) to multi-step planning (often below 60 points)
Inability to generate fictional app interfaces from scratch
Q3
3. What are the three main technical bottlenecks identified in the qualitative analysis of GUI generation failures?
Memory limitations, processing speed, and storage capacity
Icon interpretation, text rendering accuracy, and localization precision
Color accuracy, resolution scaling, and compression artifacts
1/2

Paper 2

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Published: 2026-02-09

Link: http://arxiv.org/pdf/2602.08676

1. 📘 Topic and Domain: This paper presents LLaDA2.1, a diffusion-based large language model that accelerates text generation through token editing mechanisms in the natural language processing domain.
2. 💡 Previous Research and New Ideas: The paper builds on LLaDA2.0 and previous discrete diffusion language models, proposing a novel "Draft-and-Edit" paradigm that combines Mask-to-Token (M2T) and Token-to-Token (T2T) operations with configurable threshold decoding to enable error correction during generation.
3. ❓ Problem: The paper aims to solve the trade-off between decoding speed and generation quality in discrete diffusion language models, addressing exposure bias and token-level inconsistencies that occur during parallel decoding.
4. 🛠️ Methods: The authors use dual probability thresholds for configurable decoding (Speedy Mode and Quality Mode), mixture of M2T and T2T training objectives, multi-turn forward data augmentation, and ELBO-based Block-level Policy Optimization (EBPO) for reinforcement learning.
5. 📊 Results and Evaluation: LLaDA2.1 achieves significant speed improvements (892 TPS on HumanEval+, 801 TPS on BigCodeBench, 663 TPS on LiveCodeBench) while maintaining competitive performance across 33 benchmarks covering knowledge, reasoning, coding, math, and alignment tasks.

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

LLaDA2.1: Speeding Up Text Diffusion via Token Editing - Method Flow Stage 1: Training Paradigm CPT M2T + T2T SFT Multi-Turn Forward RL (EBPO) Policy Optimization Stage 2: Configurable Decoding Dual Threshold Control ωmask + ωedit M2T Draft Mask→Token T2T Edit Token→Token Stage 3: Operating Modes S Mode Speed Priority Low ωmask Q Mode Quality Priority High ωmask MBE Multi-Block Edit Core Innovation: Draft-and-Edit Paradigm Unmasking Set Γt p(v|xt) > ωmask Editing Set Δt p(v|xt) > ωedit Performance Achievements 892 TPS HumanEval+ 801 TPS BigCodeBench 663 TPS LiveCodeBench 33 Tasks Benchmarks Infrastructure Support dFactory Training SGLang Inference AReaL RL Alpha-MoE FP8 Quantization Key Innovation: Editable State Evolution enables dynamic speed-quality trade-off Models: LLaDA2.1-Mini (16B) + LLaDA2.1-Flash (100B)
Q1
1. What is the key innovation that LLaDA2.1 introduces to address the speed-quality trade-off in discrete diffusion language models?
A dual-threshold 'Draft-and-Edit' paradigm that combines Mask-to-Token (M2T) and Token-to-Token (T2T) operations
A completely new autoregressive architecture that replaces the diffusion mechanism
A simple parameter scaling approach that increases model size to 200B parameters
Q2
2. According to the paper, what is the peak throughput (TPS) that LLaDA2.1-Flash achieves on HumanEval+ coding benchmark?
663 TPS
801 TPS
892 TPS
Q3
3. What does the 'Speedy Mode (S Mode)' in LLaDA2.1 prioritize compared to 'Quality Mode (Q Mode)'?
Higher accuracy by using conservative thresholds and avoiding token editing
High-throughput generation by accepting lower-confidence tokens and relying on T2T correction
Balanced performance by using identical thresholds for both M2T and T2T operations
1/2

Paper 3

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Published: 2026-02-09

Link: http://arxiv.org/pdf/2602.08439

1. 📘 Topic and Domain: This paper introduces Demo-driven Video In-Context Learning for procedural video knowledge acquisition, focusing on enabling multimodal large language models to learn from video demonstrations rather than relying solely on pre-trained knowledge.
2. 💡 Previous Research and New Ideas: The paper builds on existing multimodal video understanding and in-context learning research, proposing a novel paradigm where models learn from text instructions, video demonstrations, or selected demonstrations to answer questions about target videos, moving beyond static knowledge retrieval.
3. ❓ Problem: The paper addresses the limitation that current video understanding models primarily rely on internal pre-trained knowledge or visible facts rather than learning new skills from contextual demonstrations, which is crucial for human-like learning and adaptation.
4. 🛠️ Methods: The authors develop Demo-ICL using a two-stage training strategy: video supervised fine-tuning followed by information-assisted Direct Preference Optimization (DPO), along with constructing Demo-ICL-Bench benchmark with 1,200 questions from instructional YouTube videos.
5. 📊 Results and Evaluation: Demo-ICL achieves superior performance compared to existing models, with state-of-the-art results showing significant improvements in demo-driven learning tasks, while current models like Gemini-2.5-Pro achieve only 46.6% and 32.0% accuracy on text and video demonstrations respectively, demonstrating the challenge's difficulty.

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Demo-ICL: Methodological Workflow Data Collection HowTo100M Videos ASR Transcripts Metadata Filtering Quality Control Text Demo Generation Qwen2.5-72B Summarization Step Filtering & Merging Qwen2.5-VL Refinement Visual-Text Alignment Video Demo Selection Metadata Ranking Title Similarity LLM Validation Pair Construction Question Gen Step Selection QA Creation Human Review Quality Check Text-demo ICL Text Instructions as Context Video-demo ICL Video Demonstrations as Reference Demo Selection Choose from Video Pool Two-Stage Training Strategy Stage 1: Video SFT Multi-source Dataset LLaVA-OneVision, Oryx COIN, Cross-Task Demo-ICL Samples Stage 2: Info-Assisted DPO Timestamp Assistance (Text-demo) Text Guidance (Video-demo) Preference Optimization Iterative Training Demo-ICL Enhanced MLLM Evaluation on Demo-ICL-Bench Text-demo: 43.4% Video-demo: 32.0% Demo Selection: 58.0%
Q1
1. What is the fundamental difference between Demo-ICL's approach and existing video understanding benchmarks?
Demo-ICL uses longer videos while existing benchmarks use short clips
Demo-ICL requires models to learn from in-context demonstrations rather than relying on pre-trained knowledge or visible facts
Demo-ICL focuses on audio understanding while existing benchmarks focus on visual content
Q2
2. In the Demo-ICL training strategy, what is the purpose of the information-assisted Direct Preference Optimization (DPO) stage?
To reduce the computational cost of video processing by using fewer frames
To generate high-quality responses by providing assistive information like timestamps and text guidance, overcoming current models' limitations in demo-driven learning
To compress video data into smaller file sizes for faster processing
Q3
3. According to the experimental results, what does the poor performance of current state-of-the-art models on Demo-ICL-Bench reveal?
Current models need more computational power to process video data effectively
The benchmark questions are poorly designed and need to be simplified
Current models struggle with extracting and transferring knowledge from demonstrations, highlighting the need for specialized training approaches