1. 📘 Topic and Domain: This paper introduces GEBench, a benchmark for evaluating image generation models as interactive GUI environments in the computer vision and human-computer interaction domain.
2. 💡 Previous Research and New Ideas: The paper builds on existing image generation models and GUI automation research, proposing a novel evaluation framework that shifts focus from general visual fidelity to GUI-specific interaction logic and temporal coherence across discrete state transitions.
3. ❓ Problem: The paper addresses the lack of evaluation methods for assessing whether image generation models can reliably function as GUI environments, as existing benchmarks focus on general visual quality rather than GUI-specific requirements like state transitions and interaction logic.
4. 🛠️ Methods: The authors created a 700-sample benchmark across five task categories (single-step, multi-step, fiction-app, real-app, grounding) and developed GE-Score, a five-dimensional metric evaluated by VLM judges across Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality dimensions.
5. 📊 Results and Evaluation: Results show that while current models perform well on single-step transitions (top models achieving 80+ scores), they struggle significantly with multi-step planning and spatial grounding tasks, with major bottlenecks in icon interpretation, text rendering, and localization precision.