1. 📘 Topic and Domain: The paper introduces CIGEVAL, a unified agentic framework for evaluating conditional image generation across various tasks such as text-guided image generation, subject-driven image editing, and control-guided image generation.
2. 💡 Previous Research and New Ideas: The paper builds upon previous image evaluation metrics like CLIP-Score, LPIPS, and VIESCORE, but proposes a novel approach that integrates large multimodal models (LMMs) with specialized tools to overcome limitations in task specificity, explainability, and human alignment.
3. ❓ Problem: The paper addresses the challenge of developing task-agnostic, reliable, and explainable evaluation metrics for conditional image generation that can align with human judgment across diverse generation tasks.
4. 🛠️ Methods: The authors implement an agentic framework that combines LMMs (like GPT-4o or open-source models) with a multi-functional toolbox (including Grounding, Highlight, Difference, and Scene Graph tools) and fine-grained evaluation through task decomposition, tool selection, and analysis.
5. 📊 Results and Evaluation: CIGEVAL with GPT-4o achieves a Spearman correlation of 0.4625 with human assessments across seven tasks, closely matching the human-to-human correlation of 0.47, and when implemented with fine-tuned 7B open-source LMMs using only 2.3K training trajectories, it surpasses previous GPT-4o-based state-of-the-art methods.