1. 📘 Topic and Domain: The paper focuses on benchmarking spatial intelligence of text-to-image models, evaluating their ability to understand and generate complex spatial relationships in images.
2. 💡 Previous Research and New Ideas: Building on existing T2I benchmarks that use short or information-sparse prompts, this work proposes long, information-dense prompts covering 10 spatial sub-domains and introduces omni-dimensional multi-choice evaluations instead of simple yes/no questions.
3. ❓ Problem: Current T2I models excel at generating objects but fail at handling complex spatial relationships like positioning, orientation, occlusion, and causal interactions, which existing benchmarks fail to adequately evaluate.
4. 🛠️ Methods: The authors create SpatialGenEval with 1,230 information-dense prompts across 25 real-world scenes, each paired with 10 multi-choice questions targeting different spatial abilities, and construct SpatialT2I dataset with 15,400 text-image pairs for fine-tuning.
5. 📊 Results and Evaluation: Evaluation of 23 SOTA models reveals spatial reasoning as the primary bottleneck (scores often below 30%), while fine-tuning with SpatialT2I yields consistent improvements (+4.2% for SD-XL, +5.7% for UniWorld-V1, +4.4% for OmniGen2).