1. 📘 Topic and Domain: Automating the conversion of UI designs into front-end code using vision-language models and multi-agent systems.
2. 💡 Previous Research and New Ideas: Based on previous vision-language models and UI-to-code generation research, proposing a novel modular multi-agent framework that decomposes the task into grounding, planning, and generation stages.
3. ❓ Problem: Addressing the limitations of existing text-based and vision-based code generation systems that struggle with capturing spatial layouts and visual design intent in UI development.
4. 🛠️ Methods: Implements a three-stage pipeline with a grounding agent for UI component detection, a planning agent for hierarchical layout construction, and a generation agent for HTML/CSS code synthesis, plus dual-stage post-training of vision-language models.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance across five metrics (block match, text similarity, position alignment, color consistency, and CLIP similarity), outperforming existing open-source models and competing with proprietary systems.