1. 📘 Topic and Domain: Development of Step1X-Edit, a practical framework for general image editing using natural language instructions in the domain of computer vision and AI-powered image manipulation.
2. 💡 Previous Research and New Ideas: Based on existing diffusion models and multimodal LLMs, proposing a new unified framework that combines MLLM's semantic reasoning with DiT-style diffusion architecture to achieve comparable performance to closed-source models like GPT-4o.
3. ❓ Problem: The significant performance gap between open-source and closed-source image editing algorithms, limiting accessibility and reproducibility in the field.
4. 🛠️ Methods: Developed a data generation pipeline producing over 1 million high-quality training triplets across 11 editing categories, integrated MLLM with diffusion decoder, and created GEdit-Bench for evaluation.
5. 📊 Results and Evaluation: Step1X-Edit outperformed existing open-source baselines by a substantial margin and approached the performance of proprietary models like GPT-4o and Gemini2 Flash, as evaluated on GEdit-Bench through both automated metrics and user studies.