1. 📘 Topic and Domain: A technical report introducing Ovis-U1, a 3-billion-parameter unified multimodal AI model for image understanding, text-to-image generation, and image editing.
2. 💡 Previous Research and New Ideas: Based on GPT-4o and previous Ovis models, proposing a new unified training approach starting from a language model instead of using a frozen multimodal language model.
3. ❓ Problem: Addressing how to endow a multimodal understanding model with image generation capabilities and effectively train a unified model on both understanding and generation tasks.
4. 🛠️ Methods: Implements a diffusion-based visual decoder with bidirectional token refiner, utilizing a 6-stage unified training process combining understanding, generation, and editing tasks.
5. 📊 Results and Evaluation: Achieves 69.6 on OpenCompass Multi-modal Academic Benchmark, 83.72 on DPG-Bench, 0.89 on GenEval, and scores of 4.00 and 6.42 on ImgEdit-Bench and GEdit-Bench-EN respectively, surpassing several state-of-the-art models.