1. 📘 Topic and Domain: The paper introduces Skywork UniPic, a unified autoregressive model for visual AI tasks including image understanding, text-to-image generation, and image editing.
2. 💡 Previous Research and New Ideas: Based on previous fragmented approaches using separate models for different tasks, it proposes a novel unified architecture with decoupled visual encoding strategy using MAR for generation and SigLIP2 for understanding.
3. ❓ Problem: The paper addresses the challenge of creating a single, parameter-efficient architecture that can excel at multiple visual AI tasks while remaining deployable on commodity hardware.
4. 🛠️ Methods: The method employs a 1.5B-parameter model with four core components: MAR encoder-decoder, SigLIP2 encoder, shared language model backbone, and MLP projection layers, trained through a progressive four-stage curriculum.
5. 📊 Results and Evaluation: The model achieves state-of-the-art performance across multiple benchmarks: 0.86 on GenEval, 85.5 on DPG-Bench, 5.83 on GEditBench-EN, and 3.49 on ImgEdit-Bench, while requiring only 15GB GPU memory for 1024×1024 image generation.