1. 📘 Topic and Domain: The paper introduces NextStep-1, a large-scale autoregressive model for text-to-image generation and editing, operating in the domain of artificial intelligence and computer vision.
2. 💡 Previous Research and New Ideas: Based on previous autoregressive language models and diffusion models, it proposes a novel approach using continuous tokens and flow matching for image generation, rather than traditional vector quantization or heavy diffusion models.
3. ❓ Problem: The paper aims to solve the limitations of existing autoregressive text-to-image models that either rely on computationally-intensive diffusion models or suffer from quantization loss through vector quantization.
4. 🛠️ Methods: The paper implements a 14B parameter autoregressive model with a 157M flow matching head, combining a Transformer backbone for text processing with continuous image tokens, trained on a diverse dataset including text-only corpus, image-text pairs, and interleaved data.
5. 📊 Results and Evaluation: The model achieves state-of-the-art performance for autoregressive models in text-to-image generation, scoring 0.54 on WISE, 0.67 on GenAI-Bench advanced prompts, 85.28 on DPG-Bench, and 0.417 on OneIG-Bench English prompts, while also demonstrating strong capabilities in image editing tasks.