1. 📘 Topic and Domain: The paper presents VINO, a unified visual generator for image and video generation and editing within a single framework, in the domain of computer vision and deep learning.
2. 💡 Previous Research and New Ideas: Based on previous diffusion models and multimodal assistants, it proposes a new unified framework that combines a vision-language model with a Multimodal Diffusion Transformer, using interleaved conditioning tokens.
3. ❓ Problem: The paper addresses the fragmentation of visual generation pipelines, where text-to-image, text-to-video, and visual editing models are developed separately, lacking a unified framework for handling multiple tasks.
4. 🛠️ Methods: VINO couples a vision-language model with a Multimodal Diffusion Transformer, using learnable query tokens and a token-boundary mechanism, trained through a progressive three-stage pipeline that gradually expands capabilities.
5. 📊 Results and Evaluation: The model demonstrates strong performance across diverse generation and editing benchmarks, showing improved identity preservation, faithful instruction following, and better controllability in multi-identity edits compared to existing task-specific models.