1. 📘 Topic and Domain: The paper introduces Kandinsky 5.0, a family of foundation models for high-resolution image and video generation, consisting of three core models: Image Lite (6B parameters), Video Lite (2B parameters), and Video Pro (19B parameters).
2. 💡 Previous Research and New Ideas: Based on previous diffusion models and flow matching approaches, the paper proposes new architectural optimizations including CrossDiT (Cross-Attention Diffusion Transformer) and NABLA (Neighborhood Adaptive Block-Level Attention) mechanism for efficient video generation.
3. ❓ Problem: The paper addresses the challenges of creating high-quality, consistent, and controllable video generation while maintaining computational efficiency and reducing the complexity of attention mechanisms for long video sequences.
4. 🛠️ Methods: The paper implements a multi-stage training pipeline including pre-training, supervised fine-tuning, distillation, and RL-based post-training, along with comprehensive data processing and curation methods. It also introduces optimizations for VAE encoding, memory efficiency, and inference speed.
5. 📊 Results and Evaluation: Through human side-by-side evaluations, the models demonstrated superior or competitive performance against leading models like Sora, Veo, and Wan across key metrics including visual quality, motion dynamics, and prompt adherence. The NABLA mechanism achieved 2.7× reduction in training and inference time while maintaining quality.