1. 📘 Topic and Domain: Video generation using diffusion transformers, focusing on optimizing and accelerating the attention mechanism in video diffusion models.
2. 💡 Previous Research and New Ideas: Based on existing Video Diffusion Transformer (vDiT) architectures, proposing new sparse attention patterns and optimization techniques to reduce computational overhead while maintaining generation quality.
3. ❓ Problem: The quadratic computational complexity of attention mechanisms in video diffusion transformers leads to significant inference latency, making video generation slow and computationally expensive.
4. 🛠️ Methods: Introduces Sparse-vDiT framework that combines pattern-optimized sparse kernels, offline sparse diffusion search algorithm, and head fusion techniques to optimize attention computation based on identified sparsity patterns.
5. 📊 Results and Evaluation: Achieved 2.09×, 2.38×, and 1.67× theoretical FLOP reduction on CogVideoX1.5, HunyuanVideo, and Wan2.1 respectively, with actual speedups of 1.76×, 1.85×, and 1.58× while maintaining high visual quality (PSNR scores of 24.13, 27.09, and 22.59).