1. 📘 Topic and Domain: A unified AI framework called UniVideo for video understanding, generation, and editing that combines multimodal capabilities in a single system.
2. 💡 Previous Research and New Ideas: Based on previous unified text-image models and task-specific video models, proposes a novel dual-stream architecture combining a Multimodal Large Language Model (MLLM) for understanding with a Multimodal DiT (MMDiT) for generation.
3. ❓ Problem: Addresses the limitation of current video AI models being restricted to single tasks or modalities, lacking unified capabilities for understanding complex instructions and performing diverse video tasks.
4. 🛠️ Methods: Uses a two-stream architecture with frozen MLLM for instruction understanding and MMDiT for video generation, trained across multiple tasks including text/image-to-video generation and video editing through a three-stage training process.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance across multiple video tasks, demonstrates zero-shot generalization to unseen tasks, and shows strong capabilities in visual prompt understanding and task composition, evaluated through both human assessment and automatic metrics.