1. 📘 Topic and Domain: Human-centric video generation using collaborative multi-modal conditioning (text, image, audio) for AI-driven video synthesis.
2. 💡 Previous Research and New Ideas: Based on DiT-based text-to-video models, introducing new collaborative multi-modal control through minimal-invasive image injection and focus-by-predicting strategies for audio-visual sync.
3. ❓ Problem: Addressing the challenges of data scarcity in paired triplet conditions (text-image-audio) and the difficulty of balancing multiple sub-tasks (subject preservation and audio-visual sync) in multi-modal video generation.
4. 🛠️ Methods: Implements a two-stage progressive training paradigm with a multimodal data processing pipeline, using minimal-invasive image injection for subject preservation, focus-by-predicting strategy for audio-visual sync, and time-adaptive Classifier-Free Guidance for inference.
5. 📊 Results and Evaluation: Outperforms state-of-the-art methods in both subject preservation and audio-visual sync tasks, with superior performance in aesthetic quality, text following, identity preservation, and audio-visual synchronization metrics.