1. 📘 Topic and Domain: The paper presents "Captain Cinema," a framework for generating short movies from textual descriptions, operating in the domain of AI-generated video content and narrative storytelling.
2. 💡 Previous Research and New Ideas: Based on previous text-to-video models that could only generate 5-10 second clips, this paper introduces a novel two-stage approach combining top-down keyframe planning with bottom-up video synthesis for longer, narratively coherent videos.
3. ❓ Problem: The paper addresses the challenge of generating long-form, narratively coherent videos with consistent characters and scenes, as existing approaches struggle with maintaining coherence beyond short clips.
4. 🛠️ Methods: The method uses a two-stage approach: first generating keyframes using a Multimodal Diffusion Transformer with GoldenMem compression for long-context memory, then synthesizing video between keyframes using interleaved conditioning.
5. 📊 Results and Evaluation: The results show superior performance in generating visually coherent and narratively consistent short movies compared to baselines, evaluated through automated metrics and user studies, with particularly strong results in temporal dynamics and character consistency preservation.