1. 📘 Topic and Domain: The paper addresses one-minute video generation from text storyboards using Test-Time Training (TTT) layers to overcome the limitations of Transformer models in handling long contexts.
2. 💡 Previous Research and New Ideas: The paper builds on Diffusion Transformers but proposes using TTT layers with neural network hidden states instead of traditional RNN approaches like Mamba or DeltaNet which use matrix hidden states.
3. ❓ Problem: The paper aims to solve the inefficiency of self-attention in generating long videos, as traditional Transformers struggle with one-minute videos due to quadratic complexity with context length.
4. 🛠️ Methods: The authors add TTT-MLP layers to a pre-trained Diffusion Transformer (CogVideo-X 5B), fine-tune on Tom and Jerry cartoons, and implement on-chip tensor parallelism for efficiency while limiting self-attention to 3-second segments.
5. 📊 Results and Evaluation: TTT-MLP outperformed baselines (Mamba 2, Gated DeltaNet, sliding-window attention) by 34 Elo points in human evaluation across four metrics, generating more coherent videos with complex stories, though still containing some artifacts.