1. 📘 Topic and Domain: Video world models with memory mechanisms for long-term consistent video generation, in the domain of computer vision and generative AI.
2. 💡 Previous Research and New Ideas: Based on diffusion-based video generation models, proposes a novel three-part memory system (spatial, working, and episodic memory) inspired by human memory mechanisms.
3. ❓ Problem: Addresses the limited temporal context window and forgetting problem in existing video world models that causes inconsistency when revisiting previously generated scenes.
4. 🛠️ Methods: Implements a geometry-grounded point cloud for spatial memory, recent context frames for working memory, and sparse historical keyframes for episodic memory, all integrated into a diffusion transformer architecture.
5. 📊 Results and Evaluation: Achieves significantly improved view recall consistency (PSNR: 19.10 vs baselines ~12.0) and higher user study ratings across camera accuracy, static consistency, and dynamic plausibility metrics.